Stand-alone page

Some time ago, while I was waiting to see if I got in the PhD programme I am currently in, a though came into my mind: "Can I make a CAT Tool from scratch?".

The reasoning was simple - I like to challenge myself and learn through trial and error.

Where to start?

The way I go about this every time is by trying to simplify everything as much as possible. Interestingly enough, it works most of the time.

So, what would a my CAT tool do? That was a really great question.

Make a basic text editor Translation Database related issues Throw in some API Other Stuff

UI Mock-UP

I needed something that would look a like simple text editor. Fortunately, notepad is a thing and I could just get inspired from its ui.

Details

Initial window UI

Honestly, I am not a software dev What I ended up with

It took some time to set everything up. This was probably the easiest part of the entire project.

Honestly, I am not a software dev

UI Break-Down

Details

File

Honestly, I am not a software dev
  • New Project - Deletes Source Text/Target Text, Intermediate Text, TM, Glossary
  • Open Original - Open text in the left Text-Box
  • Open Translation - Opens text in the right Text-Box
  • Save project - Save ST/TT/IT/TM and Gloss to a folder - bugged atm
  • Save - Initial UI saves text that is in the right Text-Box
  • Save As / Print - don't work
  • Exit - Releases you from using this awful thing.

Edit

Honestly, I am not a software dev
  • Undo - for when you messed up
  • Redo - for when you messed up while correcting the previous mess-up
  • Cut - time to move to another line
  • Copy - when you don't want to write it again
  • Paste - insert copied/cut fragment
  • Deselect All - You shall not be selected!
  • Select All - You shall be selected!

Text

Honestly, I am not a software dev
  • Fonts - opens up the fonts menu
  • Small - text font size set to small
  • Medium - text font size set to medium
  • Large - text font size set to large
  • Search - search the selected text - works 3/10 times
  • Spelling - Language - Opens up the Languages Selection Menu for Spelling - Uses Hunspell
  • Spelling - Advanced - Not implemented

Separators

Honestly, I am not a software dev
  • Comma - splits the text into fragments when encountering a comma.
  • Fullstop - splits the text into fragments when encountering a fullstop.

Tools

Honestly, I am not a software dev

In-depth look at it a bit later.

  • TTS - text-to-speech -> can be used to read text
  • VTT - voice-to-text -> sub-par dictation
  • Word Analysis - WordNet API - no by me - Show Semantic/Lexical relations / Synsets / Semantic Similarity
  • Token Analyser - not by me - Keyword/Whitespace/Stop/Simple/Standard analysers
  • XML editor - not by me - I was attempting to open xml files to see if I can translate them - nope, does not work, yet.

Glossary

Honestly, I am not a software dev

In-depth look at it a bit later.

  • Quick Glossary - opens up a basic glossary UI
  • Glossary - opens the full Glossary UI
  • New Glossary - creates a glossary in the glossary folder
  • New Custom Glossary - creates a glossary at the user-specified location

API Menu

Honestly, I am not a software dev

In-depth look at it a bit later.

  • Dictionaries
    • Oxford API - UI finished
    • MW API - raw answers - no UI
    • MW Medical API - raw answers - no UI
  • Yandex
    • Translation- UI works - API key is dead
    • Spellcheck - UI works - basic spell check

Translation Menu

Honestly, I am not a software dev

In-depth look at it a bit later.

  • Switch to Translation - Start translation process
  • Next ST segment - next Source Text segment - move segments up
  • Previous ST segment - previous Source Text segment - move segments down
  • Validate segment - validates the segment and moves segments up
  • Next Target Text segment - does not work atm.

Translation Memory Menu

Honestly, I am not a software dev

In-depth look at it a bit later.

  • New - Creates a new TM based on the existing template
  • Align - Opens up the alignment UI

Top Ribbon Menu

Honestly, I am not a software dev

Bottom Ribbon Menu

Honestly, I am not a software dev

Translation Window UI

This was probably the most annoying part of the whole project.

Details

The idea was to split the right and left windows into smaller text-boxes. In each smaller text-box, the software will load one line of text.

Honestly, I am not a software dev

Both sides will contain the same sentences - with the right window allowing text to be edited

Some time later - and after a really interesting period in which I had to figure out how to load a text line by line and then pass the lines from one text box to another - I ended up with this

Honestly, I am not a software dev Segment control menu

Demo

Load text file -> split text into translation segments -> translate -> save translated file.

When done - program generates three ".txt" files:


Translation Memory Generator

Everything before this part was done witout using any database, it was just .txt manipulation. If I wanted to do more complex things such as creating something that would allow me to align texts to create a translation memory, I needed to use some sort of database.

The plan

I knew that TMX is an XML specification, and XML is not that difficult to get around - I started by looking at the structure of a .tmx file.

I used the - Translation Memory eXchange format (TMX) definition document by the Localisation Industry Standards Association (LISA) - available under the terms of the CC BT 3.0 license.

Opened up a .tmx file - then used the definition document to make a map for myself - threw the results into a html page so they would look more tidy.

TMX summary

A simple overview of a TMX file structure

<?xml version="1.0"?>/*XML version.*/
<!-- Example of TMX document -->/*XML comment.*/

< tmx version="1.4" > /*The version attribute indicates the version of the TMX format.*/


< header/*Begining of the header tag which contains meta-data about the document.*/
/*Attributes declared inside the start tag.*/
  creationtool="XYZTool"/*Name of the tool which created this TMX.*/
  creationtoolversion="1.01-023"/*Version of the above mentioned tool.*/
  datatype="PlainText"/*Type of contained data.*/
  segtype="sentence"/*Specifies the kind of segmentation used in the <tu> element. */
  adminlang="en-us"/*Specifies the default language for the administrative and informative elements <note> and <prop>.*/
  srclang="EN"/*Specifies the source language.*/
  o-tmf="ABCTransMem"/*Specifies the format of the translation memory file from which the TMX document or segment thereof have been generated.*/
  creationdate="20020101T163812Z"/*Creation date having the following format YYYYMMDDThhmmssZ.*/
  creationid="John Doe"/*Specifies the user that created the entry.*/
  changedate="20020413T023401Z"/*Change date having the following format YYYYMMDDThhmmssZ.*/
  changeid="Jane Doe"/*Specifies the user that modified the entry.*/
  o-encoding="iso-8859-1"/*The o-encoding attribute specifies the original or preferred code set of the data of the element in case it is to be re-encoded in a non-Unicode code set.*/
  /*Inside of the start tag.*/
 > /*Start tag closed.*/
 

/*Inside the header element.*/
 <note> This is a note at document level. </note>/*Self-explanatory.*/
 <prop type="RTFPreamble" {\rtf1\ansi\tag etc} {/fonttbl} </prop>/*Define custom properties of the parent element.*/
  <ude name="MacRoman" base="Macintosh" >/*Specify a set of user-defined characters.*/
   <map unicode="#xF8FF" code="#xF0" ent="Apple_logo" subst="[Apple]"/>/*Mapping from Unicode to the user-defined encoding.*/
 </ude>/*ude tag closed.*/
 </header>/*header element tag closed.*/
/*Ouside of the header element.*/

 <body>/*Container for the translation unit collection.*/
    <tu /*Translation unit start.*/
		tuid="0001" /*Translation unit id.*/
		datatype="Text" /*Type of data contained inthe translation unit.*/
		usagecount="2" /*Time the unit was accessed.*/
		lastusagedate="20181129T122945Z" /*Last time the unit was accessed.*/
	>/*tu start tag closed.*/
		<note>Text of a note at the TU level.</note> /*Self-explanatory.*/
		<prop type="x-Domain">Christmas</prop> /*Custom property tags.*/
		<prop type="x-Project">Santa</prop> /*Custom property tags.*/
			<tuv /*Contains text in a given language.*/
				xml:lang="ro" /*Language for the curent text.*/
				creationdate="20181129T122945Z" /*Date it was created.*/
				creationid="John Doe" /*User who created it.*/
				> /*tuv start tag closed.*/
				<seg>data Propozitia ta aici.</seg> /*The text data.*/
			</tuv> /*tuv closing tag*/
			<tuv /*Contains text in a given language.*/
				xml:lang="en" /*Language for the curent text.*/
				creationdate="20181129T122945Z" /*Date it was created.*/
				creationid="John Doe" /*User who created it.*/
				changedate="20181129T122945Z" /*Date it was modified.*/
				changeid="Jane Doe" /*User who modified it.*/
			> /*tuv start tag closed.*/
				<seg>Your sentence here.</seg> /*The text data.*/
			</tuv> /*tuv closing tag*/
		</tu> /*tu closing tag*/
 </body> /*body closing tag*/

  
</tmx> /*tmx closing tag*/

Database time

If I got away with only using text files until now, I realised that it won't be the case if I want to work with TMs and Glossaries.

At the time I made this, I had almost no knowledge of working with databases. Sure, I could do pivot tables and some fancy stuff in Excel, but nothing that might help me here.

Nevertheless, I knew how I would do things, I just had no idea how to code them.

Honestly, I am not a software dev
  • Send data
  • Retrieve data
  • Edit existent data
Honestly, I am not a software dev
  • Menu bar
  • Ribbon menu
  • Database controls - database search, hopefully.
  • 1a - Source Text read line by line
  • 2a - Target Text read line by line

There's also the issue of saving the access database data as a .tmx file. But about that a bit later, currently I had to learn about OLE DB.

Fast forward about two days, or so, and I was able to write and read data from a database using the TM Alignment UI I made.

Default UI - when opening the TM Creator Honestly, I am not a software dev
Alignment UI - after loading ST and TT and pressing Begin Honestly, I am not a software dev
TMX export - after pressing the TMX export button Honestly, I am not a software dev
UI - Break-Down
Honestly, I am not a software dev
  • Open ST - open Source Text (text 1)
  • Open TT - open Target Text (text 2)
Honestly, I am not a software dev
  • Lock ALL - locks the text box UI - text cannot be edited
  • Unlock ALL - unlocks the text box UI - text can be edited
  • Next Batch - validates all pairs and moves to the bath of pairs
  • Start/Begin - start alignment
Honestly, I am not a software dev
  • to TMX - export to TMX
  • to accdb - export to .accdb
  • Options
    • Show - shows tmx export menu
    • Hide - hides tmx export menu
Honestly, I am not a software dev
  • Open ST - open 1st text file
  • Open TT - open 2nd text file
  • Start/Begin - start alignment
  • Lock ALL - locks the text box UI - text cannot be edited
  • Unlock ALL - unlocks the text box UI - text can be edited
  • Next Batch - validates all pairs and moves to the bath of pairs
  • Status - Shows the status of the connection with the database
  • Dev Options - Activates the test menu - allows searching the TM - bugged atm
Honestly, I am not a software dev

Check TMX summary for more info

  • Language Pair - drop-down menu for language selection
  • Creation Tool - Creation tool name
  • Creation id - input creation ID
  • Creation Date - Unix date
  • Source Language - srclang
  • Data Type - drop-down menu for data-type (eg. PlainText)
  • Administrative Language - adminlang
  • Segment Type - drop-down menu for segtype (eg. sentence)
  • Original TMF - o-tmf

Demo

The exporter gets the needed data from two areas:

Honestly, I am not a software dev
Errors and bugs

Glossary

One annoying error that is yet to be fixed is properly closing the database. It's the same error as with the TMX exporter.


Conclusion

Did I manage to make CAT tool? Nope, not even close.

It was an interesting project and there were plenty of issues that I had to solve.

Some of the issues were solved, and some are still there an will have to be solved, eventually.

There are some other things that could be showcased such as: spellcheck, some dictionary APIs and MT. Compared to what I have already shown, they are not that interesting.

All in all it was a good learning opportunity and a lot of practice with code writing (several thousand lines of code).