Some time ago, while I was waiting to see if I got in the PhD programme I am currently in, a though came into my mind: "Can I make a CAT Tool from scratch?".
The reasoning was simple - I like to challenge myself and learn through trial and error.
Where to start?
The way I go about this every time is by trying to simplify everything as much as possible. Interestingly enough, it works most of the time.
So, what would a my CAT tool do? That was a really great question.
Make a basic text editor
Load text from a file - plain .txt file
Allow me to manipulate that text
Save it to another file - plain .txt file
Translation
Separators - most likely "comma" and "full stop"
Texts loaded line by line - based on the previous splitting operations using separators
Autocorrect / Spelling - for various languages - Will use Hunspell
Some sort of dictionary
Word / Character count - for Source Text Box
Word / Character count + Translation % for Target Text Box
Database related issues
Find something to use as a database - Not Excel - most likely an Access Database
Create a Translation Memory generator
Create a Glossary System
Throw in some API
Dictionary APIs
Oxford Dictionary API
Merriam-Webster API
Merriam-Webster Medical API
Machine Translation
Google Translate API
Yandex Translation API
Other Stuff
Yandex Spellchecker API
Some Dictation capabilities
Some Text-to-Speech capabilities
Word Analysis / Token Analysis
UI Mock-UP
I needed something that would look a like simple text editor. Fortunately, notepad is a thing and I could just get inspired from its ui.
Details
Initial window UI
1 - Source text window - Imported files will be open here
2 - Text window - Input text manually if there is no file to be imported.
3 - Sort of ribbon menu - Hope I don't get sued for it.
6 - Menu bar
4 /5 - Text meta-info: word/character count + other metrics
What I ended up with
It took some time to set everything up. This was probably the easiest part of the entire project.
Open Translation - Opens text in the right Text-Box
Save project - Save ST/TT/IT/TM and Gloss to a folder - bugged atm
Save - Initial UI saves text that is in the right Text-Box
Save As / Print - don't work
Exit - Releases you from using this awful thing.
Edit
Undo - for when you messed up
Redo - for when you messed up while correcting the previous mess-up
Cut - time to move to another line
Copy - when you don't want to write it again
Paste - insert copied/cut fragment
Deselect All - You shall not be selected!
Select All - You shall be selected!
Text
Fonts - opens up the fonts menu
Small - text font size set to small
Medium - text font size set to medium
Large - text font size set to large
Search - search the selected text - works 3/10 times
Spelling - Language - Opens up the Languages Selection Menu for Spelling - Uses Hunspell
Spelling - Advanced - Not implemented
Separators
Comma - splits the text into fragments when encountering a comma.
Fullstop - splits the text into fragments when encountering a fullstop.
Tools
In-depth look at it a bit later.
TTS - text-to-speech -> can be used to read text
VTT - voice-to-text -> sub-par dictation
Word Analysis - WordNet API - no by me - Show Semantic/Lexical relations / Synsets / Semantic Similarity
Token Analyser - not by me - Keyword/Whitespace/Stop/Simple/Standard analysers
XML editor - not by me - I was attempting to open xml files to see if I can translate them - nope, does not work, yet.
Glossary
In-depth look at it a bit later.
Quick Glossary - opens up a basic glossary UI
Glossary - opens the full Glossary UI
New Glossary - creates a glossary in the glossary folder
New Custom Glossary - creates a glossary at the user-specified location
API Menu
In-depth look at it a bit later.
Dictionaries
Oxford API - UI finished
MW API - raw answers - no UI
MW Medical API - raw answers - no UI
Yandex
Translation- UI works - API key is dead
Spellcheck - UI works - basic spell check
Translation Menu
In-depth look at it a bit later.
Switch to Translation - Start translation process
Next ST segment - next Source Text segment - move segments up
Previous ST segment - previous Source Text segment - move segments down
Validate segment - validates the segment and moves segments up
Next Target Text segment - does not work atm.
Translation Memory Menu
In-depth look at it a bit later.
New - Creates a new TM based on the existing template
Align - Opens up the alignment UI
Top Ribbon Menu
Text Font/Size/Bold/Italics/Under - menu
Source Language / Target Language selectors
Experimental controls - work in the translation UI
Bottom Ribbon Menu
Word count - number of words
Character count - number of characters
Translated % - percent of translated text
Toggle spellcheck
Spellcheck language indicator
Book button - dictionary - not working atm
Spellcheck button - opens spellcheck language selection UI
Translation Window UI
This was probably the most annoying part of the whole project.
Details
The idea was to split the right and left windows into smaller text-boxes. In each smaller text-box,
the software will load one line of text.
Both sides will contain the same sentences - with the right window allowing text to be edited
1a - contain uneditable text lines
1b - will contain the entire text - lines will be appended instead of passed from one text-box to another.
2a - contain editable text lines
2b - will contain the translated text - translated lines will be appended instead of passed from one text-box to another.
3/4/5/6 - unchanged
3a - additional menu related to translation controls
Some time later - and after a really interesting period in which I had to figure out how to load a text line by line and then pass the lines
from one text box to another - I ended up with this
Segment control menu
Validate Segment - validates translated segment and then reads the next line in the text file
Experimental controls need to be activate to have access to them - these aren't working as intended at the moment
Next ST segment - moves source text lines up
Previous ST segment - moves source text lines down
Next TT segment - moves target text lines up
Machine Translation Controls - allows target language selection for MT translation
Demo
Load text file -> split text into translation segments -> translate -> save translated file.
When done - program generates three ".txt" files:
original.txt - original after being split into translation units
intermediate.txt - contains pairs of sentences - was planning on using it for a TM - plans changed
yourtranslationame.txt - finished translation
Translation Memory Generator
Everything before this part was done witout using any database, it was just .txt manipulation. If I wanted to do more complex things
such as creating something that would allow me to align texts to create a translation memory, I needed to use some sort of database.
The plan
I knew that TMX is an XML specification, and XML is not that difficult to get around - I started by looking at the structure of a .tmx file.
I used the - Translation Memory eXchange format (TMX) definition document by the Localisation Industry Standards Association (LISA) - available under the terms of the CC BT 3.0 license.
Opened up a .tmx file - then used the definition document to make a map for myself - threw the results into a html page so they would look more tidy.
TMX summary
A simple overview of a TMX file structure
<?xml version="1.0"?>/*XML version.*/
<!-- Example of TMX document -->/*XML comment.*/
< tmx version="1.4" > /*The version attribute indicates the version of the TMX format.*/
< header/*Begining of the header tag which contains meta-data about the document.*/
/*Attributes declared inside the start tag.*/
creationtool="XYZTool"/*Name of the tool which created this TMX.*/
creationtoolversion="1.01-023"/*Version of the above mentioned tool.*/
datatype="PlainText"/*Type of contained data.*/
segtype="sentence"/*Specifies the kind of segmentation used in the <tu> element. */
adminlang="en-us"/*Specifies the default language for the administrative and informative elements <note> and <prop>.*/
srclang="EN"/*Specifies the source language.*/
o-tmf="ABCTransMem"/*Specifies the format of the translation memory file from which the TMX document or segment thereof have been generated.*/
creationdate="20020101T163812Z"/*Creation date having the following format YYYYMMDDThhmmssZ.*/
creationid="John Doe"/*Specifies the user that created the entry.*/
changedate="20020413T023401Z"/*Change date having the following format YYYYMMDDThhmmssZ.*/
changeid="Jane Doe"/*Specifies the user that modified the entry.*/
o-encoding="iso-8859-1"/*The o-encoding attribute specifies the original or preferred code set of the data of the element in case it is to be re-encoded in a non-Unicode code set.*/
/*Inside of the start tag.*/
> /*Start tag closed.*/
/*Inside the header element.*/
<note> This is a note at document level. </note>/*Self-explanatory.*/
<prop type="RTFPreamble" {\rtf1\ansi\tag etc} {/fonttbl} </prop>/*Define custom properties of the parent element.*/
<ude name="MacRoman" base="Macintosh" >/*Specify a set of user-defined characters.*/
<map unicode="#xF8FF" code="#xF0" ent="Apple_logo" subst="[Apple]"/>/*Mapping from Unicode to the user-defined encoding.*/
</ude>/*ude tag closed.*/
</header>/*header element tag closed.*/
/*Ouside of the header element.*/
<body>/*Container for the translation unit collection.*/
<tu /*Translation unit start.*/
tuid="0001" /*Translation unit id.*/
datatype="Text" /*Type of data contained inthe translation unit.*/
usagecount="2" /*Time the unit was accessed.*/
lastusagedate="20181129T122945Z" /*Last time the unit was accessed.*/
>/*tu start tag closed.*/
<note>Text of a note at the TU level.</note> /*Self-explanatory.*/
<prop type="x-Domain">Christmas</prop> /*Custom property tags.*/
<prop type="x-Project">Santa</prop> /*Custom property tags.*/
<tuv /*Contains text in a given language.*/
xml:lang="ro" /*Language for the curent text.*/
creationdate="20181129T122945Z" /*Date it was created.*/
creationid="John Doe" /*User who created it.*/
> /*tuv start tag closed.*/
<seg>data Propozitia ta aici.</seg> /*The text data.*/
</tuv> /*tuv closing tag*/
<tuv /*Contains text in a given language.*/
xml:lang="en" /*Language for the curent text.*/
creationdate="20181129T122945Z" /*Date it was created.*/
creationid="John Doe" /*User who created it.*/
changedate="20181129T122945Z" /*Date it was modified.*/
changeid="Jane Doe" /*User who modified it.*/
> /*tuv start tag closed.*/
<seg>Your sentence here.</seg> /*The text data.*/
</tuv> /*tuv closing tag*/
</tu> /*tu closing tag*/
</body> /*body closing tag*/
</tmx> /*tmx closing tag*/
Database time
If I got away with only using text files until now, I realised that it won't be the case if I want to work with TMs and Glossaries.
At the time I made this, I had almost no knowledge of working with databases. Sure, I could do pivot tables and some fancy stuff in Excel, but nothing that might help me here.
Nevertheless, I knew how I would do things, I just had no idea how to code them.
Send data
Retrieve data
Edit existent data
Menu bar
Ribbon menu
Database controls - database search, hopefully.
1a - Source Text read line by line
2a - Target Text read line by line
There's also the issue of saving the access database data as a .tmx file. But about that a bit later, currently I had to learn about OLE DB.
Fast forward about two days, or so, and I was able to write and read data from a database using the TM Alignment UI I made.
Default UI - when opening the TM CreatorAlignment UI - after loading ST and TT and pressing BeginTMX export - after pressing the TMX export buttonUI - Break-Down
Open ST - open Source Text (text 1)
Open TT - open Target Text (text 2)
Lock ALL - locks the text box UI - text cannot be edited
Unlock ALL - unlocks the text box UI - text can be edited
Next Batch - validates all pairs and moves to the bath of pairs
Start/Begin - start alignment
to TMX - export to TMX
to accdb - export to .accdb
Options
Show - shows tmx export menu
Hide - hides tmx export menu
Open ST - open 1st text file
Open TT - open 2nd text file
Start/Begin - start alignment
Lock ALL - locks the text box UI - text cannot be edited
Unlock ALL - unlocks the text box UI - text can be edited
Next Batch - validates all pairs and moves to the bath of pairs
Status - Shows the status of the connection with the database
Dev Options - Activates the test menu - allows searching the TM - bugged atm
Check TMX summary for more info
Language Pair - drop-down menu for language selection
Creation Tool - Creation tool name
Creation id - input creation ID
Creation Date - Unix date
Source Language - srclang
Data Type - drop-down menu for data-type (eg. PlainText)
Administrative Language - adminlang
Segment Type - drop-down menu for segtype (eg. sentence)
Original TMF - o-tmf
Demo
The exporter gets the needed data from two areas:
Meta-data for the TMX - the header part - comes from the TM Options Menu
From the .accdb which contains four columns
Stsegment - source text segment
Ttsegment - target text segment
Slang - source text language code
Tlang - target text language code
Errors and bugs
Source Language and Target Language codes seem to not get written into the database. This means those tmx
entries will have default values.
Database is not closed upon exiting the program.
On pressing begin the Source Text and the Target Text get swapped. It seems to be a UI issue.
Glossary
One annoying error that is yet to be fixed is properly closing the database. It's the same error as with the TMX exporter.
Conclusion
Did I manage to make CAT tool? Nope, not even close.
It was an interesting project and there were plenty of issues that I had to solve.
Some of the issues were solved, and some are still there an will have to be solved, eventually.
There are some other things that could be showcased such as: spellcheck, some dictionary APIs and MT. Compared
to what I have already shown, they are not that interesting.
All in all it was a good learning opportunity and a lot of practice with code writing (several thousand lines of code).