Some time ago, while I was waiting to see if I got in the PhD programme I am currently in, a though came into my mind: "Can I make a CAT Tool from scratch?".
The reasoning was simple - I like to challenge myself and learn through trial and error.
The way I go about this every time is by trying to simplify everything as much as possible. Interestingly enough, it works most of the time.
So, what would a my CAT tool do? That was a really great question.
I needed something that would look a like simple text editor. Fortunately, notepad is a thing and I could just get inspired from its ui.
It took some time to set everything up. This was probably the easiest part of the entire project.
In-depth look at it a bit later.
In-depth look at it a bit later.
In-depth look at it a bit later.
In-depth look at it a bit later.
In-depth look at it a bit later.
This was probably the most annoying part of the whole project.
The idea was to split the right and left windows into smaller text-boxes. In each smaller text-box, the software will load one line of text.
Both sides will contain the same sentences - with the right window allowing text to be edited
Some time later - and after a really interesting period in which I had to figure out how to load a text line by line and then pass the lines from one text box to another - I ended up with this
Segment control menu
Load text file -> split text into translation segments -> translate -> save translated file.
When done - program generates three ".txt" files:
Everything before this part was done witout using any database, it was just .txt manipulation. If I wanted to do more complex things such as creating something that would allow me to align texts to create a translation memory, I needed to use some sort of database.
I knew that TMX is an XML specification, and XML is not that difficult to get around - I started by looking at the structure of a .tmx file.
I used the - Translation Memory eXchange format (TMX) definition document by the Localisation Industry Standards Association (LISA) - available under the terms of the CC BT 3.0 license.
Opened up a .tmx file - then used the definition document to make a map for myself - threw the results into a html page so they would look more tidy.
<?xml version="1.0"?>/*XML version.*/
<!-- Example of TMX document -->/*XML comment.*/
< tmx version="1.4" > /*The version attribute indicates the version of the TMX format.*/
< header/*Begining of the header tag which contains meta-data about the document.*/
/*Attributes declared inside the start tag.*/
creationtool="XYZTool"/*Name of the tool which created this TMX.*/
creationtoolversion="1.01-023"/*Version of the above mentioned tool.*/
datatype="PlainText"/*Type of contained data.*/
segtype="sentence"/*Specifies the kind of segmentation used in the <tu> element. */
adminlang="en-us"/*Specifies the default language for the administrative and informative elements <note> and <prop>.*/
srclang="EN"/*Specifies the source language.*/
o-tmf="ABCTransMem"/*Specifies the format of the translation memory file from which the TMX document or segment thereof have been generated.*/
creationdate="20020101T163812Z"/*Creation date having the following format YYYYMMDDThhmmssZ.*/
creationid="John Doe"/*Specifies the user that created the entry.*/
changedate="20020413T023401Z"/*Change date having the following format YYYYMMDDThhmmssZ.*/
changeid="Jane Doe"/*Specifies the user that modified the entry.*/
o-encoding="iso-8859-1"/*The o-encoding attribute specifies the original or preferred code set of the data of the element in case it is to be re-encoded in a non-Unicode code set.*/
/*Inside of the start tag.*/
> /*Start tag closed.*/
/*Inside the header element.*/
<note> This is a note at document level. </note>/*Self-explanatory.*/
<prop type="RTFPreamble" {\rtf1\ansi\tag etc} {/fonttbl} </prop>/*Define custom properties of the parent element.*/
<ude name="MacRoman" base="Macintosh" >/*Specify a set of user-defined characters.*/
<map unicode="#xF8FF" code="#xF0" ent="Apple_logo" subst="[Apple]"/>/*Mapping from Unicode to the user-defined encoding.*/
</ude>/*ude tag closed.*/
</header>/*header element tag closed.*/
/*Ouside of the header element.*/
<body>/*Container for the translation unit collection.*/
<tu /*Translation unit start.*/
tuid="0001" /*Translation unit id.*/
datatype="Text" /*Type of data contained inthe translation unit.*/
usagecount="2" /*Time the unit was accessed.*/
lastusagedate="20181129T122945Z" /*Last time the unit was accessed.*/
>/*tu start tag closed.*/
<note>Text of a note at the TU level.</note> /*Self-explanatory.*/
<prop type="x-Domain">Christmas</prop> /*Custom property tags.*/
<prop type="x-Project">Santa</prop> /*Custom property tags.*/
<tuv /*Contains text in a given language.*/
xml:lang="ro" /*Language for the curent text.*/
creationdate="20181129T122945Z" /*Date it was created.*/
creationid="John Doe" /*User who created it.*/
> /*tuv start tag closed.*/
<seg>data Propozitia ta aici.</seg> /*The text data.*/
</tuv> /*tuv closing tag*/
<tuv /*Contains text in a given language.*/
xml:lang="en" /*Language for the curent text.*/
creationdate="20181129T122945Z" /*Date it was created.*/
creationid="John Doe" /*User who created it.*/
changedate="20181129T122945Z" /*Date it was modified.*/
changeid="Jane Doe" /*User who modified it.*/
> /*tuv start tag closed.*/
<seg>Your sentence here.</seg> /*The text data.*/
</tuv> /*tuv closing tag*/
</tu> /*tu closing tag*/
</body> /*body closing tag*/
</tmx> /*tmx closing tag*/
If I got away with only using text files until now, I realised that it won't be the case if I want to work with TMs and Glossaries.
At the time I made this, I had almost no knowledge of working with databases. Sure, I could do pivot tables and some fancy stuff in Excel, but nothing that might help me here.
Nevertheless, I knew how I would do things, I just had no idea how to code them.
There's also the issue of saving the access database data as a .tmx file. But about that a bit later, currently I had to learn about OLE DB.
Fast forward about two days, or so, and I was able to write and read data from a database using the TM Alignment UI I made.
Check TMX summary for more info
The exporter gets the needed data from two areas:
One annoying error that is yet to be fixed is properly closing the database. It's the same error as with the TMX exporter.
Did I manage to make CAT tool? Nope, not even close.
It was an interesting project and there were plenty of issues that I had to solve.
Some of the issues were solved, and some are still there an will have to be solved, eventually.
There are some other things that could be showcased such as: spellcheck, some dictionary APIs and MT. Compared to what I have already shown, they are not that interesting.
All in all it was a good learning opportunity and a lot of practice with code writing (several thousand lines of code).