CAT Tool

Stand-alone page

Some time ago, while I was waiting to see if I got in the PhD programme I am currently in, a though came into my mind: "Can I make a CAT Tool from scratch?".

The reasoning was simple - I like to challenge myself and learn through trial and error.

Where to start?

The way I go about this every time is by trying to simplify everything as much as possible. Interestingly enough, it works most of the time.

So, what would a my CAT tool do? That was a really great question.

Make a basic text editor

Load text from a file - plain .txt file
Allow me to manipulate that text
Save it to another file - plain .txt file

Translation

Separators - most likely "comma" and "full stop"
Texts loaded line by line - based on the previous splitting operations using separators
Autocorrect / Spelling - for various languages - Will use Hunspell
Some sort of dictionary
Word / Character count - for Source Text Box
Word / Character count + Translation % for Target Text Box

Database related issues

Find something to use as a database - Not Excel - most likely an Access Database
Create a Translation Memory generator
Create a Glossary System

Throw in some API

Dictionary APIs

Oxford Dictionary API
Merriam-Webster API
Merriam-Webster Medical API

Machine Translation

Google Translate API
Yandex Translation API

Other Stuff

Yandex Spellchecker API
Some Dictation capabilities
Some Text-to-Speech capabilities
Word Analysis / Token Analysis

UI Mock-UP

I needed something that would look a like simple text editor. Fortunately, notepad is a thing and I could just get inspired from its ui.

Details

Initial window UI

Honestly, I am not a software dev

1 - Source text window - Imported files will be open here
2 - Text window - Input text manually if there is no file to be imported.
3 - Sort of ribbon menu - Hope I don't get sued for it.
6 - Menu bar
4 /5 - Text meta-info: word/character count + other metrics

What I ended up with

It took some time to set everything up. This was probably the easiest part of the entire project.

Honestly, I am not a software dev

UI Break-Down

Details

File

Honestly, I am not a software dev

New Project - Deletes Source Text/Target Text, Intermediate Text, TM, Glossary
Open Original - Open text in the left Text-Box
Open Translation - Opens text in the right Text-Box
Save project - Save ST/TT/IT/TM and Gloss to a folder - bugged atm
Save - Initial UI saves text that is in the right Text-Box
Save As / Print - don't work
Exit - Releases you from using this awful thing.

Edit

Honestly, I am not a software dev

Undo - for when you messed up
Redo - for when you messed up while correcting the previous mess-up
Cut - time to move to another line
Copy - when you don't want to write it again
Paste - insert copied/cut fragment
Deselect All - You shall not be selected!
Select All - You shall be selected!

Text

Honestly, I am not a software dev

Fonts - opens up the fonts menu
Small - text font size set to small
Medium - text font size set to medium
Large - text font size set to large
Search - search the selected text - works 3/10 times
Spelling - Language - Opens up the Languages Selection Menu for Spelling - Uses Hunspell
Spelling - Advanced - Not implemented

Separators

Honestly, I am not a software dev

Comma - splits the text into fragments when encountering a comma.
Fullstop - splits the text into fragments when encountering a fullstop.

Tools

Honestly, I am not a software dev

In-depth look at it a bit later.

TTS - text-to-speech -> can be used to read text
VTT - voice-to-text -> sub-par dictation
Word Analysis - WordNet API - no by me - Show Semantic/Lexical relations / Synsets / Semantic Similarity
Token Analyser - not by me - Keyword/Whitespace/Stop/Simple/Standard analysers
XML editor - not by me - I was attempting to open xml files to see if I can translate them - nope, does not work, yet.

Glossary

Honestly, I am not a software dev

In-depth look at it a bit later.

Quick Glossary - opens up a basic glossary UI
Glossary - opens the full Glossary UI
New Glossary - creates a glossary in the glossary folder
New Custom Glossary - creates a glossary at the user-specified location

API Menu

Honestly, I am not a software dev

In-depth look at it a bit later.

Dictionaries
- Oxford API - UI finished
- MW API - raw answers - no UI
- MW Medical API - raw answers - no UI
Yandex
- Translation- UI works - API key is dead
- Spellcheck - UI works - basic spell check

Translation Menu

Honestly, I am not a software dev

In-depth look at it a bit later.

Switch to Translation - Start translation process
Next ST segment - next Source Text segment - move segments up
Previous ST segment - previous Source Text segment - move segments down
Validate segment - validates the segment and moves segments up
Next Target Text segment - does not work atm.

Translation Memory Menu

Honestly, I am not a software dev

In-depth look at it a bit later.

New - Creates a new TM based on the existing template
Align - Opens up the alignment UI

Top Ribbon Menu

Honestly, I am not a software dev

Text Font/Size/Bold/Italics/Under - menu
Source Language / Target Language selectors
Experimental controls - work in the translation UI

Bottom Ribbon Menu

Honestly, I am not a software dev

Word count - number of words
Character count - number of characters
Translated % - percent of translated text
Toggle spellcheck
Spellcheck language indicator
Book button - dictionary - not working atm
Spellcheck button - opens spellcheck language selection UI

Translation Window UI

This was probably the most annoying part of the whole project.

Details

The idea was to split the right and left windows into smaller text-boxes. In each smaller text-box, the software will load one line of text.

Honestly, I am not a software dev

Both sides will contain the same sentences - with the right window allowing text to be edited

1a - contain uneditable text lines
1b - will contain the entire text - lines will be appended instead of passed from one text-box to another.
2a - contain editable text lines
2b - will contain the translated text - translated lines will be appended instead of passed from one text-box to another.
3/4/5/6 - unchanged
3a - additional menu related to translation controls

Some time later - and after a really interesting period in which I had to figure out how to load a text line by line and then pass the lines from one text box to another - I ended up with this

Honestly, I am not a software dev

Segment control menu

Validate Segment - validates translated segment and then reads the next line in the text file
Experimental controls need to be activate to have access to them - these aren't working as intended at the moment
- Next ST segment - moves source text lines up
- Previous ST segment - moves source text lines down
- Next TT segment - moves target text lines up
Machine Translation Controls - allows target language selection for MT translation

Demo

Load text file -> split text into translation segments -> translate -> save translated file.

When done - program generates three ".txt" files:

original.txt - original after being split into translation units
intermediate.txt - contains pairs of sentences - was planning on using it for a TM - plans changed
yourtranslationame.txt - finished translation

Translation Memory Generator

Everything before this part was done witout using any database, it was just .txt manipulation. If I wanted to do more complex things such as creating something that would allow me to align texts to create a translation memory, I needed to use some sort of database.

The plan

I knew that TMX is an XML specification, and XML is not that difficult to get around - I started by looking at the structure of a .tmx file.

I used the - Translation Memory eXchange format (TMX) definition document by the Localisation Industry Standards Association (LISA) - available under the terms of the CC BT 3.0 license.

Opened up a .tmx file - then used the definition document to make a map for myself - threw the results into a html page so they would look more tidy.

TMX summary

A simple overview of a TMX file structure

<?xml version="1.0"?>/*XML version.*/
<!-- Example of TMX document -->/*XML comment.*/

< tmx version="1.4" > /*The version attribute indicates the version of the TMX format.*/


< header/*Begining of the header tag which contains meta-data about the document.*/
/*Attributes declared inside the start tag.*/
  creationtool="XYZTool"/*Name of the tool which created this TMX.*/
  creationtoolversion="1.01-023"/*Version of the above mentioned tool.*/
  datatype="PlainText"/*Type of contained data.*/
  segtype="sentence"/*Specifies the kind of segmentation used in the <tu> element. */
  adminlang="en-us"/*Specifies the default language for the administrative and informative elements <note> and <prop>.*/
  srclang="EN"/*Specifies the source language.*/
  o-tmf="ABCTransMem"/*Specifies the format of the translation memory file from which the TMX document or segment thereof have been generated.*/
  creationdate="20020101T163812Z"/*Creation date having the following format YYYYMMDDThhmmssZ.*/
  creationid="John Doe"/*Specifies the user that created the entry.*/
  changedate="20020413T023401Z"/*Change date having the following format YYYYMMDDThhmmssZ.*/
  changeid="Jane Doe"/*Specifies the user that modified the entry.*/
  o-encoding="iso-8859-1"/*The o-encoding attribute specifies the original or preferred code set of the data of the element in case it is to be re-encoded in a non-Unicode code set.*/
  /*Inside of the start tag.*/
 > /*Start tag closed.*/


/*Inside the header element.*/
 <note> This is a note at document level. </note>/*Self-explanatory.*/
 <prop type="RTFPreamble" {\rtf1\ansi\tag etc} {/fonttbl} </prop>/*Define custom properties of the parent element.*/
  <ude name="MacRoman" base="Macintosh" >/*Specify a set of user-defined characters.*/
   <map unicode="#xF8FF" code="#xF0" ent="Apple_logo" subst="[Apple]"/>/*Mapping from Unicode to the user-defined encoding.*/
 </ude>/*ude tag closed.*/
 </header>/*header element tag closed.*/
/*Ouside of the header element.*/


 <body>/*Container for the translation unit collection.*/
    <tu /*Translation unit start.*/
		tuid="0001" /*Translation unit id.*/
		datatype="Text" /*Type of data contained inthe translation unit.*/
		usagecount="2" /*Time the unit was accessed.*/
		lastusagedate="20181129T122945Z" /*Last time the unit was accessed.*/
	>/*tu start tag closed.*/
		<note>Text of a note at the TU level.</note> /*Self-explanatory.*/
		<prop type="x-Domain">Christmas</prop> /*Custom property tags.*/
		<prop type="x-Project">Santa</prop> /*Custom property tags.*/
			<tuv /*Contains text in a given language.*/
				xml:lang="ro" /*Language for the curent text.*/
				creationdate="20181129T122945Z" /*Date it was created.*/
				creationid="John Doe" /*User who created it.*/
				> /*tuv start tag closed.*/
				<seg>data Propozitia ta aici.</seg> /*The text data.*/
			</tuv> /*tuv closing tag*/
			<tuv /*Contains text in a given language.*/
				xml:lang="en" /*Language for the curent text.*/
				creationdate="20181129T122945Z" /*Date it was created.*/
				creationid="John Doe" /*User who created it.*/
				changedate="20181129T122945Z" /*Date it was modified.*/
				changeid="Jane Doe" /*User who modified it.*/
			> /*tuv start tag closed.*/
				<seg>Your sentence here.</seg> /*The text data.*/
			</tuv> /*tuv closing tag*/
		</tu> /*tu closing tag*/
 </body> /*body closing tag*/

</tmx> /*tmx closing tag*/

Database time

If I got away with only using text files until now, I realised that it won't be the case if I want to work with TMs and Glossaries.

At the time I made this, I had almost no knowledge of working with databases. Sure, I could do pivot tables and some fancy stuff in Excel, but nothing that might help me here.

Nevertheless, I knew how I would do things, I just had no idea how to code them.

Honestly, I am not a software dev

Send data
Retrieve data
Edit existent data

Honestly, I am not a software dev

Menu bar
Ribbon menu
Database controls - database search, hopefully.
1a - Source Text read line by line
2a - Target Text read line by line

There's also the issue of saving the access database data as a .tmx file. But about that a bit later, currently I had to learn about OLE DB.

Fast forward about two days, or so, and I was able to write and read data from a database using the TM Alignment UI I made.

Default UI - when opening the TM Creator

Honestly, I am not a software dev

Alignment UI - after loading ST and TT and pressing Begin

Honestly, I am not a software dev

TMX export - after pressing the TMX export button

Honestly, I am not a software dev

UI - Break-Down

Honestly, I am not a software dev

Open ST - open Source Text (text 1)
Open TT - open Target Text (text 2)

Honestly, I am not a software dev

Lock ALL - locks the text box UI - text cannot be edited
Unlock ALL - unlocks the text box UI - text can be edited
Next Batch - validates all pairs and moves to the bath of pairs
Start/Begin - start alignment

Honestly, I am not a software dev

to TMX - export to TMX
to accdb - export to .accdb
Options
- Show - shows tmx export menu
- Hide - hides tmx export menu

Honestly, I am not a software dev

Open ST - open 1st text file
Open TT - open 2nd text file
Start/Begin - start alignment
Lock ALL - locks the text box UI - text cannot be edited
Unlock ALL - unlocks the text box UI - text can be edited
Next Batch - validates all pairs and moves to the bath of pairs

Status - Shows the status of the connection with the database
Dev Options - Activates the test menu - allows searching the TM - bugged atm

Honestly, I am not a software dev

Check TMX summary for more info

Language Pair - drop-down menu for language selection
Creation Tool - Creation tool name
Creation id - input creation ID
Creation Date - Unix date
Source Language - srclang
Data Type - drop-down menu for data-type (eg. PlainText)
Administrative Language - adminlang
Segment Type - drop-down menu for segtype (eg. sentence)
Original TMF - o-tmf

Demo

The exporter gets the needed data from two areas:

Meta-data for the TMX - the header part - comes from the TM Options Menu
From the .accdb which contains four columns

Honestly, I am not a software dev

Stsegment - source text segment
Ttsegment - target text segment

Slang - source text language code
Tlang - target text language code

Errors and bugs

Source Language and Target Language codes seem to not get written into the database. This means those tmx entries will have default values.
Database is not closed upon exiting the program.
On pressing begin the Source Text and the Target Text get swapped. It seems to be a UI issue.

Glossary

One annoying error that is yet to be fixed is properly closing the database. It's the same error as with the TMX exporter.

Conclusion

Did I manage to make CAT tool? Nope, not even close.

It was an interesting project and there were plenty of issues that I had to solve.

Some of the issues were solved, and some are still there an will have to be solved, eventually.

There are some other things that could be showcased such as: spellcheck, some dictionary APIs and MT. Compared to what I have already shown, they are not that interesting.

All in all it was a good learning opportunity and a lot of practice with code writing (several thousand lines of code).