PolyAnno: Translation

This is part of my series of posts about the PolyAnno project – more here

The translations are handled in essence nearly identically to the transcriptions in the project as I left it at the end of 2016, with the only real differences being in variable names.

However this was not the result of merely dismissing this facet, I did research crowd sourcing translations and other methodologies of mass translating texts and found several main points worth considering for the Polyanno project.

We will not necessarily have a consistent transcription.
There is no freely accessible single package for automating the translation of all the possible languages and dates of languages that the Collections at Edinburgh contain even if we had a definitive transcription for each one.
Users contributing could easily be misled or confused by the changing of languages over time – words change their meaning and for documents with missing or incorrect metadata about the dating of a digitised document people could en masse translate incorrectly and leave the qualified few who recognise the style of the correct dates fairly helpless. I investigated trying to use Google ngrams, Etymological Wordnet or other large datasets of text to maybe automate triggers for changes of meaning but I am severely underqualified (and lacking the resources to learn within the scope of the project) for a linguistics and NLP project like that.
Certain formal features ensure consistency that could help ease translations such as knowing that legal documents from a certain time and place always start with the same phrase for example. I investigated the possibility of prompts providing hints if the metadata suggests a high likelihood of this but we are both lacking in relevant metadata and each of these hints would need to be written and added to the code manually.
The metadata for the languages of documents was incorrect surprisingly often because the Victorians were idiots who didn’t know what languages looked like. Or didn’t care, I don’t know. This makes things more unhelpful.
There could be multiple languages within one document.
Regional dialects. Written.
The most successful crowd sourced translation projects worked because they already had huge user bases e.g. Facebook and Twitter, that they could just kindly ask to translate modern languages. Others just cheat and pay people. The remaining take a very long time and only work on projects involved mapping one specific language (and date of it) to another, not
Some people may wish to use this project to build a portfolio of public, useful, niche translations for their C.V.
Some people may wish to use this project for teaching, to generate discussions about translation by finding controversial translations that are not generally well documented, or simply boost students’ investment as they know that working on translations within this would actually be contributing original work.

And most importantly of all, I learned that it is unlikely I will ever be able to read the “History of the Adventures, Deeds and Courageous Actions of the Hero and Knight Thewerdanck” unless I get this project to work or I learn medieval High German.

thewerdancke_the_legend — Thwederdancke, what a legend. I think. What a name anyway. || Image via Bayerische StaatsBibliothek

Language Identification

Within the JSONs used for storing the transcriptions and translations, there is a language field. I never designed and developed a methodology for identifying the language(s) involved, important because the metadata provided at the time of digitisation is quite possibly incomplete, or incorrect.

There are a lot of possible variations in languages in the world, however I am limiting the design to those that can be defined by the W3C standards as outlined here.

Next: Verification