A framework for learning.
A fundamental limitation of Machine Assisted Translation systems is the need for very large corpora of training data. Most Bible translation projects cannot provide this and this limits the contribution of the systems to the later stages of a project.
Project Paddington seeks to overcome this limitation by kick-starting MAT by providing a bi-lingual lexicon, morphology and syntax tables compiled from very small amounts of text. This brings forward the moment when MAT systems can contribute to a translation.
There are three major areas of work being developed for Paddington:
- A set of small process we call parseBots (pB – hence Paddington) tasked with learning about the content and structure of a very small piece of text,
- The Language Module which collects and collates items learnt by the bots and
- An Interaction Module tasked with engaging with the user to verify the findings of the rest of the system.
Many of our other systems have a role to play in building the capabilities of Paddington.
We presented a paper about Project Paddington to the ASLING TC39 conference in London in Nov 2017: Learning from Sparse Data.
ParaTExt Glossing Technologies
The first major sub-system delivered for ParaTExt has become known as the Glossing Technologies but is known to the team as Augustus. GT/Augustus provides Key Term glossing and automatic Interlinear back-translation for ParaText.
Transforming the word
Sometimes elements of languages, particularly words borrowed from other languages, are transformed to match the available phonemes in a language. Such transformations are generally regular and can be learnt and then applied automatically to allow other systems to recognise similar forms which have undergone transformation.
PToleMy stands for Phoneme Transliteration Matrix and is an implementation of a Hidden Markov Model (HMM) which learns common transformations and then applies them to predict similar transformations in new contexts.
What’s in a name?
CogNomen is our proper-name finding system. It uses Percival to identify proper names in any language, based on their phonemic similarity to a model form of the name. Where there are consistent changes in phonemes between model and target languages, CogNomen uses PToleMy to learn and map these changes.
A Pattern for Learning
Natural language is a messy affair riddled with complex and sometimes contradictory patterns. The ability to find complex, non-concatenative patterns is crucial for any systems with pretensions to parse language. Percival has the capability to parse such patterns and identify linguistic items which are related to one another through sharing a common, but perhaps complex, structure.
Percival can work with any language and even with language pairs with disparate alphabets by using the capabilities of our PToleMy sub-system.