7000++ => ParaLexica
Some of you may remember a bulletin in April entitled Aeroplanes, Jet Lag and Sparse Learning. Thankfully, there have been no more aeroplanes since then and the jet lag has passed but the question of Learning from Sparse Data is still very much with us.
The biggest limitation of all Machine Translation (MT) systems is the need to train the system with lots of examples before it can function. In the case of an NT translation this means that much of the task may be complete before there is enough data to train the machine to help. Colleagues in both UBS and SIL have been encouraging us to consider how our systems might be brought on line earlier in a project, perhaps even from day one.
We set out to imagine a way for a machine to learn right from the very start of a translation project. This proved a very fruitful exercise. It is astonishing what can be imagined once you put aside the idea something cannot be done… The idea began in an office in the SIL centre in Dallas, was further developed at EACL 2017 in Valencia and finally gathered some flesh to its bones in the first part of the summer.
We can think of Language Learning as having three phases: Discovery, Validation and Verification. Discovery is the task of recognising structures or patterns in a language that may represent meaning or function. Validation is convincing ourselves that a pattern is worth investigating further. Verification is seeking confirmation that the analysis is good. Much of the early summer was devoted to modelling this concept and we have been very pleased with the results.
Introducing Project Paddington
We are calling the idea Project Paddington (PB). Why? Well, unusually, it is not just the name of a bear. As we began to model the discovery stage of the process we imagined a whole set of ‘Bots’, each of which was able to parse an element of natural language. This gave us morphBots, nameBots, syntaxBots, stemBots etc, etc… Collectively we thought of them as parseBots (PB), at which point a name for the project became obvious.
Turning this proposal into a viable process is a lot of work. Our existing systems (not least CogNomen) can drive many of the Bots but others will need to be developed from scratch. This task alone would easily consume the remaining funding we have and that is before we have begun to consider how best to aggregate the results from the various Bots into a coherent model for the language. Equally, designing and building a verification process which allows the (non-technical) translator to verify what the Bots are learning is not trivial.
We took the proposal to UBS, showed them a demo and asked the question: Do we put this to one side until we have finished the current work schedule or should we make this a priority? The response was: This is a very exciting development which keeps the translator at the heart of the process. We like it a lot and we want you to prioritise it.
So, we are embarking on a major piece of work of a scale well beyond the capacity of our current funding. Did we mentioned prayer..?