Broadening the Perspective

As we approach Autumn there is much for which to be thankful in the work of the last few months. We have made steady progress with our Sparse Data Learning Model, good outcomes were forthcoming from both the software symposium in Philadelphia at American Bible Society and from the ACL/DeepLo 2018 conference in Australia.

Major international conferences are excellent ways to get a synopsis of the ‘state of the art’ in a field. ACL 2018 was no exception and coupled with the DeepLo conference later the same week, which focussed on low resource languages, the two events provided a clear picture of how well, or otherwise, languages without strong commercial support are served by mainstream Machine Translation (MT).

When the MAT team was first set up by BFBS in 1991 the state of the art was considered to be Rule-Based MT (RBMT). Our work then in Statistical MT (SMT) was very distinctly left of field. Over the next 15-20 years RBMT gradually lost ground to SMT and by about 2010 Phrase-Based SMT (PBSMT) was established as the leading methodology for MT systems. Our own SMT systems have been available in ParaTExt since about 2007.

The last two years, however, have seen a sea-change in state of the art MT. The advent of Neural MT (NMT) from Google (and others) in 2016 brought a new kid to the block who seemed able to do things that bit better than the existing PBSMT systems. You might expect that we too would be looking to NMT for future systems but the reality is a little different.

SMT, particularly PBSMT, systems require a lot of example data from which to learn the equivalences between a pair of texts. NMT, sadly, needs even more, to the point that using NMT to translate between language pairs with low resources of training data etc… is simply impractical. This limitation has led us to imagine other approaches and it was to examine our approach alongside mainstream NMT solutions that took Jon to Melbourne.

There were some interesting outcomes. Broadly speaking, NMT systems cannot contribute where low-resource languages are concerned. There are, however, some things we can borrow from NMT systems.

NMT is based upon what is sometimes called ‘Deep Learning’ (DL). DL uses multi-layered neural networks (NN) to learn the correct outputs for a given input. This involves combining signals from many different nodes in the network to assess the validity of a hypothesis. NNs have a number of mechanisms that do this, one of which (Threshold Logic Units) we have adapted to allow us to combine signals from different analyses towards aggregate conclusions.

Something else NNs offer is an intriguing ability to perform better when they are thinking about different things at the same time(!). How does that work? Well, a NN is trained to perform a task and then, once it is performing well, the same network is then trained to do something entirely different but retaining its original ability. The odd thing is that when you do this, the performance of the network on its original task improves. How’s that for bizarre?

These two characteristics are also implicit in the work we have in hand in our Sparse Data Learning Model. Disparate but related analyses are run against the same data and the outputs combine to strengthenlearning outcomes.

All in all it is very fascinating, distinctly leading (bleeding?) edge and building it into a useful machine for bible translations is proving one of the most interesting and extensive projects we have ever undertaken.

 

Working together

Chiang Mai report

Both Jon and Neil travelled to Payap University in Chiang Mai early in May for meetings with colleagues working with SIL/Wycliffe and GBI. The focus was on discovering and exploiting the ways our research might support one another’s work and the outcomes were very encouraging. Our own focus on language learning machines was strongly endorsed and we can already see ways in which this work might be exploited in systems under development by colleagues elsewhere.

The whole area of machine learning and automatic translation technologies is an increasingly important topic for bible translation and our key funders are known to be keen to exploit such technological developments. Ensuring our work is central to this is important to ensure that opportunities are not missed and the widest possible benefits are secured for translators.

Soon after our return from Chiang Mai we received an invitation to contribute to an important discussion with ETEN, our major funders, at the American Bible Society HQ in Philadelphia PA in July. This meeting will bring together the Chiang Mai participants with a wider group from other organisations working in the same field in discussion with funders. This is an important meeting for us. The work we do is primary research in a highly complex, technical field. Building confidence and understanding with funders is key to the long term outcomes towards which we are working. Jon travels out to PA on 10th July for this meeting.

DeepLo 2018

The response to our request for help with the costs of attending the DeepLo conference on MAT for low-resource languages in Melbourne was strong and Jon travels on from PA to Australia (via LHR) on 13-16th July to attend ACL 2018 and DeepLo from 16-19th July. Thank you to all those who responded so generously. We are hopeful that we shall not only have the opportunity to present our work in this key research forum but there will also be opportunity to learn more about other initiatives in support of low-resource languages from amongst the wider research community.
This is a punishing schedule but these two events will be key for the project in terms both of wider support and future research directions.

Less is More

In a world where we are surrounded by texts it can be strange to realise that sometimes the first ever piece of text written in a language is the one being typed by the Bible translator. Our Sparse Data Research project looks to glean as much information about a language as possible the very start of a project, perhaps from very small amounts of text. Research is showing good results after just ten verses have been translated.

In fact we find that more data doesn’t necessarily help this process. Small groups of verses will often have related words, forms and names and be in a consistent genre. When more data is added we find the ‘noise’ of this new data can drown out the ‘signal’ that can be found in the small group of verses.

Less truly is more in these circumstances. One could almost say there’s a ‘still, small voice’ waiting to be heard above the loudness of the less focussed approach.

Less is, unfortunately, not always more where budgets are concerned. An invitation to participate in a workshop focussed precisely on the problem of working in ‘low-resource languages’ (DeepLo 2018) is both welcome and timely. DeepLo is the first forum at which machine learning specialists and computational linguists are invited to meet and discuss these issues. Our work is central to their focus. The invitation is an affirmation of our work but the venue, Melbourne Australia, would make a huge hole in our travel budget for the year. Nevertheless, the opportunity not only to present our work but to hear from others how they are approaching these issues is important. If you would like to make a donation towards the costs of attending the DeepLo workshop, please do so here.

Getting a wider team together

The Paralexica team will be in Chiang Mai, Thailand in May to meet colleagues looking at progress in computer aided translation projects and specifically at how our research might help. On-line meetings have already taken place with a larger group to prepare for the meetings in Chiang Mai, to ensure the time is used as efficiently as possible. Bringing a focus to these meetings is good, and the chance to meet colleagues face-to-face is so important. Having the time to chat over a meal, to let the team run with an idea in an unstructured way often brings good outcomes. That back of the envelope or napkin scribblings can be highly fruitful!

Chiang Mai was chosen for our meetings partly because there is a large Bible translation work in South East Asia where many communities in the region are still in need of scripture in their own language. The difficulties of providing good linguistic analysis for languages with little or no resources is well recognised here.

 

ASLING TC39

Friday 17th November 2017 found us in London attending the Association internationale pour la promotion des technologies Linguistiques (#ASLING) 39th Translating and the Computer conference (#TC39). The conference is held annually in London at the Insititute of Mechanical Engineers on Birdcage Walk under the watchful eye of George Stephenson whose portrait hangs in the conference room.

The conference attracts a wide following of computational linguists and translators in equal measure representing academic researchers, commercial translation providers and government agencies including EU and UN translation services. In this the TC conference is, in our experience, a unique blend of research, reality and pragmatism and as such it represents a highly knowledgable arena within which to present our work for peer review.

We have presented here on many occasions (see: publication list) and our work has always been well received (as evidenced by the number of return invitations we get). This year we presented our early research on Learning from Sparse Data or, as we call it, Project Paddington. We were not at all sure how well this would go down. The peer reviewers’ comments on our paper had not been entirely enthusiastic, not least because the approach we are working with is, for good reasons, diametrically opposed to most current research in Machine Translation. Some of the reviewers clearly felt we ought to fall into line. So it was with a little trepidation that Jon clambered onto the rostrum to present our paper in the very last session of the conference.

To our relief (and, to be honest, some surprise) the response was enthusiastic. The first comment during questions came from the CEO of a commercial MT provider who thanked us for the paper and then went on to say, “this is fantastic work, I have always felt that this kind of model is how our systems should be approaching language learning, it just feels right and now you have demonstrated that it can work! I am going to model this as soon as possible, thank you”! Further, equally supportive questions and comments followed until the session had to be closed to allow the conference as a whole to be formally ended. As we left the building half an hour later we were still in deep discussion with other delegates about our work including the conference key note speaker, Prof. Alexander Waibel from Carnegie Mellon University International Center (sic) for Advanced Communication Technologies, who identified many points of contact with the work of his department and mainstream MT challenges.

All in all, it was a very pleasing outcome and we came away much encouraged that our recent research is very much at the forefront of developing language technologies.

ParseBots, Language Models and a New Name

It has been rather too long since our last newsletter. Our only excuse is that we have been busy but the summer holiday season has offered us a chance to draw breath at last.

7000++ => ParaLexica

The first thing we need to share is that we have decided to rebrand the team. With so much of our work so closely coupled to the UBS ParaTExt project we felt a better name for the project would be ParaLexica which has something of a family feel to it with ParaTExt as well as sounding vaguely linguistic. We have a nice new website – paralexica.net – to go with our new name and a blog – paralexic.net/theblog/ – to which copies of these bulletins will be posted.

Some of you may remember a bulletin in April entitled Aeroplanes, Jet Lag and Sparse Learning. Thankfully, there have been no more aeroplanes since then and the jet lag has passed but the question of Learning from Sparse Data is still very much with us.

The biggest limitation of all Machine Translation (MT) systems is the need to train the system with lots of examples before it can function. In the case of an NT translation this means that much of the task may be complete before there is enough data to train the machine to help. Colleagues in both UBS and SIL have been encouraging us to consider how our systems might be brought on line earlier in a project, perhaps even from day one.

We set out to imagine a way for a machine to learn right from the very start of a translation project. This proved a very fruitful exercise. It is astonishing what can be imagined once you put aside the idea something cannot be done… The idea began in an office in the SIL centre in Dallas, was further developed at EACL 2017 in Valencia and finally gathered some flesh to its bones in the first part of the summer.

We can think of Language Learning as having three phases: Discovery, Validation and Verification. Discovery is the task of recognising structures or patterns in a language that may represent meaning or function. Validation is convincing ourselves that a pattern is worth investigating further. Verification is seeking confirmation that the analysis is good. Much of the early summer was devoted to modelling this concept and we have been very pleased with the results.

Introducing Project Paddington

We are calling the idea Project Paddington (PB). Why? Well, unusually, it is not just the name of a bear. As we began to model the discovery stage of the process we imagined a whole set of ‘Bots’, each of which was able to parse an element of natural language. This gave us morphBots, nameBots, syntaxBots, stemBots etc, etc… Collectively we thought of them as parseBots (PB), at which point a name for the project became obvious.

Turning this proposal into a viable process is a lot of work. Our existing systems (not least CogNomen) can drive many of the Bots but others will need to be developed from scratch. This task alone would easily consume the remaining funding we have and that is before we have begun to consider how best to aggregate the results from the various Bots into a coherent model for the language. Equally, designing and building a verification process which allows the (non-technical) translator to verify what the Bots are learning is not trivial.

We took the proposal to UBS, showed them a demo and asked the question: Do we put this to one side until we have finished the current work schedule or should we make this a priority? The response was: This is a very exciting development which keeps the translator at the heart of the process. We like it a lot and we want you to prioritise it.

So, we are embarking on a major piece of work of a scale well beyond the capacity of our current funding. Did we mentioned prayer..?

Publications

Riding J and Boulton N (2019), The ParseBot Language Analyser, In proceedings BT2019, SIL

Riding J and Boulton N (2017), Learning from Sparse Data, In Translation and the Computer 39 – Proceedings. , pp.89-97. ASLING.

Riding J and Boulton N (2016), What’s in a Name?, In Translation and the Computer 38 – Proceedings. , pp. 122-132. ASLING.

Riding JD (Forthcoming), Translation, Technology, Churches and the Bible, In The Signs of the Times. Ed. Norton, J, Wipf and Stock.

Riding J (2012), Hunting the Snark – the problem posed for MT by complex, non-concatenative morphologies, In Translating and the Computer 34. ASLIB/IMI.

Riding J (2011), PToleMy – Transliterating Proper-Names, United Bible Societies Europe and Middle East Area Translation Conference – Proceedings.

Riding JD (2011), God and The Machine, The Linguist. Vol. 50(5), pp. 18-19.

Riding J and van Steenbergen G (2011), Glossing Technology in Paratext 7, The Bible Translator. Vol. 62(2), pp. 92-102.

Riding J (2010), Towards an understanding of word formation in natural language. Presentation to: UBS (GIAG).

Rees N and Riding J (2009), Automatic Concordance Creation for Texts in Any Language, In Proceedings of Translation and the Computer 31. IMI/ASLIB.

Riding J (2009), MT and MAT and Developing World Vernacular Languages, In Proceedings of Machine Translation 25 Years On. BCS NLTSG.

Riding J (2008), Statistical Glossing, Language Independent Analysis in Bible Translation, In Translating and the Computer 30. ASLIB/IMI.

Riding JD (2007), A Relational Method for the Automatic Analysis of Highly-Inflectional Morphologies. Thesis at: Oxford Brookes University.

Project Paddington

A framework for learning.

A fundamental limitation of Machine Assisted Translation systems is the need for very large corpora of training data. Most Bible translation projects cannot provide this and this limits the contribution of the systems to the later stages of a project.

Project Paddington seeks to overcome this limitation by kick-starting MAT by providing a bi-lingual lexicon, morphology and syntax tables compiled from very small amounts of text. This brings forward the moment when MAT systems can contribute to a translation.

There are three major areas of work being developed for Paddington:

  1. A set of small process we call parseBots (pB – hence Paddington) tasked with learning about the content and structure of a very small piece of text,
  2. The Language Module which collects and collates items learnt by the bots and
  3. An Interaction Module tasked with engaging with the user to verify the findings of the rest of the system.

Many of our other systems have a role to play in building the capabilities of Paddington.

We presented a paper about Project Paddington to the ASLING TC39 conference in London in Nov 2017: Learning from Sparse Data.

Project Augustus

ParaTExt Glossing Technologies

The first major sub-system delivered for ParaTExt has become known as the Glossing Technologies but is known to the team as Augustus. GT/Augustus provides Key Term glossing and automatic Interlinear back-translation for ParaText.

 

Project PToleMy

Transforming the word

Sometimes elements of languages, particularly words borrowed from other languages, are transformed to match the available phonemes in a language. Such transformations are generally regular and can be learnt and then applied automatically to allow other systems to recognise similar forms which have undergone transformation.

PToleMy stands for Phoneme Transliteration Matrix and is an implementation of a Hidden Markov Model (HMM) which learns common transformations and then applies them to predict similar transformations in new contexts.

Project CogNomen

What’s in a name?

CogNomen is our proper-name finding system. It uses Percival to identify proper names in any language, based on their phonemic similarity to a model form of the name. Where there are consistent changes in phonemes between model and target languages, CogNomen uses PToleMy to learn and map these changes.