A New Year, New Challenges

Merry Christmas! We are writing this just before the Feast of Candlemas on 2nd February when Christmas properly comes to an end with a celebration of Christ as ‘a Light to enlighten the Gentiles’. Which, of course, is us. We hope you have all enjoyed a peaceful and blessed Christmas season.

The last few weeks have proved surprisingly busy. Following the conference in Singapore we returned home already planning what should be the final testing phase of CogNomen, our names finder, with a view to releasing it for integration within ParaTExt later in 2017. The plans were laid and the work begun and then, out of the blue, came a request from UBS colleagues to evaluate a proposal from another organisation for a system to assist translators. Interestingly, there are many points of contact with our own work, both current and in the past. It is, in fact, partly based on work done by us around 2002. It’s a small world…

And so, Jon finds himself en route to Thailand (Chiang Mai) in February to meet the authors of the proposal and to make a technical assessment of the work. It is really encouraging to find that our work in the past is now the inspiration for others (even if it makes us feel old) and we are really looking forward to seeing where this new initiative will take us.

Here is Genesis 1 in Thai.

Please pray:

  • Give thanks for the fellowship of Bible translators and the Bible Societies.
  • Give thanks that labours in the past are bearing fruit now.
  • Whilst it may feel like a small world the flights are long.
    Please pray for travelling mercies, for all those attending the meeting in Chiang Mai and
  • For heads free from jet-lag to understand technically complex proposals.

Our appeal for help upgrading tired computers has already gathered more than £1,000, thank you so much to those who contributed. We still have a little way to go. One of our supporting fellowships here in the UK has set up an on-line donation page on MyDonate where you can send a donation and get the benefit of Gift Aid as well.
If you’d prefer to donate via more traditional means please contact us  and we shall send you details.


O Lord, you have given us your word for a light to shine upon our path; grant us so to meditate on that word, and to follow its teaching, that we may find in it the light that shines more and more until the perfect day; through Jesus Christ our Lord. Amen.

Staying Connected

We’d like to begin this bulletin with a message of thanks to all of you who remember us in your prayers. A few of you we see fairly regularly but most only rarely. These occasional meetings, so often prefaced with the assurance that you want to continue receiving news of the project and to support us in prayer, mean a lot to us and we value your support very greatly. Thank you.

Developing computer programs can be a very self-contained world. There is the task, the tools and the developer. Attention to detail is everything and it can be easy to lose sight of the bigger picture. Teams are often very ‘distributed’. Face to face contact is usually by Skype (when timezones coincide) and the day to day management of the work is via systems such as ‘Git’ hosted remotely on sites like ‘BitBucket’ and ‘Github’. It is all moderately efficient and quite impersonal, which is one reason we value your support and the opportunity to share something of what we are doing via this list.

The other way we keep ourselves in touch with the wider world is by interacting with colleagues in the field by email and Skype and, just occasionally, by meeting together. Such a meeting happens next month in Singapore. We shall join colleagues from all over the world and it is a chance to hear what others are doing and to show the outcomes of our own work. The focus this year is a major new release of the ParaTExt translation editing software. The PT developers will be there as will regional PT support staff and representatives / early adopters from the translation community. We have been allocated a slot in the programme to talk about our work on Saturday 5th November.

Please pray:

  • For travelling mercies. For some it is a long flight with all that entails. Most will be en-route during 2-3rd Nov..
  • For the smooth running of the conference which takes place from 4-11th Nov.. Bringing together more than 100 people from all over the world needs a lot of administrative support.
  • Give thanks for Bible Society of Singapore who have undertaken the local arrangements (some of you will see a very familiar figure on their logo).
  • For our major funders, ETEN, whose representatives will no doubt be there to see what progress their investment has generated.
    Without their support much of global Bible translation would be starved of resources and our work would be immeasurably more difficult if not impossible. Pray they are encouraged.

Some of you have asked how you might contribute financially to support the team’s work. It is hard to do this via the bible societies as the mechanisms don’t really exist to send funds directly to a particular project. One of our supporting fellowships here in the UK has set up an on-linedonation page on MyDonate where you can send a donation and get the benefit of Gift Aid as well. They will hold any gifts for us to draw on directly. We shall need to replace some fairly elderly computers soon and hope to use these funds towards the cost (about £4,000). If you’d prefer to donate via more traditional means please contact us at gtp@biblesocieties.org and we shall send you details.


BLESSED Lord, who hast caused all holy Scriptures to be written for our learning: Grant that we may in such wise hear them, read, mark, learn, and inwardly digest them, that by patience and comfort of thy holy Word, we may embrace and ever hold fast the blessed hope of everlasting life, which thou hast given us in our Saviour Jesus Christ. Amen.

Many UK churches have just celebrated Bible Sunday and some of us will have heard once again Thomas Cranmer’s collect:

It’s a great reminder of how important Bible translation is. We have been blessed with the Bible in English since the 16th century but there are still between 1 and 2 billion people in the world without scripture in their language.

Thank you for your prayers and support.

Jigsaws and Meerkats

Writing a computer program is a little like completing a jigsaw puzzle. One of the fundamentals is that you have all the pieces and (unless you are trying to make it last) you probably have a picture of the finished jigsaw on the lid of the box. The task is first to understand how all of the parts contribute to the picture and then to arrange them in the right order to make the picture appear. It can be an absorbing task but in the end it is a problem with a known solution.One of the interesting things about MAT research is that we often find ourselves faced with problems for which we have no known solution. Instead of working out how to reproduce the picture on the lid we can be faced with a stack of pieces which may or may not make a picture and the picture they make is unknown before we start the puzzle. In these circumstances we aren’t so much coding a solution as exploring possibilities and it’s only when we have examined all the possible solutions that we begin to get an idea of the problem we are trying to solve.If this all sounds a little arcane then you’re right, it can be. More to the point, writing a program to solve this kind of puzzle can be quite challenging. Our program needs to know what pieces of the puzzle belong together and in what order. One snag is that we have no idea how many, if any, of the pieces we have are part of the solution. This creates problems. One way to solve this kind of problem is by using a technique called ‘recursion’. Recursion is a way of examining each piece in turn without needing to know how many pieces there are or how many are actually part of the solution.

Starting with the first piece we look to see how many others fit alongside it. We’ll call this our FindNext function. Now here comes the fun bit. Let’s suppose FindNext discovered three other pieces that fit alongside the first. For each one of those three pieces we ask FindNext to see how many others fit alongside that piece; and then we ask FindNext to do the same for those and so on. Every time FindNext finds an adjacent piece we ask FindNext to find others that are adjacent to that one. We can even get FindNext to ask itself to do the next piece! It is possible to solve quite complex problems using this technique and ‘Recursive’ solutions like this are very elegant ways to deal with such problems.

Regular readers will know that we have been building a system to find proper-names in text. The system we have built to do this, CogNomen, is at its heart a recursive algorithm we call Percival. Percival is a very elegant solution with surprising few lines of code in its program but we discovered when we sent it out for testing that it had a major problem. Finding names up to about twelve characters long worked fine and took CogNomen no more than a second or two. But when we asked CGN to find longer names it took a lot loooooonger… Nebuchadnezzar took 13 seconds before CGN had found it in a text. That’s much too long. Worse was to come. One of our testers asked CGN to find Maher-Shalal-Hash-Baz; everything went very quiet and then, after some minutes, their computer announced “Unable to allocate memory, out of heap space”. CogNomen had broken the machine.

The problem proved to be recursion or more accurately, how Windows (.NET) handles recursive functions – not well. In short, there wasn’t anything wrong with CGN but Windows couldn’t manage what CGN was asking it to do. This was a disappointment. Whilst there are only a handful of names longer than about twelve characters in the Bible we wanted to use the CGN processing in other contexts where the number of items to be processed might be a lot more than twelve.

The solution was to rewrite the Percival algorithm which underpins CGN as an iterative, rather than a recursive, function. This has taken us two months and the outcome is a very much more complicated implementation of Percival with many hundreds of lines of code but which nevertheless runs very fast indeed. Remember Nebuchadnezzar – 13 seconds to process? The new algorithm can do the same in 13 milliseconds! The basic process remains the same but the way we have implemented it is now iterative rather than recursive.

All in all it has been a very interesting exercise but it has taken a lot of time we would have preferred to have given to other tasks. We now have a very efficient solution which has been, paradoxically, a lot more complicated to develop than our original one.

 

A Wider World

Back in November 2015 we blogged about the Translating and the Computerconference in London (TC37). Amongst the contributions was a presentation about the problems of ‘Discontinuous Structures’ in natural languages. The speaker introduced an EU funded project which hoped to make progress assessing the problems such structures create for NLP. For us this was an exciting moment. Many of the deficiencies in current NLP can be traced to failing to handle discontinuities amongst words in a text and deriving the underlying meaning.By this we don’t mean that we can’t understand the individual words (although that can be a problem) but rather that it can be hard to see how words in a sentence relate to one another. Human beings often produce disjointed sentences where parts of the sentence which actually refer to the same thing are displaced across the clause. A straight forward example of this behaviour is the German split verb: Ich komme um 15 Uhr in München an (I arrive in Munich at 3 PM) where the preposition an becomes detached from its verb komme (ankommen = arrive). More complex examples involve a word moving from its natural position in the sentence to the head (or, sometimes the tail) of a phrase to give emphasis to a particular part of the utterance. The human being listening to the speaker perceives only the outcome of the shift, not least because the inflections in the speaker’s voice give many clues as to what is happening. It gets harder when the text is written down but the expectations of speech allow us to reconstruct the meaning.This kind of thing is a lot harder for a computer. For a start, it has no knowledge base of speech patterns to fall back on to. If it is working from a vast knowledge base of parallel clauses it may solve the problem by recognising the phrase as a whole and substituting the equivalent from its tables but that can only work where there are vast parallel corpora for the particular pair of languages being processed. The other option is to encode a set of rules to describe the phenomenon but the inventiveness of human beings in generating new forms will inevitably break the rule set sooner or later.

The world of NLP is only just beginning to address these kind of problems and we were excited to hear at TC37 of a workshop planned for the Association of Computational Linguists conference in June 2016 (NAACL 2016) on exactly these kinds of problems. Our interest stems from our own work on discontinuous structures which has been in the context of analysing complex (and discontinuous) morphologies and, latterly, recognising potentially discontinuous structures in transliterated proper-names. We were very interested to see what we might learn from others working in the same field.


de grammatica

NAACL 2016 Papers presented ranged from the general problems of accounting for discontinuous structures in formal grammars of language through to an analysis of how Urdu/Hindi uses such forms to express possessives. All bar one of the papers presented sought to apply the principles of Lexical Function Grammar (LFG) to map meaning to words. LFGs have grown out of work on Generative Grammars. They represent an attempt to map the surface form of a sentence to its meaning. Here are some examples:

LFGs construct two representations of a sentence, the c-structure (constituents) and the f-structure (features). C-structures try and represent the words in a sentence as they are spoken. In this example we are presented with two examples using the same three words but ordered differently. Now here is a question for you: Are “John resigned yesterday” and “Yesterday John resigned” the same thing? Whilst the underlying event is common to both, the first one focusses on what happened (John resigned) but the second emphasises the moment when that event happened (Yesterday, …). Now look at the two f-structures; they are identical. By encoding language in this way will machines be able to figure out how utterances are actually working? Perhaps, but see above for rule sets…

One last diagram for you. Here is what happens when you show how a c-structure relates to its corresponding f-structure (John is clearly having a bad few days):


de forma

The exception to all the work on LFGs was a paper given by Prof. David Chiang from Univ. of Notre Dame. He set out to explore how ‘free’ word order in NL might be handled by NLP systems. He proposed using a form of Finite State Automaton (FSA – Turing Machine) to handle the problem. For us this was of particular interest. The FSA is a machine for handling a stream of events. In other words it expects to deal with events one at a time and in the order it encounters them. It is a solution we have used in the past for identifying syllable boundaries and it underpinned the work done to decode Enigma at Bletchley Park. This may all sound mighty technical but in the end it is simply a way of recognising that there is a dimension to language we are prone to overlook – time/order. As creatures we live our lives in time. Our world streams by us as a series of experiences ordered by time. One of those ordered streams is language. David observed that a sentence is really just a bag of words. He spoke of using a development of regular expressions (a form of template) to give a machine the ability to build an expectation of what may be coming down the stream of language it is processing.

This is an interesting approach and one which is actually quite close to what we are doing. Instead of relying on artificial constructs such as c and f-structures it sets out to provide the machine with a way of interpreting the language stream item by item. This has a number of advantages. First of all, it is dealing with reality (something formal grammars don’t always do). Secondly, it encourages us to think of language as sentences/clauses and meaning rather than attempting to derive one of these from the other via a constructed grammar. This is close to the work of cognitive linguists who tend to speak of ‘form – meaning’ pairs (Langacker) rather than constructing elaborate representations of form and meaning such as LFGs.

Our own work is about automatically constructing templates which can then be used to identify similar constructions elsewhere in the same text. David Chiang began by using an FSA to identify related items in a text stream. It is particularly interesting that he is now observing that this processing model is analogous to a Recurring Neural Network (RNN). In our own work we too are beginning to consider RNN-like models as holding much potential for NLP. It may well be that we can benefit from his work as we develop our own discontinuous structures processing.

The thing which was absent from all of the NAACL 2016 presentations was the recognition that discontinuous structures are endemic within NL, occurring not only in syntax but in morphology and transliteration machines. We already know that complex morphology can be analysed using this kind of template recognition and we are also experimenting with similar techniques as a way of deriving syntax structures directly from text. Watch this space…

 

Testing Times

Towards Deployment…

March has been an exciting time for the team. Work begun in September last year has started to come together into something our users will be able toevaluate for us in the field. Some of you will know what that entails but for those that don’t here’s a short list of some of the work involved in turning an idea into a testable reality:

  • Decide how the idea can be made computable – some ideas never get past this stage; some things computers simply can’t do.
  • Work out a design for the solution. If you have ever made something from a kit, this stage is equivalent to deciding what parts you need in the kit to begin with.
  • Make the parts. This can take quite while. Some will be simple components, others may be more complex, particularly those which provide the core processing to make the system work. Simpler parts can be constructed in a few hours, others may take weeks.
  • Test everything! Before we ask our users to test a system we have to make sure that each component in the system does exactly what it says in the tin. Programmers call this unit testing. It is very tedious but done properly it saves a huge amount of time in the long run. One component written recently took about 50 lines of code to create but the unit tests have run to well over 1,000 lines of code.

 

Unit testing is very important. We usually only sit down together once a month. The rest of the time we are talking each day on Skype but because we are in different places we each have a copy of the system we are working on and each of us will be working on a different element of the system. There is always the possibility that something one of us does in one context will break something the other has done elsewhere. Having unit tests allows us test all our changes and make sure everything is still working as it should.

  • All systems have dependencies. We need to make sure our system is compatible with the different versions of ParaTExt and that we ship everything it needs to be able to work properly.
  • Build and Deploy. This is the moment when everything comes together and we finally have something we can offer to our testers to download and install on their computers. It is also the point when we need a name for the system. The front runner at the moment is Cognomen but if you can think of something better for a system to identify proper-names in a Bible, do let us know!

 

We are blessed with supportive colleagues all over the world and are not short of volunteers to test out our systems on different languages. These and other colleagues will be testing Cognomen. Does it do what is says on the tin? Where could it be improved? Can it handle the complexities of many different languages and still give helpful results?

The honest answer is, we don’t know. We do our own testing of course but that is inevitably limited. Handing the system over to people at the sharp end of Bible translation is the only way we can find out if it works…

easy as ABC…

Damage Limitation…

While Neil has been working on a Percival-based names finder Jon has been looking at the same problem but from the ‘other end’. Sorting out a train wreck is always a good thing to do but wouldn’t it be even better to avoid it in the first place?

The snag, of course, is that expecting human beings to remember exactly how they dealt with each one of 4,700 over a 10-15 year period is probably optimistic.

Time to introduce you to a Bear called…

PToleMy

Most of our systems are named after real teddy bears, for largely historical reasons. (Perhaps one day we’ll post of picture of Percival but he’s a shy character and prefers to avoid the limelight). PToleMy is an unusual Bear because his name actually stands for something: PToleMy = Proper-name Transliteration Mapping. In fact he is named for Claudius Ptolemy whose treatise Geographia included the first map of the known world.

Proper-name Transliteration Mapping

When translators encounter a proper name in the text they don’t usually try and translate it; instead, they try to render it phonemically (by using letters for sounds) into their language. So, names like Πετρος and Στεφανος in Greek come into English as Peter and Stephen. So far so good. All we have done (well, nearly all) is replace the Greek letters with English ones. Where languages share a common alphabet it is even more straightforward. A team working from an English model text into a language which uses the same alphabet might find they needed to make very few changes as is the case with, for example Swahili where the English form Dorcas becomes Dorkasi.

The trick is knowing which letters go to which between the two languages. We could always ask the team to make us a table and then simply use that but it’s a bit of a blunt instrument. The letter ‘c’ might go to ‘c’ or to ‘k’ or perhaps ‘ck’ so it’s not just a case of one to one pairings. Not only do we need to know what the possible transformations are for a letter, to have any chance of predicting which one it ought to be we need to know how likely each possibility is overall and in context.

We do this by using a thing called a Markov Chain. Using a set of sample names we can train the system to know what transliterations are possible and what the probability of each possible transliteration is. We build, in effect, a map of possible transliterations. That information allows us to generate a transliteration hypothesis for names we haven’t yet seen. (Strictly speaking this particular variety of Markov Chain is called an Ehrenfest Chain). In practice we generate a set of possible transliterations ranked by probability.

The initial experiments are giving good results and although we still have some problems to solve it looks like we will be able to suggest consistent transliterations for names as the translators encounter them in the text. Two possible enhancements are:

  1. To adjust the probability for a given letter mapping to another on the basis of its predecessors in the stream of text. For example, if we were transliterating into English and have just seen a ‘q’ go by, we might consider the probability of it being followed by a ‘u’ to be much higher than any other letter. We can generate this information by watching the stream of letters go by and noting what follows what.
  2. Many languages change the form of names depending on their role in a sentence. In Latin for example, if Peter is the subject of a sentence he is Petrus, if he is the direct object he is Petrum and if he is the indirect object he becomes Petro. Just as we can learn what letters can follow others from watching the stream of text go by so too we learn these changes and incorporate them into our transliteration predictions.

It all looks very interesting and distinctly possible although one big question has begun to creep over the horizon. Whilst our Markov Chain works well for a simple letter to letter predictor once we begin to introduce more complex predictions based on predecessors and so forth we might actually be better off using a different processing model and that could be an artificial neural network (ANN). ANNs have had a bad press for a while but recent developments have made them rather more tractable.

Time to do some reading…

You know, whassisname…

A Job for Percival

For the last four months or so the MAT team has been looking at the problems caused for Bible translators by proper-names. You might think proper-names would be a fairly easy thing to translate, not least because most of them aren’t really translated. In fact, they can generate an astonishing amount of of hassle. For a start, there’s a lot of them, about 4,700 in fact. Many of them only occur once in the entire Bible but they still need to be rendered into the target language a way that’s accessible for the readers.

But it gets worse. In the original Greek and Hebrew texts of the Bible names are loaded with significance in a way that it is foreign to much of our 21st century world. The meaning of many Hebrew names often adds real significance to their narratives. Consider the book of Ruth. This little gem of a book is only four chapters and doesn’t have that many people in it to begin with but the name of every character is carefully chosen to add meaning to the story:

  • Elimelech – means “My God is King” (A great name for a god-fearing Jew)
  • Naomi – means gracious one or perhaps beautiful one
  • Mahlon – means ‘sickly’
  • Kilion – ‘weakling’
  • Ruth – has to do with loyalty
  • Orpah – in Hebrew is the ‘nape of the neck’ (which is what you see when someone turns away from you)
  • Boaz – is less clear but may be linked to concepts like ‘upright citizen’ or ‘protector’
  • Obed – means ‘servant’. The same word is used by Isaiah to describe the ‘Servant of the Lord’ who will redeem Israel.
A Judaean threshing floor:
 

Each name adds something to the story. Perhaps sadly they are not often translated although some translations will footnote their meanings to help the reader. Whilst it isn’t that hard to render each of the names in Ruth into another language, doing so in a way that is consistent with the other 4,700 or so in the text is a bigger challenge. Consider Elimelech, El[God] -i[my] -melech[king]. When this is rendered in another language it is not only important that it is recognisable and pronounceable, it really ought to show its relationship to names like Abimelech (my father is king) and Ahimelech (my brother is king). If these names are rendered consistently across the text as a whole the reader has a much better chance of spotting the similarities between their forms.

So, consistency in rendering names is very important. The problems arise when we remember that it not unusual for a translation project to run for 10 or even 15 years. Over that time different members of the team may work on different parts of the text. New members may join the team and older members may retire. But the way the team handles the names in the text really needs to be consistent across all 66 (85) books and throughout the duration of the project. The reality is that this kind of consistency is hard to achieve and projects often find themselves with a huge task verifying all their names before publication.

To help translators avoid what can be a mammoth task towards the end of a project the team is developing a system to find proper-names throughout a text even if when their spelling is perhaps a little variable. This will mean that translators will be able to see for example: ‘Immmanuel, Imanuel, Emanuel and Emmanuel’ or ‘Ahimelek, Achimelek, Ahimelech, and Ahimelekh’ are all the same name and without having to try and guess what the spelling differences might be. We will be able to all the different renderings as references to the same name. Once we have done that the user can pick the preferred rendering and we can adjust the rest accordingly. In itself this will save many hours over many hundreds of projects.

The system we are using to do this is being developed by Neil. It is a derivative of Percivalwhich we wrote about last year. Not only will it bring real benefits to translators all over the world it will also allow us to test the Percival algorithm in a real application which we be invaluable for our own research and development programme.

Translating and Computers…

When we are out talking to people about our work common question is “where are you based”? It’s actually quite hard to answer. I could say, Dorset in the UK but Neil is in Wiltshire. Then again our colleagues and those who use our systems are all quite literally over the world. We don’t really have an office as such. The work we do involves long hours staring at computer screens, usually in isolation, trying to work out how to persuade a brainless machine to do something a translator might find useful. Not only can it be a solitary world, the work of Bible translation we support is in itself a niche market. So it can be very good to break out of the bubble for a day or two and take a look at what the wider world is doing with computer based translation systems.One of the very best places to do this ASLING’s Translating and the Computer Conference, held annually in London and now in its 37th year. In recent years the conference has met at One Birdcage Walk, home of the UK Engineering Institutes. It is the longest running conference on translation and computers and unique in that it brings together not just academic researchers but those developing and working with translation systems in the real world. Delegates come from Europe (EU translation services are always strongly represented), from the US (Microsoft sent the head of their Skype Translator project this year) and then there are the mavericks, like us. As a networking opportunity it is unrivalled, and, it gets us out of the study which has to be a good thing.This is not the first time we have attended. We last presented in 2012 when we spoke about the early work that has become the Percival Project, currently under development (see the post for 23 Sep 2015). When we presented Percival in 2012the paper was unique. Nobody else was looking at the problems which arise for computers when they try to deal with patterns of language which don’t come in neat rows. Sadly for computers, an awful lot of language doesn’t come in neat rows but it does seem that we were amongst the first to recognise this as a generic problem for machine translation and to begin to work on ways of equipping computers to deal rather better with things like the horrors of complex word-formation, free word order and the like.

After a gap of three years we were interested to see how the world had moved on. There was the usual eclectic mix of papers and presentations and amongst the concerns of commercial translators we noted one or two points of particular interest and some general themes beginning to emerge:

  • One theme was how computer driven translation is now looking to integrate approaches which tended to be distinct in the past. Whereas in years gone by we heard about Translation Memory (TM) systems (like Google) and Statistical Machine Translation (SMT) systems (like Moses) and a lot about post-editing systems to tidy up the mess the first two generate, these are now being combined into hybrid systems which try to take the best from each world.
  • We learnt of concordancing systems now being used by major dictionary publishers like Harper Collins, OUP, CUP and Macmillan – we first presented our work on this at TC31 in 2009.
  • A number of presenters spoke about the lack of Machine Translation provision for languages other than the 40 or so international lingua franca which account for 90% of global purchasing power. Money talks in only a limited number of languages it seems. It was good to hear about the KAMUSI project: which aims to provide training data for SMT and TMs in less well resourced languages. Perhaps output from our glossing technologies might be able to help here?

Most interesting was the growing recognition that, as we had observed to the 34th conference in 2012, language doesn’t come in neatly packaged and closely related chunks. A number of speakers from:

  • Microsoft’s William Stevens (head of their Skype Translator project) who noted that before you can translate speech you need to remove the ‘disfluencies’ (great word!) through
  • Alan Melby from the EU who lamented the poor performance of MT systems in morphologically rich languages and
  • Constantin Orazan from the EXPERT project whose team is working to find ways to help MT and TM systems deal with ‘discontinuous structures’ in languages were describing research which is trying to deal with the tendency of natural language to be messy and disparate. Sounds familiar…

All in all it was an excellent opportunity to catch up with the world of computer based translation systems. We are much encouraged that our own research is clearly very much at the heart of current concerns for the wider world and we came away with invitations to present next year at TC38 and also at a workshop conference planned for 2016 addressing the problem of Discontinuous Structures in Natural Language Processing.

It seems we are ahead of the game but others are now coming along with new ideas which may well help us in the future. Time to plunder the Philistines…

Wht are we doing this?

Human language is a very fascinating thing. We use it to pass on information, to express our deepest emotions, to pray, to praise, to encourage and, sadly, sometimes to denigrate. The conduit for all these is language, spoken and written. Language defines who we are, preselects our friends, links us with our heritage and sets our expectations.Not only do we shape language, language shapes us. Different languages can make a dramatic difference to our understanding of a particular event or story. Recently, researchers in Germany assembled three groups of people, one group were monoglot English speakers, another monoglot German speakers and the third were bilingual between English and German. They showed each person a photograph of a woman walking across a street in a city. Those that spoke only English described the scene as ‘there is a woman crossing a street’. The monoglot German speakers saw ‘a woman walking towards a building’. Most interestingly, the bilinguals fell into two groups. Those that had been given a text in English to read before being shown the photograph saw it as the English speakers had read it, those who were give a German text to read saw what the German speakers had perceived.11 years ago, Neil and I were present in St Paul’s Cathedral in London for a service to celebrate 200 years since the founding of the Bible Society movement. The speaker was Rowan Williams, then the Archbishop of Canterbury. He had some interesting things to say about language and translation:“Of all the great world religions, it is Christianity that has the most obvious and pervasive investment in translation. We do not have a sacred language; from the very first, Christians have been convinced that every human language can become the bearer of scriptural revelation. The words in which revelation is first expressed are not solid, impenetrable containers of the mystery; they are living realities which spark recognition across even the deepest of gulfs between cultures, and generate new words native to diverse cultures which will in turn become alive and prompt fresh surprise and recognition.
Biblical translation represents an enormous act of faith – the faith that what is given by God in one context is capable of being equally transfiguring and authoritative in all other human environments. Jesus speaks Greek and Aramaic; but the whole narrative of his words and work, his ministry and death and resurrection, is such that he can speak to call, to judge, to forgive and to bless in every human language that has been or will be”.

Now if this is so, the task of translating the Bible is foundational for the mission of the church all over the world. Through scripture Christ is welcomed into every language, culture and experience. But this remains a hope, not a reality. There are about 7,000 active languages in the world today. Only about 500 have a translation of the whole Bible. A further 1,300 languages have a NT translation and another 1,000 languages have a translation of at least one book of the Bible. That leaves more than 4,000 languages without a translation of even a part of the Bible. True, the major international lingua franca of our times all have a translation of the Bible and by this measure between 4 and 5 of the 7 billion souls on earth have access to a translation of scripture they can understand, at least to some degree. But this is not the same as hearing God speak in the language we learned as a child. Language is formative. To encounter God within the culture and language that made us who we are is transformative.

When a translation is completed the whole community gathers to celebrate. Copies of the new scripture are distributed and eager eyes scan the pages. “Now!”, they say, “now we know that God is one of us, that he understands us and shares our lives”! That’s a powerful thing. But perhaps the most exciting thing is that, just as the English and German speakers saw different things in the photograph of the woman walking across a street, so too in their new translation a community may discover things about God that you and I may not have seen through our Bibles.

“We have a gospel to proclaim”, says the hymn; but so too does God and one of the ways we can help that happen is through translating the story of a radical Creator whose extravagant love led him to gibbet outside Jerusalem and whose gospel continues to transform lives all over the world as more and more of his people encounter him through his word in their language.

That’s why we do this.

The Percival Project

Language and Computers

Language is a very fundamental part of who we are and how we think of the world around us and different languages can do things very differently. Whatthey have in common is the fact that small human beings from the age of about two years are able to absorb and reproduce the language they hear around them without, apparently, trying very hard! How are they able to do this?

Children can recognise the shapes and patterns of language even when those patterns are very complex. This is something computers find very hard. It is not difficult to tell a computer to go and find a particular pattern in a piece of text. It is a lot harder to have the computer work out what patterns are present in the text for itself. The Percival Project is trying to help with this problem.

The Percival Project

Bible translators already have MAT systems which help them create more consistent translations. These systems are able to recognise how key biblical terms have been rendered by the translators. If the team then translates the term differently later on the system can spot the change and flag it for review. These systems work for most languages but they sometimes work better for some than others. When they work less well it is often because the system cannot recognise different words which are in fact closely related. Provided the language forms words by adding suffixes or prefixes, like English, the system usually works well and can recognise that a set of words like {love, loved, loveliness, lovely, loves, loving etc…} are closely related. But some languages are not so straightforward.

Here are some closely related words in a language which forms its words in a much more complicated way. {QA+aL, QA+aleNU, TiQe+OL, yiQe+eLU, QO+eLey, Qe+ULOT}. We have written them using the English alphabet but you can see the original on the right. If you stare at these for long enough you will see that the thing which is common to them all is the stem letters Q+L (in the original we have coloured them red). But just look at what is happening around them! Not only can we see prefixes and suffixes being added, there are even changes in between the letters of the stem.

Percival allows the computer to sort out this kind of problem for itself, without the need for human beings to show it how. In fact, not only can Percival recognise that the words in the example above belong together, in a language like Tonga it can see that the name Abraham in English becomes Abulahamu and in still another context it can recognise patterns in phrases which correspond to the syntax of the language.

How this helps translators

  • Just imagine how difficult it is to spell check texts in languages as complex as the example above. But if the computer can see the structure of the words it will be able to flag any that don’t conform to the patterns typical for the language. Think how much time that could save.
  • If the computer can group related words together think how much easier it becomes to create an index for the text. And just imagine how much more useful that makes the text on a computer, tablet or mobile phone!
  • Suppose you are working to translate the Bible into a number of closely related languages. What if the computer could not only show you which word in language A corresponds to which word in language B but could even arrange them in the right order for the syntax of the language?

These are just some of the benefits Percival could bring to translators.

When will it be ready?

It’s early days. At present Percival exists only as a proof of concept prototype, an experiment on a laboratory bench if you like. There is a lot of work to do testing and refining the process to make sure it can benefit the widest possible set of languages.

Who benefits?

In the end the people who benefit are the billions of people across the world who still cannot read scripture in their heart language.
Systems like Percival can’t translate the Bible but they can help Bible translators do their work more quickly and more consistently so that more and more people can hear what God is saying to them today in their scripture.

Please pray for the team and their colleagues as they work to make these things a reality for hundreds of translators all over the world and the people they serve.