Machine translating the Bible into new languages

 

“In the beginning God made heaven and earth. The earth was without sight and above, and darkness was over the bottomless pit, and the spirit of God was brought on the water. And God said, be light! And there was light. And God saw that the light was good, and God divided the light from the darkness. And God called the light day, and the darkness called night. And it was evening and it was morning, the first day.”

— Genesis 1:1–5, automatically translated using only the New Testament in English and the entire Bible in 52 other languages

Introduction

Current machine translation systems focus on translating previously unseen sentences between a small set of well-known languages, keeping the languages fixed and generalizing over sentences with different meanings. This formulation suggests a dual problem, which we consider here: That of translating a fixed set of well-known sentences into a large number of little-known languages, keeping the meaning fixed and generalizing over different languages. Intuitively, this corresponds to translating a single book into a large number of languages. The archetypical example is the Bible, parts of which have been translated into some 3,000 languages. Specifically, we consider the problem of producing a translation of the full Bible given only a partial translation—for example, the New Testament, or only the Gospels.

Being able to automatize a part of the translation work and, say, generate a raw translation suitable for further editing would free human resources and allow for a larger number of translations with the same effort. Moreover, the Bible is not the only text that could benefit from this; as will be demonstrated, the approach can be useful with as little as tens of parallel translations.

Can machine learning help us here? The existing forms of machine translation are of no use here: To produce a useful model, at least tens or hundreds of megabytes of parallel corpora (the same text in source and target languages) are needed. In most of our target languages, if we are lucky the text we have is the New Testament, or about one megabyte of text.

It turns out that machine translation models indeed can learn to generalize over languages. For this purpose, the Bible translations are a treasure; a parallel corpus par excellence. No other parallel text exists which we can use to gain linguistic insight into hundreds or thousands of languages. The division of the Bible into verses provides a natural correspondence between texts in different languages1. Moreover, most Bible translations tend to err on the conservative side rather than take liberties, resulting in somewhat more mechanical translation than in most other non-technical genres.

Overview of the chosen approach

Traditional machine translators are trained to translate from a single source language to a single target language. The training data, then, must be a sufficiently large parallel corpus in those two languages. The more modern approach is to train a single model to translate from one or more languages to other languages, giving the model the desired target language as input at translation time. It has been observed that this approach helps when translating between languages with a dearth of parallel texts. This is the approach we also choose here.

We choose one of the translations as a “source” language. It should be noted that this really cannot be considered a source language in the traditional sense because it is not the only source text that determines the translation. Rather, when we train the model to translate from this source language to all the different target languages, the model incorporates knowledge from all these different texts and uses that knowledge when producing any translation. This means that the model can learn to abstract over languages. Intuitively, it also means that the model can extract information about the meaning of the individual verses from the many ways in which different languages express the same thoughts. However, the choice of this language is not without consequences; the model will have access to text in this language at the translation phase, and can attend to different parts of this text. For this reason, we will call this language the attention language.

As a translation framework, we use Fairseq-py, the Python version of the Facebook AI Research Sequence-to-Sequence Toolkit. We use the Moses project for tokenization of the input text, although we have to disable all of the language-specific tokenization features, making it a fairly mechanical process of separating punctuation.

Since many of our languages contain rich morphology, it does not make much sense to translate word-by-word. We use byte pair encoding as implemented in the Subword Neural Machine Translation project. We generate a single set of byte pairs for all of the languages together to facilitate sharing of subword snippets. In order to maximize the sharing potential in the languages used, at this point we have used languages that either use the Latin script or which can be easily be converted to use a Latin script and still pass a visual inspection2. It is unknown at this point how much, if at all, this helps the translation. Thus, the input tokens of the model are either subword snippets (the byte pairs) of one or more characters and, as the first token of each source sentence, a special tag indicating the target language.

The parallel Bible translations used were fetched from the archives of the Sword project. The Sword project provides Bible translations in a machine-readable format, and at least attempts to align them by verses, i.e. it uses an uniform verse numbering regardless of the one chosen by the translation in question.

We chose some languages as test languages, omitting from the training set either the book of Genesis or the entire Old Testament. Since it is unknown if the chosen number of languages approaches the capacity of the model, we decided to give larger weights to the test languages. This was done crudely by repeating the translation in the training set three times. This has an unfortunate side effect of also making the training slower, and we in fact hypothesize that it should not be necessary with the parameters and the number of input languages we used. In the alternative, the training code could certainly be modified to give an additional weight to certain inputs.

Test setting

We decided to use one of the the Fairseq’s presets, fconv_wmt_en_ro, as our network architecture. This is a fully convolutional model where both the encoder and decoder have 20 layers with 512 neurons each, connected by 3×3 convolutions. It has previously been used to translate between English and Romanian.

We trained the model with the following 53 Bible translations:

Sword module Language Notes
2TGreek Koine Greek (LXX + Tischendorf) attention language, romanized
Afr1953 Afrikaans
Alb Albanian
BasHautin Basque NT only
Bela Belarusian romanized
BretonNT Breton NT only
BulVeren Bulgarian romanized
CSlElizabeth Church Slavonic romanized
Chamorro Chamorro Gospels, Psalms & Acts
ChiNCVs Chinese (Simplified) romanized
Cro Croatian
CzeCEP Czech
DaOT1931NT1907 Danish
Dari Dari romanized
DutSVV Dutch
ESV2011 English target language; NT only used in training
Esperanto Esperanto
Est Estonian
FarTPV Farsi romanized
FrePGR French
FinPR Finnish target language; Genesis omitted from training
GerNeUe German target language; Genesis omitted from training
GreVamvas Greek (modern) romanized
Haitian Haitian Creole
HebModern Hebrew (modern) romanized
HinERV Hindi romanized
HunUj Hungarian
ItaRive Italian
Kekchi K’ekchi’
KorHKJV Korean romanized
LtKBB Lithuanian
LvGluck8 Latvian
ManxGaelic Manx Gaelic Matthew, Luke, John
Maori Maori
Mg1865 Malagasy
Norsk Norwegian
NorthernAzeri Northern Azeri
Peshitta Classical Syriac romanized, NT only
PolUGdanska Polish
PorAlmeida1911 Portuguese
PotLykins Potawatomi Matthew and Acts only
RomCor Romanian
RusSynodal Russian romanized
ScotsGaelic Scots Gaelic Mark only
SloStritar Slovenian NT and Psalms
SomKQA Somali
SpaRV Spanish
Swahili Swahili NT only
SweFolk1998 Swedish
TagAngBiblia Tagalog
TurHADI Turkish easy-to-read NT only
Ukrainian Ukrainian romanized
Vulgate Latin

For byte pair encoding, we somewhat arbitrarily chose to use 30000 byte pairs over the entire corpus (both attention and target languages).

We used the omitted portions of the target languages (Finnish and German Genesis and English Old Testament) as the combined validation and test set.

For prediction, we used a beam search of width 120.

Results

Running on a single Nvidia GeForce GTX 1080 Ti card, the training took roughly 7 days to converge, severely limiting our ability to experiment with different setups. The resulting model allows us to produce missing portions (those which are present in the attention language) in any of the training languages.

The resulting model produces translations that we generally find helpful, if not always correct. Naturally, the Old Testament even contains words which do not occur at all in the New Testament; in these cases, the model often uses a wrong word or adopts a word from one of the related languages. We surmise the results might be useful as raw translations for further editing, possibly even in a collaborative (Wikipedia or GitHub style) translation effort where not all of the translations are intimately familiar with biblical Hebrew and Greek.

The English produced is visibly better than the Finnish, in spite of the larger amount of Finnish available in the training phase (the entire Bible except for Genesis, as opposed to only the New Testament). We suggest there may be a few different contributing factors. First, most of the training languages are more closely related to English than to Finnish. Second, the attention language, Greek, is closer to English than to Finnish. And lastly, the byte pair encoding and the chosen number of byte pairs may actually hide some of the structure in Finnish because of inflection and compound words. As a specific example, we observed the word poikalapsi (a boy child, from poika ‘boy’ + lapsi ‘child’) split into tokens as poi + kala + psi.

As examples of the translations produced, here are some of the more familiar Old Testament texts (also see the one at the beginning of the article). Capital letters have been added for legibility, as the model currently works entirely in lower case.

1. The serpent was answer than all the beast of the earth that the Lord God had done. And he said to the woman, is it that God said, you shall not eat of any tree in the paradise?
2. And the woman said to the serpent, we are to eat of the fruit of the tree,
3. but from the fruit of the tree that is in the grain God said, you shall not eat it and do not touch it, so that you may not die.
4. And the serpent said to the woman, you will not die.
5. For God knew that on the day you will eat, your eyes will be opened, and you will be like God knowing good and evil.
6. And when the woman saw that the tree was good to eat, and that it was right for the eyes, and is right to show it. Then she took his fruit and ate, and she gave it to her husband, and she ate.
7. Then the eyes of the two were opened, and they recognized that they were naked, and they wounded figures of fig tree and made to themselves white.

—Genesis 3:1-7

 

22. And the Lord spoke to Moses, saying,
23. speak to Aaron and to his sons, saying, so shall you blessed the children of Israel, saying to them, they shall put my name upon the people of Israel, and I will blessed them.
24. The Lord blessed you and keep you.
25. Let the Lord show his face to you and have mercy on you.
26. Let the Lord looke his face to you and give you peace.

—Numbers 6:22-26

 

1. A salm of David. The Lord is my shepherd, nothing will be lacking.
2. He dwells me in wealth of graves, he grows me upon the rest of rest.
3. He turn my soul. He guides me on the sends of righteousness for his name.
4. Even if I go in the midst of the shadow of death, I will not fear evil, for you are with me. Your scepter and your back do me console me.
5. You prepares a table for me before my enemies, you anointed my head with olie, my cup is grained.
6. Yet all the days of my life and mercy will follow me all the days of my life, and I will dwell in the house of the Lord for long days.

—Psalm 23

Here is the beginning of Genesis automatically translated into Finnish:

1. Alussa Jumala on tehnyt taivaan ja maan.
2. Ja maa oli luopunut ja tyhjäksi, ja pimeys oli syvyydestä; ja Jumalan henki tuli vetten yläpuolelle.
3. Ja Jumala sanoi: “Tapahtukoon valkeus!” ja tuli valkeus.
4. Ja Jumala näki valon, että se oli hyvä; ja Jumala erotti valon pimeyden välistä.
5. Ja Jumala kutsui valkeuden päivän, ja pimeyden hän kutsui yön. Ja tuli ilta, ja tuli aamulla: yksi päivä.
6. Ja Jumala sanoi: “Tapahtukoon taivaanvahva vetten keskellä, ja se erottakoon veden ja vedet”. Ja niin tuli.
7. Ja Jumala teki taivaanvahvan ja erotti veden, joka oli taivaanvahvaan ja veden päällä, joka on taivaanvahvuuden yläpuolella. ja niin tapahtui.
8. Ja Jumala kutsui taivaan taivaan. Ja tuli ilta, ja tuli aamulla: toinen päivä.
9. Ja Jumala sanoi: “Vettä, joka on taivaan alla, kokoontukoon yhteen, ja tehtäköön se kuiva”. Ja niin tuli.
10. Ja Jumala kutsui kuivan maan, ja vesipurot kutsui meren. Ja Jumala näki, että se oli hyvä.

—Genesis 1:1-10

More translations are available at https://sliedes.kapsi.fi/bib-trans/. As we are only confident in our estimations of the quality of the Finnish and English translations, and to an extent the German one, we would appreciate hearing how the other translations (Basque, Breton, Slovenian, Swahili and Turkish) compare to the English one. As noted, the English, Finnish and German languages were given larger weights in training, which may or may not affect negatively the quality of other translations.

The source code for all the scripts, tools and modifications to third party tools is available at https://github.com/sliedes/fairseq-py.

Further research

The perceived difference in quality between English and Finnish suggests trying to use as the attention language the language that is most closely related to the target language. Alternatively, it would be possible to use several languages as attention languages and let the learnable attention mechanism decide which languages to attend to.

Since the task is to generalize over languages, we wonder what the result would be if we also allowed the source language to be any of the languages, i.e. train a full n-to-n translation model. This would also presumably obviate the need for different attention languages, since any of the languages could be used as the attention language and the best translation chosen.

Research should be done on the optimal number of languages to include in a model of a given size. Different ways to choose the languages to include could also be investigated. Moreover, some of the hyperparameters in this work were essentially just guesses of good values, like the byte pair count. Ensemble models usually allow for better translations; whether that also holds in this case is an interesting question of practical value. The precise effect of the target language weighting should be investigated. It may be that it would be beneficial to give unit weights to all languages unless the training is done with a very large set of languages. Also, we have only tried the fully convolutional model; other machine translator network models should be evaluated.

We suspect the translator may be more suitable for creoles and pidgins and other languages with very close relatives. Conversely, language isolates should be harder. This effect needs to be investigated further.

Other parallel Bible corpora besides the Sword project exist, and might be of higher quality. Better aligned verses should result in a higher quality translation.

Since the training process is resource intensive, it would be interesting to try to retrain a model to a new target language, for example by replacing a language in the model by another one. Whether this is viable and results in a speedup is an open question.

We also propose investigating the possibility of integrating the model and training into Bible translation software. Even now the model is capable of incrementally giving suggestions. It might also be possible to revise the model after a human has edited parts of the translation or marked some translated text as accepted.

Conclusion

We have presented a framework for translating text into a language with only a very modest amount of parallel text under the constraint that we only translate a small set of sentences. To the best of our knowledge, this is a new capability, and we hope that this will enable Bible translators and free their resources.

 

 

 

 

 

Footnotes

^1. Of course, the reality is a bit messier than this. Sometimes verse boundaries are in the middle of a sentence, and there are a few different verse numbering schemes. Some translations use slightly different source texts, but the differences tend to be insignificant, except where it causes omission of entire verses, as sometimes is the case.

^2. We used the Python unidecode package to convert non-Latin scripts into ASCII representation. Languages that used a Latin script but characters outside the ASCII character set were used as is. It should be noted that the result of this transformation can be very lossy; one clear example is the Chinese Bible text, which unidecode translates into something resembling pinyin, but without tone markings. Such a text simply does not contain enough information for a Chinese speaker to understand it. Nevertheless, we hypothesize that it is sufficiently regular as a language to be useful in this training context.

 

 

 

References

Gehring, Jonas, et al. “Convolutional sequence to sequence learning.” Proc. of ICML. 2017.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Christodouloupoulos, Christos, and Mark Steedman. “A massively parallel corpus: The Bible in 100 languages.” Language resources and evaluation 49.2 (2015): 375-395.

Agić, Željko, Dirk Hovy, and Anders Søgaard. “If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages.” The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015). 2015.

Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. “Exploiting similarities among languages for machine translation.” arXiv preprint arXiv:1309.4168 (2013)

Johnson, Melvin, et al. “Google’s multilingual neural machine translation system: Enabling zero-shot translation.” arXiv preprint arXiv:1611.04558 (2016).

Leave a comment