This article was originally posted in the TWB Tech Blog on medium.com

TWB’s current research focuses on bringing language technology to marginalized communities

Translators without Borders (TWB) aims to empower people through access to critical information and two-way communication in their own language. We believe language technology such as machine translation systems are essential to achieving this. This is a challenging task given many of the languages we work with have little to no language data available to build such systems.

In this post, I’ll explain some methods for dealing with low-resource languages. I’ll also report on our experiments in obtaining a Tigrinya-English neural machine translation (NMT) model.

The progress in machine translation (MT) has reached many remarkable milestones over the last few years, and it is likely that it will progress further. However, the development of MT technology has mainly benefited a small number of languages.

Building an MT system relies on the availability of parallel data. The more present a language is digitally, the higher the probability of collecting large parallel corpora which are needed to train these types of systems. However, most languages do not have the amount of written resources that English, German, French and a few other languages spoken in highly developed countries have. The lack of written resources in other languages drastically increases the difficulty of bringing MT services to speakers of these languages.

Low-resource MT scenario

In scientific literature for machine translation, there is no particular consensus on which corpus size constitutes a low-resource scenario. But we can say roughly that a low-resource condition is when the size of the parallel training corpus is not sufficient for reaching an acceptable result with the standard MT approaches. This is usually judged with a standardized automatic evaluation metric called BLEU, which correlates with human translation assessments.

Figure 2, modified from Koehn and Knowles (2017), shows the relationship between the BLEU score and the corpus size for the three MT approaches.

A classic phrase-based MT model outperforms NMT for smaller training set sizes. Only after a corpus size threshold of 15M words, roughly equivalent to 1 million sentence pairs, classic NMT shows its superiority.

Low-resource MT, on the other hand, deals with corpus sizes that are around a couple of thousand sentences. Although this figure shows at first glance that there is no way to obtain anything useful for low resource languages, there are ways to leverage even small data sets. One of these is a deep learning technique called transfer learning, which makes use of the knowledge gained while solving one problem to apply it to a different but related problem.

Cross-lingual transfer learning

Zoph et al. (2018) applied transfer learning in machine translation and proved that having prior knowledge in translation of a separate language pair can improve translating a low-resource language.

Figure 3 illustrates their idea of cross-lingual transfer learning.

The researchers first trained an NMT model on a large parallel corpus — French–English — to create what they call the parent model. In a second stage, they continued to train this model, but fed it with a considerably smaller parallel corpus of a low-resource language. The resulting child model inherits the knowledge from the parent model by reusing its parameters. Compared to a classic approach of training only on the low-resource language, they record an average improvement of 5.6% BLEU over the four languages they experiment with. They further show that the child model doesn’t only reuse knowledge of the structure of the high resource target language but also on the process of translation itself.

The high-resource language to choose as the parent source language is a key parameter in this approach. This decision is usually made in a heuristic way judging by the closeness to the target language in terms of distance in the language family tree or shared linguistic properties. A more sound exploration of which language is best to go for a given language is made in Lin et al. (2019).

Multilingual training

The path that was cleared by cross-lingual transfer learning led naturally to the use of multiple parent languages. The straightforward approach, first described by Dong et al. (2015), mixes all the available parallel data in the languages of interest and sends them into training as illustrated in Figure 4.

What results from the example is one single model that translates from the four languages (French, Spanish, Portuguese and Italian) to English.

Multilingual NMT offers three main advantages. Firstly, it reduces the number of individual training processes needed to one, yet the resulting model can translate many languages at once. Secondly, transfer learning makes it possible for all languages to benefit from each other through the transfer of knowledge. And finally, the model serves as a more solid starting point for a possible low-resource language.

For instance, if we were interested in training MT for Galician, a low-resource romance language, the model illustrated in Figure 4 would be a perfect fit as it already knows how to translate well in four other high-resource romance languages.

A solid report on the use of multilingual models is given by Neubig and Hu (2018). They use a “massively multilingual” corpus of 58 languages to leverage MT for four low-resource languages: Azeri, Belarusian, Galician, and Slovakian. With a parallel corpus size of only 4500 sentences for Galician, they achieved a BLEU score of up to 29.1% in contrast to 22.3% and 16.2% obtained with a classic single-language training with statistical machine translation (SMT) and NMT respectively.

Transfer learning also enables what is called a zero-shot translation, when no training data is available for the language of interest. For Galician, the authors report a BLEU score of 15.5% on their test set without the model seeing any Galician sentences before.

Case of Tigrinya NMT

Tigrinya is an Ethiopian language spoken by around 7.9 million people in Eritrea and Ethiopia. It is neither supported by any commercial MT provider, nor has any publicly available models. TWB is currently developing open datasets and MT for Tigrinya in cooperation with the Masakhane initiative.

Tigrinya is no longer in the very low-resource category thanks to the recently released JW300 dataset by Agic and Vulic. Nevertheless, we wanted to see if a higher resource language could help build a Tigrinya-to-English machine translation model. We used Amharic as a parent language, which is written with the same Ge’ez script as Tigrinya and has larger public data available.

The datasets that were available to us at the time of writing this post are listed below. After JW300 dataset, the largest resource to be found is Parallel Corpora for Ethiopian Languages.

Our transfer-learning-based training process consists of four phases. First, we train on a dataset that is a random mix of all sets totaling up to 1.45 million sentences. Second, we fine-tune the model on Tigrinya using only the Tigrinya portion of the mix. In a third phase, we fine-tune on the training partition of our in-house data. Finally, 200 samples earlier allocated aside from this corpus are used for testing purposes.

As a baseline, we skip the first multilingual training step and use only Tigrinya data to train on.

We see a slight increase in the accuracy of the model on our in-house test set when we use the transfer learning approach. The results in various automatic evaluation metrics are as follows:

Conclusion

Neural machine translation is a data-hungry technology. Although this severely reduces the possibility to expand it to the majority of the world’s languages, we can still apply various techniques to make it available to more people than if we limited ourselves to approaches tuned towards high-resource languages. Methodologies like transfer learning and linguistically informed data mixture have a role to play in helping everyone communicate in their language.

Written by Alp öktem, Computational Linguist for Translators without Borders

Translators improve lives by translating potentially lifesaving information into often ‘marginalized’ languages spoken by vulnerable individuals. Those who volunteer for Translators without Borders (TWB) have a range of experiences and skills, and share a vision of a world where knowledge knows no language barriers. We are grateful for all our translators, and we love sharing their stories.

Two of our top translators of Tigrinya, a language spoken by approximately seven million people, deserve special recognition for the work they did in 2018. Our featured translators, Kidane Haile and Kalayu Menasbo, have their roots in Eritrea and Ethiopia respectively. But they are united by a common language and their tireless desire to use their skills to support those in need.

Tigrinya is a Semitic language, belonging to the same language family as Amharic, Hebrew, Arabic, and Maltese. It is widely spoken in Eritrea and in northern Ethiopia, and by immigrant communities in Sudan, Saudi Arabia, the United States, and parts of Europe.

Eritrea Landscape, Ghinda — Ghinda, Eritrea.

Missing Children Europe

Tigrinya was one of the most important marginalized languages at TWB in 2018, primarily because of our partners’ work with refugees. For example, Missing Children Europe works with refugee youth in Europe who are unaccompanied; Tigrinya is one of the most important languages for this work. Kalayu and Kidane both contributed to the Missing Children Europe work, giving hope to people who have been forced from home due to poverty, hunger, persecution, discrimination, civil war, or unemployment. Young people and displaced or unaccompanied children are particularly vulnerable in such situations. They need to be able to report problems and to know their rights and responsibilities. They cannot do any of that without information in a language they understand.

Kalayu knows how important it is to ensure communication does not become a barrier to humanitarians providing safety. Language mediators are crucial. So the documents provided by our Tigrinya translators can be life-changing.

Kidane, too, sees it as a privilege to work with an organization like Missing Children Europe: to know he is supporting young children, and that the work he does is valuable.

A translator’s journey: taking refuge and delivering safety with words

Kidane now works from his home office in Buffalo, New York translating from English to Tigrinya. The dedicated volunteer prides himself on communication and a desire to help others, hence his enthusiasm for working with TWB. Since joining in April 2018, Kidane has completed 60 tasks, amounting to 32,000 words.

“At one time in my life, I was a refugee. So, I understand what it is like to be in an unfamiliar country, facing a language barrier and other challenges. When I work with people in that situation, I understand what they are going through and it makes me happy to help them,” Kidane Haile, Translator

In 2010, Kidane arrived in the United States with refugee status. For four years he worked part-time, studied full-time, and worked on his English fluency. It was then that he realized his knowledge of Tigrinya and English opened up an opportunity to work and help the community simultaneously. Now he works as a full-time interpreter, though he never forgets where his journey began:

“I often think about making life easier for people who start in a new country and need help communicating and understanding their new situation, the way I was years ago.”

Kalayu, the second of our spotlighted Tigrinya translators, works in the same language pair from his home in Ethiopia. This busy volunteer has translated almost 30,000 words across 17 tasks since he joined TWB in October 2018. He continually aims to serve and provide for others through improved communications.

And his dedication to the mission is evident: Kalayu often works late into the night to complete translation tasks, without the convenience of a home laptop.

In fact, the keen reader and ex-radio journalist wears many charitable hats: he also works for World Vision Ethiopia, a nongovernmental organization dedicated to transforming the lives of vulnerable children and families. In his various roles, he creates safe, protected environments by translating vital information into local languages.

Beyond TWB

Kidane’s experience with TWB has expanded his written translation skills and helped him to take on work outside of his primary field of interpretation.

Kalayu explains how working with TWB helped him understand the impact a translation can make:

“I have no money to support people, but I have the skill of translation – a skill that can support those who need it in their daily life.” This revelation has made Kalayu a committed language professional.

**Photo by Kalayu. Sunset over the Adwa mountains, Ethiopia.**

“A translation task may take you a day, but for those who need it, it may serve as a life continuing catalyst,” Kalayu Menasbo.

To get in touch about any of the topics mentioned in this post, please join the discussion or email [email protected].

If you know a second language, and you too want to help build a world where knowledge knows no language barriers, apply here to become a translator for TWB.

Written by Danielle Moore, Communications Officer for TWB, with interview responses by Kidane Haile and Kalayu Menasbo, Kató translators for TWB.

Tag: Tigrinya

Transfer Learning Approaches for Machine Translation