Transfer Learning Approaches for Machine Translation

This article was originally posted in the TWB Tech Blog on medium.com

TWB’s current research focuses on bringing language technology to marginalized communities

Translators without Borders (TWB) aims to empower people through access to critical information and two-way communication in their own language. We believe language technology such as machine translation systems are essential to achieving this. This is a challenging task given many of the languages we work with have little to no language data available to build such systems.

In this post, I’ll explain some methods for dealing with low-resource languages. I’ll also report on our experiments in obtaining a Tigrinya-English neural machine translation (NMT) model.

The progress in machine translation (MT) has reached many remarkable milestones over the last few years, and it is likely that it will progress further. However, the development of MT technology has mainly benefited a small number of languages.

Building an MT system relies on the availability of parallel data. The more present a language is digitally, the higher the probability of collecting large parallel corpora which are needed to train these types of systems. However, most languages do not have the amount of written resources that English, German, French and a few other languages spoken in highly developed countries have. The lack of written resources in other languages drastically increases the difficulty of bringing MT services to speakers of these languages.

Low-resource MT scenario

Figure 2, modified from Koehn and Knowles (2017), shows the relationship between the BLEU score and the corpus size for the three MT approaches.

A classic phrase-based MT model outperforms NMT for smaller training set sizes. Only after a corpus size threshold of 15M words, roughly equivalent to 1 million sentence pairs, classic NMT shows its superiority.

Low-resource MT, on the other hand, deals with corpus sizes that are around a couple of thousand sentences. Although this figure shows at first glance that there is no way to obtain anything useful for low resource languages, there are ways to leverage even small data sets. One of these is a deep learning technique called transfer learning, which makes use of the knowledge gained while solving one problem to apply it to a different but related problem.

Cross-lingual transfer learning

Figure 3 illustrates their idea of cross-lingual transfer learning.

The researchers first trained an NMT model on a large parallel corpus — French–English — to create what they call the parent model. In a second stage, they continued to train this model, but fed it with a considerably smaller parallel corpus of a low-resource language. The resulting child model inherits the knowledge from the parent model by reusing its parameters. Compared to a classic approach of training only on the low-resource language, they record an average improvement of 5.6% BLEU over the four languages they experiment with. They further show that the child model doesn’t only reuse knowledge of the structure of the high resource target language but also on the process of translation itself.

The high-resource language to choose as the parent source language is a key parameter in this approach. This decision is usually made in a heuristic way judging by the closeness to the target language in terms of distance in the language family tree or shared linguistic properties. A more sound exploration of which language is best to go for a given language is made in Lin et al. (2019).

Multilingual training

What results from the example is one single model that translates from the four languages (French, Spanish, Portuguese and Italian) to English.

Multilingual NMT offers three main advantages. Firstly, it reduces the number of individual training processes needed to one, yet the resulting model can translate many languages at once. Secondly, transfer learning makes it possible for all languages to benefit from each other through the transfer of knowledge. And finally, the model serves as a more solid starting point for a possible low-resource language.

For instance, if we were interested in training MT for Galician, a low-resource romance language, the model illustrated in Figure 4 would be a perfect fit as it already knows how to translate well in four other high-resource romance languages.

A solid report on the use of multilingual models is given by Neubig and Hu (2018). They use a “massively multilingual” corpus of 58 languages to leverage MT for four low-resource languages: Azeri, Belarusian, Galician, and Slovakian. With a parallel corpus size of only 4500 sentences for Galician, they achieved a BLEU score of up to 29.1% in contrast to 22.3% and 16.2% obtained with a classic single-language training with statistical machine translation (SMT) and NMT respectively.

Transfer learning also enables what is called a zero-shot translation, when no training data is available for the language of interest. For Galician, the authors report a BLEU score of 15.5% on their test set without the model seeing any Galician sentences before.

Case of Tigrinya NMT

Tigrinya is no longer in the very low-resource category thanks to the recently released JW300 dataset by Agic and Vulic. Nevertheless, we wanted to see if a higher resource language could help build a Tigrinya-to-English machine translation model. We used Amharic as a parent language, which is written with the same Ge’ez script as Tigrinya and has larger public data available.

The datasets that were available to us at the time of writing this post are listed below. After JW300 dataset, the largest resource to be found is Parallel Corpora for Ethiopian Languages.

Our transfer-learning-based training process consists of four phases. First, we train on a dataset that is a random mix of all sets totaling up to 1.45 million sentences. Second, we fine-tune the model on Tigrinya using only the Tigrinya portion of the mix. In a third phase, we fine-tune on the training partition of our in-house data. Finally, 200 samples earlier allocated aside from this corpus are used for testing purposes.

As a baseline, we skip the first multilingual training step and use only Tigrinya data to train on.

We see a slight increase in the accuracy of the model on our in-house test set when we use the transfer learning approach. The results in various automatic evaluation metrics are as follows:

Conclusion

Written by Alp öktem, Computational Linguist for Translators without Borders

Language: Our Collective Blind Spot in the Participation Revolution

Two years ago, I embarked on an amazing journey. I started working for Translators without Borders (TWB). While being a first-time Executive Director poses challenges, immersing myself in the world of language and language technology has by far been the more interesting and perplexing challenge.

 

Students, Writing, Language
Students practising to write Rohingya Zuban (Hanifi script) in Kutupalong Refugee Camp near Cox’s Bazar, Bangladesh.

Language issues in humanitarian response seem like a “no-brainer” to me. A lot of others in the humanitarian world feel the same way – “why didn’t I think of that before” is a common refrain. Still, we sometimes struggle to convince humanitarians that if people don’t understand the message, they aren’t likely to follow it. When I worked in South Sudan for another organisation, in one village, I spoke English, one of our team interpreted to Dinka or Nuer, and then a local teacher translated to the local language (I don’t even know what it was). I asked a question about how women save money; the response had something to do with the local school not having textbooks. It was clear that there was no communication happening. At the time, I didn’t know what to do to fix it. Now I do – and it’s not difficult or particularly expensive.

That’s the interesting part. TWB works in 300 languages, most of which I’d never heard of, and this is a very small percentage of the over 1,300 languages spoken in the 15 countries currently experiencing the most severe crises. There’s also no reliable data on where exactly each language is spoken. I’ve learned so much about language technology that my dog can almost talk about the importance of maintaining translation memories and clean parallel datasets.

Communicating with conflict-affected people

The International Committee of the Red Cross and the Harvard Humanitarian Initiative have just published a report about communicating with conflict-affected people that mentions language issues and flags challenges with digital communications. (Yay!) Here are some highlights:

  • Language is a consistent challenge in situations of conflict or other violence, but often overlooked amid other more tangible factors.

  • Humanitarians need to ‘consider how to build “virtual proximity” and “digital trust” to complement their physical proximity.’

  • Sensitive issues relating to sexual and gender-based violence are largely “lost in translation.” At the same time, key documents on this topic are rarely translated and usually exclusively available in English.

  • Translation is often poor, particularly in local languages. Some technology-based solutions have been attempted, for example, to provide multilingual information support to migrants in Europe. However, there is still a striking inability to communicate directly with most people affected by crises.

TWB’s work, focusing on comprehension and technology, has found that humanitarians are simply unaware of the language issues they face.

  • In north-east Nigeria, TWB research at five sites last year found that 79% of people wanted to receive information in their own language; less than 9% of the sample were mother-tongue Hausa speakers. Only 23% were able to understand simple written messages in Hausa or Kanuri; that went down to just 9% among less educated women who were second-language speakers of Hausa or Kanuri, yet 94% of internally displaced persons receive information chiefly in one of these languages.
  • In Greece, TWB found that migrants relied on informal channels, such as smugglers, as their trusted sources of information in the absence of any other information they could understand.

  • TWB research in Turkey in 2017 found that organizations working with refugees were often assuming they could communicate with them in Arabic. That ignores the over 300,000 people who are Kurds or from other countries.

  • In Cox’s Bazar, Bangladesh, aid organizations supporting the Rohingya refugees were working on the assumption that the local Chittagonian language was mutually intelligible with Rohingya, to which it is related. Refugees interviewed by TWB estimate there is a 70-80% convergence; words such as ‘safe’, ‘pregnant’ and ‘storm’ fall into the other 20-30%.

What can we do?

Humanitarian response is becoming increasingly digital. How do we build trust, even when remote from people affected by crises?

‘They only hire Iranians to speak to us. They often can’t understand what I’m saying and I don’t trust them to say what I say.’ – Dari-speaking Afghan man in Chios, Greece.

Speak to people in their language and use a format they understand: communicating digitally – or any other way – will mean being even more sensitive to what makes people feel comfortable and builds trust. The right language is key to that. Communicating in the right language and format is key to encouraging participation and ensuring impact, especially if the relevant information is culturally or politically sensitive. The right language is the language spoken or understood and trusted by crisis-affected communities; the right format means information is accessible and comprehensible. Providing only written information can hamper communication and engagement efforts with all sectors of the community from the start – especially women, who are more likely to be illiterate.

Lack of data is the first problem: humanitarians do not routinely collect information about the languages people speak and understand, or whether they can read them. It is thus easy to make unsafe assumptions about how far humanitarian communication ‘with communities’ is reaching, and to imagine that national or international lingua francas are sufficient. This can be done safely without harming the individuals or putting the community at risk.

Budgets: Language remains below the humanitarian radar and often absent from humanitarian budgets. Budgeting for and mobilizing trained and impartial translators, interpreters and cultural mediators can ensure aid providers can listen and provide information to affected people in a language they understand.

Language tools: Language information fact-sheets and multilingual glossaries can help organizations better understand key characteristics of the languages affected people speak and ensure use of the most appropriate and accurate terminology to communicate with them. TWB’s latest glossary for Nigeria provides terminology in English/Hausa/Kanuri on general protection issues and housing, land and property rights.

A global dataset on language

TWB is exploring ways of fast-tracking the development and dissemination of a global dataset on language and communication for crisis-affected countries, as a basis for planning effective communication and engagement in the early stages of a response. We plan to complement this with data mining and mapping of new humanitarian language data.

TWB has seen some organizations take this on – The World Health Organization and the International Federation of Red Cross and Red Crescent Societies have both won awards for their approaches to communicating in the right language. Oxfam and Save the Children regularly prioritize language and the International Organization for Migration and the United Nations Office for the Coordination of Humanitarian Affairs are starting to routinely include language and translation in their programs. A few donors are beginning to champion the issue, too.

TWB has only really been able to demonstrate the possibilities for two or three years – and it’s really taking off. It’s such a no-brainer, so cost-effective, it’s not surprising that so many organizations are taking it on. Our next step is to ensure that language and two-way communication are routinely considered, information is collected on the languages that crisis-affected people speak, accountability mechanisms support it, and we make the overall response accessible for those who need protection and assistance.

Written by Aimee Ansari, Executive Director, Translators without Borders.