Language data fills a critical gap for humanitarians

Until now, humanitarians have not had access to data about the languages people speak. But a series of open-source language datasets is about to improve how we communicate with communities in crisis. Eric DeLuca and William Low explain how a seemingly simple question drove an innovative solution.

“Do you know what languages these new migrants speak?”

Lucia, an aid worker based in Italy, asked this seemingly simple question to researchers from Translators without Borders in 2017. Her organization was providing rapid assistance to migrants as they arrived at the port in Sicily. Lucia and her colleagues were struggling to provide appropriate language support. They often lacked interpreters who spoke the right languages and they asked migrants to fill out forms in languages that the migrants didn’t understand.

Unfortunately, there wasn’t a simple answer to Lucia’s question. In the six months prior to our conversation with Lucia, Italy registered migrants from 21 different countries. Even when we knew that people came from a particular region in one of these countries, there was no simple way to know what language they were likely to speak.

The problem wasn’t exclusive to the European refugee response. Translators without Borders partners with organizations around the world which struggle with a similar lack of basic language data.

Where is the data?

As we searched various linguistic and humanitarian resources, we were convinced that we were missing something. Surely there was a global language map? Or at least language data for individual countries?

The more we looked, the more we discovered how much we didn’t know. The language data that does exist is often protected by restrictive copyrights or locked behind paywalls. Languages are often visualized as discrete polygons or specific points on a map, which seems at odds with the messy spatial dynamics that we experience in the real world.

In short, language data isn’t accessible, or easily verifiable, or in a format that humanitarians can readily use.

We are releasing language datasets for nine countries

Today we launch the first openly available language datasets for humanitarian use. This includes a series of static and dynamic maps and 23 datasets covering nine countries: DRC, Guatemala, Malawi, Mozambique, Nigeria, Pakistan, Philippines, Ukraine, and Zambia.

This work is based on a partnership between TWB and University College London. The pilot project received support from Research England’s Higher Education Innovation Fund, managed by UCL Innovation & Enterprise. With support from the Centre for Translation Studies at UCL, this project was the first of its kind in the world to systematically gather and share language data for humanitarian use.

The majority of these datasets are based on existing sources — census and other government data. We curated, cleaned, and reformatted the data to be more accessible for humanitarian purposes. We are exploring ways of deriving new language data in countries without existing sources, and extracting language information from digital sources.

This project is built on four main principles:

TWB Language Data Initiative

1. Language data should be easily accessible

We started analyzing existing government data because we realized there was a lot of quality information that was simply hard to access and analyze. The language indicators from the 2010 Philippines census, for example, were spread over 87 different spreadsheets. Many census bureaus also publish in languages other than English, making it difficult for humanitarians who work primarily in English to access the data. We have gone through the process of curating, translating, and cleaning these datasets to make them more accessible.

2. Language data should work across different platforms

We believe that data interoperability is important. That is, it should be easy to share and use data across different humanitarian systems. This requires data to be formatted in a consistent way and spatial parameters to be well documented. As much as possible, we applied a consistent geographic standard to these datasets. We avoided polygons and GPS points, opting instead to use OCHA administrative units and P-codes. At times this will reduce data precision, but it should make it easier to integrate the datasets into existing humanitarian workflows.

We worked with the Centre for Humanitarian Data to develop and apply consistent standards for coding. We built an HXL hashtag scheme to help simplify integration and processing. Language standardization was one of the most difficult aspects of the project, as governments do not always refer to languages consistently. The Malawi dataset, for example, distinguishes between “Chewa” and “Nyanja,” which are two different names for the same language. In some cases, we merged duplicate language names. In others, we left the discrepancies as they exist in the original dataset and made a note in the metadata.

Even when language names are consistent, the spelling isn’t always. In the DRC dataset, “Kiswahili” is displayed with its Bantu prefix. We have opted instead to use the more common English reference of “Swahili.”

Every dataset uses ISO 639-3 language codes and provides alternative names and spellings to alleviate some of the typical frustrations associated with inconsistent language references.

3. Language data should be open and free to use

We have made all of these datasets available under a Creative Commons Attribution Noncommercial Share Alike license (CC BY-NC-SA-4.0). This means that you are free to use and adapt them as long as you cite the source and do not use them for commercial purposes. You can also share derivatives of the data as long as you comply with the same license when doing so.

The datasets are all available in .xlsx and .csv formats on HDX, and detailed metadata clearly states the source of each dataset along with known limitations.

Importantly, everything is free to access and use.

4. Language data should not increase people’s vulnerability

Humanitarians often cite the potential sensitivities of language as the primary reason for not sharing language data. In many cases, language can be used as a proxy indicator for ethnicity. In some, the two factors are interchangeable.

As a result, we developed a thorough risk-review process for each dataset. This identifies specific risks associated with the data, which we can then mitigate. It also helps us to understand the potential benefits. Ultimately, we have to balance the benefits and risks of sharing the data. Sharing data helps humanitarian organizations and others to develop communication strategies that address the needs of minority language speakers.

In most cases, we aggregated the data to protect individuals or vulnerable groups. For each dataset, we describe the method we used to collect and clean the data, and specify potential imitations. In a few instances, we chose to not publish datasets at all.

How can you help?

This is just the beginning of our effort to provide more accessible language data for humanitarian purposes. Our goal is to make language data openly available for every humanitarian crisis, and we can’t do it alone. We need your help to:

Integrate and share this data. We are not looking to create another data portal. Our strategy is to make these datasets as accessible and interoperable as possible using existing platforms. But we need your feedback so we can improve and expand them.
Add language-related questions into your ongoing surveys. Existing language data is often outdated and does not necessarily represent large-scale population movements. Over the past year, we have worked with partners such as IOM DTM, REACH, WFP, and UNICEF to integrate standard language questions into ongoing surveys. This is essential if we are to develop language data for the countries that don’t have regular censuses. The recent multi-sectoral needs assessment in Nigeria is a good example of how a few strategic language questions can lead to data-driven humanitarian decisions.
Use this language data to improve humanitarian communication strategies. As we develop more data, we hope to provide the tools for Lucia and other humanitarians to design more appropriate communication strategies. Decisions to hire interpreters and field workers, develop radio messaging, or create new posters and flyers should all be data-driven. That’s only possible if we know which languages people speak. An inclusive and participatory humanitarian system requires two-way communication strategies that use languages and formats that people understand.

Clearly, the answer to Lucia’s question turned out to be more complicated than any of us expected. This partnership between TWB and the Centre for Translation Studies at UCL has finally made it possible to incorporate language data into humanitarian workflows. We have established a consistent format, an HXL coding scheme, and processes for standardizing language references. But the work does not stop with these nine countries. Over the next few months we will continue to curate and share existing language datasets for new countries. In the longer term we will be working with various partners to collect and share language data where it does not currently exist. We believe in a world where knowledge knows no language barriers. Putting language on the map is the first step to achieving that.

Eric DeLuca is the Monitoring, Evaluation, and Learning Manager at Translators without Borders.

William Low is a Senior Data and GIS Researcher at University College London.

Funding for this project was provided by Research England’s Higher Education Innovation Fund, managed by UCL Innovation & Enterprise.

Transfer Learning Approaches for Machine Translation

This article was originally posted in the TWB Tech Blog on medium.com

TWB’s current research focuses on bringing language technology to marginalized communities

Translators without Borders (TWB) aims to empower people through access to critical information and two-way communication in their own language. We believe language technology such as machine translation systems are essential to achieving this. This is a challenging task given many of the languages we work with have little to no language data available to build such systems.

In this post, I’ll explain some methods for dealing with low-resource languages. I’ll also report on our experiments in obtaining a Tigrinya-English neural machine translation (NMT) model.

The progress in machine translation (MT) has reached many remarkable milestones over the last few years, and it is likely that it will progress further. However, the development of MT technology has mainly benefited a small number of languages.

Building an MT system relies on the availability of parallel data. The more present a language is digitally, the higher the probability of collecting large parallel corpora which are needed to train these types of systems. However, most languages do not have the amount of written resources that English, German, French and a few other languages spoken in highly developed countries have. The lack of written resources in other languages drastically increases the difficulty of bringing MT services to speakers of these languages.

Low-resource MT scenario

In scientific literature for machine translation, there is no particular consensus on which corpus size constitutes a low-resource scenario. But we can say roughly that a low-resource condition is when the size of the parallel training corpus is not sufficient for reaching an acceptable result with the standard MT approaches. This is usually judged with a standardized automatic evaluation metric called BLEU, which correlates with human translation assessments.

Figure 2, modified from Koehn and Knowles (2017), shows the relationship between the BLEU score and the corpus size for the three MT approaches.

A classic phrase-based MT model outperforms NMT for smaller training set sizes. Only after a corpus size threshold of 15M words, roughly equivalent to 1 million sentence pairs, classic NMT shows its superiority.

Low-resource MT, on the other hand, deals with corpus sizes that are around a couple of thousand sentences. Although this figure shows at first glance that there is no way to obtain anything useful for low resource languages, there are ways to leverage even small data sets. One of these is a deep learning technique called transfer learning, which makes use of the knowledge gained while solving one problem to apply it to a different but related problem.

Cross-lingual transfer learning

Zoph et al. (2018) applied transfer learning in machine translation and proved that having prior knowledge in translation of a separate language pair can improve translating a low-resource language.

Figure 3 illustrates their idea of cross-lingual transfer learning.

The researchers first trained an NMT model on a large parallel corpus — French–English — to create what they call the parent model. In a second stage, they continued to train this model, but fed it with a considerably smaller parallel corpus of a low-resource language. The resulting child model inherits the knowledge from the parent model by reusing its parameters. Compared to a classic approach of training only on the low-resource language, they record an average improvement of 5.6% BLEU over the four languages they experiment with. They further show that the child model doesn’t only reuse knowledge of the structure of the high resource target language but also on the process of translation itself.

The high-resource language to choose as the parent source language is a key parameter in this approach. This decision is usually made in a heuristic way judging by the closeness to the target language in terms of distance in the language family tree or shared linguistic properties. A more sound exploration of which language is best to go for a given language is made in Lin et al. (2019).

Multilingual training

The path that was cleared by cross-lingual transfer learning led naturally to the use of multiple parent languages. The straightforward approach, first described by Dong et al. (2015), mixes all the available parallel data in the languages of interest and sends them into training as illustrated in Figure 4.

What results from the example is one single model that translates from the four languages (French, Spanish, Portuguese and Italian) to English.

Multilingual NMT offers three main advantages. Firstly, it reduces the number of individual training processes needed to one, yet the resulting model can translate many languages at once. Secondly, transfer learning makes it possible for all languages to benefit from each other through the transfer of knowledge. And finally, the model serves as a more solid starting point for a possible low-resource language.

For instance, if we were interested in training MT for Galician, a low-resource romance language, the model illustrated in Figure 4 would be a perfect fit as it already knows how to translate well in four other high-resource romance languages.

A solid report on the use of multilingual models is given by Neubig and Hu (2018). They use a “massively multilingual” corpus of 58 languages to leverage MT for four low-resource languages: Azeri, Belarusian, Galician, and Slovakian. With a parallel corpus size of only 4500 sentences for Galician, they achieved a BLEU score of up to 29.1% in contrast to 22.3% and 16.2% obtained with a classic single-language training with statistical machine translation (SMT) and NMT respectively.

Transfer learning also enables what is called a zero-shot translation, when no training data is available for the language of interest. For Galician, the authors report a BLEU score of 15.5% on their test set without the model seeing any Galician sentences before.

Case of Tigrinya NMT

Tigrinya is an Ethiopian language spoken by around 7.9 million people in Eritrea and Ethiopia. It is neither supported by any commercial MT provider, nor has any publicly available models. TWB is currently developing open datasets and MT for Tigrinya in cooperation with the Masakhane initiative.

Tigrinya is no longer in the very low-resource category thanks to the recently released JW300 dataset by Agic and Vulic. Nevertheless, we wanted to see if a higher resource language could help build a Tigrinya-to-English machine translation model. We used Amharic as a parent language, which is written with the same Ge’ez script as Tigrinya and has larger public data available.

The datasets that were available to us at the time of writing this post are listed below. After JW300 dataset, the largest resource to be found is Parallel Corpora for Ethiopian Languages.

Our transfer-learning-based training process consists of four phases. First, we train on a dataset that is a random mix of all sets totaling up to 1.45 million sentences. Second, we fine-tune the model on Tigrinya using only the Tigrinya portion of the mix. In a third phase, we fine-tune on the training partition of our in-house data. Finally, 200 samples earlier allocated aside from this corpus are used for testing purposes.

As a baseline, we skip the first multilingual training step and use only Tigrinya data to train on.

We see a slight increase in the accuracy of the model on our in-house test set when we use the transfer learning approach. The results in various automatic evaluation metrics are as follows:

Conclusion

Neural machine translation is a data-hungry technology. Although this severely reduces the possibility to expand it to the majority of the world’s languages, we can still apply various techniques to make it available to more people than if we limited ourselves to approaches tuned towards high-resource languages. Methodologies like transfer learning and linguistically informed data mixture have a role to play in helping everyone communicate in their language.

Written by Alp öktem, Computational Linguist for Translators without Borders

Language: Our Collective Blind Spot in the Participation Revolution

Two years ago, I embarked on an amazing journey. I started working for Translators without Borders (TWB). While being a first-time Executive Director poses challenges, immersing myself in the world of language and language technology has by far been the more interesting and perplexing challenge.

Students, Writing, Language — Students practising to write Rohingya Zuban (Hanifi script) in Kutupalong Refugee Camp near Cox’s Bazar, Bangladesh.

Language issues in humanitarian response seem like a “no-brainer” to me. A lot of others in the humanitarian world feel the same way – “why didn’t I think of that before” is a common refrain. Still, we sometimes struggle to convince humanitarians that if people don’t understand the message, they aren’t likely to follow it. When I worked in South Sudan for another organisation, in one village, I spoke English, one of our team interpreted to Dinka or Nuer, and then a local teacher translated to the local language (I don’t even know what it was). I asked a question about how women save money; the response had something to do with the local school not having textbooks. It was clear that there was no communication happening. At the time, I didn’t know what to do to fix it. Now I do – and it’s not difficult or particularly expensive.

That’s the interesting part. TWB works in 300 languages, most of which I’d never heard of, and this is a very small percentage of the over 1,300 languages spoken in the 15 countries currently experiencing the most severe crises. There’s also no reliable data on where exactly each language is spoken. I’ve learned so much about language technology that my dog can almost talk about the importance of maintaining translation memories and clean parallel datasets.

Communicating with conflict-affected people

The International Committee of the Red Cross and the Harvard Humanitarian Initiative have just published a report about communicating with conflict-affected people that mentions language issues and flags challenges with digital communications. (Yay!) Here are some highlights:

Language is a consistent challenge in situations of conflict or other violence, but often overlooked amid other more tangible factors.
Humanitarians need to ‘consider how to build “virtual proximity” and “digital trust” to complement their physical proximity.’
Sensitive issues relating to sexual and gender-based violence are largely “lost in translation.” At the same time, key documents on this topic are rarely translated and usually exclusively available in English.
Translation is often poor, particularly in local languages. Some technology-based solutions have been attempted, for example, to provide multilingual information support to migrants in Europe. However, there is still a striking inability to communicate directly with most people affected by crises.

TWB’s work, focusing on comprehension and technology, has found that humanitarians are simply unaware of the language issues they face.

In north-east Nigeria, TWB research at five sites last year found that 79% of people wanted to receive information in their own language; less than 9% of the sample were mother-tongue Hausa speakers. Only 23% were able to understand simple written messages in Hausa or Kanuri; that went down to just 9% among less educated women who were second-language speakers of Hausa or Kanuri, yet 94% of internally displaced persons receive information chiefly in one of these languages.

In Greece, TWB found that migrants relied on informal channels, such as smugglers, as their trusted sources of information in the absence of any other information they could understand.
TWB research in Turkey in 2017 found that organizations working with refugees were often assuming they could communicate with them in Arabic. That ignores the over 300,000 people who are Kurds or from other countries.
In Cox’s Bazar, Bangladesh, aid organizations supporting the Rohingya refugees were working on the assumption that the local Chittagonian language was mutually intelligible with Rohingya, to which it is related. Refugees interviewed by TWB estimate there is a 70-80% convergence; words such as ‘safe’, ‘pregnant’ and ‘storm’ fall into the other 20-30%.

What can we do?

Humanitarian response is becoming increasingly digital. How do we build trust, even when remote from people affected by crises?

‘They only hire Iranians to speak to us. They often can’t understand what I’m saying and I don’t trust them to say what I say.’ – Dari-speaking Afghan man in Chios, Greece.

Speak to people in their language and use a format they understand: communicating digitally – or any other way – will mean being even more sensitive to what makes people feel comfortable and builds trust. The right language is key to that. Communicating in the right language and format is key to encouraging participation and ensuring impact, especially if the relevant information is culturally or politically sensitive. The right language is the language spoken or understood and trusted by crisis-affected communities; the right format means information is accessible and comprehensible. Providing only written information can hamper communication and engagement efforts with all sectors of the community from the start – especially women, who are more likely to be illiterate.

Lack of data is the first problem: humanitarians do not routinely collect information about the languages people speak and understand, or whether they can read them. It is thus easy to make unsafe assumptions about how far humanitarian communication ‘with communities’ is reaching, and to imagine that national or international lingua francas are sufficient. This can be done safely without harming the individuals or putting the community at risk.

Budgets: Language remains below the humanitarian radar and often absent from humanitarian budgets. Budgeting for and mobilizing trained and impartial translators, interpreters and cultural mediators can ensure aid providers can listen and provide information to affected people in a language they understand.

Language tools: Language information fact-sheets and multilingual glossaries can help organizations better understand key characteristics of the languages affected people speak and ensure use of the most appropriate and accurate terminology to communicate with them. TWB’s latest glossary for Nigeria provides terminology in English/Hausa/Kanuri on general protection issues and housing, land and property rights.

A global dataset on language

TWB is exploring ways of fast-tracking the development and dissemination of a global dataset on language and communication for crisis-affected countries, as a basis for planning effective communication and engagement in the early stages of a response. We plan to complement this with data mining and mapping of new humanitarian language data.

TWB has seen some organizations take this on – The World Health Organization and the International Federation of Red Cross and Red Crescent Societies have both won awards for their approaches to communicating in the right language. Oxfam and Save the Children regularly prioritize language and the International Organization for Migration and the United Nations Office for the Coordination of Humanitarian Affairs are starting to routinely include language and translation in their programs. A few donors are beginning to champion the issue, too.

TWB has only really been able to demonstrate the possibilities for two or three years – and it’s really taking off. It’s such a no-brainer, so cost-effective, it’s not surprising that so many organizations are taking it on. Our next step is to ensure that language and two-way communication are routinely considered, information is collected on the languages that crisis-affected people speak, accountability mechanisms support it, and we make the overall response accessible for those who need protection and assistance.

Written by Aimee Ansari, Executive Director, Translators without Borders.