July 2019

A match made in Amsterdam

Introducing TAUS' "hi-fi" translation data service

Six years ago in 2013, a Dutch government funded research project was launched in Amsterdam under the leadership of Professor Khalil Sima’an, a computational linguist at the Institute for Logic, Language and Computation of the University of Amsterdam. Called DatAptor, it brought together a team from Intel, the Directorate General of Translation of the European Commission, and TAUS, the translation automation think tank.

The partners set themselves a critical machine translation (MT) goal: How to make it more efficient to immediately select the very best translation data for an MT engine from the vast amount of available content. Three years later, their response emerged as a brand new online service – the TAUS Matching Data solution.

Language data is in demand right across the technology map as machine learning solutions are used for applications such as chatbots, automated report writers, and high-quality transcription.

Most of all, machine translation has been able to leverage the language data contained in "memories" created during the first wave of crowdsourced automation as databases of parallel human translations.

However, the fit between data availability and optimum usability has never been easy.

The big data-domain fit problem

"Finding language data for MT training has always been a big challenge," says Jaap van der Meer, director of TAUS. "Selecting data for a particular domain was almost impossible. Back in 2010, we began taking an example data set (a simple domain-specific translation memory) to help users compile a personalized corpus from the repository of many billions of segments in the TAUS Data Cloud."

Today, the dramatic arrival of neural MT has made it even more necessary to prime the machine with selected data. But the problem has been how to automatically select the "high-fidelity" or in-domain data to closely match your document’s source language style?

Mini-corpus as query term

Using DatAptor, the task is remarkably simple. You create a mini-corpus of segments that typify your domain in your source language (say 20,000 data segments on oil drilling) and use it as the "query term" to search a big data repository. This will then select high-fidelity candidates of in-domain segments from any parallel corpora in the repository. All you need is a smart way to access lots of data!

As Khalil Sima’an explains: "Our dream was to make the World Wide Web itself the source of all data selections." But this would have been overly ambitious. So, the team decided to prove the concept on a smaller dataset – the best multiple-domain data collection of translation data the new – TAUS’s own data repository.

This collection of parallel corpora from the language services industry has been accumulating since 2007, and by 2013 had almost 70 billion words in 220 language combinations across indexed domain types for easier searching.

Improving segment matches

By closely examining corpora associated with given domains, the team also learned that each domain is in fact a mixture of many subdomains. This means that by making metrics of combinations of segments across all subdomains in a very large repository, you can open up a wealth of new, untapped selections. In other words: better matching of segments.

So, if the user provides a Query Corpus representing their domain of interest, the Matching Data method (as it has now been called) will find the most suitable selection in the repository.

Moreover, this can be improved by using matching scores for each segment. Users can then decide whether they want to download compact, medium or large selections of data, depending on their needs.

Successful road-test

To road-test Matching Data, the team ran a trial with Oracle International Product Solutions by developing a colloquial corpus for general online conversations and chats, between English and Chinese, Korean, Japanese, Spanish and Brazilian Portuguese.

Oracle’s average quality score on the segments found through the Matching Data process revealed a satisfying 84 percent. DatAptor proved to be a highly successful partnership between research and industry, delivering on its promise and successfully transmuting the outcomes into a new service – the TAUS Matching Data service, which went live at the beginning of 2019.

Check it out on www.tausdata.org, and join in the ongoing effort to use, develop, finetune and improve on matching data provision for the industry as a whole.