June 2017
By Konstantin Dranch

Image: © yongyuan/istockphoto.com

Konstantin Dranch is an independent researcher in the translation industry, and the founder of translationrating.com portal. From 2014 to 2016 he worked for Memsource, the company behind a cloud translation platform.

Twitter: @constanch

What Big Data can tell us about the translation industry

The rapid adoption of cloud-based translation tools, which aggregate usage statistics across thousands of organizations and translators, is producing Big Data for the translation industry. With the help of this Big Data, we can benchmark productivity, identify industry trends, and train machine intelligence.

Only five years ago, translation data was mostly isolated and siloed. Companies employed on-premises translation software that kept data locked away within the company, perhaps only shared at conferences. Moreover, comparing statistics such as translation speed and quality was difficult because globalization teams developed custom metrics and tracked performance in their own unique ways.

Cloud computing leads to Big Data

By the early 2010s, cloud translation tools started to gain prominence and user communities grew. Unlike their predecessors, cloud tools had central storage and aggregated data from many organizations and individual translators. They offered the same standard metrics to everyone, solving the problem of data inhomogeneity.

Initially, the cloud approach encountered severe resistance. Enterprises with secrets to protect and translation companies working with confidential materials were reluctant to upload their translations onto third-party servers. Germany's market was particularly hard for cloud platforms to penetrate. By 2017, however, cloud platforms have overcome this resistance and gained widespread adoption. Major European platforms currently have tens of thousands or even hundreds of thousands of registered users. Memsource exceeded 100,000 accounts by 2017. SmartCAT boasts 70,000 translators available via the platform. The crowdsourcing platform Crowdin claims to have 900,000 registered users.

Each of these platforms has the capability to collect the data from all their users under one roof, eventually arriving at billions of translated words, thousands of working hours done – the Big Data.

Machine learning

Big Data has the potential to dramatically shake up the translation industry. Its most promising long-term use is machine learning. Stored editing, proofreading and terminology work in hundreds of language combinations means platforms have a lot of material to train translation engines and spell-checkers. Translation data is usually copyrighted to users and not accessible to software engineers. However, under some circumstances users agree to share their non-confidential content. For example, the Microsoft Translator comes with a free feedback engine that provides and collects reusable translations to train the engine further.

There are a few use cases of machine learning implementation in translation systems. Matecat, for instance, used their material to automate tag placement during translation. SmartCAT analyzed patterns in translator behavior to machine-"guess" whether the selected translator could accomplish the task within the deadline. Lilt.com and SDL are experimenting with adaptive MT, which learns from users as they work within the system. None of these experiments have evolved into "killer features" allowing the platforms to take control of the market. But tool providers around the world keep experimenting.

Productivity benchmarking

A more evolved use for Big Data is to get a bird's-eye view on the translation industry. This can already be achieved.

In Memsource, we were able to set up a business intelligence feature that helped gather and analyze metadata from translation jobs, including:

  • volume of words
  • language combinations
  • leverage of technology

Information has been anonymized and analyzed in bulk, irrespective of the user. After a few weeks of cleaning the data, we were able to extract some relevant findings that indicate market trends.

Finding 1: Languages of India, Asia and Eastern Europe are growing in prominence

Volume: * low millions; ** above 5 million words; *** above 10 million words; **** above 20 million words; ***** above 100 million words

Figure 1: Annual growth of the number of words translated into


For the chart shown in Figure 1, we compared the number of words translated into in Memsource in the second quarter of 2015 and second quarter of 2016. The overall volume grew by 300 percent from about 150 million words to 450 million words. Those languages that exceeded this overall growth rate show a particularly steep rise in demand.

We used only target languages with more than one million words translated over the second quarter of 2016.

Here are the findings:

  1. Marachi and Hindi lead the table with incredible growth stemming from low base value. This signifies that some brands are now starting to translate into languages of India and target consumers there.
  2. The growth of Latvian, Serbian, Romanian and Hungarian signifies the rising business interest in less saturated but smaller markets in Eastern Europe.
  3.  Finnish, Swedish, Dutch, Norwegian and Norwegian Bokmål show the continuing growth of business interest in Nordic countries. The community of Memsource users in these countries is strong and gets even stronger every year, which might influence the data.
  4. Chinese, Vietnamese and Indonesian represent the rise of the Asian languages. Japanese is not part of this trend, perhaps because Memsource already has a stable community of users in Japan that led to a high base volume, making it harder to achieve 300 percent of growth or more.

Finding 2: Implementing translation memory saves 36 percent of the translation budget

Figure 2: Translation memory savings across the top 100 Memsource users


Using a large sample of 516.5 million words translated in Memsource, we were able to pinpoint the average increase in productivity and the budget savings from translation memory.

For this sample, we looked at the last six months' worth of translations from our top 100 users. These are mostly large translation companies and in-house translation departments of software and manufacturing enterprises.

Translation memory is the basic technology in professional language services, as it allows the reuse of previous translations. Depending on content type and volume, this speeds up work considerably. Translation memory works by checking how similar a new text segment is to the best available match in the database of the translation memory. It classifies the segments into eight categories.



Words belong to…


A segment that is repeated within the translated document


A context match: 100% match preceded and followed by other 100% matches


A segment is identical to a segment in the TM


99-95% similarity


85-94% similarity


75-84% similarity


50-74% similarity


0-50% similarity

Table 1: Translation memories classify text segments into different categories.


Figure 3: Sample translation memory matches


In our sample, out of the 516 million words translated, 38 percent had a match of some kind in the translation memory, and 14.6 percent more were repetitions. Translators save time only with repetitions and good quality matches. To calculate savings, we applied a discount to matches: 80 percent for exact matches, 75 percent for good quality matches and 25 percent for fuzzy matches. The resulting cost savings differed from organization to organization and ranged from 14 to 90 percent. The average value for the whole volume was 36 percent.


Finding 3: Professional human translators can leverage generic MT to gain a 5–20 percent boost in productivity

We used a similar method to find out how post-edited machine translation can improve the productivity of professional translators. Measuring how closely MT suggestions resembled actual human translations performed in Memsource, we were able to draw conclusions about MT relevancy for translators.

Figure 4: Machine translation leverage by language pair


In Figure 4:

  • Match 100: segments where the professional human translation is identical to the MT suggestion
  • Match 85–95: MT suggestions are close enough to use after edits
  • Match 50–75: MT is useful for auto-completion of individual words, but not whole segments
  • Match 0 represent segments with 0–49 correlation to human translation

For the chart shown in Figure 4, we could only track projects where users first enabled machine translation.

For this experiment, we measured the performance of generic, non-customized MT engines, and used a sample of 38 million words.

Overall, only 5–20 percent of the suggestions from MT were good enough to simply use as final translation without any edits. Up to 40 percent of the suggestions were usable after editing.

French, Portuguese, Spanish and English machine translation engines had the highest rates of MT leverage. English to French stood out with more than 20 percent of the translations a complete match to MT suggestions, and almost 90 percent of segments having at least some coherence with the MT. In comparison, Russian, Polish and Korean had much lower leverage rates: below 5 percent of exact matches.

The difference is probably due to the morphological typology of the languages. French, Portuguese, Spanish and English are analytic languages which rely on word order and auxiliary words such as "are" or "will" to convey meaning. Russian, Polish and Korean are synthetic, which means they use many more inflections. MT is still struggling with inflections.

To conclude

As the above examples demonstrate, Big Data can deliver some interesting findings for machine translation. Hard data of this kind is new and was not available before. There is a big potential for utilizing such findings in the future for data-driven decision-making. And when companies learn to harness the power of Big Data, it might have a profound impact on the industry.