A practical guide to machine translation quality prediction

How much content comes out of post-editing without any edits made? Translation quality prediction can guide your focus toward those sections that actually need post-editing while leaving those that can be safely skipped.

Text by Adam Bittlingmayer

Inhaltsübersicht

Image: © Valerii Evlakhov/istockphoto.com

GenAI in machine translation took to the stage in 2017, inside the likes of Google Research and Google Translate, where I had worked as an engineer. The smart technology built upon the work of the research labs inside institutions such as the TU Munich and the RWTH Aachen in the 1990s and 2000s. Yet, for half a decade it failed to accelerate human-quality translation because every single word generated by machine translation had to be post-edited by a human.

Today, buyers of high-volumes of translations are translating up to ten times more content efficiently while maintaining a human quality. They do so by using private, custom-made Large Language Models (LLMs) that can automatically edit and verify most translations: millions and millions of words that need no human eye.

A main use case is the translation of technical content from global companies in critical industries such as hardware, software, chip design, biotech, machinery, and manufacturing. This content requires both efficiency and quality.

If you are wondering how to get started with quality prediction, here is my company ModelFront’s practical guide to machine translation quality prediction. These insights are based on our “quality estimation” research work, which we conducted mainly for technical content in critical industries.

What is quality prediction for?

Quality prediction is for accelerating translation safely: more efficiency, same quality.

More precisely, quality prediction helps to accelerate post-editing workflows by cutting out manual human work for millions of text segments that would not be edited anyway.

Accelerating post-editing workflows helps organizations to grow capacities, speed up turnaround times, and save costs. It could even help to improve quality by shifting more content from raw machine translation to accelerated post-editing, moving content from external vendors to in-house, expanding into more languages, providing more service-level tiers, or reducing workflow steps.

However, quality prediction is not suitable for one-off or low-value offline use cases or for comparing machine translation engines, cleaning translation memories, estimating post-editing effort at the document level, or annotating at the word level.

Quality prediction enables the acceleration of post-editing work safely, directly in the main production workflow, by automatically verifying as many segments as safely as possible so that they can be skipped by manual human post-editing.

Integration and workflows

Quality prediction makes workflows more efficient by automatically triggering human intervention at the right points. It acts as a cache, a translation memory (TM) for new content. A quality prediction system is integrated into the translation management system (TMS) to verify millions of translations at the segment level.

Example workflow

To accelerate a traditional fully manual post-editing workflow, an automatic quality prediction step is added before the manual human post-editing step, see Figure 1.

Figure 1: The three steps to high-quality machine-translated content with little human input

  1. Machine translation: All new segments are machine-translated. Many of these translations require no edits. But which ones?
  2. Quality prediction: All machine translations get labeled with a quality prediction: verified ?, or not verified ?. Verified segments automatically skip human post-editing.
  3. Human post-editing: Only the unverified segments are sent to human post-editing.

The final output is a blend of fully automatic translations and manually post-edited translations.

In the quality prediction step, the status of verified segments is changed to Translated or Confirmed, as if a human translator had already verified them manually. The exact setup is configurable, usually based on existing workflows.

At the end of this automated workflow, the translation management system is updated with the progress and remaining word count and pricing for the file and project. CAT tools skip over verified segments, while still showing them for context, like exact matches from the translation memory. Only the segments that cannot be safely verified are sent to manual human post-editing.

Models

The core technology inside a quality prediction system is a multilingual Large Language Model (LLM) that verifies and edits translations. A quality prediction (QP) model is an LLM built and trained for the task of verifying (or rejecting) translations.

Quality prediction (QP)

LLM to verify (or reject) translations

Input

Source text → target text

Output

Verified yes
Not verified no (boolean)

For example:

Input

“The quick brown fox jumps over the lazy dog.” → “Der schnelle braune Fuchs springt über den faulen Hund.”

Outputyes

Or:

Input

“The quick brown fox jumps over the lazy dog.” → “Der kecke braune Fuchs gumpt über der lausigen Dogge.”

Outputno

In real-world scenarios, a quality prediction model is custom-trained to predict if a machine translation will not be edited anyway by professional human translators at a certain step in the workflow.

Quality estimation

In research, the precursor to quality prediction was machine translation quality estimation (QE). A quality estimation model or system returned raw scores (logistic regression), not a boolean prediction (binary classification).

LLM to score translations

InputSource text → target text
OutputScore from 0.0 to 1.0 (floating-point number), often displayed as 0% to 100%

The meaning of raw quality estimation scores also varies between systems, and the distribution varies between models, languages, and content types.

To make those raw scores useful in real-world workflows, the production system has to provide boolean quality predictions, by calibrating and applying thresholds for each model version, language, and content type, to keep the same final human quality.

Automatic post-editing

An automatic post-editing (APE) model is a separate, generative LLM built for the task of editing translations.

LLM to edit translations

InputSource text → target text
OutputEdited target text

In real-world scenarios, the automatic post-editing model is custom-trained to edit translations in that workflow and tightly coupled with the custom-trained quality prediction model for the same workflow.

All these models are based on the Transformer, the same core model architecture as used for machine translation systems such as Google Translate and GenAI systems such as ChatGPT. However, these models are specifically built and trained for the task of verifying or editing a translation.

Requirements for buyers

So, how do you decide if a scenario is a fit for the quality prediction of your content? There are strict prerequisites for making quality prediction work in the real world.

High volume

Are you translating a high volume of content? For now, at least, this technology is only feasible in scenarios with enough volume to justify the overhead by automating millions of words per year.

Real value

Does accelerating translation safely create real value for your company? A good test is whether this solution would still interest you and your team if it did not include an LLM. If you just want AI for AI's sake, there are plenty of better options. But if you actually need efficiency and quality, then it makes sense to use the right technology.

Control

Are you in control of your translation workflows, not just TMs? You should be managing the translation management system AND workflow. Over the years, too many companies have lost control and even visibility to agencies and legacy workflow systems. Speaking to the CTOs of tech companies here in Silicon Valley or to translation teams from automotive and machinery companies at the tcworld conference has revealed that they have the same problem.

So, if you are a high-volume buyer, who can get real value, and you are in control of your translation workflows, accelerating translation safely has never been easier.

If you don't have control, then that's the first thing to work on. The good news is that it might only take a few months to get into a better position.

Requirements for providers

Now that you have worked out that your scenario meets the prerequisites, you need to buy (or build) a system that meets your requirements. In our experience, accelerating translation safely in the real world requires safety, convenience, and alignment. These are the minimum requirements for any provider to even be worth considering.

RequirementsSystem ASystem BSystem C

Safety

Same human final quality

  

Convenience

Works easily with any TMS, MT, or agency

 

Alignment

Full focus and no conflict of interest

 

Success stories

Works in real-world scenarios

  

Safety is key to real savings – if the quality drops, then the value is unclear. Keep in mind that bad quality prediction is worse than no quality prediction. Ultimately, the provider needs to be responsible for safety. Behind the scenes, it requires a full system and lifecycle management: data checks, custom LLMs with strong guardrails, transparent monitoring, A/B testing, engineers on call, human evaluations, and continuous re-training.

Convenience is also key to savings – upfront costs, ongoing costs, and hidden costs can destroy net savings. Quality prediction should work with your existing setup – TMS, MT engines, and translators or agencies. It should also work with your future setup and not lock you in.

Alignment is key to safety and convenience. Avoid potential conflicts of interest. You don’t want to commit to buying more manual translation, or become locked into a translation management system or machine translation engine, or grow the engineering headcount.

What quality prediction changes

Quality prediction should not change the final quality or the setup – the TMS and CAT, the MT engines, the agencies or in-house translators, or how their day-to-day work. It should just make translation more efficient.

But making a translation team two or even ten times more efficient will inevitably cause changes. More efficiency leads to more demand. And more efficiency and more demand quickly become the new normal. This comes with subtle, counterintuitive changes inside the translation team.

In the era of traditional human post-editing, professional human translators were often rushed to get through the repetitive work. After the shift to quality prediction, the motivation is to get higher quality for the segments they do look at, because that data is used to train LLMs, which ultimately drive efficiency.

In fact, now there is more direct concrete value created by all sorts of work, like cleaning up assets such as TMs and terminology, retraining engines, separating workflows, monitoring, and evaluations.

The future is here

It's just not evenly distributed.

You've already seen content that was made by blending fully automated translation with human translation – you just didn't know it! Software and hardware docs, patents, marketing material, product catalogs... And that's the point! Nobody should notice it.

For example, if you read the docs or patents from the world’s top hardware, pharmaceutical, and machinery companies in Spanish, German, Chinese, or Japanese, you’d never know that millions of words were machine translations that have been AI-edited and AI-verified, without a human in the loop. Like the airliner's autopilot, a credit card's fraud detection, or the humble translation memory, this new technology is growing efficiency without sacrificing quality.

The key is not to just get most of the work done automatically but to also trigger human intervention at the right point. And of course, the systems have to be properly built, customized, deployed, monitored, and updated.

Now quality prediction, with private custom LLMs, guardrails, and monitoring, is accessible in every major third-party translation management system, just like machine translation is.