March 2014
By Artem Ukrainets and Anna Sidorova

Image: © Sergey Kulikov/ 123rf.com

Anna Sidorova has been working for ABBYY, global provider of document recognition, data capture and linguistic products, since 2007. Anna currently heads the marketing of ABBYY Language Services, the most dynamic division of ABBYY group.


Anna_Si[at]abbyy.com
www.abbyy-ls.com




Artem Ukrainets received his PhD from the Moscow Institute of Physics and Technology and has been working in localization, translation and software development for almost ten years. In 2010, Artem joined ABBYY Language Services, provider of language services and technologies, where he currently serves as Head of Research.

Raising productivity of automated translation: The factor of terminology

To reveal the benefits of terminology management and promote its use, linguistic services and technologies provider ABBYY Language Services conducted research to evaluate the impact of terminology management on automated translation. Here is what the company found.

Professional translators can hardly imagine their job without glossaries. Improper or inconsistent use of terms is the source of many (if not most) errors, which Language Service Providers (LSP) report on bad translations. This regularly leads to customer complaints. With tremendous technological breakthroughs in the translation industry during the past two decades, the urgency for proper terminology management has increased. The translators’ former personal reference materials, glossaries and translation memories (TM) have become a company’s linguistic assets.

The proliferation of Computer-Aided Translation (CAT) tools and the emergence of Machine Translation (MT) technology integrated into some of these CAT tools have been in the spotlight for several years already. One of the most important questions today is how technology improves human productivity and what needs to be done to increase the efficiency of automated translation. In 2013, we conducted extensive research on the impact of a translation memory and different machine translation systems regarding the speed of the translation process. This year we approached a more subtle issue, which is harder to evaluate: The impact of terminology management on automated translation.

Our starting point

First, let’s touch upon existing terminology practices. A glossary consistency check is usually performed after an in-country review of the translation. We often hear from translators that they spend a lot of their time (up to 90 percent in certain cases!) searching for the right translation over the Internet or in reference materials which the client has provided. It still happens frequently that the glossary is supplied in form of a PDF or Excel file, and is not directly integrated into the translation workflow. Thus, users have to either search for terms in a very time-consuming manner or simply learn them by heart. In some cases, the knowledge of these terms is what qualifies a translator for the particular field. So potentially, if we provide a person with a well-prepared and easy-to-access glossary, we might be able to unleash the potential of many other professional translators, who are not yet familiar with the field.

And yet, in most of the existing CAT tools, which are widely used by professional translators and offer a wide range of features, terminology remains a simple display of term and its translation, where other forms of words might be omitted.

Second, our research of 2013 clearly shows the new value of terminology: Terminology is the decisive factor in achieving good MT results. All of the post-editors of automated translation who took part in our experiments stated that they spent the majority of their time searching for and editing terms. Further analysis revealed that sometimes post-editors even changed terms that had already been translated properly, i.e. they did unnecessary work that increased the overall post-editing time.

All of that proves that the proper implementation of MT technology should contain the advanced terminological module to ensure real gains in productivity. Without proper implementation, there is the potential to get poor fuzzy matches, which would simply not improve productivity. Perhaps more alarmingly, getting bad MT proposals at the segment level could negatively impact productivity, since the translator would have to read through multiple variants to determine which one to choose, if any at all.

Our goal

The new research of 2014 is dedicated to measuring how terminology-related functionality affects the translation time. In general, the translator is expected to use proper terms during the translation phase, then the editor makes sure that all terms have been used correctly. The operational overhead and time used for these two phases depends greatly on whether the glossary and QA checks are integrated into the translation workflow.

Our technical base

We used the cloud-based automated translation environment www.SMARTCAT.pro integrated with one of the state-of-the-art terminology solutions (www.LINGVO.PRO). The latter uses advanced morphological search, which allowed us to easily find all forms of terms regardless of their tense, gender, number or case. All glossary terms are displayed at the very moment of translation. In this way the translators did not have to employ any external system or reference material to look for the term and its translation. The application of this technology allowed us to achieve a significant reduction of translation time.

Another important part was the built-in QA check of SmartCAT, that is performed at the segment level. Whenever the segment was confirmed, its translation was checked with the glossary for consistency. Such an automated check not only reduces the overall number of terminology-related errors, but also the editing time.

To be more precise, we measured the exact time a person spent on translation or editing at the segment level. SmartCAT also has a metrics functionality. With this we tracked changes made during different steps of the workflow, and thus could see which mistakes the translators had made.

All these features allowed us to perform a very detailed experiment.

Methodology

We arranged a two-step workflow consisting of translation and editing, which is quite common in the industry. Editing is essential to check the quality of the final texts and, in particular, to ensure the proper use of terminology.

We used three different methods for translating a document from scratch:

  • No initial translation memory, machine translation or glossary. No initial TM is chosen to reduce possible inconsistencies introduced by other translators. No MT or glossary means that the translator has to look for all translations manually, either on the Internet or on any other available sources.
  • No initial TM and MT. A glossary is provided in the form of an Excel file. In this case the translator has to switch continuously between the CAT tool and the Excel file. This simulates a regular situation for LSPs, when the customer provides a glossary with its own structure and format.
  • No initial TM. MT is automatically delivered into CAT results for reference. The glossary is assigned to the project within SmartCAT, and terms are fed into CAT search results.

The documents featured in all translation methods are different, but have the same topic with the same set of terms and are of approximately the same size. The average size of the document is about two pages (500 words). The topic of the texts is patent application, which is very sensitive regarding the correct terminology.

The glossary has a good correlation to the source text – on average there are about ten unique terms for each document with 26 entries (some terms appear in the text more than once).

The same person with medium knowledge of the topic translated all three projects (it is likely that experienced translators will have very little gain of the technology). The files were translated one by one without any interruptions. After the translation was completed, an editor performed the final review of the document (not a formal LQA procedure, but regular editing to obtain a final delivery document). The amount of changes between these two stages was calculated for all three documents.

Work time was measured at segment level, and the complete time calculated for both the translation and editing phase.

Results

Aggregated data for all three documents is presented in the table below:

Parameter\Document

1 – no glossary

2 – external glossary

3 – glossary in CAT

Translation time (min)

82

75

64

Editing time (min)

32

26

19

Terminology errors

9

4

0


The table shows that translation time decreases from document to document, which demonstrates the effect of more easily accessible terms. Translators don’t need to spend time searching in multiple resources, or to even look somewhere outside the CAT tool.

Editing time is reduced as well, as the translator makes less mistakes (which also can be seen in the corresponding row of the table), and the editor does not need to refer to other sources to look up terms.

Other factors

There are several factors, which we did not address in the experiment and which might have both positive and negative impacts:

  • Translators tend to learn most common terms over time, so for long-term projects the effect might be lower.
  • Preparing an elaborate glossary takes time, so the overall time reduction for translation might be less. Translating a glossary is also a challenge, as translation really depends on the context. It makes sense to have a comprehensive glossary for long-term projects in cases where translators are not very experienced, or where quality is crucial.
  • For glossaries with a large number of terms it is almost impossible for a translator to learn all the terms, so in this case the effect of an automated terminology workflow plus QA checks would be significantly greater.

Conclusion

Terminology is an important factor affecting translation time, and one that can be easily estimated. It is also critical for the productivity of the translation process, and thus should be tackled before the project starts.

Terminology is always the most time-consuming part of a project, especially for large ones. To address this, the first essential step is extracting and translating terminology, i.e. creating a term base. It helps greatly, if this term base is supported by the terminology management function of your CAT tool. Terminology serves as the backbone, which will support the whole body of translation.

Glossaries can also help to customize corporate MT systems. In addition, they can assist professional translators with future projects.

If integrated in a complete CAT solution, automatic quality checks help to evaluate the results of translation. Usually this is sufficient to ensure correct terminology usage. We discovered that it is extremely helpful if the automated translation system is capable of highlighting correctly used terms. This protects them from over-editing, and communicates the correct terms to the post-editor/translator without taking up extra time.

We believe that such a dynamic approach increases both productivity and consistency. The text for translation is first analyzed and the terminology extracted, saving the context for the translator's reference. Then the terminology is incorporated into the MT customization parameters, and once the post-editing is completed, the automated quality check-up validates the terms’ translations. We suggest that this approach works most efficiently in an integrated, cloud-based application. In this way, translators have permanent access to the system during all stages: while creating the term base, while customizing the MT engine, and during post-editing.