December 2014
By Angelika Zerfass

Image: Oleksandr Kovalchuk/ 123rf.com

Angelika Zerfaß runs the consulting company zaac in Bonn and offers consulting, training and technical support for translation technologies.


zerfass[at]zaac.de
www.zaac.de


 


 

Correct data exchange

Data can be exchanged between translation systems and terminology tools with the help of the TMX and TBX formats. However, segments are still transferred incorrectly or not transferred at all in spite of these standards. What could be the cause and how can the results of this data exchange be improved?

There are two typical situations where you might need to exchange data between translation memory systems: When switching from one translation memory system to another or when a customer hands over a translation memory or a terminology database to the service provider or vice versa.

Such an exchange should not be a problem; after all we have TMX, the Translation Memory Exchange Format, and TBX, the Termbank Exchange Format. The user can export a translation memory or a terminology database and import it to another translation system using these formats. This should actually be an open and shut case.

Issues mostly with terminology

When exchanging TM data with the help of TMX, the transport of segment pairs is not an issue, strictly speaking. But a TMX file can contain much more, e.g. metadata for categorizing segment pairs or information about the context of the segments. This is where the differences come into play, where some information successfully survives the exchange and some doesn’t.

The exchange of terminology data can be very simple or it can develop into a very time-consuming activity depending on how the terminology databases of the programs are structured. Since terminology entries can be a lot more complex than the segment pairs in a translation memory, it is often necessary to spend more time and effort for the exchange.

Where are the differences

Let’s assume the following situation: A department decides to switch from translation program A to translation program B. It is known that data can be transported from one program to the other using the TMX format, but there is some uncertainty about whether the data will really find its way to the new program and what other issues might arise.

The experts decide to test the process. One of the larger translation memories is exported to TMX format and then imported in the new program. Then the files that were already translated with program A and this TM are loaded into program B and analyzed. The analysis shows that not all segments would receive a 100%  match from the translation memory. How can that be?

A look at the documents in both programs shows that the programs segment the documents differently, and program A has a list of abbreviations that is not yet maintained in program B. Furthermore, the segmentation rules in program A had been customized, for instance to prevent segmentation after a number with a following period. Program B doesn’t yet have this customization.

Different calculation

With the tests it becomes clear that purely transferring a translation memory through TMX is not enough. It is also necessary to consider that the manner in which the match rates from the translation memory are calculated differs from program to program. A segment that appears to be an 80%  match in program A can be a 75% match or even an 89% match in program B – therefore the analysis statistics of two different programs usually cannot be compared.

The department now takes a closer look at its old translation memory and finds that the memory from program A has user-defined fields (metadata) for categorizing the content. Another test should now show whether this metadata is also present in program B after exchanging the TMX and can be reused. A check shows that the segment pairs themselves, together with the general metadata about the creator and the date of creation can be transferred from one program to another using the TMX format without any problems. The exchange of other metadata however depends on the combination of programs between which the exchange is to be executed.

Below we deal with some detailed information about the data that a TMX file can contain, areas where problems may arise during an exchange and how they can best be solved.

Matching language and language variants

The information about the language of the segment is provided for every segment in a TMX file. This is often in the form of language variants such as de-DE or en-US.

 

Language variants in a TMX file

<TU>

        <TUV xml:lang=“de-DE“>

            <seg>Dies ist ein Satz.</seg>

        </TUV>

        <TUV xml:lang=“en-US“>

            <seg>This is a sentence.</seg>

        </TUV>

</TU>

Figure 1:
Explanation TU = Translation Unit; TUV = Translation Unit Variant; (ID of the language of the segment)


Issue: Some programs permit setting a general language such as German or English as language identifier. The TMX file then shows DE or EN as the language ID. Other programs expect a language variant such as de-DE, en-US here. The way language variants are written usually doesn’t influence the exchange , i.e. it makes no difference whether the variants are written as DE-DE or de-DE.

Solution: If language variants are needed for an import, the language ID in the translation memory needs to be adapted through search and replace.

Text and formatting

Table 1 presents a simplified view of formatting information in a TMX file. I would like to point out here that the TMX file doesn’t only save formatting but also other elements as tags. This includes tabs, placeholders for index entries or even images that are anchored in the sentence.

Both representations of the formatting information in the TMX file are valid. In one instance, the actual formatting is named explicitly (here “bold”), while the formatting information is numbered in the other instance.

Issue: Here too, it becomes obvious that information can get lost. Although it is clear in both programs that there is a tag in the sentence, the actual formatting information is not available in program B. This means that there will be matches from the TM, but the tags might not show the correct content.

Solution: The matches from the TM can be used during the translation, but the tags may have to be replaced manually with the corresponding tags from the source text.

 

Simplified presentation of formatting information

Program A

<seg>This is a <Formatting Bold>sentence</Formatting Bold>.</seg>

Program B

<seg>This is a<Formatting 1>sentence</Formatting 1>.</seg>

Table 1
Source: Angelika Zerfass

 

Identifying the source of the segments

The information about whether a segment originates from an alignment allows the translation memory system to apply a penalty on the match rate when showing the match  during translation. This information is provided differently in different systems, so that it usually cannot be reused anymore after an exchange through TMX. Here are a few examples on how information on alignment is attached to the segments.

Simplified presentation of formatting information

 

<TU>

<prop type="aligned">yes</prop>

<TUV xml:lang="de-DE">

<seg>This is a sentence.</seg>

</TU>

<TU>

<prop type="x-Origin">Alignment</prop>

<TUV xml:lang=“de-DE“>

<seg> This is a sentence.</seg>

</TU>

<TU>

<CrU>ALIGN!

<TUV xml:lang="de-DE">

<seg> This is a sentence.</seg>

</TU>

Table 2 
Source: Angelika Zerfaß


Issue: The information that something originates from an alignment cannot be exchanged between the programs, or a program cannot use the type of information that another program uses.

Solution: The TMX-file can be edited using search/replace, so that the information on alignment matches the program in which it is to be imported.

Including metadata

In case of metadata, we differentiate between system data, such as “Name of the creator” or “Date saved”, and the user-defined metadata. The system data is usually transferred without any problem through the TMX file from one program to another. In case of user-defined data it depends on the programs between which the TMX file is transferred. Some programs can read the metadata from other programs, although it doesn’t appear in the same way as they themselves would provide it in the TMX file, while other programs ignore metadata that doesn’t correspond to their own structure.

Table 3 lists some examples for what user-defined metadata can look like in TMX files from different programs.

 

Examples of user defined metadata

<prop type="Att::Field">XYZ</prop>

<prop type ="x-Field:MultiplePicklist">XYZ</prop>

<prop type ="x-Field ">XYZ</prop>

Tab. 03  Source  Angelika Zerfaß

Table 3Source: Angelika Zerfass


All metadata is realized through a <prop> tag (Property) in the TMX-file. A field FIELD has been defined here in the TM respectively and filled with a value from a list (XYZ).

Issue: Every program has its own way of describing user-defined fields. Unfortunately, the specification of TMX permits this.

Solution: A test export and import is recommended. Possibly the receiving program can take over the metadata. If not, it would be necessary to think about adapting the metadata in the TMX file through search/replace, if this is possible without too great an effort.

Unfortunately, there is no way to exchange a list of metadata field settings between one translation memory system and the next.

Segments with context

In today's TM systems, information about the context of a segment is also often saved in addition to the actual segment pair. But each system has its own way of doing so. One program may explicitly save the sentence before and after the segment, another might save a code that is composed of different elements. The advantage of this context information is that the translation program can show the translator that a match is not just a 100% match (the segment in the document is identical to the segment in the TM), but that even the context around the segment matches. Therefore there is a greater probability that the translation from the translation memory really fits the current context.

Below is a very simplified presentation of how context information can appear in a translation system.

Context information in a translation system

<TU>

<TUV…

<prop type=“x-context-pre“>Sentence before.</prop>

<prop type=“x-context-post“>Sentence after.</prop>

<seg>This is a sentence.</seg>

Explicit context

<prop type=“x-Context“>-4137887936528476506, -4137887936528476506</prop>

Context as hash code

Table 4
Source: Angelika Zerfass

Issue: Without context information only 100% matches can appear. A match with the same context is only possible again, once new segments have been saved with the new program in the translation memory.

Solution: none

Finally, our department also finds that there are some languages where the content of the translation is already quite old and doesn’t correspond to the most recent style and terminology guidelines any longer. A decision is taken to transfer this memory to program B, but to use it only as a background memory there. For this purpose many systems offer the option to apply a penalty for the content of a whole translation memory.

Between customer and service provider

Now, after the department has successfully switched from one program to another internally, the next step is to exchange the data with the service providers. How can an external service provider use the TM of the customer?

It is worth discussing with the service provider which system is used or can be used there, as many service providers use multiple translation systems. If the customer and service provider use the same translation program, then there are various options to transfer the translation memory or even to exchange it:

  • The service provider receives access to the translation memory on the server of the customer. An exchange doesn’t take place here, but the service provider works directly with the customer’s memory. In such a situation it is advisable to use a working memory and a master memory, so that the Master-TM is updated only with the final and reviewed content.
  • The customer sends packages for the translation from his own system and includes a copy or an extract of the TM in the packages. The memory is updated by the customer himself after the bilingual files are sent back.
  • The customer sends the TM file or even the folder, in which the TM files were saved by the system, so that the service provider only needs to integrate them.
  • The customer exports a TMX file and receives TMX back from the service provider. Since both work on the same system, (meta) data is not lost.

If the customer and service provider use different programs, then the options available for exchange depend on the combination of programs.

If the customer works with SDL Trados Studio or STAR Transit for instance, the translation packages that also contain a TM or reference material can be directly edited with the software memoQ. The memory is imported from the packages in a background run. To return a TM, an export to the TMX format would be necessary, since the translation memory is basically not included in the returned packages. Unfortunately, this is not possible the other way round, since the packages from memoQ cannot be read in SDL Trados Studio or STAR Transit.

Reduced match rates

If it concerns a different combination of programs, the export to TMX is the sole option for exchange. In this case, it is to be expected that the match rates will be lower than they would have been when using the same program due to the different segmentation in the programs.

Matches with context are not available and it might be that user-defined metadata cannot be reused. This can be a disadvantage when a translation memory contains translations from different fields, such as marketing, technical texts, manuals or contracts for example and a penalty on segments from specific fields could be set based on the metadata. A penalty on segments from the area of marketing could be helpful while translating contracts.

Exchanging terminology

After the exchange of translation memory data has been analyzed in detail, the same question presents itself for the transfer of the terminology database. There is an exchange format here too: TBX – the term base exchange format.

Use of TBX

Our department now finds that exchanging the terminology data is very easy, namely when both programs work with TBX and program B can simply create a new terminology database from the TBX file of program A. The matter becomes complicated when the structure of the terminology databases is very different and program B either doesn’t allow a TBX import or cannot recreate the structures of the terminology database of program A.

Issue: While exchanging TBX files it may happen that the unique entry numbers that the terminology database assigned in program A are not transferred and program B renumbers the entries again.

Solution: If the numbers are to be maintained, the export of the data in TBX format may have to be adapted, if possible.

Use of tables

Our department was lucky and the data can be transferred using the TBX format. The question of how the data can be transferred does come up in case of some of the other service providers, where the exchange via TBX is not possible. Usually, there is nothing else to be done in such cases other than to provide the data in a table format. There might have to be some compromises for very complex entries where only the most important information can be transferred.

Issue: The formats that can be transferred as tables and to which program A can export its terminology data need to be checked. This can be a multilingual list for instance, lists with selected additional fields or the entire database structure. Since terminology data can be saved to different kinds of tables, it is necessary to check what type of table can be imported by the receiving program. In case of export to Excel for instance, there is the column-oriented export (an entry consists of only one row with several columns) or the row-oriented export (an entry consists of several rows).

Solution: Only detailed testing of what can be exported and imported will show what processes are necessary.

Exchanging specific settings

Our department has worked extensively on the functionalities of the terminology database. E.g. the option of marking specific synonyms as “prohibited” was used. The function can later be used for the terminology check while translating.

This kind of setting is realized in different ways in the different terminology databases. One system uses a separate field to be defined by the user. Another system uses a checkbox with standard content to mark prohibited terms.

Issue: The varied handling of marking prohibited content can make this content difficult or even impossible to exchange.

Solution: If an automated solution is not offered, the information, that a term is prohibited, might have to be entered manually later in the receiving system.

Different ways of matching

The similarity between the terms in the text and the terms in the terminology database is calculated differently by translation systems when using a terminology database during translation. In some systems this takes place by setting a match rate. In other systems it is through a separate list of possible endings, in still others by using placeholder symbols to split the root of a term from the ending, for example.

Issue: The way to compare terms with the terminology database can be very different..

Solution: Usually it is necessary to adapt the terminology database to the receiving system.

Editing might be necessary

Exchanging data from a translation memory between translation programs through TMX works well  for the actual segment pairs and the system data such as creator or date of creation. Everything else depends on the combination of programs. It might be worth editing the TMX file with search/replace to be able to keep more information, e.g. the user defined metadata fields. In addition to the exchange of TM data via TMX, other things could be reused as well. These could include abbreviation lists, metadata in the TM or  an adaptation of the segmentation rules.

Technical writers may have to accept lower match rates for some elements, due to different segmentation or calculation of matches. It is also possible that the number of tags is different depending on the program in which a file is opened. There are differences here too, in what a translation program shows the translator and what it doesn’t. This naturally influences the match rates from the translation memory as well.

The exchange of terminology can develop into a truly time-consuming affair if it has to take place through table formats. Often, additional editing of the terminology database in the receiving system or in the worst case even a full reconstruction of the term base is required.

In any case, it is advisable to extensively test the transfer of data from one translation memory system and a terminology database to another.