June 2010
By Ethan Shen

Ethan Shen is in charge of product development at Gabble On, an e-learning company based in New York. He is working on a language learning tool for high school and college students that will layer vocabulary building, flashcards, quizzes, and progress report features.


Comparison of online machine translation tools

Which is the best online translation tool for language learners and general users of freeware? A recent research paper evaluates the quality of the three popular online translation tools Google Translate, Bing (Microsoft) Translator, and Yahoo Babelfish. Among other findings, the data reveals that while Google Translate is widely preferred when translating long passages, Microsoft Bing Translator and Yahoo Babelfish often produce better translations for phrases below 140 characters. Also, in general Babelfish performs well in East Asian Languages such as Chinese and Korean and Bing Translator performs well in Spanish, German, and Italian.

The research project was created to answer a simple question I’ve had since using Babelfish.com to help with my Spanish homework many years ago: Which translation engine works best?

Since the research was intended for the general user of such online tools – rather than for professional translators – survey takers were drawn from the general multilingual internet population and quality measures were based on direct comparison, rather than subjective rating scales.

Instead of providing a pre-determined set of phrases, our survey tool was built to dynamically produce translation results to any text provided by the survey taker. The experiment strove to replicate the experience of the average student or casual reader using online translation tools.

Project hypothesis

Our hypothesis was that the relative performance of various translation engines will change depending on the language to be translated and the character length of the requested translation. For example, Engine X may be consistently more effective than Engine Y for English-Spanish translations under 50 characters, but the opposite will be true for translations over 150 characters.

In accordance, we proposed to test the following hypotheses:

  • Hypothesis 1 – No single translation engine will be consistently most effective for all pairs of languages and text conditions.
  • Hypothesis 2 – Statistical translation engines (Google Translate) will be generally more effective for language pairs for which extremely large bodies of parallel text is available. These languages primarily include the official working languages of the European Union for which large bodies of translated parallel text is available.
  • Hypothesis 3 – Translation engines with language-specific rules-based elements (Babelfish and Bing Translator) will be more effective for non-official UN languages and for languages whose grammar structures significantly diverge from Latin and Germanic languages.
  • Hypothesis 4 – Statistical translation engines will be more effective in longer formal sentences, while rules-based engines will be more effective in short phrase-based sentences.
  • Hypothesis 5 – Rules-based engines will be more effective for questions, as there are fewer parallel texts for questions available, which the analysis of the statistical method requires.

Experimental design

To test these hypotheses we directly compared the quality of outputs from Google Translate (a statistical translation engine), Yahoo Babelfish (a traditional rule-based translation engine) and Microsoft Bing Translator (a hybrid statistical engine with language specific rules). 

We invited volunteers to enter text of their choice into our survey form, which routed user requests to each of the three translation engines via their server-side Application Programming Interfaces (API’s). These API connections allowed us to return the results of all three engines and allowed the user to vote on the engine which produced the best and worst results. After giving their vote, the user had to designate their fluency level in each language. This allowed us to weigh each vote, with votes from fluent speakers weighted more heavily in our final analysis than votes from speakers with limited proficiency.

In order to collect a quantity of data that would render our research statistically significant, we ran promotions throughout a 6-week period to drive professional translators and interpreters as well as non-professional multilingual users to our site. 

At the end of our data collection, we analyzed the distribution of the “best” and “worst” votes according to the following parameters:

  1. The input and output languages
  2. The length of the text given in characters
  3. Single sentence or phrase vs. multiple sentence paragraphs
  4. Presence or absence of a question mark  

All users were asked to self rate their fluency in both languages. Whenever the user designated their fluency in the destination language as “limited”, the data set was discarded from the analysis. “Functional” and “Fluent” voters were treated equally for the sake of simplicity.


Figure 1. – Most preferred engine and margin of preference compared to second-best engine.

The above table describes the relationship between user preferences and translated text character length for 15 single direction language pairings. The most preferred engine is given at each intersection (Google, Babelfish, or Bing) along with the magnitude of its lead over its closest competitor in that category (colored percentage). The language pairings excluded from this table represent sets for which preferences were overwhelming (over 100%) or insufficient data was available. 

From this data, the following conclusions can be drawn:

    1. For long passages of text up to 2000 characters, survey takers generally prefer Google Translate's results across the board.
      a.    The extent of Google’s lead varies dramatically from language to language. In some languages such as French, the strength of Google Translate’s engine is overwhelming. However, in several others like German, Italian, and Portuguese, Google holds only a very slim lead when compared to its closest competitor.
      b.    These observations validate our Hypothesis 1 that no single engine can perform equally well across a spectrum of languages or conditions.
    2. The greatest relative strength of a statistical translation focused engine (Google Translate) has not clustered around the European Union working languages as expected. German, Italian, and Portuguese, all EU working languages are the most hotly contested from a performance perspective.
      a.    One possible explanation is that large additional bodies of parallel English-French text are available from the government of Canada for which official documents are translated into both. To a lesser extent this could explain the strength of Google Translate in Spanish as many Latin American countries offer English translations of official documents.
      b.    This data partially refutes Hypothesis 2.
    3. Traditional rule-based translation engines (Babelfish) performed generally well in East Asian languages such as Chinese and Korean.
      a.    One possible reason for this performance could be that the language specific grammar and word usage rules are more effective than association-based transliteration in these situations.
      b.    These finding are in line with Hypothesis 3, but the size of the data set is not large enough to confirm the hypothesis in a statistical significant manner.
    4. Across almost every language Bing Translator and Yahoo Babelfish gain ground or surpass Google Translate as the text length gets shorter.
      a.    In Chinese, the gradual erosion of Google’s relative performance as total text length shrinks from 2000 characters to 50 characters, is stark. Respectively, as phrases get shorter and more straightforward, rule-based or hybrid translation engines perform better.
      b.    Though data is not shown, a similar effect is seen for passages that are only one sentence compared to passages with multiple sentences.
      c.    This data strongly validates Hypothesis 4.
    5. The most interesting observation is, that translation quality is not a two way street. The engine that is best for translating in one direction is not necessarily the best tool to translate back the other way.
      a.    The two most obvious cases of this are French and German. Though Google Translation dominates when translating both of these languages to English, it faces heavy competition when translating from English to the foreign language.  

    Midway through the data collection we hid the brands and randomized the positions of the results, so the source was unknown before voting. Though we expected some effect, the extent of the brand bias outcome was startling.

    1. Across the general populace it can be seen that users selected Google over Microsoft Translator 21% more often when they knew the brands compared to when the brands were hidden. The effect is even more pronounced with voters that self designated their language fluency as “limited”. When given a choice between two results, users were almost 30% more likely to choose the result given by Google if the source was known. Only brand bias can explain this increased preference for one tool.
    2. The Google-relative brand bias effect over Yahoo Babelfish is even more stark at 136%. This could be both a reflection of Google’s strong brands as well as the marketing neglect that Babelfish has suffered since its heyday in the late 90’s.
    3. When you take this bias into account when viewing results in Table 1, many more languages pairing would be hotly contested or favoring Bing Translator or Babelfish. Due to the size of the data set, we have chosen not to separate the data for further analysis. However in future experiments we will attempt to test this effect on a language-by-language basis.

        Practical application

        With this data regular users of free online translation tools can customize their behavior according to the most effective tool for each situation. For example, a student studying French should confidently trust his translations from Google. A student studying Chinese would be better served using Babelfish or Bing when translating short sentences for homework.

        Another application would be to build a universal composite engine that automatically routes translation requests based on the parameters of the given text. This could ensure that the best possible result is given every time by understanding how each engine performs under these different variables.

        Overview of online translation tools evaluated

        Babelfish – rule-based engine

        • Formerly hosted by Altavista, and more recently bought by Yahoo, Babelfish is largely based on an older version of Systran
        • Systran, founded in 1968, is one of the oldest machine translation companies.
        • Systran is primarily a rule-based translation engine that has been developed to very high precision over the last 40 years.
        • In more recent years, Systran has blended its rule-based translation engine with a statistical translation engine to improve flexibility. However these changes are not reflected in Babelfish
        • Yahoo Babelfish
        • More details about SYSTRAN’s hybrid engine


        Google Translate – statistical translation

        • Google Translate used Systran’s engine until 2007 when it developed its own proprietary statistical translation engine.
        • Statistical translation uses phrase correlation between known pairs of pre-translated parallel texts
        • This leverages Google’s powerful search engines and massive computing power
        • Google recently uploaded approximately 200 billion words of parallel-translated documents from United Nations archives to train their system. This has resulted in a significant improvement in translation accuracy.
        • The drawback of the statistical approach is that it does not apply explicit grammatical rules, since its algorithms are based on statistical analysis rather than hard coded rule-based analysis
        • The main benefit of the statistical approach is that rule-based translation systems require the manual development of linguistic rules, which is costly and does not carry over to other languages.
        • Statistical-based systems are not tailored to any specific pair of languages; they simply need big bodies of parallel text to train from. This is the reason why Google has 60 languages and Babelfish has only 14, even though it has been in operation significantly longer.
        • Google Translate: http://translate.google.com


        Microsoft Bing Translator – Statistical translation engine with language-specific rules

        • Bing Translator is a statistical machine translation engine, which also relies on a language specific rule-based component to disassemble and reassemble sentences from one language to another. Microsoft refers to this system as “Linguistically informed statistical machine translation”
        • This hybrid system incorporates both language specific dependency trees to augment the effectiveness of phrase-based statistical machine translation models
        • Because statistical translation engines rely on matching phrases in the translation request with existing phrases in their database, these systems begin to fail when varied grammar structures are used, that cause phrases to take on new meaning due to the rearranged word order.
        • By employing language specific parsing, dependency, and word alignment rules, Bing’s unique approach engine is able to generalize word order in phrases to make them easier for statistical translation engines to process, and then realign the output to match the grammatical intent of the original phrase.
        • Microsoft Bing Translator
        • More details about Microsoft’s Linguistically Informed Statistical Machine Translation Model in this paper published in the Association for Computational Linguistics

        Page 1 from 1
        #1 Localization Services in Bangalore wrote at Fri, May 27 answer homepage

        Google Translator is better than others.