October 2014
By Martin Volk, Anne Göhring and Rihards Kalniņš

Image: © stocksnapper/ 123rf.com

Anne Göhring is a research and teaching assistant at the University of Zurich. She grew up in Geneva and studied at ETH Zurich and the University of Zurich, from where she graduated in Spanish language and literature and in Computational Linguistics. She has worked as a computer scientist in finance and information technology for many years.


www.cl.uzh.ch/index_en.html



 
Martin Volk is professor of Computational Linguistics at the University of Zurich. His research focuses on multilingual systems, in particular on Machine Translation. His group has been investigating domain adaptation techniques for statistical machine translation, hybrid machine translation for lesser resourced languages, and machine translation into sign language.


www.cl.uzh.ch/index_en.html



 
Rihards Kalniņš is the international development manager for machine translation at Tilde, a leading European language technology company. He lives in Riga, Latvia.


rihards.kalnins[at]Tilde.com
www.tilde.com


 


 

Exploring the inner life of SMT Systems

Have you ever wondered what the inside of a machine translation system looks like? Or considered what it would be like to crack open a system and examine its internal components? We’ve all dreamed of shrinking down to a few millimeters and exploring the insides of household appliances – or, better yet, the internal systems of the human body. But what would we find if we gazed down at the inner depths of machine translation?

Students at the University of Zurich’s Institute of Computational Linguistics recently obtained this opportunity in an introductory course on machine translation and parallel corpora. Instructors at the institute wanted to give their students a deeper understanding of the inner life of SMT systems by letting them “peek under the hood” of a real MT platform and tinker with the internal components.

The machine translation platform chosen to examine was LetsMT, provided by Tilde, a European language technology company. Since its launch in 2012, LetsMT has been used to build numerous customized MT systems. These include MT systems deployed by European governments as well as MT solutions integrated into popular mobile apps and language software.

LetsMT had already been employed as a classroom aid at the University of Copenhagen, where the Centre for Language Technology uses LetsMT as a “hands-on” tool to teach students the basics of statistical machine translation (SMT). In the spring term of 2014 the University of Zurich joined them, furthering the goal of teaching the next generation of MT professionals about how language science and technological development can be merged to create powerful new solutions.

Changing course

In previous years, students at the Institute of Computational Linguistics who had been introduced to the principles of statistical machine translation were asked to train their Moses system on the University of Zurich’s server. Although they didn't have to install any software or collect any parallel corpus, students found that this first encounter was often discouraging.

Therefore, the course instructors Prof. Martin Volk and Anne Göhring decided to follow the lead of the University of Copenhagen and use LetsMT for the assignment. The idea was for the computational linguistics students to begin as quickly as possible to experiment with a statistical MT system without worrying about the technical details.

The explicit goals for their first assessment were "to learn to train a statistical MT system, and to experiment with your own MT system (and compare it, for example, with Google Translate)." The institute also encouraged the students from the start to discover SMT in an interactive way and explore the "inner life" of SMT systems.

Exploring the inner life: benefits and advantages

It is difficult to quantify the specific insights students gained by using the LetsMT platform, how user friendly they found it, or how useful this experiment was for achieving the learning goals of the course. But student reports included lots of very positive feedback. For instance, one student wrote: "It turned out to be astonishingly easy"; and another said: "the steps were easy to follow and thoughtfully explained."

From a teaching perspective, the University of Zurich found that the LetsMT platform offers many advantages. For students, the multilingual aspect is very important. Students were asked to choose any convenient language pair, where "convenient" meant that (a) they understood both source and target languages well enough to assess the quality of the translations delivered by their systems, and (b) there was at least one parallel corpus available for this language pair.

Instructors and teaching staff did not make use of the whole functionality of LetsMT, since they had decided that the students should register as demo users of the platform. This restricted students to train on available parallel corpora of limited size.

The wide range of languages covered by LetsMT gave students the opportunity to experiment with their preferred language pair, including their mother tongues. The institute had examples of systems for translation to and from English and German, Spanish and German, Russian and German, as well as systems for English-to-French, Italian-to-English, Swedish-to-English, and Swedish-to-German translation.

The clearly structured interface of LetsMT leads students seamlessly through every step necessary for building a statistical MT system. The resulting flow chart automatically produced for each created system perfectly illustrates the whole building process and is a great help for all participants.

Another nice feature for both students and teachers at the University of Zurich is the easy access through the web interface, although the institute’s teaching team already had everything prepared to train Moses systems on its server.

One last positive feature not explicitly mentioned in the assignment is the ability to train domain-specific systems. Some students spontaneously focused on that aspect, for example, choosing corpora from the legal domain, translating some in-domain and out-of-domain texts, and finally evaluating the resulting translations manually.

Theory and practice, and beyond

Similar to the teaching goals described by the University of Copenhagen, the Institute of Computation Linguistics wants students to “study MT in both theory and practice in order for them to become competent users.” But in contrast to students from language programs, computational linguistics students should develop some more technical skills, ideally programming skills.

For this reason, at the end of the term, instructors assigned a Moses training and evaluation task to be done in groups of 2-3 students. All the groups completed the assignment within two weeks following the step-by-step instructions given them, most of the teams without any additional help.

Still, the Institute of Computational Linguistics continues to search for a didactic solution to bridge the gap between the easy-to-use LetsMT platform and the less comfortable path the students must follow to create an SMT system all by themselves.

The University of Zurich discovered that using LetsMT helped level out the different educational backgrounds of the participants and allowed students to immediately build their own SMT systems. Instructors also found that the platform covers many different issues treated during the course – like the size of monolingual and parallel corpora, the training steps, the translation and language models, evaluation metrics, and domain adaptation – and thus helped convey the main ideas and principles of SMT to the students.

Closing the feedback loop

Thanks to the use of LetsMT in the classroom at the University of Zurich, the team at Tilde was also able to improve the overall LetsMT experience for other users. The Institute of Computational Linguistics was in close contact with Tilde about their use of the platform throughout the spring semester. This included detailed feedback reports from professors and students in the program.

One issue that students often encountered was the use of specific tags in technical documentation. Translators often receive documents that include bits of HTML code, for example, from websites and user interfaces. This is a known stumbling block for MT systems, which often automatically translate and reformat tags, rendering them meaningless. This issue was highlighted by the Institute of Computation Linguistics, and Tilde was able to provide a remedy.

LetsMT can now correctly deal with complex tags and placeholders, ensuring that MT systems can accurately translate technical documents and HTML code. This is a major leap forward for the application of machine translation that could be accomplished due to the successful collaboration between the University of Zurich and Tilde.

In the future, the University of Zurich will continue to employ LetsMT in the classroom, teaching future language technology professionals about the inner life of SMT systems. And, we are happy to see what new inspirations these explorations will bring.