July 2014
By Donald A. DePalma

Don DePalma is the founder and Chief Strategy Officer at independent market research firm Common Sense Advisory (CSA Research). He is the author of the premier book on business globalization Business Without Borders: A Strategic Guide to Global Marketing.


don[at]commonsenseadvisory.com
www.commonsenseadvisory.com


 


 

Free machine translation can leak data

New European Union regulations will threaten companies with substantial fines if they’re found guilty of a data breach. While IT departments ensure the integrity and security of their data, many ignore the outflow of proprietary information through the frequent use of free machine translation.

Your personal and corporate data is under siege. Hackers around the world steal identities and credit cards and breach the cyber-defenses of corporations, while government agencies systematically monitor your phone calls, e-mail, and Internet usage. Whether they are driven by the promise of ill-gotten gains or claims of national defense, these invaders have inspired the European Union and other governments to institute strong laws to protect data.

But even if these hackers and government agencies stopped looking, you would still have to worry about data security. Why? Both your employees and your suppliers are unconsciously conspiring to broadcast your confidential information, trade secrets, and intellectual property (IP) to the world. How? Through unencrypted requests to Google Translate and Microsoft Bing Translator, routine use of Wi-Fi at coffee shops and airports, and whenever they send translation jobs off to their contractors. How big a problem is this?

Last year, Google disclosed that 200 million people use its free machine translation (MT) every day. That’s just one place where people go for no-cost translation. Add Babylon, Baidu Translate, Microsoft Bing Translator, SDL’s FreeTranslation.com, SYSTRAN, Yandex Translate, and their mobile equivalents. There’s simply no shortage of free MT options that tempt your employees.

In our most recent survey on machine translation, Common Sense Advisory asked localization managers at enterprises to estimate their corporate colleagues’ use of the technology – 64 percent figure that their fellow employees use it frequently or more often than that (see Figure 1).  

Figure 1: Many companies use free MT frequently


These usage levels mean that employees send significant amounts of corporate information to these online MT providers. For example, they might translate e-mails, text messages, project proposals, legal contracts, merger and acquisition documents, and other sensitive content. We asked our respondents how concerned they are with the potential loss of intellectual property or proprietary data on these free sites: 
62 percent told Common Sense Advisory that they are concerned or very concerned (see Figure 2).

Figure 2: Concern with potential loss of intellectual property from free MT


Our survey did not ask whether our respondents were worried about their suppliers’ use of machine translation, although other Common Sense Advisory research shows increasing experimentation and reliance on the technology among both freelancers and language service providers. Systematic or even ad hoc use of free MT through its integration to widely used translation memory tools will increase the outflow of corporate information to sites over which the content owner has no control.

How worried should enterprises be about free MT? Sensitive corporate data can leak in two ways – in transit or at the site:

  •  The “wrong” people can see information in transit. This issue isn’t restricted to MT, but is a symptom of increasing reliance on web-based services and the cloud. Employees or providers make MT requests over unencrypted connections or use open Wi-Fi hotspots that anyone could monitor. Similarly, translators – either working for the company or for a language services provider (LSP) – may push client content over unsecure communications channels, thus exposing potentially sensitive information to whoever happens to be listening in. 

  • MT sites can use your data in ways you did not intend. While content ownership remains with the creator, free MT providers claim usage rights under their terms and conditions. For example, Google notes that it “does not claim any ownership in the content that you submit or in the translations of that content returned by the API.” However, as you follow the policy links, you learn that “When you upload or otherwise submit content to our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content.”

This license to use the data “continues even if you stop using our Services.” MT providers may offer an opt-out option: “Some Services may offer you ways to access and remove content that has been provided to that Service. Also, in some of our Services, there are terms or settings that narrow the scope of our use of the content submitted in those Services.” However, once the data has leaked, it’s difficult to get it back under your control.

Microsoft offers paying subscribers who commit to a minimum volume of 250 million characters per month the option of restricting their training documents to just their own MT system. It addresses the communication holes with interfaces that fully support secure sockets layer (SSL), a common protocol for managing secure message transmission on the Internet. However, users of its free MT service are subject to the problems that we just described.

But enterprises also need to think about suppliers working with corporate content outside the firewall. For example, translation suppliers add another headache to free MT usage. Even if all of your employees practice “safe MT” usage and avoid sending sensitive data over unencrypted communication lines to servers outside your control, you cannot expect your translation suppliers to be experts in these best practices.

  • Service providers may not tell clients that they use MT. Half of the interviewees for our “Translation Production Models” report on how LSPs produce translations were either already using or testing machine translation in their workflow. We found no consensus on whether they disclose that fact to the buyer.
  • Most buyers haven’t caught up yet with data leakage. Most procurement departments do not yet specify protections that suppliers should adopt while using machine translation. Even when they are in place, enforcement will always be an issue. For example, project managers may not have studied the sales agreement that the owner of the LSP or sales rep signed. If they have read it, many will not have a process to make it known throughout the production process.
  • Subcontractors might not follow the agreed-upon rules. Most LSPs farm out work to third-party linguists, either other LSPs or freelancers. These other agencies or individuals may not follow the same rules. Vendor managers do not necessarily ask their subcontractors about their MT practices, although that should become a routine question.
  • No matter what anyone says, linguists can and will use MT. Competition and price pressure being what they are, there is nothing you can do to prevent linguists from using free MT as an efficiency tool. As our report on “Trends in Translation Pricing” found, market forces require suppliers to use everything they can to be competitive.

The big question is, “what can enterprises do to limit the exposure of free MT usage?”

Short of disconnecting your company from the Web or establishing and enforcing usage restrictions across an entire enterprise and at all your suppliers, what can you do? First off, you should determine whether the free MT terms and conditions comply with your data security and usage policies. If they do not, then you should begin by educating your staff about the potential dangers. Train them to follow safe network and Wi-Fi procedures such as using secure HTTP and encrypted connections wherever they happen to be. While such education will not eliminate the problem, it could limit it.

You should also work with suppliers to limit your data exposure. In your master service agreements, include confidentiality statements that outline acceptable MT use, and audit the processes how they enforce such usage. Identify the content types for which you would prefer that they avoid using MT.

For the employees and providers that work with your source content and translations, you can:

  • Lock down content workflows. If you or your LSP use a translation management system (TMS), you may be able to limit the network exposure of your content. In this case, all participants – requestor, translator, reviewer, and project manager – work within a secure, hosted environment that blocks access to free online MT. If the secure workflow extends further back to the content management system and out to the deployment system, that’s even better. Many TMS suppliers should be able to provide a closed environment.
  • Find MT providers that respect your data. Some LSPs and translation portals that offer post-editing and other MT services advertise how securely they deal with your content. Look for a supplier that limits the exposure of data during file transfers through encrypted connections, authenticates valid users through certificates and passwords, and keeps your content separate from other companies’ through the use of multi-tenant servers. Don’t be shy about asking your providers tough questions about how they manage your data throughout their entire process.

For the vast majority of employees, you have less control over their possible use of machine translation in their everyday work. For them, you should:

  • Anonymize outgoing MT requests. Software can automatically hide information in your content that might identify your company. When such software is installed, what’s sent to Google or Yandex for translation is text with security tokens replacing proper names such as individuals, cities, countries, and companies. Lingosec is a new software company that provides such anonymity, while CipherCloud provides more generalized protection to applications in the cloud.  
  • Send free MT requests to software that you control. Work with your network administrators to redirect any browser calls to an MT site to instead be processed by your own machine translation software. Most MT vendors offer behind-the-firewall or cloud-based solutions that will meet your requirements for data security and integrity. As with any MT solution you choose, look for encrypted connections, authentication, and multi-tenant servers. Most commercial MT solution providers support these and other security-enabling capabilities. Of course, bringing MT inside your firewall raises other issues about training and maintenance, but that is a topic for another debate.

Powerful technology brings new challenges. Free online MT is no exception. While it represents just one of the many corporate holes through which data can leak, it is a growing threat that many organizations have yet to acknowledge, much less address. Both translation buyers and suppliers have their own data security requirements based on business-specific factors, so they have to calibrate their response to potential data leakage.

The bottom line: Data leakage via free online MT and supply chain flaws is a clear and present danger to enterprises and their translation suppliers. Software to plug these gaps is just now entering the market.

Page 1 from 1
1
#2 Albert wrote at Wed, Jun 15 answer

It is also immoral to pay a human the price of a machine translation.

#1 Mark wrote at Thu, Aug 14 answer

Hi, is it immoral or illegal to charge a client for human translation but perform machine translation and post editing and not inform them.

 

Thanks