March 2020
Text by Jörg Schmidt

Image: © metamorworks/istockphoto.com

Jörg Schmidt has more than 15 years of experience as a technical consultant, account and project manager for XML-based content management systems in industries such as automotive, aerospace and defense, life sciences and manufacturing. He has been working as a solution architect for SDL since 2013.


jschmidt[at]sdl.com


 


 

What can linguistic AI do for you today – and tomorrow?

AI is one of the biggest buzzwords in general technology, and in IT in particular. So, what impact does it have on the technical communication ecosystem – today and in the near future?

What is (linguistic) AI?

Let’s start with a proper definition of AI. Colloquially, the term "Artificial Intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".

As this is a wide field, divided into multiple subcategories, in this article I will focus on the subcategory with the most obvious impact on technical communication: linguistic AI. This combines three subjects: language processing, language understanding and language generation – powered by machine learning (ML).


Figure 1: What is linguistic AI?

 

As in all neural technology using machine learning, successful implementation of linguistic AI applications is based on three main pillars:

  • Machine learning competence
    
Deep expertise in ML techniques and algorithms is a core requirement
  • Good data
    
AI systems learn best from structured and intelligent content
  • The right focus 

    Narrowly focused business challenges where learning data is available

In summary, knowledge and real expertise regarding machine learning are required, not merely throwing some data into open-source toolkits to get "something." ML expertise takes years to develop and only comes after much experimentation.

Equally important is to get the right data and prepare it for optimal ML. This is one of the reasons why "free of charge" solutions that translate any kind of content, from any type of domain, with any level of language quality, are improving only slowly (if at all).

Finally, after you align your content, you need the right focus. For example, enterprises with large translation memories (TM) face issues similar to those of the "free of charge" platforms that are publicly available. The main challenge is that the TMs have grown over the years based on input from many different departments.

If all three items are properly in place, AI can transform your business.

 

Why does it matter for technical communication?

The ongoing content explosion is creating challenges for all participants in the content supply chain in technical communication. From a high level, the content supply chain connects content creators and content consumers with three main pillars: content creation and management, content transformation and translation, and content distribution and delivery.

All three stages face specific challenges across all industries:  

  • Authors must create larger volumes of content for more product variants, adapt it to more output channels, and make it suitable for more target groups
  • Translators must localize more content in less time for less money
  • Customers must find the right information from much smaller (and thus many more) chunks of content that are available on more channels

All three challenges can benefit from linguistic AI in one way or another. Let’s have a more detailed look at the three use cases, starting with the most mature one.

 

Use case 1: Neural Machine Translation (NMT)

General Machine Translation (MT) has been publicly available through tools such as Google Translate for over ten years. Based on the poor quality it has delivered in the past, professional translators have not seen it as a big threat. This has changed recently with the introduction of Neural Machine Translation. NMT is the third generation of MT, enabling a breakthrough in quality.

Figure 2 shows a rough overview of the development of translation quality over time. Machine translation started with rule-based approaches in the 1970s, which delivered only slow improvement in quality and were overtaken by statistical engines about 15 years ago. Third-generation NMT, which has only been commercially available for about five years, has obviously had the steepest learning curve and is therefore delivering the best quality available today for practically all language pairs.


Figure 2: Development of the quality of machine translation over time


In combination with current computing power, NMT engines can translate 100,000 words per minute while delivering reasonable quality. Combined with human post-editing, it provides a cost-efficient translation method for any type of content.

Figure 3: Translation methods vs. content types

 

However, will this make human translators obsolete in the near future? Most likely not! As with all disruptive technologies, it will only change the working environment for the majority of professional translators. Besides, new job opportunities like "NMT machine trainer" will arise. Also, traditional translation tasks in highly sensitive areas such as advertising (where content very often has to be recreated) or highly difficult areas such as legal (where every character counts) will always require human creativity. For more standard content such as technical documentation, a combination of NMT and human post-editing still delivers the best quality. While SDL’s internal statistics are showing an average quality level of 92.5 percent for our NMT engines (still behind the 96.6 percent our human translators are delivering), the combination of NMT and human post-editing is reaching a superior 99.5 percent! In the end, NMT is, like all other CAT tools in the past, just a utility to make human translators better and faster.

 

Use case 2: Content classification

With the notorious content explosion, findability of information is an issue on both ends of the content supply chain. An IDC survey notes that workers spend an average of 4.5 hours a week looking for documents, then re-authoring what they did not find (Source: IDC’s Information Worker Survey). No reliable numbers are available for the average time content consumers waste on searching for information. But anyone who has ever tried to solve an issue with a technical product will agree that too much time is required to find answers for many product-related questions today.

The solution is "intelligent content."

Intelligent content is structured and tagged, making it highly searchable and organized. The goal is to shorten the time for content consumers (both internal and external) to answer their questions or complete their research – so ultimately, to deliver relevance. That said, different content types require different levels of structure. Technical documentation, which is often cited and reused with a shelf life corresponding with the respective products, requires more structure than a marketing campaign website that only lasts a short while.

However, there still needs to be a unifying structure that spans these disparate pieces of content. That’s where a corporate taxonomy and ontology are critical:

  • Search and Discovery – Make users more productive and your content more findable.
  • Groupings and Relationships – Organize hierarchy and topics.
  • Content Linking – Creates user pathways to help them discover related content.
  • Integrations and Interoperability – Connect external sources to content and exchange information automatically.

Information systems such as content management and knowledge management increasingly use taxonomies to improve access to and discovery of the right content. The more content an organization manages, the more critical this becomes.

Findable content relies on strong information relationships. These are typically designed around the way a company sees its content, but don’t necessarily reflect how employees and customers want to use it.

 

Taxonomies and ontologies

Two important classification concepts are taxonomy and ontology. If a taxonomy is a tree, an ontology is a forest, encompassing multiple taxonomies. Taxonomies and ontologies uncover the meaning of content in a way that computers can process, find, filter, and connect information.

As great as these systems are for organizing content, today they mostly rely on people manually applying the right tagging to the right content. Unfortunately, manual tagging by humans is subjective and potentially applied unevenly. It often deviates from corporate taxonomies resulting in poor enterprise search.

One of the main reasons for this might be that content classification is a pretty boring task and, for the majority of people doing it, not a core part of their job description. But it is exactly this kind of task that, in the past, has benefited in many areas from improvements in technology, and AI might be the technology that will be able to help. Today, linguistic AI can already suggest tags, but given the low volume of correctly tagged content, training data might not be sufficient to guarantee reliable recommendations. However, if the suggestions are at least reminding authors to apply classification metadata more often, the volume and quality of data will improve and, with that, linguistic AI can learn and improve as well.

So, keep an eye out for upcoming solutions in this area – but don’t be disappointed with initial results that do not look too useful. They will improve with every correction you apply!

 

Use case 3: Content creation

So today, content translation and classification can benefit from linguistic AI with further improvements expected in the near future. However, what about a technical author’s core task – content creation? Linguistic AI today can “understand” text and thus can help to extract key concepts and supporting points that serve as a jump-start for derivative content.

Most technical documentation is derived from product specifications, whitepapers, and software documentation as part of the code. Linguistic AI could assist a technical author in the early stages of creation tasks. From a commercial, off-the-shelf product perspective, there are indeed first applications available (see Image 4 below) that allow you to upload larger sets of documents to automatically extract summaries or identify the most relevant taglines.

While this may be useful today for marketing or internal documents such as executive summaries, it does not seem appropriate (yet) for security and legally relevant content types due to their legal ramifications. But this should change in the near future, so keep an eye out for these kinds of solutions if you spend a significant part of your work time reading source documents from other departments.


Image 4: Content assistant  

 

Summary

Linguistic AI is a quickly progressing discipline within AI that can support humans at multiple stages of the content supply chain, and it will change the professional lives of many people in the field of technical communication. However, like most disruptive technologies in the past, it will not make jobs obsolete; rather, it will provide new opportunities for innovation and allow people to work more efficiently.