November 2017
Text by Don DePalma

Image: ©AH68/istockphoto.com

Don DePalma is the founder and Chief Strategy Officer at independent market research firm Common Sense Advisory (CSA Research). He is the author of the premier book on business globalization Business Without Borders: A Strategic Guide to Global Marketing.

don[at]commonsenseadvisory.com
www.commonsenseadvisory.com

Talk to me

There’s an explosion of speech interfaces for computers, smartphones, and other devices. None of them would pass the Turing Test, a classic test for determining whether a computer is thinking and responding as a human would. However, developers around the world are enhancing conversational user interfaces (CUIs) to mimic human response.

Billions of people around the world would benefit from a human-machine interface that requires no training, specialized hardware, or skills beyond being able to talk and listen. The mission: To enable conversational interactions with machines like those we’ve seen for decades in Star Trek or Star Wars.

Computers underpin the modern world, adding intelligence to everything from the most mundane household appliances to sophisticated power grids. The nature of computers is changing. Embedded in smartphones and activity trackers, they enable communications with others and monitor our movements and rest. As the cost, size, and complexity of computer chips continue to shrink, computer-powered devices will find their way into even more applications as the Internet of Things (IoT) expands to help us control, compute, connect, and even care for us throughout every aspect of our daily lives.

What we should remember is that this expansive Internet of Things is not just about stuff, but also about the people using those things. This means that all of this computer-equipped gear has to communicate with whoever is wearing it, using it, or affected by it. Traditionally, we tell computers what we want them to do via a keyboard and mouse, by a touchscreen, gestures, or simply by pressing buttons such as on/off/mute. The computer – in whatever device it happens to be embedded – responds via a screen message, by simply turning on, or by muting itself. Wearables on our bodies and the surrounding and nearby IoT change this. Some devices autonomously do what we tell them to, others provide haptic feedback through our fingertips or skin, and yet others interact through immersive Virtual Reality headgear and gloves.

But there is one interface that will rule them all; especially as the next one or two billion users of computing, smartphones, and Internet technology come online: speech.

From automated call centers to virtual assistants

Our voice and ears are the most natural interface for communicating not only with other people, but also with businesses, cars, and phones:

  • Speech recognition software (SRS) has long powered interactive voice response (IVR) systems that direct us to the appropriate agent when calling companies or government offices. Cheaper computers and better software led to widespread use of IVR as ways to optimize contact centers. Because IVR limits callers to pre-programmed dialogues, straying from an expected script leads to communication failures. IVR is limited by the operational dialogues that their owners devise – thus, they would fail the Turing Test because they "think" like a workflow, not like a responsive human agent in a conversation.
  • Automobiles began incorporating voice-command devices (VCD) for in-vehicle communications and entertainment in the early 2000s. Initially, this specialized speech recognition software understood just a handful of commands to operate the radio and air conditioner, then added support for navigation systems and, over time, expanded to thousands of commands as they added queries for driving directions or finding gas stations. But like IVR, many in-vehicle systems still react only to known scenarios and aren’t very conversational. Only the latest generations have incorporated smartphone SRS technology in the form of CarPlay and Android Auto, automotive versions of Apple Siri and Google Assistant respectively.
  • Mobile phones initially offered speech support for simple canned actions like "call Mom", but over the last six years have evolved their speech interfaces into what seems like more conversationally capable virtual assistants. Then Apple Siri, Google Assistant, and Microsoft Cortana migrated from the phone to the desktop and into cars. Along with Amazon Alexa, they started finding their way into other devices. Today, those technologies represent the most powerful speech interfaces available to the mass market.

Improved as they are over IVR and VCD, even virtual assistants are not yet where users want them to be. So where would that be?

Conversationally challenged devices

Everyone really wants the fully conversational and intuitive talking computers that science fiction has tantalized us with for years. For all their power, today’s mobile-phone conversational user interfaces (CUIs) are still an early demonstration of technology rather than the evolved C-3PO or Marvin the Paranoid Android experiences we expect.

Let's consider the shortcomings of today’s interfaces, and then outline what’s happening in development laboratories.

First, where do today's CUIs fail to meet user expectations? Every user of speech recognition has experienced problems with understanding and gaps. For example, the device doesn't recognize you or your accent, so you have to repeat or enunciate. Or it misses something you say because a car horn sounds, another speaker talks over you, or you turn away from the microphone for a second – and unlike a human, the technology doesn't fill in that short gap with the known context the way a human interlocutor would. People are understandably frustrated when Alexa or Siri respond with "I don’t understand" to even simple requests. In turn, users raise their voices, speak more slowly, dumb down their questions, or simply abandon the interface.

You might stray from the dialogues or domains that the phone understands, say something un-programmed by the CUI developer, or ask a question that requires the ability to make an inference or interact with another application. For example, you ask your brother-in-law in Poughkeepsie what the weather will be like for your visit this coming weekend. Knowing that you're visiting, he will realize that you’re asking him what you should pack and he answers, "It looks like thunderstorms Saturday afternoon, so bring some rain gear."

Now let's ask one of these CUIs about Saturday’s weather in Poughkeepsie. It will tell you the forecasted temperature and the likelihood of precipitation. But it won't make the inference that your brother-in-law did, so you ask more explicitly, "Will I need a jacket in Poughkeepsie tomorrow?" You'll either get the temperature or maybe directions to a site telling you under which conditions you might want to wear a jacket, but it won't make the connection between your request and the weather. Or it might simply misunderstand one or two words and give you the wrong answer.

Given this inability, you won’t rely on Alexa or Siri to arrange all the details for your next trip. If a CUI can’t extrapolate your intent in asking a simple question about what you should wear the next weekend, we can be sure that the combinatorial complexity of an international trip – with call-outs to other websites or micro-services for things like flight reservations, hotels, and possible visa requirements – is well beyond its current capability.

Finally, none of them remember state – that is, memory and context – so you really can’t have a conversation with one of them. For example, a CUI won’t remember what you asked it to do a few minutes ago, so subsequent dialogue starts from scratch. CUIs can’t save results or provide a persistent link. Until they can remember state, they remain conversationally hobbled.

Those are some of the flaws of today’s CUIs. We’re still at a crossroad waiting for even more intelligence to take us to meet our expectation of a meaningful conversation with our at-hand devices.

Technological advancements leading us into the future

Successful conversational interfaces deal with vague, non-deterministic, arbitrarily complex, and very likely contradictory input. The potential for semantic errors multiplies dramatically as these inference engines work through steps and transitions between tasks fulfilled by different systems. These evolved CUIs will use program generators to parse input, determine intent, create code on the fly, and call the services that can deliver them. According to our findings at Common Sense Advisory (CSA Research), their technology is advancing on a few fronts – speech recognition, connectivity, and inference.

Speech recognition

Speech platforms must process complicated utterances in a variety of accents, dialects, and idiolects. In August 2017, Microsoft announced that its speech transcription software generated fewer transcription errors than a team of humans would have. This was a lab test and simulated audio from a stable landline, but it’s a step in the right direction for dealing with the noisy backgrounds of mobile phones, factory floors, and fast-food drive-through ordering systems.


Connectivity

Everything benefits from broad usage. This network effect mandates that CUIs interact with a wide variety of programs. Similar to WeChat (China) and Line (Japan), they’ll allow for integrating an array of specialized functions. Unlike these, they’ll offer even broader portfolios of services and – for several of them – the ability to support dozens of locales in these interactions. Earlier this year, developers of the leading CUIs announced open interfaces so other systems could incorporate a conversational interface. Amazon and Microsoft have already agreed to have Alexa and Cortana talk to each other, thus allowing them to exchange knowledge from each other’s domains.


Inference

To avoid the Poughkeepsie rain gear problem, these evolving speech platforms are adding artificial intelligence (AI) to help decipher the intent of user input and provide the conversation with some context and memory. Inference engines apply AI techniques to figure out what people are actually asking and to provide an appropriate response – which will often require a determination of state and history, disambiguation, clarification, and calls to other services. Hence, connectivity to other systems is essential. Developers are also adding machine learning frameworks from Amazon, Google, IBM, Microsoft, and other big-data suppliers so that the CUIs benefit from previous interactions. Backed by AI software, these apps will learn from interactions, become more predictive and insightful in their responses, and simulate a more conversational interaction. 


Preparing for conversational interfaces

How should your company react to these changes? Our data at CSA Research finds that speech is becoming the dominant interface, especially for new users with mobile phones or for interacting with IoT around the planet:

  • Add mobile to your content and platform strategy – and budget accordingly. Mobile and IoT devices constitute a tremendous growth opportunity in both developed and new markets. Besides provisioning for staffing and technology changes to your existing localization models, you’ll also have to conduct the budgetary exercise of establishing the return on investment for adding mobile-centric locales with big, but not economically attractive populations.

  • Add plans for speech to your global content strategy. Extend current processes to meet expanding requirements for spoken interactions such as searches and customer support. Devices and content types differ for mobile, but core development concepts such as separating presentation and content from code still apply. Recognize the need to meet increased expectations for local experiences arising from the proximity, immediacy, and intimacy of mobile devices, so train staff and contractors. 

  • Experiment with commercial software for spoken language support. Today’s virtual assistants provide an off-the-shelf, evolving platform for spoken interactions with a variety of devices. Evaluate the alternatives and choose the one that best meets your requirements:
    1) If your app is cross-platform, identify one that runs on multiple devices – Google and Microsoft currently lead in this area; 

    2) Pick one that has an API or SDK to allow integration with your app – in 2017, suppliers began releasing these low-level interfaces; and
    
3) Consider the current state of foreign-language support and quiz prospective suppliers on future offerings – today, Apple and Microsoft are ahead on that front.

  • Work within that platform’s ecosystem. Today’s mobile apps are largely single-function affairs. Speech is one of the first shared services that many will incorporate. We expect that apps will increasingly be re-developed around micro-services, using modularized code that runs a single process and interacts with other services through a well-defined interface. Make your own apps self-describing or interrogatable so that they can be more easily discovered by others. More metadata about what they do will enhance the ecosystem.

In the final analysis, the spoken language that people use to communicate with fellow humans will be their interface of choice for dealing with machines. As billions of people interact with an enormous number and variety of computer-equipped devices, conversational interfaces will become more prevalent. Commercial enterprises and government agencies should begin researching and experimenting with spoken-language interfaces that their customers and citizens will come to expect.