Speaking of Which: Let’s Throw Off Our Robot Overlords;-)

Living in a Speech-Triggered Wold

Like many people in the developed world, I spend every day surrounded by technology – laptops, phones, car touch-screens, networked gadgets and smart TV. Of course I want to interact with all these things to get my work done and to enjoy their services and entertainment. And I have learned to use them pretty well – I have learned to type on keyboards, move mice, finger touchpads, pinch, swipe and peck. The devices themselves have not learned much of anything!

A wave of change is sweeping towards us, with the potentially to dramatically shift the basic nature of interactions between people and our electronic environment. Deep learning-based algorithms are transforming many aspects of computation, by allowing good solutions bigger, more complex, more ambiguous problems than before. The transformations are particularly striking across vision, language analysis, pattern finding in large data sets, and speech. I have written a lot about vision and the potential for neural network vision algorithms to meet or exceed human capabilities in recognizing and tracking objects, summarizing scenes and sequences, and finding safe navigational channels in two and three dimensions. These vision systems are becoming very substitutes for human vision in tasks like driving, surveillance and inspection.

Deep learning advances in speech have a completely different character from vision – these advances are rarely about substituting for humans, but in pay more attention to humans. Speech recognition, speaker identification, and speech enhancement are all functions to enhance the human-machine interaction.   The net effect of better human-machine interaction is less about displacing human-to-human interaction but more about displacing the keyboards, mice, remote controls and touchpads we have learned to use, albeit painfully.   In a very real sense, the old interfaces required us to retrain the neurological networks in our brains – new speech interfaces move that neural network effort over onto the computers!

Speech is coming into use in many forms – speech enhancement, generation, translation, and identification, but various flavors of speech recognition are getting the most attention. Speech recognition seems to be used in at least three distinct ways.

  • First, it is used for large-scale transcription, often for archival purposes, in medical records and in business and legal operations.   This is typically done in the cloud or on PCs.
  • Second, we have seen the rapid rise, with Alexa, Siri, Google Voice and others, of browser-like information queries. This is almost always cloud-based, not just because the large vocabulary recognizers require, for now, server-based compute resources, but also because the information being sought naturally lives in the cloud.
  • Third, there are local systems controls, where the makers of phones, cars, air conditioners and TVs want the convenience of voice command. In some cases, these may also need voice responses, but for simple commands proper recognition will have obvious natural responses.

Voice has some key advantages as a means of user-interface control, due to the richness of information. Not only do the words themselves carry meaning, but the tone and emotion, the speaker, and the sound environment may bring additional clues to the user’s intent and context. But leveraging all this latent information is tricky. Noise is particularly disruptive to understanding, as it masks the words and tone, sometimes irretrievably. We also want to the speech for user interfaces to be concise and unambiguous, yet flexible. This gives us a sophisticated set of demands placed on even these “simple” voice command UIs.

  • Command sets that comprehensively cover all relevant UI operations, sometimes including obscure ones.
  • Coverage of multiple phrases for each operation, for example: “On”, “Turn on”, “Turn on TV”, “Turn on the TV”, Turn on television”, “Turn on the television” (and maybe even “Turn on the damn TV”). These lists can get pretty long.
  • Tolerance of background noise, competing voices and, most, especially, the noise of the device itself, especially for music and video output devices. In some cases, the device can know what interfering noise it is generating and attempt to subtract it out of the incoming speech, but complex room acoustics make this surprisingly tricky
  • Ability to distinguish commands directed to a particular device from similar commands for other voice-triggered devices in the room.

The dilemma of multiple voice-triggered devices in one room is particularly intriguing. Devices often use a trigger phrase (“Hey Siri” or “Alexa”) to get the attention of the system, before a full set of commands or language can be recognized. This may be particular practical for cloud-based systems that don’t want or need to be always listening by streaming audio to the cloud. Listening for a single wake-up phrase can take significantly less compute than always listening for a whole vocabulary of phrases. However, it does increase the human speech effort to get even the simplest service. Moreover, most device users don’t really want to endure the essential latencies – 1-2 seconds for the simplest commands – inherent in cloud-based recognition.

As the number of devices in a space increases, it will be more and more difficult to remember all the trigger phrases and the available set of commands for each device.   In theory, a unified system could be created, in which commands are recognized via a single rational interface, and then routed to the specific gadget to be controlled. While Amazon or Google might be delighted with this approach, but it seems somewhat unlikely that all potential voice-triggered devices will be built or controlled by a single vendor. Instead, we are likely to see some slightly-contained chaos, in which each device vendor dispenses with wake words for local devices and attempts to make their commands as disjoint as possible from the commands for other devices. In the cases of conflicts – e.g. when both the music player and TV support a command “Turn volume down” – the devices will work from context, and may even signal to one another that a relevant command has been captured and acted on. (And rather than relying on a network handshake for synchronization, the devices may signal audibly to indicate a successful command capture. Both other devices and the user may like discrete notification.)

One great challenge of ubiquitous speech UIs is endowing each device with sufficient smarts. Most devices need far less than the full continuous speech recognition needed for transcription, but even fairly simple devices will likely vocabulary of tens of phrases, often comprised of hundreds of words.   Accurately spotting all those words and phrases, especially in a noisy environment with multiple speakers and the audio blaring, is tricky, especially given reasonable expectations for accuracy. We simply don’t want to have to repeat ourselves. This means running fairly sophisticated recognition neural networks, either to pick out words for matching against command patterns or to directly recognize entire command phrases. Fortunately, neural network inference hardware is getting dramatically more efficient. Creators of neural network processors and accelerators for embedded systems are now routinely touting efficiencies above 1T operations per watt, often driven by the demands for computer vision. Short command recognition is likely to take 3-5 orders of magnitude less compute than that. This implies that as neural network-specific acceleration grows common in embedded platforms, we may well see rich speech interfaces implemented in microwatts of power. In addition we are likely to see hybrid systems that follow a 90/10 rule – 90% of commands are recognized and acted on locally, to minimize latency, computing cost and network bandwidth consumption, and the most obscure and difficult phrases are passed on to the cloud for interpretation. A well-designed hybrid system will have the latency of local and the sophistication of cloud. All this is profoundly good news for users, who hate to get up off the couch or to look for the remote!

In the last couple of years, handwringing over the potential downsides of AI has become a surefire attention getter. Some of the concerns are real – we should be worrying about implicit bias in training sets, about the fragility and incomprehensibility of systems in mission-critical situations, and the potential economic disruption from rapid change in some classes of work.   I am less worried about the most apocalyptic “rise of the robot overlords” scenarios. In fact, I see the end of our fifty years of forced learning to type, swipe, mouse and point. Perhaps instead of computers forcing us to learn their preferred interface methods, we can teach them ours. Perhaps we can now finally throw off our robot overlords!