Speech! Speech! Speech!

New technology is a source of excitement at any time, but we don’t live at just “any time”. Right now, we are experiencing an era of particularly rapid evolution of computing technology, and a period of particularly dramatic evolution of new business models and real-world applications. The blanket term, “AI”, has captured the world’s imagination, and it is tempting to dismiss much of the breathless enthusiasm – and doomsaying – as just so much hype. While there is a large dose of hype circulating, we must not overlook the very real and very potent emergence of deep learning or neural network methods as a substantially fresh approach to computing. The pace of improvement of algorithms, applications and computing platforms over just the past five years shows that this approach – more statistical, more parallel, and more suitable for complex, essentially ambiguous problems – really is a big deal.

Not surprisingly, a great deal of today’s deep learning work is being directed at problems in computer vision – locating, classifying, segmenting, tagging and captioning images and videos. Roughly half of all deep learning startup companies are focused on one sort of vision problem or another. Deep learning is a great fit for these problems. Other developers and researchers have fanned out across a broad range of other complex, data-intensive tasks in modeling financial markets, network security, recruiting, drug discovery, and transportation logistics. One domain is showing particular promise, speech processing. Speech has all the key characteristics of big data and deep complexity that suggest potential for neural networks. And researchers and industry have made tremendous progress on some major tasks like automatic speech recognition and test-to-speech synthesis. We might even think that the interesting speech problems are all getting solved.

In fact, we have just scratched the surface on speech, especially on the exploitation of the combination of audio signal processing and neural network methods to extract more information, remove ambiguity and improve quality of speech-based systems. Deep networks can follow not just the sounds, but the essential semantics of the audio trace, providing powerful means to overcome conflicting voices, audio impairments and confusion of meaning. The applications of improved speech understanding extend well beyond the applications we know today – smart speakers, cloud-based conversation bots and limited vocabulary device command systems. We expect to see semantic speech processing permeate into cars, industrial control systems, smart home appliances, new kinds of telephony and a vast range of new personal gadgets. We fully expect to see lots of new cloud APIs for automated speech processing services, new packaged software and new speech-centric devices. Speech is so natural and so accessible for humans, that I can predict it will be a preferred interface mechanism for hundreds of billions of electronic systems over time.

Ironically, the entrepreneurial world is not pursuing speech opportunities in the way it has been chasing vision applications. In Cognite Ventures list of the top 300 deep learning startups, for example, about 160 are focused on vision, but only 16 on speech (see www.cogniteventures.com). While vision IS a fertile ground, speech is equally fertile. Moreover, most of the innovations in new deep learning network structures, reinforcement learning methods, generative adversarial networks, accelerated neural network training and inference chips, model distillation and new intuitive training frameworks all apply to speech as completely and easily as to vision tasks.

This convergence is also true when we look at new hardware platforms. The computing patterns and the essential data-types classical vision functions and for classical audio functions have long been quite distinct. Vision had lots of parallelism, exploited spatial locality in multiple dimensions constantly, worked largely on 8b, and 16b data and often needed throughput of hundreds of billions of operations per second. Audio algorithms were more sequential, worked on 16b, 24b and 32b data and typically used only a few billion operations per second in the most demanding applications. Today, however, both vision and speech processing are adopting very similar neural network inference methods, so they can use much the same hardware – the network really doesn’t care whether it is operating on voice samples or image samples. These methods produce better and better results as the network dimensions (and training data sets!) scale up, so that real-time vision problems can now consume trillions of operations per second. Clever architects and silicon platform developers are inventing ways to deliver these extraordinary operation rates, at efficiencies easily 10x better than conventional DSPs and 100x better than conventional CPUs. A dramatic step-up in available computing throughput and efficient for speech computing is the happy and unexpected by-product of this vision computing breakthrough!

So I fully anticipate an acceleration in speech applications over the next couple of years, just as we’re enjoying today in vision applications. The convergence of algorithmic innovation in deep learning for speech and language, new understanding of the interaction between AI and signal processing, the huge increase in efficient and available speech computing bandwidth, and the expanding sphere of applications in phones, cars, IoT and the cloud all suggest a period of particularly rapid change.