Speaking of Which: Let’s Throw Off Our Robot Overlords;-)

Living in a Speech-Triggered Wold

Like many people in the developed world, I spend every day surrounded by technology – laptops, phones, car touch-screens, networked gadgets and smart TV. Of course I want to interact with all these things to get my work done and to enjoy their services and entertainment. And I have learned to use them pretty well – I have learned to type on keyboards, move mice, finger touchpads, pinch, swipe and peck. The devices themselves have not learned much of anything!

A wave of change is sweeping towards us, with the potentially to dramatically shift the basic nature of interactions between people and our electronic environment. Deep learning-based algorithms are transforming many aspects of computation, by allowing good solutions bigger, more complex, more ambiguous problems than before. The transformations are particularly striking across vision, language analysis, pattern finding in large data sets, and speech. I have written a lot about vision and the potential for neural network vision algorithms to meet or exceed human capabilities in recognizing and tracking objects, summarizing scenes and sequences, and finding safe navigational channels in two and three dimensions. These vision systems are becoming very substitutes for human vision in tasks like driving, surveillance and inspection.

Deep learning advances in speech have a completely different character from vision – these advances are rarely about substituting for humans, but in pay more attention to humans. Speech recognition, speaker identification, and speech enhancement are all functions to enhance the human-machine interaction.   The net effect of better human-machine interaction is less about displacing human-to-human interaction but more about displacing the keyboards, mice, remote controls and touchpads we have learned to use, albeit painfully.   In a very real sense, the old interfaces required us to retrain the neurological networks in our brains – new speech interfaces move that neural network effort over onto the computers!

Speech is coming into use in many forms – speech enhancement, generation, translation, and identification, but various flavors of speech recognition are getting the most attention. Speech recognition seems to be used in at least three distinct ways.

  • First, it is used for large-scale transcription, often for archival purposes, in medical records and in business and legal operations.   This is typically done in the cloud or on PCs.
  • Second, we have seen the rapid rise, with Alexa, Siri, Google Voice and others, of browser-like information queries. This is almost always cloud-based, not just because the large vocabulary recognizers require, for now, server-based compute resources, but also because the information being sought naturally lives in the cloud.
  • Third, there are local systems controls, where the makers of phones, cars, air conditioners and TVs want the convenience of voice command. In some cases, these may also need voice responses, but for simple commands proper recognition will have obvious natural responses.

Voice has some key advantages as a means of user-interface control, due to the richness of information. Not only do the words themselves carry meaning, but the tone and emotion, the speaker, and the sound environment may bring additional clues to the user’s intent and context. But leveraging all this latent information is tricky. Noise is particularly disruptive to understanding, as it masks the words and tone, sometimes irretrievably. We also want to the speech for user interfaces to be concise and unambiguous, yet flexible. This gives us a sophisticated set of demands placed on even these “simple” voice command UIs.

  • Command sets that comprehensively cover all relevant UI operations, sometimes including obscure ones.
  • Coverage of multiple phrases for each operation, for example: “On”, “Turn on”, “Turn on TV”, “Turn on the TV”, Turn on television”, “Turn on the television” (and maybe even “Turn on the damn TV”). These lists can get pretty long.
  • Tolerance of background noise, competing voices and, most, especially, the noise of the device itself, especially for music and video output devices. In some cases, the device can know what interfering noise it is generating and attempt to subtract it out of the incoming speech, but complex room acoustics make this surprisingly tricky
  • Ability to distinguish commands directed to a particular device from similar commands for other voice-triggered devices in the room.

The dilemma of multiple voice-triggered devices in one room is particularly intriguing. Devices often use a trigger phrase (“Hey Siri” or “Alexa”) to get the attention of the system, before a full set of commands or language can be recognized. This may be particular practical for cloud-based systems that don’t want or need to be always listening by streaming audio to the cloud. Listening for a single wake-up phrase can take significantly less compute than always listening for a whole vocabulary of phrases. However, it does increase the human speech effort to get even the simplest service. Moreover, most device users don’t really want to endure the essential latencies – 1-2 seconds for the simplest commands – inherent in cloud-based recognition.

As the number of devices in a space increases, it will be more and more difficult to remember all the trigger phrases and the available set of commands for each device.   In theory, a unified system could be created, in which commands are recognized via a single rational interface, and then routed to the specific gadget to be controlled. While Amazon or Google might be delighted with this approach, but it seems somewhat unlikely that all potential voice-triggered devices will be built or controlled by a single vendor. Instead, we are likely to see some slightly-contained chaos, in which each device vendor dispenses with wake words for local devices and attempts to make their commands as disjoint as possible from the commands for other devices. In the cases of conflicts – e.g. when both the music player and TV support a command “Turn volume down” – the devices will work from context, and may even signal to one another that a relevant command has been captured and acted on. (And rather than relying on a network handshake for synchronization, the devices may signal audibly to indicate a successful command capture. Both other devices and the user may like discrete notification.)

One great challenge of ubiquitous speech UIs is endowing each device with sufficient smarts. Most devices need far less than the full continuous speech recognition needed for transcription, but even fairly simple devices will likely vocabulary of tens of phrases, often comprised of hundreds of words.   Accurately spotting all those words and phrases, especially in a noisy environment with multiple speakers and the audio blaring, is tricky, especially given reasonable expectations for accuracy. We simply don’t want to have to repeat ourselves. This means running fairly sophisticated recognition neural networks, either to pick out words for matching against command patterns or to directly recognize entire command phrases. Fortunately, neural network inference hardware is getting dramatically more efficient. Creators of neural network processors and accelerators for embedded systems are now routinely touting efficiencies above 1T operations per watt, often driven by the demands for computer vision. Short command recognition is likely to take 3-5 orders of magnitude less compute than that. This implies that as neural network-specific acceleration grows common in embedded platforms, we may well see rich speech interfaces implemented in microwatts of power. In addition we are likely to see hybrid systems that follow a 90/10 rule – 90% of commands are recognized and acted on locally, to minimize latency, computing cost and network bandwidth consumption, and the most obscure and difficult phrases are passed on to the cloud for interpretation. A well-designed hybrid system will have the latency of local and the sophistication of cloud. All this is profoundly good news for users, who hate to get up off the couch or to look for the remote!

In the last couple of years, handwringing over the potential downsides of AI has become a surefire attention getter. Some of the concerns are real – we should be worrying about implicit bias in training sets, about the fragility and incomprehensibility of systems in mission-critical situations, and the potential economic disruption from rapid change in some classes of work.   I am less worried about the most apocalyptic “rise of the robot overlords” scenarios. In fact, I see the end of our fifty years of forced learning to type, swipe, mouse and point. Perhaps instead of computers forcing us to learn their preferred interface methods, we can teach them ours. Perhaps we can now finally throw off our robot overlords!

The Latest Cognite Ventures Deep Learning Startup List

Last year this time, AI and deep learning were very hot. Twelve months later, they are still very hot ;-). But the market has certainly matured. The startups are refining their technology ambitions and market segments to leverage deep learning with greater thoroughness and precision. Investors are getting more sophisticated in separating generic AI from true deep learning companies. The broader public is seeing the shift from isolated examples of extraordinary results (the AlphaGo board game breakthrough, autonomous vehicle demos) to more widespread deployment of deep learning applications (cloud-based natural language translation, automated photo labeling). So it is time to revisit the Cognite 300 Deep Learning Startup list.

Amidst starting and funding companies like Babblabs, I have also been continuously updating my list of the most interesting, most real deep learning startups I can find. I have gotten better geographical balance, especially by getting a better handle on the large cluster of deep learning startups in Israel. I have worked to add new companies that meet my standard criteria – substantial knowledge, use and dependence on deep learning in the product – and to cull the small handful that have fallen off the radar, or were somewhat miscategorized as deep learning companies.

A few observations may be worth adding:

  • Funding for deep learning seems to continue to be robust. This is especially true in China, where the level of enthusiasm for AI and the government-encouraged tech investment climate are combining for a feeding frenzy over the relatively small number of good startups. Valuations appear to be climbing to the sky. That’s fun in the short term for entrepreneurs, but the inevitable return of gravity will be less fun.
  • Investors all over the world are opening their wallets for AI chip startups in a way not seen in more than a decade. Startups recognize how the extraordinary computational intensity of deep learning training and inference opens the door to massive architectural innovation. The rate of improvement in absolute performance and energy efficiency is dramatic, and will likely follow a super-Moore’s-Law trajectory for at least a couple of years. For better or for worse, however, architects finding a theoretical factor of a hundred may simply not be enough! ALL the rival platforms are finding big gains in neural-network-specific architectures. The real test will be in building the software systems and applications to exploit this cacophony of new silicon. The Cognite list now includes 22 chip startups world wide (a third of them in China) and I am also following 5 or 6 more that still in stealth mode. It is not premature to worry about over-saturation of the AI chip sub-segment.
  • Startups across all the application segments are getting significantly more savvy about the technology and the positioning. It is still the case that putting the “AI” label on a tech startup is just prudent – AI is such a broad term that almost any company that processes data can make a claim. The deep learning or neural network subset of AI is much narrower and better defined. Nevertheless, more teams really do understand the technology better, thanks to the widespread embrace of the great open-source tools, massive on-line training opportunities, and the surprising agility in universities to respond to student interest and industry demand. So more teams are really able to do something real with deep learning and to think intelligently about how to apply it to myriad valuable application problems. This is making the task of identifying and filtering startup candidates more difficult: sophisticated deep learning will become increasingly standard – and assumed – as table stakes in many application segments.

I have now let the list grow to 350 companies in order to maintain consistency, and to avoid dropping strong deep learning players. The infographic I designed last June is now out of date, so I am releasing a new improved design, now highlighting the geographical distribution of the companies.

Here are the companies added to the list since the last edition:

Beyond Verbal
Cognitive ID
Deep Learning Robotics
Deep Solutions
Deep Trading
Element AI
Fifth Dimension
Gyrfalcon Technology
Jungo Connectivity
Magentiq eye
Mod9 Technologies
Morpheus Labs
OrCam Technologies
Prisma Labs
Soteria Intelligence
Voyager Labs

And here are the handful dropped:

Blue River Technology

* a Cognite Ventures portfolio company



Speech! Speech! Speech!

New technology is a source of excitement at any time, but we don’t live at just “any time”. Right now, we are experiencing an era of particularly rapid evolution of computing technology, and a period of particularly dramatic evolution of new business models and real-world applications. The blanket term, “AI”, has captured the world’s imagination, and it is tempting to dismiss much of the breathless enthusiasm – and doomsaying – as just so much hype. While there is a large dose of hype circulating, we must not overlook the very real and very potent emergence of deep learning or neural network methods as a substantially fresh approach to computing. The pace of improvement of algorithms, applications and computing platforms over just the past five years shows that this approach – more statistical, more parallel, and more suitable for complex, essentially ambiguous problems – really is a big deal.

Not surprisingly, a great deal of today’s deep learning work is being directed at problems in computer vision – locating, classifying, segmenting, tagging and captioning images and videos. Roughly half of all deep learning startup companies are focused on one sort of vision problem or another. Deep learning is a great fit for these problems. Other developers and researchers have fanned out across a broad range of other complex, data-intensive tasks in modeling financial markets, network security, recruiting, drug discovery, and transportation logistics. One domain is showing particular promise, speech processing. Speech has all the key characteristics of big data and deep complexity that suggest potential for neural networks. And researchers and industry have made tremendous progress on some major tasks like automatic speech recognition and test-to-speech synthesis. We might even think that the interesting speech problems are all getting solved.

In fact, we have just scratched the surface on speech, especially on the exploitation of the combination of audio signal processing and neural network methods to extract more information, remove ambiguity and improve quality of speech-based systems. Deep networks can follow not just the sounds, but the essential semantics of the audio trace, providing powerful means to overcome conflicting voices, audio impairments and confusion of meaning. The applications of improved speech understanding extend well beyond the applications we know today – smart speakers, cloud-based conversation bots and limited vocabulary device command systems. We expect to see semantic speech processing permeate into cars, industrial control systems, smart home appliances, new kinds of telephony and a vast range of new personal gadgets. We fully expect to see lots of new cloud APIs for automated speech processing services, new packaged software and new speech-centric devices. Speech is so natural and so accessible for humans, that I can predict it will be a preferred interface mechanism for hundreds of billions of electronic systems over time.

Ironically, the entrepreneurial world is not pursuing speech opportunities in the way it has been chasing vision applications. In Cognite Ventures list of the top 300 deep learning startups, for example, about 160 are focused on vision, but only 16 on speech (see www.cogniteventures.com). While vision IS a fertile ground, speech is equally fertile. Moreover, most of the innovations in new deep learning network structures, reinforcement learning methods, generative adversarial networks, accelerated neural network training and inference chips, model distillation and new intuitive training frameworks all apply to speech as completely and easily as to vision tasks.

This convergence is also true when we look at new hardware platforms. The computing patterns and the essential data-types classical vision functions and for classical audio functions have long been quite distinct. Vision had lots of parallelism, exploited spatial locality in multiple dimensions constantly, worked largely on 8b, and 16b data and often needed throughput of hundreds of billions of operations per second. Audio algorithms were more sequential, worked on 16b, 24b and 32b data and typically used only a few billion operations per second in the most demanding applications. Today, however, both vision and speech processing are adopting very similar neural network inference methods, so they can use much the same hardware – the network really doesn’t care whether it is operating on voice samples or image samples. These methods produce better and better results as the network dimensions (and training data sets!) scale up, so that real-time vision problems can now consume trillions of operations per second. Clever architects and silicon platform developers are inventing ways to deliver these extraordinary operation rates, at efficiencies easily 10x better than conventional DSPs and 100x better than conventional CPUs. A dramatic step-up in available computing throughput and efficient for speech computing is the happy and unexpected by-product of this vision computing breakthrough!

So I fully anticipate an acceleration in speech applications over the next couple of years, just as we’re enjoying today in vision applications. The convergence of algorithmic innovation in deep learning for speech and language, new understanding of the interaction between AI and signal processing, the huge increase in efficient and available speech computing bandwidth, and the expanding sphere of applications in phones, cars, IoT and the cloud all suggest a period of particularly rapid change.

What does a $5 camera cost?

An envelope is a powerful tool, especially the back of the envelope. It enables simple calculations that often illuminate some important, complicated problem or opportunity. We have just that kind of problem in working to understand all structure of vision processing systems, especially the key question of where the processing power for vision will live. Will it live in the camera itself? Will it live in the cloud? Will it live somewhere in between, on a local network or at the edge of the cloud? Issues like this are rarely determined by just one factor. It not just “what’s the fastest” or “what’s the cheapest”. A rich mix of technical and economic forces drives the solutions.

One of the most fundamental forces in vision is ever-cheaper cameras. When I checked a couple of weeks ago I could buy a decent-looking security camera, with high resolution lens and imager, IR LED illumination, processing electronics, power supply and waterproof housing on Amazon for $11.99:

The price has been following steadily – when I checked three months ago, the entry prices was about $18. We can reasonably expect that the price will continue to fall over the next couple of years – the $5 security camera cannot be far off.

The incredible shrinking camera cost is the result of classic economies of scale in CMOS semiconductors and assembly of small electronic gadgets. That scaling is inspired by and reinforces huge increases in the volume of image sensor shipments, and in the installed base of low-cost cameras. Using sensor volume data from SemiCo (2014), and the assumption of a three-year useful field life for a camera, you can see the exponential growth in the world’s population of cameras of all kinds.

I show the camera population alongside the human population to reinforce the significance of the crossover – more cameras capturing pixels that there are people to look at them.

I’ve written before about the proliferation of cameras in general, but now I want to get out my envelope and do some further calculations on the key question, “What does a $5 camera cost?” or more accurately, ”What’s the cost of ownership of a $5 camera over a three year lifetime?”

The answer, as you might already guess, is “More than $5.” Let’s break down some of the factors:

  1. The camera takes power. The $12 ZOSI camera above takes about 6W, so I’ll guess that my near-future $5 camera will cut power too, so I’ll estimate 3W. That’s certainly not low compared with the low-power CMOS sensors found in mobile phones. As a rule of thumb, electricity in the US costs about $1 per watt per year, so we get a three-year power cost of about $9 for our $5 camera.

If we did all the image processing locally, $5 + $9 = 13 might be a good total for the cost of owning that camera. And you can imagine some mobile gadgets and IoT applications – smart doorbells and baby monitors – where 100% local vision processing is adequate (or even preferred). But what if we want to centralize the processing and take advantage of cloud data? What if we want to push the stream up to the datacenter? Then we have some more costs to consider:

  1. The camera needs a network connection. Local cabling or Wi-Fi may just have capital costs, but almost all Internet access technologies cost real money.  It turns out that Internet access costs vary ENORMOUSLY. There are three basic cost categories.
    1. If you’re a residential customer in an area with fiber to the home, you’re in great shape. – You can get costs as low as $0.10 per TB (10Gbps line for $300 from US Internet).
    2. If you’re a residential customer using cable or DSL, you’re doing to pay $8-20 per TB (Based on AT&T and Comcast pricing)
    3. If you’re a business customer, and have to buy a commercial T1 or T3 connection, you’re going to pay a lot – probably more than $100 per TB, maybe a lot more. (Based on broker T1Shopper.com)

Of course, I am glossing over difference is quality of service, but let’s just break it down into two categories – fiber at $0.10 per TB and non-fiber at $10 per TB. Interestingly, it doesn’t appear that DSL, cable and T1 connections are dropping much in price. Clearly, that will happen only in areas where a much higher capacity option like fiber becomes available.

  1. The camera needs storage in the cloud. Almost any application that does analysis of a video stream needs a certain amount of recall, so users can apply further study to  In some applications, the recall time may be quite short – minutes or hours; in others, there may be a need for days or weeks of storage.   We’ll look at two scenarios – a rolling one-day buffer of video storage and a rolling one-month of video storage to span the range.   The cost of storage is a little tricky to nail down. Amazon is now offering “unlimited” storage for $5/month, but that’s just to consumers, not relevant to AWS cloud compute customers.   AWS S3 storage currently costs about $25/TB-month for standard storage, and $5/TB-month for “glacier storage” which may take minutes to hours to access. The “glacier storage” seems appropriate for the cases where a month of recall is needed. Just as the camera gets cheaper over time, we’d expect the storage to get cheaper to, so we’ll estimate $12.50/TB-month for standard storage and $2.5/TB-month for glacier storage.
  2. The camera needs computing in the cloud.   The whole point of moving the camera stream to the cloud is to be able to compute on it, presumably with sophisticated deep learning image understanding algorithms. There is no fixed standard for the sophistication of computation on video frames, but we can set some expectations. In surveillance video, we anticipate that we want to know what are all the major objects in the frame, and where are they located – in other words, object detection. This is a harder problem than simple classification of a single object in each image, as in the most common ImageNet benchmark. This gives the important clues on the progress of activity in the video stream. The best popular algorithm for this, YOLO (You Only Look Once) by Joseph Redmon, requires about 35 GFLOPS per frame of computation. We can get a reasonable picture of the cost by looking at AWS compute offerings, especially the g3 GPU instances. John Joo did some good estimates of the cost of AlexNet inference on AWS g3.4xlarge instances – 3200 inferences per second on the Tesla M60. YOLOv2 is about 25x more compute intensive, so we’d expect a cost of roughly $2.30 per million frames, based on AWS g3.4xlarge pricing of $1.14/hour. Over time, we expect both the algorithms to improve – fewer compute cycles for the same accuracy – and the compute to get more economical. We will again assume a factor of two improvement in each of these, so that future YOLO inference on AWS GPUs might cost about $0.58 per million frames.

Now let’s try to put it all together, for a comparison of cloud computing costs relative to our 3-year camera cost of $13. The last key variable is the data rate coming out of the camera. I will assume an 8Mpixel camera capable of producing 60 frames per second. This is lower resolution than what is found in high-end smartphones, so qualifies as a commodity spec, I think. Let’s compute the costs in four scenarios:

  1. Camera streams raw pixels at 8Mpixel * 2 bytes/pixel * 60 fps into the cloud.   While it is unlikely that we would adopt raw pixel streaming, it still provides a useful baseline.
  2. Camera uses standard H.264p60 video compression on the 4K UHD stream. This reduces the video bandwidth requirement to about 40 Mbps
  3. We assume some filtering or frame selection intelligence in the camera that reduces the frame rate, after H.264 compression, to 1 frame per second.
  4. We assume more filtering or frame selection intelligence in the camera to reduces the frame rate to 1 frame per minute


Data stream Network Cost Storage Cost Compute Cost Total Cost
Fiber Conv 1 day 1 month
8Mp raw $9,500 $950,000 $79,000 $476,000 $3,300 $92,000-$1,430,000
H.264 p60 $47 $4,700 $400 $2,400 $3,300 $3,700-$10,000
H.264 @ 1 fps $1 $79 $7 $39 $55 $62-$173
H.264 @ 1 fpm $0 $1.30 $0.11 $0.66 $0.91 $1-$3

The analysis, rough as it is, gives us some clear takeaways:

  • The bandwidth and storage costs of raw uncompressed video are absurd, so we’ll never see raw streaming of any significance
  • Even normally compressed video is very expensive – $4K to $10K over the three year life of our $5 camera. Very few video streams are so valuable as to justify this kind of cost.
  • Smart selection down to one frame per second helps cloud costs significantly, but the cloud costs dwarf the camera costs. Commodity solutions are likely to push to lower costs, hence lower cloud frame rate.
  • Selection down to one frame per minute makes cloud costs insignificant, even relative to our cheapo camera. It may not be necessary to go this far to leverage the cost reductions in the hardware.
  • We might reasonably expect that the sweet spot is 5 to 20 frames per minute of cloud-delivered video, with the high end most appropriate where cheap fiber to the home (or business) is available, and the low-end more appropriate with conventional network access.
  • The total value of these cameras is economically significant, when we consider the likely deployment of about 20 billion cameras by 2020. While the actual business model for different camera systems will vary widely, we can get a rough order of magnitude by just assuming the cameras are streaming just 10 frames per minute, half over fiber, half over conventional network access. The three-year cost comes to $260B for the cameras and $300B for the network and cloud computing, or about $220B per year. Not chump change!

The big takeaway is that the promise of inexpensive cameras cannot be realized without smart semantic filtering in or near the camera. Cameras, especially monitoring cameras, need to make intelligent choices about whether anything interesting is happening in a time window. If nothing meets the criteria of interest, nothing is sent to the cloud. If activities of possible interest are detected, a window of video, from before and after the triggering action, need to be shipped to the cloud for in-depth analysis. The local smart filtering subsystem may include preliminary assessment and summary of activities outside the window of interest in the video trigger package for the cloud.

You might reasonably ask, “Won’t putting more intelligence in the cameras make them much more expensive?” It is clear that these low-end cameras are designed for minimum cost today, with the simplest viable digital electronics for interfacing to the network. But just as video compression engines have become commoditized, cheap and standard in these cameras, I expect neural network processors to go the same route. For example, running standard YOLOv2 at 60 fps requires about 2.3 TFLOPs or 1.2 T multiply-adds per second. That would fit comfortably in 3 Tensilica Vision C5 cores and less than 5mm2 of 16nm silicon including memory.   That probably translates into an added silicon manufacturing cost of less than 50 cents. So it might push up the total camera and power cost by a few dollars, but not enough to shift the balance towards the cloud. After all, doing YOLOv2 on the full 60 fps in the cloud can cost thousands of dollars.

This model also suggests that the algorithms on both ends – at the camera and in the cloud – will need to adapt to specific application requirements and setting, and evolve significantly over time as better trigger and analysis methods emerge. This is the essence of a smart distributed system, with global scope, continuous evolution of functionality, and high economic value.

I should finish with a few caveats on this little analysis. First, it is full of simplifying assumptions about cameras, networks, storage needs and compute intensity. The costs of network access is particularly tricky to nail down, not least because wired data access is unmetered, so that the marginal cost of one more bit is zero, until you run out of capacity – then the marginal bit is very expensive.   Second, I have focused on costs in the U.S.   Some other regions have better access to low cost fiber, so may be able to leverage video cloud computing better. Other regions are significantly more expensive network access, so device-cloud tradeoff may shift heavily towards smarter device filtering.

Finally, the uses of cameras are enormously varied, so it is naïve (but instructive) to boil everything video analysis application down to object detection. There may be significantly heavier and lighter compute demands in the mix.

This analysis focuses on wired cameras. However, cameras in smart phones make up a meaningful fraction of the expected 20B installed image sensors. Wireless bandwidth to the cloud is dramatically more expensive than wired bandwidth. The going price is very rough $10 per GB in high bandwidth plans, or 1,000 times the price of DSL and cable and 100,000 times the price of fiber. If, hypothetically, you somehow managed to stream your 8Mp 60 fps camera to the cloud as raw pixels continuously for three years, it would cost you about a billion dollars ;-). Compressing it to 40Mbps H.264p60 drops the costs to a mere five million dollars. Of course, wireless data costs may well come down, but not enough to make continuous streaming from phones attractive. Especially smart compression is going to be needed for any application that wants to rely on continuous streaming over significant periods.

So that’s the story of a cheap camera on Amazon and the back of an envelope.

Some resources:

Network costs:



Xfinity Internet


Storage costs:


Computing costs:




Wireless costs:


The Top Deep Learning Startups in Israel

I’ve been pondering a small mystery for months.  I have finally resolved it. The mystery was this: “Where are all the Israeli deep learning startups?” Ten months ago I started putting together the Cognite Ventures list of the top worldwide deep learning startups. I drew on lots of existing lists of AI startup activity around the world, especially surveys published in the US, UK and China. I also systematically searched the on-line startup-funding tracking site, Crunchbase, to get systematic look at companies that self-report as AI startups. I filtered through thousands of startup descriptions and visited almost one thousand startup websites to get the basics of their product and technology strategy. I talked to colleagues who also track the AI explosion.

Much to my surprise, only half a dozen startups based in Israel made it through my search and filtering – just these few:

I wondered “Why?”

  • Was it because Israeli startups were focused on other areas?
  • Was it because startups were doing advanced AI, but weren’t publicly highlighting use of deep learning methods?
  • Were the companies slipping through my search net?

A few weeks ago, Daniel Singer, an independent market analyst in Israel published a detailed and extremely useful info-graphic, showing logos of more than 420 startups in Israel associated with Artificial Intelligence.  See the graphic

His analysis put the companies in eight major categories and a total of 39 sub categories. The scale of the list certainly suggests a great deal of general AI activity, and perhaps a lot of action in that most-interesting subset, deep learning.

Over the past week, I have visited the websites of every single one of the 420+ companies on Singer’s chart. Using company mission statements, product descriptions, blogs, and job postings, plus additional information from Crunchbase and YouTube videos, I have worked to assess the product focus and technical reliance on hard-core deep learning methods of these companies – the same methodology for the entire “Cognite 300 Deep Learning Startup” list. Happily I have identified a significant number – 34 – to add to the Cognite List (see below). Of course, I recognize that the companies were slipping through my net because of lower press visibility in the US, less reporting on Crunchbase, smaller typical start-up size, all of which make these a little harder to see.

This is an interesting, and I hope, important group of companies, with significant clusters in embedded vision for autonomous vehicles, human-machine interface and robotics, in cloud-based security and surveillance, in marketing, and in medical care. These companies have invested to understand the impact of neural networks on end markets and have built products that rely heavily on harnessing the combination of large data sets to training, and opportunity to extract hidden patterns from images, transactions, user clicks, sounds and other massive data streams. I suspect that companies that understand the implications of deep learning first will enjoy comparative advantages.

Some of the more intriguing ones on that list include:

  • OrCam Technologies builds smart glasses for the visually impaired that can recognize individual friends and family, read text out-loud and warn users of dangers.
  • GetAlert does modern surveillance and monitoring using deep neural networks in the clouds, but adds extraction of 3D structure from image streams for improved action classification
  • Vault predicts the financial success of film productions from deep analysis of scripts and casts.
  • Augury uses vibration and ultrasonic sensors to monitor mechanical systems to get early warning of slight changes that signal developing malfunctions.

Of course, deep learning is not the only form of “AI” that will legitimately contribute to market disruption and start-up traction. AI covers a lot of techniques, including other forms of statistical analysis and machine learning, natural language processing in chat-bots and other structured dialog systems, and other methods for finding and exploiting patterns in big data. For many companies their “less deep” learning is appropriate to their tasks and will lead them to success too. I suspect, though, that their success will often be driven by other factors – end market understanding, good UI integration, key business relationships – more than by AI technology mastery.

It is interesting to look for patterns in Singer’s list of 420+ companies, for it gives another useful view of what’s happening in the Israeli startup scene. Several large clusters are particularly noticeable:

  • Chat-bots for engaging with customers more continuously and adaptively and coaching sales people
  • All kinds of e-commerce optimization – enhanced bidding, ad campaign optimization, pricing and incentives
  • Fraud protection in ecommerce including ad fraud
  • Predictive maintenance in manufacturing and operations
  • IT infrastructure management especially for cyber security
  • All sorts of brand management and promotion

While some of these strong areas tap into Israel’s historical tradition of technical innovation driven by defense needs, many are not clearly tied to vision and surveillance, wireless communication, signals intelligence or threat profiling. Instead they reflect the entrepreneurial drive to exploit huge global trends, especially in the rise of ecommerce and on-line marketing.

You might ask why one analyst identifies 420 AI start-ups and another only 40. Daniel Singer clearly wanted to explore the full scope of “smart” startups in Israel. He used a very broad definition of AI, including companies as diverse as one doing “connected bottle caps” [Water.IO], another doing a job search tool [Workey], and a third, frighteningly, automatic web content composition from a few keywords [Articoolo].This expansive definition of AI in more inclusive, and reflects, in part, the enormous fascination among entrepreneurs, investors and the general public for all things related to AI. This broad definition subsumes many of the longer-term trends in big data analysis, ecommerce automation and predictive marketing.

I have taken a more selective view. Much of the enthusiasm for AI in the past three years has been specifically triggered by the huge, well-publicized strides in just one subdomain of AI – neural networks. This has been particularly striking in areas like computer vision, automated language translation and automatic speech recognition, but also in the most complex and ambiguous tasks in business data analysis. I have chosen to focus on companies that appear to be at the cutting edge of neural networks, using the most sophisticated deep learning methods. I believe this will be the greatest area of disruption of existing products, companies and business models.

So the broad view and the selective view are complementary, both giving windows into lively entrepreneurial climate in Israel.

Does Vision Research Drive Deep Learning Startups?

What’s relationship between academic research and entrepreneurship? Does writing technical papers help startups? Does a government support of computer science research translate into commercial innovation? We’d all like to know!

I just spent a week at the Computer Vision and Pattern Recognition (CVPR) Conference in Honolulu. It is one of the three big academic conferences – along with Neural Inspired Processing Systems (NIPS) and the International Conference on Machine Learning (ICML) – that has embraced and proliferated cutting edge research in deep learning.   CVPR has grown rapidly, with attendance almost doubling annually over the past few years (to more than 5200 this year). More importantly, it has played a key role in encouraging all kinds of computer vision researchers to explore the potential for deep learning methods on almost every standing vision challenge.

So many teams submit worthwhile papers that it has adopted a format to expose as many people to as many papers as possible. Roughly 2500 papers were submitted this year, of which only about 30% are accepted. Even so how can people absorb more than 750 papers? All the accepted papers get into poster sessions, and those sessions are unlike any poster session I’ve seen before. You often find two or three authors surrounded by a crowd of 25 or more, each explaining the gist of the talk over and over again for all comers. Some notable papers are given a chance to be shared a rapid-fire FOUR minute summary in one of many parallel tracks. And even smaller handful of papers gets much bigger exposure – a whole TWELVE minute slot, including Q&A!

Remarkably the format does work – the short talks serve as useful teasers to draw people to the posters. A session of short talks gives a useful cross section of the key application problems and algorithmic methods across the whole field of computer vision. That broad sampling of the papers confirms the near-total domination of computer vision by deep learning methods.

You can find interesting interactive statistics on the submitted CVPR17 papers here and on the accepted papers here. Ironically, only 12% of the papers are assigned to the “CNN and Deep Learning” subcategory, but in reality, almost every other category, from “Scene Recognition and Scene Understanding” to “Video Events and Activities” also dominated by deep learning and CNN methods. Of course, there is still a role for classical image processing and geometry-based vision algorithms, but these topics occupy something of a backwater at CVPR. I believe this radical shift in focus reflect a combination of the power of deep learning methods on the inherent complexity (and ambiguity) of many vision tasks, but deep learning is also a shiny new “hammer” for many research teams – they want to find every vision problem that remotely resembles a “nail”!

Nevertheless, CVPR is so rich with interesting ideas that the old “drinking from a fire hose” analogy fits well. It would be hard to do justice to range of ideas and the overall buzz of activity there, but here are a few observations about the show and the content.

  • Academia meets industry – for recruiting. The conference space is built around a central room that serves for poster space, exhibition space and dining. This means you have a couple of thousand elite computer vision and neural network experts wandering around at any moment. The exhibitors –Microsoft, NVidia, Facebook Google, Baidu and even Apple, plus scores of smaller companies – ring the outside of the poster zone, all working to lure prospective recruits into their booths.
  • Creative applications abound. I was genuinely surprised at how broadly research teams are applying neural network methods. Yes, there is still work on the old workhorses of image classification and segmentation, but so much of the work is happening on newer problems in video question answering, processing of 3D scene information, analyzing human actions sequences, and using generative adversarial networks and weakly supervised systems to learn from limited data. The speed of progress is also impressive – some papers build on methods themselves first published at this CVPR 2017! On the other hand, not all the research shows breakthrough results. Even very novel algorithms sometimes show just minor improvements in accuracy over previous work, but improvements by a few percent add up quickly if rival groups release new results every few months.
  • Things are moving too fast for silos to develop.   I got a real sense that the crowd was really trying to absorb everything, not just focusing on narrow niches. In most big fields like computer vision, researchers tend to narrow their attention to specific set of problems or algorithms, and those areas develop highly specialized terminology and metrics. At CVPR, I found something closer to a common language across most of the domain, with everyone so eager to learn and adopt methods from others, regardless of the specialty. That cross-fertilization certainly seems to be aiding the ferocious pace of research.
  • The set of research institutions is remarkably diverse. With so much enthusiasm for deep learning and its applications to vision, an enormous range of universities, research institutes and companies are getting into the research game. To quantify the diversity, I took a significant sample of papers (about 100 out of the 780 published) and looked at the affiliations of the authors. I counted each unique institution on a paper, but did not count the number of authors from the same institution. In my sample, the typical paper had authors from 1.6 institutions, with a maximum of four institutions. I suspect that mixed participation reflects both overt collaboration and movement of people, both between universities and from universities into companies. In addition, one hundred unique institutions are involved in the one hundred papers I sampled. This means probably have many hundreds of institutions doing world-class research computer vision and deep learning. While the usual suspects – the leading tech universities – are doing more papers, the overall participation is extraordinarily broad.
  • Research hot spots by country. The geographical distribution of the institutions says a lot about how different academic communities are responding the deep learning phenomenon. As you’d expect, the US has the greatest number of author institutions, with Europe and China following closely. The rest of the Asia lags pretty far behind in sheer numbers. Here’s the make up of our sample, by country – the tiny slices in this chart are Belgium, Denmark, India, Israel, Sweden, Finland, Netherlands, Portugal, Australia and Austria.Europe altogether together (EU with UK;-), plus Switzerland) produced about 28% of the papers, significantly more than China, but still less than the US.
  • Research hotspots by university. We can also use this substantial sample to get a rough sense of which specific universities are putting the most effort on this. Here are the top ten institutions worldwide, for this sample of the paper authors – with some key institution in the UK (Oxford and Cambridge) and Germany (Karlsruhe and Max Planck Institute) rounding out the next ten:

    All this raises an interesting question – what’s the relationship academic research and startups in deep learning technology? Do countries with strong research also have strong startup communities? This is a question this authorship sample, plus the Cognite Ventures deep learning startup database, can try to answer, since we have a pretty good model of where the top computer vision startups are based, and we know where the research is based. In the chart below, I show the fraction of deep vision startups and the fraction of computer vision papers, for the top dozen countries (by number of startups):

    The data appears to tell some interesting stories:

    • US, UK and China really dominate in the combination of research and startups.
    • US participation computer vision research and startups is fairly balanced, though research actually lags a bit behind startup activity.
    • The UK actually has meaningfully more startup activity than research, relative to other countries. This may reflect good leveraging of worldwide research and a good climate for vision startups.
    • China is fairly strong on research relative to vision startups, suggesting perhaps upside potential for the Chinese startup scene, as the entrepreneurial community leverages more of the local research expertise
    • Though the numbers are not big enough to reach strong conclusions, it appears that research in Germany, France and Japan significantly exceeds any conversion of research into startups. I think this reflects the fairly strong research tradition, especially in Germany, with a developed overall startup culture and climate.


    Both the live experience of CVPR and the statistics underscore the real and vital link between research in computer vision and the emergence of a deep understanding of methods and applications for new enterprises. This simple analysis doesn’t directly reveal causality, but it is a pretty good bet that computer vision researchers often become deep learning founders and technologists, fueling the ongoing transformation of the vision space. More research means more and better startups.

Deep Learning Startups in China: Report from the Leading Edge

Everyone knows the Chinese classic curse, “May you live in interesting times”. Well, it turns out, the Chinese origin for this pithy phrase is apocryphal – the British statesman, Austin Chamberlain, probably popularized the phrase in the 1930s and attributed it to the Chinese to lend it gravity. We do, however, live in interesting times, in no field better epitomized than in deep learning, and in no location more poignantly than in China.

I have just return from a ten day tour of Beijing, Shenzhen, Shanghai and Hangzhou, meeting with deep learning startups and giving a series of talks on the worldwide deep learning market. In the most fundamental ways, neither the technology, nor the applications, nor the startup process are so different from what you find in Silicon Valley or Europe, but the trip was full of little eye-openers about the deep learning in China, and about the entrepreneurial process there. It reinforced a few long-standing observations, but also shifted my point of view in important ways.

The most striking reflection on the China startup scene is how much it feels like the Silicon Valley environment, and how it seems to differ from other Asian markets. First, there seems to be quite of bit of money available, from classic VCs, from industrial sponsors and even from US semiconductor companies – Xilinx and NVidia have investments in these high-profile startups in China, but I’m sure other major players do too. Second, deep learning is very active, with much of the same “gold-rush” feeling I observe in the US. This contrasts with the Taiwan, Japan and Korea markets, where the deep learning startup theme is less developed, either because startups less central to the business environment (Japan) or because the deep learning enthusiasm has not grown so intense (Taiwan, Korea). Ample funding and enthusiasm also means rapid growth of significant engineering teams – the smallest company I saw had 25 people, the biggest had about 400. California teams have evolved office layouts that look like Chinese ones – open offices without cubicle walls – and Chinese teams have developed the California tradition of endless free food. We are not there yet, but we closer than ever to a common startup culture spanning the Pacific.

Observation: The Chinese academic and industrial technical community is closely tuned into the explosion of activity in deep learning, and many companies are looking to leverage it in products. Baidu’s heavy investment in deep learning– with a research team of more than 1000 – is already well know. The number of papers, on deep learning from Chinese universities and the interest level among startups is also very high. Overall, Chinese industry seems to be gradually shifting from a “cost-down” mindset – focused on how to take established products and to optimize the whole bill-of-materials for lower cost – towards greater attention to functional innovation. A strong orientation towards hardware and system products remains: I have found many fewer pure-play cloud software startups in China than in the US or UK. Nevertheless, the original software content in these systems is growing rapidly. Almost every company I visited had polished and impressive demos of vehicle tracking, face recognition, crowd surveillance or demographic assessment.

Observation: Chinese startups are unafraid of doing new deep-learning silicon platforms. Quite a few of the software companies I visited are building or planning chips capturing their insights into neural network inference computation. Perhaps one in four Chinese startups is working towards silicon, while only one in 15 worldwide is doing custom silicon. One executive explained that that Chinese investors really like to see the potential differentiation that comes from chips, and startups believe that committing to silicon actually helps secure capital. This is in stark contrast to current Silicon Valley wisdom – that investors flee at the mention of chip investment. This striking dichotomy reflects a combination of perceived lower chip development costs in China (because of lower engineering salaries, avoidance of bleeding-edge semiconductor technologies below 20nm and smaller, niche-oriented designs) and the widespread belief that tying software to silicon protects software value. Ironically, silicon development is now strikingly rare among Silicon Valley startups, driven partly by the high costs, and long timelines for chip products, and partly by comparative attractiveness of cloud software startups for investment, where the upfront costs are so much less, and countless new market niches seems to appear weekly.

Observation: The China startups are almost entirely focused on real-time vision and audio applications, with only a supporting role for cloud-based deployment. Cars, human-machine interface and surveillance are the standout segments.   DJI, the world’s biggest consumer drone maker uses highly sophisticated deep learning for target tracking and gesture control. Most of the companies doing vision have capability and demos in identifying and tracking vehicles, pedestrians and bicycles, which applies to both automotive driver assistance/self-driving vehicles and surveillance. Top startups in the vision space include Cambricon, DeepGlint, Deephi, Emotibot, Megvii, Horizon Robotics, Intellifusion, Minieye, Momenta, MorphX, Rokid, SenseTime, and Zero Zero Robotics. Audio systems are also a big area, with particular emphasis on automated speech recognition, including increasing embedded real-time speech processing. Top startups here include AISpeech, Mobvoi, and Unisound.

Observation: China has a disproportionally large surveillance market, and correspondingly heavy interest in deep learning-based applications for face identification, pedestrian tracking, and crowd monitoring.   China is already the world’s largest video surveillance market and has been among the fastest growing. Chinese suppliers describe usage scenarios of identifying and tracking criminals, but China does not have a serious conventional crime problem (both violent crime and property crime are well below US levels, for example). To some extent, “monitoring crime” is a code word for monitoring political disturbances, a deep obsession of the Chinese Communist Party. This is not the only driver for vision-based applications – face ID for access control and transaction verification are also important.

Over the course of ten days, I saw some particularly interesting startups:

  • Horizon Robotics [Beijing]: Horizon is a deep-learning powerhouse, led by Yu Kai. With 220 people, they are innovating across a broad front on vision systems, including smart home, gaming, voice-vision integration, and self-driving cars. They have also adopted a tight hardware-software integration strategy for more complete and efficient solutions.
  • Horizon is a deep-learning powerhouse, led by Yu Kai. With 220 people, they are innovating across a broad front on vision systems, including smart home, gaming, voice-vision integration, and self-driving cars. They have also adopted a tight hardware-software integration strategy for more complete and efficient solutions.
  • Intellifusion [Shenzhen]: Intellifusion is fairly complete video system supplier with close ties to the government organizations deploying public surveillance. They currently deploy their own servers with GPUs and FPGAs, but moving increasing functionality into edge devices, like the cameras themselves.
  • NextVPU [Shanghai]: NextVPU is the youngest (about 12 months old) and smallest (24 people) of the startups I saw. They are pursuing AR, robotics and ADAS systems, but their first product, an AR headset of the visually impaired, is compelling in both a technical and social sense. Their headset does full scene segmentation for pedestrian navigation and recognizes dozens of key objects – signs, obstacles and common home and urban elements to help their users.
  • Deephi [Beijing]: Deephi is one of the most advanced and impressive of all the deep learning startups, with close ties to both the leading technical university in China, Tsinghua, and leading US research universities.   They have a particularly sophisticated understanding of what it takes to map leading edge neural networks into the small power, compute and memory resources of embedded devices, using world-class compression techniques. They are pursuing both surveillance (vision) and data-center (vision and speech) applications with a range of innovative programmable and optimized inference architectures.
  • Sensetime [Beijing]: Sensetime is one of the biggest and most visible software startups using deep learning for vision. They have impressive demos spanning surveillance, face recognition, demographic and mood analysis, and street view identification and tracking of vehicles. They are sufficiently competent and confident to have developed their own training framework, Parrots, in lieu of Caffe, Tensor Flow and the other standard platforms.
  • Megvii [Beijing]: Megvii is a prominent Chinese “unicorn” – a startup valued at US$1B+, and is often known by the name of their leading application, Face++. Face++ is a sophisticated and widely used face ID environment, leveraging the Chinese government’s face database. This official and definitive database enables customer verification for transaction systems like Alibaba’s AliPay and DiDi’s leading ride hailing system. They show an impressive collection of demos for recognition and augmented reality and “augmented photography. Like many other Chinese companies, Megvii is moving functionality from the cloud to embedded devices, to improve latency, availability and security.
  • Bitmain [Beijing]: Bitmain is hardly a startup and is wildly successful in non-deep learning applications, specifically cyptocurrency mining hardware. They have become the biggest supplier of ASICs for computational hashing, especially for Bitcoin, but now spreading into rival currencies like Litecoin. Founded in 2013, they hit US$100 in revenue in 2014 and are on track to do US$500M this year. This revenue stream and profitability is allowing them to explore new areas, and deep learning platforms seem to be a candidate for further expansion.

Here’s a more complete list of top Chinese deep-learning startups

Name Description
4Paradigm Scaled deep learning cloud platform
AISpeech Real-time and cloud-based automated speech recognition for car, home and robot UI
Cambricon Device and cloud processors for AI
DeepGlint 3D computer vision and deep learning for human & vehicle detection, tracking and recognition
Deephi Compressed CNN networks and processors
Emotibot A natural interaction interface between human and machine based on multi-modal
Face++ Face recognition
Horizon Robotics Smart Home, automotive and Public safety
ICarbonX Individualized health analysis and prediction of health index by machine analysis
Intellifusion Cloud-based deep learning for public safety and industrial monitoring
Minieye ADAS vision cameras and software
Mobvoi Smart watch with voice search using cloud
Momenta AI platform for level 5 autonomous driving
MorpX Commercializes computer vision and deep learning technologies for low-cost/-power platforms
Rokid Home personal assistant – ASR + face/gesture
SeetaTech Open source development platform to enable enterprise vision and machine learning
SenseTime Computer vision
TUPU Image recognition technology and services
tuSimple Software for self-driving cars: detection and tracking, flow, SLAM, segmentation, face analysis
Unisound AI-based speech and text
YITU Technology Computer vision for surveillance, transportation and medical imaging
Zero Zero Robotics Smart following drone camera

Of course, no one can claim to understand everything that’s happening in the vibrant Chinese startup community, least of all a non-native speaker. Nevertheless, everyone I spoke with in China validated this list of the top deep learning startups. Some were a bit surprised at the depth of the list, especially in identifying startups that were not yet on their radar. Both technically, and in exploring market trends, the China startup world is at the cutting edge in many areas. It bears close watching for worldwide impact.

The Cognite 300 Poster

Today I’m rolling out the Cognite 300 poster, a handy guide to the more focused and interesting startup companies in cognitive computing and deep learning.  I wrote about the ideas behind the creation of the list in an earlier blog posting:

Who are the most important start-ups in cognitive computing?

I have updated the on-line list every couple of months since the start of 2017, and will continue to do so, because the list keeps changing. Some companies close, some are acquired, some shift their focus. Most importantly, I continue to discover startups that belong on the list, so I will keep adding those, using approximately the same criteria of focus used for the first batch.

I should underscore that many potentially interesting companies haven’t gone on the list:

  • because it appears that AI is only a modest piece of their value proposition, or
  • because there is too little information on their websites to judge, or
  • because they doing interesting work on capturing and curating big data sets (that may ultimately require deep learning methods)  but don’t emphasize learning work themselves, or
  • I just failed to understand the significance of the company’s offerings

A few weeks ago, a venture capital colleague suggested that I should do a poster, to make the list more accessible and to communicate a bit of the big picture of segments and focus areas for these companies.  I classified the companies (alas, by hand, not with a neural network 😉 into 16 groups

  1. Sec – Security, Surveillance, Monitoring
  2. Cars – Autonomous Vehicles
  3. HMI – Human-Machine Interface
  4. Robot – Drones and Robots
  5. Chip – Silicon Platforms
  6. Plat -Deep Learning Cloud Compute/Data Platform and Services
  7. VaaSVision as a Service
  8. ALaaS – Audio and Language as a Service
  9. Mark – Marketing and Advertising
  10. CRM – Customer Relationship Management and Human Resources
  11. Manf – Operations, Logistics and Manufacturing
  12. Sat – Aerial Imaging and Mapping
  13. Med – Medicine and Health Care
  14. Media – Social Media, Entertainment and Lifestyle
  15. Fin – Finance, Insurance and Commerce
  16. IT – IT Operations and Security

I’ve also included an overlay of two broader categories, Vision and Embedded.  Many of the 16 categories fall cleanly inside or outside embedded and vision, but some categories include all the combinations.  A few companies span two of the 16 groups, so they are shown in both.

You may download and use the poster as you wish, so long as you reproduce it in its entirety, do not modify it and maintain the attribution to Cognite Ventures.

The Cognite 300 Startup Poster

Finally, I have also updated the list itself, including details on the classification of the startup by the 16 categories and the 2 broader classes, and identifying the primary country of operations.  For US companies I’ve also including the primary state.

The Cognitive Computing Startup List


How to Start an Embedded Vision Company – Part 3

This is the third installment of my thoughts on starting an embedded vision company. In part 1, I focused on the opportunity, especially how the explosion in the number of image sensors was overwhelming human capacity to directly view all the potential image streams, and creating a pressing need for orders-of-magnitude increase in the volume and intelligence of local vision processing. In part 2, I shifted to a discussion of some core startup principles and models for teams in embedded vision or beyond. In this final section, I focus on how the combination of the big vision opportunity, plus the inherent agility (and weakness!) of startups can guide teams to some winning technologies and business approaches.

Consider areas of high leverage on embedded vision problems:

  1. The Full Flow: Every step of the pixel flow from sensor interface, through the ISP, to the mix video analytics (classical and neural network-based) has impact on vision system performance. Together with choices in training data, user interface, application targeting and embedded vs. cloud application partitioning give an enormous range of options on vision quality, latency, cost, power, and functionality. That diversity of choices creates many potential niches where a startup can take root and grow, without having to attack huge obvious markets using the most mainstream techniques initially.
  2. Deep Neural Networks: At this point it is pretty obvious to that neural network methods are transforming computer vision. However, applying neural networks in vision is much more than just doing ImageNet classification. It pays to invest in thoroughly understanding the variety of both discrimination methods (object identification, localization, tracking) and generation methods. Neural-network-based image synthesis may start to play a significant role in augmenting or even displacing 3D graphics rendering in some scenarios. Moreover, Generative Adversarial Network methods allow a partially trained discrimination network and a generation network to iterate through refinements that improve both networks automatically.
  3. Data Sets: To find, create and repurpose data for better training is half the battle in deep learning. Having access to unique data sets can be the key differentiator for a startup, and brainstorming new problems that can be solved with available large data sets is a useful discipline in startup strategy development. Ways to maximize data leverage may include the following:
    1. Create an engagement model with customers, so that their data can contribute to the training data set for future use. Continuous data bootstrapping, perhaps spurred by free access to cloud service, may allow creation of large, unique training data collections.
    2. Build photo-realistic simulations of the usage scenes and sequences in your target world. The extracted image sequences are inherently labeled by the underlying scene structures and can generate large training sets to augment real world image captured training data. Moreover, simulation can systematically cover rare but important combinations of object motion, lighting, and camera impairments for added system robustness. For example the automotive technology startup up, AIMotive, builds both sophisticated fused cognition systems from image, LiDAR and radar streams, and sophisticated driving simulators with accurate 3D world to train and test neural network-based systems.
    3. Some embedded vision systems can be designed as subsets of bigger, more expensive server-based vision systems, especially when neural networks of heroic scale are development by cloud-based researchers. If the reference network is enough better than the goals for the embedded system, the behavior of that big model can be used as “ground truth” for the embedded system. This makes generation of large training sets for the embedded version much easier.
    4. Data augmentation is a powerful method. If you have only a moderate amount of training data, you may be able to apply a series of transformations to the data and allow prior labeling to be maintained. (We know a dog is still a dog, no matter how we scale it, rotate it or flip its image.) Be careful though – neural networks can be so discriminating that a network trained on artificial or augmented data, may only respond to such example, however similar those examples may be to real world data, in human perception.
  4. New Device Types: The low cost and high intelligence of vision subsystems is allowing imaging-based systems in lots of new form-factors. These new device types may create substantially new vision problems. Plausible new devices include augmented reality headsets and glasses, ultra-small always-alert “visual dust” motes, new kinds of vehicles from semi trucks to internal body “drones”, and cameras embedded in clothing, toys, disposable medical supplies, packaging materials, and other unconventional settings. It may not be necessary in these new devices to deliver either fine images, or achieve substantial autonomy. Instead, the imagers may just be the easiest way to get a little bit more information from the environment or insight about the user.
  5. New Silicon Platforms: Progress in the new hardware platforms for vision processing, especially for deep neural network methods, is nothing less than breathtaking. We’re seeing improvements in efficiency of at least 3x per year, which translates into both huge gains in absolute performance at the high end, and percolation of significant neural network capacity into low cost and low power consumer-class systems. Of course, 200% per year efficiency growth cannot continue for very long, but it does let design teams think big about what’s possible in a given form-factor and budget. This rapid advance in computing capacity appears to be happening in many different product categories – in server-grade GPUs, embedded GPUs, mobile phone apps processors, and deeply embedded platforms for IoT. As just one typical example, the widely used Tensilica Vision DSP IP cores have seen the multiply rate – a reasonable proxy for neural network compute throughput – increase by 16x (64 è 1024 8x8b multiplies per cycle per core) in just over 18 months. Almost every established chip company doing system-on-chip platforms is rolling out significant enhancements or new architectures to support deep learning. In addition, almost 20 new chip startups are taking the plunge with new platforms, typically aiming at huge throughput to rival high-end GPUs or ultra high efficiency to fit into IoT roles. This wealth of new platforms will make choosing a target platform more complex, but will also dramatically increase the potential speed and capability of new embedded vision platforms.
  6. More Than Just Vision: When planning an embedded vision product, it ‘s important to remember that embedded vision is a technology, not an end application. Some applications will be completely dominated by their vision component, but for many others the vision channel will be combined with many other information channels. This may come from other sensors, especially audio and motion sensors, or from user controls or from background data, especially cloud data. In addition, each vision node may be just one piece of a distributed application, so that node-to-node and node-to-cloud-to-node application coordination may be critical, especially in developing a wide assessment of a key issue or territory. Once all the channels of data are aggregate and analyzed, for example, through convolutional neural networks, what then? Much of the value of vision is in taking action, whether the action is real-time navigation, event alerts, emergency braking, visual or audio response to users, or updating of central event databases. In thinking about the product, map out the whole flow to capture a more complete view of user needs, dependencies on other services, computation and communication latencies and throughput bottlenecks, and competitive differentiators for the total experience.
  7. Point Solution to Platform: In the spirit of “crossing the chasm” it is often necessary to define the early product as a solution for a narrow constituency’s particular needs. Tight targeting of a point solution may let you stand out in a noisy market of much bigger players, and to reduce the integration risks faced by your potential early adopters. However, that also limits the scope of the system to just what you directly engineer. Opening up the interfaces and the business model to let both customers and third parties add functionality has two big benefits. First, it means that the applicability of your core technology can expand to markets and customers that you couldn’t necessarily serve with your finite resources to adapt and extend the product. Second, the more a customer invests their own engineering resources into writing code or developing peripheral hardware around your core product, the more stake they have in your success. Both practical and psychological factors make your product sticky. It turns a point product into a platform. Sometimes, that opening of the technology can leverage an open-source model, so long as some non-open, revenue-generating dimension remains. Proliferation is good, but is not the same as bookings. Some startups start with a platform approach, but that has challenges.   It may be difficult to get customers to invest to build your interfaces into their system if you’re too small and unproven, and it may be difficult to differentiate against big players able to simply declare an “de facto industry standard”.

Any startup walks a fine line between replicating what others have done before, and attempting something so novel that no one can appreciate the benefit. One useful way to look for practical novelty is to look at possible innovation around the image stream itself. Here are four ways you might think about new business around image streams:

  1. Take an existing image stream, and apply improved algorithms. For example, build technology that operates on user’s videos and does improved captioning, tagging and indexing.
  2. Take and existing image stream and extract new kinds of data beyond the original intent. For example, use outdoor surveillance video streams to do high resolution weather reporting, or look at traffic congestion.
  • Take an existing image stream and provide services on it under new business models. For example, build a software for user video search that doesn’t charge by copy or by subscription, but by success in finding specific events
  1. Build new image streams by putting cameras in new places. For example, chemical refiners are installing IR cameras that can identify leaks of otherwise invisible gases. A agricultural automation startup, Blue River, is putting sophisticated imaging on herbicide sprayers, so that herbicides can be applied just on recognized weeds, not on crop plants or bare soil, increasing yields and reducing chemical use.

Thinking beyond just the image stream can be important too. Consider ways that cameras, microphones and natural language processing methods can be combined to get richer insights into the environment and users intent.

  • Can the reflected sound of a aerial drone’s blades give additional information for obstacle avoidance?
  • Can the sound of tires on the road surface give clues about driving conditions for autonomous cars?
  • Can the pitch and content of voices give indications of stress levels in drivers, or crowds in public places?

The figure below explores a range of application and functions types using multiple modes of sensing and analysis

Autonomous Vehicles and Robotics Monitoring, Inspection and Surveillance Human-Machine Interface Personal Device Enhancement
Vision ·    Multi-sensor: image, depth, speed

·    Environmental assessment

·    Localization and odometry

·    Full surround views

·    Obstacle avoidance

·    Attention monitoring

·    Command interface

·    Multi-mode automatic speech recognition

·   Social photography

·   Augmented Reality

·   Localization and odometry

Audio ·    Ultrasonic sensing · Acoustic surveillance

· Health and performance monitoring

·   Mood analysis

·   Command interface

·  ASR in social media context

·  Hands-free UI

·  Audio geolocation

Natural Language · Access control

· Sentiment analysis

·   Sentiment analysis

·   Command interface

·  Real-time translation

·  Local service bots

·  Enhanced search

The variety of vision opportunities is truly unbounded. The combination of inexpensive image sensors, huge cognitive computing capacity, rich training data and ubiquitous communications makes this time absolutely unique. Doing a vision startup is hard, just as any startup venture is hard. Finding the right team, the right market, the right product and the right money is never easy, but the rewards, especially the emotional, technical and professional rewards can be enormous.

Good luck!

How to Start an Embedded Vision Company – Part 2

In my previous blog post, I outlined how the explosion of high-resolution, low-cost image sensors was transforming the nature of vision, as we rapidly evolve to a world where most pixels are never seen by humans, but captured, analyzed and used by embedded computing systems. This discontinuity is creating ample opportunities for new technologies, new business models and new companies. In this second part, we look at the basics ingredients of a startup, and two rival models of how to approach building a successful enterprise.

Let’s look at the basic ingredients of starting a technology business – not just a vision venture. We might call this “Startup 101”. The first ingredient is the team.

  • You will need depth of skills. It is impossible to be outstanding in absolutely everything, but success does depend on having world-class capability in one or two disciplines, usually including at least one technology dimension. Without leadership somewhere, it’s hard to differentiate from others, and to see farther down the road on emerging applications.
  • You don’t need to be world-class in everything, but having a breadth of skills across the basics – hardware, software, marketing, sales, fund-raising, strategy, infrastructure – will help enormously in moving forward as a business. The hardware/software need is obvious and usually first priority. You have to be able to develop and deliver something useful, unique and functional. But sooner or later you’ll also need to figure out how to describe it to customers, make the product and company visible, and complete business transactions. You’ll also need to put together some infrastructure, so that you can hire and pay people, get reasonably secure network access and supply toilet paper in the bathrooms.
  • Some level of experience on the team is important. You don’t need to be graybeards with rich startup and big company track records, but some level of real world history is enormously valuable. You need enough to avoid the rookie mistakes and to recognize the difference between a normal potholes and an existential crevasse.   Potholes you do your best to avoid, but it you have to survive a bump, you can. A bit of experience can alert you when you’re approaching the abyss, so you can do everything possible to get around it. Is there a simple formula for recognizing those crevasses? Unfortunately, no (but they often involve core team conflict, legal issues, or cash flow). Startups through a lot of issues, big and small, at the leadership team, so there will be plenty of opportunity to gain experience along the way.
  • The last key dimension of team is perhaps the most important, but also the most nebulous – character. Starting a company is hard work, with plenty of stress and emotion, because of the stakes. A team, capable and committed to openness, patience and honesty, will perform better, last longer, and have more fun than other teams. It does NOT mean that the team should agree all the time – part of the point of constructing a team with diverse skills is to get good “parallax perspective” on the thorniest problems. It DOES mean trusting one another to do their jobs, being willing to ask tough questions about assumptions and methods, and working hard for common effort. More than anything, it means putting ego and individual glory aside.

The second key ingredient for a startup is the product. Every startup’s product is different (or it had better be!), but here are four criteria to apply to the product concept:

  1. The product should be unique in at least one major dimension.
  • The uniqueness could be functionality – product does something that wasn’t possible before, or it does a set of functions together that were weren’t possible before.
  • The uniqueness could be performance – it does a known job faster, at lower power, cheaper or in a smaller form-factor than anyone else.
  • The uniqueness could be the business or usage model – it allows a task to be done by a new – usually less sophisticated – customer, or let’s the customer pay for it in a different way
  1. Building the product must be feasible.   It isn’t enough just to have a Mat Lab model of a great new vision algorithm – you need to make it work at the necessary speed, and fit in the memory of the target embedded platform, with all the interfaces to cameras, networks, storage and other software layers.
  2. The product should be defensible. Once others learn about the product, can they easily copy it? When you work with customers about real needs, will you be able to incorporate improvements more rapidly and more completely than others? Can you gather training data and interesting usage cases more quickly? Can you protect your code, your algorithms, and your hardware design from overt cloners?
  3. You should be able to explain the product relative to the competition? In some ideal world, customers would be able to appreciate the magnificence of your invention without any point of comparison – they would instantly understand how to improve their lives by buying your product.   In that ideal world you would have no competition.  In the long run, you ideally want to so dominate your segment that no one else comes close. However, if you have no initial reference point – no competition – you may struggle to discover and explain the product’s virtues. Having some competition is not a bad thing –it gives a preexisting reference point by which the performance, functionality and usage model breakthrough can be made vivid to potential customers. In fact, if you think you have no competition, you should probably go find some, at least for purpose of showing the superiority of your solution.

The third key ingredient for a startup is the target market: the group of users plus the context for use. Think “teenage girls” + “shopping for clothes” or “PCB layout designers” + “high frequency multi-layer board timing closure”.

Finding the right target market for a new technology product faces an inherent dilemma. In the early going, it is not hard to find a group of technology enthusiasts who will adopt the product because it is new, cool and powerful. They have an easy time picture how it may serve their uses and are comfortable with the hard work to adapt the inherent power of your technology to their needs. Company progress often stalls, however, once this small population of early adopters has embraced the product. The great majority of users are not looking for an application or integration challenge – they just want to get some job done. They may tolerate use of new technology from an untried startup, but only if it clearly addresses their use case. This transition to mainstream users has been characterized by author Geoffrey Moore as “crossing the chasm”. The recognized strategy for getting into wider use is to narrow the focus to more completely solve the problems of a smaller group of mainstream customers, often by solving the problem more completely for one vertical application or for one specific usage scenario. So “going vertical” puts the fear (and hypothetical potential) of the technology into the background and emphasizes the practical and accessible benefits of the superior solution.

This narrowing of focus, however, can sometimes create a dilemma in working with potential investors. Investors, especially big VCs want to hit homeruns by winning huge markets. They don’t want narrow niche plays.  The highly successful investor, Peter Thiel, dramatizes this point of view by saying “competition is for losers”, meaning that growing and dominating a new market can be much more profitable than participating in an existing commodity market.

The answer is to think about, and where appropriate, talk about the market strategy in two levels. First identify huge markets that are still under-served or latent. Then focus on an early niche within that emerging market which can be dominated by a concentrated effort, where the insights and technologies needed to master the niche are excellent preparation for larger and larger surrounding use-cases with the likely huge market. Talking about both the laser focus on the “beachhead” initial market AND the setup for leadership in the eventual market can often resolve the apparent paradox.

The accumulated wisdom of startup methods is evolving continuously, both as teams refine what works, and as the technology and applications create new business models [think advertising], new platforms [think applications as a cloud service], new investment methods [think crowd funding] and new team structures [think gig economy]. The software startup world, in particular, has been dramatically influenced by the “Lean Startup” principle.   This idea has evolved over the past fifteen year, spawned by the writing of Steve Blank, more than any one individual. It contrasts in key ways to the long-standing model, which we can call “Old School”.

Old School Lean Startup
Funding Seed Round based on team and idea, A Round to develop product, B Round after revenue Develop prototype to get Seed Round, A Round after revenue, B Round, if any, for global expansion
Product Types Hardware/software systems and silicon Easiest with software
Customer Acquisition Develop sales and marketing organization, to sell direct or build channel CEO and CTO are chief sales people until product and revenue potential proven in the market
Business models Mostly B2B with large transactions Web–centric B2B and B2C with subscriptions and micro-transactions

In vastly simplified form, the Lean Startup model is built on five elements of guidance:

  1. Rapidly develop a Minimum Viable Product (MVP) – the simplest product-like deliverable that customers will actually use in some form. Getting engaged with customers as early as possible gives you the kind of feedback on real problems that you cannot get from theoretical discussions. It gives you a chance to concentrate on the most customer-relevant features and skip the effort on features that customers are less likely to care about.
  2. Test prototypes on target users early and often – Once you have an MVP, you have a great platform to evolve incrementally. If you can figure out how to float new features into the product without jeopardizing basic functionality, then you can do rapid experimentation on customer usage. This allows the product to evolve more quickly.
  3. Measure market and technical progress dispassionately and scientifically – New markets and technologies often don’t follow old rules-of-thumb, so you may need to develop new more appropriate metrics of progress for both. Methods like A-B testing of alternative mechanisms can give quick feedback on real customer usage, and enhances a sense of honesty and realism in the effort.
  4. Don’t take too much money too soon – Taking money from investors is an implied commitment to deliver predictable returns in a predicable fashion. If you try to make that promise too early, people won’t believe you, so you won’t get the funding. Even if you can convince investors to fund you, taking money too early may make you commit to a product before you really know what works best. In some areas, like cloud software, powerful ideas can sometimes be developed and launched by small teams, so that little funding is absolutely necessary in the early days. Startup and funding culture have evolved together so that teams often don’t expect to get outside funding until they have their MVP. Some teams expect to be eating ramen noodles for many months.
  5. Leverage open source and crowd source thinking – It is hard to overstate the impact of open source on many areas of software. The combination of compelling economics, rapid evolution and vetting by a wide base of users makes open source important in two ways – as a building block within your own technical development strategy, and as part of a proliferation strategy that creates a wider market for your product. Crowd sourcing represents an analogous method to harness wide enthusiasm for your mission or product to gather and refine content, generate data and get startup funds.

As these methods have grown up in the cloud software world, they do not all automatically apply to embedded vision startups. Some technologies, like new silicon platforms, require such high upfront investments and expensive iterations that deferring funding or iterating customer prototypes may not be viable. In addition, powerful ideas like live (and unannounced) A-B testing on customers will not be acceptable for some embedded products, especially in safety-critical applications. The lean methods here work most obviously in on-line environments, with large numbers of customers and small transactions.   A design win for an embedded system may have much greater transaction value than any single order in on-line software, so the sales process may be quite different, with a significant role for well-orchestrated sales and marketing efforts with key customers. Nevertheless, we can compare typical “Old School” and “Lean Startup” methods across key areas like funding, product types, methods for getting customers and core business models.