Part 1: Why Vision
Since I started Cognite Ventures eight months ago, my activity with startup teams has ramped up dramatically. Many of these groups are targeting some kind embedded vision application, and many want advice on how to succeed. This conversation developed into an idiosyncratic set of thoughts on vision startup guidance, which in turn spawned a talk at the Embedded Vision Summit which I’m now expanding as a blog. You can find the slides here, but I will also break this conversation down into a three-part article.
Please allow me to start with some caveats! Every startup is different, every team is different, and the market is constantly evolving – so there is no right answer. Moreover, I have had success in my startups, especially Tensilica, but I can hardly claim that I have succeeded just because of following these principles. I have been blessed with an opportunity to work with remarkable teams, whose own talent, energy and principles have been enormously influential on the outcome. To the extent that I have directly contributed to startup success, is it because of applying these ideas? Or in spite of these ideas? Or just dumb luck?
I believe the current energy around new ventures in vision comes from two fundamental technical and market trends. First, the cost of capturing image streams has fallen dramatically. I can buy a HD resolution security camera with IR illumination and an aluminum housing for $13.91 on Amazon. This implies that the core electronics – CMOS sensor, basic image signal processing and video output – probably costs about two dollars at the component level. This reflects the manufacturing learning curve from the exploding volume of cameras. It’s useful to compare the trend for the population of people with the population of cameras on the planet, based on SemiCo data on image sensors from 2014 and assuming each sensor has a useful life in the wild of three years.
What does it say? First, it appears that the number of cameras crossed over the number of people sometime in the last year. This means that even if every human spent every moment of every day and night watching video, a significant fraction of the output of these cameras would go unwatched. Of course, many of these cameras are shut off, or sitting in someone’s pocket, or watching complete darkness at any given time. Nevertheless, it is certain that humans will very rarely see the captured images. If installing or carrying those cameras around is going to make any sense, it will because we used vision analysis to filter, select or act on the streams without human involvement in every frame.
But the list of implications goes on!
- We now have more than 10B image sensors installed. If each can produce an HD video stream of 1080p60, we have potential raw content of roughly 100M pixels per second per camera, or 1018 new pixels per second, or something >1025 B per years of raw pixel data. If, foolishly, we tried to keep all the raw pixels, the storage requirement would exceed the annual production of hard disk plus NAND flash by a factor of rough 10,000. Even if we compressed the video down to 5Mbps, we would fill up a year’s supply of storage by sometime on January 4 of the next year. Clearly we’re not going to store all that potential content. (Utilization and tolerable compression rates will vary widely by type of camera – the camera on my phone is likely to be less active that a security camera, and some security cameras may get by on less than 5MBps, but the essential problem remains.)
- Where do new bits come from? New bits are captured from the real world, or “synthesized” from other data. Synthesized data is credit card transactions, packet headers, stock trades, emails, and other data created within electronic systems as a byproduct of applications. Real world data can be pixels from cameras, or audio samples from microphones, or accelerometer data from MEMS sensors. Synthetic data is ultimately derived from real world data, though the transformations of human interaction, economic transactions and sharing. Audio and motion sensors are rich sources of data, but their data rates are dramatically less – 3 to 5 orders of magnitude less – than that of even cheap image sensors. So virtually all of the real data of the world – and an interesting fraction of all electronic data – is pixels.
- The overwhelming volume of pixels has deep implications for computing and communications. Consider that $13.91 video camera. Even if we found a way to ship that continuous video stream up to cloud, we couldn’t afford to use some x86 or GPU-enabled server to process all that content – over the life of that camera, we’d could easy spend thousands of dollars on that hardware (and power) dedicated to that video channel. Similarly, 5Mbps of compressed video * 60 second * 60 minutes * 24 hours * 365 days is 12,960 Gbits per month. I don’t know about your wireless plan, but that’s more than my cellular wireless plan absorbs easily. So it is pretty clear that we’re not going to be able to either do the bulk of the video analysis on cloud servers, or communicate it via cellular. Wi-Fi networks may have no per-bit charges, and greater overall capacity, but wireless infrastructure will have trouble scaling to the necessary level to handle tens of billions of streams. We must find ways to do most of the computing on embedded systems, so that no video, or only the most salient video is sent to the cloud for storage, further processing or human review and action.
The second reason for the enthusiasm for vision is the revolution in computation methods for extracting insights from image streams. In particular, the emergence of convolutional neural networks as a key analytical building block has dramatically improved the potential for vision systems to extract subtle insightful results from complex, noisy image streams. While no product is just a neural network, the increasingly well-understood vocabulary of gathering and labeling large data sets, constructing and training neural networks, and deploying those computational networks onto efficient embedded hardware, has become part of the basic language of vision startups.
When we reflect these observations back onto the vision market, we can discern three big useful categories of applications:
- Capture of images and video for human consumption. This incudes everything from fashion photography and snapshots posted on Facebook to Hollywood films and document scanning. This is the traditional realm of imaging, and much of the current technology base – image signal processing pipelines, video compression methods and video displays – are built around particular characteristics of the human visual system. This area has been the mainstay of digital imaging and video related products for the past two decades. Innovation in new higher resolution formats, new cameras and new image enhancement remains a plausible area for startup activity even today, but it is not as hot as it has been. While this area has been the home of classical image enhancement methods, there is ample technical innovation in this category, for example, in new generative neural network models that can synthesize photo-realistic images.
- Capture of images and video, then filtering, reducing and organizing into a concise form for human decision-making. This category includes a wide range of vision processing and analytics technologies, including most activity in video monitoring and surveillance. The key here is often to make huge bodies of video content tagged, indexed and searchable, and to filter out irrelevant content so only a tiny fraction needs to be uploaded, stored, reviewed or more exhaustively analyzed. This area is already active but we would expect even more, especially as teams work to exploit the potential for joint analytics spanning many cameras simultaneously. Cloud applications are particularly important in this area, because its storage, computing and collaboration flexibility.
- Capture of images and video, analyzing and then using insights to take autonomous action. This domain has captured the world’s imagination in recent years, especially with the success of autonomous vehicle prototypes and smart aerial drones. The rapid advances in convolutional neural networks are particularly vivid and important in this area, as vision processing becomes accurate and robust enough to trust with decision making in safety-critical systems. One of the key characteristics of these systems is short-latency, robustness and hard real-time performance. System architects will rely on autonomous vision systems to the extent that the systems can make guarantees of short decision latency and ~100% availability.
Needless to say, some good ideas may be hybrids of these three, especially in systems that use vision for some simple autonomous decision-making, but rely on humans for backup, or for more strategic decisions, based on the consolidated data.
In the next part of the article, we’ll take a look at the ingredients of a startup – team, product and target market – and look at some cogent lessons from the “lean startup” model that rules software entrepreneurship today.