What does a $5 camera cost?

An envelope is a powerful tool, especially the back of the envelope. It enables simple calculations that often illuminate some important, complicated problem or opportunity. We have just that kind of problem in working to understand all structure of vision processing systems, especially the key question of where the processing power for vision will live. Will it live in the camera itself? Will it live in the cloud? Will it live somewhere in between, on a local network or at the edge of the cloud? Issues like this are rarely determined by just one factor. It not just “what’s the fastest” or “what’s the cheapest”. A rich mix of technical and economic forces drives the solutions.

One of the most fundamental forces in vision is ever-cheaper cameras. When I checked a couple of weeks ago I could buy a decent-looking security camera, with high resolution lens and imager, IR LED illumination, processing electronics, power supply and waterproof housing on Amazon for $11.99:

The price has been following steadily – when I checked three months ago, the entry prices was about $18. We can reasonably expect that the price will continue to fall over the next couple of years – the $5 security camera cannot be far off.

The incredible shrinking camera cost is the result of classic economies of scale in CMOS semiconductors and assembly of small electronic gadgets. That scaling is inspired by and reinforces huge increases in the volume of image sensor shipments, and in the installed base of low-cost cameras. Using sensor volume data from SemiCo (2014), and the assumption of a three-year useful field life for a camera, you can see the exponential growth in the world’s population of cameras of all kinds.

I show the camera population alongside the human population to reinforce the significance of the crossover – more cameras capturing pixels that there are people to look at them.

I’ve written before about the proliferation of cameras in general, but now I want to get out my envelope and do some further calculations on the key question, “What does a $5 camera cost?” or more accurately, ”What’s the cost of ownership of a $5 camera over a three year lifetime?”

The answer, as you might already guess, is “More than $5.” Let’s break down some of the factors:

  1. The camera takes power. The $12 ZOSI camera above takes about 6W, so I’ll guess that my near-future $5 camera will cut power too, so I’ll estimate 3W. That’s certainly not low compared with the low-power CMOS sensors found in mobile phones. As a rule of thumb, electricity in the US costs about $1 per watt per year, so we get a three-year power cost of about $9 for our $5 camera.

If we did all the image processing locally, $5 + $9 = 13 might be a good total for the cost of owning that camera. And you can imagine some mobile gadgets and IoT applications – smart doorbells and baby monitors – where 100% local vision processing is adequate (or even preferred). But what if we want to centralize the processing and take advantage of cloud data? What if we want to push the stream up to the datacenter? Then we have some more costs to consider:

  1. The camera needs a network connection. Local cabling or Wi-Fi may just have capital costs, but almost all Internet access technologies cost real money.  It turns out that Internet access costs vary ENORMOUSLY. There are three basic cost categories.
    1. If you’re a residential customer in an area with fiber to the home, you’re in great shape. – You can get costs as low as $0.10 per TB (10Gbps line for $300 from US Internet).
    2. If you’re a residential customer using cable or DSL, you’re doing to pay $8-20 per TB (Based on AT&T and Comcast pricing)
    3. If you’re a business customer, and have to buy a commercial T1 or T3 connection, you’re going to pay a lot – probably more than $100 per TB, maybe a lot more. (Based on broker T1Shopper.com)

Of course, I am glossing over difference is quality of service, but let’s just break it down into two categories – fiber at $0.10 per TB and non-fiber at $10 per TB. Interestingly, it doesn’t appear that DSL, cable and T1 connections are dropping much in price. Clearly, that will happen only in areas where a much higher capacity option like fiber becomes available.

  1. The camera needs storage in the cloud. Almost any application that does analysis of a video stream needs a certain amount of recall, so users can apply further study to  In some applications, the recall time may be quite short – minutes or hours; in others, there may be a need for days or weeks of storage.   We’ll look at two scenarios – a rolling one-day buffer of video storage and a rolling one-month of video storage to span the range.   The cost of storage is a little tricky to nail down. Amazon is now offering “unlimited” storage for $5/month, but that’s just to consumers, not relevant to AWS cloud compute customers.   AWS S3 storage currently costs about $25/TB-month for standard storage, and $5/TB-month for “glacier storage” which may take minutes to hours to access. The “glacier storage” seems appropriate for the cases where a month of recall is needed. Just as the camera gets cheaper over time, we’d expect the storage to get cheaper to, so we’ll estimate $12.50/TB-month for standard storage and $2.5/TB-month for glacier storage.
  2. The camera needs computing in the cloud.   The whole point of moving the camera stream to the cloud is to be able to compute on it, presumably with sophisticated deep learning image understanding algorithms. There is no fixed standard for the sophistication of computation on video frames, but we can set some expectations. In surveillance video, we anticipate that we want to know what are all the major objects in the frame, and where are they located – in other words, object detection. This is a harder problem than simple classification of a single object in each image, as in the most common ImageNet benchmark. This gives the important clues on the progress of activity in the video stream. The best popular algorithm for this, YOLO (You Only Look Once) by Joseph Redmon, requires about 35 GFLOPS per frame of computation. We can get a reasonable picture of the cost by looking at AWS compute offerings, especially the g3 GPU instances. John Joo did some good estimates of the cost of AlexNet inference on AWS g3.4xlarge instances – 3200 inferences per second on the Tesla M60. YOLOv2 is about 25x more compute intensive, so we’d expect a cost of roughly $2.30 per million frames, based on AWS g3.4xlarge pricing of $1.14/hour. Over time, we expect both the algorithms to improve – fewer compute cycles for the same accuracy – and the compute to get more economical. We will again assume a factor of two improvement in each of these, so that future YOLO inference on AWS GPUs might cost about $0.58 per million frames.

Now let’s try to put it all together, for a comparison of cloud computing costs relative to our 3-year camera cost of $13. The last key variable is the data rate coming out of the camera. I will assume an 8Mpixel camera capable of producing 60 frames per second. This is lower resolution than what is found in high-end smartphones, so qualifies as a commodity spec, I think. Let’s compute the costs in four scenarios:

  1. Camera streams raw pixels at 8Mpixel * 2 bytes/pixel * 60 fps into the cloud.   While it is unlikely that we would adopt raw pixel streaming, it still provides a useful baseline.
  2. Camera uses standard H.264p60 video compression on the 4K UHD stream. This reduces the video bandwidth requirement to about 40 Mbps
  3. We assume some filtering or frame selection intelligence in the camera that reduces the frame rate, after H.264 compression, to 1 frame per second.
  4. We assume more filtering or frame selection intelligence in the camera to reduces the frame rate to 1 frame per minute

 

Data stream Network Cost Storage Cost Compute Cost Total Cost
Fiber Conv 1 day 1 month
8Mp raw $9,500 $950,000 $79,000 $476,000 $3,300 $92,000-$1,430,000
H.264 p60 $47 $4,700 $400 $2,400 $3,300 $3,700-$10,000
H.264 @ 1 fps $1 $79 $7 $39 $55 $62-$173
H.264 @ 1 fpm $0 $1.30 $0.11 $0.66 $0.91 $1-$3

The analysis, rough as it is, gives us some clear takeaways:

  • The bandwidth and storage costs of raw uncompressed video are absurd, so we’ll never see raw streaming of any significance
  • Even normally compressed video is very expensive – $4K to $10K over the three year life of our $5 camera. Very few video streams are so valuable as to justify this kind of cost.
  • Smart selection down to one frame per second helps cloud costs significantly, but the cloud costs dwarf the camera costs. Commodity solutions are likely to push to lower costs, hence lower cloud frame rate.
  • Selection down to one frame per minute makes cloud costs insignificant, even relative to our cheapo camera. It may not be necessary to go this far to leverage the cost reductions in the hardware.
  • We might reasonably expect that the sweet spot is 5 to 20 frames per minute of cloud-delivered video, with the high end most appropriate where cheap fiber to the home (or business) is available, and the low-end more appropriate with conventional network access.
  • The total value of these cameras is economically significant, when we consider the likely deployment of about 20 billion cameras by 2020. While the actual business model for different camera systems will vary widely, we can get a rough order of magnitude by just assuming the cameras are streaming just 10 frames per minute, half over fiber, half over conventional network access. The three-year cost comes to $260B for the cameras and $300B for the network and cloud computing, or about $220B per year. Not chump change!

The big takeaway is that the promise of inexpensive cameras cannot be realized without smart semantic filtering in or near the camera. Cameras, especially monitoring cameras, need to make intelligent choices about whether anything interesting is happening in a time window. If nothing meets the criteria of interest, nothing is sent to the cloud. If activities of possible interest are detected, a window of video, from before and after the triggering action, need to be shipped to the cloud for in-depth analysis. The local smart filtering subsystem may include preliminary assessment and summary of activities outside the window of interest in the video trigger package for the cloud.

You might reasonably ask, “Won’t putting more intelligence in the cameras make them much more expensive?” It is clear that these low-end cameras are designed for minimum cost today, with the simplest viable digital electronics for interfacing to the network. But just as video compression engines have become commoditized, cheap and standard in these cameras, I expect neural network processors to go the same route. For example, running standard YOLOv2 at 60 fps requires about 2.3 TFLOPs or 1.2 T multiply-adds per second. That would fit comfortably in 3 Tensilica Vision C5 cores and less than 5mm2 of 16nm silicon including memory.   That probably translates into an added silicon manufacturing cost of less than 50 cents. So it might push up the total camera and power cost by a few dollars, but not enough to shift the balance towards the cloud. After all, doing YOLOv2 on the full 60 fps in the cloud can cost thousands of dollars.

This model also suggests that the algorithms on both ends – at the camera and in the cloud – will need to adapt to specific application requirements and setting, and evolve significantly over time as better trigger and analysis methods emerge. This is the essence of a smart distributed system, with global scope, continuous evolution of functionality, and high economic value.

I should finish with a few caveats on this little analysis. First, it is full of simplifying assumptions about cameras, networks, storage needs and compute intensity. The costs of network access is particularly tricky to nail down, not least because wired data access is unmetered, so that the marginal cost of one more bit is zero, until you run out of capacity – then the marginal bit is very expensive.   Second, I have focused on costs in the U.S.   Some other regions have better access to low cost fiber, so may be able to leverage video cloud computing better. Other regions are significantly more expensive network access, so device-cloud tradeoff may shift heavily towards smarter device filtering.

Finally, the uses of cameras are enormously varied, so it is naïve (but instructive) to boil everything video analysis application down to object detection. There may be significantly heavier and lighter compute demands in the mix.

This analysis focuses on wired cameras. However, cameras in smart phones make up a meaningful fraction of the expected 20B installed image sensors. Wireless bandwidth to the cloud is dramatically more expensive than wired bandwidth. The going price is very rough $10 per GB in high bandwidth plans, or 1,000 times the price of DSL and cable and 100,000 times the price of fiber. If, hypothetically, you somehow managed to stream your 8Mp 60 fps camera to the cloud as raw pixels continuously for three years, it would cost you about a billion dollars ;-). Compressing it to 40Mbps H.264p60 drops the costs to a mere five million dollars. Of course, wireless data costs may well come down, but not enough to make continuous streaming from phones attractive. Especially smart compression is going to be needed for any application that wants to rely on continuous streaming over significant periods.

So that’s the story of a cheap camera on Amazon and the back of an envelope.

Some resources:

Network costs:

https://fiber.usinternet.com/plans-and-prices/

http://www.att-services.net/att-high-speed-internet.html

https://www.connecttoxfinity.com/internet.html?se=Google&cp=219478943&ag=28420258583&kw=274543775&mt=p&utm_source=Google&utm_campaign=G_C2XBrand_Monitored&utm_medium=cpc&utm_term=comcast%20cable%20internet&mchxkw=c:219478943k:comcast%20cable%20internetm:pp:1t4d:cai:28420258583ad:188681968537s:g&gclid=Cj0KEQjw9uHOBRDtz6CKke3z6ecBEiQAu0Jr3tnUusemd1-IRF-q2fu4tjqrVi69__2T-P-9BJwkfRMaAsWW8P8HAQ

http://www.t1shopper.com

Storage costs:

https://aws.amazon.com/s3/pricing/

Computing costs:

https://blog.dominodatalab.com/new-g3-instances-in-aws-worth-it-for-ml/

https://aws.amazon.com/ec2/pricing/reserved-instances/pricing/

https://pjreddie.com/darknet/yolo/

Wireless costs:

https://www.forbes.com/sites/tristanlouis/2013/09/22/the-real-price-of-wireless-data/#7136a248749f

The Top Deep Learning Startups in Israel

I’ve been pondering a small mystery for months.  I have finally resolved it. The mystery was this: “Where are all the Israeli deep learning startups?” Ten months ago I started putting together the Cognite Ventures list of the top worldwide deep learning startups. I drew on lots of existing lists of AI startup activity around the world, especially surveys published in the US, UK and China. I also systematically searched the on-line startup-funding tracking site, Crunchbase, to get systematic look at companies that self-report as AI startups. I filtered through thousands of startup descriptions and visited almost one thousand startup websites to get the basics of their product and technology strategy. I talked to colleagues who also track the AI explosion.

Much to my surprise, only half a dozen startups based in Israel made it through my search and filtering – just these few:

I wondered “Why?”

  • Was it because Israeli startups were focused on other areas?
  • Was it because startups were doing advanced AI, but weren’t publicly highlighting use of deep learning methods?
  • Were the companies slipping through my search net?

A few weeks ago, Daniel Singer, an independent market analyst in Israel published a detailed and extremely useful info-graphic, showing logos of more than 420 startups in Israel associated with Artificial Intelligence.  See the graphic

His analysis put the companies in eight major categories and a total of 39 sub categories. The scale of the list certainly suggests a great deal of general AI activity, and perhaps a lot of action in that most-interesting subset, deep learning.

Over the past week, I have visited the websites of every single one of the 420+ companies on Singer’s chart. Using company mission statements, product descriptions, blogs, and job postings, plus additional information from Crunchbase and YouTube videos, I have worked to assess the product focus and technical reliance on hard-core deep learning methods of these companies – the same methodology for the entire “Cognite 300 Deep Learning Startup” list. Happily I have identified a significant number – 34 – to add to the Cognite List (see below). Of course, I recognize that the companies were slipping through my net because of lower press visibility in the US, less reporting on Crunchbase, smaller typical start-up size, all of which make these a little harder to see.

This is an interesting, and I hope, important group of companies, with significant clusters in embedded vision for autonomous vehicles, human-machine interface and robotics, in cloud-based security and surveillance, in marketing, and in medical care. These companies have invested to understand the impact of neural networks on end markets and have built products that rely heavily on harnessing the combination of large data sets to training, and opportunity to extract hidden patterns from images, transactions, user clicks, sounds and other massive data streams. I suspect that companies that understand the implications of deep learning first will enjoy comparative advantages.

Some of the more intriguing ones on that list include:

  • OrCam Technologies builds smart glasses for the visually impaired that can recognize individual friends and family, read text out-loud and warn users of dangers.
  • GetAlert does modern surveillance and monitoring using deep neural networks in the clouds, but adds extraction of 3D structure from image streams for improved action classification
  • Vault predicts the financial success of film productions from deep analysis of scripts and casts.
  • Augury uses vibration and ultrasonic sensors to monitor mechanical systems to get early warning of slight changes that signal developing malfunctions.

Of course, deep learning is not the only form of “AI” that will legitimately contribute to market disruption and start-up traction. AI covers a lot of techniques, including other forms of statistical analysis and machine learning, natural language processing in chat-bots and other structured dialog systems, and other methods for finding and exploiting patterns in big data. For many companies their “less deep” learning is appropriate to their tasks and will lead them to success too. I suspect, though, that their success will often be driven by other factors – end market understanding, good UI integration, key business relationships – more than by AI technology mastery.

It is interesting to look for patterns in Singer’s list of 420+ companies, for it gives another useful view of what’s happening in the Israeli startup scene. Several large clusters are particularly noticeable:

  • Chat-bots for engaging with customers more continuously and adaptively and coaching sales people
  • All kinds of e-commerce optimization – enhanced bidding, ad campaign optimization, pricing and incentives
  • Fraud protection in ecommerce including ad fraud
  • Predictive maintenance in manufacturing and operations
  • IT infrastructure management especially for cyber security
  • All sorts of brand management and promotion

While some of these strong areas tap into Israel’s historical tradition of technical innovation driven by defense needs, many are not clearly tied to vision and surveillance, wireless communication, signals intelligence or threat profiling. Instead they reflect the entrepreneurial drive to exploit huge global trends, especially in the rise of ecommerce and on-line marketing.

You might ask why one analyst identifies 420 AI start-ups and another only 40. Daniel Singer clearly wanted to explore the full scope of “smart” startups in Israel. He used a very broad definition of AI, including companies as diverse as one doing “connected bottle caps” [Water.IO], another doing a job search tool [Workey], and a third, frighteningly, automatic web content composition from a few keywords [Articoolo].This expansive definition of AI in more inclusive, and reflects, in part, the enormous fascination among entrepreneurs, investors and the general public for all things related to AI. This broad definition subsumes many of the longer-term trends in big data analysis, ecommerce automation and predictive marketing.

I have taken a more selective view. Much of the enthusiasm for AI in the past three years has been specifically triggered by the huge, well-publicized strides in just one subdomain of AI – neural networks. This has been particularly striking in areas like computer vision, automated language translation and automatic speech recognition, but also in the most complex and ambiguous tasks in business data analysis. I have chosen to focus on companies that appear to be at the cutting edge of neural networks, using the most sophisticated deep learning methods. I believe this will be the greatest area of disruption of existing products, companies and business models.

So the broad view and the selective view are complementary, both giving windows into lively entrepreneurial climate in Israel.

Does Vision Research Drive Deep Learning Startups?

What’s relationship between academic research and entrepreneurship? Does writing technical papers help startups? Does a government support of computer science research translate into commercial innovation? We’d all like to know!

I just spent a week at the Computer Vision and Pattern Recognition (CVPR) Conference in Honolulu. It is one of the three big academic conferences – along with Neural Inspired Processing Systems (NIPS) and the International Conference on Machine Learning (ICML) – that has embraced and proliferated cutting edge research in deep learning.   CVPR has grown rapidly, with attendance almost doubling annually over the past few years (to more than 5200 this year). More importantly, it has played a key role in encouraging all kinds of computer vision researchers to explore the potential for deep learning methods on almost every standing vision challenge.

So many teams submit worthwhile papers that it has adopted a format to expose as many people to as many papers as possible. Roughly 2500 papers were submitted this year, of which only about 30% are accepted. Even so how can people absorb more than 750 papers? All the accepted papers get into poster sessions, and those sessions are unlike any poster session I’ve seen before. You often find two or three authors surrounded by a crowd of 25 or more, each explaining the gist of the talk over and over again for all comers. Some notable papers are given a chance to be shared a rapid-fire FOUR minute summary in one of many parallel tracks. And even smaller handful of papers gets much bigger exposure – a whole TWELVE minute slot, including Q&A!

Remarkably the format does work – the short talks serve as useful teasers to draw people to the posters. A session of short talks gives a useful cross section of the key application problems and algorithmic methods across the whole field of computer vision. That broad sampling of the papers confirms the near-total domination of computer vision by deep learning methods.

You can find interesting interactive statistics on the submitted CVPR17 papers here and on the accepted papers here. Ironically, only 12% of the papers are assigned to the “CNN and Deep Learning” subcategory, but in reality, almost every other category, from “Scene Recognition and Scene Understanding” to “Video Events and Activities” also dominated by deep learning and CNN methods. Of course, there is still a role for classical image processing and geometry-based vision algorithms, but these topics occupy something of a backwater at CVPR. I believe this radical shift in focus reflect a combination of the power of deep learning methods on the inherent complexity (and ambiguity) of many vision tasks, but deep learning is also a shiny new “hammer” for many research teams – they want to find every vision problem that remotely resembles a “nail”!

Nevertheless, CVPR is so rich with interesting ideas that the old “drinking from a fire hose” analogy fits well. It would be hard to do justice to range of ideas and the overall buzz of activity there, but here are a few observations about the show and the content.

  • Academia meets industry – for recruiting. The conference space is built around a central room that serves for poster space, exhibition space and dining. This means you have a couple of thousand elite computer vision and neural network experts wandering around at any moment. The exhibitors –Microsoft, NVidia, Facebook Google, Baidu and even Apple, plus scores of smaller companies – ring the outside of the poster zone, all working to lure prospective recruits into their booths.
  • Creative applications abound. I was genuinely surprised at how broadly research teams are applying neural network methods. Yes, there is still work on the old workhorses of image classification and segmentation, but so much of the work is happening on newer problems in video question answering, processing of 3D scene information, analyzing human actions sequences, and using generative adversarial networks and weakly supervised systems to learn from limited data. The speed of progress is also impressive – some papers build on methods themselves first published at this CVPR 2017! On the other hand, not all the research shows breakthrough results. Even very novel algorithms sometimes show just minor improvements in accuracy over previous work, but improvements by a few percent add up quickly if rival groups release new results every few months.
  • Things are moving too fast for silos to develop.   I got a real sense that the crowd was really trying to absorb everything, not just focusing on narrow niches. In most big fields like computer vision, researchers tend to narrow their attention to specific set of problems or algorithms, and those areas develop highly specialized terminology and metrics. At CVPR, I found something closer to a common language across most of the domain, with everyone so eager to learn and adopt methods from others, regardless of the specialty. That cross-fertilization certainly seems to be aiding the ferocious pace of research.
  • The set of research institutions is remarkably diverse. With so much enthusiasm for deep learning and its applications to vision, an enormous range of universities, research institutes and companies are getting into the research game. To quantify the diversity, I took a significant sample of papers (about 100 out of the 780 published) and looked at the affiliations of the authors. I counted each unique institution on a paper, but did not count the number of authors from the same institution. In my sample, the typical paper had authors from 1.6 institutions, with a maximum of four institutions. I suspect that mixed participation reflects both overt collaboration and movement of people, both between universities and from universities into companies. In addition, one hundred unique institutions are involved in the one hundred papers I sampled. This means probably have many hundreds of institutions doing world-class research computer vision and deep learning. While the usual suspects – the leading tech universities – are doing more papers, the overall participation is extraordinarily broad.
  • Research hot spots by country. The geographical distribution of the institutions says a lot about how different academic communities are responding the deep learning phenomenon. As you’d expect, the US has the greatest number of author institutions, with Europe and China following closely. The rest of the Asia lags pretty far behind in sheer numbers. Here’s the make up of our sample, by country – the tiny slices in this chart are Belgium, Denmark, India, Israel, Sweden, Finland, Netherlands, Portugal, Australia and Austria.Europe altogether together (EU with UK;-), plus Switzerland) produced about 28% of the papers, significantly more than China, but still less than the US.
  • Research hotspots by university. We can also use this substantial sample to get a rough sense of which specific universities are putting the most effort on this. Here are the top ten institutions worldwide, for this sample of the paper authors – with some key institution in the UK (Oxford and Cambridge) and Germany (Karlsruhe and Max Planck Institute) rounding out the next ten:

    All this raises an interesting question – what’s the relationship academic research and startups in deep learning technology? Do countries with strong research also have strong startup communities? This is a question this authorship sample, plus the Cognite Ventures deep learning startup database, can try to answer, since we have a pretty good model of where the top computer vision startups are based, and we know where the research is based. In the chart below, I show the fraction of deep vision startups and the fraction of computer vision papers, for the top dozen countries (by number of startups):

    The data appears to tell some interesting stories:

    • US, UK and China really dominate in the combination of research and startups.
    • US participation computer vision research and startups is fairly balanced, though research actually lags a bit behind startup activity.
    • The UK actually has meaningfully more startup activity than research, relative to other countries. This may reflect good leveraging of worldwide research and a good climate for vision startups.
    • China is fairly strong on research relative to vision startups, suggesting perhaps upside potential for the Chinese startup scene, as the entrepreneurial community leverages more of the local research expertise
    • Though the numbers are not big enough to reach strong conclusions, it appears that research in Germany, France and Japan significantly exceeds any conversion of research into startups. I think this reflects the fairly strong research tradition, especially in Germany, with a developed overall startup culture and climate.

     

    Both the live experience of CVPR and the statistics underscore the real and vital link between research in computer vision and the emergence of a deep understanding of methods and applications for new enterprises. This simple analysis doesn’t directly reveal causality, but it is a pretty good bet that computer vision researchers often become deep learning founders and technologists, fueling the ongoing transformation of the vision space. More research means more and better startups.

Deep Learning Startups in China: Report from the Leading Edge

Everyone knows the Chinese classic curse, “May you live in interesting times”. Well, it turns out, the Chinese origin for this pithy phrase is apocryphal – the British statesman, Austin Chamberlain, probably popularized the phrase in the 1930s and attributed it to the Chinese to lend it gravity. We do, however, live in interesting times, in no field better epitomized than in deep learning, and in no location more poignantly than in China.

I have just return from a ten day tour of Beijing, Shenzhen, Shanghai and Hangzhou, meeting with deep learning startups and giving a series of talks on the worldwide deep learning market. In the most fundamental ways, neither the technology, nor the applications, nor the startup process are so different from what you find in Silicon Valley or Europe, but the trip was full of little eye-openers about the deep learning in China, and about the entrepreneurial process there. It reinforced a few long-standing observations, but also shifted my point of view in important ways.

The most striking reflection on the China startup scene is how much it feels like the Silicon Valley environment, and how it seems to differ from other Asian markets. First, there seems to be quite of bit of money available, from classic VCs, from industrial sponsors and even from US semiconductor companies – Xilinx and NVidia have investments in these high-profile startups in China, but I’m sure other major players do too. Second, deep learning is very active, with much of the same “gold-rush” feeling I observe in the US. This contrasts with the Taiwan, Japan and Korea markets, where the deep learning startup theme is less developed, either because startups less central to the business environment (Japan) or because the deep learning enthusiasm has not grown so intense (Taiwan, Korea). Ample funding and enthusiasm also means rapid growth of significant engineering teams – the smallest company I saw had 25 people, the biggest had about 400. California teams have evolved office layouts that look like Chinese ones – open offices without cubicle walls – and Chinese teams have developed the California tradition of endless free food. We are not there yet, but we closer than ever to a common startup culture spanning the Pacific.

Observation: The Chinese academic and industrial technical community is closely tuned into the explosion of activity in deep learning, and many companies are looking to leverage it in products. Baidu’s heavy investment in deep learning– with a research team of more than 1000 – is already well know. The number of papers, on deep learning from Chinese universities and the interest level among startups is also very high. Overall, Chinese industry seems to be gradually shifting from a “cost-down” mindset – focused on how to take established products and to optimize the whole bill-of-materials for lower cost – towards greater attention to functional innovation. A strong orientation towards hardware and system products remains: I have found many fewer pure-play cloud software startups in China than in the US or UK. Nevertheless, the original software content in these systems is growing rapidly. Almost every company I visited had polished and impressive demos of vehicle tracking, face recognition, crowd surveillance or demographic assessment.

Observation: Chinese startups are unafraid of doing new deep-learning silicon platforms. Quite a few of the software companies I visited are building or planning chips capturing their insights into neural network inference computation. Perhaps one in four Chinese startups is working towards silicon, while only one in 15 worldwide is doing custom silicon. One executive explained that that Chinese investors really like to see the potential differentiation that comes from chips, and startups believe that committing to silicon actually helps secure capital. This is in stark contrast to current Silicon Valley wisdom – that investors flee at the mention of chip investment. This striking dichotomy reflects a combination of perceived lower chip development costs in China (because of lower engineering salaries, avoidance of bleeding-edge semiconductor technologies below 20nm and smaller, niche-oriented designs) and the widespread belief that tying software to silicon protects software value. Ironically, silicon development is now strikingly rare among Silicon Valley startups, driven partly by the high costs, and long timelines for chip products, and partly by comparative attractiveness of cloud software startups for investment, where the upfront costs are so much less, and countless new market niches seems to appear weekly.

Observation: The China startups are almost entirely focused on real-time vision and audio applications, with only a supporting role for cloud-based deployment. Cars, human-machine interface and surveillance are the standout segments.   DJI, the world’s biggest consumer drone maker uses highly sophisticated deep learning for target tracking and gesture control. Most of the companies doing vision have capability and demos in identifying and tracking vehicles, pedestrians and bicycles, which applies to both automotive driver assistance/self-driving vehicles and surveillance. Top startups in the vision space include Cambricon, DeepGlint, Deephi, Emotibot, Megvii, Horizon Robotics, Intellifusion, Minieye, Momenta, MorphX, Rokid, SenseTime, and Zero Zero Robotics. Audio systems are also a big area, with particular emphasis on automated speech recognition, including increasing embedded real-time speech processing. Top startups here include AISpeech, Mobvoi, and Unisound.

Observation: China has a disproportionally large surveillance market, and correspondingly heavy interest in deep learning-based applications for face identification, pedestrian tracking, and crowd monitoring.   China is already the world’s largest video surveillance market and has been among the fastest growing. Chinese suppliers describe usage scenarios of identifying and tracking criminals, but China does not have a serious conventional crime problem (both violent crime and property crime are well below US levels, for example). To some extent, “monitoring crime” is a code word for monitoring political disturbances, a deep obsession of the Chinese Communist Party. This is not the only driver for vision-based applications – face ID for access control and transaction verification are also important.

Over the course of ten days, I saw some particularly interesting startups:

  • Horizon Robotics [Beijing]: Horizon is a deep-learning powerhouse, led by Yu Kai. With 220 people, they are innovating across a broad front on vision systems, including smart home, gaming, voice-vision integration, and self-driving cars. They have also adopted a tight hardware-software integration strategy for more complete and efficient solutions.
  • Horizon is a deep-learning powerhouse, led by Yu Kai. With 220 people, they are innovating across a broad front on vision systems, including smart home, gaming, voice-vision integration, and self-driving cars. They have also adopted a tight hardware-software integration strategy for more complete and efficient solutions.
  • Intellifusion [Shenzhen]: Intellifusion is fairly complete video system supplier with close ties to the government organizations deploying public surveillance. They currently deploy their own servers with GPUs and FPGAs, but moving increasing functionality into edge devices, like the cameras themselves.
  • NextVPU [Shanghai]: NextVPU is the youngest (about 12 months old) and smallest (24 people) of the startups I saw. They are pursuing AR, robotics and ADAS systems, but their first product, an AR headset of the visually impaired, is compelling in both a technical and social sense. Their headset does full scene segmentation for pedestrian navigation and recognizes dozens of key objects – signs, obstacles and common home and urban elements to help their users.
  • Deephi [Beijing]: Deephi is one of the most advanced and impressive of all the deep learning startups, with close ties to both the leading technical university in China, Tsinghua, and leading US research universities.   They have a particularly sophisticated understanding of what it takes to map leading edge neural networks into the small power, compute and memory resources of embedded devices, using world-class compression techniques. They are pursuing both surveillance (vision) and data-center (vision and speech) applications with a range of innovative programmable and optimized inference architectures.
  • Sensetime [Beijing]: Sensetime is one of the biggest and most visible software startups using deep learning for vision. They have impressive demos spanning surveillance, face recognition, demographic and mood analysis, and street view identification and tracking of vehicles. They are sufficiently competent and confident to have developed their own training framework, Parrots, in lieu of Caffe, Tensor Flow and the other standard platforms.
  • Megvii [Beijing]: Megvii is a prominent Chinese “unicorn” – a startup valued at US$1B+, and is often known by the name of their leading application, Face++. Face++ is a sophisticated and widely used face ID environment, leveraging the Chinese government’s face database. This official and definitive database enables customer verification for transaction systems like Alibaba’s AliPay and DiDi’s leading ride hailing system. They show an impressive collection of demos for recognition and augmented reality and “augmented photography. Like many other Chinese companies, Megvii is moving functionality from the cloud to embedded devices, to improve latency, availability and security.
  • Bitmain [Beijing]: Bitmain is hardly a startup and is wildly successful in non-deep learning applications, specifically cyptocurrency mining hardware. They have become the biggest supplier of ASICs for computational hashing, especially for Bitcoin, but now spreading into rival currencies like Litecoin. Founded in 2013, they hit US$100 in revenue in 2014 and are on track to do US$500M this year. This revenue stream and profitability is allowing them to explore new areas, and deep learning platforms seem to be a candidate for further expansion.

Here’s a more complete list of top Chinese deep-learning startups

Name Description
4Paradigm Scaled deep learning cloud platform
AISpeech Real-time and cloud-based automated speech recognition for car, home and robot UI
Cambricon Device and cloud processors for AI
DeepGlint 3D computer vision and deep learning for human & vehicle detection, tracking and recognition
Deephi Compressed CNN networks and processors
Emotibot A natural interaction interface between human and machine based on multi-modal
Face++ Face recognition
Horizon Robotics Smart Home, automotive and Public safety
ICarbonX Individualized health analysis and prediction of health index by machine analysis
Intellifusion Cloud-based deep learning for public safety and industrial monitoring
Minieye ADAS vision cameras and software
Mobvoi Smart watch with voice search using cloud
Momenta AI platform for level 5 autonomous driving
MorpX Commercializes computer vision and deep learning technologies for low-cost/-power platforms
Rokid Home personal assistant – ASR + face/gesture
SeetaTech Open source development platform to enable enterprise vision and machine learning
SenseTime Computer vision
TUPU Image recognition technology and services
tuSimple Software for self-driving cars: detection and tracking, flow, SLAM, segmentation, face analysis
Unisound AI-based speech and text
YITU Technology Computer vision for surveillance, transportation and medical imaging
Zero Zero Robotics Smart following drone camera

Of course, no one can claim to understand everything that’s happening in the vibrant Chinese startup community, least of all a non-native speaker. Nevertheless, everyone I spoke with in China validated this list of the top deep learning startups. Some were a bit surprised at the depth of the list, especially in identifying startups that were not yet on their radar. Both technically, and in exploring market trends, the China startup world is at the cutting edge in many areas. It bears close watching for worldwide impact.

The Cognite 300 Poster

Today I’m rolling out the Cognite 300 poster, a handy guide to the more focused and interesting startup companies in cognitive computing and deep learning.  I wrote about the ideas behind the creation of the list in an earlier blog posting:

Who are the most important start-ups in cognitive computing?

I have updated the on-line list every couple of months since the start of 2017, and will continue to do so, because the list keeps changing. Some companies close, some are acquired, some shift their focus. Most importantly, I continue to discover startups that belong on the list, so I will keep adding those, using approximately the same criteria of focus used for the first batch.

I should underscore that many potentially interesting companies haven’t gone on the list:

  • because it appears that AI is only a modest piece of their value proposition, or
  • because there is too little information on their websites to judge, or
  • because they doing interesting work on capturing and curating big data sets (that may ultimately require deep learning methods)  but don’t emphasize learning work themselves, or
  • I just failed to understand the significance of the company’s offerings

A few weeks ago, a venture capital colleague suggested that I should do a poster, to make the list more accessible and to communicate a bit of the big picture of segments and focus areas for these companies.  I classified the companies (alas, by hand, not with a neural network 😉 into 16 groups

  1. Sec – Security, Surveillance, Monitoring
  2. Cars – Autonomous Vehicles
  3. HMI – Human-Machine Interface
  4. Robot – Drones and Robots
  5. Chip – Silicon Platforms
  6. Plat -Deep Learning Cloud Compute/Data Platform and Services
  7. VaaSVision as a Service
  8. ALaaS – Audio and Language as a Service
  9. Mark – Marketing and Advertising
  10. CRM – Customer Relationship Management and Human Resources
  11. Manf – Operations, Logistics and Manufacturing
  12. Sat – Aerial Imaging and Mapping
  13. Med – Medicine and Health Care
  14. Media – Social Media, Entertainment and Lifestyle
  15. Fin – Finance, Insurance and Commerce
  16. IT – IT Operations and Security

I’ve also included an overlay of two broader categories, Vision and Embedded.  Many of the 16 categories fall cleanly inside or outside embedded and vision, but some categories include all the combinations.  A few companies span two of the 16 groups, so they are shown in both.

You may download and use the poster as you wish, so long as you reproduce it in its entirety, do not modify it and maintain the attribution to Cognite Ventures.

The Cognite 300 Startup Poster

Finally, I have also updated the list itself, including details on the classification of the startup by the 16 categories and the 2 broader classes, and identifying the primary country of operations.  For US companies I’ve also including the primary state.

The Cognitive Computing Startup List

 

How to Start an Embedded Vision Company – Part 3

This is the third installment of my thoughts on starting an embedded vision company. In part 1, I focused on the opportunity, especially how the explosion in the number of image sensors was overwhelming human capacity to directly view all the potential image streams, and creating a pressing need for orders-of-magnitude increase in the volume and intelligence of local vision processing. In part 2, I shifted to a discussion of some core startup principles and models for teams in embedded vision or beyond. In this final section, I focus on how the combination of the big vision opportunity, plus the inherent agility (and weakness!) of startups can guide teams to some winning technologies and business approaches.

Consider areas of high leverage on embedded vision problems:

  1. The Full Flow: Every step of the pixel flow from sensor interface, through the ISP, to the mix video analytics (classical and neural network-based) has impact on vision system performance. Together with choices in training data, user interface, application targeting and embedded vs. cloud application partitioning give an enormous range of options on vision quality, latency, cost, power, and functionality. That diversity of choices creates many potential niches where a startup can take root and grow, without having to attack huge obvious markets using the most mainstream techniques initially.
  2. Deep Neural Networks: At this point it is pretty obvious to that neural network methods are transforming computer vision. However, applying neural networks in vision is much more than just doing ImageNet classification. It pays to invest in thoroughly understanding the variety of both discrimination methods (object identification, localization, tracking) and generation methods. Neural-network-based image synthesis may start to play a significant role in augmenting or even displacing 3D graphics rendering in some scenarios. Moreover, Generative Adversarial Network methods allow a partially trained discrimination network and a generation network to iterate through refinements that improve both networks automatically.
  3. Data Sets: To find, create and repurpose data for better training is half the battle in deep learning. Having access to unique data sets can be the key differentiator for a startup, and brainstorming new problems that can be solved with available large data sets is a useful discipline in startup strategy development. Ways to maximize data leverage may include the following:
    1. Create an engagement model with customers, so that their data can contribute to the training data set for future use. Continuous data bootstrapping, perhaps spurred by free access to cloud service, may allow creation of large, unique training data collections.
    2. Build photo-realistic simulations of the usage scenes and sequences in your target world. The extracted image sequences are inherently labeled by the underlying scene structures and can generate large training sets to augment real world image captured training data. Moreover, simulation can systematically cover rare but important combinations of object motion, lighting, and camera impairments for added system robustness. For example the automotive technology startup up, AIMotive, builds both sophisticated fused cognition systems from image, LiDAR and radar streams, and sophisticated driving simulators with accurate 3D world to train and test neural network-based systems.
    3. Some embedded vision systems can be designed as subsets of bigger, more expensive server-based vision systems, especially when neural networks of heroic scale are development by cloud-based researchers. If the reference network is enough better than the goals for the embedded system, the behavior of that big model can be used as “ground truth” for the embedded system. This makes generation of large training sets for the embedded version much easier.
    4. Data augmentation is a powerful method. If you have only a moderate amount of training data, you may be able to apply a series of transformations to the data and allow prior labeling to be maintained. (We know a dog is still a dog, no matter how we scale it, rotate it or flip its image.) Be careful though – neural networks can be so discriminating that a network trained on artificial or augmented data, may only respond to such example, however similar those examples may be to real world data, in human perception.
  4. New Device Types: The low cost and high intelligence of vision subsystems is allowing imaging-based systems in lots of new form-factors. These new device types may create substantially new vision problems. Plausible new devices include augmented reality headsets and glasses, ultra-small always-alert “visual dust” motes, new kinds of vehicles from semi trucks to internal body “drones”, and cameras embedded in clothing, toys, disposable medical supplies, packaging materials, and other unconventional settings. It may not be necessary in these new devices to deliver either fine images, or achieve substantial autonomy. Instead, the imagers may just be the easiest way to get a little bit more information from the environment or insight about the user.
  5. New Silicon Platforms: Progress in the new hardware platforms for vision processing, especially for deep neural network methods, is nothing less than breathtaking. We’re seeing improvements in efficiency of at least 3x per year, which translates into both huge gains in absolute performance at the high end, and percolation of significant neural network capacity into low cost and low power consumer-class systems. Of course, 200% per year efficiency growth cannot continue for very long, but it does let design teams think big about what’s possible in a given form-factor and budget. This rapid advance in computing capacity appears to be happening in many different product categories – in server-grade GPUs, embedded GPUs, mobile phone apps processors, and deeply embedded platforms for IoT. As just one typical example, the widely used Tensilica Vision DSP IP cores have seen the multiply rate – a reasonable proxy for neural network compute throughput – increase by 16x (64 è 1024 8x8b multiplies per cycle per core) in just over 18 months. Almost every established chip company doing system-on-chip platforms is rolling out significant enhancements or new architectures to support deep learning. In addition, almost 20 new chip startups are taking the plunge with new platforms, typically aiming at huge throughput to rival high-end GPUs or ultra high efficiency to fit into IoT roles. This wealth of new platforms will make choosing a target platform more complex, but will also dramatically increase the potential speed and capability of new embedded vision platforms.
  6. More Than Just Vision: When planning an embedded vision product, it ‘s important to remember that embedded vision is a technology, not an end application. Some applications will be completely dominated by their vision component, but for many others the vision channel will be combined with many other information channels. This may come from other sensors, especially audio and motion sensors, or from user controls or from background data, especially cloud data. In addition, each vision node may be just one piece of a distributed application, so that node-to-node and node-to-cloud-to-node application coordination may be critical, especially in developing a wide assessment of a key issue or territory. Once all the channels of data are aggregate and analyzed, for example, through convolutional neural networks, what then? Much of the value of vision is in taking action, whether the action is real-time navigation, event alerts, emergency braking, visual or audio response to users, or updating of central event databases. In thinking about the product, map out the whole flow to capture a more complete view of user needs, dependencies on other services, computation and communication latencies and throughput bottlenecks, and competitive differentiators for the total experience.
  7. Point Solution to Platform: In the spirit of “crossing the chasm” it is often necessary to define the early product as a solution for a narrow constituency’s particular needs. Tight targeting of a point solution may let you stand out in a noisy market of much bigger players, and to reduce the integration risks faced by your potential early adopters. However, that also limits the scope of the system to just what you directly engineer. Opening up the interfaces and the business model to let both customers and third parties add functionality has two big benefits. First, it means that the applicability of your core technology can expand to markets and customers that you couldn’t necessarily serve with your finite resources to adapt and extend the product. Second, the more a customer invests their own engineering resources into writing code or developing peripheral hardware around your core product, the more stake they have in your success. Both practical and psychological factors make your product sticky. It turns a point product into a platform. Sometimes, that opening of the technology can leverage an open-source model, so long as some non-open, revenue-generating dimension remains. Proliferation is good, but is not the same as bookings. Some startups start with a platform approach, but that has challenges.   It may be difficult to get customers to invest to build your interfaces into their system if you’re too small and unproven, and it may be difficult to differentiate against big players able to simply declare an “de facto industry standard”.

Any startup walks a fine line between replicating what others have done before, and attempting something so novel that no one can appreciate the benefit. One useful way to look for practical novelty is to look at possible innovation around the image stream itself. Here are four ways you might think about new business around image streams:

  1. Take an existing image stream, and apply improved algorithms. For example, build technology that operates on user’s videos and does improved captioning, tagging and indexing.
  2. Take and existing image stream and extract new kinds of data beyond the original intent. For example, use outdoor surveillance video streams to do high resolution weather reporting, or look at traffic congestion.
  • Take an existing image stream and provide services on it under new business models. For example, build a software for user video search that doesn’t charge by copy or by subscription, but by success in finding specific events
  1. Build new image streams by putting cameras in new places. For example, chemical refiners are installing IR cameras that can identify leaks of otherwise invisible gases. A agricultural automation startup, Blue River, is putting sophisticated imaging on herbicide sprayers, so that herbicides can be applied just on recognized weeds, not on crop plants or bare soil, increasing yields and reducing chemical use.

Thinking beyond just the image stream can be important too. Consider ways that cameras, microphones and natural language processing methods can be combined to get richer insights into the environment and users intent.

  • Can the reflected sound of a aerial drone’s blades give additional information for obstacle avoidance?
  • Can the sound of tires on the road surface give clues about driving conditions for autonomous cars?
  • Can the pitch and content of voices give indications of stress levels in drivers, or crowds in public places?

The figure below explores a range of application and functions types using multiple modes of sensing and analysis

Autonomous Vehicles and Robotics Monitoring, Inspection and Surveillance Human-Machine Interface Personal Device Enhancement
Vision ·    Multi-sensor: image, depth, speed

·    Environmental assessment

·    Localization and odometry

·    Full surround views

·    Obstacle avoidance

·    Attention monitoring

·    Command interface

·    Multi-mode automatic speech recognition

·   Social photography

·   Augmented Reality

·   Localization and odometry

Audio ·    Ultrasonic sensing · Acoustic surveillance

· Health and performance monitoring

·   Mood analysis

·   Command interface

·  ASR in social media context

·  Hands-free UI

·  Audio geolocation

Natural Language · Access control

· Sentiment analysis

·   Sentiment analysis

·   Command interface

·  Real-time translation

·  Local service bots

·  Enhanced search

The variety of vision opportunities is truly unbounded. The combination of inexpensive image sensors, huge cognitive computing capacity, rich training data and ubiquitous communications makes this time absolutely unique. Doing a vision startup is hard, just as any startup venture is hard. Finding the right team, the right market, the right product and the right money is never easy, but the rewards, especially the emotional, technical and professional rewards can be enormous.

Good luck!

How to Start an Embedded Vision Company – Part 2

In my previous blog post, I outlined how the explosion of high-resolution, low-cost image sensors was transforming the nature of vision, as we rapidly evolve to a world where most pixels are never seen by humans, but captured, analyzed and used by embedded computing systems. This discontinuity is creating ample opportunities for new technologies, new business models and new companies. In this second part, we look at the basics ingredients of a startup, and two rival models of how to approach building a successful enterprise.

Let’s look at the basic ingredients of starting a technology business – not just a vision venture. We might call this “Startup 101”. The first ingredient is the team.

  • You will need depth of skills. It is impossible to be outstanding in absolutely everything, but success does depend on having world-class capability in one or two disciplines, usually including at least one technology dimension. Without leadership somewhere, it’s hard to differentiate from others, and to see farther down the road on emerging applications.
  • You don’t need to be world-class in everything, but having a breadth of skills across the basics – hardware, software, marketing, sales, fund-raising, strategy, infrastructure – will help enormously in moving forward as a business. The hardware/software need is obvious and usually first priority. You have to be able to develop and deliver something useful, unique and functional. But sooner or later you’ll also need to figure out how to describe it to customers, make the product and company visible, and complete business transactions. You’ll also need to put together some infrastructure, so that you can hire and pay people, get reasonably secure network access and supply toilet paper in the bathrooms.
  • Some level of experience on the team is important. You don’t need to be graybeards with rich startup and big company track records, but some level of real world history is enormously valuable. You need enough to avoid the rookie mistakes and to recognize the difference between a normal potholes and an existential crevasse.   Potholes you do your best to avoid, but it you have to survive a bump, you can. A bit of experience can alert you when you’re approaching the abyss, so you can do everything possible to get around it. Is there a simple formula for recognizing those crevasses? Unfortunately, no (but they often involve core team conflict, legal issues, or cash flow). Startups through a lot of issues, big and small, at the leadership team, so there will be plenty of opportunity to gain experience along the way.
  • The last key dimension of team is perhaps the most important, but also the most nebulous – character. Starting a company is hard work, with plenty of stress and emotion, because of the stakes. A team, capable and committed to openness, patience and honesty, will perform better, last longer, and have more fun than other teams. It does NOT mean that the team should agree all the time – part of the point of constructing a team with diverse skills is to get good “parallax perspective” on the thorniest problems. It DOES mean trusting one another to do their jobs, being willing to ask tough questions about assumptions and methods, and working hard for common effort. More than anything, it means putting ego and individual glory aside.

The second key ingredient for a startup is the product. Every startup’s product is different (or it had better be!), but here are four criteria to apply to the product concept:

  1. The product should be unique in at least one major dimension.
  • The uniqueness could be functionality – product does something that wasn’t possible before, or it does a set of functions together that were weren’t possible before.
  • The uniqueness could be performance – it does a known job faster, at lower power, cheaper or in a smaller form-factor than anyone else.
  • The uniqueness could be the business or usage model – it allows a task to be done by a new – usually less sophisticated – customer, or let’s the customer pay for it in a different way
  1. Building the product must be feasible.   It isn’t enough just to have a Mat Lab model of a great new vision algorithm – you need to make it work at the necessary speed, and fit in the memory of the target embedded platform, with all the interfaces to cameras, networks, storage and other software layers.
  2. The product should be defensible. Once others learn about the product, can they easily copy it? When you work with customers about real needs, will you be able to incorporate improvements more rapidly and more completely than others? Can you gather training data and interesting usage cases more quickly? Can you protect your code, your algorithms, and your hardware design from overt cloners?
  3. You should be able to explain the product relative to the competition? In some ideal world, customers would be able to appreciate the magnificence of your invention without any point of comparison – they would instantly understand how to improve their lives by buying your product.   In that ideal world you would have no competition.  In the long run, you ideally want to so dominate your segment that no one else comes close. However, if you have no initial reference point – no competition – you may struggle to discover and explain the product’s virtues. Having some competition is not a bad thing –it gives a preexisting reference point by which the performance, functionality and usage model breakthrough can be made vivid to potential customers. In fact, if you think you have no competition, you should probably go find some, at least for purpose of showing the superiority of your solution.

The third key ingredient for a startup is the target market: the group of users plus the context for use. Think “teenage girls” + “shopping for clothes” or “PCB layout designers” + “high frequency multi-layer board timing closure”.

Finding the right target market for a new technology product faces an inherent dilemma. In the early going, it is not hard to find a group of technology enthusiasts who will adopt the product because it is new, cool and powerful. They have an easy time picture how it may serve their uses and are comfortable with the hard work to adapt the inherent power of your technology to their needs. Company progress often stalls, however, once this small population of early adopters has embraced the product. The great majority of users are not looking for an application or integration challenge – they just want to get some job done. They may tolerate use of new technology from an untried startup, but only if it clearly addresses their use case. This transition to mainstream users has been characterized by author Geoffrey Moore as “crossing the chasm”. The recognized strategy for getting into wider use is to narrow the focus to more completely solve the problems of a smaller group of mainstream customers, often by solving the problem more completely for one vertical application or for one specific usage scenario. So “going vertical” puts the fear (and hypothetical potential) of the technology into the background and emphasizes the practical and accessible benefits of the superior solution.

This narrowing of focus, however, can sometimes create a dilemma in working with potential investors. Investors, especially big VCs want to hit homeruns by winning huge markets. They don’t want narrow niche plays.  The highly successful investor, Peter Thiel, dramatizes this point of view by saying “competition is for losers”, meaning that growing and dominating a new market can be much more profitable than participating in an existing commodity market.

The answer is to think about, and where appropriate, talk about the market strategy in two levels. First identify huge markets that are still under-served or latent. Then focus on an early niche within that emerging market which can be dominated by a concentrated effort, where the insights and technologies needed to master the niche are excellent preparation for larger and larger surrounding use-cases with the likely huge market. Talking about both the laser focus on the “beachhead” initial market AND the setup for leadership in the eventual market can often resolve the apparent paradox.

The accumulated wisdom of startup methods is evolving continuously, both as teams refine what works, and as the technology and applications create new business models [think advertising], new platforms [think applications as a cloud service], new investment methods [think crowd funding] and new team structures [think gig economy]. The software startup world, in particular, has been dramatically influenced by the “Lean Startup” principle.   This idea has evolved over the past fifteen year, spawned by the writing of Steve Blank, more than any one individual. It contrasts in key ways to the long-standing model, which we can call “Old School”.

Old School Lean Startup
Funding Seed Round based on team and idea, A Round to develop product, B Round after revenue Develop prototype to get Seed Round, A Round after revenue, B Round, if any, for global expansion
Product Types Hardware/software systems and silicon Easiest with software
Customer Acquisition Develop sales and marketing organization, to sell direct or build channel CEO and CTO are chief sales people until product and revenue potential proven in the market
Business models Mostly B2B with large transactions Web–centric B2B and B2C with subscriptions and micro-transactions

In vastly simplified form, the Lean Startup model is built on five elements of guidance:

  1. Rapidly develop a Minimum Viable Product (MVP) – the simplest product-like deliverable that customers will actually use in some form. Getting engaged with customers as early as possible gives you the kind of feedback on real problems that you cannot get from theoretical discussions. It gives you a chance to concentrate on the most customer-relevant features and skip the effort on features that customers are less likely to care about.
  2. Test prototypes on target users early and often – Once you have an MVP, you have a great platform to evolve incrementally. If you can figure out how to float new features into the product without jeopardizing basic functionality, then you can do rapid experimentation on customer usage. This allows the product to evolve more quickly.
  3. Measure market and technical progress dispassionately and scientifically – New markets and technologies often don’t follow old rules-of-thumb, so you may need to develop new more appropriate metrics of progress for both. Methods like A-B testing of alternative mechanisms can give quick feedback on real customer usage, and enhances a sense of honesty and realism in the effort.
  4. Don’t take too much money too soon – Taking money from investors is an implied commitment to deliver predictable returns in a predicable fashion. If you try to make that promise too early, people won’t believe you, so you won’t get the funding. Even if you can convince investors to fund you, taking money too early may make you commit to a product before you really know what works best. In some areas, like cloud software, powerful ideas can sometimes be developed and launched by small teams, so that little funding is absolutely necessary in the early days. Startup and funding culture have evolved together so that teams often don’t expect to get outside funding until they have their MVP. Some teams expect to be eating ramen noodles for many months.
  5. Leverage open source and crowd source thinking – It is hard to overstate the impact of open source on many areas of software. The combination of compelling economics, rapid evolution and vetting by a wide base of users makes open source important in two ways – as a building block within your own technical development strategy, and as part of a proliferation strategy that creates a wider market for your product. Crowd sourcing represents an analogous method to harness wide enthusiasm for your mission or product to gather and refine content, generate data and get startup funds.

As these methods have grown up in the cloud software world, they do not all automatically apply to embedded vision startups. Some technologies, like new silicon platforms, require such high upfront investments and expensive iterations that deferring funding or iterating customer prototypes may not be viable. In addition, powerful ideas like live (and unannounced) A-B testing on customers will not be acceptable for some embedded products, especially in safety-critical applications. The lean methods here work most obviously in on-line environments, with large numbers of customers and small transactions.   A design win for an embedded system may have much greater transaction value than any single order in on-line software, so the sales process may be quite different, with a significant role for well-orchestrated sales and marketing efforts with key customers. Nevertheless, we can compare typical “Old School” and “Lean Startup” methods across key areas like funding, product types, methods for getting customers and core business models.

How to Start an Embedded Vision Company — Part 1

Part 1: Why Vision

 

Since I started Cognite Ventures eight months ago, my activity with startup teams has ramped up dramatically. Many of these groups are targeting some kind embedded vision application, and many want advice on how to succeed. This conversation developed into an idiosyncratic set of thoughts on vision startup guidance, which in turn spawned a talk at the Embedded Vision Summit which I’m now expanding as a blog. You can find the slides here, but I will also break this conversation down into a three-part article.

Please allow me to start with some caveats! Every startup is different, every team is different, and the market is constantly evolving – so there is no right answer. Moreover, I have had success in my startups, especially Tensilica, but I can hardly claim that I have succeeded just because of following these principles. I have been blessed with an opportunity to work with remarkable teams, whose own talent, energy and principles have been enormously influential on the outcome. To the extent that I have directly contributed to startup success, is it because of applying these ideas? Or in spite of these ideas? Or just dumb luck?

I believe the current energy around new ventures in vision comes from two fundamental technical and market trends. First, the cost of capturing image streams has fallen dramatically. I can buy a HD resolution security camera with IR illumination and an aluminum housing for $13.91 on Amazon.   This implies that the core electronics – CMOS sensor, basic image signal processing and video output – probably costs about two dollars at the component level.   This reflects the manufacturing learning curve from the exploding volume of cameras. It’s useful to compare the trend for the population of people with the population of cameras on the planet, based on SemiCo data on image sensors from 2014 and assuming each sensor has a useful life in the wild of three years.

What does it say? First, it appears that the number of cameras crossed over the number of people sometime in the last year. This means that even if every human spent every moment of every day and night watching video, a significant fraction of the output of these cameras would go unwatched. Of course, many of these cameras are shut off, or sitting in someone’s pocket, or watching complete darkness at any given time. Nevertheless, it is certain that humans will very rarely see the captured images. If installing or carrying those cameras around is going to make any sense, it will because we used vision analysis to filter, select or act on the streams without human involvement in every frame.

But the list of implications goes on!

  • We now have more than 10B image sensors installed. If each can produce an HD video stream of 1080p60, we have potential raw content of roughly 100M pixels per second per camera, or 1018 new pixels per second, or something >1025 B per years of raw pixel data. If, foolishly, we tried to keep all the raw pixels, the storage requirement would exceed the annual production of hard disk plus NAND flash by a factor of rough 10,000. Even if we compressed the video down to 5Mbps, we would fill up a year’s supply of storage by sometime on January 4 of the next year. Clearly we’re not going to store all that potential content. (Utilization and tolerable compression rates will vary widely by type of camera – the camera on my phone is likely to be less active that a security camera, and some security cameras may get by on less than 5MBps, but the essential problem remains.)
  • Where do new bits come from? New bits are captured from the real world, or “synthesized” from other data. Synthesized data is credit card transactions, packet headers, stock trades, emails, and other data created within electronic systems as a byproduct of applications. Real world data can be pixels from cameras, or audio samples from microphones, or accelerometer data from MEMS sensors. Synthetic data is ultimately derived from real world data, though the transformations of human interaction, economic transactions and sharing.   Audio and motion sensors are rich sources of data, but their data rates are dramatically less – 3 to 5 orders of magnitude less – than that of even cheap image sensors. So virtually all of the real data of the world – and an interesting fraction of all electronic data – is pixels.
  • The overwhelming volume of pixels has deep implications for computing and communications. Consider that $13.91 video camera. Even if we found a way to ship that continuous video stream up to cloud, we couldn’t afford to use some x86 or GPU-enabled server to process all that content – over the life of that camera, we’d could easy spend thousands of dollars on that hardware (and power) dedicated to that video channel.   Similarly, 5Mbps of compressed video * 60 second * 60 minutes * 24 hours * 365 days is 12,960 Gbits per month. I don’t know about your wireless plan, but that’s more than my cellular wireless plan absorbs easily. So it is pretty clear that we’re not going to be able to either do the bulk of the video analysis on cloud servers, or communicate it via cellular. Wi-Fi networks may have no per-bit charges, and greater overall capacity, but wireless infrastructure will have trouble scaling to the necessary level to handle tens of billions of streams.  We must find ways to do most of the computing on embedded systems, so that no video, or only the most salient video is sent to the cloud for storage, further processing or human review and action.

The second reason for the enthusiasm for vision is the revolution in computation methods for extracting insights from image streams. In particular, the emergence of convolutional neural networks as a key analytical building block has dramatically improved the potential for vision systems to extract subtle insightful results from complex, noisy image streams. While no product is just a neural network, the increasingly well-understood vocabulary of gathering and labeling large data sets, constructing and training neural networks, and deploying those computational networks onto efficient embedded hardware, has become part of the basic language of vision startups.

When we reflect these observations back onto the vision market, we can discern three big useful categories of applications:

  1. Capture of images and video for human consumption. This incudes everything from fashion photography and snapshots posted on Facebook to Hollywood films and document scanning. This is the traditional realm of imaging, and much of the current technology base – image signal processing pipelines, video compression methods and video displays – are built around particular characteristics of the human visual system. This area has been the mainstay of digital imaging and video related products for the past two decades.   Innovation in new higher resolution formats, new cameras and new image enhancement remains a plausible area for startup activity even today, but it is not as hot as it has been. While this area has been the home of classical image enhancement methods, there is ample technical innovation in this category, for example, in new generative neural network models that can synthesize photo-realistic images.
  2. Capture of images and video, then filtering, reducing and organizing into a concise form for human decision-making.  This category includes a wide range of vision processing and analytics technologies, including most activity in video monitoring and surveillance. The key here is often to make huge bodies of video content tagged, indexed and searchable, and to filter out irrelevant content so only a tiny fraction needs to be uploaded, stored, reviewed or more exhaustively analyzed. This area is already active but we would expect even more, especially as teams work to exploit the potential for joint analytics spanning many cameras simultaneously.  Cloud applications are particularly important in this area, because its storage, computing and collaboration flexibility.
  3. Capture of images and video, analyzing and then using insights to take autonomous action. This domain has captured the world’s imagination in recent years, especially with the success of autonomous vehicle prototypes and smart aerial drones.   The rapid advances in convolutional neural networks are particularly vivid and important in this area, as vision processing becomes accurate and robust enough to trust with decision making in safety-critical systems. One of the key characteristics of these systems is short-latency, robustness and hard real-time performance. System architects will rely on autonomous vision systems to the extent that the systems can make guarantees of short decision latency and ~100% availability.

Needless to say, some good ideas may be hybrids of these three, especially in systems that use vision for some simple autonomous decision-making, but rely on humans for backup, or for more strategic decisions, based on the consolidated data.

In the next part of the article, we’ll take a look at the ingredients of a startup – team, product and target market – and look at some cogent lessons from the “lean startup” model that rules software entrepreneurship today.

 

What’s happening in startup funding?

I’ve spend the last few months digging into the intersection between the on-going deep learning revolution and the world-wide opportunity for startups. This little exercise has highlighted both how the startup funding world is evolving, and some of the unique issues and opportunities for deep learning-based startups.

Looking at some basic funding trends is a good place to start. Pitchbook as just published an excellent public summary of key quantitative trends in US startup funding: http://pitchbook.com/news/reports/2016-annual-vc-valuations-report

These show the growth in the seed funding level and valuation, the stretching out of the pre-seed stage for companies and the a reduction in overall funding activity from the exceedingly frothy levels of 2015.

Let’s look at some key pictures – first seed funding:

That’s generally a pretty comforting trend – seed round funding levels and valuations increasing steadily over time, without direct signs of a funding bubble or “irrational enthusiasm”.   This says that strong teams with great ideas and demonstrated progress on their initial product (their Minimum Viable Product or “MVP”) are learning from early trial customers, getting some measurable traction and able to articulate a differentiated story to seed investors.

A second picture on time-to-funding gives a more sobering angle – time to funding:

This picture suggests that the time-line for progressing through the funding stages is stretching out meaningfully. In particular, it says that it is taking longer to get to seed funding – now more than two years. How to startups operate before seed? I think the answer is pre-seed angle funding, “friends-and-family” investment, credit cards and a steady diet of ramen noodles ;-). This means significant commitment to the minimally-funded startup as not a transitory moment but a life-style. It takes toughness and faith.

That commitment to toughness has been codified as the concept of the Lean Startup.  In the “good old days” a mainstream  entrepreneur  has an idea, assembles a stellar team, raises money, rents space, buys computers, a phone systems, networks and cubicles, builds prototypes, hires sales and marketing people and takes a product to market.  And everyone hoped customers would buy it just as they were supposed to.  The Lean Startup model turns that around – an entrepreneur has an idea, gathers two talented technical friends, uses their old laptops and an AWS account, builds prototypes and takes themselves to customers.  They iterate on customer-mandated features for a few months and take it to market as a cloud-based service.  Then they raise money.   More ramen-eating for the founding team, less risk for the investors, and better return on investment overall.

Some kinds of technologies and business models fit the Lean Startup model easily – almost anything delivered as software, especially in the cloud or in non-mission-critical roles.  Some models don’t fit so well – it is tough to build new billion-transistor chips on just a ramen noodle budget, and tough to get customers without a working prototype.  So the whole distribution of startups has shifted in favor of business models and technologies that look leaner.

If you’re looking for sobering statistics, the US funding picture shows that funding has retreated a bit from the highs of 2015 and early 2016.

Does that mean that funding is drying up? I don’t think so. It just makes things look like late 2013 and early 2014, and certainly higher than 2011 and 2012. In fact, I believe that most quality startups are going to find adequate funding, though innovation, “leanness” and savvy response to emerging technologies all continue to be critically important.

To get a better idea of the funding trend, I dug a bit deeper into one segment – computing vision and imaging – hat I feel may be representative of a broad class of emerging technology-driven applications, especially as investment shifts towards artificial intelligence in all its forms.

For this, I mined Crunchbase, the popular startup funding event database and service, to get a rough picture of what has happened in funding over the past five years. It’s quite hard to get unambiguous statistics from a database like this when your target technology or market criteria don’t neatly fit the predefined categories. You’re forced to resort to description text keyword filtering which is slow and imperfect. Nevertheless, a systematic set of key word filters can give good relative measures over time, even if they can’t give very good absolute numbers.   Specifically, I looked at the number of funding deals, and the number of reported dollars for fundings in embedded vision (EV) companies in each quarter over the past five years, as reported in Crunchbase and as filtered down to represent the company’s apparent focus. (It’s not trivial. Lots of startups’ descriptions talk, for example, about their “company vision” but that doesn’t mean they’re in the vision market ;-). The quarter by quarter numbers jump around a lot, of course, but the linear trend is pretty clearly up and to the right. This data seems to indicate a health level of activity and funding climate for embedded vision.

I’d say that the overall climate for technologies related to cognitive computing – AI, machine learning, neural networks, computer vision, speech recognition, natural language processing and their myriad applications – continues to look health as a whole as well.

In parallel with this look at funding, I’ve also been grinding away at additions, improvements, corrections and refinements on the Cognitive Computing Startup List. I’ve just made the third release of that list. Take a look!

 

 

A global look at the cognitive computing start-up scene

I published the first version of my cognitive computing startup list about six weeks ago.  As I poked around further, and got some great questions from the community, I discovered a range of new resources on deep learning and AI startups, and literally thousands of new candidates.  In particular, I started using Crunchbase as a resource to spread my net further for serious cognitive computing companies.  If you simply search their database for companies that mention artificial intelligence somewhere in their description, you get about 2200 hits.  Even the Crunchbase category of Artificial Intelligence companies has more than 1400 companies currently.

As I described in the first release, the majority of companies in the AI category, while having generally interesting or even compelling propositions, are using true cognitive computing as just a modest element of some broader product value, or may be playing up the AI angle, because it is so sexy right now.  Instead, I really tried to identify those companies operating on inherently huge data analytics and generation problems, which have a tight focus on automated machine learning, and whose blogs and job posting suggest depth of expertise and commitment to machine learning and neural network methods.

I also found other good lists of AI-releated startups, like MMC Ventures’s “Artificial Intelligence in the UK: Landscape and learnings from 226 startups”:

https://medium.com/mmc-writes/artificial-intelligence-in-the-uk-landscape-and-learnings-from-226-startups-70b9551f3e4c#.l7elokutt

and the Chinese Geekpark A100 list of worldwide startups:

http://www.geekpark.net/topics/217003

With all this, I could filter the vast range of startups down to about 275 that seem to represent the most focused, the most active and the most innovative, according to my admittedly idiosyncratic criteria.

The geographical distribution is instructive.  Not surprisingly, about half are based in the US, with two-thirds of the US start-ups found in California.  More surprisingly is the strong second is the UK, with more than 20% of the total, followed by China, and Canada.  I was somewhat surprised to find China with just 8% of the startups, so I asked a number of colleagues to educate me more on cognitive computing startups in China.  This yields a few more important entrants, but China still lags behind the UK in cognitive computing startups.

I have split the list a number of different ways, identifying those

  • with a significant focus on embedded systems (not just cloud-based software): 82 companies
  • working primarily on imaging and vision-based cognitive computing: 125 companies
  • doing embedded vision: 74 companies

Within embedded vision, you’ll find 10 or more each focused on surveillance, autonomous cars, drones and robotics, human-machine interface, and new silicon platforms for deep learning.  It’s a rich mix.

Stay tuned for more on individual companies, startup strategies and trends in the different segments of cognitive computing.  And take a look at the list!