This is the third installment of my thoughts on starting an embedded vision company. In part 1, I focused on the opportunity, especially how the explosion in the number of image sensors was overwhelming human capacity to directly view all the potential image streams, and creating a pressing need for orders-of-magnitude increase in the volume and intelligence of local vision processing. In part 2, I shifted to a discussion of some core startup principles and models for teams in embedded vision or beyond. In this final section, I focus on how the combination of the big vision opportunity, plus the inherent agility (and weakness!) of startups can guide teams to some winning technologies and business approaches.
Consider areas of high leverage on embedded vision problems:
- The Full Flow: Every step of the pixel flow from sensor interface, through the ISP, to the mix video analytics (classical and neural network-based) has impact on vision system performance. Together with choices in training data, user interface, application targeting and embedded vs. cloud application partitioning give an enormous range of options on vision quality, latency, cost, power, and functionality. That diversity of choices creates many potential niches where a startup can take root and grow, without having to attack huge obvious markets using the most mainstream techniques initially.
- Deep Neural Networks: At this point it is pretty obvious to that neural network methods are transforming computer vision. However, applying neural networks in vision is much more than just doing ImageNet classification. It pays to invest in thoroughly understanding the variety of both discrimination methods (object identification, localization, tracking) and generation methods. Neural-network-based image synthesis may start to play a significant role in augmenting or even displacing 3D graphics rendering in some scenarios. Moreover, Generative Adversarial Network methods allow a partially trained discrimination network and a generation network to iterate through refinements that improve both networks automatically.
- Data Sets: To find, create and repurpose data for better training is half the battle in deep learning. Having access to unique data sets can be the key differentiator for a startup, and brainstorming new problems that can be solved with available large data sets is a useful discipline in startup strategy development. Ways to maximize data leverage may include the following:
- Create an engagement model with customers, so that their data can contribute to the training data set for future use. Continuous data bootstrapping, perhaps spurred by free access to cloud service, may allow creation of large, unique training data collections.
- Build photo-realistic simulations of the usage scenes and sequences in your target world. The extracted image sequences are inherently labeled by the underlying scene structures and can generate large training sets to augment real world image captured training data. Moreover, simulation can systematically cover rare but important combinations of object motion, lighting, and camera impairments for added system robustness. For example the automotive technology startup up, AIMotive, builds both sophisticated fused cognition systems from image, LiDAR and radar streams, and sophisticated driving simulators with accurate 3D world to train and test neural network-based systems.
- Some embedded vision systems can be designed as subsets of bigger, more expensive server-based vision systems, especially when neural networks of heroic scale are development by cloud-based researchers. If the reference network is enough better than the goals for the embedded system, the behavior of that big model can be used as “ground truth” for the embedded system. This makes generation of large training sets for the embedded version much easier.
- Data augmentation is a powerful method. If you have only a moderate amount of training data, you may be able to apply a series of transformations to the data and allow prior labeling to be maintained. (We know a dog is still a dog, no matter how we scale it, rotate it or flip its image.) Be careful though – neural networks can be so discriminating that a network trained on artificial or augmented data, may only respond to such example, however similar those examples may be to real world data, in human perception.
- New Device Types: The low cost and high intelligence of vision subsystems is allowing imaging-based systems in lots of new form-factors. These new device types may create substantially new vision problems. Plausible new devices include augmented reality headsets and glasses, ultra-small always-alert “visual dust” motes, new kinds of vehicles from semi trucks to internal body “drones”, and cameras embedded in clothing, toys, disposable medical supplies, packaging materials, and other unconventional settings. It may not be necessary in these new devices to deliver either fine images, or achieve substantial autonomy. Instead, the imagers may just be the easiest way to get a little bit more information from the environment or insight about the user.
- New Silicon Platforms: Progress in the new hardware platforms for vision processing, especially for deep neural network methods, is nothing less than breathtaking. We’re seeing improvements in efficiency of at least 3x per year, which translates into both huge gains in absolute performance at the high end, and percolation of significant neural network capacity into low cost and low power consumer-class systems. Of course, 200% per year efficiency growth cannot continue for very long, but it does let design teams think big about what’s possible in a given form-factor and budget. This rapid advance in computing capacity appears to be happening in many different product categories – in server-grade GPUs, embedded GPUs, mobile phone apps processors, and deeply embedded platforms for IoT. As just one typical example, the widely used Tensilica Vision DSP IP cores have seen the multiply rate – a reasonable proxy for neural network compute throughput – increase by 16x (64 è 1024 8x8b multiplies per cycle per core) in just over 18 months. Almost every established chip company doing system-on-chip platforms is rolling out significant enhancements or new architectures to support deep learning. In addition, almost 20 new chip startups are taking the plunge with new platforms, typically aiming at huge throughput to rival high-end GPUs or ultra high efficiency to fit into IoT roles. This wealth of new platforms will make choosing a target platform more complex, but will also dramatically increase the potential speed and capability of new embedded vision platforms.
- More Than Just Vision: When planning an embedded vision product, it ‘s important to remember that embedded vision is a technology, not an end application. Some applications will be completely dominated by their vision component, but for many others the vision channel will be combined with many other information channels. This may come from other sensors, especially audio and motion sensors, or from user controls or from background data, especially cloud data. In addition, each vision node may be just one piece of a distributed application, so that node-to-node and node-to-cloud-to-node application coordination may be critical, especially in developing a wide assessment of a key issue or territory. Once all the channels of data are aggregate and analyzed, for example, through convolutional neural networks, what then? Much of the value of vision is in taking action, whether the action is real-time navigation, event alerts, emergency braking, visual or audio response to users, or updating of central event databases. In thinking about the product, map out the whole flow to capture a more complete view of user needs, dependencies on other services, computation and communication latencies and throughput bottlenecks, and competitive differentiators for the total experience.
- Point Solution to Platform: In the spirit of “crossing the chasm” it is often necessary to define the early product as a solution for a narrow constituency’s particular needs. Tight targeting of a point solution may let you stand out in a noisy market of much bigger players, and to reduce the integration risks faced by your potential early adopters. However, that also limits the scope of the system to just what you directly engineer. Opening up the interfaces and the business model to let both customers and third parties add functionality has two big benefits. First, it means that the applicability of your core technology can expand to markets and customers that you couldn’t necessarily serve with your finite resources to adapt and extend the product. Second, the more a customer invests their own engineering resources into writing code or developing peripheral hardware around your core product, the more stake they have in your success. Both practical and psychological factors make your product sticky. It turns a point product into a platform. Sometimes, that opening of the technology can leverage an open-source model, so long as some non-open, revenue-generating dimension remains. Proliferation is good, but is not the same as bookings. Some startups start with a platform approach, but that has challenges. It may be difficult to get customers to invest to build your interfaces into their system if you’re too small and unproven, and it may be difficult to differentiate against big players able to simply declare an “de facto industry standard”.
Any startup walks a fine line between replicating what others have done before, and attempting something so novel that no one can appreciate the benefit. One useful way to look for practical novelty is to look at possible innovation around the image stream itself. Here are four ways you might think about new business around image streams:
- Take an existing image stream, and apply improved algorithms. For example, build technology that operates on user’s videos and does improved captioning, tagging and indexing.
- Take and existing image stream and extract new kinds of data beyond the original intent. For example, use outdoor surveillance video streams to do high resolution weather reporting, or look at traffic congestion.
- Take an existing image stream and provide services on it under new business models. For example, build a software for user video search that doesn’t charge by copy or by subscription, but by success in finding specific events
- Build new image streams by putting cameras in new places. For example, chemical refiners are installing IR cameras that can identify leaks of otherwise invisible gases. A agricultural automation startup, Blue River, is putting sophisticated imaging on herbicide sprayers, so that herbicides can be applied just on recognized weeds, not on crop plants or bare soil, increasing yields and reducing chemical use.
Thinking beyond just the image stream can be important too. Consider ways that cameras, microphones and natural language processing methods can be combined to get richer insights into the environment and users intent.
- Can the reflected sound of a aerial drone’s blades give additional information for obstacle avoidance?
- Can the sound of tires on the road surface give clues about driving conditions for autonomous cars?
- Can the pitch and content of voices give indications of stress levels in drivers, or crowds in public places?
The figure below explores a range of application and functions types using multiple modes of sensing and analysis
|Autonomous Vehicles and Robotics||Monitoring, Inspection and Surveillance||Human-Machine Interface||Personal Device Enhancement|
|Vision||· Multi-sensor: image, depth, speed
· Environmental assessment
· Localization and odometry
· Full surround views
· Obstacle avoidance
|· Attention monitoring
· Command interface
· Multi-mode automatic speech recognition
|· Social photography
· Augmented Reality
· Localization and odometry
|Audio||· Ultrasonic sensing||· Acoustic surveillance
· Health and performance monitoring
|· Mood analysis
· Command interface
|· ASR in social media context
· Hands-free UI
· Audio geolocation
|Natural Language||· Access control
· Sentiment analysis
|· Sentiment analysis
· Command interface
|· Real-time translation
· Local service bots
· Enhanced search
The variety of vision opportunities is truly unbounded. The combination of inexpensive image sensors, huge cognitive computing capacity, rich training data and ubiquitous communications makes this time absolutely unique. Doing a vision startup is hard, just as any startup venture is hard. Finding the right team, the right market, the right product and the right money is never easy, but the rewards, especially the emotional, technical and professional rewards can be enormous.