An envelope is a powerful tool, especially the back of the envelope. It enables simple calculations that often illuminate some important, complicated problem or opportunity. We have just that kind of problem in working to understand all structure of vision processing systems, especially the key question of where the processing power for vision will live. Will it live in the camera itself? Will it live in the cloud? Will it live somewhere in between, on a local network or at the edge of the cloud? Issues like this are rarely determined by just one factor. It not just “what’s the fastest” or “what’s the cheapest”. A rich mix of technical and economic forces drives the solutions.
One of the most fundamental forces in vision is ever-cheaper cameras. When I checked a couple of weeks ago I could buy a decent-looking security camera, with high resolution lens and imager, IR LED illumination, processing electronics, power supply and waterproof housing on Amazon for $11.99:
The price has been following steadily – when I checked three months ago, the entry prices was about $18. We can reasonably expect that the price will continue to fall over the next couple of years – the $5 security camera cannot be far off.
The incredible shrinking camera cost is the result of classic economies of scale in CMOS semiconductors and assembly of small electronic gadgets. That scaling is inspired by and reinforces huge increases in the volume of image sensor shipments, and in the installed base of low-cost cameras. Using sensor volume data from SemiCo (2014), and the assumption of a three-year useful field life for a camera, you can see the exponential growth in the world’s population of cameras of all kinds.
I show the camera population alongside the human population to reinforce the significance of the crossover – more cameras capturing pixels that there are people to look at them.
I’ve written before about the proliferation of cameras in general, but now I want to get out my envelope and do some further calculations on the key question, “What does a $5 camera cost?” or more accurately, ”What’s the cost of ownership of a $5 camera over a three year lifetime?”
The answer, as you might already guess, is “More than $5.” Let’s break down some of the factors:
- The camera takes power. The $12 ZOSI camera above takes about 6W, so I’ll guess that my near-future $5 camera will cut power too, so I’ll estimate 3W. That’s certainly not low compared with the low-power CMOS sensors found in mobile phones. As a rule of thumb, electricity in the US costs about $1 per watt per year, so we get a three-year power cost of about $9 for our $5 camera.
If we did all the image processing locally, $5 + $9 = 13 might be a good total for the cost of owning that camera. And you can imagine some mobile gadgets and IoT applications – smart doorbells and baby monitors – where 100% local vision processing is adequate (or even preferred). But what if we want to centralize the processing and take advantage of cloud data? What if we want to push the stream up to the datacenter? Then we have some more costs to consider:
- The camera needs a network connection. Local cabling or Wi-Fi may just have capital costs, but almost all Internet access technologies cost real money. It turns out that Internet access costs vary ENORMOUSLY. There are three basic cost categories.
- If you’re a residential customer in an area with fiber to the home, you’re in great shape. – You can get costs as low as $0.10 per TB (10Gbps line for $300 from US Internet).
- If you’re a residential customer using cable or DSL, you’re doing to pay $8-20 per TB (Based on AT&T and Comcast pricing)
- If you’re a business customer, and have to buy a commercial T1 or T3 connection, you’re going to pay a lot – probably more than $100 per TB, maybe a lot more. (Based on broker T1Shopper.com)
Of course, I am glossing over difference is quality of service, but let’s just break it down into two categories – fiber at $0.10 per TB and non-fiber at $10 per TB. Interestingly, it doesn’t appear that DSL, cable and T1 connections are dropping much in price. Clearly, that will happen only in areas where a much higher capacity option like fiber becomes available.
- The camera needs storage in the cloud. Almost any application that does analysis of a video stream needs a certain amount of recall, so users can apply further study to In some applications, the recall time may be quite short – minutes or hours; in others, there may be a need for days or weeks of storage. We’ll look at two scenarios – a rolling one-day buffer of video storage and a rolling one-month of video storage to span the range. The cost of storage is a little tricky to nail down. Amazon is now offering “unlimited” storage for $5/month, but that’s just to consumers, not relevant to AWS cloud compute customers. AWS S3 storage currently costs about $25/TB-month for standard storage, and $5/TB-month for “glacier storage” which may take minutes to hours to access. The “glacier storage” seems appropriate for the cases where a month of recall is needed. Just as the camera gets cheaper over time, we’d expect the storage to get cheaper to, so we’ll estimate $12.50/TB-month for standard storage and $2.5/TB-month for glacier storage.
- The camera needs computing in the cloud. The whole point of moving the camera stream to the cloud is to be able to compute on it, presumably with sophisticated deep learning image understanding algorithms. There is no fixed standard for the sophistication of computation on video frames, but we can set some expectations. In surveillance video, we anticipate that we want to know what are all the major objects in the frame, and where are they located – in other words, object detection. This is a harder problem than simple classification of a single object in each image, as in the most common ImageNet benchmark. This gives the important clues on the progress of activity in the video stream. The best popular algorithm for this, YOLO (You Only Look Once) by Joseph Redmon, requires about 35 GFLOPS per frame of computation. We can get a reasonable picture of the cost by looking at AWS compute offerings, especially the g3 GPU instances. John Joo did some good estimates of the cost of AlexNet inference on AWS g3.4xlarge instances – 3200 inferences per second on the Tesla M60. YOLOv2 is about 25x more compute intensive, so we’d expect a cost of roughly $2.30 per million frames, based on AWS g3.4xlarge pricing of $1.14/hour. Over time, we expect both the algorithms to improve – fewer compute cycles for the same accuracy – and the compute to get more economical. We will again assume a factor of two improvement in each of these, so that future YOLO inference on AWS GPUs might cost about $0.58 per million frames.
Now let’s try to put it all together, for a comparison of cloud computing costs relative to our 3-year camera cost of $13. The last key variable is the data rate coming out of the camera. I will assume an 8Mpixel camera capable of producing 60 frames per second. This is lower resolution than what is found in high-end smartphones, so qualifies as a commodity spec, I think. Let’s compute the costs in four scenarios:
- Camera streams raw pixels at 8Mpixel * 2 bytes/pixel * 60 fps into the cloud. While it is unlikely that we would adopt raw pixel streaming, it still provides a useful baseline.
- Camera uses standard H.264p60 video compression on the 4K UHD stream. This reduces the video bandwidth requirement to about 40 Mbps
- We assume some filtering or frame selection intelligence in the camera that reduces the frame rate, after H.264 compression, to 1 frame per second.
- We assume more filtering or frame selection intelligence in the camera to reduces the frame rate to 1 frame per minute
|Data stream||Network Cost||Storage Cost||Compute Cost||Total Cost|
|Fiber||Conv||1 day||1 month|
|H.264 @ 1 fps||$1||$79||$7||$39||$55||$62-$173|
|H.264 @ 1 fpm||$0||$1.30||$0.11||$0.66||$0.91||$1-$3|
The analysis, rough as it is, gives us some clear takeaways:
- The bandwidth and storage costs of raw uncompressed video are absurd, so we’ll never see raw streaming of any significance
- Even normally compressed video is very expensive – $4K to $10K over the three year life of our $5 camera. Very few video streams are so valuable as to justify this kind of cost.
- Smart selection down to one frame per second helps cloud costs significantly, but the cloud costs dwarf the camera costs. Commodity solutions are likely to push to lower costs, hence lower cloud frame rate.
- Selection down to one frame per minute makes cloud costs insignificant, even relative to our cheapo camera. It may not be necessary to go this far to leverage the cost reductions in the hardware.
- We might reasonably expect that the sweet spot is 5 to 20 frames per minute of cloud-delivered video, with the high end most appropriate where cheap fiber to the home (or business) is available, and the low-end more appropriate with conventional network access.
- The total value of these cameras is economically significant, when we consider the likely deployment of about 20 billion cameras by 2020. While the actual business model for different camera systems will vary widely, we can get a rough order of magnitude by just assuming the cameras are streaming just 10 frames per minute, half over fiber, half over conventional network access. The three-year cost comes to $260B for the cameras and $300B for the network and cloud computing, or about $220B per year. Not chump change!
The big takeaway is that the promise of inexpensive cameras cannot be realized without smart semantic filtering in or near the camera. Cameras, especially monitoring cameras, need to make intelligent choices about whether anything interesting is happening in a time window. If nothing meets the criteria of interest, nothing is sent to the cloud. If activities of possible interest are detected, a window of video, from before and after the triggering action, need to be shipped to the cloud for in-depth analysis. The local smart filtering subsystem may include preliminary assessment and summary of activities outside the window of interest in the video trigger package for the cloud.
You might reasonably ask, “Won’t putting more intelligence in the cameras make them much more expensive?” It is clear that these low-end cameras are designed for minimum cost today, with the simplest viable digital electronics for interfacing to the network. But just as video compression engines have become commoditized, cheap and standard in these cameras, I expect neural network processors to go the same route. For example, running standard YOLOv2 at 60 fps requires about 2.3 TFLOPs or 1.2 T multiply-adds per second. That would fit comfortably in 3 Tensilica Vision C5 cores and less than 5mm2 of 16nm silicon including memory. That probably translates into an added silicon manufacturing cost of less than 50 cents. So it might push up the total camera and power cost by a few dollars, but not enough to shift the balance towards the cloud. After all, doing YOLOv2 on the full 60 fps in the cloud can cost thousands of dollars.
This model also suggests that the algorithms on both ends – at the camera and in the cloud – will need to adapt to specific application requirements and setting, and evolve significantly over time as better trigger and analysis methods emerge. This is the essence of a smart distributed system, with global scope, continuous evolution of functionality, and high economic value.
I should finish with a few caveats on this little analysis. First, it is full of simplifying assumptions about cameras, networks, storage needs and compute intensity. The costs of network access is particularly tricky to nail down, not least because wired data access is unmetered, so that the marginal cost of one more bit is zero, until you run out of capacity – then the marginal bit is very expensive. Second, I have focused on costs in the U.S. Some other regions have better access to low cost fiber, so may be able to leverage video cloud computing better. Other regions are significantly more expensive network access, so device-cloud tradeoff may shift heavily towards smarter device filtering.
Finally, the uses of cameras are enormously varied, so it is naïve (but instructive) to boil everything video analysis application down to object detection. There may be significantly heavier and lighter compute demands in the mix.
This analysis focuses on wired cameras. However, cameras in smart phones make up a meaningful fraction of the expected 20B installed image sensors. Wireless bandwidth to the cloud is dramatically more expensive than wired bandwidth. The going price is very rough $10 per GB in high bandwidth plans, or 1,000 times the price of DSL and cable and 100,000 times the price of fiber. If, hypothetically, you somehow managed to stream your 8Mp 60 fps camera to the cloud as raw pixels continuously for three years, it would cost you about a billion dollars ;-). Compressing it to 40Mbps H.264p60 drops the costs to a mere five million dollars. Of course, wireless data costs may well come down, but not enough to make continuous streaming from phones attractive. Especially smart compression is going to be needed for any application that wants to rely on continuous streaming over significant periods.
So that’s the story of a cheap camera on Amazon and the back of an envelope.