Where Are We with Computer Vision?

Print Friendly, PDF & Email

In the past several years, we’ve witnessed how deep learning, specifically convolutional neural networks, has been successfully applied to computer vision, natural language processing, speech recognition, logistics, online advertising, and many other problem domains. There are a few things that are unique about the application of deep learning to computer vision and understanding these characteristics will help in understanding the state of computer vision.

In this article, I’d like to share a nice summary of the state of computer vision from Course 4 “Convolutional Neural Networks” from the new Deep Learning Specialization series on Coursera. Dr. Andrew Ng provides some compelling observations about deep learning and computer vision with the goal of mapping out the future of this increasingly popular technology. Consider that many machine learning problems fall somewhere on the spectrum between where you’re working with “small data” to where you have “big data.” For example, there is a decent amount of data available for speech recognition. On the other hand, for image recognition and image classification it feels like we still wish we had more data even though the online data sets are quite big (many in excess of over a million images) In addition, there are some problems like object detection where we have even less data.

We know that image recognition is a problem of looking at an image and determining whether a cat is shown or not. Whereas object detection is looking at an image determining where in the image certain objects are located (using a technique called bounding boxes). This is used extensively in autonomous driving systems. Because of the high cost involved with creating the bounding boxes, it more expensive to label the objects and the bounding boxes, so we tend to have less data for object detection than for image recognition.

Consequently, if you look across a broad spectrum of machine learning problems, you see on average that when you have
a lot data, you tend to find data scientists getting away with using simpler algorithms as well as less manual feature engineering. There is less need to carefully design features for the problem. Instead devise a large neural network that learns whatever it wants to learn when you have a lot of data. In contrast, when you don’t have that much data, then on average you see people engaging in more manual engineering. When you don’t have much data then manual feature engineering is the best way to get good performance.

When you look at machine learning applications, the learning algorithm have two sources of knowledge. One source of knowledge is the labeled data, the x, y pairs you use for supervised learning. The second source of knowledge is the manual feature engineering. There are many ways to manually engineer a system. It can be from carefully manually designing the features, to carefully manually designing the network architectures, to maybe other components of your system. So when you don’t have much labeled data, you need to count on more on manual feature engineering.

Computer vision is trying to learn a really complex function, and it often feels like we don’t have enough data for computer vision even though data sets are getting bigger and bigger. Often we just don’t have as much data as we need. This is why the state of computer vision, historically and even today, has relied more on manual feature engineering. This is also why the field of computer vision has developed rather complex network architectures, because in the absence of more data, the way to get good performance is to spend more time is spent experimenting with network architecture.

It is not an issue slighting manual feature engineering. When you don’t have enough data, manual feature engineering is a very difficult, a very skillful task that requires a lot of insight. A data scientist who is insightful with manual feature engineering will get better performance. It’s a positive contribution to a project to perform manual feature engineering when you don’t have enough data. But if you have lots of data then it’s not necessary to spend time with manual feature engineering. Historically, the field of computer vision has used very small data sets, and so historically, the computer vision literature has relied on a lot of manual feature engineering, and even in the last few years the amount of data with the right computer vision task has increased dramatically. This has resulted in a significant reduction in the amount of manual feature engineering that’s being done.

There’s still a lot of manual engineering of network architectures in computer vision which is why you see
very complicated hyper parameter choices in computer vision, more complex than you do in a lot of other disciplines. In fact, because you usually have smaller object detection data sets than image recognition data sets, you see that the algorithms become even more complex and has even more specialized components. Fortunately, one thing that
helps a lot when you have little data is transfer learning.

If you look at the computer vision literature, and review the set of ideas out there, you’ll find that people are really enthusiastic. They’re really into doing well on standardized benchmark data sets and on winning competitions. For computer vision researchers, if you do well on the benchmarks it’s easier to get a paper published. The positive side of this is that it helps the whole community figure out what are the most effective algorithms. You also see in the papers, people using techniques that do well on benchmarks, but that you wouldn’t really use in a production system
that you’d deploy in an actual application.

Here are a couple tips at doing well on benchmarks. These are techniques that may not see their way into production systems.

  • Ensembles – Once you’ve determined the specific neural network arrangement you want to use, you can then train several neural networks independently and average their outputs. This is called an ensemble. For example, you can initialize, say, three or five or seven neural networks randomly, and train up all of these neural networks, and then average their outputs. Essentially, you need to average their output’s y-hats (not their weights). Let’s say you use seven neural networks that have seven different predictions that can be averaged. This strategy may offer maybe 1% better or 2% better, or a little bit better on a benchmark which can help win a benchmark competition. Using ensembles means that you typically need to run an image through a number of different neural networks. Consequently, this slows down your running time by this same factor or sometimes even more. As a result, using ensembles is one of those tips that data scientists use for doing well in benchmarks and for winning competitions but almost never used in a production system, i.e. unless you have a large computational budget and don’t mind burning a lot of money for each customer image.
  • Data Augmentation – “Multi-crop” is a form of applying data augmentation to your test image. For example, let’s say you want to determine if an image is of a cat, where you just copy it four times including two mirrored versions. There’s a technique called the “10-crop” which basically goes like this – you take a central region, a crop, and run it through your classifier. Then you take that crop, upper left-hand corner and run through your classifier. Then you do the same thing with the mirrored image. So, take the central crop. Then, take the four corners crops. If you add these up, there’s ten different crops of the image, hence the name 10-crop. You run these ten images through your classifier and then average the results. It’s a question of whether you have the computational budget to do this, use ten crops, or you can use fewer crops. All this might get you little bit better performance in a production system, a system deployed for actual users. This is another technique that is used much more for doing well on benchmarks, than in actual production systems.

One of the drawbacks of ensembles is that you need to maintain a number of different neural networks around which takes up a lot more computer memory. For multi-crop, you keep just one network around, so it doesn’t use up as much memory, but it still slows down your run-time to some degree.

These are techniques you see in the wild, and research papers will refer to these techniques as well. But it may not be appropriate to use these methods when building production systems, even though they are great for doing better on benchmarks and on winning competitions.

Because of all the computer vision problems in the small data regime, others have done a lot of manual engineering of the network architectures. Surprisingly, a neural network that works well on one vision problem often, often will work for other vision problems as well. So, to build a practical system you can do well by starting off with a neural network architecture that someone else has devised. You can use an open source implementation if possible because the open source implementation might have figured out all the minute details like various hyper-parameters. Finally, another team may have spent weeks training a model on a bank of GPUs and on a million+ images. By using another team’s pretrained model, then fine tuning on your own data set, you can often get going much faster on an application. The alternative is if you have the computer resources and the inclination, you can train your own neural networks from scratch. In fact, if you want to invent your own computer vision algorithm, that’s what you might have to do.


Contributed by Daniel D. Gutierrez, Managing Editor and Resident Data Scientist for insideBIGDATA. In addition to being a tech journalist, Daniel also is a consultant in data scientist, author, educator and sits on a number of advisory boards for various start-up companies. 


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind



  1. I thought it was interesting when you explained that computer vision is trying to learn a function that is complex. If I were to guess, computer vision is the process of automating machines to perform physical jobs. It would be interesting to learn more about the software systems that are put in place to control these machines.