Field Report: Deep Learning Summit 2016

Print Friendly, PDF & Email

DeepLearningSummit2FIELD REPORT

insideBIGDATA was pleased to be on-hand for the recent Deep Learning Summit in Boston from May 12-13, 2016 as guests of the conference sponsor RE•WORK. Deep Learning (DL) researchers and industry leaders gathered for the event to hear extraordinary speakers, discover emerging trends and expand their professional network, all around this hugely popular field.

Deep Learning is a family of neural network algorithms that might not only drive our cars in ten years but might also drive an artificial intelligence revolution in the fields of medicine, education, advertising, and robotics. Jana Eggers, CEO of Nara Logics, predicted in her opening remarks to the conference that artificial intelligence will eventually have a greater impact on society than the Internet or mobile.

Conference Highlights

Yoshua Bengio, Professor at the University of Montreal and one of the founding fathers of Deep Learning, reviewed the basic conceptual ideas about neural networks and spoke about the future of the field. He also spent a lot of time reviewing the history of neural networks and some time speaking about his lengthy list of contributions to the field. According to Bengio, Deep Learning algorithms work analogously to the way the brain works. As data moves from the input of the network through successive layers, its representation inside the computer becomes more abstract. If a robot looks at a car, for example, low-level features such as lines or colors would be encoded by the initial layers; slightly more abstract features like circular wheels, might be formed within the middle layers; and by the final layer, mid-layer features combine to form a classifiable concept.



Depending on applications, actual Deep Learning networks can be over 100 layers and require massive amounts of training data and weeks to train. According to Bengio, for AI to get to the next level, which would include “machine reasoning” and reporting simple narratives about images, current processing speed must improve at least one-hundred fold.

To rise to this processing challenge, companies have begun to specialize in providing full-stack Deep Learning solutions. Google’s open source TensorFlow framework is currently popular, while new players like Silicon Valley-based Nervana Systems are entering the space as well. Nervana supplies a framework that even includes specialized ASIC hardware optimized for the type of math performed within Deep Learning networks, something Google does not currently provide. Nervana also provides customized cloud-based software to run on their hardware.

In the next series of sections, we bring you short summaries of the the technical sessions we attended. Our goal is to bring you up to speed with everything discussed at the conference so you can have a foundation for moving forward with this exciting technology.

DeepLearningSummit3DAY 1

Hugo Larochelle, Research Scientist, Twitter

Twitter is trying to develop a set of generic tools that are transferable from text, images, or video. If a Deep Learning Neural Network (DLNN) sees a picture or video of a horse or the word “horse,” it wants to be able to know the user means the same thing. Obviously this is one major objective to DL in the near future.

Lifelong learning is a big research topic these days. How can a machine learn over time instead of just in training mode? Unsupervised learning is a big deal. This is learning with unlabeled data sets (i.e. no class variable). There is so much data available — way more data than anyone can possibly process — that there is a need for a way to train these DLNNs without labels. Labels are expensive. A person needs to literally tell the machine what it’s seeing, the same way a human teacher needs to tell a student things. There are efforts underway for automatic machine labeling, although currently more error-prone than traditional labeling. Due to the expense of labeling a data set, some companies employ the Amazon’s Mechanical Turk or Figure Eight (previously known as Crowdflower).

A common unsupervised learning technique is clustering. The algorithm places “concepts,” as vectors, into an N-dimensional hyperspace. Ideas that are similar are located closer together in this space. Another big deal in the neural network community these days is called Word2Vec. It is a machine learning method which places words into an N-dimensional hyperspace and then manipulates the words with linear algebra.

LaRochelle is also famous for posting a series of instructional YouTube videos about DL and neural networks. Everyone we talked to mentioned his videos as one of the main learning resources out there. Interestingly, LaRochelle was a student of Yoshua Bengio.

Daniel McDuff, Principal Research Scientist, Affectiva

Dr. McDuff’s presentation was “Emotion Intelligence to our Digital Experiences -Turning Everyday Devices into Health Sensors.” He talked about how today’s electronics have very sensitive optical and motion sensors that can capture subtle signals resulting from cardio-respiratory activity. He discussed how webcams can be used to measure important physiological parameters without contact with the body. In addition, he showed how ordinary smartphones can be turned into a continuous physiological monitors. Both of these techniques reveal the surprising power of devices around us all the time. He also showed how DL technology is helping us create highly scalable and low-cost applications based on these sensor measurements.

These guys are trying to recognize facial expressions in video. They have an office right here in Waltham, MA. In a previous life, one of us once stopped by their office to see if they were hiring, and we walked right in without anyone noticing. We sort of walked around and overheard a conversation with one of their marketing people talking to Google Analytics. We think their whole game plan is to help advertisers better understand their audience. They didn’t mention anything about their business plan at the conference. McDuff talked mostly about how they use support vector machines (SVMs) and convolutional neural networks to encode facial expressions. It seems that convolutional neural networks are the method of choice for recognizing images. They’re one of many flavors of DL. McDuff showed a lot of examples of many emotions and how they break down faces into different parts.

Tony Jebara, Professor of Computer Science at Columbia University and Director of Machine Learning Research at Netflix

Dr. Jebara’s talk was “Double-cover Inference in Deep Belief Networks” where he talked about how most DL architectures utilize forward-propagation to predict and back-propagation to learn. Meanwhile, in deep belief networks, performing loopy propagation and iterating forward and backward steps could yield better estimates of the probabilities of the unknown variables in hidden layers. Unfortunately, loopy propagation is plagued with convergence issues. He described a double-cover construction which makes two identical copies of the network. Excitatory connections are preserved within copies while inhibitory connections are placed across the copies. Remarkably, loopy propagation on the double-cover results in much better convergence and estimation from both a theoretical and practical perspective.

Honglak Lee, Asst. Professor of Computer Science and Engineering at the University of Michigan, Ann Arbor

Dr. Lee’s presentation was “Deep learning with disentangled representations” where he spoke about how over the recent years, DL has emerged as a powerful method for learning feature representations from complex input data, and it has been greatly successful in computer vision, speech recognition, and language modeling. The recent successes typically rely on a large amount of supervision (e.g., class labels). While many DL algorithms focus on a discriminative task and extract only task-relevant features that are invariant to other factors, complex sensory data is often generated from intricate interaction between underlying factors of variations (for example, pose, morphology and viewpoints for 3d object images). In this work, Lee tackles the problem of learning deep representations that disentangle underlying factors of variation and allow for complex reasoning and inference that involve multiple factors. Specifically, he developed deep generative models with higher-order interactions among groups of hidden units, where each group learns to encode a distinct factor of variation. Lee presented several successful instances of deep architectures and their learning methods, including supervised and weakly-supervised setting.

In brief, Lee talked about disentangled representations. This is similar to what Twitter was talking about, making meaning from various forms of sensory inputs. This is they key to smarts, e.g. if I see a dog, hear a dog, feel a dog, how do I know it’s the same thing? Lee also, talked a bit about analogy making and cited a paper that seemed interesting: Reed, et. al., NIPS 2015.

John Hershey, Sr. Principle Research Scientist, Mitsubishi Electric Research Labs

Dr. Hershey’s session was “Cracking the Cocktail Party Problem: Deep Clustering for Speech Separation.” He talked about the “cocktail problem,” focusing on one signal when it is mixed with another signal. This can happen in voice (two people talking at once) or imagery – one image superimposed on another, e.g. someone looking into a window with a reflection; how do you know which image is on the inside and which on the outside?

The human auditory system gives us the extraordinary ability to converse above the chatter of a lively cocktail party. Selective listening in such conditions is an extremely challenging task for computers, and has been the holy grail of speech processing for more than 50 years. Previously, no practical method existed in the case of single channel mixtures of speech, especially when the speakers are unknown. Hershey presented a breakthrough in this area using a new type of neural network called “deep clustering.” His deep clustering network assigns embedding vectors to different sonic elements of the noisy signal. When the embeddings are clustered the constituent sources are revealed. The system is able to extract clean speech from single channel mixtures of unknown speakers, with an astounding 10 dB improvement in signal to noise ratio — a level of improvement previously unobtainable even in simpler speech enhancement tasks. Amazingly, the system can even generalize between two- and three-speaker mixtures. The belief is that this technology is on the verge of solving the general audio separation problem, opening up a new era in spontaneous human-machine communication.

Spyros Matsoukas, Principle Scientist, Amazon

Matsoukas’ talk was “Deep Learning for Amazon Echo.” He introduced Amazon Echo and Alexa, the virtual personal assistant that powers Echo, and discussed the challenges his team is facing when developing machine learning solutions for wake word detection, automatic speech recognition, natural language understanding, question answering, dialog management, and speech synthesis. He described how he applys deep learning to the problem of speech recognition, addressing challenges associated with training on large data sets and adapting the acoustic model to a variety of speakers and environmental conditions.

Matsoukas gave a very nice overview of what goes on under the hood in the Echo. It uses the Amazon Alexa framework for speech recognition. He showed the full block diagram of what happens from the time a user talks into the device to when the user gets a reply. A bit of the smarts was talked about, like how Echo puts placeholders in for common requests, e.g. “Echo, what is the weather today?” The word “today” can be any date. So their technology looks for patterns like that. The presentation included a lot about Big Data. They have dedicated data servers — not the ones they give to the public — to manage their data. One interesting point made was that there is a trend to bring apps to the data rather than the other way around. This circumvents many problems, if possible. For example, a few companies are experimenting with running DL algorithms on individual phones, rather than just collecting data from those phones. In the medical community there are privacy laws which prevent data from moving; instead, the goal is to bring the algorithms to the hospitals and run them there, etc. The talk included a bit about moving the code rather than the data.

Andrew McCallum, Professor and Director of the Center for Data Science at the University of Massachusetts Amherst

Dr. McCallum’s presentation was “Deep Learning for Representation and Reasoning from Natural Language” where he described advances in deep learning for extracting entity-relations from natural language as well as for representing and reasoning about the resulting knowledge base. He  introduced “universal schema,” his approach that embeds many database schema and natural language expressions into a common semantic space. Then he described recent research in Gaussian embeddings that capture uncertainty and asymmetries, collaborative filtering with text, and logical implicature of new relations through multi-hop relation paths compositionally modeled by recursive neural tensor networks.

McCallum also collaborates with Amazon and is working on unsupervised learning. His big thing is to “use a vector relationship to create a chain of reasoning on vector embeddings in some semantic space.” In translation, this means he is working on DL methods to create vector relationships between concepts, similar to the way Word2Vec does it, as described above.

Nathan Wilson, CTO & Co-Founder, Nara Logics

Dr. Wilson’s presentation was “Biological Foundations for Deep Learning: Towards Decision Networks” where he discusses the basic principles of intelligence that have been pursued by two parallel research communities – computer scientists developing artificial intelligence, and neuroscientists exploring the brain. Recent advances, particularly in deep learning, present a key opportunity for new homologies and cross-pollination. In his talk he highlighted some of the latest learning rules discovered by each community and their surprising convergence. He then described how these rules can be coordinated at scale to take learning networks from perception to decisions, to help solve mature enterprise problems that are ripe for AI applications.

Nara Logics is basically creating a machine learning system inspired by biology, cortical physiology. Wilson’s was at a high-level with little technical content. He showed the many parallels between DL and biology, and emphasized that despite the neural network Winter of the 80s and 90s, we should never forget that there are just too many similarities between the brain and machine learning to ignore. Both disciplines feed and learn from each other. Neuroscientists can help computer scientists, and vice verse. The talk reminded us about Jeff Hawkins’ company Numenta. This company is based on a similar biologically-inspired structure. Basically these biologically-inspired companies are creating proprietary neural networks whose structure is much more similar to the human cortex structure than your typical DLNN. For DLNN, the similarity basically ends with the fact that they both use neurons and they both use a layered hierarchy. But these other biologically-inspired companies try to incorporate many more similarities.

Adam Lerer, Facebook

Adam Lerer spoke about “Learning Physical Intuition by Example” which which examined how babies are known to acquire visual “common sense” concepts, such as object permanence, gravity, and intuitive physics, at a young age. For example, infants play with toy blocks, allowing them to gain intuition about the physical behavior of the world at a young age. While deep neural networks have exhibited state-of-the-art performance on many computer vision tasks, more complex reasoning (e.g. ‘what will happen next in this scene?’) requires an understanding of how the physical world behaves. Lerer explored the ability of deep feedforward models to learn such intuitive physics. Using a 3D game engine, he created small towers of wooden blocks, and trained large convolutional network models to accurately predict their stability, as well as estimating block trajectories. The models are able to generalize to new physical scenarios and to images of real blocks.

Lerer ran some experiments training DLNN on their reactions to physical simulations of toy blocks falling. Then he compared the machines’ expectations to those of toddlers. Lerer indicated that Facebook is interested in this research to learn about why people post certain pictures. He commented that people often post pics of people and things in precarious positions: trucks about to hit an overpass, cats about to fall into the toilet, etc. Facebook wants to recognize these pics automatically.

Vivienne Sze and Yu-Hsin Chen, both from MIT

Dr. Sze and postdoc Chen talked on the topic “Building Energy-Efficient Accelerators for Deep Learning” from the perspective that as deep learning is becoming more ubiquitous in our lives, we are in need of better hardware infrastructure to support the large amount of computation foreseeable. In particular, the high energy/power consumption of current CPU and GPU systems prevents the deployment of deep learning at a larger scale, and dedicated deep learning accelerators will be the key to solve this problem. In this talk, they gave an overview of our work to build an energy-efficient accelerator, called Eyeriss, for deep convolutional neural networks (CNN), which are currently the cornerstone of many deep learning algorithms. Eyeriss is reconfigurable to support state-of-the-art deep CNNs. Focusing on minimizing data movement between the accelerator and the main memory as well as within the computation fabric of the accelerator, they were able to achieve 10 times higher energy efficiency compared to modern mobile GPUs.

These researchers talked about energy-efficient chips for deep learning and the theme was ways to improve the efficiency of chips. The GPU chips are actually NOT fast enough in their current state, believe it or not. Like Bengio said, they need to improve about 100x in order to get to the “next level” of cognition — at least using current algorithms.

DeepLearningSummit4DAY 2

Nashlie Sephus, CTO, PartPic

Dr. Sephus spoke about “An Industrial-Strength Pipeline for Recognizing Replacement Parts,” indicating that image classification and computer vision for search are rapidly emerging in today’s technology and consumer markets. Partpic focuses on image search for replacement parts, and the focus was on presenting their industrial pipeline for such, with applications to fasteners. She discussed how they have aimed to overcome issues such as acquiring enough training data, training and classification of many different types of parts, identification of customized specifications of parts (such as finish type, dimensions, etc.), establishing constraints for the user to take a “good-enough” image, and scalability of many pieces of data associated with thousands of parts.

PartPic doesn’t actually use deep learning. Sephus’ very interesting talk was about how her company finds all kinds of parts, from screws to sink fixtures, etc. The user takes a pic of the part, it gets recognized by the database, and then the user can buy the part.

Cambron Carter, GumGum

The topic for Cambron Carter’s talk was “How Deep Learning and Image Recognition are Changing the Advertising Experience” where he indicated that despite being so vital to the financial fitness of many tech companies, there is disparity in the experience of online advertising. Deep learning is changing that experience. As a computer vision firm applying their technology to advertising, Carter discussed how GumGum is using deep learning for a multitude of purposes including content safety, reduction of redundant processing, and general image understanding. He also shared some highly specific – and occasionally peculiar – image recognition use cases to which GumGum employed deep learning techniques to afford the user a more organic experience, such as serving ads for lipstick only on pages which have images of people with “bold” lips. He described the problems they have attacked with deep learning, both supervised and unsupervised, their battle with statistics at scale, and how they see deep learning dramatically benefiting both consumers and marketers in the long-term.

In a nutshell, Carter explained how GumGum tries to find marketing/advertising solutions for their clients. They’re constantly getting weird requests such as Loreal’s “bold” lips classifier. For these weird requests they get together in a room and try to decide what features correspond to their clients’ weird requests and then train the neural networks to find these images. So each client is different.

Adham Ghazali, CEO, Imagry

Ghazali’s presentation was “Large Scale Visual Understanding For Enterprise.” This company recognizes objects in real-time video, using code running right on individual phones. They’ve figured out how to do image recognition while keeping the computational footprint low. He didn’t go into much detail into their code, but the main point is that a lot of the processing is done on the phones themselves. Someone asked a question about battery life, and the CEO just smiled. Perhaps this is their Achilles heel? Anyway, he gave a cool demo where he showed live video and in each video, objects had bounding boxes drawn around them with text descriptions of what was recognized inside the bounding box. e.g. a boy riding a skateboard in one, with each object identified in real-time on a phone.

Mark Hammond, CEO, Bonzai

Hammond’s talk was “Doing for Artificial Intelligence what Databases did for Data” where he made the case for building deep learning systems at present is part science, part art, and a whole lot of arcana. Rather than focusing on the concepts we want the system to learn and how those can be taught, one often finds themselves dealing with low level details like network topology and hyper parameters. It is easy to lose the forest for the trees. Databases solved this problem for data by allowing users to program at a higher level of abstraction. With a database, one eschews low level implementation details and instead builds a model of the information (the schema) using a high level declarative programming language (e.g. SQL). The database server is then used to actualize this model, and manage its usage with real data. Similarly, for artificial intelligence, one can build a model for conceptual understanding (the mental model) using a high level declarative programming language (Inkling). An intelligence server can then be used to actualize this model, and manage its usage with real data.

Mark explored the underpinnings of this technique, detail the Inkling programming language, and demonstrated how one can build, debug, and iteratively refine models. To make things concrete and fun, Mark detailed creating a system to play the video game Breakout using deep learning, but requiring codifying only the high level concepts relevant for intelligent play and a curriculum for how one can teach them.

What they’re doing is encapsulating all the mathematical details and overhead of programming a neural network, into a new programming language designed to “do for Deep Learning what databases have done for data.” In other words, encapsulate all the schema and structure associated with neural nets into a simplified programming language (like SQL), so the programmer does not need to know the math behind neural network. They plan to introduce their new language in a few months. It’s similar to and compatible with Python — more than just a library, however — it actually has new keywords and schema designed to write neural network code. Most important, Bonsai has changed the way they train neural networks. Hammond spent a lot of time explaining the pedagogy behind the way people learn, and explained in his talk that computers and neural networks learn the same way as humans: learning mechanisms MUST begin with simple tasks and then work up to more complex tasks. He argues that the way we currently train DLNN, with tons of data thrown at them, is similar to a human trying to learn chess from a complex game of grandmasters. No, we first learn very simple tasks and then work our way up. He argues that DLNNs learn most efficiently as well if their training data is given to them in a specific order. In fact, Google Deep Mind never solved Pac Man the way it solved 20-something other Atari games — why? Hammond argues because the Google DLNN was not trained the correct way. Bonsai is working on this problem, using their new pedagogy and software/language to train their neural networks.

Alejandro Jaimes, Chief Scientist, AiCure

The topic for Dr. Jaimes’ talk was “Artificial Intelligence in Improving Health Outcomes and De-Risking Clinical Trials” based on the premise that it’s very common for people to not take their medication as prescribed. In population health, medication adherence estimates are around 50%, while in Clinical Trials estimates range between 43% and 78%. This results in huge costs for the pharmaceutical industry—around $378 Billion dollars a year. Jaimes gave an overview of why this matters, and describe how we can use Artificial Intelligence for medication adherence. Patients, or participants in clinical trials, use mobile phones while taking their medication. He uses Computer Vision in real time to identify the medication, and the patient, and to verify medication ingestion. The platform provides insights to physicians and clinical trial coordinators to produce better health outcomes and de-risk clinical trials, while encouraging patients or clinical trial participants to take their medication as prescribed, effectively impacting everyone on the planet.

In essence, AiCure is trying to make you take your medicine. They take a pic of people taking their medicine and report the data to doctors. Apparently this is important during clinical trials, where people often skip their dosage or don’t care. There are issues with image recognition of correct pills and different types of patients (addicts to elderly) have different needs (e.g. elderly might need to take 20 pills simultaneously or addicts might not take them or trick them, etc.).

Byron Galbraith, Talla

Dr. Galbraith’s talk was “Beyond the Keyword Search: Finding Job Candidates with CV2Vec” A major challenge for HR teams is finding, interviewing, and on-boarding job candidates. At Talla, Galbraith is building intelligent assistants that employ deep learning to help offload some of the tedious and time-consuming parts of this workload. His talk focused on CV2Vec, a set of experiments they’ve done on the candidate sourcing side of this process. By training neural models to map CV and resume documents into a dense vector representation, they are able to perform candidate searches on more than just keywords. They can find candidates that are most similar to a reference person or the job ad itself, cluster people together and visualize how CVs align with each other, and even make a prediction as to what someone’s next job will be.

Galbraith uses a similar mathematical technique as Word2Vec to place resumes (obtained from — 30,000 to start — a somewhat small data set but enough) into a vector space similar to the Word2Vec vector space. Its not clear whether this method is valid, but he’s trying to make a business out of it by providing more than just keyword searches for candidates and the companies trying to hire them.


One general takeaway from the summit is that the number of dimensions in the data is often much higher than the pieces of data available — and the number of pieces of data is monstrously large to begin with! This is the basic problem of cognition: how to manage the dimensions within data for processing and the amount of data required for training.

We can end with a great quote someone put up about what these Deep Learning researchers have been trying to do for the past ten years, since the Deep Learning algorithm showed up on the scene: “We are trying to replace symbols by vectors so we can replace logic by algebra.” — Yann Lecun (one of the three founding fathers of DL, along with Yoshua Bengio and Geoff Hinton). This quote basically sums up what these researchers are trying to do mathematically, and perhaps even summarizes pretty well what goes on inside the brain.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind