Commonsense Understanding: The Big Apple of Our AI

Print Friendly, PDF & Email

A Report from Two New York City AI Conferences: Artificial General Intelligence Research Backs Away from Back-propagation to Leverage Logic, Physical and Social Models of the World

Although machines can defeat the best human Go players, drive cars under ideal conditions, answer natural language inquiries about the weather or other specific topics, they do not actually understand what they are doing. Their success has been achieved primarily through a mathematical technique applied to multi-level artificial neural networks, called back-propagation, which achieves its results by gobbling vast amounts of labeled training data and computing power. These neural networks are often optimized by adding signal processing algorithms such as convolution or recurrent feedback. The finished product goes by the name Deep Learning.

Top AI researchers are beginning to believe, however, that Deep Learning may not be the correct path to artificial general intelligence (AGI). AGI is synonymous with future general purpose learning algorithms that could one day produce human-level cognition. Because Deep Learning relies heavily on large quantities of data, it cannot generalize from few examples, a divergence from the way the human brain learns. Geoff Hinton, a senior Google scientist and one of the inventors of back-propagation as applied to Deep Learning, confided in a September 2017 interview to [1], “My view is to throw [back-propagation] away and start again … we clearly do not need all that labeled data.”

According to a joint paper [2] published in 2017 by the Chinese Academy of Sciences and the University of Nebraska, today’s best “intelligent” machines fail to even demonstrate the general cognitive ability of a six-year-old. So as we enter 2018, what is the state of progress in AGI, and why has it been so difficult to move forward? Researchers at both the 2017 O’Reilly AI and 2017 Cognitive Computational Neuroscience (CCN) conferences that took place in New York City last year, believe that the primary component of general intelligence still missing from today’s state-of-the-art AI is commonsense understanding. So what does it take to imbue this characteristic into a silicon image?

Carnap and the Definition of Commonsense Understanding

Explanation and prediction form the backbone of scientific reasoning and human commonsense. Back in 1966, Rudolf Carnap, a giant of 20th-century philosophy, wrote, “What good are [scientific] laws? What purposes do they serve in science and everyday life? The answer is twofold: they are used to explain facts already known, and they are used to predict facts not yet known [italics Carnap’s].” [3] At CCN, over a half century after these words were published, Josh Tenenbaum, Professor of Cognitive Science at MIT, stood on Carnap’s shoulders as he spoke about the importance of developing AI with a human ability to model the world. Humans learn and predict using scientific laws that are translated into commonsense — these laws range from universal physical laws (if the cup tips, milk will spill; if the ball drops, it will fall to the ground) to complex social laws (when, where, and from whom is it okay to steal a french fry or steal a kiss?). And as Carnap reminds us from the same source, “In many cases, the laws involved may be statistical rather than universal. The prediction will then be only probable.”

Tenenbaum uses Bayesian Theory, a form of inductive logic, as a means to model these probabilities and supplement more traditional Deep Learning with physical and social models of the world. When the brain models a physical or social law, whether causal or statistical, the model acts as a template to facilitate prediction. Prediction constitutes a key element of commonsense. “Intelligence is not just about pattern recognition,” says Tenenbaum. “The mind is a modeling engine.”

The Difficulty with Creating Models of the World

At CCN, Tenenbaum referenced a paper he wrote with Peter Battaglia and Jessica Hemrick, titled “Simulation as an Engine of Physical Scene Understanding.” In the abstract to that paper, the authors spell out why the modeling of physical systems, and hence the emulation of the mind, proves so difficult.

In a glance, we [humans] can perceive whether a stack of dishes will topple, a branch will support a child’s weight … or [if] a tool is firmly attached to a table or free to be lifted. Such rapid physical inferences are central to how people interact with the world and with each other, yet their computational underpinnings are poorly understood. We propose a model based on an “intuitive physics engine,” a cognitive mechanism … that uses approximate, probabilistic simulations to make robust and fast inferences in complex natural scenes where crucial information is unobserved.”

Figure 1

That last phrase, “…where crucial information is unobserved,” is a key aspect to why these models are so difficult to create artificially. A Deep Learning neural network can successfully label all objects in a scene, but it cannot infer about the relative relationships between the objects. For example, in the stack of dishes shown in Figure 1, what’s unobserved is the lack of support. A barista could tell that this is a disaster waiting to happen, but a machine sees just a stack of dishes. “How do you [program an AI to] see not just the objects,” asks Tenenbaum, “but what’s going on physically?”

Social models have also proven difficult to build. Non-verbal cues such as body language or the understanding of emotive facial expressions are vital for social success, but at CCN 2017, Rebecca Saxe, another MIT cognitive scientist, showed that facial expressions and body language, when presented without context, in either pictures or video, do not give a reliable indicator of human emotional state. In Saxe’s presentation, she described an experiment she conducted using scenes from the British game show Golden Balls. The show pits two contestants against each other in a type of Prisoner’s Dilemma, which allows each contestant to share or steal a giant cash prize. If both contestants decide to steal, then they both get nothing. If both decide to share, then they share the prize; if one decides to share and the other decides to steal, then the contestant who steals gets the whole prize, while the one who decides to share gets nothing. Saxe asked human subjects to watch decontextualized video (no sound, etc.) and rate which contestants had won or lost money, based solely on their facial expressions. In 88 episodes, subjects rated 20 different emotions. With the context gone from the video, Saxe found that human observers could not determine better than chance, which contestants had won or lost money. The average emotion profiles for both winning and losing contestants were the same. “Facial expressions by themselves are pretty ambiguous,” says Saxe. “They get their meaning often by their context.”

Probabilistic Programming

Together, Deep Learning and model engines working together provide a simplified hierarchical framework similar to the hierarchical framework we see in the brain. Deep Learning provides the pattern recognition needed to label objects, and models provide the foundation for the inductive logic needed for commonsense reasoning. Tenenbaum and his colleagues are in the initial stages of fusing Deep Learning with physical and social models to build cognitive systems they call probabilistic programs.

For a toy example of this hierarchical structure in action, see Figure 2. A machine could use a Deep Learning neural network to count three bananas, place that number into its memory, and then, if it saw a monkey, predict, using a knowledge base, that the monkey would probably want a banana. The machine could then use the same inductive logic and knowledge base to reason how if a truck rolled by, the truck would probably not want a banana. The first level of this process (called the connectionist level), classifies bananas and trucks via a Deep Learning neural network. This level performs signal processing: a reduction of millions of image pixels into a few dozen bytes — a label for bananas, a label for the number three, a label for the monkey, and a label for the truck.

Figure 2

Once the labels are stored to a memory location, the AI can processes them as symbols within an inductive logic program derived from a model, and then come to a probabilistic conclusion or prediction. A model of the world acts as an interface between the signal processing and the symbol processing; the model grounds the cognitive engine in a form of semantic understanding.

2018 and Beyond

Future progress in AI will be made by seamlessly integrating the connectionist and symbolic levels, with or without the need for back-propagation.

To learn more about probabilistic causal models and Bayesian inference, go to, which is a link to an eBook titled Probabilistic Models of Cognition, written by Josh Tenenbaum and his colleague Noah Goodman. This eBook is a fantastic summary of Tenenbaum’s vision for causal models.

Another resource is a seminal 72-page article Tenebaum wrote with Brenden Lake of NYU, Tomer Ullman of MIT, and Samuel Gershman of Harvard, titled “Building Machines that Learn and Think Like People.” [4] In this paper, what amounts to a short book, Tenenbaum and his colleagues lay out their vision for how to move artificial general intelligence research into 2018 and beyond.



[3] Rudolf Carnap, Martin Gardner (editor), An Introduction to the Philosophy of Science, Basic Books, 1966


Contributed by: Howard Goldowsky who lives near Boston, MA, where he programs DSP algorithms, trains at chess, and studies AI and cognitive science. He has been writing about chess for almost 20 years and is the author of two chess books. Recently he has begun to write about machine learning and AI.


Sign up for the free insideBIGDATA newsletter.



Speak Your Mind