Sign up for our newsletter and get the latest big data news and analysis.

Interview: David Blei, Professor, Columbia University

Today, big data has implications across all industries: from healthcare, automotive, telecom; to IoT and security. As the data deluge continues, we are finding newer ways of managing and analyzing, to gather actionable insights and grapple with the challenges if security and privacy.

The Association of Computing Machinery (ACM) just concluded a celebration of 50 years of the ACM A.M. Turing Award (commonly known as the “Nobel Prize of computing”) with a two-day conference in San Francisco. The conference brought together some of the brightest minds in computing to explore how computing has evolved and where the field is headed. Big data was the focus of a number of panels and discussions at the conference. The following is a discussion with David Blei, Professor, Columbia University; ACM-Infosys 2013 Foundation Award, ACM Fellow (2015).

Question: Gartner estimates there are currently about 4.9 billion connected devices (cars, homes, appliances, industrial equipment, etc.) generating  data. This is expected to reach 25 billion by 2020. What do you see as some of the primary challenges and opportunities this wave of data will create?

David Blei: There are two levels of opportunities, with one being at the personal level. We are now surrounded by a variety of connected devices, each  one eventually connecting to a person, and all of that data can help us make things easier for that person. For example, think about Netflix’s  recommendation algorithm or email spam filters. These are all helping us use these large data sets that come from these connected devices to make  predictions and make things better for an individual person. The key idea here is that just the data from something as simple as Netflix watching  habits doesn’t provide the recommendation of a new movie. It’s that data alongside all the data from everybody else that helps make recommendations.

It’s an exciting world because we are personalizing our interaction with devices through the aggregate data of everybody using their devices. And of  course, this all comes with a challenge around privacy and around what we give up when we make our data available or the spectrum of how much we  can give up against how much personalization power we get in return.

The other opportunity is in an unprecedented way to learn about the world through these huge collections of many individuals. This is a massive data  set, and patterns of communication, patterns of interaction and patterns of movement–including all types of other macro-level descriptions of society and people and the world–are now available to us. And you can think of all this data together as something akin to a telescope that give us high-resolution information about the universe around us. This provides us with information about behavioral patterns.

Another place where this comes up is in genetics, medical records and medical matters. For instance, you can look at the history of treatments and  medical records alongside a person’s genes. On a worldwide scale, Big Data can help us both at the individual level form treatments for people but also at the macro level to understand relationships between genes and diseases, to understand patterns of health, and to understand global patterns of  genetic traits.

Studying and collecting Big Data comes with its challenges. If you take the example of genes and diseases, it’s an important computer science and  statistics problem that’s unsolved. Data scientists are looking to answer how we take data that we observe from the world and use it to identify causal connections between two variables.

Question: As more data is collected from a growing pool of devices, has the individual lost the right to information privacy?

David Blei: I think this is an important, and often foggy, issue. We’ve seen that there are two types of people: those who are uncomfortable with everything that  they do with their computer and phone being available to someone else, and those who don’t seem uncomfortable.

Question: What moral quandaries do you see arising from the increasing use of predictive data analytics? How do we overcome these challenges?

David Blei: The question of morality is a very important one, and difficult, and this relates to the distinction between the power to build tools and  the question of what is the right way to use those tools. One must be deliberate and adhere to principles that we decide we want to maintain.

This distinction is one that we as computer scientists and machine learners need to think about and address head on. With different types of  predictive tools, and especially those being deployed at large scale, come consequences around who sees what and how people get their information.  And we can’t simply ignore those consequences and say, “Well, all I did was build a machine learning tool that showed you what news articles I think  you would like.” I don’t think we can say “it’s not our problem” if a negative effect comes from something we created just because we didn’t intend it.

How to solve this problem is a philosophical question–an applied philosophical question, if that kind of thing can exist. Computer scientists can sit  down and come up with the criteria that we want for what it means for an algorithm to be responsibly deployed and then we ask the question whether  or not the algorithms we build meet those criteria. One challenge here is that we don’t know what’s going to happen when we deploy our algorithms.  So, another new problem becomes having an understanding of the effect of an algorithm on the domain where you deploy it. You have to be deliberate and you have to take responsibility. Just because you make it for one task doesn’t mean it can’t be used for another.

Question: Security is a hot topic regarding Big Data. To what extent will Big Data be responsible for new security problems and challenges?

David Blei: That’s an area that is important for computer security experts to study. But it’s more than computer security, it’s also about sociology  and economics, specifically the effect of the data availability. What is the effect of all data being private? What’s the effect of all data being public?  What’s the effect of particular data being public and particular data being private? These are difficult questions to answer.

For example, imagine there are thousands of hospitals across the country, and each one collects records about patients. Those records contain the  patient’s name, diseases that the patient has had and conditions the patient has had. They may also contain procedures that the patient decided to  have, maybe elective procedures, and lab results. So, each hospital record is a very rich collection of data–private data about a person. Now, one  reaction to that is to say that has to be private, and it’s a fair reaction. However, can you imagine all that data in aggregate and what could we learn?  We could learn things about how certain treatments can lead to better future treatment. We could learn things like patterns of health and some kind  of societal or environmental causes of those health problems that we could then perhaps intervene in and fix. We could even learn connections  between genetics and diseases and other traits. So, by making all the data completely private, we are relinquishing power to make use of that data for  these kinds of inferences that we might want to make.

So, with that in mind, what do you do? Well, that’s where security and privacy experts come in–where there are ways to think about how public is too  public and can I work with the data to make it more anonymous, but still make it usable to make global inferences, or even individualize inferences.  And then what line do I draw? So, to me, we get the most bang for our buck if we make everything public; however, that’s going to have some serious security and morality issues with it, so we shouldn’t do that.

But making everything completely private and not benefitting from that data also doesn’t seem like a great option. I think this is a difficult, thorny  issue. It lives at the intersection of policy, philosophy, morality, computer science, data science and machine learning.

Question: Are there potential technological breakthroughs on the horizon that you think could transform this area again in the near future?

David Blei: I think we are in the middle of a transformative time for machine learning and statistics, and it’s fueled by a few ideas. Reinforcement  learning is a big one. This is the idea that we can learn how to act in the face of an uncertain environment with uncertain consequences of our actions.  This is the idea that is fueling a lot of the amazing results that we’re seeing in machine learning and AI. Deep learning is another idea, which is a very  flexible class of learners that, when given massive data sets, can identify complex and compositional structure in high-dimensional data. Another idea  that’s fueling this– it’s a 60-year-old idea, but it’s optimization. So, I have some kind of function and I want the maximal value of that function, how  do I do that? Well, it’s called an optimization procedure. Optimization tells us how to do that very efficiently with massive, big data sets.

So, what’s on the horizon? One is probabilistic programming –devising models of the world that capture the theories and assumptions and then using  data in concert with the model to compute about those assumptions and draw inferences and predictions about unknown things and the future. The idea of probabilistic modeling has been around a long time, and in the 90s and early 2000s, machine learners built a methodology for computing  with probability models. What probabilistic programming does is make that whole idea very generic. Whereas in the 90s we would make assumptions  about the world and then we’d have to work hard to figure out how to devise an algorithm that computes with data under those assumptions.  Probabilistic programming takes this computer science view of “I’m going to write the model down like a program and then to get that algorithm, I’m  going to build a compiler that can take any program, any model and create this algorithm that computes with data under those assumptions.” And  that suddenly opens up the door, where now instead of it taking eight months to fit one probability model, I can explore many probability models and that’s synonymous with exploring many theories about how the world works under a data set. Probabilistic programming also interacts beautifully  with things like deep learning, where you take these neuro-networks and you use them as pieces in a larger probability model that really expresses  your theories and assumptions about the world in a structured way.

And the other idea, which also has been around, is causality. The idea that forming a prediction about the future is different from making a causal  claim about the way the world works. A lot of the successes for AI and machine learning are now predictive in nature. They are taking the world as it  comes to us and they are forming a prediction about the future and then using that prediction to succeed. But if we want to take the next step, and we  want to know what would happen if I intervene in a certain way and how will that change the world, that’s a causal problem. This is something that  people like award winner Judea Pearl and many statisticians like Don Rubin have worked on for years and years. I think that the intersection of Big  Data and causality and the scientific method, is a big idea that’s going to be transformative soon.

Question: In what ways can Big Data be better utilized for greater public benefit?

David Blei: If I had to sum it up in one word – science. Understanding the world through observation is the key problem, and Big Data–by which I  mean both large data sets with many measurements and also data sets with many variables–are transforming what’s possible in terms of  understanding the world through observation, and that’s science.


Sign up for the free insideBIGDATA newsletter.

Leave a Comment


Resource Links: