Interview: Ankur Goyal, VP of Engineering at MemSQL

Print Friendly, PDF & Email

Ankur_GoyalI recently caught up with Ankur Goyal, Vice President of Engineering at MemSQL, to get an inside look at the new era of predictive analytics. Ankur runs engineering at MemSQL. He was one of the first employees at MemSQL starting in 2011, and has actively driven growth of the team since then. Ankur previously held systems research positions at Microsoft and Carnegie Mellon University, where he worked on innovative distributed systems. He studied computer science at Carnegie Mellon University.

Daniel D. Gutierrez – Managing Editor, insideBIGDATA

insideBIGDATA: Predictive analytics has been described as using modeling, machine learning, and data mining of historical data to make predictions about the future. How do you define or characterize predictive analytics?

Ankur Goyal: We define predictive analytics as taking the best of both current and historical data to model the most likely outcomes, enabling companies to adapt and learn in real-time.

Use of historical data is an important cornerstone of predictive modeling, however, the most accurate predictions appear with a combination of real-time and historical data. For example, weeks of historical data on grocery inventory means little when a big storm hits and shoppers rush to stock up. In that case, real-time inventory assessment can help grocery chains optimize distribution with the most recent information.

Predictive analytics has 2 parts to it:

  1. The use of historical data to train and parameterize machine learning models. Retraining models in real-time is a new concept and this requires training on a combination of real-time and historical data.
  2. The application of the machine learning model to data as it streams in. This “tagged” data can be used to train future models and is tagged with the model at time of ingest so that simple SQL queries can be run against it to predict anomalies.

insideBIGDATA: Why does it matter to today’s businesses?

Ankur Goyal: We live in a real-time world and no one likes to wait. Predictive analytics are the next step for businesses to succeed in the digital economy where the winners will be able to tailor offerings, and make every moment work for them.

Time is a valuable asset for every company. And any sufficiently low-latency data pipeline is indistinguishable from a time machine. If you can adapt and learn before your competitors, you come out ahead.

Predictive analytics are the natural next step after real-time dashboards. Once you get analytics fast enough that you can view a dashboard in real-time, the next step is to skip looking yourself and have a machine figure out what went wrong or recommend what to consider. That is the natural evolution from a real-time dashboard to predictive analytics and it lets you automate many analytical processes completed by hand today. Essentially, predictive analytics gives you a virtual army of software analysts who can find anomalies with extreme speed.

insideBIGDATA: What’s the potential upside for companies that can combine historical data with real-time data when doing predictive analytics?

Ankur Goyal: The upside for combining historical and real-time data is improved accuracy. Understandably, history alone is not a predictor of the future, and for that matter neither is the present. However, the closer a company can get to both current and historical data, the more accurate the prediction.

Real-time data is an essential component to predictive analytics, as these events themselves drive predictions. Real-time data matters as older data and slower predictions leave more time wasted before taking action. Historical data significantly enhances the context with which a prediction can be made. For example, while predicting whether a sensor is faulty, a query can look at both the current and historical sensor readings to detect anomalies.

insideBIGDATA: What are some examples of applications performing predictive analytics across large data sets in real-time?

Ankur Goyal: One example of applications performing predictive analytics in real-time involves industrial equipment and the need to predict equipment failure points. The operational and capital equipment costs due to failure while in production can be catastrophic to a company’s bottom line and cause a delay across an entire value chain.

Another example is using click-tracking history to predict which web pages a user will visit and serve ads accordingly. This can be modeled and implemented in real-time as a user visits new web pages.

Using high performance analytical solutions such as an in-memory database, companies can analyze sensor information from industrial equipment in real-time and ensure that they get maximum use while minimizing downtime, getting ahead of maintenance issues and increasing overall production output.

insideBIGDATA: What are the technical challenges related to predicting future behavior based on querying real-time data?

Ankur Goyal: Many companies have been using predictive analytics but typically in a batch-process workflow. The main challenge stems from legacy systems being based on disk drives, and not memory. These systems are simply too slow to operate in real-time.

Another challenge is that predictive analytics usually involves multiple systems; one to capture data, another to query data, and sometimes a third to develop predictive models.

By combining systems together, companies can remove the time of a typical Extract, Transform, and Load (ETL) process from overnight to intraday. This is possible with in-memory databases that can handle transactional and analytical workloads simultaneously, something Gartner refers to as Hybrid Transactional/Analytical Processing, or HTAP.

Most systems are simply not able to handle the complex processing needed for real-time predictive analytics. From a technical perspective, you need to be able to score each data point and compare it to the scores of previous data points, i.e. be able to ingest and process data quickly but also be able to flexibly query it against historical data. Because of this requirement, MemSQL is a natural choice for predictive analytics workloads.

insideBIGDATA: IDC recently reported that by 2018, there will be 22 billion IoT devices installed, driving the development of over 200,000 new IoT apps and services. How will predictive analytics play a part in the Internet of Things?

Ankur Goyal: There is no question data volumes are on the rise. And unlike the world of “virtual goods,” the Internet of Things promises to link the real world with our vast connected computing infrastructure. The data volumes driven by always-on devices and sensors demand infrastructure that can keep up.

MemSQL supports this cultural shift and subsequent data growth through our in-memory, distributed database which is capable of capturing millions of transactions per second, far surpassing the limits of conventional infrastructure. Further, MemSQL also allows concurrent analytics on this infrastructure so that every query is accurate to the last transaction, or the last click.

As companies seek to understand what is happening with their applications, devices, cars, drones, equipment, and all-connected infrastructure of IoT, they need data processing infrastructure to capture, process, and analyze metrics in real-time. Doing this effectively with predictive analytics will elevate the winners in the digital economy.

insideBIGDATA: Can you elaborate on how companies are changing the ways they extract value from their data and specifically, how predictive models can be applied to these businesses?

Ankur Goyal: Companies like Pinterest, Comcast, Samsung, and Shutterstock choose MemSQL to process massive amounts of information and to gather analytics on rapidly changing datasets. The first step for all companies is to build real-time data pipelines and dashboards. This provides visibility.

The next step is to move into the realm of predictive analytics. The good news is that many existing predictive models can be implemented without rework. For example, predictive models built in SAS can be exported using the predictive model markup language (PMML) and then embedded into a real-time workflow using MemSQL.

insideBIGDATA: We’ve seen companies such as Uber and Airbnb disrupt industries that have been virtually unchanged or unchallenged for decades. What other industries might be prime for disruption for companies that can effectively tap into real-time data for predictive analytics?

Ankur Goyal: Any company successfully using its data will move ahead in the digital economy. Fundamentally, data processing is about decision making. Predictive analytics changes the way decisions are made and enables extreme personalization. Every company benefits from such higher precision, which results to overall much higher efficiency.

Beyond transportation and hospitality, represented by the likes of Uber and Airbnb, we will likely see similar shakeups in industries such as finance, technology, manufacturing, retail, energy, insurance, and more. As the technology becomes more available, classic enterprises are also gearing up to run predictive analytics and adapt to compete with new companies.

insideBIGDATA: For companies to be able to shift direction as fast as the trends do, how relevant do you think historical data will be? How far back should companies look, or should they only look across real-time data going forward?

Ankur Goyal: The answer is both. If you only look at historical data, such as with traditional data warehouses, you have no ability to see what is happening in the present. If you only look at the present, as is with some stream-processing engines, you lose the context of history.

By analyzing both historical and real-time data, companies can build the most comprehensive, accurate models possible, and get closest to the world of predictive analytics.

Real-time is meaningless without historical context. The whole point of analytics is to look at data across rows, but without the ability to flexibly and dynamically look at data across time -whether a few minutes, hours, or even years - the analysis is very limited. Fundamentally, “historical data” can be defined as having the flexibility to run queries with more context and more accuracy. The more historical data you can take into account while making a prediction on real-time data, the more accurate your prediction will be.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind