One of the benefits of going around town attending various Big Data events is meeting the most fascinating people in this industry. During the last Los Angeles Hadoop User Group meetup, I caught up with Eli Karplus, Senior Director, Predictive Analytics & Optimization for Shopzilla, the price comparison site that enables shoppers to find, compare and buy anything, sold by virtually anyone, anywhere. Needless to say, Shopzilla collects a lot of data and has embraced the big data paradigm in a significant way. Eli agreed to the following interview which provides a lot of insight into how very leading-edge Shopzilla has become with technology:
insideBIGDATA: How is Shopzilla using big data technology and what role do you play in making that happen?
Eli Karplus: At our core, we are fundamentally a big data company. We always have been. More than 40 million in-market shoppers funnel through our e-commerce sites every month. We connect those consumers with 5,000+ retailers who are selling 200 million+ products. We process over 10,000 searches per second and have one of the world’s largest SEM (search engine marketing) systems with billions of keywords. It’s a lot to keep track of.
The adoption of Hadoop into our ecosystem over the past couple of years has ushered in a new era for us, enabling us to store so much more and perform so many more computationally intensive operations.
One of my roles is to frame business challenges into the shape of data-/math-problems that we can solve with data science, then productize and bring to market.
insideBIGDATA: How does Shopzilla use the Hadoop technology stack?
Eli Karplus: We have 2 production clusters, with a combined total of 191 data nodes and total capacity of about 400TB (and immediate plans to add to it). Though our clusters are centrally managed, the users and the applications are distributed throughout the various verticals of the company. Tools used across the different teams include: Java MapReduce, Pig, Hive, Mahout, R, and Python.
Nearly 5000 scheduled jobs run daily.
We leverage the platform for core operations like processing of logs, sessionization, and general enrichment. We mine and analyze our logs for actionable insights and optimization opportunities.
It is on the Hadoop platform that we’ve been able to make great strides in deriving insights into relationships between entities (like products, merchants, or keywords), through classification and clustering. And it is the scalability of Hadoop that empowers us to run computationally intensive scoring algorithms at high frequency across our product inventory or our expansive SEM and SEO keyword repositories.
insideBIGDATA: Can you give us a profile of your data science group – number of members, areas of expertise, etc.?
Eli Karplus: As different as the initiatives they work on, there is no single profile as all contributors bring their own unique valuable skill set. The small set of folks here called “data scientists” have expertise in techniques like regression, clustering, segmentation, classification, and neural networks. When these techniques are brought together with the other essential ingredients: effective data stewardship, powerful and reliable infrastructure, and well-written software – that’s when the magic happens.
insideBIGDATA: Are there any exciting plans in the future for Shopzilla using data science, big data and machine learning?
Eli Karplus: Absolutely. In addition to the ongoing investments we continue to make in the science of optimizing the user shopping experience, there are a couple of new, data-science-heavy initiatives that I’m particularly excited about in the short-term:
In collaboration with our Aisle A division, we’re augmenting our already-massive shopping intent databases with even more data sources and new machine learning methods, to build audience profiles (known as “Aisles” in our business) for brands and retailers. Powered by big data & rich analytics, we can offer our advertising partners the ability to reach & engage the right set of consumers at the ideal time w/ the right message.
In our Bizrate Insights program, we’re developing our next generation of analytics products for retailers – all at our value-pricing model. By analyzing billions of visits, abandons, and transactions, we’ll deliver customized predictive models that not only provide retailers with actionable recommendations for increasing user engagement and sales, but also allow them to benchmark their performance against relevant groups.
This is an exciting time to be a data scientist. Especially in the retail and advertising space. The data are abundant. And so are the opportunities.
Daniel – Managing Editor, insideBIGDATA