Most traditional applications of Spark involve massive data-sets that already exist. A less-commonly encountered use-case, but nevertheless extremely useful, is in Simulations, where massive amounts of data are generated based on model parameters. In this talk from Spark Summit East 2016, Prasad Chalasani explores some of the challenges that arise in setting up scalable simulations in a specific application, and share some solutions and lessons learned along the way, in the realms of mathematics and programming. The application scenario he explores is to quantify the impact of cookie-contamination in randomized experiments aimed at measuring digital advertisement lift/effectiveness. Cookies are randomly assigned to test or control, and those in test are exposed to ads while those in control are not. The goal is to measure the lift in conversion-rate due to ad-exposure. One important factor that taints such measurements is cookie-contamination: a real-world user may have multiple cookies (but the system is unaware of this linkage), and if their cookies are in both test and control groups, then the cookie in control may show a higher conversion rate than that of a clean control cookie that has no “siblings” in the test group. Analytically quantifying the impact of this contamination is difficult without making overly simplistic assumptions, and one idea we pursued is to simulate the impact of cookie-contamination, with millions of trials over 10s of millions of users. The goals are: (a) understand/quantify the impact of cookie distribution and contamination, on the expected value of the computed lift as well as the 90% confidence interval, and (b) derive approximate analytical formulas for the observed lift. Scaling up the simulations to a large of trials and users is challenging, and we share some of our solutions, and also describe the analysis of error and expectation.
Prasad Chalasani is currently the SVP of Data Science at MediaMath, leading the development of innovative, proprietary scalable algorithms, and analytics that leverage massive amounts of data to power smarter digital marketing for the world’s leading advertisers. Prior to joining MediaMath, Prasad led Data Science at Yahoo Research, and before that worked for 10 years as a quantitative researcher and portfolio manager of statistical trading strategies at hedge funds and at Goldman Sachs. Prasad holds a PhD in Computer Science from CMU and BTech in Computer Science from IIT.
Here are the slides that accompany this presentation:
Sign up for the free insideBIGDATA newsletter.