One of my favorite taglines is “the best run businesses run SAP®.” It is clean. It creates a sense of elitism by association. And upon hearing this tagline, those using SAP for planning think, “I am an SAP shop, my business is efficient.” Those not using SAP, “my company doesn’t use SAP and we have a number of deficiencies.” (Guess what? All companies have problems but here those issues accrue to not using SAP. Smart.) So what has this got to do with analytics and the use of R or Python?
For the past 4 years I was an evangelist of advanced analytics algorithms at a large analytics software firm. My job was to show how advanced algorithms can better match customer preferences with products, decrease waste in production and reduce customer churn, to mention a few applications. Over time I realized that to maximize the traction of my “preaching” I needed to spend more time on the algorithm and less time on the syntax. How did I come to this conclusion? Over time I observed that a syntax message didn’t resonate: A knowledge message did. And with this realization I spent less time on the syntax. Less time on the software. In fact, I found that the majority of my audiences were avid users of R and Python. For the past 10 years there has been a tectonic shift in what tools are used by budding data scientists. They have preferred modern languages, namely R and Python, to older commercial software products from the traditional vendors. With my history of interaction at nearly 400+ companies, I feel confidently that I can make this generalized statement – The best data scientists use R and Python.
Now, there will be exceptions to this claim and certainly some of the best scientists I know use closed-sourced products, but the time is here to acknowledge this statement – The best data scientists, and the overwhelming majority of new data scientists, choose R and Python. This fact can be born out in research done by Robert Muenchen, who aggregates these topics in his article “The Popularity of Data Analysis Software.” So, for a moment let’s assume that this is true. That data scientists want to use R and Python. Then why do companies continue to force the use of proprietary software. Good question. Here is a plausible story. Proprietary software historically came with an implicit guarantee that the computation is guaranteed. That by paying for software you were paying for accuracy. Not of better answers, but instead that every time 2 plus 2 was added, it equaled 4. This has been the software model which has allowed older proprietary software languages to persist. So what does this mean going forward?
With this new era, an era of openness and customization, developers will begin to create layers of abstraction between the computation and the language. In essence, a “Google Translate” for statistical computing. R, Python and other “glue languages” can allow for this translation through a series of API’s which allow data scientists to write R and Python code and have their computation done in an external engine bypassing the native frames or data sets available with the platform. For data scientists they are now able to write in their native language. For the company, the predictive model output is the same regardless of whether it is written in R or Python. As a veteran of statistical computing, I know this is a huge concern of using open source packages in a production environment. For models to be used in a production environment, these models must be validated and supported. The software must be robust to all the problems inherent in real world data. This is the beauty of this abstraction layer. It provides uniformity of output and a path to production.
In short, it lets the best companies hire the best data scientists.
Contributed by: Ken Sanford, an Analytics Architect and Evangelist at H2O.ai. Ken is a reformed academic economist who likes to empower customers to solve problems with data. Ken’s primary passion is teaching and explaining. He likes to simplify and tell stories. Ken has spent time in academia (Middle Tennessee State University, U of Cincinnati, Peace College) consulting (Deloitte) and software development (SAS). He has a Ph.D. in Economics from the University of Kentucky in Lexington and his work on price optimization has been published in peer-reviewed journals.
Sign up for the free insideBIGDATA newsletter.