As I frequently travel in data science circles, I’m hearing more and more about a new kind of tech war: Python vs. R. I’ve lived through many tech wars in the past, e.g. Windows vs. Linux, iPhone vs. Android, etc., but this tech war seems to have a different flavor to it. What feels different in this case is that the application area is the same, namely performing work in data science where the solution often depends on the use of libraries that implement various machine learning algorithms. This being the case, the question is what language should you adopt as a data scientist?
While R has traditionally been the programming language of choice for data scientists, some believe it is ceding ground to Python. Here is a short list of some the arguments I’ve heard of late, along with my personal assessment of each:
R is Too Complex
The most frequently stated argument I’ve heard is the view that Python is general purpose and comparatively easy to learn whereas R remains a somewhat complex programming environment to master. I think this view is misguided, since complexity is in the eye of the beholder (i.e. the programmer). I will agree that R is certainly a very powerful data analysis and data modeling tool with specific emphasis for machine learning. Moreover, much of the power of R is in its ecosystem. There are more than 5,000 packages that extend the open source statistical environment to new heights.
When I first learned R, I did not find it particularly complex; it was a lot easier for me to learn R than C++ or Java with their mammoth frameworks. Besides, the application of machine learning is much more “complex” than any programming language used to develop a given algorithm. Using Python won’t circumvent that fact.
R Isn’t Really a Language
Another argument says that part of the reason people struggle to learn R is that it’s not really a language. As R expert John Cook points out, R “is an interactive environment for doing statistics,” not really a programming language. He also suggests, “I find it more helpful to think of R as having a programming language than being a programming language.” This view may be somewhat accurate, but if R doesn’t look like a traditional programming language, it doesn’t mean it isn’t one. It is simply well-suited for its intended problem domain, namely statistical analysis and machine learning. Once its nuances are mastered, R developers tend to swear by it and use it as a primary tool for data science projects. Plus, R tends to reduce complexity for data scientists because it incorporates vectorized operations that are important to the linear algebra principles inherent in many machine learning algorithms.
Python is More Approachable
Some feel that Python is more approachable. I’ve heard some say that since all sorts of developers are familiar with Python and use it for a wide array of applications, it is the more optimal choice for data science – unlike R, which is pretty much only used for data analysis. I feel this is a silly argument. Wouldn’t you want to use a tool that is specifically suited to a particular task, rather than one that didn’t include specific features for the intended problem domain? There’s nothing wrong with using a special purpose programming language to implement special purpose problems.
Remember, R is a very old statistical environment that has an incredible global following. The functions in the base stats package, as well as many packages found in CRAN, in many case are based on very old implementations (some in Fortran) of classic algorithms (e.g. the Random Forest algorithm in R is based on the original Fortran code by Leo Breiman and Adele Cutler). It is good to know that the modeling language I’m using has a long and trusted history.
People in the Organization Already Know Python
I’ve heard some people express the view that as businesses grapple to get more values out of data assets, they’re also struggling to find qualified data scientists. They say that more often than not such data scientists may already work internally and likely have some familiarity with Python. The feeling is that given the importance of asking the right questions of one’s data, training up homegrown talent on big data technologies is much more effective than training new-hire data scientists on the complexities of one’s business.
While it is a well-founded policy to hire within an organization to fill certain positions since the candidate may likely have valuable domain experience, I think it is a huge stretch of the imagination to think a talented data scientist is lurking somewhere in the organization just waiting to pick up the torch without losing stride. Huh? Is the data scientist just slumming for a while as a UI developer using Python? I don’t think this scenario is very likely. It is much more reasonable to see that data scientist as a new hire. The talent you’re seeking is based on computer science, machine learning, mathematical statistics and probability theory, hardly the skill set of a run-of-the-mill IT staffer.
A Single Language Across Applications
Yet another argument I’ve heard is that beyond tapping into a ready-made Python developer pool, one of the biggest benefits of doing data science in Python is added efficiency of using one programming language across different applications. I really don’t see this as a problem because in my world there are both “theorist” data scientists and “experimental” data scientists. R is ideal for the theorist who does exploratory data analysis, data munging, modeling and algorithm building. Python, on the other hand is a good tool for the experimentalist who implements algorithms for production use. I believe R and Python make excellent partners in building data science solutions. As a consultant I often work with internal developers, many times using Python, to implement my algorithms in R.
A Path for the Future
Python lacks much of R’s richness for data analysis, data modeling and machine learning, but it is making progress. At this point, data science is a very technical area and in my mind you can’t give up R’s depth in favor of Python’s approachability and general-purpose nature. As I mentioned above, the two languages can and do live together nicely. Data science will always be the realm of “scientists who deal with data,” and that will not change anytime soon considering the overly simple nature of the recent “machine learning as a service” product offerings. Practitioners still maintain a firm foundation in mathematics and statistics, which is beyond mortal business analysts and others.
So at least for now, I’d like to douse the flames of the Python vs. R tech war. I think there’s plenty of room for two good choices in the pursuit of robust data science.