Book Review: Doing Data Science

Print Friendly, PDF & Email

Doing_data_scienceO’Reilly Media does it right. Their PR department gives valuable support to the grassroots efforts in the data science community by helping out local Meetup groups. A case in point is how they provided a number of current titles to be given away as raffle prizes for the Los Angeles R User Group of which I am a member. So I thought I’d review a new O’Reilly book that just came out in October 2013, Doing Data Science: Straight Talk from the Frontline by Rachel Schutt & Cathy O’Neil.

I found this book to be a very odd bird indeed. It is one book you can read from back cover to front cover and not be at a disadvantage. This is because the book is really just a collection of presentations made by various people to a class taught by the primary author Rachel Schutt at Columbia University in the Fall of 2012 – Introduction to Data Science. It wasn’t entirely clear what content Schutt was directly responsible for since only some of the chapters indicate who the contributors were (one of the chapters was contributed by a group of her students!). The co-author, Cathy O’Neil, I’ve encountered before as an outspoken blogger going by the name “mathbabe” but it wasn’t specifically stated how she became part of the book project, other than to say she was one of the students in Schutt’s class. Chapter 6 was partly written by O’Neil.

Both Schutt and O’Neil are Ph.D.s data science appropriate fields, but the book was not “written” by the two, rather they seemed to have performed some kind of editing function with the materials submitted by each contributor and added commentaries of their own. As a result, the book is a hodgepodge of anecdotes, factoids, R code snippets, plots, and mathematics, all from the in-class presentations. I enjoy seeing math in data science books, but the equations in this book were sort of just floating there requiring the reader to explore further at another time.

Although I have issues with the book as it is not any sort of text for the field, I did enjoy reading it with a number of “Ah, I didn’t know that!” moments. Schutt’s credentials in data science are considerable, having worked at Google for a few years around the same time that “data science” was growing up in Silicon Valley. As a result the book has many memorable anecdotes about the early days of the data science industry, and observations about what makes big data tick. I enjoyed the story about the Google software engineer who accidentally deleted 10 petabytes of data, and I think my favorite quote from the book is from the student’s chapter 15:

Kaggle competitions could be described as the dick-measuring contests of data science.

With contributor’s chapters on statistical inference, machine learning algorithms, logistic regression, financial modeling, recommendation engines, data visualization, Hadoop, MapReduce, and more, I’d say the book is worth a read, but not necessarily as a source of learning data science but more as a high-level guide and short historical account of this young industry. You get to learn about the people, companies, technologies that have collectively built the data science arena and you’ll be better for it especially if you are working to become a data scientist yourself.


Speak Your Mind