I’m frequently asked about educational resources for those making their entry into the data science and machine learning professions. There are plenty of good advanced books such as theoretical masterpieces like Elements of Statistical Learning (free PDF) and Learning From Data to name just two. Also very good and still somewhat advanced, but based on R, is Introduction to Statistical Learning (free PDF). My own book, Machine Learning and Data Science, on the other hand is expressly for beginners. So there definitely is a market for a good intermediate learning resource. The book “R for Everyone: Advanced Analytics and Graphics” authored by Jared P. Lander covers that intermediate ground very well.
This book is basically two books in one – the first 13-chapters cover the basics of the R language. They are quite good and if you are new to R you will find them extremely useful.
The remaining chapters cover using R for statistical learning techniques. As with most other books on the subject, there is little effort to teach statistics and probability theory, although Chapters 14 and 15 skim the surface. This is the challenge with writing books on machine learning, to fully understand the subject you’ll need to provide an enormous amount of background material including mathematics and statistics. But doing so will serve to instantly cut the audience for the book because many readers won’t be prepared for it.
One very positive aspect about the book, qualifying it as an intermediate text, is the use of the ggplot2 data visualization package. ggplot2 generates far better graphics than the base R graphics functions in R. There wasn’t really a whole lot of explanations as to why you were doing what you were doing, but I understand why, ggplot2 is a big topic.
The balance of the book covers material that I classify as machine learning: linear models, generalized linear models (e.g. logistic regression), model evaluation (cross-validation, the Bootstrap, step-wise feature selection), shrinkage methods (e.g. regularization), non-linear models (e.g. decision trees and Random Forests), and time series. Chapter 22 covers unsupervised techniques including K-means and hierarchical clustering. I think this coverage of the field will help anyone transitioning into the field and would serve as a good template for learning. That being said, a serious student will want to incorporate outside resources during the learning process such as the R-bloggers digest, The R Journal, and Stackoverflow.
Finally, I think the title “R for Everyone” is a poor representation of what role the book actually fills. It is a good intermediate resource for teaching machine learning and one that I plan to recommend to my students after they graduate from my own book.