Sign up for our newsletter and get the latest big data news and analysis.

# Book Review: Tree-based Methods for Statistical Learning in R

Here’s a new title that is a “must have” for any data scientist who uses the R language. It’s a wonderful learning resource for tree-based techniques in statistical learning, one that’s become my go-to text when I find the need to do a deep dive into various ML topic areas for my work. The methods discussed represent the cornerstone for using tabular data sets for making predictions using decision trees, ensemble methods like random forest, and of course the industry’s darling gradient boosting machines (GBM). Algorithms like XGBoost are king of the hill for solving problems involving tabular data. A number of timely and somewhat high-profile benchmarks show that this class of algorithm beats deep learning algorithms for many problem domains.

This book “Tree-based Methods for Statistical Learning in R,” is by Brandon M. Greenwell, a data scientist with 84.51° where he works on a diverse team to enable, empower, and enculturate statistical and machine learning best practices where applicable to help others solve real business problems. Greenwell’s book covers important topics such as: decision trees, tree-based ensembles such as random forests and gradient boosting machines. Chapter 7 on random forests, and Chapter 8 on GBMs are brimming over with information providing a strong foundation for doing real-world machine learning (along with a moderate amount of math throughout), coupled with plenty of code examples.

The book is primarily aimed at researchers and practitioners who want to go beyond a fundamental understanding of tree-based methods. It could also serve as a useful supplementary text for a graduate level course on statistical/machine learning. Some parts of the book necessarily involve more math and notation than others. For example, Chapter 3 on conditional inference trees involves a bit of linear algebra and matrix notation, but the math-oriented sections can often be skipped without sacrificing too much in the way of understanding the core concepts. The code examples should also help drive the main concepts home by connecting the math to simple coding logic.

The book does assume some familiarity with the basics of machine learning, as well as the R programming language. Useful references and resources are provided in the introductory material in Chapter 1. While Greenwell tries to provide sufficient detail and background where possible, some topics receive only a cursory treatment. Whenever possible he makes an effort to point the more ambitious reader in the right direction in terms of useful references.

The author developed an R package expressly for facilitating examples in the book, “treemisc” which is available on CRAN and a GitHub repo set up by the author. The R code from the book is also available. I found the code in the book to be straightforward and easy to understand. There are also plenty of insightful data visualizations. NOTE: this is not a Tidyverse book, opting rather to use traditional R coding practices.

For background material, I thought Chapter 2 was superb in its coverage of classification and regression trees (CART), originally proposed by Leo Breiman in his 1984 seminal book on the subject. I found Chapters 7 and 8 to be the most useful. Chapter 7 does a great job of outlining and drilling down in to random forests, while Chapter 8 does the same for GBM. At the end of Chapter 8 you’ll find a brief discussion of the most popular boosting algorithms: XGBoost, LightGBM, and CatBoost. Section 8.9.4 has a very nice code example for using XGBoost. Chapter 5 on ensemble algorithms includes a useful treatment of bagging (bootstrap aggregating) and boosting. Finally, Chapter 6 is on the subject of ML interpretability, a hot topic in the industry right now.

So Many Packages, So Little Time

Another area in which this book excels is making the reader aware of all the great tree-based R packages are out there. I learned about a bunch of packages I never knew about. For example, Chapter 3 identifies implementations of CTree, one of the more important developments in recursive partitioning in the past two decades. I learned that it is only available in R (see the `party` and `partykit` packages), a good reason to have R programming in your data science arsenal.

Contributed by Daniel D. Gutierrez, Editor-in-Chief and Resident Data Scientist for insideBIGDATA. In addition to being a tech journalist, Daniel also is a consultant in data science, author, educator, and sits on a number of advisory boards for various start-up companies.