# Book Review: Statistical Learning with Sparsity – The Lasso and Generalizations As a data scientist, I have a handful of books that serve as important resources for my work in the field – “Statistical Learning with Sparsity – The Lasso and Generalizations” by Trevor Hastie, Robert Tibshirani, and Martin Wainwright is one of them. This book earned a prominent position on my desk for a number of reasons. First, the authors are all luminaries in the field of machine learning. Two other books by these experts also sit on my desk – “The Elements of Statistical Learning” (the machine learning bible, also known as ESL), and its R-specific cousin “An Introduction to Statistical Learning” also known as ISL. I’ll take a close look at any materials from Stanford professors Hastie and Tibshirani, their stuff is that good in helping me in my profession. All three books are available for free in PDF form here:

Statistical Learning with Sparsity – The Lasso and Generalizations

The Elements of Statistical Learning – Data Mining, Inference, and Prediction – Second Edition

An Introduction to Statistical Learning with Applications in R

First, a little background for why Statistical Learning with Sparsity (SLS) is important to data scientists. There exist a number of model fitting procedures that can yield better prediction accuracy and model interpretability. One such method is known as shrinkage – because the estimate coefficients are shrunken towards zero relative to the least squares estimates. Shrinkage is also known as regularization and has the effect of reducing variance. Depending on the type of shrinkage that is performed, some of the coefficients may be estimated to be exactly zero which means shrinkage methods can also perform variable selection.

Ridge regression is the most common shrinkage method that introduces a new term in the equation that uses ridge regression coefficient estimates whose values are minimized using a tuning parameter. Ride regression’s advantage over least squares is rooted in the bias-variance trade-off. Ridge regression has one obvious disadvantage – it will include all predictors in the final model. The shrinkage term (also known as the penalty), will shrink all of the coefficients toward zero, but it will not set any of them to exactly zero. This can present a challenge in situations where the number of predictors is large. The lasso is a relatively new alternative to ridge regression that overcomes this disadvantage.

As with ridge regression, the lasso shrinks the coefficient estimates toward zero using the lasso penalty. However, in the case of the lasso some of the coefficient estimates are forced to be exactly zero when the tuning parameter is sufficiently large. This means that the lasso performs variable selection. As a result, models generated from the lasso are generally much easier to interpret than those produced with ridge regression. We say that the lasso yields sparse models, i.e. models that involve only a subset of the predictors.

One great thing about SLS is that author Tibshirani is the inventor of the lasso shrinkage method. Here is the seminal paper from 1996 that first introduced the lasso: “Regression Shrinkage and Selection via the Lasso.”

Sparsity is the central theme of the book. In general, a sparse statistical model is one in which only a relatively small number of predictors play an important role and therefore it is much easier to estimate and interpret than a dense model. SLS presents methods that exploit sparsity to help recover the underlying signal in the data set. In summary, the advantages of sparsity are interpretation of the fitted model and computation convenience. A third advantage has recently emerged from some deep mathematical analyses in this field of research. This has become known as the “bet on sparsity principle:

Use a procedure that does well in sparse problems, since no procedure does well in dense problems.”

The area of sparse statistical modelling is exciting for data scientists and theorists and very practically useful. SLS will help you understand why this is so.

Here is a list of chapters in the text:

1 – Introduction

2 – The Lasso for Linear Models

3 – Generalized Linear Models

4 – Generalizations of the Lasso Penalty

5 – Optimization Methods

6 – Statistical Inference

7 – Matrix Decompositions, Approximations, and Completion

8 – Sparse Multivariate Methods

9 – Graphs and Model Selection

10 – Signal Approximation and Compressed Sensing

11 – Theoretical Results for the Lasso

SLS should be viewed primarily as an academic or theoretical resource since it is composed mostly of mathematics (no code). However, if you’re a true data scientist, you’re going to want to fully absorb much of this book. The book is considered a graduate-level text on the subject of machine learning, so likely for masters and Ph.D. level students in computer science and applied statistics. I would recommend SLS as an important resource for any data scientist, especially when used in conjunction with ESL and ISL mentioned above. Contributed by Daniel D. Gutierrez, Managing Editor and Resident Data Scientist for insideBIGDATA. In addition to being a tech journalist, Daniel also is a practicing data scientist, author, educator and sits on a number of advisory boards for various start-up companies.