I’ve been a big fan of MIT mathematics professor Dr. Gilbert Strang for many years. A few years ago I reviewed the latest 5th edition of his venerable text on linear algebra. Then last year I learned how he morphed his delightful mathematics book into a brand new title (2019) designed for data scientists – “Linear Algebra and Learning from Data.” I was intrigued, so after I received my review copy I did a deep dive without hesitation. I appreciate Strang’s approach to learning mathematics. His books are completely no-nonsense and no fluff. He gets down to business and the result is something mathematics lovers enjoy. This book resembles a collection of well-crafted, albeit dense, course notes for an honors section of a class designed for advanced undergraduate/graduate level learners. Moreover, the breadth of material covered is impressive. I enjoyed the “Learning from Data” text immensely and this book review should serve to encourage all data scientists who may be lacking in the mathematical foundations of data science to make an investment in this great learning resource.

As I tell my own data science students about the importance of learning the mathematical underpinnings of machine learning – while leading-edge machine learning applications can change in the short-term (e.g. CNNs, RNNs, reinforcement learning, NLP, LSTM, transfer learning, etc.) the underlying mathematical concepts do not. This book will help you establish a foundation.

The text is divided into seven main parts, each with a special focus designed to provide the mathematical architecture important for data scientists to gain a firm understanding of machine learning.

- Highlights of Linear Algebra
- Computations with Large Matrices
- Low Rank and Compressed Sensing
- Special Matrices
- Probability and Statistics
- Optimization
- Learning from Data

Part I highlights the fundamental elements of linear algebra including such important topics for machine learning as: matrix multiplication, eigenvalues and eigenvectors, singular value decomposition (SVD), principal components, and many others topics needed for understanding what drives machine learning. Bear in mind, this content represents an overview of linear algebra and it is advised that the reader already have exposure to the material (e.g. vector spaces, subspaces, independence, bases, and linear operators).

In Part II discusses computation with large matrices with a focus on matrix factorization, iterative methods, along with insights into recent powerful algorithms for approximating matrix problem solutions using randomization and projection.

Part III discusses a variety of low-rank and sparse approximation techniques and ending with methods such as LASSO and matrix completion algorithms.

Part IV drills down into the topic of special matrices and constitutes a compilation of specially structured matrices that have applications in a variety of data and signal analysis areas. The matrices range from classical discrete Fourier transforms to graph node-adjacency matrices used for clustering. This section contains a tiny bit of MATLAB code on Page 249 that’s used in one example. I appreciated the section on Page 253, “Applications of Clustering” after a discussion of k-means.

Part V provides a good dose of probability and statistics, something I advise all my introductory data science students to absorb. The chapters include useful topics like: mean and variance, probability distributions, covariance matrices, multivariate Gaussian and weighted least squares, plus a discussion of Markov Chains. Also featured is the *Central Limit Theorem*, something all data scientists need to know. Data scientists can limp along without knowledge of probability and statistics, but eventually you’ll hit the wall, and need to sit down and learn these fundamentals. This book’s coverage is pretty much all you’ll need.

One great reason to invest in this book is Part VI on optimization. Optimization techniques are the life blood of many machine learning algorithms. This part of the book examines several work-horse algorithms based on optimization including linear programming, gradient descent, and most relevant to machine learning, stochastic gradient descent. The chapter even has a definition of the ubiquitous “argmin” expression used in most machine learning theory texts. The problem is these books never define what argmin is! (hint: argmin are the values of arguments for a function F where F reaches its minimum). Kudos for Strang for completeness.

The crescendo of the book is reached in Part VII “Learning from Data” which contains all the meat for data scientists interested in truly taking command of what underlies machine learning algorithms. Here, Strang overviews the mathematics of machine learning including deep neural networks, convolutional neural networks (CNNs), recurrent neural networks (RNNs), the backprop algorithm, bias-variance tradeoff, use of hyperparameters, and how the chain rule in Calculus is king.

**Caveats**

Every book has its caveats, so here’s my slant with respect to the Strang text. The book is not for novices in terms of mathematics or data science. This book should be used after a number of years under your belt with independent work with the mathematics behind machine learning. The book doesn’t make an attempt to tie specific mathematics topics with parallel topics in data science. For example, the book covers Singular Value Decomposition (SVD) but doesn’t make a tight connection to Principal Component Analysis (PCA) and dimensionality reduction. But if you’re far enough along in your study of the subject, the connections should be a natural progression from reading the book.

A wise and strategic data scientist will use this text as a road-map for further study using various references throughout the text, along with topic-specific courses aligned with the topics in the book. For example, Strang introduces the reader to Fourier transforms (see Part IV), undeniably important in various areas of data science, but the treatment is rather brief. It would be a good idea to drill-down with the subject using additional learning resources, e.g. papers, books, videos, etc.

One last nit, is that the book has no bibliography! I typically look forward to a valued textbook’s bibliography because it is where I can learn where to find additional information as topics are discussed in the text. Strang does include citations embedded in various parts of the book which is harder to review all at once.

**Even More!**

As extensions to the book, and as additional learning resources, Strang offers his video lectures available on MIT OpenCourseWare for Math 18.06 and 18.065. The book also has a useful website that includes a number of PDF files containing extracted pages from the book including a review of the Central Limit Theorem.

Using textbooks on which I place a high value (like this one), I make sure to spend time working out the problem sets (or “psets” in MIT vernacular) at the end of each section. I find this extra effort enhances the learning experience. Fortunately, a manual for instructors that includes solutions to the problems can be found HERE.

The book is a fine addition to any data scientist’s library, and maintains a prime position on my desk. Further, I’ve added the book to the bibliography I’ve put together for my own students in data science. Highly recommended!

C*ontributed by Daniel D. Gutierrez, Managing Editor and Resident Data Scientist for insideBIGDATA. In addition to being a tech journalist, Daniel also is a consultant in data scientist, author, educator and sits on a number of advisory boards for various start-up companies. *

*Sign up for the free insideBIGDATA newsletter.*

Great review ! I’m also a fan of Prof. Strang, read through the first intro book and now am reading this book. I also wrote a blog post around it. Happy to have found another fan ! https://anagileway.com/2020/06/04/prof-gilbert-strang-linear-algebra/

I cannot agree that the book does not connect linear algebra topics with their applications in data science. The claim that singular value decomposition is not tied to principal component analysis is simply false. Chapters I.8 and I.9 of Strang’s book are dedicated to this topic; the chapters are titled “Singular Values and Singular Vectors in the SVD” and “Principal Components and the Best Low Rank Matrix” respectively. Other points in the review are valid, especially the one about missing bibliography.

It would be interesting to compare Gilber Strang’s 2019 book with Charu Aggarwal’s “Linear Algebra and Optimisation for Machine Learning”.

Hello Andrey,

Thank you for your comments about my book review, much appreciated. In reference to your comment, I indicated that the Strang book didn’t make a “tight” connection between SVD and PCA as I’ve seen in other books. In fact I really like the connections made in the Aggarwal text you mention. Chapter 7 of that book is great, and integrates all the areas surrounding SVD to data science. Don’t get me wrong, I really like the Strang book, just like the treatment given in other books better.

Daniel

Prof. Gilbert Strang has published an Indian edition of this book.

For more information, visit Wellesley Publishers (India) : http://www.wellesleypublishers.com

This book is based on the MIT course 18.065 at OpenCourseWare (ocw.mit.edu).

Book-specific website : math.mit.edu/learningfromdata