Bootstrapping is a widely used statistical learning technique that falls under the broader category of resampling methods. Bootstrapping is typically used in the estimation of various statistics and can be used to quantify the uncertainty associated with a given estimator or machine learning algorithm. As a simple example, it can be used to estimate the standard errors of the coefficients from a linear regression fit.
The Bootstrap approach allows us to use automation to emulate the process of obtaining new sample sets so that we can estimate variability without generating additional samples. Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set. This is a very powerful realization. The name “bootstrapping” comes from the phrase, “To lift yourself up by your bootstraps.” There is extensive mathematical theory that justifies bootstrapping techniques. However, the use of bootstrapping does feel like you are performing magic. Although it does not seem like you would be able to improve upon the estimate of a population statistic by reusing the same sample over and over again, bootstrapping can in fact do this.
Bootstrap techniques are relatively new to the field of machine learning. The first use was published in a 1979 paper by Bradley Efron, “Bootstrap Methods: Another Look at the Jackknife.” As computing power has increased and become less expensive, bootstrap techniques have become more widespread in their use.
For readers using the R platform, there are two packages available for performing the bootstrap technique: boot which is based on functions and data sets discussed in the book “Bootstrap Methods and Their Applications” and bootstrap which is based on functions (bootstrap, cross-validation, jackknife) and data sets for the book “An Introduction to the Bootstrap.”
In the spirit of the importance of bootstrap methods to contemporary machine learning, I’d like to review several prominent books on the subject. Some of the titles are relatively new, while others can be considered “classics.”
An Introduction to the Bootstrap by Bradley Efron and Robert J. Tibshirani, Chapman and Hall/CRC; Softcover reprint of the original 1st ed. 1993 edition (January 1, 1993)
This is the seminal text on the subject because one of the authors, Efron, is the creator of the technique. Brad Efron wrote the key paper rediscovering the bootstrap and putting it in its proper place with other resampling techniques in his famous 1979 paper in the Annals of Statistics. His work was a breakthrough that has now led to hundreds of other publications and several books on the bootstrap and more general resampling procedures by himself, his students and many other statisticians.
This book is a concise presentation of the bootstrap and its wide variety of applications and is very much up to the state-of-the-art in this rapidly growing area of statistics. It is written in an intuitive fashion and avoids much of the mathematics which are needed to provide formal proof that the bootstrap does what it is intended to do. If you’re like me, and like to travel back to original works for important milestones in the field, then this book is for you. The book is pricey, but once I found it at a used technical book seller for half the price.
Bootstrap Methods and their Application by A. C. Davison and D. V. Hinkley, Cambridge University Press; 1 edition (October 28, 1997)
This book has an excellent balance between practice and theory. It presents the bootstrap as a powerful tool through examination of practical issues. I would recommend this book for everyone interested in improving their use of statistical learning techniques. Applications covered in the book include stratified data, finite populations, censored and missing data, linear, nonlinear, and smooth regression models, classification, time series and spatial problems. The book includes a disk of S-Plus programs for implementing the methods described in the text. I did find the author’s mathematical notation a bit confusing and too dense. If you’re selecting a book for a course, Davison and Hinkley is a good choice for educational use because of the many exercises.
The Bootstrap and Edgeworth Expansion by P. Hall, Springer Series in Statistics (1992)
If you are interested in the theory and formal mathematics of this area of statistics, this text may be for you. The material is advanced and rigorous. First, it lays the foundation for a particular view of the bootstrap. Second, it gives an account of Edgeworth expansion. It is well written but requires a good mathematical background and knowledge of advanced probability theory would be helpful. It is not easy reading even for doctoral students and postdoc researchers but is certainly worth the effort. I tend to like more theoretical treatments of topics I directly can use in my machine learning work, so this text is a useful addition to my library.
The Weighted Bootstrap (Lecture Notes in Statistics) by Philippe Barbe and Patric Bertail, Springer; Softcover reprint of the original 1st ed. 1995 edition (October 4, 2013)
This graduate level monograph could be used as part of an advanced topics course on the bootstrap. Its goal is to answer two questions:
- How well does the generalized bootstrap work?
- What are the differences between all the different weighted schemes?
The authors tried to make the proofs as detailed as possible, and some proofs have been relegated to appendices which I believe helps focus the primary material in the presentation.
Chapter 1 investigates the weighted bootstrap of statistical functions and looks for some general regularity conditions under which the generalized bootstrap may be used. Chapter 2 gives some information concerning the practical choice of the weights and the difference between all these random weighted methods in the regular cases investigated in Chapter 1. Chapter 3 looks at some non-regular cases which require a drastic modification of the bootstrap. Chapters 4-6 contain proofs.
Bootstrap Methods: An Overview by Michael R. Chernick, Springer; 2015 edition (February 28, 2015)
This is a new book on The Bootstrap and won’t be available until February 2015. Pre-release information from the publisher indicates that the book shall provide a compact reference guide to many key aspects of the bootstrap. It will explore bootstrap methods with an introductory chapter that also addresses other resampling methods and their broad range of applications (e.g. time series). The individual chapters plan to focus on estimation, confidence intervals and hypothesis testing.
Bootstrap Methods: A Guide for Practitioners and Researchers by Michael R. Chernick, Wiley-Interscience; 2 edition (November 12, 2007)
This is another book often discussed in statistical learning circles, alas not everyone has favorable opinions. I did want to review it for myself to give it a fair shake in this article, alas the publisher, John Wiley & Sons, declined to provide me with a review copy. They only provide review copies for new titles not older than 18 months; too bad for their authors! So rather than steer you wrong, I’d go with the upcoming Springer title by the same author (see above).
Sign up for the free insideBIGDATA newsletter.