How Statistical Science Can Advance Big Data Research Projects

Print Friendly, PDF & Email

ASAlogo_colorA recently released American Statistical Association (ASA) white paper recommends a multidisciplinary approach comprised of statisticians, mathematicians, data scientists and relevant domain scientists to tackle the challenges of the federal government’s Big Data Research and Development Initiative and similar private-sector projects.

The audience for the free PDF white paper, titled Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society, is anyone working with Big Data to advance their work—whether it be research, business or policy—including the private sector, academia and government. It was written by a dozen ASA members with expertise in Big Data.

The data science revolution is being powered by advances in statistics, computer science, applied mathematics and other fields. The ASA’s Discovery with Data white paper makes it clear how essential these partnerships are. Science, society and our fields will all benefit,” says Edward Lazowska, the Bill and Melinda Gates Chair in Computer Science and Engineering at the University of Washington and one of the nearly 50 prominent members of the Big Data community who reviewed the document.

The ASA report notes, “Insight is required to distinguish meaningful signals from noise. The ability to explore data with skepticism is required to determine when systematic error is masquerading as a pattern of interest. The keys to such skeptical insight are rigorous data exploration, statistical inference and the understanding of variability and uncertainty. These keys are the heart of statistics and remain to be used to their full potential in Big Data research.

“Statistics—the science of learning from data and of measuring, controlling and communicating uncertainty—is the most mature of the data sciences. Over the last two centuries, particularly the last 30 years coinciding with the advent of large-scale computing, statistics has been an essential part of the social, natural, biomedical and physical sciences; engineering; and business analytics, among others.

“Statistical thinking not only helps make scientific discoveries, but it quantifies the reliability, reproducibility and general uncertainty associated with these discoveries. Because one can easily be fooled by complicated biases and patterns arising by chance, and because statistics has matured around making discoveries from data, statistical thinking will be integral to Big Data challenges,” the paper continues.

Cynthia Rudin, associate professor of statistics at the Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, chaired the ASA Big Data R&D Initiative Work Group that wrote the white paper. She recently presented at the KDD 2014 Conference, which was themed “Data Mining for Social Good.”

There was tremendous excitement at the event for how statistics is changing the world, with applications ranging from reducing parking congestion in Los Angeles to identifying impoverished villages in Africa from aerial photographs,” she explains. “I also was struck by the huge demand for statisticians and data scientists by organizations in the public and private sectors; many companies are hiring. KDD 2014 demonstrates that the themes of the ASA’s Data with Discovery white paper are resonating now more than ever and will continue to do so for the foreseeable future.”

The ASA report offers examples of several scientific and medical research areas to which statistics has contributed. Following are select examples:

Biological Sciences/Bioinformatics – Biology has changed from a data-poor discipline to a data-intensive one. Today, biologists regularly sift through large data sets. Furthermore, many outcomes of interest, such as gene expression, are dynamic quantities. The complexity is further exacerbated by cutting-edge, yet unpolished, technologies producing measurements noisier than anticipated. This complexity and level of variability makes statistical thinking an indispensable aspect of the analysis. The biologists are now seeking statisticians as collaborators, and these collaborations have led to, among other things, the development of breast cancer recurrence gene expression assays that identify patients at risk of distant recurrence following surgery.

Health Care and Public Health – Personalized predictions of disease risk, as well as time to onset, progression or disease-related adverse events, has the potential to revolutionize clinical practice. Such predictions are important in the context of mental health, cancer, autoimmune disorders, diabetes, inflammatory bowel diseases, and stroke and organ transplants, among others. Statisticians are using huge amounts of medical data to make personalized predictions of disease risk, understand the benefits and harms of drugs and other treatments in addition to environmental factors, analyze the quality of care and understand changing health trends.

Societal Benefit – Statistics and data mining for crime analysis and predictive policing have had a major impact on the way police patrol in major cities and respond to domestic violence cases, sentencing and crime policy. Statistics was used to demonstrate improved efficiency and better performance for the electric utility grid. Statisticians also have made important contributions on the challenges of measuring traffic and civic infrastructure maintenance. Last, statisticians have made great progress toward public use of government microdata with synthetic data techniques to protect respondents’ privacy.

Social Sciences – Statistics was central to new United Nations population and fertility probabilistic projections obtained by combining demographic models with Bayesian methods. These methods were central to producing projections released by the UN that influence national policies worldwide. The large-scale field experiments that revolutionized political campaigns are another example in which statistical methods to adjust for noncompliance were critical. Last, statistics was central to research deducing the behavior of individual group members, work that has subsequently been used in litigation by both sides in every state over the Voting Rights Act.

The ASA Big Data paper ends with a call for more ambitious incentives for multidisciplinary teams to address challenges to Big Data and other research priorities and federal attention to attracting and retaining the next generation of statisticians. Multidisciplinary teams—with each discipline having much to learn from the other—will ensure the best science is brought to bear, help to avoid reinvention of existing techniques from the contributing data science disciplines and spur development of new theories and approaches. “The history of statistics shows how statisticians have engaged in interdisciplinary research, and how that engagement advanced domain sciences, informed policy and provided new insights.… Today, statisticians continue to contribute expertise to a growing range of scientific and social problems. Their training and experience in collaborative research make them the natural leaders of interdisciplinary teams.… However, the Big Data era is marked by an ever-growing class of truly important and hard interdisciplinary problems where more engagement would be widely productive,” the paper stresses.

“Big Data challenges at the interface of statistics and computer science highlight where statistical thinking is required and multidisciplinary teams involving statisticians important. Statistical thinking fuels the cross-fertilization of ideas between scientific fields, industry and government. Further engagement of statisticians and cutting-edge statistics, as one of the core data science disciplines, will help advance the aims of the administration’s Big Data Initiative. In the work of OSTP, NSF and other federal agencies to address STEM workforce issues, we strongly encourage attention to attracting and retaining the next generation of statisticians, especially those who can work seamlessly across disciplines. The statisticians engaged in interdisciplinary research involving Big Data will need to be computationally savvy, possessing expertise in statistical principles and an understanding of algorithmic complexity, computational cost, basic computer architecture and the basics of both software engineering principles and handling/management of large-scale data,” the paper concludes.

The paper lists nearly 50 prominent members of the Big Data community—including computer scientists, statisticians and data scientists—who reviewed the document.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind