In this new insideBIGDATA Guide to Scientific Research, the goal is to provide a road map for scientific researchers wishing to capitalize on the rapid growth of big data technology for collecting, transforming, analyzing, and visualizing large scientific data sets. This article is the first in a series that explores the benefits that researchers across a wide spectrum of scientific disciplines can achieve by adopting big data technologies. The complete insideBIGDATA Guide to Scientific Research is available for download from the insideBIGDATA White Paper Library.
The rapid evolution of big data technology in the past few years has changed forever the pursuit of scientific exploration and discovery. Along with traditional experiment and theory, computational modeling and simulation is a third paradigm for science. Its value lies in exploring areas of science in which physical experimentation is unfeasible and insights cannot be revealed analytically, such as in climate modeling, seismology and galaxy formation. More recently, big data has been called the “the fourth paradigm” of science. Big data can be observed, in a real sense, by computers processing it and often by humans reviewing visualizations created from it. In the past, humans had to reduce the data, often using techniques of statistical sampling, to be able to make sense of it. Now, new big data processing techniques will help us make sense of it without traditional reduction.
Jim Gray, the late U.S. computer scientist from Microsoft in 2007 described a major shift going on in the area of scientific research as—“fourth paradigm” for scientific exploration and discovery.
He predicted that the collection, analysis, and visualization of increasingly large amounts of data would change the very nature of science. One of the goals of big data discussed in the book The Fourth Paradigm is to make the scientific record a first-class scientific object. Fast forward to 2015 and we see distinct evidence for how the big data technology stack is facilitating this change. This technology guide is geared toward scientific researchers working at universities and other research institutions (e.g. NASA, JPL, NIH, etc.) who may benefit from learning more about how big data is meaningfully transformative in the way it can be applied to the data collection and analysis part of their projects. Further, we’ll illustrate how Dell big data technology solutions powered by Intel are actively helping scientists who are focused on their data, on their models and on their research results.
Here is a short-list of several scientific areas currently using or planning to use big data technology solutions to manage the influx of unparalleled amounts of data:
- Astronomy – the proposed Large Synoptic Survey Telescope (LSST) in Chile is expected to create 12.8 gigabytes of data every 39 seconds, for a sustained data rate of 330 megabytes per second. Over a ten-hour winter night, LSST will thus collect up to 13 terabytes.
- Genomics – the Wellcome Trust Sanger Institute in Cambridge, UK can store 18 petabytes of data. All labs need to manipulate data to yield research results. As prices drop for high-throughput instruments such as automated genome sequencers, small biology labs can become big data generators. Biological data are much more heterogeneous than those in other scientific fields. They stem from a wide range of experiments that yield many types of information, such as genetic sequences, interactions of proteins or findings in Electronic Medical Records (EMRs). A single sequenced human genome is around 140 gigabytes.
- Neuroscience – the U.S. based BRAIN Initiative uses big data to map the human brain. By mapping the activity of neurons in the brain, researchers hope to discover fundamental insights into how the mind develops and functions, as well as new ways to address brain trauma and diseases. Researchers plan to build instruments that will monitor the activity of hundreds of thousands and perhaps 1 million neurons, taking 1,000 or more measurements each second. This goal will unleash a torrent of data. A brain observatory that monitors 1 million neurons 1,000 times per second would generate 1 gigabyte of data every second, 4 terabytes each hour, and 100 terabytes per day. Even after compressing the data by a factor of 10, a single advanced brain laboratory would produce 3 petabytes of data annually.
- Climate sciences – The NASA Center for Climate Simulation (NCCS) crunches massive amounts of climate and weather information, giving researchers eye-opening visibility into their data—currently around 32 petabytes. Climate and environmental sciences is an excellent proving ground for big data technology as the field has a wide variety and large volume of data which needs to be captured rapidly.
- Health sciences – the European Bioinformatics Institute in Hinxton, UK is one of the world’s largest biology data repositories and currently stores 20 petabytes of data about genes, proteins and small molecules.
- Cosmology – the Square Kilometer Array (SKA) is one of the most ambitious science projects ever undertaken. A consortium of 10 nations, with the involvement of numerous university scientists and industrial companies, plans on setting up a massive radio telescope made up of millions of antennas spread out across vast swaths of southern Africa and Australia. When it’s completed in 2024, the array will give astronomers insights into the evolution of the ﬁrst stars and galaxies after the Big Bang so they can better understand the history of the universe and the nature of matter. Every day, the antennas will gather 14 exabytes of data and store about one petabyte.
Along with the many significant opportunities, data intensive scientific research also will bring complex challenges. Many scientists are concerned that the data deluge will make it increasingly difficult to find data of relevance and to understand the context of collective data. In addition, the management of data presents increasingly difficult issues. For example, how do international, multidisciplinary and often competitive collaborations of researchers address challenges related to the creation and use of metadata, ontologies and semantics, data curation and still conform to the principles of security, privacy and data integrity? These challenges of a loosely connected community of researchers could be substantial.
There may be distinct challenges with the advent of big data coupling with scientific research, but as with all new technological paradigms, growing pains are de rigueur in anticipation of the benefits enabled for scientific progress. Caveats notwithstanding, there is no doubt that big data is quickly becoming an integral part of scientific research today.
Over the next few weeks we will explore these manufacturing topics:
- Big Data for Scientific Research – An Overview
- Primary Motivators of Big Data vis-à-vis Scientific Research
- Big Data Technology for Scientific Research
- Big Data and Open Science Data
- Case Studies: Big Data and Scientific Research
If you prefer the complete insideBIGDATA Guide to Scientific Research is available for download in PDF from the insideBIGDATA White Paper Library, courtesy of Dell and Intel.
 Tony Hey, Stewart Tansley, and Kristin Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery (Redmond, Wash.: Microsoft Research, 2009).