Case Studies: Big Data and Scientific Research

Print Friendly, PDF & Email

This is the fifth and final article in an editorial series with a goal to provide a road map for scientific researchers wishing to capitalize on the rapid growth of big data technology for collecting, transforming, analyzing, and visualizing large scientific data sets.

In the last article, we reviewed the open data movement in scientific research and how it relates to big data. The complete insideBIGDATA Guide to Scientific Research is available for download from the insideBIGDATA White Paper Library.

insideBIGDATA_Guide_Research_featureIn order to illustrate how the scientific community is rapidly moving forward with the adoption of the big data technology stack, in this section we’ll consider a number of research projects that have benefited from these tools. In addition, these project profiles show how big data is steadily merging with traditional HPC architectures. In each case, significant amounts of data are being collected and analyzed in the pursuit of unparalleled understanding of nature and the universe.

Tulane University

As part of its rebuilding efforts after Hurricane Katrina, Tulane University partnered with Dell and Intel to build a new HPC cluster to enable the analysis of large sets of scientific data. The cluster is essential to power big data analytics in support of scientific research in the life sciences and other fields. For example, the school has numerous oncology research projects that involve statistical analysis of large data sets. Tulane also has researchers studying nanotechnology, the manipulation of matter at the molecular level, involving large amounts of data.

Tulane_quoteTulane worked with Dell to design a new HPC cluster dubbed Cypress, consisting of 124 Xeon-based Dell PowerEdge C8220X server nodes, connected through the high density, low-latency Z9500 switch, providing a total  computational theoretical peak performance of more than 350 teraflops. Dell also leveraged their relationship with Intel, who in turn leveraged their relationship with leading Hadoop distribution Cloudera – allowing Tulane to do big data analytics using Hadoop in an HPC environment.

Using Cypress enables Tulane to conduct new scientific research in fields such as epigenetics (the study of the mechanisms that regulate gene activity), cytometry (the measurements of the existence of certain subsets of cells within a kind of tissue in the human body), primate research, sports-related concussion research, and the mapping of the human brain.

Arizona State University

ASU worked with Dell to create a powerful HPC cluster that supports big data analytics. As a result, ASU built a holistic Next Generation Cyber Capability (NGCC) using Dell and Intel technologies that is able to process structured and unstructured data, as well as support diverse biomedical genomics tools and platforms.

ASU turned to Dell and Intel to expand its HPC cluster. The resulting NGCC delivers 29.98 teraflops of sustained performance for HPC, big data and massively parallel (or transactional) processing with 150 nodes and 2,400 cores. The HPC side of the NGCC includes 100 Dell PowerEdge M620 servers with Intel® Xeon® E52660 processors and 1,360 cores. NGCC’s transactional side includes 20 Dell PowerEdge M420 servers, each with Intel Xeon E5-2430 processors.

HPC and Cloudera’s Hadoop distribution upon which NGCC is based can handle data sets of more than 300 terabytes of genomic data. In addition, ASU is using the NGCC to understand certain types of cancer by analyzing patients’ genetic sequences and mutations.

National Center for Supercomputing Applications

The National Center for Supercomputing Applications (NCSA) provides computing, data, networking, and visualization resources and services that help scientists, engineers, and scholars at the University of Illinois at Urbana-Champaign and across the country. The organization manages several supercomputing resources, including the iForge HPC cluster based on Dell and Intel technologies.

One particularly compelling scientific research project that’s housed in the NCSA building is the Dark Energy Survey (DES), a survey of the Southern sky aimed at understanding the accelerating expansion rate of the universe. The project is based on the iForge cluster and ingests about 1.5 terabytes daily.

Translational Genomics Research Institute

To advance health through genomic sequencing and personalized medicine, the Translational Genomics Research Institute (TGen) requires a robust, scalable high-performance computing environment complimented with powerful big data analytics tools for its Dell | Hadoop platform. TGen optimized its infrastructure by implementing the Dell Statistica analytics software solution and scaling its existing Dell HPC cluster with Dell PowerEdge M1000e blades, Dell PowerEdge M420 blade servers and Intel processors. The increased performance accelerated experimental results, enabling researchers to expand treatments to a larger number of patients.

As gene sequencers increase in speed and capacity, TGen scaled its HPC cluster to 96 nodes. This was done with cutting-edge PowerEdge servers that featured Intel® Xeon® processors that achieved 19 teraflops of processing. The cluster supports 1 million CPU hours per month and 100% year-toyear data growth. To manage this level of big data, TGen scaled its existing Terascala storage so it can hold 1 petabyte.

Summary

The explosion of big data is transforming how scientists conduct research. Grants and research programs are geared at improving the core technologies around managing and processing big data sets, and speeding up scientific research with big data. The emergent field of data science is changing the direction and speed of scientific research by letting people fine-tune their inquiries by tapping into giant data sets. Scientists have been using data for a long time. What’s new is that the scale of the data is overwhelming, which can be an infrastructure challenge. Researchers now need to be able to tame large data sets with new big data software tools and HPC to make rapid advances in their fields.

If you prefer, the complete insideBIGDATA Guide to Scientific Research is available for download in PDF from the insideBIGDATA White Paper Library, courtesy of Dell and Intel.

Speak Your Mind

*