Data Science 101: How to Build Big Data Pipelines

Print Friendly, PDF & Email

Hadoop is not an island. To deliver a complete Big Data solution, a data pipeline needs to be developed that incorporates and orchestrates many diverse technologies. A Hadoop focused data pipeline not only needs to coordinate the running of multiple Hadoop jobs (MapReduce, Hive, Pig or Cascading), but also encompass real-time data acquisition and the analysis of reduced data sets extracted into relational/NoSQL databases or dedicated analytical engines.

In the video presentation below from the SpringOne 2GX 2012 conference in Washington, DC, Costin Leau looks at the architecture of Big Data pipelines, the challenges ahead and how to build manageable and robust solutions using Open Source software such as Apache Hadoop, Hive, Pig, Spring for Apache Hadoop, Batch and Integration.


Earn your master’s in predictive analytics completely online from Northwestern University.



Speak Your Mind



  1. Nice Video. We have an article on Data Pipelines as well suggesting that Data Analytics should get involved in the pipeline sooner in the process.