There has been a lot of discussion lately about enterprise data hubs. Two major Hadoop distributions, Cloudera and MapR, are making a lot of noise to the business community with the statement that the center of gravity is shifting away from data warehouses toward Hadoop. The backdrop driving this discussion is that organizations are struggling today with rapidly growing data from multiple data sources. Transactional data from ERP and e-commerce systems, log files, sensor data (e.g. RFID) and unstructured social media data are a few of these fast growing data sources.
There are many existing analytic platforms in organizations today that are focused on specific data and applications. Enterprises recognize that there is more value in combining these silos of data to open up new use cases. What is required is a central platform that can serve as a collection point for a broad and varied set of data and can accommodate a wide set of use cases—an Enterprise Data Hub.
Hadoop as an Enterprise Data Hub was discussed in depth by independent analyst Mike Ferguson in his May 2013 paper “Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop.” Mike was a principal and co-founder of Codd and Date Europe Limited (the inventors of the Relational Model) and was also a Chief Architect at Teradata. In his paper, Mike delineates the requirements for an enterprise data hub form the perspective of the MapR Distribution for Hadoop. The platform capabilities that he discusses include full data protection, business continuity and availability features to form the foundation for cleansing, transforming and integrating structured and multi-structured data from multiple sources.
MapRʼs data protection and disaster recovery capabilities make MapR Hadoop distributions suitable for long-term storage of Big Data and data warehouse archived data, which can then be selectively re-processed in specific analyses,” noted Mike Ferguson.
MapR invested several years of engineering effort to re-architect a data platform for Hadoop so it could support such enterprise-grade capabilities.
Later in November 2013, Cloudera made its enterprise data hub intentions known at the Strata/Hadoop World event in New York with the announcement of the beta release of Cloudera Enterprise 5.
With an Enterprise Data Hub, information is cheap to store, and you can keep full-fidelity [unaggregated] data forever if you want to,” Cloudera co-founder Mike Olson explained. “You can do your ETL and your data cleaning and preening on this new platform and deliver derived data sets to special-purpose data warehouses and document management systems for advanced processing there.”
When Cloudera Enterprise 5 becomes generally available early next year, the idea is to keep the broadest and deepest swath of data on Hadoop — or more correctly, Cloudera’s commercially enhanced version of Hadoop — and use Cloudera Impala SQL capabilities, Cloudera Search, and Cloudera Navigator access management and auditing to take over the broad and high-scale workloads from traditional database systems such as data integration, data warehouse and document management systems.
As someone who is deeply entrenched in the big data industry including Hadoop, I had been wondering when the fall of the data warehouse and data mart might occur. It appears that with the new enterprise data hub initiatives described above that this day may be sooner than I expected.