Sign up for our newsletter and get the latest big data news and analysis.

Chaos and Hard Labor Are Not the Foundations of Data-Driven Organizations

Murthy MathiprakasamIn this special guest feature, Murthy Mathiprakasam, Principal Product Marketing Manager at Informatica, discusses the increased adoption of Data Lake Management practices and how organizations are quickly getting value from big data. Murthy Mathiprakasam is a director of product marketing for Informatica’s Big Data products. Murthy has a decade and a half of experience with high-growth software technologies, including roles at Mercury Interactive, Google, eBay, VMware and Oracle.

Big data continues to be on a tear with the exponential growth in adoption for Hadoop and other modern data platforms. There are more and more case studies and success stories of organizations leveraging big data to become more efficient, competitive and responsive to their customers.

However, there’s an awful lot of noise and confusion in the industry too. While some of this noise might just be the experimentation from new approaches that is inevitable with any transformative technology, it is also starting to feel like some vendors and some pundits are promoting a deliberate trend toward data anarchy inside organizations. The core principle of data anarchy is simple: cut through the “bottleneck” of IT by letting all the organization’s data go free and let a jungle of data engineers and data scientists get pervasive access to data whenever they want. There’s only one problem: this makes no sense for data engineers and data scientists, and it indisputably makes no sense for the organization as a whole.

The role of data engineers and data scientists should in no way go diminished. Data engineers and data scientists are the yin and yang of turning raw sets of big data into trusted and useful insights. However, most data engineers and data scientists in these anarchist environments will tell you that they feel far from empowered. There are countless anecdotes of these professionals talking about how the chaos and hard labor of anarchy impedes their ability to actually exercise the skills they should be bringing to the organization. So, clearly chaos and hard labor is no formula for success. The organizations that are delivering trusted information using big data quickly and repeatedly have one thing in common: they are leveraging big data fabrics based on the principles of data lake management.

Data lake management is a set of technologies and practices that quickly and repeatedly turn raw big data into trusted information. Make no mistake; this is not another set of bureaucratic processes that slow down progress or results. Quite to the contrary: when data lake management is used as a foundation for a data democracy, end user professionals, such as data engineers and data scientists, enjoy the highest productivity since data flows through the organization quickly and predictably.

Data lake management starts with the recognition that the whole point of big data and the concept of a data lake is to have all an organization’s data in one place. Data silos don’t help the individuals hoarding the data or the organizations that fail to encourage these individuals to participate in a communal and centralized data lake. Once the data itself is targeted for collection in a centralized data lake, it can be curated with a well-governed process.  Most organizations have some notion of data lake zones, with an initial ingestion zone for raw data, an intermediary discovery zone for early modeling, and a final analytical zone for fully certified data. The segmentation of data into zones in the data lake itself accelerates access to fit-for-purpose datasets. Instead of viewing data curation as a static process, the organization can now offer different SLAs for different levels of quality and certification of data. Combined with an agile execution process and short project cycle times, cross-functional teams can curate data assets very quickly for a variety of purposes.

Collaboration can be fostered by the technology as well. A comprehensive data lake management solution provides persona-specific user interfaces for data engineers and data scientists who seek self-service. These interfaces are tied back into a centralized enterprise information catalog, which automatically classifies and indexes the data. This intelligent use of data science techniques on the data structure, data field names, and data usage itself enables data relationships to be inferred between datasets that could never have been inferred through manual effort. These automated classifications facilitate collaboration and self-service by making the process of interacting with data fully guided instead of one of roaming around amidst chaos.

Finally, data lake management leverages business glossaries to establish taxonomies and rules for what constitutes meaningful and valid data. With the help of data stewards in lines of business, the standard for what is correct can be established. These standards can also be coded as business rules that are automatically run against datasets to identify anomalies, duplications, inaccuracies and inconsistencies for further human investigation. Again, the guided approach to data lake management eliminates unnecessary manual effort and dramatically increases organization productivity by quickly driving visibility around data quality and compliance.

The evidence is clear – chaos and hard labor are simply not winning approaches to managing big data. At best, organizations face excessively long project cycles times from the absence of comprehensive and systematic data lake management practices. More frighteningly, the absence of such practices and lack of understanding of big data can put organizations at risk for potential failures of internal controls or external regulations.

Establishing trust and understanding for big data assets rapidly and repeatedly is essential in a world where data volumes and data variety will only continue to grow. The undeniable approach for establishing trust and understanding around big data assets is through data lake management technologies and practices. Invest in a comprehensive and systematic approach to data lake management to dramatically accelerate big data projects, and empower data engineers and data scientists to unprecedented levels of productivity.


Sign up for the free insideBIGDATA newsletter.

Leave a Comment


Resource Links: