Sign up for our newsletter and get the latest big data news and analysis.

Six Tenets of Data Lake Access Control and Governance

In this special guest feature, Amandeep Kurana, Chief Executive Officer and Co-founder at Okera, provides six important tenets of data lake access control and governance – Data-Centricity, Rich Access Policies, Scalability and Automation, Unified Visibility, Open, API-First Design, and Hybrid and Multi-Cloud Ready. Amandeep launched Okera in 2016 with CTO and co-founder Nong Li. While supporting customer cloud initiatives at Cloudera and playing an integral role at AWS on the Elastic MapReduce team, Amandeep oversaw some of the industry’s largest big data implementations. After witnessing first-hand the challenges companies faced in adopting big data technologies, especially in the cloud, he founded Okera to empower all users with easy data access through a unified, secured, and governed platform across heterogeneous sources. Amandeep is also the co-author of HBase in Action, a book on building applications by HBase and is passionate about distributed systems, big data and everything cloud. Amandeep received his MS in Computer Science from the University of Santa Cruz and a Bachelor in Engineering at Thapar Institute of Engineering and Technology.

It may be old news that data is today’s true differentiator for businesses, but companies still face a daunting challenge when trying to use data for their digital transformation initiatives. Companies create massive data lakes and hire data scientists and analysts, but they often still fail to overcome the tension that exists between people who want to use data to power new applications and those charged with ensuring proper access controls and governance to meet rapidly evolving regulatory requirements to protect private customer and employee information—all at scale.

Reaching this goal requires both process and technology. We’ll focus on technology here. As organizations build and use platforms to achieve their data goals, they must ensure that any solution to providing access control and governance is built on the following six foundational tenets:

  1. Data-Centricity
  2. Rich Access Policies
  3. Scalability and Automation
  4. Unified Visibility
  5. Open, API-First Design
  6. Hybrid and Multi-Cloud Ready

1. Data-Centricity

Data access policies and governance should not be based on the storage system or analytics engine being used. Instead, the solution must be data-centric and enable the consistent enforcement of policies using the tools you have deployed today, as well as those you may deploy in the future. To do this, architecture should be able to support multiple analytics frameworks such as pure SQL, hybrid SQL, structured but not SQL (data frames, Spark), and machine learning and business intelligence.

2. Rich Access Policies

Effective access control and governance must support both structured and unstructured data at various granularities. For unstructured data, granularity should range from several folders to individual files. For structured data, granularity should range from a set of data sets to individual datasets, to columns, rows and even cells. Other key capabilities include anonymization, tokenization, masking, and redaction of data (generally referred to as obfuscation) for different users and use cases. It’s also important to be able to apply policies, such as consent management and right to erasure, that will support evolving privacy regulations like GDPR and CCPA.

3. Scalability and Automation

A key goal in deploying a data lake is the ability to operate and use the platform at scale without having to scale human resources and integration costs. Your approach to data lake access control and governance should have the same scalability and cost goals. Achieving this requires:

  • Definition – Support for more sophisticated and complex policy constructs, such as context-based dynamic views and attributed-based policies. These will make defining policies at scale much easier.
  • Enforcement – Policies need to be applied to datasets and workloads at any scale for any tool, ranging from single digit gigabytes to multiple petabytes, without impacting performance.
  • Management – Managing access policies, specifically around fine-grained access control, should be based on an API-first design and allow for automation. Costly manual management methods cannot support a high number of users, which will limit the ability to scale data lake access control and governance over time. Policies must also be applied to data without creating multiple views and data objects for each policy and consumption tool.

4. Unified Visibility

Data access control and governance should address two aspects of usage visibility:

  • Historical visibility –Provides a view into user activity and access patterns via an audit trail. The quality and richness of the content in the audit trail has to be consistent and cannot vary depending on the consuming application or the source system. This audit trail can also be used later to build the next set of capabilities for the data lake, such as usage analytics, chargebacks and resource management.
  • Current State visibility – Is the ability to answer questions like, “Who has access to a given dataset, and what is their view?” “What data can this user access?” “How can I get access to this dataset?”

If either aspect of data usage visibility is missing or inadequate, the organization cannot gain the necessary insight into user activity required for effective governance.

5. Open, API-First Design

The approach to access control and governance needs to be able to support the new tools, frameworks and vendors that will inevitably join the analytics and machine learning ecosystem. This means it should utilize a simple, service-oriented architecture that is API-first in design. Insisting on an API-first design will enable easy integration with current and future enterprise tools, such as Active Directory (AD) or Single Sign-On (SSO) systems, for identity management, log management frameworks for diagnostics and anomaly detection, and catalogs for business metadata. 

Access control and governance solutions must also be agnostic to both storage and analytics platforms and the interfaces should be pluggable. Additionally, the solution should be vendor-neutral to avoid vendor lock-in.

6. Hybrid and Multi-Cloud Ready

A modern organization aspiring to be agile and adopt best-of-breed technologies to create business value faster is also thinking about infrastructure the same way. Hybrid and multi-cloud are top of mind for a lot of C-level executives. This means that in addition to taking an API-first design approach, the approach to data access control and governance should be vendor agnostic and cloud-native, and support hybrid infrastructure. This combination will help ensure the right architecture for the long term.

Conclusion

For most organizations, providing access control and governance on modern, cloud-based data lakes requires a successful balance between user empowerment and protecting private information. Having the right technology and tools is critical to supporting the organization’s overarching goal of achieving business agility without making any compromises on security, governance or privacy. By ensuring their approach to data access control and governance is based on the above six tenets, organizations can make this goal a reality.

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: