Sign up for our newsletter and get the latest big data news and analysis.

Myth or Reality? The Truth Behind the Evolution of Apache Ranger

In this special guest feature, Balaji Ganesan, CEO and co-founder of both Privacera, the cloud data governance and security leader, and XA Secure, acquired by Hortonworks, discusses the truth behind the evolution of Apache Ranger. Balaji is an Apache Ranger™ committer and member of its project management committee (PMC).

As companies of all sizes migrate their data and analytical workloads to take advantage of clouds’ lower capital cost and efficient resource utilization, they face data security and access control challenges similar to when they started building on-premises Hadoop data lakes in the mid-2000s. In order to meet this need, Apache Ranger emerged as a leading centralized platform to administer access control policies across a number of open source applications including Apache Hive, Apache Spark, and Apache Kafka to name just a few.

Considered a highly successful open-source project that is in use at hundreds of enterprises around the world, Apache Ranger started off as a commercial software project. My partner Don Bosco Durai, Selvamohan Neethiraj, and I founded XA Secure in 2013 with the goal to develop an enterprise-ready, centralized platform built from the ground up to define and administer data access controls for on-premises Hadoop data lakes. Within the year, XA Secure was acquired by Hortonworks who quickly donated XA Secure’s entire code base, comprising roughly 440,000 lines to the Apache Software Foundation (ASF).

In releasing the code, Hortonworks laid the foundation of Apache Ranger as an Incubator project with the first version being released in November 2014. Fast-forward to 2017, Ranger was recognized as a top-level project (TLP) – a testament to the project’s growing community and adoption. In fact, as of this writing,  Apache Ranger has had more than 15 major and minor releases. 

Ranger is a centralized framework to define, administer and manage access control policies. Thanks to the Ranger community, the platform provides the most comprehensive security coverage across Hadoop and other Big Data components all from a single interface, including: 

  • Lightweight plugin-based architecture which authorizes access to data in the context of the resources being authorized. These plug-ins are lightweight, distributed agents that act as the gatekeepers to access various big data projects  such as Apache Spark or Apache Kafka. When a user executes a SQL query or reads a file, Ranger plugin performs a quick authorization check against the resources that the user is requesting to access. If the user has the required permissions, the plugin then essentially lets the Hadoop cluster take over the processing of the query. Plugin architecture also provides the ability to extend its authorization model to systems that are not part of the Hadoop ecosystem.
  • Central audit location which authorizes requests across all the components. The comprehensive audits framework provides rich reporting along with contextual metadata such as resource classification, IP, locale, the specific policy, and its version for each access request.
  • Advanced security features include dynamic column masking and row filtering. Dynamic data masking capability allows only authorized users to see the data they are permitted to see, while for other users the same data is masked or anonymized. Ranger’s row-level security is a default filter condition that empowers administrators to render a limited number of filtered rows from a Hive table without the need to manually add these as predicates or create multiple views.
  • Key Management Service (KMS) stores and manages encryption keys for HDFS Transparent Data Encryption. Ranger KMS is compatible with Hadoop’s native KMS API. 

Despite being a mainstay of the open-source community for many, there are a number of misconceptions associated with Apache Ranger: 

  • Myth #1: Apache Ranger is exclusively an RBAC solution for heterogeneous data services – It is a common misconception that Ranger is exclusively based on the role-based access control (RBAC) approach to implementing access control policies. The reality is Ranger started its journey as an open-source project based on attribute-based access control (ABAC) approach. In addition to empowering data administrators to define access policies based on roles and users, Ranger also offers the flexibility to authorize policies based on a combination of the subject, action, resource, and environment. Using descriptive attributes such as active directory (AD) group, Apache Atlas-based tags or classifications, geo-location, etc., Ranger provides a holistic approach to data governance that encompasses both ABAC and RBAC approaches.
  • Myth #2: Apache Ranger follows a rejection-based approach to access control – The second misconception is that Ranger follows a rejection-based approach to access control. In fact it’s quite the contrary, as Apache Ranger follows the industry best practice for writing access policies with the least privilege. Under this approach, users are explicitly denied unless there is a policy in place that specifically grants them access to requested data. For example, a user may only have Select but not Update privileges. 
  • Myth #3: Apache Ranger produces a large number of access policies that are difficult to maintain  – Apache Ranger leverages the best practices of Access and Deny conditions to deliver a precise level of access control to enterprises. The ability to support conditions for deny/allow along with specific exclude/include conditions enables security and compliance administrators to achieve access control at a fine-grain level by writing a small set of easily understandable policies. In some cases, what would have required a dozen roles and permissions to specify a policy, can now be done with a single simple policy in Apache Ranger’s comprehensive policy framework.

Fast forward four years after Apache Ranger became a TLP and  it is now a commonly utilized data governance framework for on-prem data lakes. Deployed across thousands of companies around the world and managing petabytes of data it has been proven to be the scalable, flexible data governance framework needed to solve the problem of managing data stored disparate and heterogeneous Big Data environments. 

Looking to the future,  distributed cloud platforms utilized by enterprises today have resulted in a similar problem as the Big Data environments of the past. It is difficult to secure data distributed across different cloud environments due to the disparate access control mechanisms offered by cloud service providers. For enterprises migrating to the cloud, there are historical lessons to be learned from 7 years of community development and support from Apache Ranger which can now be applied to the cloud. 

A corollary to the myths outlined above can be considered best practices for responsibly managing data in the cloud. Specifically, when selecting a data governance solution for the cloud, enterprises should consider an access control method which can: a.) provide a centralized platform to define and manage access control policies across on-premises and cloud environments and cloud-native services;  b.) leverage both attribute-based and role-based access control;  c.) create access policies based on least privilege; and d.) maximizes fine-grained security without the headache of managing an exponential number of policies. By applying the best practices cultivated from my experience developing Apache Ranger and by applying the above steps, you will be well on your way to optimizing your cloud data privacy and governance infrastructure. 

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: