Sign up for our newsletter and get the latest big data news and analysis.

Interview: Ayush Parashar, Co-Founder and Vice President of Engineering at Unifi Software

I recently caught up with Ayush Parashar, Co-Founder and Vice President of Engineering at Unifi Software, to discuss the role and value of metadata as enterprises embrace self-service access to data. Ayush is a technologist with a passion for generating value out of big data with deep expertise in product development and enterprise customer implementations. He has software engineering expertise around big data solutions and has strong domain knowledge around Hadoop, MPP Database and Systems, Performance Engineering and Data Integration. Before Unifi he was part of the founding engineering team at Greenplum.

Daniel D. Gutierrez – Managing Editor, insideBIGDATA

Gartner recently issued a report* that found data management architectures and technologies are rapidly shifting to become more highly distributed. The firm predicts that heading into 2018 strategies will continue to shift in this direction. Data and analytics leaders can take advantage of this trend by enabling governance, improving sharing ability and deriving business value through metadata.

With regard to metadata, Unifi leverages metadata in almost every aspect of what they do. From delivering an elegant user experience to powering AI, it’s a key asset in supporting the capabilities and functionality they are able to deliver to a business user.

Most enterprises run their business on a vast array of disparate data sources. That being the case, we have to have the ability to talk to every data source an enterprise might be leveraging, as well as learn what we think might be interesting inside that data source. This is step 1 of the search and discovery process. What we learn in this process is extremely valuable across our platform for Catalog, Prep, Governance, Collaboration and Automation—it’s how we leverage the metadata across our entire ecosystem that distinguishes Unifi in the marketplace.

insideBIGDATA: In your opinion, what are the most important things an analyst should be armed with before they embark on the journey of insight, discovery and iteration?

Ayush Parashar: Most of the time insight creation has a problem statement associated with it such as, ‘Why are sales lower in this product category?’, ‘How can I sell more to certain segments of customers?’ or ‘Why do we have a high drop rate on videos on the west coast?’. For all these questions, an analyst needs to start by finding primary data related to that question. This results in a relevant search that’s personalized for that user and returns data of interest.

It’s followed by deeply understanding data of interest, utilizing metadata definitions, profiling and looking at the statistics of data. If the data is not in a form that’s ready for visual analysis the discovery is followed by some form of data preparation where other kinds of data are combined. Then business rules and filters are applied, and finally data is aggregated at different levels, for example by geography and/or time, to create an insight. The data preparation stage is iterative. After looking at any output an analyst may realize that few things are missed or they may want to look at data in a different way.

insideBIGDATA: What are examples where having an integrated platform around metadata usage provides benefits to the enterprise and enables a data-driven organization?

Ayush Parashar: Metadata definitions are important for a detailed understanding of the data, to collaborate on it and verify its quality. Metadata definitions and the data’s statistics, available during data discovery or preparation, allows for users to correctly choose the right data for their analysis. For example, marketing’s representation of sales data might be different than finance’s representation. If you are from a finance group, you are more likely to look at definitions and the dataset that your group uses. In today’s data-driven enterprises data and its definitions are shared inside and outside of the organization. A central repository of metadata definitions provides a single source of truth and it results in consistency of data usage across the enterprise.

insideBIGDATA: How is understanding the business ontology of data important to an analyst?

Ayush Parashar: Business ontology is very important and relevant to enterprises. For instance, the definition of revenue can be very different for a retail company compared to a tech company. The ontology that explains such a definition becomes extremely relevant when work has to be done on related data problem statements. Also, for any new workforce that joins a company understanding the ontology helps them understand the business in a deeper way and get closer to data that’s pertinent to that business.

insideBIGDATA: What are some examples of things that are interesting or that seem to be related to each other that you’d correlate to another dataset?

Ayush Parashar: Data sets are related to each other in multiple forms. Customer data can be linked to sales data through common keys where you can join them to achieve sales by customer. There can be customer data in a CRM system and Call Center systems that you may want to merge to see a 360-degree view of the customer. There are different relationships between datasets that can be useful in solving different sets of data problems. It’s important for a tool to automatically understand these relationships in order to guide end users to create value.

insideBIGDATA: Are all forms of metadata easily discoverable – for example, GPS data, video, audio clips, etc.?

Ayush Parashar: Most forms of metadata can be scraped from source data and made discoverable including metadata from structured/multi-structured sources—text data, video, audio & GPS related data, etc. However, end user annotation and collaboration is very important too.

insideBIGDATA: Some industry analysts believe there will be an automated enrichment of metadata through machine learning, crowdsourcing, search capabilities, and other processes. Would you agree?

Ayush Parashar: Absolutely! It’s a combination of AI and crowdsourcing that will result in deep and rich metadata creation. AI can result in understanding metadata definitions and categorization based on high level data types—address, SSN, etc. When AI is used on crowd-sourced metadata definitions, it takes the automation experience to a totally different level. AI also plays a part in ‘personalized’ search. Users see relevant information which may differ for each of them when they search for the same key words.

insideBIGDATA: Is metadata critical to solving for GDPR?

Yes, metadata is critical for all of the following problem statements related to GDPR:

  • Find all PII attributes – which takes into account all the datasets that are spread across the enterprise
  • Classify data sets into different categories and levels of sensitivity
  • Restrict access (mask access) to data/metadata that’s classified as PII data
  • Alert – when someone accesses data or metadata that’s sensitive
  • Lastly, to identify metadata when usage of a product is relevant to GDPR, for example to detect unusual patterns of data access or processing of data

Also in providing audit trails of data access and usage – this is particularly relevant in relation to the right to be forgotten provisions of the GDPR regulation

insideBIGDATA: What role does it play in preventing a security breach, identifying fraud or detecting a threat?

Ayush Parashar: Metadata can help classify the sensitivity of data and then rules can be created at different levels to take different actions during a security breach, identifying fraud or detecting a threat.

insideBIGDATA: How are conversations that are captured around the use of various datasets cataloged?

Ayush Parashar: Conversations on the quality, value or validity of a dataset are extremely important and they lead to higher value creation out of data. It makes sense to capture the conversation right next to the dataset catalog. That way a tool can also derive insights from these conversations and recommend various things to the users that leads to better collaboration.

Making comments searchable helps with discovery, for example duplicate data sets or similar sounding data sets can be clarified with comments. All this leads to shared learning, something that dramatically reduces the time, effort and cost of re-inventing the wheel around data discovery. Another key aspect to this shared learning environment is the reduction in what i call “institutional knowledge” the concept that a few key people, typically in IT, have detailed knowledge about data and Heaven forbid they should leave.

insideBIGDATA: From 2016 to 2017 the number of vendors providing solutions for metadata discovery grew. What’s your prediction for this market segment in 2018?

Ayush Parashar: The market for metadata discovery/catalog is going to see robust growth in the coming years. Business users and analysts alike realize they need to have governance, data cataloging/discovery, data preparation and data pipeline automation on the same platform so they can have a 360 degree view of their data. Having metadata discovery as part of a holistic solution will lead to more automation and intelligence being generated, which will enhance the end user experience and create efficiencies for analysts. While it’s a challenging problem, the companies that can achieve this will have an edge.

*Source: Published by Gartner October 31, 2017: Predicts 2018: Data Management Strategies Continue to Shift Toward Distributed

 

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: