Sign up for our newsletter and get the latest big data news and analysis.

Enabling Federated Querying & Analytics While Accelerating Machine Learning Projects

In this special guest feature, Brendan Newlon, Solutions Architect at Stardog, indicates that for an increasing number of organizations, a semantic data layer powered by an enterprise knowledge graph provides the solution that enables them to connect relevant data elements in their true context and provide greater meaning to their data. Stardog is a leading Enterprise Knowledge Graph (EKG) platform provider.

Just as the DevOps movement has driven greater automation of the software development lifecycle and increased the speed with which developers can get code into production, it also promised to improve the speed that data can be provisioned to support both production applications via regularly scheduled pipelines and on the fly analysis.

Organizations today recognize the excessive cost and resulting latency due to data engineering teams needing to constantly wrangle data to make it ready for analysis. By logically connecting enterprise data, knowledge graphs identify key data elements that make sense to the business. In fact, analytics applications that point to an enterprise knowledge graph powered semantic data layer offer end users better and faster services by reaching directly into the data lake. This reduces latency for queries dramatically, often from tens of seconds to hundreds or dozens of milliseconds.

The advent of Machine Learning (ML) created even more challenges. When developing an ML strategy, many organizations focused on ensuring data was accessible, reusable, interpretable, and high quality, but with infrastructures that include a data lakehouse this proved to be difficult. Most businesses today incorporate statistical learning to model and understand complex datasets through data science projects. However, rule-based AI is growing and this approach includes everything from making intelligent inferences about schemas to expedite data integration to assembling techniques for text analytics or Natural Language Processing (NLP). This is because the most successful data science projects come from combining more than one source of data, and that data science nightmare has become less terrifying with the advent of data lakes. When it comes to combining data sources and datasets, ontologies and context help get machine-learning projects through the last mile.

Knowledge graphs can assist by making it simpler to feed sound and rich data into ML algorithms. Leveraging industry-standard models and ontologies, organizations can model their domain knowledge and connect disparate data sources across the enterprise. By showing meaningful relationships between data, the business is able to  maximize the use and reuse of their internal content by laying the foundation for AI and semantic applications.

The Benefits of Using Knowledge Graphs and Machine Learning Together

The inherent traits of knowledge graphs make them a key part of a modern AI and ML strategy because they create advantages for organizations including:

  • Improve Productivity. We all know that data scientists spend a great deal of time cleansing data to produce the desired model or expected results. Knowledge graphs save data scientists time by allowing them to train models directly on unified data. Some solutions also include built-in inference capabilities which allow users to resolve conflicting data definitions without the need to change or copy the underlying data. Once business and domain rules are captured in the data model, the system will apply these rules at query time.
  • Leverage Existing Tools. Unlike graph databases that act as a knowledge graph, knowledge graphs that contain virtualization operate as a true data layer, helping organizations maintain data accuracy and benefit from the security features of existing tools. Additionally, the output of existing models can be put back into the knowledge graph easily using existing infrastructure solutions.
  • Quickly Jump Start AI and Machine Learning. With built-in predictive analytics and similarity search, knowledge graphs support rapid model development and iteration for data analysis. More importantly, it allows users to extract patterns from their data and make intelligent predictions based on them.
  • Surmise New Facts. Better data means better learning. Machine learning complements logical reasoning, while inference expresses all the implied and predicated relationships and connections between data sources. Used together, they create a richer and more accurate view of the available data and provide much needed context, not just volume.

Creating Value: How a Semantic Layer Connects Data Outside a Data Lakehouse 

While a lakehouse provides a view of the data lake sitting beside it, it is usually unable to capture the full picture of an organization’s data landscape. This is because data may remain siloed outside the data lake, in places such as multi-cloud apps. Sometimes, this is for good reasons such as regulatory or sovereignty concerns. Adding a semantic layer to the data lake means maintaining all the benefits of the lakehouse such as scalability and lower cost-per-GB storage, and it does so while supporting quick unification to reduce or eliminate the need for complex ETL pipelines and manual data mart stewardship.

A knowledge graph powered semantic data layer functions as the go-between for an organization’s storage and consumption layers. It acts as the glue and the multiplier by connecting and enhancing all data to deliver greater value to citizen data scientists and analysts in the context of the actual business use-cases without the need for additional IT involvement.

This combination generates a direct ROI over data warehouses lacking the semantic layer. Let’s look at Boehringer Ingelheim as an example. As the world’s largest privately held pharmaceutical company they had teams of researchers working independently to develop new treatments. However, data was often siloed within these groups, making it difficult to link targets, genes, and disease data across various parts of the company. Boehringer Ingelheim had a vision of making data available “Wikipedia-style” for the entire organization. This was the impetus behind their Semantic Integration Project which would build a semantic layer atop Boehringer’s data lake to provide a consolidated, one-stop shop for nearly all their R&D data. With all relevant data logically connected, bioinformaticians can search for specific diseases, studies, or genes and easily explore how individual elements relate to one another. The new linked data dictionary enabled users to fetch data directly from the data lake and put it into R, ready for analysis. Finally, by increasing the connectedness of its data, Boehringer has been able to simplify complex data analysis efforts so they can answer complex questions more quickly.

Data lakes are highly effective at consolidating data that has naturally amassed in silos. However, to truly deliver ROI, data lakes need to demonstrate business value. To understand how to accelerate ROI from data lake investments, organizations must start by answering questions such as why are we looking to consolidate enterprise data in the first place? Why do we need to cut through siloed architectures?

Once answered, the business will be well on its way to giving users the ability to answer complex questions. For an increasing number of organizations, a semantic data layer powered by an enterprise knowledge graph provides the solution that enables them to connect relevant data elements in their true context and provide greater meaning to their data.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideBIGDATANOW

Leave a Comment

*