Beware "Big Schema" - insideBIGDATA

In this special guest feature, Michael Blaha writes about the ever increasing dimensionality of enterprise data sets – Big Schema. Michael Blaha is a consultant and trainer who specializes in conceiving, architecting, modeling, designing and tuning databases. He has worked with dozens of organizations around the world. Blaha has authored seven U.S. patents, many articles and seven books, the most recent of which is the “UML Database Modeling Workbook.” He received his doctorate from Washington University in St. Louis, and is an alumnus of GE Global Research in Schenectady, New York.

Most everyone has heard of “big data” – the popular term for data so massive it’s difficult to manage. Today, the volume of search engine queries, online retail sales and Twitter messages regularly exceeds the capabilities of traditional databases.

There’s a complement to big data that we call “big schema”. Modern data can not only have vast quantities and fast rates, but can also have diverse structure. Big schema can arise with enterprise data models, large data warehouses and scientific data.

Enterprise data models

An enterprise data model (EDM) describes the essence of an organization – it abstracts multiple apps, combining and reconciling their content. EDMs have many purposes, such as integrating app data, driving consistency across apps, documenting enterprise scope, finding functional gaps and overlaps, and providing a vision for future apps. Many enterprises have dozens of apps, so schema size can be very large.

The UK financial software vendor Avelo has been using an EDM to coordinate and integrate apps. Avelo was formed by the merger of four predecessor companies, so its apps aligned poorly. They have different abstractions, naming approaches and development styles. As a result, it was difficult to construct an EDM.

We limited the scope of Avelo’s EDM to cope with the poor alignment. We started by seeding the EDM via rapid reverse engineering. We browsed each app’s schema to find core concepts – the tables with the most foreign key connections – and used only the top 10. Business experts helped us reconcile the concepts to create a high-level EDM.

Large data warehouses

Data warehouses can also involve big schema. A data warehouse combines data from day-to-day operational apps and places it on a common basis for analysis and reporting. A large enterprise can have a great deal of data to analyze, leading to many data warehouse tables.

We can’t do much to restrain the size of a large data warehouse. But by using agile data modeling, we can make sure that payoff occurs incrementally, as the warehouse is constructed.

We recently worked on a large data warehouse encompassing multiple departments that illustrates both good and bad approaches. One department’s staff focused on building their portion of the warehouse and deferring usage. After many months of work, they are still building. Another department chose to build incrementally, according to business demand. This latter approach has been more successful and easier to justify for continued funding.

Scientific data

Scientific data is a third source of big schema. Scientific apps are extremely complex, involving time series, complex data types, and deep dependencies and constraints. Scientific schema is often not only large, but also difficult to represent.

Many years ago, we worked on The Process Data eXchange Institute (PDXI) project sponsored by The American Institute of Chemical Engineers (AIChE). The purpose of PDXI was to produce a data model to serve as the basis for a data exchange standard for chemical engineering apps. Chemical plants have a wide variety of equipment, complex mixtures of substances and a range of operating conditions, so there is a lot of data to represent. The PDXI model was several hundred pages. This was too much to manage, too much to explain and too much to understand.

In retrospect, we now realize that we should have used more generic data structures. For example, the PDXI model had fifty pages for equipment, such as tanks, reactors, pumps and distillation columns. A better model would have avoided all this detail by combining data and metadata. Then the fine particulars of each kind of equipment could have been specified elsewhere.

So when you build applications, think not only about big data, but also big schema. For where there is big data, there is often big schema. And big schema can even arise by itself.

Sign up for the free insideBIGDATA newsletter.

Comments

Alan Sill says

May 13, 2014 at 9:42 am

Michael, you raise a good point. Many scientific data formats are well described for particular sub-fields, but the general problem you mention has often gone unaddressed.

For this reason the Open Grid Forum has created the Data Format Description Language standard, which does not require that you rewrite the format of your data to conform to a particular schema or add other overhead, but allows for the efficient description of existing data formats in a flexible and systematic way.

A DFDL description allows any text or binary data to be read from its native format and to be presented as an instance of an information set. DFDL also allows data to be taken from an instance of an information set and written out to its native format. You can use DFDL to create a schema for description of the *logical model* of the data, leading to an open standard capable of describing almost any format of text or binary data. We have several implementations from the commercial sector as well as an open-source package called “Daffodil” that implements DFDL front eh National Center for Supercomputing Applications (NCSA).

Learn more about DFDL and its flexible solution to the problem that you mention at https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl

Beware “Big Schema”

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Speak Your Mind Cancel reply

Comments

Featured RSS Feed

More News from insideHPC

Beware “Big Schema”

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Join Us On Social Media

Speak Your Mind Cancel reply

Comments

Related Posts

Featured RSS Feed

More News from insideHPC