Big Data and Butterflies

Print Friendly, PDF & Email

Mark Gross - HeadshotIn this special guest feature, Mark Gross, President & CEO and founder of Data Conversion Laboratory (DCL), asserts that the definition of the term “Big Data” is very fluid historically and will continue to change as technology advances. DCL is a recognized authority on XML implementation and document conversion. Prior to founding DCL in 1981, Mark was with the consulting practice of Arthur Young & Co. Mark has a BS in Engineering from Columbia University and an MBA from New York University. He has also taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker and writer on the topic of automated conversions to XML.

“Big Data”– we’ve heard the term, but what does it mean exactly? According to Wikipedia, Big Data refers to “data sets that are so large or complex that traditional data processing applications are inadequate,” and that is correct, to a point. The reality is that Big Data means different things to different people and organizations. It’s a term we all use, but not in the same way, and in every industry or area of endeavor it means something different.

The National Security Agency (NSA) has often made the news over their collection of recordings of phone calls or emails. They have billions of voice recordings and emails, and the tools to analyze this enormous amount of data to hopefully prevent terrible things from happening. We know that’s Big Data. But what if you’re not the NSA?

Big Data Is Relative

To me, Big Data simply refers to data sets that are bigger than you’re accustomed to handling. I’ll give some examples. One of our clients, The Optical Society of America, needed to convert approximately 750,000 pages of materials for their archives going back to 1917. That’s Big Data in the technical publishing world. In the legal sector, a litigation firm might have tens of millions of pages in an eDiscovery production. The US Patent Office, also one of our clients, receives five million pages per month of filings, every month. Examples like these change the definition of big data on case by case basis, industry to industry.

Another definition of Big Data emerges when you are dealing with compilations of content arriving from different sources. Take for instance Elsevier’s Scopus database (the largest abstract and citation database of peer-reviewed literature), with upwards of 60 million entries – multiply that by the pages per entry, and you’re getting into fairly large numbers.

Big Data Project Plans Are a Big Help

The reality is that in every field of endeavor, data sets are getting bigger because we now have the technology and processes to deal with ever larger content collections, and have the means to better monetize them. No matter what your definition of Big Data is, when it comes to a conversion project the keys to success are having a project plan, and the right people to work with. You need to ask yourself:

  1. What does the content look like? You need to develop an inventory of some sort, because for any large collection, the content is likely to be varied and be spread out over multiple locations.
  2. What do you want out of it? Think about the value in your content, and how you envision people using it. What products will you be able to develop now that you have the data?
  3. Plan a pathway to get there – you may not have all the answers upfront, and may need to build a flexible plan that will develop over time as you learn more about what’s in your collections. Remember that agile development provides for flexibility, but doesn’t mean “winging it”.

The definition of Big Data will continue to change. And, “Big” today is very different than it was just a few years ago. As technology progresses, the definition of what is being collected as big data will change. Think about this: The Smithsonian National Museum of Natural History has a meticulously catalogued archive that includes 30 million dried insects, 4.5 million preserved plants, 7 million preserved fish in jars, birds, butterflies, sea shells, minerals and so much more, all in drawers and cabinets. Much of these specimens are digitized and findable as collections on their website. Their collection for the Department of Botany alone boasts 1,424,662 records. That’s their Big Data.


Sign up for the free insideBIGDATA newsletter.


Speak Your Mind