Big Data or Small Data? The Correct Answer is Both

Print Friendly, PDF & Email

In this special guest feature, Dr. Ricardo Baeza-Yates, CTO at NTENT, discusses how it’s not enough to weigh data decisions on the descriptor of big versus small alone – a number of other things must be considered. As CTO, Dr. Ricardo Baeza-Yates oversees the technical vision of the company. Prior to NTENT, Ricardo spent 10 years at Yahoo! ultimately rising to Vice President and Chief Research Scientist. He has also served as a professor at Universidad de Chile since 1985, where he founded and acted as the Director of the Center for Web Research and twice served as Computer Science Department Chair; as well as professor at Universitat Pompeu Fabra in Barcelona since 2005 where he founded and acted as director of the Web Research Group. Ricardo is an ACM and IEEE Fellow with over 500 publications, tens of thousands of citations, multiple awards and several patents. He has co-authored several books including “Modern Information Retrieval”, the most widely used textbook on search. He earned Bachelor’s and Master’s Degrees in Computer Science and Electrical Engineering from the University of Chile and a Ph.D. in Computer Science from the University of Waterloo.

In the current era of big data, content is collected in massive, steady, heterogeneous streams from human-related sources. Characterized by volume, variety and velocity, big data is not only large, it’s diverse and produced quickly. Main sources range from computer networks, social media activity and browser history to sensor technology and commercial transactions.

The challenges of processing and analyzing big data have caused computer science to progress in several areas such as parallel processing and machine learning. It has pushed our limits in areas like datacenter design, storage capacity and communication bandwidth. While important research problems remain with respect to big data, some improvements can be found simply through more data. However, for most companies, big data is impractical, too costly or impossible to grasp. SMBs lack the budget for such efforts, and their goals have different needs than large companies. Instead, they rely on small data.

Small data can best be summarized as any piece of data that is small enough for humans to comprehend, access, interpret and use to take specific action. In its simplest form, a single bit that drives a yes or no decision is considered small data.

Often sliced into myriads of smaller, coherent, specialized data sets, big data’s most common purpose is to produce small data, which is commonly more effective for the tasks at hand.1 It’s available, precise and complete while providing valuable insight into people, small groups and communities. It describes every person in the right context and often leads to innovation, adding a level of value to small data that big data can’t exactly match.

It’s not enough to weigh data decisions on the descriptor of big versus small alone. Other things must be considered:

  • What is the scope of the data? How exhaustive is it related to the problem at hand?
  • What about the resolution and identity of the data? How fine grained is it and how well can every item be identified?
  • How relational is the data? How easy is it to conjoin different datasets through encoding or common fields?
  • Does it have flexibility? How easy is it to extend the data, add new fields or scale in size?
  • Have you included privacy as a prominent dimension relevant to most people?

To truly benefit from big data, we must also explore the limits of small data. Businesses must determine the amount of data needed and find ways to maximize existing resources. Drill further down with specifics about privacy issues, identifying data bias and comparing results with others. Having less data doesn’t make it any less difficult to analyze. Computation may be simpler, but in most cases supervised deep learning cannot be used due to a lack of training data and an unwillingness to incur the cost of acquisition.

Data is also dynamic; new problems arise such as bias detection and correction or quantifying error and uncertainty. To top it off, each of these interdependencies have their own trade-offs that often go unstudied.  If we consider that in most cases the target data is personal and lives in a small portable device, then we must preserve privacy and/or resolve the problem in a device that has limited computing power, memory, communication, and energy.

In a 2013 experiment, 29 teams comprising 61 analysts were given a dataset and a hypothesis to prove. The result was disagreement: 20 out of 29 teams proved the hypothesis while the rest rejected it. But the most notable takeaway was that each team did something different to obtain the result. This illustrates that possibly the main source of bias is not in the data, but in how people use it.3

Sources:

1- R. Baeza-Yates (2013). Big data or right data? In Proceedings of International Workshop on Foundations of Data Management, CEUR Proceedings, vol. 1087.

2- M. Lindstrom (2016). Small data: The tiny clues that uncover huge trends. St Martin’s Press, NY, USA.

3- R. Silberzahn et al (2015). Many analysts, one dataset: Making transparent how variations in analytical choices affect results. Center for Open Science, Virginia, USA. URL: https://osf.io/gvm2z/

 

Sign up for the free insideBIGDATA newsletter.

 

 

 

Speak Your Mind

*