Data Refinement: The Dirty Side of Data Science

Print Friendly, PDF & Email

Stephen_YuIn this special guest feature, Stephen H. Yu of eClerx looks at the importance of data refinement in order to maximize the benefit of big data assets. Stephen H. Yu is a world-class database marketer and Associate Principal, Analytics & Insights Practice Lead for eClerx. Stephen has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from over 28 years of experience in best practices of database marketing. Prior to eClerx, he served as VP, Data Strategy & Analytics at Infogroup, and previously he was the founding CTO of I-Behavior Inc.

Many people are talking about the new role of the Data Scientist. Many aspiring young professionals openly express desire to pursue the world of data science, analytics and Big Data, but are obfuscated by the complexity of the data operations landscape. All work is noble, but it’s not always glorious. Intrepid, young data scientists quickly learn the data business resembles the construction industry – where development involves an ample amount of dirty work and sweat, while the end result is clean and sleek, showing no signs of the immense building process.

Most data sets in our age of “Big Data” are utterly inadequate for advanced analytics, let alone end-users. That is the primary reason why most analysts spend 50 to 80 percent of their valuable time fixing and preparing unstructured, uncategorized and unrefined raw data sets. The situation is not much different from trying to run a brand new sports car with unrefined oil.

With shiny new tool sets and myriad buzzwords surrounding data operations, consumers of information often overlook the building process, believing insights gleaned from data are instantaneous and automated. We live in the “pre-AI” age where clear answers to qualitative questions from quantitative analysis requires human intervention. Yes, advanced data science gives us extensive means to visualize and cross-section, but human beings are still needed to ask questions in a logical fashion and find the significance of resulting insights.

Data Refinement is the Key

Serious data scientists need to make data refinement their first priority, and break down the data work into three steps:

  • Collection
  • Refinement
  • Delivery


In this age of Big Data, dealing with the sheer size of data is a challenge in itself, and that is why most modern databases are optimized for collection, storage and rapid retrieval.  The three V’s – volume, velocity and variety – are the most common attributes of Big Data, but too much emphasis is given to them, resulting in many frustrated decision makers sitting on the mounds of unstructured, dirty data.  Speedy mass collection ensures databases are encompassing, but data refinement is needed to make it usable.


To be instrumentally useful, data must be converted into “answers” to questions.   In other words, Big Data must get smaller through refinement.

The data refinement process is the most important (but often neglected) part of all data operations, where data must be selected (or purged), standardized, tagged, categorized and summarized properly before it’s used for advanced analytics. Most prominently, it’s used with statistical modeling, which convert piles of raw data into usable answers.


Once the answers are formulated, data players should then deliver the answers through optimized channels in proper formats and frequency. Delivery can come in many forms depending on the goal and desired action. Marketers need vital statistics and key performance metrics through online dashboards. In the field, sales forces may require real-time information delivered to handheld devices. Product developers and marketers alike apply model based personas to customize product and services on a personal level. The execution on data is nearly limitless, but even the most efficient delivery mechanism is not useful without proper data refinement.

We are clearly living in the age of ubiquitous data, but more users are lost in it. To make data usable for high-level science, analysts must be reducing the data into bite-size answers and delivering insights to their target audience in a consistent, useful manner.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind



  1. Great article. Thanks.