Sign up for our newsletter and get the latest big data news and analysis.

Are Data Scientists Overrated? No, But Their Role is Evolving

DrDanBWIn this special guest feature, Daniel Nieten of Red Lambda lends his perspective on the evolving role of the data scientist and where these skill-sets are required. Daniel Nieten is CTO of Red Lambda, a technology company that has developed a next generation IT security and analytics solution for Big Data environments. Dr. Nieten received his undergraduate, Masters and Ph.D. degrees in Computer Science & Engineering from the University of Florida, and has been the recipient of numerous awards throughout his career for his scientific and engineering achievements.

In a recent news article about the role of data scientists, the author takes the stance that we need to make data accessible and usable by the non-technical end user and provide tools that help that user make sense of the data. He also cites the data scientist shortage and the high price tag of hiring these high-level staffers as key reasons for the need for analytics and BI solutions to shift away from solely the domain of the PhDs and toward developing tools that we can put in the hands of the end user.

I agree with this. Analytics should be accessible and easy to use—frankly, even for data scientists. Ultimately, the targeted user of any data analysis should not just be the data scientist, but whoever will use the data to ask and answer questions and subsequently make business decisions. This is backed by a number of industry studies that indicate a strong desire for executives and decision makers to use advanced analytics to improve their business operations. The desire is there, but those same surveys indicate a great deal of hesitancy. Among the top concerns? Data quality and the shortage (and cost) of experts that can again, understand the results of the analytics.

Which is exactly where the skills of a data scientist come into play

For example, while many of the visualization tools already do make the data useable by the non-data scientist, they rarely, if ever, are operating on the raw data. The process of transforming the raw data into more useable form—called data munging or wrangling—is one of the expertise areas where data scientist are always going to be required. Building the transformation for the raw data and the models to perform prediction or classification requires some pretty specialized expertise.

Dasu and Johnson[1] assert that the process of cleaning and preparing data for analysis is 80% of the effort. Data cleaning is not a one time effort, as with everything, schema’s change, the types of questions that the data is being used to answer change, and there are can be errors in the data. This activity encompasses; parsing of the data, outlier detection, and normalization (i.e., consistent units across measurements and consistent vocabulary for contextual information). Hadley Wickham proposed in the Journal of Statistical Software the concept of “tidy data” which specifically involves the structuring of data to enable analysis. At present his is a manual effort and it requires the expertise of the data scientist to improve the quality by transforming the raw-messy data into a well organized or “tidy data set”.

While there are scenarios where a non-technical end user can be injected, the skills and expertise of a data scientist still comes into play. For example:

Data acquisition and normalization

For specification-based data feeds, once the connector is created, these feeds can be used by pretty much any user. The challenge comes in with unstructured or structured data that does not have a corresponding schema or specification. This typically requires the definition of a domain-specific model and construction components to transform or normalize the data into the model. Without this, there is no context (i.e., they are just numbers or words). There needs to be something that indicates the contextual relationships of the data points. Therefore, the data scientist is needed to help understand that context.

Classification and Prediction

Some algorithms are pretty general, and there are a number of tools that provide a user with drag and drop interfaces or very simple syntax to leverage the algorithms. Machine learning and data scientists built these algorithms, but they are not required to make them useable. Great for the end user. However, the data scientist is often needed in the interpretation of the results. This could be quite a challenge for a non-data scientist type.

Searching and Visualization

These are the two areas that actually have excelled in reducing the dependencies on data scientist. Tablaeu is an excellent tool for building dashboards and visualizing the data in numerous permutations, and there are scores of other solutions that do this as well.

From MetaGrid’s roots in the P2P file sharing environments within higher education, we recognized that this sector didn’t have many highly technical users. That’s why MetaGrid was created to appeal to the non-technical user from its inception through its drag and drop dashboard creation and visualization tools that are very straightforward to use. Custom dashboards present the user with the relevant data to answer the questions of interest.

Yet, the environments are customized, and all the data acquisition and normalization is done behind the scenes. So is the correlation of the data and the relational data stored in the system. That means if the user wants to expand outside the scope of the data collection and correlation, then something focused on the new question has to be created, once again necessitating the skills of a trained and insightful data scientist.

Like many in the big data and analytics sector, I see the role of the data scientists evolving and becoming even more pivotal in the future of Big Data analytics and in driving continued innovation. As Kevin Kelly, one of the founders of Wired magazine and its executive editor for a number of years said, “Machines are for answers; humans are for questions.”

He went on to say that “the world that Google is constructing—a world of cheap and free answers—having answers is not going to be very significant or important. Having a really great question will be where all the value is.”

[1] Dasu T, Johnson T (2003). Exploratory data mining and data cleaning. Wiley-IEEE.

 

Sign up for the free insideBIGDATA newsletter.

Comments

  1. Is there any difference between a visual analyst and data analyst? If there is any how are they different in terms of tasks performed with the two positions?

  2. Nice article Daniel,

    Agree on the clean data and effectively enabling the subject matter expert to take control of their data science activity.

    We’re working hard to deliver an easy to use tool set for subject matter experts to acquire and clean data. The tool is an iPaaS by definition but we like to think we approach this a little differently to most providers on the market.

    Thanks and have a great 2015!

Leave a Comment

*

Resource Links: