6 Things Slowing Down Big Data

In this special guest feature, Steve Sarracino of Activant Capital discusses how big data has become a victim of its own success in terms of six distinct issues that hinder further progress. Steve Sarracino is Founder & Partner with Activant Capital, a firm that invests in technology enabled and software businesses, focused on growth equity and providing liquidity for entrepreneurs and founders.

The amount of data collected by businesses, non-profits and governments has skyrocketed in recent years. The increasing commonplace usage of devices that can be connected to the Internet has allowed organizations to collect more and more data on their customers, based on their browsing habits, purchases, software logs, search queries and social media activity. Organizations also gather data in bulk from other sources, namely sensor networks, remote sensing via satellites, vehicle diagnostic data and point-of-sale terminals. This trend of automated data collection has the potential to drive a radical transformation in how enterprises research, innovate, market and ultimately grow. While one would think this glut of information collected by machines would be a boon for enterprise users of Big Data, it has become apparent that it is a victim of its own success. While Big Data can be very useful to an organization, there are six issues that currently hinder the progress of the field.

There aren’t enough data scientists to make sense of it all

The most pressing issue is the lack of qualified data scientists in the United States. According to a report by Accenture, the demand for skilled data scientists has greatly outpaced the number of qualified graduates entering the workforce, with a projected 400,000 data science jobs being created during the period of 2010-2015 with only 140,000 new data science specialists to fill them. Additionally, a report by McKinsey states that there will be a 50 percent gap in the supply of data scientists versus demand in 2018. Furthermore, the report projects “a need for 1.5 million additional managers and analysts in the U.S. who can ask the right questions and consume the results of the analysis of Big Data effectively”. This astonishing shortage of data scientists has led some to believe that the era of Big Data could take longer to be realized for organizations that don’t move swiftly in hiring the best talent.

Lack of integration with existing tools

The usage of Big Data has been slowed due to current Big Data solutions not integrating with existing software, even if there was enough talent available. There are a few companies currently working on this problem, yet there are not any real standards being developed for tight integration. There are a multitude of applications, frameworks and scripting languages for processing and making sense out of Big Data, but the majority of them aren’t straightforward enough that a casual user could make sense of them, let alone use the results in familiar toolsets. This results in cases where it can become tedious to convert the data into formats that can be used in an organization’s information management system. The overall goal is that querying data should be seamless to the person querying it, whether it is in a Big Data solution, RDBMS, JSON/XML or a plaintext file.

The need for better security models

Information for Big Data often comes from a variety of sources, from sensor networks to social media and point-of sale systems. There must be security at every level to protect this information, from the information collectors to data processing, data in transit and analytics. Big Data software frameworks such as Hadoop allow data to be easily stored, retrieved and queried at scale by distributing it across a number of different computers. While this is good for performance and processing, it is not as beneficial for security reasons, as each node must be secured. A node used for processing or a data source can be fully encrypted, but if its information is used within an insecure Big Data environment, it can be all for naught. Compromised information, especially an executive-level analysis can be a major loss to a business. At the moment, there is almost no security in place for virtually all Big Data tools: once a malicious user has access, everything can be compromised. Common Big Data solutions like Hadoop are open source, which is part of their appeal; all of the code for an open source program is publicly available for free. With open source solutions, the community of users can often rectify security problems and issue a patch, but that doesn’t protect against all threats, especially as Hadoop was not originally built with security as a priority. Instead, security was added after it took off in popularity. Improvements are constantly being made, but security solutions for Big Data still aren’t enterprise grade and often do not interoperate with existing enterprise-wide security solutions.

The need for security to protect privacy

Big Data solutions collect and process information, ranging from the mundane, such a user’s taste in music all the way to social security numbers and medical history. Highly sensitive user data must be protected at every stage in order to ensure compliance with regulatory and statutory requirements. Data breaches must be disclosed, and that can come with the cost of user trust, legal fees and brand image. Thus securing Big Data processing systems, data collectors and so on is critical not just to ensure a competitive advantage but also to protect customers from identity theft. As with all forms of customer data collection, companies need to be careful of how they collect the information, what details are given to the consumer, how the data is used, and other legal and ethical issues. Organizations need to start focusing more heavily on securing data to protect the privacy of their customers, not just for legal reasons but to protect their image and prevent a loss in customer satisfaction.

The need for more mature software

To date, Hadoop logs are riddled with errors, warnings and numerous other issues that are almost impossible to decipher except with training, experience and know-how. Despite that, the platform still seems to magically work most of the time. If problems do arise, it can be difficult to separate minor from major issues or even to tell if information is being accessed inappropriately. Moving forward, there needs to be improvements in better predicting, as well as avoiding problems when running map-reduce jobs and giving human readable information to solve problems. Ideally, the log files should be easy enough to understand so that if there were a problem, even someone without extensive education in data science would be able to understand them.

Engineer dependent

Even if the log files for Hadoop and other programs were easier to decipher, the average person wouldn’t know what to do with that information. The field of Big Data is filled with acronyms, buzzwords and methodologies that are unfamiliar to the vast majority of people – except for engineers. The complexity of most Big Data tools still require an engineering degree to operate on a daily basis, along with knowledge and familiarity with the command line. Hadoop’s Yarn has made a vast improvement opening up accessibility to tech savvy business analysts, but the market is still far from a true GUI-based tool that can be picked up easily.

Getting Big Data back on track

The answers to these issues are manifold, but the best solution is to take advantage of Big Data tools as they continue to evolve and grow. Hadoop has begun to take security much more seriously, as its role has changed from analyzing publicly available information on the web to powering data analysis for large organizations. Automation has also come into play, as a machine can handle most of the more tedious or challenging tasks that a human analyst has had to face. Enterprises must start to embrace these automated software solutions – many of them can enable typical business analysts to perform queries and analysis in Big Data environments without the need to know MapR, Pig or other highly technical languages previously needed to access the data. Over time, Hadoop has made it possible for enterprises to capture, store and analyze larger quantities of data in a cost-effective way. With the advent of the Hadoop Yarn job-scheduler, more creative solutions are gradually starting to take place that leverage the Hadoop platform, especially as Yarn has made it easier for non-technical users to leverage Hadoop.

The benefits from properly leveraging Big Data are endless. It’s no secret that we are in the midst of a seismic shift in the analytics world. Industries around the globe are waking up to the reality that data is an asset, and not a simple storage necessity. Organizations wanting a competitive advantage need to start investing in new comprehensive data analysis software solutions. They are proven to help non-data scientists — especially marketers, product people and others on the business side of a company — extract actionable answers from their databases. Those companies that can build a reputation for providing valuable services while using consumers’ personal data in trustworthy ways will have big advantages over their competitors.

Sign up for the free insideBIGDATA newsletter.

Comments

John Amraph says

December 2, 2014 at 10:34 am

How about bandwidth?
It’s honestly pretty bad in most of the world and people just sending raw genomes by snail mail.

Murthy Mathiprakasam says

December 7, 2014 at 4:09 pm

@Steve, your article has some great points, but I think I’m a bit more optimistic and realistic about where the industry and technology is. The relational database industry is almost 40 years old with hundreds of thousands of customers. Meanwhile, any real commercial use of Hadoop has only really been happening over the last 3-4 years. Considering that estimates show thousands of Hadoop deployments within this timeframe and annual growth rates estimated near 100%, I would say “Big Data” adoption is moving pretty fast.

And just as impressive as the customer adoption growth is also the growth of technology integration by nearly all of the major enterprise software players. Whether in the form of Hadoop hardware appliances, integrations with existing management tools, or porting of familiar functionality like ETL and data quality with Hadoop, enterprise software vendors have stepped up to help customers with existing processes and skillsets take best advantage of Hadoop technology. And of course, there are entirely new applications made available on Hadoop as well, especially in the data wrangling and visualization space, enabling everyday data stewards and business analysts to make use of Hadoop without the need for specialized data science skills.

I’m also impressed by how far security and compliance have advanced in the Hadoop ecosystem. Projects like Apache Sentry and Apache Knox have dramatically moved the needle on Hadoop security. Integration with Kerberos and enterprise-grade data masking and data lineage tracking tools ensure the right people get access to the right data and that all operations are tracked for compliance reporting. There’s always more to do, but for a technology that is only a couple years old, significant progress has been made in a short amount of time.

Fundamentally, I think the misconception of Hadoop’s maturity comes from it’s constant comparison to transactional and analytical RDBMS systems, neither of which Hadoop seeks to replace. Considering that commercial use of Hadoop was essentially non-existent a couple years ago, I think rapid customer adoption, the porting of familiar tools like ETL and data quality to Hadoop, the availability of tools to ensure security and compliance, are all remarkable accomplishments and I see nothing slowing this growth and progress down any time soon.

6 Things Slowing Down Big Data

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Speak Your Mind Cancel reply

Comments

Featured RSS Feed

More News from insideHPC

6 Things Slowing Down Big Data

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Join Us On Social Media

Speak Your Mind Cancel reply

Comments

Related Posts

Featured RSS Feed

More News from insideHPC