In this special guest feature, Diana Shealy, a data scientist at Treasure Data, highlights several commonly overlooked best practices when building out an analytics program in any organization, outlines common data-science problems and offer solutions to help you make wiser choices based on your company’s specialized needs. Diana is a data scientist who believes that insights from data is accessible to every company. After focusing her studies in both statistics and computer science, she’s enjoyed successfully growing the role of data science at two enterprise software start-ups.
With all the hype around data science and ubiquitous terms like “Big Data,” it’s easy to fall into a pernicious trap: Following ideas about how things “should” be done, rather than doing things that will best benefit your company. In my past four years as a practicing data scientist, I have noticed several commonly overlooked best practices when building out an analytics program in any organization. Below, I outline common data-science problems and offer solutions to help you make wiser choices based on your company’s specialized needs.
Treating Analytics as a Necessity
One common mentality considers analytics and other data initiatives as nice-to-have resources – something to address only after the company has reached some measure of stability. Though companies love to tout themselves as being data-driven in their decision-making, under this paradigm it usually comes too late.
Organizations should treat analytics initiatives as essential to both the company’s and the product’s success. Startup CTOs and technical product owners consider it bad practice to delay a continuous testing infrastructure for a new project, and an analytics infrastructure should be no different. Start early and build your engineering and product foundation on data, rather than on hunches and best guesses.
Find A Data Champion
While some forward-thinking companies establish analytics infrastructure or data organizations early, the majority of companies continue to follow the antiquated model of data silos. The marketing team owns marketing data, the engineering team owns server and production data, and so on. This older style of data management seems easier to manage, at first. Each group only has to deal with a limited piece of the puzzle. However, as they stack up, these information silos begin to create unnecessary complexity. Instead of one team understanding all of the nuances of the company’s data, the company has a variety of teams with varying agendas and different degrees of data fluency being forced to work together. Under this scheme, data analysis often falls to the bottom of the priority pile.
In small or young companies, a full-fledged data organization is not essential to gain the benefits of an analytics initiative. But you do need a data champion. Data champions “own” all the data. They define what will be collected, ensure the collection pipeline is working, monitor data quality and champion this use of data wherever possible. Data champions don’t necessarily need to perform analysis, but they are the keepers of the data.
Larger organizations are lucky, they can build true data or analytics organizations. Yet I still see many larger companies sticking with outdated silos. In addition, many of these data and analytics organizations get leveraged only for analysis, when they should own the entire pipeline: collection, storage, clean-up, analysis and visualization. Why have an engineering or marketing talent waste precious time collecting and organizing data, when one organization can do it all?
Define Questions First, Then Collect Data
It’s easy to get excited when it’s time to start collecting data, but sometimes the thrill leads to the neglect of important steps. Deciding what data to collect is often left to the engineering department, which generally collects what it needs in the near-term, without consideration for future questions or data that will be useful in the weeks and years ahead. As a result, when it’s time do any analysis, the data is riddled with gaps. This forces companies to return to step one, and requires a pause in analysis while new data is accumulated. The frustration caused by the wasted time and talent overshadows the benefits data brings to organizations.
The solution is simple, but sadly it is rarely implemented. Define a full range of questions first, and then build processes to collect the data necessary to answer those questions. Treat data collection like the scientific method: create a hypothesis, collect data research and analyze and review your conclusions. Then rinse and repeat. Data collection should never be finished; it will continually evolve as your needs change.
Start Small but Think about Scalability
Far too often companies start too big with their analytics infrastructure and believe they need to immediately include everything, including the kitchen sink. So they rush to build a massive and highly specialized data pipeline that is costly to run and maintain. As the first insights come in, the company recognizes the value of data analysis, but the expense is too high.
Tackling too much at once leads to sloppiness and overspending. When beginning, choose a few use cases and start building out an infrastructure that’s flexible for your needs with reasonable scalability. Chances are, your analytics needs are going to iterate rapidly once your organization develops an appetite for data, so start small and use incremental building blocks. Don’t go for the newest technology just because it’s the cool new thing. Once your initial pipeline is running smoothly, it will be easier to add and subtract pieces incrementally.
Control Your Own Data
On the other hand, when some companies start thinking about data analytics, they make the right decision and start small. But then they go overboard, and the initiative is too small. They use products and services that hold their data hostage, or are cheap or even free as long as the data volume is limited. As data volume increases, they face a trade off of exponentially increasing costs, or limiting their analytics to massively restricted datasets. Imagine what kinds of trends get missed because a third party vendor won’t allow full access to data.
Having full control of data helps companies get the most out of analytics experiences. The majority of analytics use cases quickly outgrow the out-of-the-box solutions, and are left back at square one: building the analytics infrastructure they were trying to avoid in the first place. Meanwhile, valuable time and data has been wasted. Avoid the easy temptation that some of these vendors offer and focus on services and products that give full control of your data from day one.
Recognize the Data You Already Have
Remember those data silos? Good. Many sources of data are overlooked, like CRM, HCM and other enterprise application data. Intelligently defined subsets of that data should be integrated with your data warehouse or storage layer so that analysts and data scientists can easily access them without having to jump through hoops.
As companies build out data infrastructures, they should keep in mind the other sources of data they will want to collect, besides the obvious sources. Find overlooked data and pull it together into one place. This allows data champions to merge, morph and utilize everything they can, allowing organizations to optimize the value of their data initiatives.
Don’t just set it and forget it
Congratulations! You have built a data analytics pipeline and data is happily streaming to a single source of truth, give yourself a hand! You have accomplished something even top Fortune 500 companies struggle to achieve. But there is a problem: You don’t have the resources to analyze it.
I’m always amazed when I hear about organizations that want to build analytics infrastructure before hiring actual analysts. This is quite literally putting the cart before the horse. Why pay to collect and store data if you are not going to immediately start analyzing it? For startups, where everything changes at lightning speed, having massive amounts of historical data sitting and collecting dust is a waste of money and resources. Analysts and data scientists need to be part of the budget from the beginning. This is vital. A data initiative needs to be about the analysis and how that aids the organization rather than about the data. Collecting data and letting it sit is equivalent to letting money accumulate in a bank with no interest rate.
It’s hard not to fall into the hype-trap of Big Data and analytics, for good reason — when data analytics is done correctly, it brings tremendous value to growing organizations. But doing it correctly makes all the difference — it’s what lets the reality live up to the hype. Being able to recognize and solve the problems I have outlined above will allow you to create a data analytics pipeline that makes the most of your valuable time, resources and talent.
Sign up for the free insideBIGDATA newsletter.