This article is the fifth in an editorial series with a goal of directing line of business leaders in conjunction with enterprise technologists with a focus on opportunities for retailers and how Dell can help them get started. The guide also will serve as a resource for retailers that are farther along the big data path and have more advanced technology requirements.
In the last article, we focused on how retail has gained with the adoption of distributed systems such as Hadoop and Spark. The complete insideBIGDATA Guide to Retail is available for download from the insideBIGDATA White Paper Library.
This section highlights a number of important technologies that enable big data solutions for the retail industry. Specifically, Dell, together with strategic partners who include Cloudera, Intel and others, offers a focused big data and analytics portfolio to help you on every step of your journey. This end-to-end portfolio includes an unprecedented lineup of solutions and tools for advanced analytics, data integration, and data management—bringing together all the technology components your retail organization needs to gain a 360-degree view of customers.
Hadoop addresses many challenges associated with storing, managing and processing large amounts of data in diverse formats—structured, unstructured and semi-structured. A shortlist of some of the more common operational uses for the Apache Hadoop platform for companies in the retail industry includes: ETL offload, active archive, log aggregation, price optimization and agile data mining.
Retail organizations that are gathering insights from vast data volumes and varied data types find that managing large volumes of unstructured data exceeds the capacity and capabilities of traditional data intelligence systems. These systems were specifically designed for structured data types from sources such as relational databases. Gathering data intelligence and developing perspectives from extremely large amounts of data requires a scalable system that can process multi-structured data volumes quickly and responsively and that can easily scale to manage growing data volumes.
In order to answer these needs and to address the explosive growth in data volumes and complexity, organizations of all sizes are turning to the open source Hadoop platform to store, process and generate value from their data stores. Hadoop solutions are not just about being able to capture data but also about being able to work with the many new and different varieties of unstructured data—social media data, sensor data, machine generated data and more. There are many advantages to using Hadoop, particularly in scalability, flexibility and economics. But as with any open source technology, it presents a unique set of challenges when deployed into production.
Cloudera created its enterprise distribution of Hadoop (CDH) for this very purpose; to remove the uncertainty and barriers that may dissuade an organization from deploying open source Hadoop into production processes.
In 2011, Dell, together with Cloudera and Intel, delivered their first tested and validated Hadoop Reference Architecture, the Dell | Cloudera Apache Hadoop Solution, accelerated by Intel. This end-to-end package delivers the core elements of Hadoop, including scalable systems and distributed computing, within a turnkey solution based on Cloudera Enterprise software and Dell hardware with Intel Xeon processors.
In order to allow retailers to realize the benefits of the Hadoop architecture, Dell QuickStart for Cloudera Hadoop is an all-in-one system designed to reduce the complexity of deploying, configuring, and managing Hadoop systems. The solution includes the hardware, software and services needed to deliver a Hadoop cluster that will start organizations on a proof of concept to begin working with big data.
Dell QuickStart for Cloudera Hadoop enables organizations to quickly engage in Hadoop testing, development and proof of concept work. Through the combination of Dell Intel-based PowerEdge servers, Cloudera Enterprise Basic Edition, Dell Networking and Dell Services, organizations can quickly deploy Hadoop and enable development and application teams to test business processes, data analysis methodologies and operational needs against a fully functioning Hadoop cluster. With the added flexibility of the Dell Professional Services, you can choose the right combination of training, installation and application development that is right for your organization.
Dell QuickStart for Cloudera Hadoop is deployed as a packaged and supported solution with the option of the exploration of Hadoop software via a Dell Solution Center and on-premises work with a fully functioning Hadoop environment via the Dell Hadoop Pod Loaner Program.
To enable fast analytics and stream processing, another big data solution—the Dell In-Memory Appliance for Cloudera Enterprise—is bundled with Cloudera Enterprise, which includes Apache Spark, an open source parallel data processing framework.
With its appliance-based approach, the Dell solution simplifies and accelerates the otherwise complex process of creating large cluster deployments. Rather than focusing on building and deploying an analytics platform, your IT team can now spend more time helping the business gain fast, critical insights from huge amounts of data.
ETL Offload Reference Architecture
The Dell | Cloudera | Syncsort Data Warehouse Optimization—ETL Offload Reference Architecture (RA), accelerated by Intel, serves to augment your enterprise data warehouse (EDW) by providing the means for running ETL jobs in Cloudera Enterprise with Syncsort DMX-h software. The solution makes it easy to build and deploy ETL jobs in Hadoop. Dell’s value has been validated by Principled Technologies, their study highlights how customers can save $425,972 over 3 years, run ETL jobs 60% faster, and give the business back 4 days all with entry Hadoop level expertise.
The ETL process can create bottlenecks in EDWs. A few heavy jobs can bog down an enterprise data warehouse, and more processing means less query capacity. This processing work can be offloaded to Hadoop to reduce CPU utilization for heavy jobs and to accelerate complex ETL processes. The goal here isn’t to replace your EDW but rather to augment it by moving certain data, workloads and processes from your existing systems into Hadoop to gain new capabilities and cost economies.
Syncsort’s high-performance ETL software enables your users to maximize the benefits of MapReduce. Syncsort software enables faster time to value by reducing the need to develop expertise on Pig, Hive and Sqoop, or other technologies that are essential for creating ETL jobs in MapReduce.
Dell Statistica’s Analytics platform extends the Statistica portfolio as a content mining and analytics solution with the ability to transform complex and time-consuming manipulation of web-scale data resources into a fast and intuitive process. Features including advanced natural language processing (NLP), entity extraction, interactive visualizations and dashboards, and the capability to create advanced analytic models and distribute them across Hadoop, databases and database appliances.
Statistica provides the ability to harvest sentiments from unstructured data such as Twitter feeds, blogs, news reports, CRM systems, and other sources, and combine them with additional data, including demographic and regional data, to better understand market traction and opportunities in the retail space.
Statistica model development and deployment to Hadoop “data lakes”—allows for the gain of valuable insights by bringing advanced analytics to full volume data where it is stored.
This big data analytics solution also provides excellent performance and scalability by leveraging next-generation technologies like Hadoop, Lucene/ SOLR search, Mahout machine learning and interactive visualization. As one use case example, Statistica is used by Dell Global Analytics (DGA) to help over 500 internal clients improve customer acquisition and retention, identify up-sell and crosssell opportunities, increase revenue and more.
For the data experts, this solution provides:
- Out-of-the-box structured and unstructured analytics
- Drag-and-drop creation of analytic workflows
- Hadoop-enabled for big data scalability
- Functional widgets that can be configured for individual analytic needs
- Open-source search indexing of complex, faceted metadata
Statistica Big Data Analytics enables enterprises of all types to more efficiently and effectively process all data.
If you prefer, the complete insideBIGDATA Guide to Retail is available for download in PDF from the insideBIGDATA White Paper Library, courtesy of Dell and Intel.