APM for Big Data: An Architect’s Guide

Print Friendly, PDF & Email

In this special guest feature, Kunal Agarwal, CEO at Unravel Data explores the area of big data Application Performance Management (APM) and why enterprises need it. APM is not a new discipline, but it is a new best practice for big data – adopting an application-first approach to guarantee full-stack performance, maximize utilization of cluster resources, while minimizing the TCO of the infrastructure. For architects, it means that the big data architecture has to be designed to meet new business needs for speed, reliability, and cost-effectiveness, as well as align with architecture standards for performance, scalability, and availability.

Evolution of APM

Big data is trending from experimental projects to becoming a mission-critical data platform offering a range of big data applications. Enterprises look to these big data applications (e.g., ETL offload, Business Intelligence, Analytics, Machine Learning, IoT, etc.) to drive strategic business value.

As big data applications move to production, performance expectations also need to be productiongrade. The business needs answers in seconds and not hours, hardware and resources need to be continuously optimized for cost, and deadlines / SLAs need to be guaranteed. This means that APM needs to become a strategic component of a big data architecture in order to eliminate risks and costs associated with poor performance, availability, and scalability.

Current Challenges

The fundamental problem is that the big data stack is complex due to its’ distributed nature where infrastructure, storage and compute are spread across many layers, components, and heterogeneous technologies. This problem exists regardless of the specific architecture (i.e., traditional, streaming analytics, Lambda, Kappa, or Unified). From the perspective of the production big data platform and the applications that run on it, everything must run like clockwork: ETL jobs must happen at fixed intervals; users expect dashboards to be up-to-date in real time; user-facing data products must work constantly. But from the perspective of the underlying platform, the application is not an isolated job, but rather a set of processing steps that are threaded through the big data stack. For example, a fraud detection application (i.e., Data Consumer) would be comprised of a chain of many systems from Spark SQL, Spark Streaming, HDFS, MapReduce, and Kafka, as well as many processing steps within each system. As a result, the entire process of managing performance and utilization across all these layers is exponentially complex.

This complexity makes it very hard to implement Application Performance Management services that provide a single view to manage performance and utilization across the full-stack. In particular, there is no rationalized instrumentation across the stack to enable a holistic approach to guarantee performance and maximize utilization. Instead, performance and utilization information is scattered across disjointed metrics, buried in logs, or spread across performance monitoring / management tools that only provide an incomplete infrastructure view as opposed to a full stack view.

Challenges for Architects

As a result, the process of planning, operationalizing, and scaling the performance and utilization across applications, systems, and infrastructure is not production-ready. This challenge is called out in Gartner’s March 2017 Market Guide for Hadoop Operations Providers. The report states “scaling Hadoop from small, pilot projects to large-scale production clusters involves a steep learning curve in terms of operational know-how that many enterprises are unprepared for.”

The lack of production readiness spans multiples areas across business units, developers, and operations. At its core, it makes it impossible to implement a multi-tenant cluster model, where a small Ops team needs to support a large number of applications, business units, and blended workloads with a combination of SLA-bound jobs vs. data discovery. The ultimate impact affects adoption of big data and business value realization.

Architect’s checklist

The architect should play a pivotal role to ensure that the big data platform is designed for production. The architect can ensure that the big data platform will meet the needs of the business within time and budget constraints, as well as ensure the architecture will adapt to new business needs as they evolve over time. The architect’s checklist can be used in planning, operationalizing and scaling the big data platform in order to manage performance, utilization, and cost:

  1. What types of applications is the business trying to build and deploy?
  2. Will applications be SLA-bound or ad-hoc? How will workloads be prioritized cost effectively?
  3. Which systems are best suited for the applications (e.g., Spark, Hadoop, Kafka, etc.)?
  4. Which architecture approach is best suited (e.g., Lambda, etc.)? Will the cluster be on-premise, in the cloud, or hybrid?
  5. How many concurrent users need to run on the same cluster without running out of resources? How many applications need to run on the same cluster within 24 hours? How will throughput be optimized?
  6. How should storage be tiered?
  7. How many nodes will the cluster need? What infrastructure capabilities need to be in place to ensure scalability, low latency, and performance, including computing storage and network capabilities?
  8. What data governance policies need to be in place?
  9. How will dev, QA, and production be staged?
  10. How much will the cluster cost to run? How will the business be charged back?

The operationalizing phase can be broken into 4 stages. The staged approach helps to gradually shape and scale the successful implementation and ROI of big data applications.

For each stage the following set of key questions apply:

  1. What are the SLAs for applications? How can they be guaranteed?
  2. What are the latency targets for applications? How will they be met?
  3. How will Ops be able to support business units and users in a multi-tenant cluster? How will dev be able to monitor applications in a self-service fashion? How will Ops troubleshoot issues?
  4. When do users typically log in and out? How frequently?
  5. Do different groups of users behave differently? How do activity profiles of users change over time?
  6. How do I track costs in a multitenant cluster? How do I assign them to projects, business units, users, applications, etc.?
  7. How will data governance policies be enforced?


Big Data is hard. With so many new technologies and emerging layers, the big data stack is exceedingly complex. The only way to properly navigate this complexity is to take an application-centric approach. But this approach alone isn’t enough, as it’s difficult to get clear and consistent insight into the performance of your applications. APM is the answer.

To properly roll out APM, you’ll need to do thorough planning. Once you’ve made your way through the above checklists, you’ll be ready to execute APM see better returns on your big data investments.


Sign up for the free insideBIGDATA newsletter.


Speak Your Mind