Field Report: DataWorks Summit 2018

Print Friendly, PDF & Email


I was pleased to be on hand for the 2018 edition of the venerable DataWorks Summit conference (previously known as Hadoop Summit) in vibrant Silicon Valley on June 18-21. This was my 4th time covering the show for insideBIGDATA, and I’ve noticed the tone has changed significantly over the years. In the early years (there have been 11 such conference so far) it was the wild west of Hadoop techies, but now it’s more about leading enterprises that are using advanced analytics, data science, and artificial intelligence to transform the way they deliver customer and product experiences at scale. With 2,100+ attendees this year, DataWorks Summit offers an opportunity to come learn about the latest developments, while networking with industry peers and pioneers to learn how to apply open source technology to make data work and accelerate your company’s digital transformation. I still consider this conference one of the world’s premier big data community events for everything data.

Formerly known as the Hadoop Summit, DataWorks Summit has expanded its sphere of influence beyond Hadoop. This includes a greater focus on Spark and the use cases enabled by Spark like machine learning, predictive analytics, and artificial intelligence. In addition, DevOps for Big Data strongly resonated with attendees who expressed the corresponding need for tools that accelerate the DevOps cycle from code to monitor. As customers maximize their Big Data investments, there is a corresponding growth in multi-tenancy usage, and an acute need expressed by operators for tools that can help them deliver performance across clusters.

A lofty view of Silicon Valley

Each year I attend DataWorks Summit I experience a number of random things that help define the conference for me. This year was: getting a warm feeling with each repetitive recorded accouncement I heard walking through the San Jose airport – “Welcome to Mineta San Jose Airport, Silicon Valley’s airport …” because it always makes me feel I’m at the center of the technology universe, getting a top floor room at the San Jose Marriott hotel that’s adjacent to the McEnery Convention Center so I could feel “on top” of Silicon Valley (although even from that lofty perspective I could NOT see Hooli or Pied Piper!), meeting famed Gartner analyst Merv Adrian in the hotel elevator and thanking him for his acid Tweets, and wandering around downtown San Jose until I found the local Philz Coffee (my favorite even in my own backyard of Santa Monica).

When attending technology conferences as a member of the press, I take the opportunity to learn more about the technology for sure, but I also take the time to talk to the people and make observations in order to get a pulse of the industry. DataWorks Summit is a great venue for this. I talked to most vendors at their booths in the Community Expo area. I found all of the company reps, including many top level execs, to be very engaging and truly excited about the industry in general, as well as their relationships with Hortonworks (the sponsor of the conference). I also learned a lot from the attendees who represent a Who’s Who list of top enterprises. Most of all, I was able to engage in “data,” something I’ve been enamored with for all of my professional life.

Big data ads hit everyone arriving at the San Jose airport

In this field report I wanted to give you a sense for what the vendor ecosystem was saying at DataWorks Summit, their corporate message if you will. Each company had a somewhat different slant of course which aligned with their products and services, but there was also a lot of commonality. Most everyone had some tie into the industry’s current buzz – AI, machine learning and deep learning. This was perfect for me as a practicing data scientist myself. Let’s get started with some vendor snapshots …


The conference host company Hortonworks, of course, had the largest booth with the best position near the main doors in the Community Expo area. They had a number of important announcements including the new HDP 3.0. I’ll keep this snapshot brief as I will publish at a later date my one-on-one interview with the company’s VP Product Management, Jamie Engesser for an in depth look into the progress of one of the companies that helped shape the big data industry.


IBM is serious about playing an important role in today’s big data landscape. They were at the conference promoting their Data Science Experience enterprise platform that provides teams with the broadest set of open source and data science tools for any skill set, the flexibility to build and deploy anywhere in a multi-cloud environment, and the ability to operationalize data science results faster.


Our relationship with Impetus Technologies (and sister company Kyvos Insights) goes back years, and we’ve worked together on a number of projects to amplify their message in the marketplace. This year, Impetus was discussing recent enhancements to StreamAnalytix, and also showcasing some customer success stories. In fact, the company held some compelling sessions:

  • “Migrating Analytics to the Cloud at Fannie Mae” which described the modernization of Fannie Mae’s analytics platform and corresponding, full migration of its Netezza assets to the cloud.
  • “How a major bank leveraged Apache Spark and StreamAnalytix to rapidly re-build their Insider Threat Detection application” which discussed how one of the world’s largest banks, a Fortune 25 customer of Impetus, used a powerful visual platform based on Apache Spark for unified streaming and batch data processing. The project made it possible to rapidly develop and deploy Spark applications for threat detection.
  • “BI on Big Data with instant response times at Verizon” which discussed how using Kyvos, Verizon was able to build a BI Consumption Layer that helped them analyze this massive data in its entirety, with the ability to slice and dice and drill down to the lowest levels of granularity, with instantaneous response times.

The company also highlighted their recent survey on BI on Big Data Adoption. The research of more than 300 organizations is eye opening. For example, more than 52 percent of respondents feel their organization lacks the skill set required to effectively implement BI on big data.


Teradata is another long-time partner of ours with a number of nice projects under out belt. It was great to reconnect with our old friends at Teradata to learn about all the company’s new initiatives in this fast paced industry.

The company offered a compelling session “Teradata AppCenter: Supercharge your data science environment” which focused on the premise that many companies are faced with numerous data scientists working in silos with limited ability to collaborate or easily access corporate data sources. Teradata AppCenter is a scalable, enterprise-ready environment that allows data scientists to import and leverage numerous different processing engines and a data scientist workbench (Spark, Tensorflow,, Jupyter, etc) designed to access shared services and numerous data sources as well as share and visualize results. The presentation did a great job demonstrating the functionality of Teradata AppCenter as well as highlighting some use cases for customers who need to support a growing data science team.

Teradata is also serious about AI, and presented an excellent session “Deploying AI in the fight against financial crime in the banking industry” by Peter MacKenzie, Teradata’s Services Director for Artificial Intelligence in America.

Women in Big Data lunch panel at DataWorks Summit 2018


Another valued insideBIGDATA partner is Hewlett Packard Enterprise (HPE). The company had a high-profile at DataWorks Summit while talking about their new family of edge-to-cloud solutions enabled by HPE Edgeline Converged Edge Systems to help organizations simplify their hybrid IT environment. By running the same enterprise applications at the edge, in data centers and in the cloud, the solutions allow organizations to more efficiently capitalize on the vast amounts of data created in remote and distributed locations like factories, oil rigs or energy grids.

To fully exploit the data and enable real-time action, organizations need to run enterprise-class applications close to the point where the data is created — at the edge. HPE’s new edge-to-cloud solutions operate unmodified enterprise software from partners Citrix, GE Digital, Microsoft, PTC, SAP and SparkCognition, both on HPE Edgeline Converged Edge Systems – rugged, compact systems delivering immediate insight from data at the edge – and on data center and cloud platforms. This capability enables customers to harness the value of the data generated at the edge to increase operational efficiency, create new customer experiences and introduce new revenue streams. At the same time, edge-to-cloud solutions enabled by HPE Edgeline simplify the management of the hybrid IT environment, as the same application and management software can be used from edge to cloud.

The company presented a session that was aligned with their new edge computing message – “Big Data Edge to Core” – which talked about how we’re living in an era of digital disruption, where the accessibility and adoption of emerging digital technologies are enabling enterprises to reimagine their businesses in exciting new ways. The session highlighted an edge-to-core-to-cloud digital infrastructure that can adapt to your flexing business needs, capturing expanding data flows at the edge and aligning them to a core infrastructure that can drive insight.


Of course Microsoft was present at DataWorks Summit in force, and not surprisingly, their message centered around the Azure cloud platform. I enjoyed one of the company’s sessions “Build big data enterprise solutions faster on Azure HDInsight” which focused on how Hadoop and Spark can be used to extract useful information in a variety of scenarios, such as ingestion, data prep, data management, processing, analyzing, and visualizing data. However, each step requires specialized tool-sets to be productive. Presenter Pranav Rastogi explained how to simplify your big data solutions with open source and ISV solutions on Microsoft’s Azure HDInsight—a fully managed cloud Hadoop distribution enabling reliable open source analytics with an industry-leading SLA.

Microsoft is very committed to data science and machine learning technologies. As a good partner of ours, we completed a number of compelling projects in the past.


Syncsort is a very old company, per the name, but they recently announced a brand refresh to reflect the company’s transformation over the past year, with advancements in areas like data quality and data availability, to better communicate how they help customers solve data challenges: We organize data everywhere, to keep the world working.

At DataWorks Summit, Syncsort was well-represented with some new data integration innovations: the company recently announced a new Ironstream product for IBM i, delivering machine log data in real-time to Splunk for advanced analytics to support SIEM (Security Information and Event Management) and ITOA (IT Operations Analytics) initiatives. The company also announced new CDC (change data capture) capabilities to extend supported data sources and targets,  helping organizations keep data from diverse sources fresh and in sync with next gen environments for advanced analytics. New sources and targets include IBM i and Oracle and Kafka, Azure AQL, Hive and MySQL as targets.

Arcadia Data

Another insideBIGDATA partner, Arcadia Data, was showcasing their customer Neustar and how it is leveraging Arcadia Data visual analytics to identify and isolate DDoS attacks. In the video interview below, courtesy of SiliconANGLE the CUBE, Arcadia Data’s VP of Marketing, Steve Wooledge, and Satya Ramachandran, VP of Engineering at Neustar discuss details of this collaboration.


Attunity is a leading provider of data integration and big data management software solutions and the company’s booth was creating a lot of buzz at DataWorks Summit. I had a nice chat with some personnel to find out about their new enhanced data governance capabilities for on-prem and cloud data lakes. As these analytic systems grow in size and scale, a lack of understanding and confidence in the data can be one of the largest barriers to adoption. Attunity’s enhanced metadata and data lineage capabilities are designed to help users understand what the source of the data is, and highlight any modifications or transformations within the Attunity platform that have been applied. These enhanced capabilities, which now include a unified repository for Attunity-sourced data and metadata, are designed to enable interoperability with third-party metadata repositories to support enterprise-class data governance, compliance and data management. the Attunity solution is now certified on Apache Atlas and Hortonworks DataPlane.

Without the complete view of data and metadata as it is loaded into Hortonworks DataPlane, enterprises struggle to maintain proper end-to-end data compliance internally and to meet industry regulations,” said said Jamie Engesser, Senior Vice President of Product Management at Hortonworks. “With Attunity’s platform, we’ve empowered Hortonworks DataPlane and Data Steward Studio customers with better information about their data. Now they are able to gain the missing data lineage and insight into transactional data source information that was previously inaccessible. Apache Atlas and Apache Ranger are key to making this possible for customers.”


Very cool Atscale t-shirt from DataWorks Summit 2018

I got the coolest t-shirt from Atscale that had a great quote from one of my personal idols, the late Stephen Hawking: “Intelligence is the ability to adapt to change.” In order to get the shirt, you had to read the quote by mentally doing the text mappings, 1->I, 7->T, etc. Best swag in a very long time! I’ve already worn it to the gym.

Atscale announced GA of its new Atscale Cloud. By providing a ‘one-click’ solution for running otherwise extremely complex Big Data workloads in the Cloud, Atscale simplifies the process of provisioning an end-to-end analytics environment that lets users work with data at unlimited scale and top performance.  With Atscale Cloud, enterprises can run Big Data Analytics on one cloud, multiple clouds or in hybrid mode (on-premises + multi-clouds). Atscale Cloud is available on the Amazon, Microsoft and Google Cloud marketplaces. It is an enterprise-grade solution to enable enterprises to run high-performance analytics workloads on Big Data regardless of where or how their Big Data is stored.

When we decided to adopt new Business Intelligence tooling, one of our top priorities was to have the ability to run rapid-fire multi-dimensional analytics at large scale, directly from the tools our users preferred,” said Maurice Lacroix, Business Intelligence Leader at, one of Europe’s largest online retailers. “With AtScale, users can run live queries, straight to Google BigQuery at great speeds.  This is something that we saw no other intelligence platform able to deliver.”


Pepperdata is the big data performance company. Their importance to a show like DataWorks Summit is a given. A Pepperdata blog post by Vinod Nair, Director of Project Management, included the following assessment of the company’s position in the industry at this point in time:

Based on hundreds of conversations over the past couple of weeks, I am confident that Pepperdata is well positioned to serve industry trends. Pepperdata uses fine granularity time-series data across the full stack, combined with active automatic controls that maximize cluster utilization to display the performance impacts of developing and running Big Data clusters. Other tools don’t provide performance views into applications, nodes, and clusters to identify where problems originate. Several customers revealed that prior to selecting Pepperdata, they spent months using other existing tools to troubleshoot performance problems in the cluster without diagnosing the root cause. In most cases, Pepperdata can pinpoint the root cause of performance issues within days. To help customers solve the problems of scaling Spark applications from development to production clusters, we recently announced Code Analyzer for Apache Spark, which identifies lines of code and related stages in applications causing performance issues related to resource consumption over CPU, memory, garbage collection, network, and disk I/O. This is just the latest addition to our comprehensive portfolio that shortens time to production, and increases cluster ROI on-premise or in the cloud, and improves communication and resolution of performance issues between Dev and Ops.”

Robin Systems

I had a rousing discussion with Robin Systems, a company that extended the containerization benefits to all classes of enterprise applications, including databases and Big Data clusters, to enable “zero performance loss” consolidation, agile data management, and simplified operations.

Robin presented two sessions at DataWorks Summit. Partha Seetala, Robin Systems CTO delivered a presentation, “Containerized Hadoop Beyond Kubernetes,” covering how Big Data applications benefit from container technology through the entire application stack. Another presentation by Ankur Desai, Director of Products, offered insight around “6 Best Practices for Containerizing Hortonworks HDP.”

Robin Execs were on-hand to discuss how the Robin Cloud Platform enables an app-store experience and agility in Application Lifecycle Management while lowering administration costs and reducing time-to-market.


Unravel, a provider of Application Performance Management (APM) for Big Data, is on a mission to simplify big data operations. The company’s vision is to provide one management tool that can manage all big data apps that are used within an enterprise stack. With this vision in mind they’ve built out operations and application management for Spark, Kafka, Hadoop and MPP platforms. Now, they’ve moved their focus to NoSQL systems and based on customer requests, the first system to offer support is HBase.

At DataWorks Summit, Unravel discussed several exciting recent developments:

  • Announced a collaboration with Azure to boost adoption of Big Data in the cloud. Unravel on HDInsight enables developers and IT Admins to manage performance, auto scaling & cost optimization better than ever
  • Introduced APM for streaming data, allowing users to improve performance and reliability of their Internet of Things (IoT), real-time, and other streaming applications
  • Secured $15 million in Series B funding, with M12 (Microsoft Ventures) participating
  • Appointed a new VP of Worldwide Sales and also a new CMO


I spent some time hanging around the sizable crowd at the BlueData booth to learn how the company is transforming how enterprises deploy Big Data analytics and machine learning. The BlueData EPIC™ software platform uses Docker container technology to make it easier, faster, and more cost-effective for enterprises to innovate with Big Data and AI technologies – enabling Big-Data-as-a-Service either on-premises, in the cloud, or in a hybrid architecture. With BlueData, they can spin up containerized environments within minutes, providing data scientists with on-demand access to the applications, data, and infrastructure they need.

At DataWorks Summit, BlueData announced the new summer release for  BlueData EPIC™. This release builds upon BlueData’s innovations in running large-scale distributed analytics and machine learning (ML) workloads on Docker containers, with new functionality to deliver even greater agility and cost savings for enterprise Big Data and AI initiatives.

This summer release is the result of collaboration with BlueData’s enterprise customers to develop new functionality in each of these areas to support their Big Data and AI initiatives – as they extend well beyond Hadoop and Spark to a range of different ML/DL and data science workloads, and beyond on-premises infrastructure to public cloud and hybrid architectures. These customer-driven innovations provide the agility of containers and elasticity of cloud computing, while ensuring enterprise-class security and reducing costs with automation. Now BlueData customers can benefit from AI-as-a-Service and ML-as-a-Service capability for their enterprise deployments – whether on-premises, in multiple public clouds, or in a hybrid model.

One of the key concepts underpinning this new release is the separation of compute and storage for Big Data and ML/DL workloads. This is a fundamental tenant of the BlueData EPIC architecture, and it allows organizations to deploy multiple containerized compute clusters for different workloads (e.g., Spark, Kafka, Tensorflow) while sharing access to a common data lake. This also enables hybrid and multi-cloud implementations, with the ability to mix and match compute/storage resources (whether on- or off-premises) depending upon the nature of the workload. And it provides the ability to scale and optimize compute resources independent from data storage – delivering greater flexibility, improving resource efficiency, eliminating data duplication, and reducing cost through the reuse of existing storage investments.


I took my attendance of DataWorks Summit as an opportunity to catch up with an old friend of insideBIGDATA, Stuart Tarmy, VP Sales at Io-Tahoe who has written a number of popular contributed pieces for us.

Io-Tahoe simplifies data discovery, enabling enterprises to find and make sense of structured, unstructured and hidden data with ease, throughout their entire business environment. Today, Io-Tahoe analyzes up to billions of rows of data with 90 percent accuracy, using our machine learning solution that spans data lakes and relational databases. Io-Tahoe is unique in its ability to discover relationships across heterogeneous databases.

Io-Tahoe was demonstrating its smart data discovery platform, featuring the Data Catalog, which allows data owners and stewards to utilize machine learning to create, maintain and search business rules on a consistent basis, regardless of how much data a company may have or where it is located. Io-Tahoe’s data discovery capability provides complete business rule management and enrichment, enabling a business user to govern the rules and define policies for critical data elements. It allows data-driven enterprises to enhance information about data automatically, regardless of the underlying technology and build a data catalog.


Streamlio was not previously in our big data industry database, so after meeting with company officials I quickly added it so it would be considered for our next quarterly IMPACT 50 list of the most impactful companies. I found that Streamlio has some very interesting tech.

Founded in 2017, the start-up is banking on organizations that are ready for real-time streaming architectures to process their basic data needs, and now it has brought three of the latest open-source technologies to bear on the process. Streamlio’s new real-time analytics suite incorporates Apache Pulsar, Apache Heron, and Apache BookKeeper.

Apache Pulsar is an open-source distributed publish-subscribe messaging engine originally created at Yahoo and now part of the Apache Software Foundation. Apache Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter and now incubating with the Apache Software Foundation. Apache BookKeeper is a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads.

Streamlio enables enterprises to connect, process, and store data in motion. Streamlio’s unified solution makes it possible to process and access data immediately, even before it reaches data lakes, data warehouses, and other repositories.


I was looking forward to my visit to the TigerGraph booth because graph databases have a certain mystic in my mind after many years using traditional relational databases. I wanted a deeper dive into the world of nodes and edges and I wasn’t disappointed. I waited until sessions were going so most of the DataWorks Summit attendees were away from the Community Expo area. The TigerGraph booth was pretty quiet when I chatted with Gaurav Deshpande, VP of Marketing and he gave me a nice high-level tutorial on graph databases. Much appreciated!

TigerGraph was excited about their announcement of the free Developer Edition of its graph analytics platform for lifetime non-commercial use. Users can experience firsthand TigerGraph’s superiority in scalability, performance and ease-of-use compared to other solutions – including Neo4j and Amazon Neptune.

As graphs continue to go mainstream, the next phase of the graph evolution has arrived. Cypher vs. Gremlin is no longer the right question to ask,” said Dr. Yu Xu, founder and CEO of TigerGraph. “The time has come to rethink graph analytics with TigerGraph and GSQL, the most complete query language on the market. One hour with our free Developer Edition is all you need to experience TigerGraph’s superiority in unlocking value from connected data at massive scale.”

TheCUBE in action at DataWorks Summit 2018


I always enjoy attending conferences where theCube is on the air. There is usually a big raised centrally located stage with the host/anchors doing the interviews, and one or more representative from the companies being interviewed. Although the sessions are recorded for Youtube so you can view them at anytime after the conference, if you happen to be on-site you can also watch the action and listen live. It’s kind of exciting to watch news being made in front of your eyes.

TheCUBE is the flagship program of SiliconANGLE Media, and each week is broadcast to dozens of digital platforms including its own organic community, reaching millions of viewers each quarter. Last year, theCUBE covered over 100 events, interviewing more than 1,500 guests about the mega trends disrupting our world.

We’ll, I’ll call it a wrap now and leave you with my sense that the big data industry is going along strong if not accelerating in its importance to business on a global basis.

Contributed by: Daniel D. Gutierrez, Managing Editor of insideBIGDATA. He is also a practicing data scientist through his consultancy AMULET Analytics.


Sign up for the free insideBIGDATA newsletter.


Speak Your Mind