While at Hadoop Summit 2016, I had the opportunity to catch up with Jack Norris, Senior Vice President of Data and Applications at Mapr, to discuss all the progress the company has made in the past year. Jack drives understanding and adoption of new applications enabled by data convergence. With over 20 years of enterprise software marketing experience, he has demonstrated success from defining new markets for small companies to increasing sales of new products for large public companies. Jack’s broad experience includes launching and establishing analytic, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. He has also held senior executive roles with EMC, Rainfinity (now EMC), Brio Technology, SQRIBE, and Bain and Company. Jack earned an MBA from UCLA Anderson and a BA in Economics with honors and distinction from Stanford University.
Daniel D. Gutierrez – Managing Editor, insideBIGDATA
insideBIGDATA: Since our visit with you during last year’s Hadoop Summit, can you give me a quick update on MapR?
Jack Norris: Sure, we brought in a new president and COO, Matt Mills coming from 20+ years at Oracle. He had 8,000 people and $4.5 billion of revenue responsibility. He’s been incredible. He has hand-picked different people to slot in on the field side. The dramatic difference is the background that some reps come from and that they’re used to really working with customers and understanding the business issues and then knowing how to play with technology and the business issues. It’s a perfect time for a Converged Data Platform to match that skill set.
insideBIGDATA: Can you give us a perspective of MapR’s current view of itself and the industry?
Jack Norris: I think we’re a different company. Our view has been the company is an enterprise software play. It’s very product-centric. That’s the way that we’re going to drive things. There probably is a once-in-30-year re-platforming of the enterprise coming out now and it’s all about that data layer. The hardware has being commoditized. The operating system is being standardized on a Linux flavor. There is the use of containers to try to improve that. It puts a lot of pressure on with the question “How are you doing the data?” Because the old model of “start with the application” makes sure you completely understand the application, and the application dictates the data schema and how it’s formulated and that’s the real key to success, especially in data warehouses projects. This results in a series of silos across the organization and that might work for particular applications but what organizations are trying to do now is really compress their responsiveness and compress their data to an action cycle. You can explain it different ways but I think if we say “real-time,” sometimes real-time is taken out of context. Some people assert “Look how fast this query is,” but the data is a week old. Or “Look at my website,” but I’m basing those analytics on historical information that hasn’t been updated for several weeks.
So the key to that flexibility and agility is to free up the data and bring the applications to the data, then you can get very fast response. At that point, the analytics become less of a reporting function and more of an integration into the business. This way you’re not reporting on fraud, you’re actually understanding and impacting before fraud happens, or for the top line for the customer experience, or for operations. For example, consider an oil company – they’re adjusting the drill bit as it’s happening so that they get a straight line into the oil field, taking measurements and understanding the drill bit before it breaks, and avoiding millions of dollars of downtime related to that. If you look at the market from that perspective, then Hortonworks and Cloudera are basically starting with the Hadoop distributed file system and basing everything on that, and it’s write-once. It’s not really a file system, it’s a Java layer that’s using the Linux file system so there’s no real breakthrough capabilities in that.
Spark is really exciting, but Spark doesn’t have a persistent data layer. So there’s this gap. How do you fill that data layer, and that’s basically the MapR Converged Data Platform. It’s not just about the new technologies, it’s also about how do I leverage legacy applications with a more scalable, flexible, manageable platform by supporting standards of the network file system. So we’ve done things like mainframe offloads that aren’t really about necessarily new applications. It’s just I want to free up and save millions of dollars a year on the expense of the mainframe and run it on a platform that’s a scale up platform that allows me to continue to go up into the right, and not have these really expensive plateaus as I max out the systems.
insideBIGDATA: The Converged Data Platform concept has become quite important to MapR of late, yes?
Jack Norris: It’s basically our view that this has to be a better platform. When it originally came out it was, “Let’s make sure we’re aligned with Hadoop,” because Hadoop is where all the heat and excitement is. But as we added MapR Streams and delivered MapR DB with JSON on the platform, it became clear that it’s so much bigger than Hadoop. We looked at different wording and messaging and the Converged Data Platform really resonated, so we ran it by our customer advisory board and that’s it.
So yes, our platform is important to MapR. It’s interesting because now we feel like even at this summit there will be convergence because right now we’re seeing separate data in motion and data at rest, and it’s really arbitrary. Like the fraud applications where data is streaming in and you’re making a decision within milliseconds. Well, that decision is based on historical purchases and it’s not just in that stream of data that’s longer term. Then when you make that decision, there’s typically some database in a real-time transaction that’s part of that. So being able to do a file, and a stream, and a database all at the same time in a coordinated unified platform with common security and management is a no-brainer.
insideBIGDATA: What do you see about Spark in terms of excitement?
Jack Norris: There was a lot of excitement at the recent Spark Summit. We’re committed to Spark, but we don’t dictate platform because we’ve got some customers in production using Storm that’s working fine. We’ve got others that are using DataTorrent and other commercial engines so our Streams is the publish and subscribe infrastructure to get the data there and then you can use the streaming analytic platform on top of it that makes sense. There is also Apache Flink which is an up-and-coming solution with a lot of buzz although it’s not quite at the maturity level that Spark is but it’s not microbatch but rather continuous so it’s got some advantages for certain workloads. Some of the Google engineers have said that it’s the closest to data flow, so that’s one to watch.
insideBIGDATA: I notice that you’re still pushing the education aspect more than anybody else in the field with the MapR free courses. I think that puts you guys on the radar because if you’re a new developer, and you’re moving into the big data space, then you need education. A lot of people are wanting to become data scientists and big data engineers, so it’s nice to have a free educational resource.
Jack Norris: That’s part of being a software resources company, not a services company. There are a lot of things we do in the community. There is a lot open-source development, e.g. Apache Drill, that we view as another community donation. On the education side, we have a pretty large educational services team, so this is not a fly-by-night training course. It’s specifically developed for on-demand training where there are hands-on labs and it’s paced a certain way. It’s flexible in terms of you can start and stop modules at any point, you can pursue the curriculum or even review it. It kind of fits in with the Khan Academy. I haven’t looked in the last 2 months, but we were over 50,000 participants, and then we rolled out the community. The purpose of the community is – we’ve got all these people coming to the class, but sometimes if you’re in a classroom setting, sometimes it’s the peers in those conversations that are as beneficial as the course materials. So the community is a digital place to have that interaction happen. We have MapR personnel that are engaged and part of it, but it’s also peers. The announcement we did yesterday on the Spyglass initiative included customizable sharable dashboards, and there’s a place in the community where you can share and exchange ideas, as well as dashboards that can be customized and shared, so that’s an important ingredient.
insideBIGDATA: Tell me more about Spyglass that was announced at the show.
Jack Norris: We’ve had in place the MapR Control System with the heat map, and all sorts of capabilities. We looked at having ease of management, visibility, and full control as part of that. Spyglass is an initiative to really go a lot further, particularly into the deep visibility layer. We use the term “Spyglass initiative” because it ties in a number of different technologies, time series database, different dashboards and so forth. The Spyglass initiative was kind of pulling that together to drive these features and customized extensible dashboards were a part of it. The kind of summary of the features is this convergence – you get a complete view of all operations. Because of our multi-tenancy features, and because of the Converged Data Platform, we’ve got customers with 50, 80, over 100 different applications running on the platform.
This Spyglass initiative is to take a next big leap onto the management side where you’ve got the ease of administration and visibility ends of the operations. We came out with the dashboard view but we’ve centralized all of that log information. It’s in a JSON format so it’s very flexible. If there are certain things that the customer wants to see, it’s very easy for them to customize it and that kind of customization is exposed in the dashboards as well.
We also extended the APIs for third party tool integration, and that’s just part of our general design center in that the big data and the platform needs to be a bridge between existing legacy systems and next generation systems. IT organizations are faced with budgets that are flat for the next five years through 2020. If you look at IDC and Gartner, they all project flat IT budgets. If you look underneath that however, it’s a consistent shrinkage in legacy spend, but an increase in next-gen, cloud, big data, etc., to the point where 30% of all IT spend in 2020 will be on next-gen technologies, and 90% of the data will be in next-gen. So flat budgets does not mean flat business as usual. IT sees a tremendous amount of change.
With Spyglass, if we look at what exactly we’re monitoring it’s kind of full circle. You look at the node, the infrastructure, everything from utilization, CPU, disk utilization, node info, throughput, file system, database information, etc. Then there’s information on the applications themselves. We’re looking at YARN and the scheduling information about when things are running, looking at it from a container perspective in terms of what things are active, what are pending, the virtual resources, the queueing information. You can set quotas and we see those as well. Some of those are unique features, like MapR is the only one that has logical volumes. There is services monitoring as well – what are the service utlilizations, what are the version numbers, etc. If you look at the dashboards, it can be everything from highly visual to diving in and looking at the individual logs.
insideBIGDATA: What was the genesis of Spyglass? What is customer driven?
Jack Norris: It was a natural evolution and it was the next shoe to drop after delivering the Converged Data Platform and that huge technology unification. Now there is management capability that goes in front of it. We’re on the cusp of what we’ve referred to internally as global cloud processing. I see this as a simplistic debate for on-premise versus cloud. If you look at MapR’s architecture, we’re used today by Amazon, the OEMs for Elastic MapReduce (EMR). There are no changes to our software to be part of that. They’ve added some integration into their own services so it integrates with some of their tools like S3.
So what we see, it’s not an either/or and it’s not hybrid either because it’s the same technology. It’s more of a processing that’s taking place in a variety of locations based on things like government regulations, resource availability, SLA requirements. If you have a global cloud processing framework regardless of where it is, one that can accommodate continuous data flows from multiple locations which essentially is our Streams product that enforces sequence across locations, and has strong consistencies so you can do mission critical things. You’re not tied up with eventual consistency. Now you’ve got something that fits under the IoT umbrella that fits into some of these real-time workloads that we’ve seen from Rubicon and different security applications. Spyglass is inserted between where we are now with convergence and where we’re going with global cloud processing.
insideBIGDATA: How do your customers get started with Spyglass?
Jack Norris: It’s basically part of the distribution. It’s an upgrade of our control system. You’ll see additional things roll out over time and continue to expand those capabilities so it’s not a separate thing we’re monetizing, it’s not like we’re selling it. We think the customizations will be pretty interesting. We want to participate in the community and spur some of those discussions so there’s some sort of open-endedness there.
insideBIGDATA: What else is new from MapR?
Jack Norris: We did an announcement around ecosystem packs. The best way to explain this is to simplify the moving parts associated with upgrades, so we came up with these packs. We have a fairly frequent release schedule on the core platform, and we have different versions of different packages that are part of that core platform. On a monthly basis, we update and do certain point releases of open-source projects, and that’s based on the development in the community, testing, hardening, and our take on what’s ready for prime time. If you look at the number of customers that have multiple applications, it’s very difficult to say, “Okay, this month everyone’s updating to version 14 of Hive, or version 5 of Azure.” It would kind of create a lot of chaos. So MapR uniquely supports multiple versions of the same package on the cluster. You’re not beholden to our scheduling, or have to communicate as an IT administrator to all the users and have some lockstep migration.
insideBIGDATA: It sounds like a lot of work was completed since the last time we met at Hadoop Summit 2015.
Jack Norris: Yes, if you look at what we’ve delivered in the past year, there was a lot of breakthrough stuff. The native JSON support in the MapR DB is a huge deal. We looked at what our customers were doing and a lot of customers were taking JSON files and then flattening them. When you flatten you make decisions about what’s important, am I going to aggregate it out? There are applications that want to see both. The ability to have that complex document in the database and be able to view things at the sub-level or aggregate level is pretty powerful. And it simplifies the process, where you don’t have that interim step that’s dependent on some IT function for that data.
The other thing it does is it simplifies the ability to link applications and shared data across an organization because you don’t have to agree on a common data model. One of the things we’ve seen in dealing with the customers with their legacy environments is it’s changing the data model that can take months. If instead you have a published and subscribed network with JSON that these things can change quickly and these schemes can be discovered on the fly, now all of the sudden you’ve got the ability to link processing without having to agree ahead of time. To provide the flexibility where they can add fields, change fields, etc. To me that’s equally as important as the convergence.
Sign up for the free insideBIGDATA newsletter.