Making Hadoop Realtime

Print Friendly, PDF & Email


Last evening I had the pleasure of attending the latest installment of our local Los Angeles Big Data User Group where I am co-organizer. The featured speaker was Dr. William Bain, Founder & CEO of ScaleOut Software, a company providing distributed data grids for the enterprise. Bill’s talk was “Making Hadoop Realtime” where he discussed how in-memory data grids bring tremendous value to the Hadoop software stack. It was a very illuminating discussion, and Bill turned out to be a superb communicator, providing an abundance of information about his company and underlying technology. The only area in which he admitted a deficiency was not being a Java developer, a reference to some of his slides that contained Java code samples. He was able to explain the code perfectly!

The event took place at the new corporate HQ of Factual in the up-scale neighborhood of Century City. I like to get to Factual hosted meetups early before the sun sets so I can gaze out of the windows of their 35th floor office; what a spectacular view of the LA basin! As Bill was testing out the video set-up before his presentation, I took the opportunity to introduce myself and kibitz a bit. In the several weeks prior, ScaleOut’s PR personnel had been pitching me with interview opportunities with Bill, but I reminding them there was no need since I was going to meet him in person here in LA for the meetup. He didn’t know about any of that, and was quite gracious. I found out Bill was a CS grad student at UCLA when I was an undergrad, so we compared notes about our favorite professors – Dr. Leonard Kleinrock (father of the Internet and packet switching), and Dr. Gerald Popek (my professor for compiler construction).

The abstract for the talk made me eager to attend this meetup:

Hadoop has been widely embraced for its ability to economically store and analyze large data sets. Using parallel computing techniques like MapReduce, Hadoop can reduce long computation times to hours or minutes. This works well for mining large volumes of historical data stored on disk, but it is not suitable for gaining real-time insights from live operational data. Still, the idea of using Hadoop for real-time data analytics on live data is appealing because it leverages existing programming skills and infrastructure – and the parallel architecture of Hadoop itself. This presentation will describe how real-time analytics using Hadoop can be performed by combining an in-memory data grid (IMDG) with an integrated, stand-alone Hadoop MapReduce execution engine. This new technology delivers fast results for live data and also accelerates the analysis of large, static data sets.

He described ScaleOut’s in-memory object oriented architecture, compared/contrasted it to Spark, showed how it integrates with Hadoop and Hive, and gave some use cases. He effectively worked in a compelling use case example of a Hedge fund customer that was very well-crafted.

He was forthright with his company’s position in the big data industry by making a reference to a competitor GridGain that recently released their product as an open source platform. Bill said GridGain’s transition to open source was wildly successful, resulting in many downloads. ScaleOut remain proprietary and has a number of important patents supporting their technology.

Bill was able to speak at a very detailed level about ScaleOut products, even from an implementation point of view including rationale for certain design decisions. This is rare for a CEO to be able to speak at this technical level. I was duly impressed.  Plus, Bill is a really nice guy, sort of like that favorite professor you could easily approach during office hours to hash out the technology of the day. His presentation brought me back to my college days sitting in one of my favorite classes. This was a great meetup.

The slides for Bill’s presentation are available HERE.

Daniel, Managing Editor – insideBIGDATA


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind



  1. miko matsumura says

    Hazelcast is the leading open source in memory data grid, so it should be of great utility for this use case. It has the advantage of being open source as well.