In this special guest feature, Alex Bordei, head of product management at Bigstep, offers 5 examples of how Apache Spark has maximized its user experience – its feel. Alex is a developer and the head of product management at Bigstep, the big-data cloud provider. Prior to Bigstep, he was one of the core developers for Hostway Corporation’s provisioning platform.
Spark’s rise to prominence surprised some in the big-data community, even catching some technical professionals unsuspecting. After all, Hadoop, Spark’s main competitor, has the backing of powerful companies like Cloudera, Hortonworks, IBM, Oracle, Microsoft and Google.
Who would take the other side of that fight?
But even with the deck stacked against it, Spark is growing. Fast. Wikibon’s George Gilbert said that Spark-based investments captured 6 percent of total big-data spending in 2016 – a figure that grows to 37 percent by 2022.
While Spark is certainly fast, that’s not what gets people to use it. That’s what draws people’s attention, but to make them stick around and to invest years of their life into it, it takes more.
Beyond the technical aspects, Spark has a softer side. That side is the true reason why some open-source (and not so open) technologies spread and others don’t. As engineers and architects, we kid ourselves in believing that we’re hyper-logical and we make choices that maximize metrics such as performance or scalability. In fact, we’re just as much driven by emotions as anybody else.
And Spark has maximized its user experience – its feel. Let me give you a few examples:
- Downloading it from the Apache site feels comfortable. All Apache projects look similar and have the same links and the same content sections. I immediately know that I’ll need to download a tar.gz file with binaries and look for the “quick start” documentation somewhere where I can copy paste the commands to get it running.
- Writing my first few lines of code felt intriguing: Where’s my “for…” and “while …”? What’s a map() and more importantly what’s the collect() thing? Documentation says that it does something on every element of this array. It also says that the array can be bigger than a server!However, I don’t yet know what collect() or parallelize() does – unknows that spark my interest. Since engineers like to build things, especially new things, this makes Spark intriguing. This is a new toy.
- There are also wow moments: As I used it more, I hit more complicated situations with objects that could not be serialized automatically to be sent to workers. I had to dig deeper into what Spark actually does automatically for me. It turns out that using reflection it generates java bytecode that gets sent to the workers. It also packages the dependent variables from the scope and sends those along. It’s only one of the things that it does for you. It has its own off-heap memory management engine and copy blocks from HDFS into this off-heap memory area to bypass the garbage collection issues. You don’t see any of this. It’s all abstracted away behind the functions of that distributed array called an RDD.
- Working with DataFrames feels This is a fairly new concept which allows you to think in terms of SQL. If I have a multi-dimensional distributed array, writing the maps and reduces can be a bit complicated to get your head around. No biggie, Spark can do it for you. Speak “SQL” to it.
- Spark does all the little things for you like fetching your dependencies from maven repo and distributing them to the workers or allowing you to look at the parquet files with an antediluvian BI tool using the JDBC connector. I’m not sure what’s the feeling that describes those, but it’s certainly nice.
Almost nothing that Spark has is necessarily novel, which may be why its adoption is catching so many by surprise. Spark didn’t invent functional programming, code generation, query planning or off-heap memory. However, with Spark, the way they all fit together with simplicity just feels right. As long as these feelings are there, Spark’s importance will continue to grow.
Sign up for the free insideBIGDATA newsletter.