Floating Elephants: Developing Data Wrangling Systems on Docker

Print Friendly, PDF & Email

hadoop-101Technology evolves quickly in the big data ecosystem and deploying the latest tools is a complex undertaking. At Trifacta, success means developing our data wrangling software against cutting edge systems and maintaining excellent compatibility with data platforms from our technology partners. We need to do this with an efficiency that allows our team to make rapid progress.

For example, one recent afternoon our development task was to support a new authentication protocol for one of our Hadoop integrations. We needed a specific distribution of Hadoop running the right services, an LDAP server, and a kerberized environment. Sounds daunting, yet we set up these dependencies and got to work in just a few minutes without having any of this software installed ahead of time. We used Docker to do it.

The Road to Docker

If you’ve developed software systems for a while you may remember a time when installing your dependencies was an exercise in manual heroism. Some of us have hand-built developer machines by installing and configuring Linux, an RDBMS, and a raft of development tools — all at the right version. If things didn’t work out on one developer’s system, the debugging session could be painful or involve a reinstall. Later, virtual machines and provisioning tools revolutionized this experience and brought reproducible environments to the masses. With that strategy, you could get one good configuration, but gaining access to additional configurations cost linear effort launching and provisioning each VM. Even though virtual machines offered an enormous improvement over hand-built systems, the provisioning time they require slows workflow and allows environments to drift out of date.

Over a year ago, Trifacta started expanding our matrix of supported big-data platforms. The engineering team needed immediate access to everything in the matrix for feature development, continuous integration, or reproducing issues. We tried using an array of virtual machines featuring different configurations of Hadoop, but that approach didn’t meet our ease-of-use or performance requirements. Clearly we needed an engineering strategy that scaled up at the pace of our engagement with our partners. As with the Trifacta product itself, scaling up includes both a human and a technical component.

To solve this problem, we took a close look at Docker. Since its 1.0 debut last year, this Linux containerization technology has sparked a tremendous surge of interest and a profusion of creative uses. It can package, deliver, and run a complex linux system with the simplicity of running an app. Docker containers can be composed into flexible systems in elegant ways. They are self contained and can be rapidly switched out with low overhead. Package once, run everywhere. Docker had all the elements of the solution we needed.

Easy Access + Efficient Development = Better Innovation

We created Docker containers for the Hadoop distributions we support and for other services we depend on. Now we can switch out configurations with just a few short commands. Our CI system tests against a matrix of supported platforms on each build, dynamically swapping out Docker containers to give us immediate feedback on quality. Test setup and teardown is clean and quick with Docker.

Better tools enable innovation. During hackathons we’ve experimented with new components like Spark and Impala by building and sharing Docker containers. Some of our developers have even taken the “less is more” approach by leaving VM environments behind and developing directly on their Macbooks, running containerized Hadoop formations as if they were on a real Linux cluster.

We anticipate even more exciting developments ahead. Docker Swarm and Docker Machines may give us easy ways to dynamically fire up performance testing clusters on our dedicated racks with far less setup overhead and contention. Docker based systems can scale up and down on real hardware to meet our different needs. Trifacta’s data wrangling software itself could be packaged as a Docker container to reduce the friction of installing it into a customer’s environment. Bringing these tools to our users and democratizing access to software fits well with our company mission.

Contributed by Jeremy Mailen. Jeremy is Principal Engineer at Trifacta, a software company developing productivity platforms for data analysis, management and manipulation.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind