Domino Data Lab, Inc. is a new company that started out with a focus on enabling much easier cloud computation, and doing “version control for data science.” I first heard about Domino at a local Meetup group event where one of the presenters said he’d been using Domino to run his R code on powerful hardware with just one command. I took a look and the company’s products seemed very compelling. I caught up with co-founder Nick Elprin at the recent useR!2014 conference to get the high-level view of his company. In the interview below Nick talks candidly about Domino’s story, plans and a couple of use case examples for his technology. Read on for more important insights from Domino!
insideBIGDATA: Please give us a high level view of Domino and what pain points it addresses in the marketplace.
Nick Elprin: Domino is a platform for analysts and data scientists that handles all the “plumbing” required for enterprise analytics workflows. It sits on top of whatever languages and tools you’re already using (e.g., R, Python, Matlab, Julia) and provides functionality in four areas: job distribution (and scalable computation); version control and reproducibility; sharing and collaboration; and model publishing and deployment.
>We have a variety of customers across different industries, including a large automotive manufacturer, a social media aggregator, a marketing firm, and several mobile application start-ups. They are using Domino to:
- Move long-running or computationally intensive jobs off of analysts’ desktops — either to powerful machines within their network, or to cloud compute resources — with any changes to their code and without any infrastructure setup or hassle.
- Automatically keep records of the results being produced as an analysis evolves, and make those results available to the entire team, so nobody needs to worry about reproducibility, and work is always easily accessible through a central web interface.
- Create self-service user interfaces on top of analytical models, so non-technical stakeholders can adjust parameters and run analyses and what-if scenarios on-demand, without distracting the analysts.
- Expose models via a REST API, so existing software applications and business processes can easily integrate with analytical models, and analysts can update models without any help from software engineers or IT.
For example, the marketing firm I mentioned earlier uses sophisticated machine learning models to figure out how to target direct mail campaigns. They were using AWS to manage all their infrastructure for doing this, but found they were spending so much timing managing machines that they were distracted from their actual analytical work. They switched to Domino to dramatically reduce their overhead managing infrastructure, saving them tons of time and reducing their total cost of ownership. We have a case study up about this on our site.
Similarly, there’s a real-time social-media aggregator we work with, and they have their own machine-learning models to detect spam in the content they process. They are now using Domino to train and collaborate on these models, instead of building and supporting their own custom infrastructure, dramatically accelerating their work.
To explore a different use case, a major car manufacturer uses Domino to collaborate and share their analyses internally. Domino has levered up the impact of their data scientists, by enabling them to make their analyses more widely consumable and usable.
insideBIGDATA: Can you tell us about a recurring use case example of how customers are using your technology?
Nick Elprin: One recurring theme is turning analytical models into “self-service” tools. We have seen many companies where data scientists are stretched too thin. In addition to doing their own work developing and improving their models, they have to spend time servicing requests from non-technical users. For example, “re-run that analysis with updated data” or “re-run that analysis with this different parameter value.” This is incredibly distracting for data scientists. We built a feature called Launchers to solve this problem. Launchers let analysts build simple, self-service web UIs around their models. Other users can then run the model by inputting parameter values, or even uploading custom data sets to process. The analysis runs on our cloud infrastructure, just like any other run — and the requesting user gets the results when it finishes executing. A large, publicly-traded company is using this, and their data scientists estimate that it has saved them about one day per week in “support requests” from internal stakeholders.
insideBIGDATA: That’s great, can you think of a second common use case?
Nick Elprin: In addition to exposing models to other people, we also see companies that need to expose their models to other software applications. So we created an “API Endpoints” feature that lets you expose a REST API around your models. We take care of all the hosting and infrastructure — you just provide the file, and which function to call in your script. Domino will serialize the input parameters from an HTTP request, pass them to your code, and serialize the results back over HTTP. The latency is very low, around 100ms, so models can be used in production applications. We are finding that this liberates data science teams: they are able to deploy changes to their models without being limited by the availability of the software engineering team.
insideBIGDATA: What’s next for Domino, what can you tell us about your next steps in doing version control for data science?
Nick Elprin: We’ve built a solid platform for automatically tracking your work, so you can always reproduce something you’ve done in the past. Now we want to enable much richer collaboration and knowledge sharing on top of that. We already let you leave notes about past versions of your analysis, but we are planning to add support for commenting on specific parts of a result (e.g., an area on a chart, or a cell in a table), and to be able to track a threaded discussion around all that content, so you can retrace the thought process and the questions that drove the evolution of your analysis. The idea is to create a collaborative “lab notebook” for your analysis.
Daniel, Managing Editor – insideBIGDATA
Sign up for the free insideBIGDATA newsletter.