Paradigm4 is the company behind SciDB, a scalable array database with native complex analytics. CEO Marilyn Matz is an expert in the field of big data, after co-founding Cognex Corp in 1981. Marilyn has some interesting perspectives as to why Hadoop might not always be the correct choice for big data deployments. I recently caught up with Marilyn to discuss these views.
insideBIGDATA: In a nutshell, can you describe the genesis of your SciDB database and what hole in the market it fills?
Marilyn Matz: We call SciDB “the scalable array database with native complex analytics, programmable from R and Python.” Its genesis is reflected in the name: SciDB. Scientists were dealing with Big Data before that phrase was even coined. The development of SciDB was motivated by demands originally articulated by astronomers and other scientists who found that existing data-management and analytical tools just didn’t do the job. For example, keeping data in files didn’t support fast ad hoc data exploration and forcing data into relational database
tables was an ‘unnatural’ act. Mathematical software for complex analytics didn’t scale to handle data volumes whose scale was quite literally astronomical.
Of course, the explosion of Big Data affects everyone now—not just scientists—so SciDB meets a demand experienced by many industries, including life science and healthcare analytics, e-commerce, quantitative finance, industrial analytics, and sensor analytics (i.e., the industrial internet and the internet of things). There are many new sources and types of data from location data to genomics data. The recurring theme is large volumes of a variety of kinds of data requiring complex analytics—analytics based on sophisticated mathematical techniques far beyond what Hadoop or Hadoop-plus-SQL databases can provide.
insideBIGDATA: Paradigm4 is promoting a new direction for big data: “Why Not Hadoop?” Can you briefly explain this concept?
Marilyn Matz: Hadoop does some things well and some things quite poorly. Pure-play Hadoop, based on Google’s Map-Reduce framework, works well when you can break a large problem into smaller pieces, work on each piece totally independently, and then assemble the individual answers to these smaller problems into an overarching answer to the original, large problem. But many big data problems – like finding patterns, trends, or clusters – defy such attempts at partitioning. Many critical insights can be found only by looking at all the data as a whole unit. They may require doing a single math operation (like matrix multiply) over the entire data set rather than doing that math operation many times, each on a piece of the data. We call these kinds of math operations ‘complex analytics.’
As problems get harder, Hadoop’s struggles grow. Writing Hadoop ‘map-reduce’ programs is hard work and the map-reduce architecture runs much slower than other approaches. For somewhat harder problems like SQL analytics, pure-play Hadoop doesn’t work. That’s why the Hadoop ecosphere includes Hive, and many Hadoop vendors, like Cloudera, are circumventing the Map-Reduce framework altogether, reclaiming the benefits of databases. Rather, they are creating products like Impala that provide SQL database functionality on top of the Hadoop file system (HDFS).
For the hardest problems—complex analytics at big data scale—Hadoop offers little. Although these problems are difficult, they are common. Data-intensive, analytically ambitious organizations want to perform sophisticated techniques like correlation, covariance, principal component analysis, multivariate statistics, generalized linear models, clustering, feature detection, and machine learning. At modest scales of data, organizations handle these problems with math/statistics packages such as R, SAS, and MatLab. But at big data scale, Hadoop is not the solution.
SciDB addresses this problem. It is a new big data solution built from the ground up without Hadoop or any other part of the Hadoop ecosystem. It is a new kind of DBMS with all the benefits of DBMSs: reliability, data sharing, a programming model that allows data retrieval and manipulation without programming or high maintenance. In addition, it includes sophisticated mathematical operations as native, in-database primitives. And it scales to large cloud-based clusters or clusters of commodity hardware, which is how it extends complex analytics to big data scale.
insideBIGDATA: Can you highlight a specific use case example where your solution shines over Hadoop?
Marilyn Matz: Many use cases from recommendation engines to effective outcomes medical research to financial trading algorithms to predictive maintenance use math techniques like correlation, covariance, principal component analysis, multivariate statistics, generalized linear models, clustering, feature detection, and machine learning.
Hadoop is just not the right computing architecture for these kinds of complex analytics. Other non-Hadoop architectures can do complex analytics but require the data to fit in memory. SciDB can do those kinds of math operations fast and efficiently on matrices way too big to fit in-memory.
SciDB also shines on working with geospatial data and time-series data. Because SciDB preserves the natural order of data, it can do really fast
windowing – selecting regions of data in time or space or finding subpopulations that share a set of attributes—and really fast math on that data in place, on the native data format.
insideBIGDATA: Can you give us a peek into your company’s future plans and what role you see yourself playing in the big data marketplace moving forward?
Marilyn Matz: As the saying goes, “One size fits none.” Companies will need multiple Big Data solutions, and will have to choose the right tool for each job. Paradigm4’s SciDB solves two specific emerging needs: first, the need to do large-scale complex analytics. Second, the need to mash-up different kinds of data to capture context in order to better understand and manage complex systems from populations to fleets of equipment and provide more personalized products and services.
As the analytical aspirations of companies grow, so too will Paradigm4. More and more, organizations will want to do complex analytics on these vast mash-ups of data without resorting to hand-coded, customized, file-based solutions. And most importantly, organizations will want to analyze that data to yield new insights, new products, and new discoveries.
SciDB, with its offering of complex analytics within the context of DBMS-style data management, will fill this need.
insideBIGDATA: What is the history of Paradigm4?
Marilyn Matz: Mike Stonebraker, Paradigm4’s CTO, has led many of the advances in database research starting more than 30 years ago with postgres, which is still widely used. He has also built companies—9 to date, including Vertica which was sold to HP—to develop and support products based on those advances.
Mike and others believe array databases with built-in scalable complex analytics are the best approach for managing the variety and volume of data – from location data to image data – and for doing high-value, complex analytics on that rich data.
Mike also believes that future of software is open source. Paradigm4 offers an open source community edition and an enterprise edition with enhanced functionality and support. There is a very active user community at scidb.org/forum.
Paradigm4 is venture-funded by Sigma Partners and Kepha Partners. The company’s SciDB enables data scientists, quants, bioinformaticians, healthcare analysts, and scientists to ask and answer big, hard, and important questions.
Sign up for the free insideBIGDATA newsletter.