SciDB is an open-source software project creating a next generation computational database for data scientists, bio-informaticians, financial and commercial analysts, and anyone else with massive volumes of multi-dimensional data such as geospatial data, genomic data, financial data. Paradigm4 is the company behind SciDB.
I found a very interesting technical report that shows SciDB’s usefulness for machine learning applications: “SciDB – How Linear Algebra Operations Scale.” SciDB accelerates linear algebra operations—the basis for statistical and machine learning computing such as correlation, covariance, and singular value decompositions (SVD)—because SciDB has been built from the ground up to store and process arrays efficiently. Consequently, data analysis with SciDB scales seamlessly to trillions of data points without the need to re-write your analytic code, or manually distribute your data. This blog post explains one benchmark that shows how SciDB’s is pushing the frontier for in-database scalable linear algebra.
The part about SVD operations caught my eye since I have an interest in the role SVD plays in Principal Component Analysis (PCA) solutions for dimensionality reduction and unsupervised learning.
The graph below pertains to a use case example for measuring SciDB speed-up scalability. This process is described in detail in the technical report. I think SciDB sounds like it might be worth a close look as an economic alternative to an HPC system.