Crunching Time and Space to Turn it Into Usable Data

Print Friendly, PDF & Email


Today platforms including Open Source solutions for Big Data (like Hadoop) and GIS Mapping solutions (like ESRI), SAP HANA, and BigTable, a highly scalable distributed storage system that is used in Google Maps – but none of these solutions scale to the requirements of data with space and time attributes or are designed to enable real-time decision-making using this data.

Why? Because conventional databases are designed to manage numeric and character types of data found in text and documents, while a spatial database is optimized to store and query spatial data or data that defines a geometric space. In a spatial database, data is stored as coordinates, points, lines, polygons and topology, and in some cases more complex data such as 3-D objects, topological data and linear networks.

However these platforms aren’t optimized to handle streaming sensor data with spatial and time series attributes, and they are unable to provide the scale and performance that allows this type of data to be analyzed in real-time for immediate action.

First, you need a database engine underneath the platform that can continuously index and store high-velocity sensor data at wire speed while being queried in real-time. Many valuable IoT data sources individually generate tens of gigabytes of complex records per second without interruption, and many applications of that data combine multiple data sources. These records must be parsed, processed, indexed, and stored at these rates if they are going to be analyzed.

Second, you need a geometry engine — the software that computes the mathematical relationships between shapes — that has correctness, precision, and performance suitable for large-scale geospatial analytics rather than the looser requirements of making maps. Events in data streams can be correlated and contextualized across diverse data sources in terms of when and where they happen because reality is fundamentally organized around space and time relationships. The data typically involves complex geospatial geometry such as the paths people take or polygons from weather sensors.

Third, you need to process in “real” real time. True real-time is characterized by the amount of time that passes between data being available for ingestion and that data being reflected in an application. Machine-generated spatial data sources tend to be extremely high velocity and high volume, far beyond the design assumptions of most big data platforms, requiring data architectures that can continuously ingest millions of complex records per second and store them to disk. Furthermore, these data models tend to be operational, requiring that these real-time data sources be immediately analyzable concurrent with that ingestion. IoT data applications are often for an operational environment where the realizable value from data is highly perishable.

J. Andrew Rogers, Founder and CTO of SpaceCurve solved these issues. Most software engineers assume today, as he did a decade ago, that the design of scalable geospatial databases is a straightforward task. As it turned out, in 2005 the computer science required for the above did not exist. Typical methods for representing dynamic geospatial data, from R-trees to “hyperdimensional hashing” to space-filling curves were all invented in the 1980s or earlier. If you could build scalable geospatial databases that way people would have done so. He incorrectly thought no one had tried.

But finding no recent literature, he went further back to the 1970’s (the 1970’s!!), and bought every single forgotten piece of literature on spatial indexing, spatial databases, parallelization and scalability, data structures and algorithms, then spent months analyzing each roadblock, solving them one by one.

By 2007 he had the framework of a solution. But he was ahead of the market. Then serendipity! In 2009 Rogers discovered similar work done in the 80’s,where the author had gotten as far as creating the super-fast algorithms needed to process this amount of data, but had stumbled on how to index complex data sets such as rectangles, intervals, hyper rectangles, etc.

Rogers realized these were completely new classes of data types, requiring an entire theory and set of algorithms and data structures for how to index these interval data types, which he invented. He had been ahead of the market while completing his work.

While working as one of the Google engineers on the first iterations of Google Earth in 2006, Rogers was on a team tasked with creating an efficient way to add “layers” onto the screen that could show what the weather was like in a given area or how the traffic was moving or what people were saying on social media. In order to build this it required an infrastructure that could index every data source in reality in real time. This presented a plethora of computer science problems that had vexed the industry since the 70’s.

Google was trying to bring live sensor data into these analyses. There was a big open question at the time: how could you analyze the real world in the same way we were analyzing relationships in the virtual world?

At the time, Google tried and had exhausted its GIS systems (combination of MySQL, PostGIS, and Big Table databases). While those technologies may allow you to manipulate and analyze upwards of 100 million objects at a time, Google saw that it would need to ingest and index trillions of objects. They brought in experts from academia to advise them, but hit the same roadblocks that computer scientists hit in the 70’s. Then just like those guys, they gave up.

But not Rodgers. Nearly ten years and $16 million VC dollars later, SpaceCurve came to market Q4 2014 with the first spatial data platform that enables truly real-time common operating pictures and unprecedented speed-to-value from fused complex geospatial, sensing, social media and other streaming and historical data.

Today SpaceCurve enables organizations to process, fuse, and act upon satellite imagery, social media, weather, and cell phone telemetry data. No big data systems built to date can act on this magnitude of time-series data in anything close to *real time (*cliché alert: you’ve heard the phrase “real time”. This is true real time based on brand new algorithms).


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind