Sign up for our newsletter and get the latest big data news and analysis.

Just Give Me My Data, Dammit!

In this special guest feature, Lloyd Tabb, Looker Chairman, Founder and CTO, discusses a number of trends playing out in the data space, from SaaS, to the data “share” house, to latency, to data weight. It’s a fun ride brought to you by an industry veteran. Lloyd spent the last 25 years revolutionizing how the world uses the internet and, by extension, data. Originally a database & languages architect at Borland, Lloyd founded Commerce Tools, acquired by Netscape. As the Principal Engineer on Netscape Navigator Gold, he led several releases of Communicator and helped define Mozilla.org. Lloyd later was CTO of LiveOps, co-founder of Readyforce and founder then advisor to Luminate. Lloyd combined his passion for data, love of programming languages and commitment to nurturing talent when he founded Looker.

Once upon a time enterprise applications stored data in their own way. Each enterprise application had its own file formats and data storage mechanisms. This tower of babel was a bad thing. It was impossible to relate data in one of your applications (like your finance system) with data in another system (like your customer relationship management system).

Giant businesses, like Oracle, were built out of creating a shared database that both these applications could use. Huge value came from sharing data between applications.

Believe it or not, this revolution is just beginning to re-occur in the cloud.

Back to the Future

Welcome to the world of SaaS, where each vendor holds your data. Each vendor expects you to use their API to copy out portions of the data that you care to share and relate. Unfortunately, this mechanism sucks for two reasons; 1) data has weight, moving it is slow, and 2) latency, by the time you read the data it is out of date.

Welcome to the Revolution!

The good news is that getting access to your SaaS application data is changing. Instead of having to create some connector that reads data from the SaaS application’s API and writes data into your data warehouse, SaaS applications are starting to publish all this data in an up-to-date fashion, into places like S3, and Google Storage and BigQuery.

  • Amazon is publishing all their billing data into S3 and keeping it up to date in real time.
  • Google is publishing Adwords Data, Google Cloud Audit Data, Google Analytics Data and (more) directly into Google Storage and BigQuery.
  • Companies like Segment are completely designed for delivering an application AND the data to your data warehouse.

The ‘Share’ House

The shareable database trend is all the hype: Snowflake computing announced its Data ‘Share’ House, a really nice pun on Data Warehouse. Google BigQuery can read data from Google Storage, Google Drive and Google Sheets. Amazon’s Athena is entirely based on S3 storage, a sharable medium. Amazon Redshift introduced Spectrum, a way to read data directly from S3 Storage. The design point on all these systems is to access data without moving it. Why are all these really smart database folks focusing on this? For the same two reasons, latency and weight.

The State of the Art

The current state of the art for most SaaS applications, as it relates to your data warehouse is to run some kind of application that reads from the SaaS service’s API and writes data into your data warehouse. There is first a scan that reads the data and then some kind of polling system that tries to keep the data up to date. Everyone that wants the data from their SaaS application does some kind of implementation of this against the API, as though everyone that wants a car has to build it from a kit. Besides the insane level of reproduced work – how many programmers have implemented some script to read data out of Salesforce? – there are many, many versions of this work that are poorly constructed.

It would be one thing if the APIs were similar and consistent, but they aren’t. Each API has its own idiosyncrasies, and changes. Keeping up with the changes and keeping the data flowing is a constant struggle.

Latency

Even when this method works, by the time you go to read from your copy of replicated Salesforce data, it is out of date. It has to be. The program has to wake up and say “Give me any new changes to the Salesforce data” and then copy those changes into your data warehouse copy of the Salesforce data. This is a custom, hand-coded form of replication that can only be as good as the SaaS vendor’s API. There is going to be some time lag from when the Salesforce change was made, until it shows up in the database. In our data warehouse, the Salesforce import lag is about 30 minutes.

Weight

It is weird to talk about data having weight but it most certainly does. Weight is the effort it takes to move data from one place to another. The clearest illustration of this was last year at AWS Re:invent when they unveiled a semi-trailer that was entirely made up of data storage. The idea was to provide a service in which Amazon would back up the semi-trailer to your data center, literally pump in your data and drive it to the cloud. Point being, on premises databases can be huge and trying move the data with fiber connection could take literally years. Some data has so much weight, it takes a semi to move it.

Imagining (and betting) On a Better World

Wouldn’t the world be better if your Salesforce or Zendesk or Marketo data were kept up to date in someplace like S3 and implemented by Salesforce or Zendesk or Marketo instead of by someone on your staff? You probably have a programmer on your staff whose job it is to move data round. Wouldn’t it be better if that person were doing analysis instead?

The database vendors are are all thinking this way – “access data where it sits”. Some SaaS apps are starting to automatically share your data back with you. I think this trend will continue, playing together is fun, but no one wants to play with a kid that won’t share his toys.

 

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: