Sign up for our newsletter and get the latest big data news and analysis.

As a Data Scientist, Do You Have to Start Worrying About Blockchain Any Time Soon?

Admit it, before you could grapple with Tensorflow and Keras, the Data Science world had moved on to Pytorch. I am not even going to the time when “R vs Python” was still a debate. That almost seems a lifetime ago. The mere introduction of a new library creates a demand for people “skilled” in it that completely changes the dynamics of the hiring wave. Doesn’t matter what else you could be good at, if you haven’t got that one shiny (no R-pun intended) new thing that’s burning up the market, your resume often gets passed up by potential employers. Staying in tune with the times is just not enough; you have to fight to keep yourself ahead of the times. And that can sometimes be both tricky and hard. That one thing you were betting on may not end up being the next market mover. Being able to preempt that next big money making skill-set can take a lot of expert research. And it does not just end there. Not every new technology lends itself to the broader domain of application of your skill-sets. It is one thing to learn a new-library or a data streaming application; a whole other thing to pick up a never before used lower level programming language.  But when you hear something repeated as frequently and as religiously as the term Blockchain is these days, you cannot help but wonder – am I immune to it’s onslaught?

What when over 50 financial services firms (according to CBInsights charts) or their strategic investment arms have invested in a Blockchain-specific startup since the start of 2014? What when according to Fortune, about 40 of the top financial companies in the world are currently experimenting with Blockchain – will I end up being a dinosaur for not paying enough attention to it?  And that in turn begs the question – I want in on the action, but as a Data Scientist where do I even start?

Well, for all the uses that Blockchain has been currently put to, whether it be in reinventing currency or in redefining finance, at it’s very core Blockchain is just a basic substratum for computing. Hence you can look at it as a large-scale data management mechanism that can tackle billions of records, only far more secure than just about anything else that exists out there. Hence very simplistically put, you have to be able to mine the data from this new data management mechanism to perform any kind of cognitive work on it. You can probably look at it as the biggest open data that will be available for ML/AI in the near future, given that Blockchain technology usage is expected to expand beyond financial transactions and smart contracts to energy consumption/autonomous vehicles/smart cities and much much more.

How soon can you expect adoption to spread across industries?

Well, there’s enough noise being created about the unfair competitive advantage data oligopolies like Google, Facebook, and Amazon have. Given that Blockchain’s entire premise is to destroy data oligopolies by democratizing it through it’s distributed registries accessible to everybody, adoption across industries is probably going to be much quicker than you think! True, that in it’s current layout data storage within Blockchain is super-expensive, so that can be a little prohibitive for companies at first, but there seems to be quite a few workarounds for that already. Plus the brownie points it gets because of it’s security protocols. The powerful encryption algorithms in the distributed ledger which are replicated on thousands of computers around the world sort of makes it a no-brainer, won’t you say?

So now that we have sort of agreed on the inevitability of this technology, (call it data management mechanism, if you want to), taking over, we are back again to our original question of “As a data scientist, where do I start?” Let’s take stock of where Data Science in the world of Blockchain stands today: At this very point in time, the biggest areas that have seen data science being applied to the world of Blockchain is in the AI algorithms used by crypto-currency trading bots. You may have heard of BTC Robot or Crypto Trader? Their Artificial Neural Network (ANN) based algorithms perform automatic pattern charting for crypto markets. When you hear ANN, that sounds like familiar territory, right? But, before you can get all fancy with your ANNs, you probably need to learn how to access all that data.

It really comes down to two components that sum up all big data analytics approaches – the first is the Data Engineering component and the second is the Data Science component. For the Data Engineering component, this will largely depend on where the data is stored. As mentioned earlier, Blockchains today are not designed to store a lot of data. Storage is super expensive because it is maintained by thousands of peers/devices. So the first step is to learn how to handle access to this data using autonomous contracts and store them on decentralized clouds like IPFS or Storj. Then get them to a centralized server to get the data ready for analysis and insights. The data in Blockchain is stored in a custom binary format that is a little tricky to untangle without the help of existing APIs. Of course, overtime, if you are so inclined, you can learn to build your own APIs. And if you are looking for something that will pull all that data without any advance coding, it’s probably time to start getting yourself acquainted with “Blockchain ABE“. If you find yourself already overwhelmed with the mention of binary formats and APIs, breathe easy, you may never really have to go that upstream in the process!

The three scenarios you will likely be dealing with for data engineering and extraction will include 1. peer-to-peer file systems  2. decentralized cloud storage systems, and 3. distributed databases. The third is probably going to the one that ends up being most popular and it is likely to be distributed NoSQL databases similar to the likes of MongoDB, Apache Cassandra, RethinkDB and so on. A lot of you may already have some familiarity, if not a certain level of proficiency, with noSQL databases. If not, given the amount of time noSQL databases have been around, there are plenty of resources and training available on them.

When it comes to the Data Science knowledge and tools needed for this, you probably already have them in your arsenal. What’s going to be important is learning how to apply them to these new scenarios  and new kinds of transactions at hand. Hence understanding business use cases of Blockchain technology whether it be in crytocurrency or elsewhere would be key. For example, today unless you understand how Point-of-Sale (POS) transactions work, you cannot possibly build a fraud transaction model in retail. Similarly, if you are trying to analyze real-estate transactions occurring on Blockchain, you cannot do it effectively unless you understand how transactions are requested, how they are sent, how they are stored in blocks, and how they are validated in sequence. So immersing yourself into understanding the flow and workings of the technology would be a good place to start to strengthen your fundamentals.

At it’s core Blockchain is very simple, and not quite the unknown monster it is made out to be. Think of blocks as tables and transactions as records. Except, the blocks are linked (each block contains the hash of the previous block) and each transaction references the transaction output of the previous transaction. And every record is immutable! That wouldn’t be enough to scare you off, would it? At least there’s some ANN in there to ease you into it!

About the Author

Contributed by: Smita Adhikary, Managing Consultant at Big Data Analytics Hires – a talent search and recruiting firm focused primarily on Data Science and Decision Science professionals. Having started her career as a “quant” more than a decade ago building scorecards and statistical models for banks and credit card companies and having spent many years in management consulting, she has witnessed from very close quarters the transformation brought about by the advent of “Big Data” in the skill-sets desired in “quants.” Like most “quants” she holds a Masters in Economics and like a lot management consultants an MBA from Kellogg School of Management.


Sign up for the free insideBIGDATA newsletter.


  1. Rupesh R says:

    Nice article on how data science will play along with block chain.

Leave a Comment


Resource Links: