In this special guest feature, Koichi Fujikawa of FlyData Inc. discusses the current role of big data in back-end infrastructure and how big data can be utilized in the modern world of business. Koichi Fujikawa is the Founder of FlyData Inc., a cloud-based data integration company.
Whether or not you know it, data dictates our lives. From GPS to personal behavior, our experiences are transformed into strings of data and are then used by businesses, governments, and organizations. Insights derived from data analytics are already well-integrated into our daily experiences, but it was only recently that such a proliferation of data has come about. As our world becomes increasingly digital, torrents of information are being created, processed, and stored. This sets the stage for “big data,” a concept of data that is already guiding the many facets of our lives.
One can say that Moore’s Law, an observation that the overall processing speeds of computers will double every two years, indirectly predicted this rapid growth in data — advanced processors can now record more complex forms of information such as pedestrian traffic, weather patterns, and particle collisions. But what was not expected, arguably, was the even more rapid increase in the amount of data that is collected and needed to be processed today. Unfortunately, the more complex the data, the more distributed and scattered it is and in many cases, the data might not even be stored in the appropriate place. Both these advancements and problems directly translate into the fields of information technology and business intelligence, where complicated data such as big data is processed on a day-to-day basis. Despite the complexity of utilizing big data, it’s very beneficial for the modern company: it can ultimately provide us with insights on our markets and a larger perspective for us to understand those who use our products.
Back-end infrastructure such as servers that enable rapid processing and collection of data are critical in most companies. When they store complex data such as logs, financial records, and user behavior, they become a treasure trove of information that can guide product development and company direction. In order to derive insights, engineers and analysts must request, or “query”, this data. With the advent of big data, which differs from normal data in terms of size, velocity, and type, querying data is beginning to require considerable effort and time. Once the amount of data reaches a few terabytes, it becomes a problem for organizations without the IT resources or staff to handle the load: it can take hours and in some cases, a whole weekend to compile data and create a report. This bottleneck will ultimately result in lost opportunities and decisions made with stale and obsolete data. With columnar storage, query-optimized processing, and SQL compatibility, services such as Amazon’s Redshift are taking the lead, allowing cheaper and easier access to big data querying and analytics. For those who would rather not build any servers on-premise nor spend the money to maintain, scale, and monitor them, this kind of convenience might be what’s best. In essence, these services enable you to process large amounts of complex data as soon as it’s loaded.
Okay, so we solved the problem of processing large amounts of data, but the questions that many analysts and engineers still pose when querying data are “where is the data?” and “is the data available right away?” — two main problems that will continue to fuel the field of data analytics. In both cases, the best solution for these two problems is “Structured Query Language”, or “SQL” for short. Traditionally, typical SQL queries are sufficient. However, with big data, it’s commonplace for queries to slow down, often taking hours and even days to query a day’s worth of data. For those working in data analytics and business intelligence, one can’t afford to slow down data analysis as it could ultimately affect the product and result in the loss of valuable opportunities. Again, Amazon’s Redshift data warehouse easily solves this issue with its capability of batch/parallel processing. But what use is a high-performance database when your data isn’t even there to begin with? You need to upload your data in a fast, but consistent manner in order for fresh information to be there when you need to analyze it. Companies such as FlyData take care of this by replicating changes and handling the “Extract, Transform, and Load” process by automatically (and continuously) uploading data from MySQL servers to Redshift and handling any errors on the way.
I believe that the concept of “big data” is still relatively hard to understand and it will take much work to fully take advantage of what it could offer. Technology grows at a rapid rate and it will continue to gain more angles from which we could record data from and fortunately, companies are catching on to this. In time, we will see more services such as Redshift and Google’s Cloud Platform and it will be even cheaper and easier to process and analyze large streams of data. For big data, the future looks promising.
Sign up for the free insideBIGDATA newsletter.