Sign up for our newsletter and get the latest big data news and analysis.

An Overview of Hulu’s Data Platform

LA-HUG1FIELD REPORT

If you want to get a true pulse of an industry you should attend a relevant local Meetup group. Admittedly, living in Los Angeles I probably have a wider selection than most given my proximity to the Silicon Beach boom area, but try your best to find a Meetup, you may be surprised.

Last night I attended the Los Angeles Hadoop users Group (LA-HUG) meeting hosted by Shopzilla over at their office on the border of West LA and Santa Monica (which is very convenient for me since it is in the same building as my gym). The topic for the evening was “An Overview of Hulu’s Data Platform” presented by Prasan Samtani and Tristan Reid of Hulu. You can watch the presentation HERE (sorry portions are without sound due to technical difficulties). About 100 attendees were on hand for pizza, beer, and technology. What could be better?

For those not familiar with Hulu, they are a streaming video service with 5.5 million subscribers, and 20 million unique visitors per month. The total amount of data they have currently is around 2-3 petabytes. Hulu viewers generate a tremendous amount of data: users watch over 400 million videos and 2 billion advertisements a month. Hulu’s data platform is definitely Big Data in every sense of the term.

The presentation was very well organized and informative. Prasan and Tristan went through their company’s data pipeline in surprising detail. They use the Cloudera Hadoop distribution as an integral part of their MapReduce-based data processing. Processing and analyzing that data is critical to the business, whether it is for deciding what content to invest in in the future, or to convince external partners on the superiority of Hulu as an advertising platform.

In the presentation the Hulu developers provided an overview of the company’s entire data platform, from collecting and storing the raw event data (called beacons), to transforming it into a relational structure and performing analysis. They described how and for what purpose they use various technologies in the Hadoop ecosystem such as MapReduce, HBase and Hive. The key focus of the talk was to describe how data flows through their pipeline, and how they’ve built a powerful tool-chain , both on top of, and around Hadoop, to suit their business needs. They even their own pre-processed language that’s translated into Java MapReduce code using their homegrown Beaconspec compiler. They also compared and contrasted their methods with those seen adopted by other companies seeking to perform similar tasks. I found it interesting that Hulu built a custom ad-hoc reporting tool for internal use. Called RP2 it is a reporting portal for pulling metrics and dimensions.

LA-HUG2

After the presentation, Prasan and Tristan took some Q&A and they got a question that was best answered by their boss who happened to be sitting in the audience. I am a terrible judge of age, but the presenters seemed to be in their late 20s and surprisingly (maybe to me only) their “boss” was even younger! Big data is a very “young” industry after all. We learned that the Hulu data team began modestly with 3 developers, but now has 20. From all indications, Hulu is a significant player in the Hadoop user community and this talk documented the team’s command of big data technology.

Daniel, Managing Editor – insideBIGDATA

 

Sign up for the free insideBIGDATA newsletter.

 

 

 

Resource Links: