Yahoo Releases the Largest-ever Machine Learning Data Set for Researchers

Print Friendly, PDF & Email

Machine_Learning_NEWYahoo Inc. (NASDAQ: YHOO) announced the public release of the largest-ever machine learning data set to the academic research community. With this release, the company aims to advance the field of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research.

Many academic researchers and data scientists don’t have access to truly large-scale data sets because it is traditionally a privilege reserved for large companies,” said Suju Rajan, director of research, Yahoo Labs. “We are releasing this data set for independent researchers because we value open and collaborative relationships with our academic colleagues, and are always looking to advance the state-of-the-art in machine learning and recommender systems.”

The Yahoo News Feed data set is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The data set stands at a massive ~110B events (13.5TB uncompressed) of user-news item interaction data, collected by recording the user-item interactions of about 20M users from February 2015 to May 2015.

Yahoo’s release of the Yahoo News Feed data set is a significant contribution to the research community. Academic researchers everywhere will finally have access to realistic scale data to study how to automatically discover which news articles are of interest to which users, and will be able to compare their methods using this as a shared test case,” said Tom Mitchell, machine learning department chair, Carnegie Mellon University. “Here at CMU we’ll certainly be using it for our research.”

The data set provides categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users. On the item side, the title, summary and key-phrases of the news article in question are also included, and interaction data is timestamped with the user’s local time and also contains partial information of the device used to access the news feeds.

Access to data sets of this size is essential to design and develop machine learning algorithms and technology that scales to truly ‘big’ data,” said Gert Lanckriet, professor, Department of Electrical and Computer Engineering, University of California, San Diego. “At the Jacobs School of Engineering at UC San Diego, it will directly and significantly benefit the wide variety of ongoing research in machine learning, artificial intelligence, information retrieval, and big data applications.”

About the Webscope program:

The data set is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful data sets comprised of anonymized user data for non-commercial use. The data set we are releasing today is governed by our commitment to safeguard our users’ privacy and follows our practice of protecting and anonymizing user data.

At the UMass Amherst Center for Data Science we have broad interests in developing new methods for scalable analytics on a wide variety of big-data domains,”said Andrew McCallum, director of the Center and professor in the College of Information and Computer Sciences. “The release of this large Yahoo News Feed data set will be a tremendous asset for the academic research community, and for us at UMass particularly, given our major research activities in natural language processing, information retrieval, databases and computational social science.”


Download insideBIGDATA: An Insider’s Guide to Apache Spark



Speak Your Mind