As a technology journalist and data scientist, I frequently find myself in the intersection of two parallel universes. So imagine my delight when I heard about this new Kaggle machine learning competition sponsored by StumbleUpon – build a classifier to categorize webpages as evergreen or non-evergreen. Evergreen content is the jewel of journalism, content that is perpetually relevant. I’m confident the proposed algorithm would put a smile on the face of insideBIGDATA’s very able and experienced publisher Kevin Normandeau who is always on pursuit for the greenest of evergreen content.
StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. While some pages they recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as “ephemeral” or “evergreen”. The ratings they get from their community provide strong signals that a page may no longer be relevant – but what if this distinction could be made ahead of time? A high quality prediction of “ephemeral” or “evergreen” would greatly improve such a recommendation system. Many people know evergreen content when they see it, but can an algorithm make the same determination without human intuition? The mission of the contestants is to build a classifier which will evaluate a large set of URLs and label them as either evergreen or ephemeral.
The winner of the contest will receive $5,000. Top performers may also have the opportunity to interview remotely for an internship at the StumbleUpon office in San Francisco.