Free Data Set Archive Helps Researchers Quickly find a Needle in a Haystack

Print Friendly, PDF & Email

Let’s say you’re doing research that requires millions of geotagged tweets. Or perhaps you’re a journalist who wants to map murders in Chicago from 2001 to the present. You need to find large spatio-temporal data sets — but where?

While there are hundreds of publicly available data sets, locating them can take months of searching. When potential sources are found, they rarely provide enough information for a researcher to decide if the set actually contains the kind of data they need without downloading the often huge file and sorting through it first.

Thanks to a computer scientist at the University of California, Riverside, finding the right data set is now as easy as bookmarking a website, and it costs absolutely nothing.

Ahmed Eldawy, an assistant professor of computer science in the Marlan and Rosemary Bourns College of Engineering, and his group spent the last three years combing the internet for public spatio-temporal data sets, studying their attributes, and summarizing the results for each set on interactive maps that show the user exactly what they’re getting.

“People who work on data science need data sets but can spend a lot of time finding them,” Eldawy said. “I wanted to build an archive they can find easily.”

Called the UCR Spatio-temporal Active Repository, or UCR STAR, the archive is made available as a service to the research community to provide easy access to large spatio-temporal data sets through an interactive exploratory interface. Users can search and filter those data sets as if shopping for their research, except that everything is free.

“The map interface visualizes the data, so you can see if it’s a good fit,” Eldawy said. “It’s like a catalog for data sets.”

At the heart of UCR STAR, the map provides an interactive exploratory interface for the data set. Similar to Google Maps or other web maps, users can zoom in and out and pan around to get a quick overview of the data distribution, coverage, and accuracy.

Important details are displayed once a data set is selected, such as the original homepage, a link to the original download source, size in bytes, number of records, file format, and other useful information. The subset download feature allows users to quickly download the data in a given geographical region, which reduces the download size. They can also embed their customized view on a webpage or share the link via social media and bookmark it to revisit later.

Sign up for the free insideBIGDATA newsletter.

Speak Your Mind