An illustration of website pages

Illustration by Neha Kavan

A Search Engine for Datasets

Haiyan Jia and a team of Lehigh researchers works to create a dataset search engine prototype that allows users to find data online regardless of domain.

Story by

Stephen Gross

Photography by

Illustrations by Neha Kavan

For most people, the task of searching for news, images or random facts is easy: Grab the nearest internet-connected device and consult Google.

It isn’t as simple for data journalists taking on the often difficult task of searching for datasets.

Haiyan Jia, an assistant professor of data journalism, notes several challenges to this type of search. One, she says, is that data journalists aren’t equipped with the skills needed because data journalism wasn’t part of the curriculum when they entered the field. Another is that no standards exist for publishing datasets online, and so they often appear in different formats, many unsearchable. Sometimes, she says, a data journalist might find only the titles of datasets; in other cases only metadata and occasionally text.

In an effort to improve the dataset search process, Jia has joined Brian Davison and Jeff Heflin, both associate professors of computer science and engineering, in developing a dataset search engine prototype that allows users of any discipline to find data online regardless of domain. Their three-year project is funded by the National Science Foundation.

Illustration with colors in backbround and partial lettering reading data search engine

“We realized this is a huge challenge,” Jia says. “It’s an important issue to address because if people like data journalists cannot get data that they want, they cannot create quality news stories. And that’s really a hurdle for a well-informed society.”

The team is taking an approach that combines technical improvements and user perspectives to enhance dataset search. While Davison and Heflin’s work focuses on schema label generation and user interface design, Jia is studying different user cases and interviewing data journalists, scholars, professors, graduate students and librarians to compile as many scenarios as possible. That information will shape the design of a search tool prototype. Once the prototype is complete, the team will test its effectiveness to see if it’s an improvement on the current process.

“We have to really think about the characteristics of datasets and see what factors are most important when we try to decide the relevance of the search results for users,” Jia says. “How can we make a better ranking of the results?”

Most importantly, Jia says, they have to address the issues of indexing search results. Dataset searches tend to be very specific. She uses the example of a data journalist trying to find how Bethlehem, Pa., residents of a specific gender, in a specific part of the city, aged 25 to 35, voted in an election. The journalist might be able to find the voting data divided by ward, she says, but to find out if the data includes the actual age of voters, they would have to enter the dataset and look at each cell.

“When you are putting all these justifications together, it’s hard to determine if the dataset actually contains what you’re looking for by just looking at the description or the title itself,” Jia says. “It’s a ton of work to find that dataset, download it to clean it up and then decide if it’s even usable. So can we, maybe at the search level, give users a sneak peek and understand whether the dataset is actually something useful for their project or their research?”

Or, she says, the users may be looking for a “Facebook for datasets.” Other than getting highly relevant search results, the team found in their latest survey study that users were interested in communicating with the data creators and data providers, as well as other users, to learn more information about the datasets and what stories or insights have been generated using them.

Story by

Stephen Gross

Photography by

Illustrations by Neha Kavan