Notes on Google Dataset Search

By Ian Mulvany, Head of Transformation, Product Innovation, SAGE Publishing

I’ve just got back from a fantastic workshop looking at infrastructure for research data discovery. I’ll blog about the workshop in due course, but I was asked to comment on Google Dataset Search (GDS). I had the chance to meet with Natasha Noy from Google who is behind the service. 

As with many Google services, it has been created by a small team, but with the underlying web-scale infrastructure of Google to build on top of. They look for data sets on the web that have been identified using Home - schema.org tags. Data repositories that expose these tags will get indexed by GDS (this includes both Figshare and DataDryad).

Natasha said the following about the service:

There are thousands of data repositories on the Web, providing access to millions of datasets. Google’s Dataset Search  provides search capabilities over dataset descriptions (metadata) in all of these repositories, enabling the users to find datasets, wherever they are. You can think of Dataset Search as “Google Scholar for datasets”. Dataset Search includes open-government datasets from many local and federal governments across the globe, a large number of repositories for scientific data, economics data, data for machine learning, and so on. It includes both commercial and noncommercial providers.

While any other group could in principle replicate this by extracting all schema related pages from something like common crawl, in reality, there are probably only a handful of other organizations that might feel like they have a mandate or the resources to look at doing something like this. I could imagine Archive.org, Microsoft or Allen.ai having a crack at something like this. In the meantime, GDS is there, and it’s great that someone is doing this.

Peter Kraker has been vocal in his concerns about this project (Update on the #DontLeaveItToGoogle campaign | Science and the Web). While I agree with Peter that there should be better support for the research and discovery infrastructures, I’m not opposed to the existence of GDS. I had a chance to talk to Peter briefly about his views on this at the Force2019 conference, and my takeaway from that discussion is that he was less concerned about the potential for Google to capture researcher attention (I mean surfacing these data sets is broadly a good thing) but was more concerned that the existence of something like GDS potentially acts as a suppressor of research and innovation support. The argument could go — well GDS exists, so why should we fund anything like it? That is a fair point, but the overall landscape of how we allocate funding decisions is a complex one. As much as we might wish things to be different, we can’t wish away the existence of web-scale players.

What I like about what Natasha is doing is that she is asking for repositories to use schema.org, which is an open standard, and which, as I mentioned above, opens the door for others to replicate what they have done. A project like Hyphe  shows that there are interesting open tools out there to address crawling, but in the end we are left with having to compete with researcher attention, so I’m not totally convinced by the argument that were GDS not to exist we would be in a place to create an alternative that would gain both funding and researcher attention. I say that due to the very large amounts of scar tissue that I carry around from many years of trying to get academics to draw their attention to the tools that I have had the opportunity to work on.

One note of caution that I would strike is that how we choose what represents a trust metric, and how we report back how data is being used are both increasingly critical aspects of the impact discussion, and are beginning to be built into how researchers are being assessed. I don’t know anything about how data set creators get any indication of whether their data sets are being used from GDS, but as the community starts to create standards and norms around this I think it’s important to get all players to adopt those standards and norms.

This post originally appeared on Ian Mulvany’s personal blog. You can find the original version here.

About

Ian Mulvany is Head of Transformation at SAGE Publishing. He helped set up SAGE’s methods innovation incubator SAGE Ocean following a lean product development approach. Previously he ran technology operations for eLife, was Head of Product for Mendeley and ran a number of early web2.0 products for Nature Publishing Group. He is passionate about creating digital tools that support the research enterprise. He is interested in the interplay between different stakeholders that can lead to the sustainably of these kinds of tools. Follow him on Twitter @IanMulvany

Previous
Previous

Why did we write Research Design and Methods An Applied Guide for the Scholar-Practitioner?

Next
Next

Why is research dissemination important for scholar-practitioners?