Last month, the US government announced that research articles and most underlying data generated with federal funds should be made public at no cost, a policy to be implemented by the end of 2025. adds to other important measures. The European Union’s science funding program, Horizon Europe, already requires almost all data to be FAIR (meaning findable, accessible, interoperable and reusable). The motivation behind these data sharing policies is to make the data more accessible so that others can use it both to verify results and perform further analysis.
But simply putting these datasets online will not bring the expected benefits: few datasets will be truly FAIR, as most will be untraceable. What is needed are policies and an infrastructure to organize the metadata.
Imagine you need to search for publications on a topic – say, carbon recovery methods – but you can only use article titles (no keywords, abstracts, or search terms). This is basically the situation for finding datasets. If I wanted to identify all the deposited data related to carbon valuation, the task would be futile. Current metadata often only contains administrative and organizational information, such as the name of the investigator and the date the data was acquired.
Additionally, for scientific data to be useful to other researchers, metadata must meaningfully and consistently communicate the gist of experiments – what was measured and under what conditions. As a researcher developing technology to help with data annotation, it’s frustrating that in the majority of fields the metadata standards needed to make data FAIR don’t even exist.
Metadata on datasets generally lack experience-specific descriptors. If present, they are rare and idiosyncratic. An investigator searching the Gene Expression Omnibus (GEO), for example, might search for genomic datasets containing information about how a disease or condition manifests in young animals or humans. Performing such a search requires knowing how the age of individuals is represented – which in the GEO repository could be age, AGE, age (after birth), age (years), l age (years) or dozens of other possibilities. (Often, such information is completely missing from datasets.) Because metadata is so ad hoc, automated searches fail and investigators waste a lot of time manually sifting through records to locate relevant datasets, without any guarantee that most (or any) can be found.
Some optimists assume that this problem can be solved by referencing datasets in published manuscripts, which at least include experimental details. But often, no published manuscript ever appears; if so, its descriptions are rarely sufficient to understand the data in the form in which it is deposited. Thus, metadata for datasets should be self-contained and should follow community-accepted guidelines listing key attributes of experiences.
Where metadata standards exist, technology can help. The CEDAR Workbench, developed by my group at Stanford University in California, offers a versatile approach to creating standardized metadata. CEDAR relies on machine-readable libraries of metadata reporting guidelines and controlled terminologies adopted by specific disciplines, and it automatically generates forms that prompt those submitting data online to fill in metadata fields and annotate. datasets with all experimental descriptors approved by a datum. scientific community. (The tool has been used in fields as disparate as biomarker surveys and wind power experiments.) The result is metadata that can be reliably searched and uses specific, consistent terms that indicate what an experience really was. (For example, there is only one way to indicate “age”.) But the workshop is useless in disciplines that lack basic metadata standards – including most scientific fields.
If we really want to share data, we need to develop standards that can make data FAIR. Funding agencies need to go beyond mere mandates for FAIR data. The Netherlands Organization for Health Research and Development (ZonMw) in The Hague, for example, runs workshops to develop simple metadata standards that its grantees can use. This process has already produced guidelines for reporting results related to COVID-19 and antimicrobial resistance, and many more workshops are planned. As a condition of funding, ZonMw requires new grant recipients to use these standards. Other funders should adopt this more participatory approach, providing tailored assistance alongside issuance requirements. This would not only lead to better datasets for targeted research programs, but also create metadata standards that communities will want to apply.
If we really want FAIR data, new international data sharing mandates will come at a significant price. ZonMw workshop managers estimate that developing a single standard costs €40,000 (US$40,000) given the labor provided by workshop participants.
Scientists and their funders need to recognize that FAIR data will require more than just a mandate – these data will require a huge investment. The research community must commit to creating discipline-specific standards for metadata and applying them across the scientific enterprise.
The author declares no competing interests.