Metadata and the need for consistency

As powerful and useful a tool as metadata are, they are not a quick fix for all data management issues if not used correctly. Here we delve into accuracy and consistency issues with metadata, and some of the work being done to address this.

There are two main problems associated with metadata that stand in the way of data discovery and re-usability: issues with consistency and accuracy.

Accuracy is self-explanatory – a mistake when entering metadata, means you no longer have an accurate description of your data.

Inaccurate metadata cannot be reliably searched for, which can you make retrieving the associated data significantly harder, or potentially impossible.

Consistency here refers to consistent metadata fields for similar data, and consistent vocabulary within a field.

Ideally, you want to describe your data with the same set of characteristics (fields) so you can query all of your data at once when performing a search.

At the same time, you want to always use identical strings to describe a feature of your data, e.g. if you store half your photos as .jpeg and the other half as .jpg extensions, a search for .jpg will only return half of your data. An example with biological data is the species name: if you describe human data as “Homo sapiens” once, and “human” another time, you are going to run into problems with data search later on.

Ontologies

You can use an ontology to ensure you metadata are consistent. An ontology is a structured collection repository of possible entries along with the properties and relationships between them. The aim of an ontology is to unambiguously define the terms you use to describe objects.

Ontologies are also a way to aggregate the knowledge you have on that object from many disparate sources and presenting it in a format that is easy to read for both humans and machines.

An example of an ontology is Cellosaurus, which describes cell lines used in biomedical research. Each entry contains several key pieces of information:

Name: recommended and synonymous.
Origin: species, tissue, disease, details from the original paper, etc.
Structured comments:
- Common problems, such as contamination
- Genetic modifications
- Groups the cell may belong to, e.g. cell catalogs or cell panels
Cross-references: links to other databases, e.g. other cell line catalogs or resources that use cell lines in their definitions, such as ChEMBL for the effects of compounds.

Tagging a research sample with an ontology term allows you to efficiently integrate with a multitude of other resources. For example, synonymous names are a common headache when searching experiment databases – a pain that’s quite easily solved by implementing an ontology, which allows searches to automatically return results for all synonyms.

The ontology can also be browsed to find cell lines that could be of interest. You could filter ontology terms by criteria like tissue and disease of interest, and then use the corresponding cell lines when searching for experiments in relevant repositories.

Finally, ontologies are fully open access, with a core group responsible for their curation. This means that anyone can suggest new entries and point out errors or additions, with quality controls in place for consistency.

Maintaining consistency

In theory, maintaining consistency and accuracy sounds simple, and to an extent it is – at least, it is at the start. When you first decide on a metadata vocabulary, you are looking at data you already have. You are likely able to intuitively predict the sorts of fields and vocabulary you will want to use. But, thinking about the future, as additional data become available, it may become obvious that a field you previously thought was unnecessary (or perhaps didn’t exist) becomes needed. Likewise, older vocabulary may no longer provide the granularity you require – terms may need to be split or merged. Managing this branching structure becomes more and more difficult over time.

Mistakes are increasingly likely to slip in when you have more fields to fill and vocabulary terms to choose from, especially if your pipelines aren’t automated.

In everyday life these problems are often mitigated by the nature of the data we use. Metadata are usually provided automatically by the services and capture devices we use. In most situations outside of research, it’s not much of a problem if our metadata aren’t perfect – needing to spend an additional five minutes searching for the particular piece of music you wanted to listen to isn’t going to ruin your day. However, the stakes are a lot higher when data volumes, accuracy, and speed requirements increase to the levels required by researchers in industry.

Public data and the metadata inconsistencies

There has been a great push to encourage scientists to publish their raw data, especially their genLife Science Data, online in recent years. Most major journals require authors to deposit their sequencing data into one of the major genetic sequence archives.

GenBank, EMBL-Bank, and DDBL are examples of repositories for processed sequence data, while the Sequence Read Archive (SRA) stores raw sequencing data. To get a sense of the scale of these databases, as of August 2019 GenBank stores more than 213 million sequences, ranging from individual viral genes to whole eukaryotic genomes, and has historically doubled in size every 18 months.

Conflicting standards

It’s important to note that while the major repositories are well-curated and deeply integrated, they do not accept all datasets or even all data types. Many other special-purpose repositories have emerged that capture, for instance, traditional low-throughput bench science data. The standards of metadata accepted here are less strict and the repositories place fewer restrictions on what data can be deposited. For instance, guidelines from the BBSRC, which is the lead UK life sciences funding agency for non-human studies, state:

“Data should be accompanied by the contextual information or documentation (metadata) needed to provide a secondary user with any necessary details on the origin or manipulation of the data in order to prevent any misuse, misinterpretation or confusion. Where standards for metadata exist, it is expected that these should be adhered to.”

The standard of metadata required is therefore left up to reviewers and funding body advisors, which will inevitably vary both within and between organizations. Database providers, scientists, and the funding institutions they all ultimately answer to have different roles when it comes to managing public research data.

Scientists are the ones who produce and use the data, so it is them who most often define metadata standards.

Yet, with the study fields so diverse in topic and scale, there are many competing and overlapping standards. Database providers have the most potential reach when it comes to enforcing metadata requirements, but do not have the resources to police them. Research funding agencies are in a much better position to ensure that the right metadata standards are used. Such funding agencies have access to the large and varied networks of scientists that are required to vet standards and have the power to enforce them through grant conditions. However, even here, the diversity of standards and individual use cases means that guidelines are not always strictly worded.

An effort to improve standards

The Minimum Information About a Microarray Experiment (MIAME) standard for recording and reporting microarray data, first described in a 2001 paper and currently the dominant standard for microarray experiments, shows that a different, better scenario is possible. It has six critical elements:

Raw Data (Raw): data extracted from imaging files.
Processed Data (Processed): final normalized data.
Sample Annotation (Experimental Factor Value / Variables): what experimental conditions were each sample subjected to?
Experimental design (Experimental Factor Value / Variables): why were these particular samples processed?
Array design details (Platform): e.g. probe sequences and where they hybridize.
Protocols (Protocol): what experimental protocols were used in the laboratory and for data processing?

As you can see, the standard outlines both data and metadata that should be recorded and subsequently reported. The creators of MIAME hoped that this consistent reporting would help to develop databases, public repositories, and data analysis tools.

Another approach to suggested format for managing metadata for consistency and adherence is the MicroArray Gene Expression Tabular (MAGE-TAB) tool. Designed for both data collection and exchange between other databases and tools, MAGE-TAB aims to help scientists communicate functional genomics experimental data, in a standardized manner. There are many potential fields in MAGE-TAB that can be filled, and it is often up to reviewers to decide if submitted data sufficiently answer the questions set out in the rating criteria. Users of private or public repositories where such data is stored can choose to filter experiments based on raw metadata and/or on the reviewers’ scores of submitted data.

Despite MIAME being a good effort to standardize data reporting, there are still inconsistencies in data standards and varying ability to enforce them. This makes the overall genLife Science Data ecosystem somewhat disorganized – which in turn makes data discovery and reusability, by both humans and computers, increasingly difficult. This is in contrast with, for example, environmental sciences, where metadata standards are defined more strictly (e.g. for NERC).

However, there is a real drive to improve this and a great deal of work is being doing to develop more robust and standardized systems for recording and reporting metadata.

If you’d like to more about metadata and their potential future in the omics industry, you can download our metadata eBook.

> A FAIR genLife Science Data and metadata management system