What are metadata and why are they so important to the omics industry?
Metadata sit at the core of our ability to not only put data into context, but find and access that data on demand. Here we explore how we define metadata, and why it’s so important to those working in the world of omics.
What are metadata?
Metadata are data that provide information about data. Metadata exist to give data context. For example, a photograph is the raw data. Additional parameters, like the date and location the photo was taken, the color profile, lens, and aperture used, constitute the metadata.
In genomics or transcriptomics, metadata describe the sample that the DNA/RNA sequence was obtained from, e.g. the organism, the cell line, and the library-preparation method.
Why are metadata important?
Metadata allow you to make better use of your data. With Life Science Data, for example, metadata are essential to correctly interpret the data and carry out meaningful comparisons with data from other samples or studies.
There is almost never a good reason not to save metadata – there are minimal storage requirements compared to the data they describe, and they offer several advantages.
Data reusability and reproducibility
You need metadata to reproduce your data correctly. For example, if you want to decode a picture with colors as they were intended, you need to use the right color profile (e.g. sRGB). If you want to reproduce DNA/RNA sequencing data or compare it with data from experiments in other studies, you need to know what cell line and experimental conditions were used, what treatments the cells or tissues were subjected to, etc.
Search for and find stored data
Say you want to find a picture of interest, perhaps from a holiday trip last year. Most likely, you’ve copied the picture over from your smartphone along with a hundred others. If you can remember the date the picture was taken, you can use that as a criterion in your search. Similarly, metadata allow you to find and retrieve your data by searching for experiments by criteria like the cell line or organism, a certain sample treatment, or by a specific person or department involved. As long as your metadata have been accurately input and saved, you can retrieve the associated data.
Integration
Metadata also provide an easy path to integrate your Life Science Data into your proprietary or public repositories. Repositories often rely on fields to be filled in by the researchers adding their Life Science Data. Metadata fields, when accurately and consistently entered into your repository, allow users and data management software to index, access, and recall the data entered.
Interpret data
You may be able to draw conclusions on the contents of the data without needing to access and analyze it directly, or you can make use of the full context to be able to interpret the data correctly. If you took a lot of pictures around the time of sunset you can make a reasonable guess that the weather was clear that particular evening. If you know your sample came from a cancer cell line, you can relate the data to a pathological state, for example in sequencing data you can expect to find more point mutations, structural variations, and chromosomal fusions.
Metadata in omics
The omics industry (i.e. genomics, transcriptomics, proteomics) is rather unique when it comes to the data it uses. Data can usually be classified by the effort required to produce a unit and the re-usability of those units, i.e. data are easy to produce and easy to reuse, vs data are difficult to produce and difficult to reuse. Most high-volume industry data fall into one of these categories.
The data are easy to produce and easy to reuse
A database of customer purchases for a store loyalty card can intuitively be broken down into rows. Each row is a unit within the database, which is easy to extract and manipulate. The same data could be filtered for a study into seasonal buying patterns or used to build detailed customer profiles. Likewise, each unit is produced automatically and with little intervention whenever a purchase is made by a customer. Data like these are easy to produce and re-use.
The data are an effort to produce and difficult to reuse
An example of this would be a complex simulation for a new piece of equipment. It takes thousands of man hours to produce and many hours of processing time to run. Since accuracy is so important in such simulations, and the end result is often both complex and bespoke, the reuse potential data like these is limited.
Omics seems to be an exception to the rule: data are an effort to produce, but in theory, easy to reuse.
The effort to complete a biological experiment is relatively high. For example, to find out the side effects of a new compound, researchers often employ differential gene expression experiments on cell lines. These are time- and resource-intensive to perform, but once you have the data, you can reuse them repeatedly.
Absolutely vital to making Life Science Data reusable, is retaining the full experimental details used to generate the data – and this is why metadata in genomics are so important.
If you’d like to more about metadata and role they play in the omics industry, you can download our metadata eBook.
Related content:
> Metadata and the need for consistency
> A FAIR genLife Science Data and metadata management system