Complete our short form to continue

Genestack will process your personal data in accordance to its privacy policy which can be found here. This includes sending you updates by email about our products and content we think it would be of interest to you. You can unsubscribe at any time by clicking the link in the footer of any email we send. By clicking submit you agree that we process your information in accordance with these terms.
Article, Data strategy, GEO, Knowledge, Public Access Data, Series, Solutions strategy, Standards

Defining standards for FAIR data

11.11.22

Alright everyone, take your seats! In our last article, we went on a data trek and set up our metaphorical data tent - our data management solution. We also hiked around the idea that traditional data management ideas don’t really work for scientific experimental data. Fallen behind? Catch up here

FAIR data?

FAIR data?

During this outing, the term FAIR data also seemed to appear quite a lot - but what exactly does that mean?

For those less aware, FAIR data means that data is; Findable, Accessible, Interoperable and Reusable. Think of these four terms as guidelines or standards that must be upkept to ensure that data is always useable in the future. We’re going to find out just how to make the F and R of FAIR happen in our research and how ineffective decisions can affect access to the data and limit its usage.

There is a significant volume of publicly available data warehouses - with shining examples like Gene Expression Omnibus (GEO) and Array-Express. Now, don’t get us wrong! These data portals have been crucial in the modern genomics environment, but they also highlight the critical negative patterns that prevent data from becoming FAIR!

How did data try to become FAIR?

How did data try to become FAIR?

Some data historians may already know this, but for the rest of us - it all began with the creation of public data repositories. At the dawn of OMICs technologies, these methods were accessible only to a few organizations within the scientific community, hence the community made a decision that anybody who publishes a scientific publication based on or including OMICs data must publish the data itself in a public repository.

That’s great I hear you say - but what does this data look like? Can only the most crucial data to the experiment - even a single line of data - be included; whilst researchers dump the rest of it? Not everybody is as keen on other scientists being able to use their data for their own purposes — funding is competitive afterall and therefore in the early years much of the data uploaded to public repositories was cryptic at best, complete gibberish at worst.

The MIAME (Minimum Information About a Microarray Experiment) standard was introduced in 2001. These standards required summarizing basic requirements and guidelines that scientists must follow while describing their OMICs data experiments: study design, sample source and preparation, data processing, etc.

It’s important for data to have regulations, but regulations are malleable. Ever since MIAME’s inception, we’ve witnessed a new struggle between two parties across the scientific community: those who would like to establish the most detailed and strict rules of data description, and those who prefer doing it in a relaxed manner saving their time.

Now, when we look back, we can definitely say victory rests comfortably on the head of the latter. For good reason and intention too, have a gander at this:

  1. A detailed experimental design is already described in the corresponding article’s Method - it’s simply excess being copied into the repository!
  2. “Democratizing access to all data for everyone” is a noble vision, but that vision begins to sour when another scientific group can use the data you produced to make a more impactful discovery and eventually get a Nobel Prize?! Or, even worse, prove we’ve been wrong with our research results! The academic world is a competitive place and so it definitely makes sense to put as limited description and information in the public domain as possible.
  3. Filling sample descriptions is tedious work, it is natural to spend as little effort as possible. That might cause such cases as putting “cancer” instead of “Non-small cell lung carcinoma” in sample annotation and other shortcuts to save time.

Overall, it might be a good time saving approach: the dataset is analyzed, the article is published, the data is archived. It’s all kept available, like a little slightly limited library of data! 

Then there comes the day when you need to find something in that data for your own research. Then the cryptic repositories of results that are available to you might feel a little overwhelming. How on earth are you ever going to make sense of it all?

Let’s break it down.

Exemplar case

Exemplar case

Let’s say we have a hypothesis that a biomarker we know of can differentiate between early stages of non-small cell lung cancer (NSCLC). Before we begin taking biopsies from real patients - planning, budgeting, and all that - it makes sense to test the hypothesis. Doing this with public data repositories sounds brilliant, right!?

Let’s do the same simple hypothesis test with the MIAME compliant NCBI GEO, a leading data repository. Think of GEO as a large collection of points in a city that it connects - It’ll be like narrowing down which bus we need to get on to visit the right stops and gather as much relevant data as we can. We need to find samples of NSCLC from patients with early or developing cancer onset; and we want them to be, say, below 40 years old. We also need some healthy control cells from healthy lungs. Oh! The data has to be all NGS generated too! Got it? Let’s give it a go:

  1. First of all, search “non-small cell lung cancer”. We get 1050 datasets back as results. That’s quite a lot.
  2. Let’s next filter out all non-human data. Right. 954 results; still quite a few.
  3. How about we put in a filter for only RNA-seq data. 321! Better.
  4. And that’s about it there. Of course there are filters that would help us to know which datasets have particular phenotypic fields:
    1. An “Age” field is annotated only for 21 datasets
    2. A “Tissue” field, which we hope to use to identify sample tumor vs. healthy samples (in free text of course) is available for 116 datasets.
    3. And a “Tumor stage” only for five datasets.

The intersection of the three filters above gives us only two datasets to explore (if only LA intersections were that quiet). We do now have a couple of datasets that could work for your research, but on the other hand we have just excluded 319 datasets that might work for testing our scientific hypothesis as well because the required fields could be annotated using different keywords, have no key words at all, or they’re listed in the original article among many other potential fields. All of that lack of effort months or even years ago by someone else is beginning to bite us, as it seems GEO’s lack of manual review and curation is coming back to bite us.

Much like how Dante exclaimed “I am the way into the city of woe”, this is how the nine circles of hell begin to unravel around our hypothesis as the grueling task of viewing every dataset in the area. At this point our MIAME system is defunct, and we’re better doing all of this manually. Oh, and we should perhaps read some corresponding articles to hunt down the samples we might use for our meta-analysis in order to make our test hypothesis have as little statistical bias as possible.

Alright, some more pessimistic people may now say “it is a public repository, nobody pays scientists to organize datasets. However, the same mindset is being inherited across the industry. Data is an asset, not a one-off thing, that we produce, analyze and archive only to store across multiple file shares, local computers and storage systems. (Learn more about how to put an actual monetary value on your data in our article here.)

Making data easier to access

Making data easier to access

The example above illustrates that whilst having a sprawling, complex data portal is a gigantic step forward; it doesn’t fix everything overnight. Rome wasn’t built in a day, and the absence of more guidelines does not allow data scientists to get the data they need in seconds - they still have to spend hours, days and weeks trawling through lots of irrelevant information.

This is made honestly more surprising by the fact that there are only a few components of a successful FAIR system that matter:

  1. The Data model - the relationships between patients, samples, measurements and other experimental entities, previously discussed here.
  2. The Phenotypic and clinical data requirements. You might know it as “metadata”, as it describes each entity in a project. In the GEO example above, the inconsistent phenotypic data (like “age” and “tissue”) across datasets do not allow us to search effectively across the repository but require opening each of the 319 datasets to identify its content.
  3. Omics data standards. Making sure all omics data pass the same processing pipelines in order to provide consistent results and rules.

These guidelines are pretty straight forward, but seldom followed. The ultimate data storing system has to assist data custodians and curators in following the guidelines once they are agreed on. Everyone in the community needs to have easy access to legacy data. It could be achieved by introducing the concept of templates which can cover all three components by defining suitable data models for different departments or by having omics data requirements that simplify its reusability.

Now imagine everyone putting data in GEO follows the same rules of data curation. You can find required data in mere seconds! Suddenly, you get straight to all of the results you need to generate that one crucial insight you were looking for all along.

The Data Template system is a crucial element of making data truly FAIR. In reality, what makes data FAIR isn’t institutionally based or viewable - what makes it FAIR is us all agreeing on what guidelines there need to be. It’s on us. Of course, it may require some time but you can always measure the ROI of applying it: data scientists can interact with data naturally, saving their time.

If you would like to see how easy implementing a templating system is for your institute and hear about how some of the worlds leading commercial and academic institutes are using it to make thier data FAIR then just drop us a message at sales@genestack.com or via our website contact form here.

11.11.22

Sign up for our newsletter