AI, Article, Blog, Data strategy, FAIR data, LLMs, Series

Six Steps to FAIR Data: How FAIR data enables searches to power your research and LLMs

06.09.24

In our previous articles

In our previous articles, we explored the steps required to make data FAIR—Findable, Accessible, Interoperable, and Reusable. Data contributors identified all data sources, defined standards, and transformed the data accordingly. With such FAIR data, Data Consumers can significantly enhance their analytics, scientific research, and decision-making processes. Ensuring that no crucial data point is missed helps prevent skewed results and leads to more informed business decisions.

When an organization has truly invested in making their data FAIR—connecting all data sources, establishing clear guidelines, and harmonizing the data—the ability to locate or search for data comes almost out of the box. The foundational work of organizing and standardizing data ensures that the search process is seamless, allowing users to quickly find the information they need without additional hurdles.

With this robust FAIR data foundation in place, the next step for Data Consumers is to leverage it effectively by locating the right datasets to answer specific research questions. While there are many search engines and technologies available to facilitate data retrieval, we will not delve into the specifics of these IT tools and systems. Once data is made FAIR, the technical task of implementing search functionality becomes much more straightforward. Instead, our focus will be on the real-world use cases that scientists encounter, demonstrating how FAIR data empowers them to effectively find and utilize the information they need to advance their research.

Typically, this begins with a project-related query, where the true value of FAIR data becomes evident in enabling precise and efficient data retrieval.

Understanding FAIR Data in Life Sciences: In the Life Sciences domain, a typical dataset consists of three key layers:

Dataset Description: This includes both structured and unstructured metadata, covering how the dataset was generated, experimental design, relevant statistics, contributors, and associated documents (e.g., experimental protocols, reports, scientific papers).
Bio-Object Description: This describes the biological entities involved, such as samples, device runs, or patients, often organized hierarchically (e.g., Patients → Samples → Device Runs). Each bio-object is accompanied by rich descriptive data, including age, disease, treatment details, etc.
Measurements: These are structured, typically tabular representations of biological measurements (e.g., gene expression, protein abundance, blood count) across multiple bio-objects.

Hopefully, you have already had a curated and uniform data collection so the search exercise is straightforward.

Now that we understand the structure of FAIR data, let's explore the main search scenarios a Data Consumer might encounter. These scenarios highlight how different data layers are utilized, showcasing the power of FAIR data in various contexts.

Scenario 1: Locating Known Data: Data Consumers often need to analyze and interpret data from ongoing research projects. However, these projects are rarely immediate, with gaps of months or even years between data production and interpretation. In such cases, searching by attributes like the study owner, author, or key terms in the top data layer allows for quick retrieval of the relevant datasets, demonstrating the efficiency of FAIR data.

Scenario 2: Locating Unknown Data: Scientists often need to find datasets to test new hypotheses, even if they are unaware of the dataset's existence. In these cases, FAIR data allows for more intricate querying at the Bio-Object level. For example:

Find all studies with female participants over 45 who were treated with a specific drug.
Retrieve datasets where epilepsy model rats were administered SV2A inhibition therapy.
Retrieve datasets where soil samples from coastal regions were tested for heavy metal contamination after a major industrial spill.
Find all studies where drought-resistant wheat varieties were grown under specific irrigation conditions.
Locate datasets where coral reef ecosystems were monitored for bleaching events under varying water temperature and acidity levels.

Such searches prevent the need for redundant experiments, saving both time and resources. By diving deeper into the data, FAIR principles enable even more complex queries, such as identifying datasets based on specific genetic mutations or measured values.

After familiarizing with general search techniques, Data Consumers can take a step deeper into the data to identify more specific data slices that meet intricate criteria. For instance:

Find all studies with female participants over 45 years old, treated with drug X, and carrying at least one non-synonymous variant in the BRCA1 gene.
Get all datasets where epilepsy model rats were administered SV2A inhibition therapy and showed decreased expression of Synaptotagmin-1.
Retrieve datasets where soil samples from coastal regions were tested for heavy metal contamination after a major industrial spill and showed significant changes in microbial diversity.
Find all studies where drought-resistant wheat varieties were grown under specific irrigation conditions and exhibited enhanced yield performance, including measurements of soil moisture and crop resilience.
Locate datasets where coral reef ecosystems were monitored for bleaching events under varying water temperature and acidity levels, with associated changes in symbiotic algae populations and reef health metrics.

These refined searches allow scientists to pinpoint datasets that meet very specific experimental or clinical criteria, demonstrating the true power of FAIR data and LLMs in handling complex data queries.

Leveraging Large Language Models (LLMs) in Search: The example queries above demonstrate how scientists naturally formulate their search tasks in plain language. Traditionally, these queries would need to be translated into complex keyword combinations or advanced search forms, which can be cumbersome and error-prone.

With the advent of LLMs, these natural language queries can now be seamlessly transformed into technical search queries. By designing prompts that incorporate both the user's request and the necessary querying syntax, LLMs can generate precise search commands that effectively navigate complex datasets.

User search phrase	Technical search query
Find all studies with female participants over 45 treated with drug X and carrying a non-synonymous variant in BRCA1.	`GET studies where Sample attributes (sex=female AND age>45 AND treatment=X) AND Gene=BRCA1 variants= missense, nonsense, stop_loss, frameshift, in-frame_insertions/deletions`
Get all datasets where epilepsy model rats were administered with SV2A inhibition therapy AND decreased expression of Synaptotagmin-1.	`GET studies where Sample attribute (species=R.norvegicus AND disease=epilepsy AND treatment=levetiracetam, brivaracetam, seletracetam, padsevonil) AND Gene=SYT1 expression < 50 TPM`
Retrieve datasets where soil samples from coastal regions were tested for heavy metal contamination after a major industrial spill and showed significant changes in microbial diversity.	`GET studies where Sample attributes (location=coastal AND sample_type=soil AND event=industrial_spill AND contaminant_concentration>50 ppm) AND Microbial_diversity_index <20%`
Find all studies where drought-resistant wheat varieties were grown under specific irrigation conditions and exhibited enhanced yield performance, including measurements of soil moisture and crop resilience.	`GET studies where Sample attributes (species=T. aestivum AND trait=drought_resistance AND irrigation_condition=5-10 mm/day) AND yield_increase>15% AND Measurements=soil_moisture (>30% field capacity), crop_resilience (>25% reduction in water stress impact)`
Locate datasets where coral reef ecosystems were monitored for bleaching events under varying water temperature and acidity levels, with associated changes in symbiotic algae populations and reef health metrics.	`GET studies where Sample attributes (ecosystem=coral_reef AND monitoring_event=bleaching AND water_temperature_increase>2°C AND pH<7.8) AND Associated_change=symbiotic_algae_population_reduction (>30% decrease), reef_health_metric_decline (>40% decrease in coral cover)`

Conclusion

As we've explored throughout this series, the transition to FAIR data in the Life Sciences brings immense value to both data contributors and consumers. By making data Findable, Accessible, Interoperable, and Reusable, researchers can more effectively answer complex questions, avoid redundant experiments, and accelerate scientific discovery. The integration of Large Language Models (LLMs) further enhances these capabilities, allowing for natural language queries to be transformed into precise technical searches. This powerful combination of FAIR data principles and advanced AI tools empowers researchers to navigate vast datasets with ease, unlocking new insights and driving more informed, data-driven decisions. As we continue to refine and leverage these technologies, the potential for breakthroughs in research and innovation becomes boundless.

06.09.24

By Alexey Dubovenko

Six Steps to FAIR Data: How FAIR data enables searches to power your research and LLMs

In our previous articles

Conclusion

Unlocking Neuroinformatics: EEG & Multi Omics Synergy with Genestack ODM

Six Steps to FAIR Data: How FAIR data enables searches to power your research and LLMs

Data Management Trends: Navigating the Future with Genestack