Article, Bioinformatics, Data, Data Streams, FAIR data, ODM, Series, Six Step Process

Getting Data In

13.03.23

Introduction Hello again, data farmers! If you’ve been keeping up with our previous adventurous articles, you’ll already know how we’ve discussed the strategic questions related to making our Life Sciences data FAIR at an organizational level. We’ve done this by trawling the macro of data parks and cities. Now let’s zoom in and wind down on the data ranch to see how to make it possible on a tactical level and in practice. Experienced data scientists' ears have already pricked up, and yes - we’ll be breaking out the tractors and irrigation metaphors today as we get investigating & irrigating straight into a solution, and begin to look at data streams that can filter into our F.A.I.R. solution! Firstly, before we get to work, let’s clear up a few things. Data does not exist in a vacuum. We generate, curate, store, and analyze it. Depending on the organization’s complexity, there is in itself a constant waterfall of data generation devices, software, and storage systems governed by district teams and departments. Our goal for this exercise is to design an organized flow of the data from those sources and pass it through the same system of data processing, curation, and sorting to make it accessible in a similar format by data scientists (if they have the proper permission, of course). Got it? Aye! We all know that in reality this part of the FAIR transformation process may not be exciting like some of those other exploratory avenues, such as AI approaches to data. Still, only limited insights can be generated without an established way to locate and grab all the data from the farthest corners of the organization. A farmer understands how important this monotony can be with constant planting and fertilizing of crops, so why wouldn’t you nurture your data by making it F.A.I.R? For those who may continue to believe; “none of this is my problem; I know where my data for my project is located,” we can bring countless counter arguments usually found across the pond at Life Sciences data conferences. Like this one here; E.g., a quote from a Big Pharma representative’s talk on Bio Data World (Basel, November 2021): “We know that in the recent past, we produced a critical diabetes dataset. We spent six (!) months trying to find it.” If gathered and described adequately, a good farmer’s data yield has an increasing value over time. The more data we get, the more accurate hypotheses we can generate, so it is natural for each Data Scientist to ask: “do I possess all the data possible for my project?” It might even sound cliché (and it actually is), but the answer requires a Single Point of Truth, and the key point here is to Get Data into it. So, let’s get on board and begin to discover how fluid getting data in really can be. We often apply water-related associations when we talk about data (a data source, a data lake, a data flow, and sometimes even a data swamp), so it is natural to exploit this irrigation analogy further! Thinking of your data as water itself is not the goal but a resource we use to grow crops (the scientific advances to improve people’s lives and earn a living). So we know it’s about generating insights, and generating them quicker. But why keep our data FAIR for this? Well, a critical success of a FAIR data transformation is identifying all possible data streams for your organization. Some may not seem obvious at first and could surprise you in a nasty way. Do you know all of them in your organization? Below we observe the major classes - to avoid yet another boring list of exact sources - of raw data recordable right here and now, each of which would require a separate software connector to ingest data needed in your FAIR system. Think of each of these as building blocks to get that water in and nourish your crops! Public Databases. The first and obvious choice for every data scientist. Why would you spend money producing your own data if you can test your hypothesis on something already out there? The caveat here is that each database has its own guidelines for data description and format (see our previous article in this series) that make it very hard to create a standard connector (except for various BioBanks and TCGA with standard protocols). Data generation devices, tools, and software. The core of each scientific innovation is to test hypotheses by designing specific experiments. A vital data flow would be is from the lab, either directly from a measurement device or -, most likely -, from software coupled with the device that processes the data into a file. In many cases, like sequencing, the data can pass through extra processing and normalization pipelines to transform generated measurements into an analysis-ready format. Electronic Lab Notebooks and Laboratory Information Management Systems (ELN and LIMS). An essential source of sample and overall experimental description due to the global digitalization and FAIRification effort. Data storage systems. Due to historical reasons, many organizations have already had at least one (most likely one per department) data storage system that contains the data generated so far. Amazon S3, Azure, Arvados, etc., proprietary, open-source, or internally developed, serve to be a reservoir for data frequently with the logic “let’s put everything there and then decide what to do with all this pile of data. Someday”. Last but not least comes the infamous headache for everyone who wants to have a structured data source. Local machines and file shares. How many times did you send countless emails or messages to your peers asking if they have / know about / heard of a dataset you need? How did you exchange this data? Email, file share, messenger? How many datasets were lost because the person responsible for them has been no longer with the organization for three years? To complicate things further, all of the data sources mentioned above are not mutually exclusive. A single organization can have several data storages, an ELN, pipelines, and it is crucial to ingest the data from all of them. Moreover, different parts of the same dataset can be located in different ones, e.gGene., Gene Variant data is stored in Amazon S3, but the corresponding sample and patient description are available in the ELN. Like we said before, not an isolated flow - but an entire system! So I promised you a solution to all this data chaos once we got through, so I should reveal that the solution isn’t necessarily one product, but some may help you along the way. At Genestack, we have built a system that makes data FAIR. While data harmonization and search requirements can be shared across (almost) all Life Sciences organizations, the data insertion GET DATA IN step requires customisation, due to the unique composition of such sources for every organisation. Despite that, we have identified that every successful software connector that brings data sources and the catalog together must:

Be automated. Once the connection is set up, the program ingests the data, either creating new datasets in the catalog or updating the existing ones. A set of interpretation rules can be introduced to make it more precise (the most simple one is “each folder in the defined location corresponds to a separate dataset” or “if new files appear in a folder, update the dataset in the catalog”).
Be scheduled. In case we have a steady flow of new data, we don’t want to run the ingestion every time manually. In this case, a scheduled data uploading (once a week, day, or hour) provides the desired level of automation.
Be able to merge information from different sources. As discussed above, different data slices for the same dataset can be located in different systems. A data catalog must identify those slices as related and merge them as part of a single dataset.

Once we set up all our data sources, we can transform and harmonize the data flowing into the catalog. That’s really where Genestack begins to take full effect and really work with your data, and next time we’ll be heading to watch data harmonization in action via Genestack’s ODM, alongside a deep dive The following article will dive deeply into the data curation topic. Can’t wait until then, or curious about what Genestack could offer your data streams? Request a demonstration!

13.03.23

By Alexey Dubovenko

Complete our short form to continue

Getting Data In

Using LLMs in Life Sciences

3 Quick Tips to Get the Most from Your LLMs

E-book: A Formula for Valuing Life Sciences Data

Sign up for our newsletter