Glossary of data management terms

There’s a wealth of terminology when it comes to how we talk about data and the methods employed to look after it. Here you can find helpful descriptions and examples of key data management terms, from access control to versioning.

Access control

Access control is a security technique that regulates who or what can view or use resources in a computing environment. It is a fundamental concept in security that minimizes risk to the business or organization.

In the context of life sciences, there is a special need for access controls to biological data/records, whether that's to comply with clinical data practices when accessing sensitive patient data due to ethical concerns, or in the sense of who should be trusted with the permission to edit (curate) data in an organisation to ensure data integrity.

Annotation

Annotation is the process of adding notes to data, for example through description, explanation, or commentary. In genomics for example, annotation can specify the locations and functions of genes on genomic sequences. For biological investigations, annotations often come in the form of detailed descriptions of the study, the samples used, and the data files generated using defined attributes (e.g. a biological “sample” often has “organism” as an attribute).

When annotation is harmonized by adopting a metadata model and curation using ontologies, the annotation will be not only human-readable, but also machine readable, which then can be used for training machine learning (ML)models.

Artificial Intelligence/Machine Learning (AI/ML)

Machine learning (ML) is an application of artificial intelligence (AI) where computer programs can automatically learn and improve based on experience. At the heart of machine learning is the development of programs that have the ability to access data and use it to learn for themselves.

For example, when fed a large volume of Life Science Data, phenotypic data and clinical observations of a group of patients presenting similar symptoms for a disease, ML models can uncover patterns of mutations, gene expression or clinical traits which are common to these patients. Such patterns will inform pharmaceutical companies to devise personalized healthcare treatments and improve clinical trial efficiency to speed up drug discovery.

API

Application programming interfaces (APIs) are created by data providers to let users access, query, and manipulate data using pre-defined, structured commands, often with customisable parameters, without having to know how the data are structured or stored by the providers behind the scenes. Since an API consists of a set of clearly defined methods of communication among various software components, it is a critical tool for integrating and exchanging data between different providers/sources.

Biomarker

A biomarker is a molecule, gene, or biological characteristic that can be reproducibly measured or assayed, and indicates the activity of a particular pathological or physiological process, or the presence or absence of a disease state.

Biomarkers are used in the R&D phase of drug development as they can help to predict the effect of a treatment on a patient (or on a group of patients), diagnose the presence of a disease, or give an accurate prognosis. For example, serum cholesterol level and blood pressure are biomarkers for cardiovascular diseases, and one could predict that a drug that lowers serum cholesterol level could reduce the risk of cardiovascular diseases.

In the clinical trial phase of drug development, biomarkers identified from earlier R&D phases are increasingly used to stratify patients into cohorts (in addition to conventional stratifying criteria, such as presence or absence of disease symptoms), so the candidate drug will be tested in those patients most likely to respond to and benefit from it.

Cohort

A cohort is a group of subjects who share a defining characteristic at a specific time. Examples of the common characteristics defining a cohort are birth, exposure to an infectious agent, disease diagnosis, treatment, etc. In a cohort study, the subjects are usually tracked over a time-course that lasts for days, months, or even years, with data being collected at one or more time-points.

Controlled vocabulary

A controlled vocabulary is an organized arrangement of words or phrases used to standardize or harmonize terms. Controlled vocabularies are scope or domain specific, e.g. in life sciences, the scope could be “disease”, “tissue”, or “cell line”. For a given subject (e.g. Species), a controlled vocabulary typically includes preferred (Homo sapiens) and variant/synonymous (H. sapiens) terms to capture the reality of how biological characteristics are often described by researchers.

By adopting controlled vocabularies in data management, an organisation can ensure their meta(data) are consistently described and so can be accurately compared. Such standardization also allows content to be indexed properly, which in turn enables data to be effectively searched and retrieved.

For example, in your organization, you may be required to insert an internal project’s code or name selected from a controlled vocabulary, to avoid different people using slightly different spelling or punctuation in such names.

Curation / Biocuration

Curation or biocuration is the process of identifying, organizing, correcting, annotating, standardizing, and enriching biological data. In the process, valuable knowledge is extracted from the data, which is then accurately represented in human-readable and machine-readable metadata formats, preferably using controlled vocabularies. It is the accurate and rich metadata that provides data with its context, to ensure the data are easily interpreted and disseminated, promoting data reuse.

Related:

> Ultimate biocuration guide poster

Data catalog

A data catalog is a tool that helps companies organize, find, and manage the data stored in their systems. Data catalogs contain information about the data and the metadata. By centralizing metadata in one location, data catalogs increase data discoverability and reusability. For example, a good ecommerce catalog lets customers discover products easily because they're described properly (by metadata) and are discoverable on the same platform (centralization).

Similarly, a good data catalog of an R&D department in a global pharma company allows anyone in the R&D team to discover experimental data generated in any of the R&D sites around the world. This allows that organization to find and reuse their data easily, and thereby save time and money.

Data manager

Data managers oversee data management processes and policies for an organization throughout the data lifecycle. Typical tasks of the manager include sourcing and collecting data from biological investigations or clinical trials with provenance, performing extract-transform-load (ETL) procedures, generating data analytics (e.g. data traffic and usage patterns), updating existing data so they stay relevant to the users, sharing data with internal/external stakeholders, and backing up data.

In biological science, the data manager also looks after auxiliary or reference data, which are crucial for annotating or interpreting data from biological investigations or clinical trials. Examples of such auxiliary data are reference genome sequences and ontologies. In the case of handling commercially sensitive or potentially human-identifiable data, the manager is also expected to ensure the data are stored, accessed, or shared in accordance with legal, ethical, and company standards.

Data model

Data models are abstractions of real-world concepts (e.g. “samples” or “sequencing runs” in the life science domain) to help organize entities of data and standardize how those elements interact with or relate to each other and any external entities. Data models help with the design of the database at the conceptual, physical, and logical levels.

The entity-relationship model, for example, uses core concepts like entities, attributes, and relationships. Some real world examples in life-science research and healthcare are the study-sample-data model and the patient-sample-measurement.

Data reuse

Data reuse is a concept that involves using data (such as experimental data or data from clinical trials) for a research activity or purpose other than that for which it was originally intended. Implementing appropriate metadata schemas, versioning, provenance, as well as curation and annotation can contribute to enhancing data discoverability and reusability. As more and more life sciences organizations treat their R&D and clinical data as an asset, and given the vast amount of resources invested on generating the data in the first place, data reuse translates to higher return-on-investment (ROI) too.

Electronic lab notebook (ELN)

An electronic lab notebook (ELN) is a computer program designed to document experiments and procedures performed in a research laboratory, as a replacement of the traditional paper laboratory notebook where researchers keep a diary or log of the experiments planned and performed.

Two of the major drawbacks of the traditional lab notebook are the lack of search capabilities so users rely on a crude index or dates to look for a record, and the problem with unintelligible handwriting. An ELN must be able to create, import, store and retrieve all important data types in digital format. ELNs also feature search functionalities that allow users to quickly retrieve data by author, date, experiment ID, data type, etc.

Facets

Facets are independent attributes that can be used to classify each entry in a collection. Faceted search allows users to use those attributes (the facets) to refine a search query to narrow down the search results. This is particularly useful when the initial query returns a large amount of results.

For example, in an online shopping website for shoes, the “collection” would consist of many pairs of shoes, each “entry” would be a particular pair of shoes, and facets such as “colour”, “size”, “style” and “upper material” are also attributes of the shoes. In life sciences, “organism”, “sex”, “age” and “tissue” are common examples of facets for narrowing down the search for relevant data.

FAIR

FAIR are a set of principles that serve as guidelines for researchers in the context of biological data management. The FAIR principles aim to overcome data discovery and reuse obstacles by ensuring data and metadata are findable, accessible, interoperable, and reusable.

A particular emphasis has been put on making the data discovery and reuse easier not only for humans, but also for machines. The FAIR principles have been formulated by stakeholders from academia, industry, funding agencies, and scholarly publishers.

Index

The index of a database is a structure that lets a search engine quickly find specific data without needing to search every row out of millions individually, and so speeds up data retrieval operations and in turn ensures a fast response to a user’s data query.

The analogy is the keyword index often found at the end of a book, which allows the reader to go straight to the pages where a keyword of interest appears, without having to read every single page of the book to search for it. In IT, the index is especially useful given that many databases are a series of related tables that often feature a very large number of rows.

Laboratory information management system (LIMS)

A laboratory information management system (LIMS) is computer software for managing laboratory experiments, samples, and associated data. The use of a LIMS allows scientists to automate and track laboratory workflows (e.g. how sequencing libraries are generated from RNA samples, and how the libraries proceed to be loaded onto sequencing machines), integrate instruments (e.g. robots preparing sequencing libraries, sequencing machines), and manage samples and associated information (e.g. the barcode on a tube of blood sample, the location of an output file from the sequencing machine).

Metadata

Metadata are data that provide information about other data. Metadata exist to give data context. For example, in genomics or transcriptomics, metadata describe the sample that the DNA or RNA sequence was obtained from, e.g. the organism, the cell line, and the library preparation method.

Omics / multi-omics

Omics are fields of study that aim to completely characterize molecular profiles in living organisms. Examples are genomics, transcriptomics, epigenomics, and proteomics. Omics experiments typically generate large datasets.

Multi-omics refers to the integration of multiple types of Life Science Data sets (e.g. genomics and transcriptomics) obtained from the same sample or related samples from the same study. Multi-omics can also refer to the integration of multiple types of Life Science Datasets coming from different studies.

Genestack ODM (ODM)

Genestack ODM (ODM) is Genestack’s data management tool for cataloging, curating, indexing, searching, and sharing biodata.

It is designed for users to record samples, studies, and linked Life Science Data, with rich metadata and relationships to capture provenance. Data adheres to customizable metadata templates with ontology support.

ODM indexes metadata and data, allowing users to interrogate both simultaneously (e.g. cell line and gene expression). A powerful search engine enables users to query multiple internal and public data sources.

Ontology

An ontology is a set of concepts and categories in a subject area that shows their properties and the relationships between them. With an ontology, each concept or category comes with a unique name (or “label”) out of all possible synonyms, and an identifier to avoid any ambiguity. Relationships between the categories are explicitly defined. Such labels, identifiers and relationships are designed to be machine-readable too to facilitate data integration.

For example, in life science, for the popular subject area of “disease”, an ontology can be created where each category would be a distinct illness (e.g. “type II diabetes”) with its own set of synonyms and definition. Relationships can be “is a risk factor for” (e.g. between “obesity” and “type II diabetes”), or “is a subclass of” (eg. “type II diabetes” can be a subclass of “glucose metabolism disease” and “adult late-onset diseases” at the same time).

The value of biocuration is greatly enhanced when ontology terms are used in metadata, not only in standardising the vocabulary used, but also in harnessing the relationships between the concepts to broaden data searches (e.g. searching for “glucose metabolism disease” will automatically return studies on “diabetes” too, even though the latter is not explicitly present in the search term).

Processed data

Processed data are data are generated from "raw" data obtained from an instrument. In life science, there is a more granular distinction between “pre-processed” and “processed” data too. “Pre-processed” data sit between “raw” and “processed”, representing the preparation of raw data through cleaning and/or transformation steps such as quality control filtering, removing background signal, normalization, etc.

Pre-processing has little to do with addressing the biological questions of the study (e.g. looking at gene expression differences between patients and healthy individuals) but prepares the data to a point that they can be used in downstream statistical analyses to calculate such differences.

Provenance

Data provenance refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and their origins.

Public repository

A public repository refers to any non-private collections of data that are available for use by anyone, often free-of-charge and without access control (except for sensitive human clinical data or potentially human-identifiable data). In life sciences, where public data sharing goes back for decades in the form of shared DNA or protein sequences from primary databases or assembled genome sequences in public genome browsers, repositories have also been established for collecting and disseminating data from biological investigations, e.g. NCBI Gene Expression Omnibus is a public repository for functional genomics studies.

Raw data

Raw data are data or measurements that come straight from an assay instrument (e.g. a sequencing machine, a mass spectrometer) prior to being processed. Raw data are the starting point of any data processing and analysis. Researchers starting with the same raw data and the same biological question could draw very differently conclusions eventually, all depending on the data processing and analysis methods used. Therefore, raw data need to be traceable and accessible for the sake of checking reproducibility of the processed data and robustness of the processing steps.

Relationship

Relationships describe the connection between something of interest (i.e. entities - that is, any independently existing thing that can be uniquely defined). A simple entity-relationship model would include both the different entity types and specify the connections - the relationships - that may exist between those entities.

In life-science research, an entity can be a study (or investigation), a sample, or experimental data. A typical relationship that needs to be captured is which investigation a sample belongs to, which allows a researcher to understand under what circumstances the sample was collected and for what reason related to the aim of the investigation.

Single Point of Truth (SPoT)

The Single Point of Truth (SPoT) ensures that every data element in all data models and associated data schema exists (or is mastered) in only one place. In such a model, links to each data element are by reference only. In business, this concept can be applied to ensure that everyone in the same organization uses the same data when making business decisions.

In data management, when properly applied, a SPoT avoids data from being siloed and prevents departments in the same company using slightly different versions/copies of the same data, which can result in costly errors and reduced profits.

Validation

Validation is the name given to the process whereby the information entered in a database or data management system is checked to ensure that it complies with the rules specified for that attribute field. In life sciences, given the volume of data in an organisation, automating the validation by feeding the rules into a validation algorithm helps to flag errors effortlessly and draw the attention of curators to rectify them.

For example, validation can ensure that only numbers between "0" and "100" are allowed in a percentage field, only "Male" or "Female" (from a controlled vocabulary or ontology) being accepted in a gender field, or the “compound treatment” field for samples must not be left blank in a study comparing the effect of a new drug against a placebo.

Versioning

Versioning refers to tracking changes, with each change producing a new version so that there is a record of the state of a particular piece of information at any given point. Versioning allows the management of changes to documents, programs, pipelines, data values, or even the state of metadata associated with a piece of data.

> A FAIR genLife Science Data and metadata management system

> The ultimate guide to biocuration

> What are APIs and why are they key to Life Science Data management?