Background
During the course of its operations, our client has produced and collected a wealth of diverse toxicology data and metadata, including transcriptomics, methylation, proteomics and other assay data. These data are used to investigate the mode of action of chemical compounds, identify off-target effects, and to aid in risk assessment reporting as part of the Adverse Outcome Pathways framework.
In order to maximise the return on investment into data production and to increase the efficiency of their research, the company’s bioinformaticians have been tasked with the development of a platform for managing and integrating these datasets. In January 2016 Genestack was chosen to deliver a proof of concept for this strategy, which resulted in finalized product implementation in December 2016.
The Challenge
The most pressing challenges faced by our client included:
–Lack of centralised knowledge management: as a big organisation, the client faces the challenge of produced data being stored across different sites, not accessible to everyone, often with inconsistently described metadata.
–Data integration struggles and how to leverage existing public datasets and analyse them alongside private or shared data
–No user-friendly tools to process Life Science Data without scripting skills and the struggle of keeping track of data provenance.
The major objectives of the project were:
-
to provide the company with a centralised data management infrastructure that allows for seamless storage, management and querying of private and public microarray-based transcriptLife Science Data, along with metadata and reports
-
to develop a meta-analysis application allowing to browse and aggregate differential expression results across thousands of transcriptomics experiments, in order to identify chemical compounds with similar transcriptomic signatures
-
to provide users with an easy-to-use pipelines tailored to their needs for transcriptomics microarray analysis, with visual and interactive reports. The pipeline includes quality control, microarray normalisation, differential expression and dose-response analysis.
Our Solution
Built upon the Genestack Platform, the solution delivers a secure cloud-based environment for storing, analysing and browsing toxicogenLife Science Data. Firstly, it allows users to seamlessly upload data and browse through hundreds of thousands of public, private and shared datasets, using faceted metadata search.
Secondly, it empowers multiple types of users to perform data analysis to investigate how genes and pathways respond to particular chemical compounds and conduct benchmark dose analysis, i.e. identify the concentration of a compound above which a specific gene or pathway starts to show a significant response.
Finally, when an experiment investigating the effects of a particular compound is carried out, the solution enables the scientists to immediately identify other compounds exhibiting a similar transcriptomic signature. These can come either from experiments conducted internally, or from publicly available datasets that have been processed by Genestack for this project.
In order to validate the pipeline that was implemented in Genestack, two large public toxicogenLife Science Datasets were re-analysed with Genestack: the Connectivity Map dataset from the Broad Institute and the LINCS L1000 datasets from the NIH. The results of the pipeline were compared and found to be consistent with published analyses.
The solution developed during the course of the project enables the user group to maximise the use of the toxicogenLife Science Data they produce across different sites, streamline the collaboration across these sites and lower the entry level to bioinformatics, allowing scientists without a computer science background to perform data analysis themselves easily.
Centralised data management infrastructure
In order to provide users with an easy-to-use infrastructure for data loading, browsing and querying, Genestack expanded its data and metadata management functionality with a set of client-specific add-ons such as: support for a range of microarray platforms (including Affymetrix and Agilent), metadata templates (to standardise the way users describe the metadata of their experiments) and the ability to import metadata from Excel spreadsheets. Users of the platform thus benefit from a powerful and highly customisable metadata system, with validation and autocomplete capabilities from ontologies like ChEBI or custom controlled vocabularies, and with an efficient faceted search interface to browse data by metadata attributes and find relevant experiments easily.
Expression Meta-analysis application
To allow our client to browse and aggregate differential expression results across multiple experiments, Genestack developed the Expression Meta-analysis Application. This tool runs similarity searches over collections of differential expression analysis files, finding compounds with similar transcriptional responses. Users are able to search by gene sets and describe genes with gene symbols or identifiers, a GO category or KEGG pathway, or as a query from an existing analysis file. This search returns existing private or public differential expression experiments exhibiting similar transcriptomic responses to the query set, together with a visualisation of expression levels.
When an experiment is performed in which the effect of a particular new compound is analysed, scientists are able to interrogate whether anyone else from their organisation or outside of it has already studied compounds behaving like the one they are investigating. This of course results in significant savings of time that would ordinarily be spent on re-investigating a given compound, which in turn translates into money savings. We believe the development and implementation of the Expression Meta-analysis application is one of the crucial added-value elements of the project.
Interactive, user-friendly tools and pipelines
One of the central ideas behind the project was empowering scientists to perform bioinformatics tasks without the need to be proficient coders. In order to do that, the technical side of bioinformatics must be reduced to the absolute minimum and users have to be equipped with a set of interactive and visual applications that will help them understand their results.
One of the key applications developed within the scope of this project was the Dose Response Application. Using this tool, users are able to investigate how genes respond to particular chemical compounds at different doses and perform benchmark dose analysis at gene and pathway levels.
The Result
The deployment of the solution was expected to, and is already proving to bring immense time and monetary savings. Firstly, having a centralised data and metadata management solution provided by Genestack allows our client to save time on looking for datasets, communicating between units etc. It allows for the company to re-use data once produced over and over again, and never having to re-do experiments, which often happens when organisations do not have a “single point of truth” solution.