We were very excited to attend TriCon for the first time a couple of weeks ago. From now on, this event will be a must on our conference list. We put together our notes from the conference and wrote a summary of the topics we found most interesting: The "Integrative Informatics for pharma" track was the closest to our interests. During the talks, representatives of pharmacutical companies and software vendors were addressing the challenges of multi-omics data management and analysis in the context of pharma R&D - things that we at Genestack are extensively working on as well. A common thread throughout most of the talks was the goal of data integration, to build actionable knowledge bases at the scale of large organisations. The challenge here is to bring together on the one hand internal data produced in isolation by separate departments, and external data (coming from public repositories or commercial databases) on the other. This challenge is made harder by:
- The vast heterogeneity in data types that can be produced (genomics, transcriptomics, metabolomics, imaging, clinical, chemical...)
- Their intrinsic complexity: for each data type, the data often needs complex processing via computationally intensive pipelines in order to extract actionable information in a research context
- The isolation of the data sources (often dubbed "data silos"): there is often a lack of infrastructure or standards in place to share data across sources, and sometimes legal barriers as well (especially for human genomic and clinical data)
In response to these challenges, an initiative has emerged over the past few years, called the "FAIR data principles". These are a set of guidelines meant to facilitate sharing and integration of data. "FAIR" stands for: findable, accessible, interoperable and reusable. These guidelines are detailed in this publication from 2016. Listening to the talks in this track, it definitely felt like the idea of FAIR data stewardship is gaining traction across the industry. The challenge now remains to build the necessary infrastructures (both computational and organisational) to support this vision for data integration, and lead the way on the journey from data to knowledge. After the main conference, we attended a symposium on NGS Diagnostics: Knowledge Bases, Annotation and Interpretation. As the cost of WES/WGS sequencing rapidly drops, and success stories about their applications are emerging, these technologies are gaining wider adoption in clinical settings. We are now accumulating an enormous number of variants associated with complex diseases and phenotypes, and the challenge of managing, analysing, and annotating them for clinical grade reporting and interpretation becomes more apparent. After listening to the talks by leading researchers and clinicians sharing their experiences, we took home a lot of best practices tips and we hope to translate them into making variants management, analysis, and interpretation better for Genestack users. The symposium started with talks about data solutions approach to advance genomic medicine. Louis M. Staudt, the Co-Chief of Center of Cancer Research at NIH, emphasised the importance of sharing genomic and clinical data by presenting the current process and the benefits of the National Cancer Institute (NCI) Genomic Data Commons (GDC) system, which now contains data from ¶14000 patients. The GDC was launched in mid-2016 and is a core of the US National Cancer Moonshot and the President's Precision Medicine Initiative. It centralises and makes accessible data from large-scale NCI programs such as TCGA and TARGET. Data are available in a standardised format via harmonisation mapping procedure, also used at Genestack. At GDC, multiple variant-calling pipelines were used, as somatic variant detection from DNA-seq data of tumour tissues is a complicated process and there is no consensus among the community on the best variant calling algorithm. As the system, the hope is to better understand the genomic drivers and determinants of cancer and its associated therapeutic response. Madhuri Hegde, Adjunct Professor from Emory University gave an interesting presentation about Emory Genetics Laboratory's in-house laboratory data management system (EmVar and EmVClass) which provides a solution to two pressing issues faced by clinical genetics laboratories: how to manage a large variant inventory with evolving variant classifications that need to be communicated to healthcare providers and how to make that inventory of variants freely available to the community. EmVar tracks changes in variant classifications, creating a record of previous cases in need of updated reports when a classification is changed. Birgit H. Funke addressed the challenges of developing knowledge resources of a large number of variants and disorders by showcasing the ClinGen/ClinVar initiative. ClinVar is a publicly available, curated database of the clinical significance of variants relative to phenotypes. This will be helpful in guiding the diagnostic laboratory directors thorough all aspects of genomic sequencing including test design, validation and interpretation. Brigit also highlighted a particular challenge of discrepant variant classification between labs, and how ClinGen addresses using their star-based review system. Reference models and population-based screening are particularly important to guide variant interpretation and we've seen their usefulness in Genestack when we used the 1KG datasets for association analysis. We were particularly pleased to hear a follow-up to the Exome Aggregation (ExAC) dataset, the recent release of the genome Aggregation Database (gnomAD), which consists of two call-sets: exome sequence data from 123,126 individuals and whole genome sequencing from 15,496 individuals. The second day saw comprehensive discussions on variant interpretation and presentation. Keith Nykamp from Invitae presented the company's framework to build a consistent and accurate clinical interpretation of genomic variants. The framework takes into account factors that determine the pathogenicity of variants: reference population data (e.g. using ExAC, gnomAD, 1KG to rule out the possibility that a variant can cause a rare disease), variant type, clinical observations (critical for demonstrating direct link between a variant and disease), experimental studies (useful for assessing the functional impact of variant), and indirect/ computational assessment. Their framework closely follows the guideline on the interpretation of sequence variants published by ACMG. Following this strategy does seem like a recipe for success, as Invitae has a strong track record in interpretation and reporting of clinical sequencing results, being one of the winners of the CLARITY challenge. In summary, TriCon was a fascinating conference, filled with discussions on the biggest challenges faced by pharma R&D and brainstorming on how to address them. We'll surely be attending next year too! The next conference we're going to is Revolutionising Next Generation Sequencing in Anterp, where Dr. Misha Kapushesky, the CEO of Genestack, will be presenting Genestack's metadata management solutions, reproducible pipelines and interactive analytics for enterprise bioinformatics R&D on 21st of March at 3:10pm.