ReComp for genomics

ReComp for genomics

Our Vision:selective re-computation of genomics pipelines

in reaction to changesNov, 2016

Dr. Paolo MissierSchool of Computing Science

Newcastle University

Data Analytics enabled by NGS

Genomics: WES / WGS, Variant calling, Variant interpretation diagnosis- Eg 100K Genome Project, Genomics England, GeCIP

Submission of sequence data for archiving and analysis

Data analysis using selected EBI and external software tools

Data presentation and visualisation through web interface

Visualisation

Metagenomics: Species identification- Eg The EBI metagenomics portal

http://oceans.taraexpeditions.org/

Understanding change: threats and opportunities

BigData

Life SciencesAnalytics

“ValuableKnowledge”

V3

V2

V1

Meta-knowledge

AlgorithmsTools

Middleware

Referencedatasets

t

t

t

Key questions for the ReComp project:

• Threats: Will any of the changes invalidate prior findings?

• Opportunities: Can the findings from the pipelines be improved over time?

• Cost: Need to model future costs based on past history and pricing trends for virtual appliances

• Impact:• Which patients/samples are likely to be affected?• How do we estimate the potential benefits on affected patients?• Re-computations are expensive. Can we estimate the impact of these changes without re-

computing entire cohorts?

Many of the elements involved in producing analytical knowledge change over time:• Algorithms and tools• Accuracy of input sequences• Reference databases (HGMD, ClinVar,

OMIM GeneMap, GeneCard,…)

The ReComp vision

Observe change• In big data• In meta-knowledge

Assess and measure• knowledge decay

Estimate• Cost and benefits of refresh

Enact• Reproduce (analytics)

processes

BigData

Life SciencesAnalytics “Valuable

Knowledge”

V3V2

V1Meta-knowledge

AlgorithmsTools

MiddlewareReferencedatasets

t

t

t

ReComp:a decision support system for selectively re-computing complex analytics in reaction to change

- Generic: not just for the life sciences!- Customisable: eg for genomics pipelines

Approach and challenges

Challenges:

1. Learning from history and optimisation:• What types of meta-knowledge needs to be captured, and how much history is required to make

optimal re-computation decisions?• Can we use history to learn estimates of impact without the need for actual re-computation?

2. Software infrastructure and toolingReComp aims to deliver a metadata management and analytics stack

3. Reproducibility:How do we ensure that the “ReComp” button will actually performe a valid re-computation?

4. Impact:Which areas of genomics and more broadly bioinformatics can benefit from ReComp?

Approach: It’s all in the meta-data!

1. History of past computations. Capture details of analytics tasks and their executions:- Structure and dependencies of the process- Cost- Provenance of the outcomes

2. Metadata analytics: Learn from history- Estimation models for impact, cost, benefits

Project structure

• 3 years funding from the EPSRC (£585,000 grant) on the Making Sense from Data call• Feb. 2016 - Jan. 2019

• 2 RAs fully employed in Newcastle• PI: Dr. Missier, School of Computing Science, Newcastle University (30%)• CO-Investigators (8% each):

• Prof. Watson, School of Computing Science, Newcastle University• Prof. Chinnery, Department of Clinical Neurosciences, Cambridge University• Dr. Phil James, Civil Engineering, Newcastle University

Builds upon the experience of the Cloud-e-Genome project: 2013-2015

Aims: - To demonstrate cost-effective workflow-based processing of NGS pipelines on the cloud- To facilitate the adoption of reliable genetic testing in clinical practice

- A collaboration between the Institute of Genetic Medicine and the School of Computing Science at Newcastle University

- Funding: NIHR / Newcastle BRC (£180,000) plus $40,000 Microsoft Research grant “Azure for Research”

ReComp for genomics

Technology

Transcript of ReComp for genomics