NGS Data Analysis Workshop Complete · • Variant calling • Variant interpretation •...
Transcript of NGS Data Analysis Workshop Complete · • Variant calling • Variant interpretation •...
NGS Data Analysis Workshop Barts and the London Genome Centre & Ingenuity Systems
Wednesday 27th February 2013
www.smd.qmul.ac.uk/gcwww.ingenuity.com
Welcome and Introduction
The Ingenuity Systems - Barts Genome Centre Partnership
Charles Mein, D.Phil. Niels Nielsen
Centre Manager Senior Sales Executive [email protected] [email protected]
Proprietary and Confidential 3
Agenda12.00pm Lunch
12.30pm Niels Nielson, Ingenuity Systems and Charles Mein, Barts and the London Medical School
Welcome and Introduction; The Ingenuity Systems - Bart's Genome Centre Partnership
12.35pm Tim Bonnert, Ingenuity Systems
Running and Analysing NGS Resequencing Experiments: An Overview of Data Generation, Analysis, and Biological
Interpretation
1.00pm Charles Mein and Michael Barnes, Barts and the London Medical School
Sequencing & Bioinformatics Services of the Genome Centre
1.15pm Sasha Howard, & Sayka BarryCentre for Endocrinology, William Harvey Research Institute
Use of Ingenuity Variant Analysis in a Next-Generation Sequencing Project:
Investigating the Genetic Factors Controlling the Timing of Puberty
1.35pm Tea Break
1.55pm Tim Bonnert, Ingenuity Systems
Intro and Case Study: NGS Biological Interpretation using Ingenuity Variant Analysis
2.25pm Michael Barnes, Barts and the London Medical School
Predicting Variant Function
2.45pm Closing Remarks
3.00pm Q&A
Attendees may optionally bring VCF files for analysis and interpretation in Ingenuity Variant Analysis
An Overview of Data Generation, Analysis, and Biological Interpretation
Tim Bonnert, Ph.D.
Field Application [email protected]
Proprietary and Confidential 5
NGS: A Powerful but Time Consuming ProcessSample
Preparation
Sequencing
Annotation
Analysis
ValidationResults from 267 scientists currently using Next Generation Sequencing: What takes the most amount of time?The Global Outlook for Next Generation Sequencing: Usage, Platform Drivers & Workflow (2011)
Biological
HypothesisExperimentation
Biological
Interpretation
Proprietary and Confidential 6
NGS: Multiple Options & Complex Considerations
Sequence
In-House / Service Provider�
Sequence Quality�
Sequencing Coverage�
Sequencing Technology�
Align
Reference Genome�
Computational Resources�
Alignment Methods�
Call Variants
Calling Applications�
Computational Resources�
Quality Filtering�
Sample Preparation
Sequencing
Annotation
Analysis
Validation
Annotation
Allele Frequencies�
Variant Location�
Function Prediction�
Known Pathogenicity�
Analysis
Study Design & Type�
Association Statistics�
Genotype & Frequency�
Biological Interpretation
Function Prediction�
Phenotype & Disease Relationship�
Gene & Pathway Membership�
Biological
HypothesisExperimentation
Biological
Interpretation
Proprietary and Confidential 7
Ingenuity Systems: Understanding Biology
Content Acquisition Ingenuity Ontology
Proprietary and Confidential 8
Ingenuity Variant AnalysisIdentify causal variants from human sequencing data in just hours
Combine in
Analyses
Annotate &
Interactively Filter
Link Variants
to Biology
Share &
Collaborate
of human whole genome, whole exome, and targeted exome samples
Sequence
& Align
Call
Variants
Upstream Pipeline
Loaded
Samples
Up
loa
d t
o V
ari
an
t A
na
lysi
s™
Analysis
StatisticsPathways
Biology
Annotation����
����
����
����
����
Stratification Studies
Personal Genome
Tumour-Normal Pair
Trio/Quad Study
Genetic Disease Cohort
Large Cancer Studies ����Pedigree Support
Disease Identification
Statistical Burden Testing
Proprietary and Confidential 9
Rapid Prioritization and Annotation of Variants
Proprietary and Confidential 10
Content Critical for Rich Biological Interpretation of Variants from DNA Re-Sequencing Studies
Unified Ontology
Disease models
Pathways
Biomarkers
Causal Networks
Regulatory
Hereditary
Experimental
Somatic
Mouse Ortholog Models
Associations
Copy number
What variants are associated with any type of [skeletal abnormality]?
What variants are associated with [breast cancer]?
What pathways have most deleterious variants in [ALL] tumor vs. normal samples?”
What variants are associated with modified [warfarin] dosing?”
What variants are expected to [activate] genes involved in [bone morphogenesis]?
What variants would be expected to impact expression of [predicted] [NFkB] targets?
Is this variant associated with [early onset breast cancer]? What is the literature evidence?
Which variants [in BRAF] lack kinase activity [in HELA cells]?
Which variants are observed in [>10%] of [melanomas]?
Which variants are deleterious in genes with [tumorigenic] mouse ortholog KO phenotypes?
Which variants are associated with [elevated CVD] risk at p<10-5 in [Framingham SHARe]?
Which variants are in regions [deleted] in [>20%] of [glioblastomas]?
What variant(s) are associated with [cystic fibrosis] patient response to [VX-770] treatment?PGx & Clinically Validated
Bio
logi
cal M
odel
sM
utat
ion
Con
tent
Proprietary and Confidential 11
Ingenuity Variant Analysis Content• 130,000+ Ph.D./M.D. expert-curated human phenotype-associated mutations
• 145,000+ somatic variants modeled from COSMIC database
• 680,000+ somatic variants modeled from Cancer Genome Atlas (TCGA)
• 13,400+ Pharmacogenetic findings curated
• 16,000+ OMIM disease findings modeled
• 81,000+ Jackson Laboratory MGD mouse knockout database curated
• 5,900+ findings supporting Haploinsufficient genes
• 78,000+ miRNA predicted/observed binding sites integrated
• 4.9M+ Transcription factor binding sites (observed + JASPAR predicted)
• 1,300+ enhancers integrated (observed + VISTA predicted)
• SIFT, BSIFT & PolyPhen-2 functional prediction calls pre-loaded and maintained up-to-date
• Reference Genomes: dbSNP, 1000 genomes project, 54 Complete Genomics unrelated healthy reference genomes, NHLBI Exome Sequencing Project
• ~4.6 M findings from the Ingenuity Knowledge Base providing foundation layer of literature and integrated content mapped onto ~1.8M classes of the Ingenuity Ontology
Proprietary and Confidential 12
Filters Utilize Ingenuity Knowledge Base, Ingenuity Analytics, and Integrated Content to Refine Variants
Ingenuity Knowledge Base + Analytics
Ingenuity Knowledge Base
Ingenuity Analytics
External Algorithm
Integrated Content or Model
Proprietary and Confidential 13
Multiple Methods for Variant Selection & Prioritisation
• Genetic Analysis– Refined using genotype selection, variant call quality, read depth,
VCF filter status, and observed frequency at Variant or Gene level
– Annotate samples using a .PED file to restrict to de novo, transmitted, or Mendelian inheritance of variants across related individuals
• Statistical Association– Over-representation analysis of genes associated with Diseases,
Biological Processes, and Signalling & Metabolic Pathways
– Variant-level statistical association in Case vs. Control using basic allelic, dominant, or recessive models
– Gene-level Case (or Control) Burden or C-alpha Test with significance evaluated by permutation
– Pathway-level Case (or Control) Burden or C-alpha Test with significance evaluated by permutation
Proprietary and Confidential 14
Refine Variants Based on Biological Associations and Molecular Interactions
DiseasesGenesBiological ProcessesPhenotypesSignalling PathwaysProtein DomainsProtein Families
Variants
Disease
Pathways
Phenotype
Your Gene List
Protein Domain
Regulators of…
Proprietary and Confidential 15
Biological Context via Gene and Pathway Relationships
Proprietary and Confidential 16
Streamlined Upload of Data via DropZone
Loaded
Samples
Combine in
Analyses
Annotate &
Interactively Filter
Link Variants
to Biology
Share &
Collaborate
of human whole genome, whole exome, and targeted exome samples
Sequence
& Align
Upstream Pipeline
Call
Variants
Up
loa
d t
o V
ari
an
t A
na
lysi
s™
DataDrop via Ingenuity DropZone
VCF 4.x ����
Complete Genomics masterVar ����
Complete Genomics Var ����
GVF ����
Access to Ingenuity Variant Analysis and the use of DropZones for data loading are available to all users
Proprietary and Confidential 17
Summary of Features and Benefits
Fast
Knowledge-driven
Scalable and Secure
User friendly
Easy to purchase
Easy to implement
Try Ingenuity® Variant Analysis™ with your Called Variant FilesFree Previews of Unlimited Analyses with Unlimited Samples
www.ingenuity.com/variants
����
����
����
����
����
Stratification Studies
Personal Genome
Tumour-Normal Pair
Trio/Quad Study
Genetic Disease Cohort
Large Cancer Studies ���� Pedigree Support
Disease Identification
Statistical Burden Testing
Analysis
StatisticsPathways
Biology
Annotation
Sequencing & Bioinformatics Services of the Genome Centre
Charles Mein, D.Phil. Michael Barnes, Ph.D.
Centre Manager Director of Bioinformatics [email protected] [email protected]
Sequencing and Bioinformatics at the Barts and the London
Genome Centre
Number Of Bases
Interrogated Application
3,000,000,000 Whole Genome Sequencing
30,000,000 Exome Sequencing
4,000,000 Targetted Sequence Capture
2,500,000 Whole Genome SNP genotyping
200,000 Targeted SNP typing
6,000 Linkage arrays
1,000 Sanger Sequencing
300 Microsatellite Genotpying
1 Taqman SNP tpying
Next Generation Sequencing - Illumina MiSeq, GAIIx, HiSeq
Species independent, most comprehensive tool currently
available
Micro Array - Illumina iScan. Range of scales to facilitate many
different approaches including GWAS and linkage. Species include
mouse and human
Capillary Sequencing -Applied Biosystems; 3730xl,Sequencing of plasmids and PCR
products. Genotyping length variants – e.g ,microsats
Real time PCR - Applied Biosystems;7900HT 384 well.SNP genotyping at individual loci
Sequencing and Bioinformatics at the Barts and the London
Genome Centre
Next Generation Sequencing - Illumina MiSeq, GAIIx, HiSeq
Species independent, most comprehensive tool currently
available
Number Of Bases
Interrogated Application
3,000,000,000 Whole Genome Sequencing
30,000,000 Exome Sequencing
4,000,000 Targetted Sequence Capture
PCR based
target prep
Target Enrichment –hybridisation
methods
Sequencing and Bioinformatics at the Barts and the London
Genome Centre
• Bioinformatics challenges
• Storage
• Data QC
• Alignment
• Variant calling
• Variant interpretation
• Statistical inference
Genome Centre pipeline
Genome Centre/ WHRI Bioinfo group
Research groups with Bioinformatics capabilities
Third party suppliers of software solutions
Sequencing and Bioinformatics at the Barts and the London
Genome Centre
• Bioinformatics challenges
• Storage
• Data QC
• Alignment
• Variant calling
• Variant interpretation
• Statistical inference
• Genome Centre Maintains own compute resource – access at cost
• Dedicated staff member• Additional funding application in for
beefed up system• Tie up with other resource in college
MidPlus and Physics cluster• Cloud in the future?
Sequencing and Bioinformatics at the Barts and the London
Genome Centre
• Bioinformatics challenges
• Storage
• Data QC
• Alignment
• Variant calling
• Variant interpretation
• Statistical inference
• Quality assessment at each step (bioinformatics as well as lab stuff)
• Communication between wet and dry team
• In house pipeline using open source tools• SAMtools• Picard• GATK
Sequencing and Bioinformatics at the Barts and the London
Genome Centre
• Bioinformatics challenges
• Storage
• Data QC
• Alignment
• Variant calling
• Variant interpretation
• Statistical inference • Alignments of individual sequence back to reference genome
• Alignment part of in house QC tool • Important quality metric– coverage
statistics• Bowtie (open source)• Novoalign (commercial)
Sequencing and Bioinformatics at the Barts and the London
Genome Centre
• Bioinformatics challenges
• Storage
• Data QC
• Alignment
• Variant calling
• Variant interpretation
• Statistical inference
• Variant Calls• Identify changes from reference
sequence – key outcome from resequencing• GATK
Sequencing and Bioinformatics at the Barts and the London
Genome Centre
• Bioinformatics challenges
• Storage
• Data QC
• Alignment
• Variant calling
• Variant interpretation
• Statistical inference
Fu et al; 2013 Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants
• Difficulty in assigning function to variant – even for coding SNPs• conservation-based methods (GERP+
and PhyloP)• functional prediction methods (SIFT,
PolyPhen2 MutationTaster)
• Little overlap in predictions• Fu et al; 2013 only consider a
variant deleterious if predicted by 4/6 methods
• New and innovative approaches required• Inhouse• Third parties
Sequencing and Bioinformatics at the Barts and the London
Genome Centre
• Bioinformatics challenges
• Storage
• Data QC
• Alignment
• Variant calling
• Variant interpretation
• Statistical inference
Genome Centre pipeline
Genome Centre/ WHRI Bioinfo group
Research groups with Bioinformatics capabilities
Third party suppliers of software solutions
Use of Ingenuity Variant Analysis in a Next-Generation Sequencing Project
Investigating the Genetic Factors Controlling the Timing of Puberty
Sasha Howard, Ph.D.
Clinical Research [email protected]
Use of Ingenuity variant analysis in a
next-generation sequencing project -Investigating the Genetic Factors Controlling the
Timing of Puberty
Dr. Sasha Howard
Clinical Research Fellow
Centre for Endocrinology
William Harvey Research Institute
Our project
• Aim: To identify causal genetic factors in a cohort of patients with CDGP
– CDGP: Constitutional Delay in Growth and Puberty
– Phenotype of delayed puberty and short stature
– Segregates in Autosomal Dominant pattern
– Studies estimate 60-80% pubertal timing is genetically determined1
1Gadjos ZK 2009
Our methods
• Use whole-exome sequencing as an initial tool
to identify novel candidate genes which
segregate with the disease trait
• WES of 52 individuals from 7 families (NimbleGen V2 platform)
– 36 affected (4-7 in each family)
– 14 unaffected
– 2 unknown status (grandparent pair)
Next-Generation Sequencing
2Bamshad 2011
NGS data
• Data returned via cloud-based storage and
analysis tool - DNAnexus
• Average coverage x44
• Mapped reads mean 93% (range 88-99%)
• Coding variants – mean 22,300 per sample
– Mean 11,174 missense
– Mean 165 nonsense
Pilot analysis
• 3 families; 16 individuals
• Analysis in DNAnexus –
– Individual nucleotide-level variation analyses
– Family population-based frequency analysis
• Further filtering in excel/ Galaxy
• Manual minor allele frequency look-up in
Ensembl/Go ESP/UCSC
• Pathway analysis using Genego Metacore
(Thomson Reuters)
Ingenuity Variant analysis
– Nucleotide-level variation analysis in DNAnexus
– Results exported and converted to VCF files using
Perl
– Uploaded to Ingenuity variant analysis for filtering
and annotation
– Additional pathway analysis using Genego
Metacore (Thomson Reuters)
Ingenuity variant analysis
Filtering
• Common variants filter – MAF < 5%
Filtering• Predicted Deleterious – deleterious/damaging muts
Filtering• Genetic Analysis filter – segregation with trait & QC
Filtering• Biological context filter
Pathway analysis
Filtering of candidates
• Further filtering based on:
– Gene expression profiles
– Impact on protein
– Conservation
– Sanger sequencing of potential candidates to verify mutation & sequence in further individuals
– Examine data from genome-wide association studies and related conditions with known genetic basis
Advantages of Variant Analysis
• Speed!
• User-friendly/ intuitive
• Minor allele frequency data
• Good support
• Multiple analyses of same data with small
variations
• Access to Ingenuity database
Disadvantages
• Cost
• Use limited to one year
• Biological context filters can be limited (e.g.
delayed puberty = FGF1)
• Keep aware of limitations of input files
Identification of candidate genes usingtrio family whole exome DNAdata
Dr. Sayka [email protected]
Identification of candidate genes usingtrio family whole exome DNA
data
Dr Sayka Barry
Centre for Endocrinology
William Harvey Research Institute
Barts and the London School of Medicine
Queen Mary University of London
John Vane Science Centre
Charterhouse Square
London EC1M 6BQ
Family ID Individual ID Paternal ID Maternal ID Sex PhenotypeSp95 Sp95M1 Sp95M3 Sp95M2 1 2Sp95 Sp95M2 2 1Sp95 Sp95M3 1 1
PED FILE
Sp95M1 affected child with pituitary hyperplasiaSp95M2 unaffected mother Sp95M3 unaffected father
Trio study:
Exome sequencing:‘Otogenetics’ use ‘NimbleGen V2 (44.1 Mbp)’ for whole human exome enrichment and PE100 Illumina HiSeq2000 sequencing
Data analysis: Ingenuity variant analysis tool
• 1000 genomes project
• NHLBI ESP exomes
• 54 publicly available whole genomes sequenced by Complete Genomics
Filters used
Common variant filters:
Deleterious prediction filters:
Genetic analysis filters:
Call quality filter:
De novo:
Mendelian:
De novo analysis
frameshiftstop gain
Mendelian
Total 158 variants (137 genes):2 stop gain5 frameshift59 in-frame71 missense 11 splice site9 microRNA1 unknown
De novo analysis
De novo analysis
Notch signaling
De novo analysis
Summary:
• De novo: six candidate genes and •Mendelian: four genes selected for validation by Sanger sequencing
Total 33 variants (27 genes):1 stop gain5 frameshift10 in-frame15 missense 2 splice site
Mendelian:
Scientific Case Study
NGS Biological Interpretation using Ingenuity® Variant Analysis™
Tim Bonnert, Ph.D.
Field Application [email protected]
Proprietary and Confidential 56
Case Study: Hereditary Pheochromocytoma• Hereditary pheochromocytoma (PCC) is a neuroendocrine tumor of the medulla
of the adrenal glands
• Whole exome sequencing (Agilent SureSelect) on Illumina Genome Analyzer II
• Sequence data obtained from European Nucleotide Archive (ENA)– http://www.ebi.ac.uk/ena/data/view/ERR031607-ERR031626
• Published in Nature Genetics (2011); PMID: 2168591– Exome sequencing identifies MAX mutations as a cause of hereditary pheochromocytoma
Independent hereditary pheochromocytoma
Independent HapMapsamples
Proprietary and Confidential 57
Analysis of Pheochromocytoma Data
Loaded
Samples
Combine in
Analyses
Annotate &
Interactively Filter
Link Variants
to Biology
Share &
Collaborate
of human whole genome, whole exome, and targeted exome samples
Sequence
& Align
Upstream Pipeline
Call
Variants
Up
loa
d t
o V
ari
an
t A
na
lysi
s™
DataDrop via Ingenuity DropZone
Public
ENA
Data
Galaxy
(GATK)
VCF 4.0
Manual or
Automatic Upload
3 Case vs. 7 Controls
Genetic Disease WorkflowAnnotation – Analysis – Biology
Proprietary and Confidential 58
Create Analysis: Different Workflows Available
Proprietary and Confidential 59
Add Biological Terms
• Diseases• Genes• Biological Processes• Phenotypes• Signalling Pathways• Protein Domains• Protein Families
Proprietary and Confidential 60
Drag-and-Drop Samples and Create Analysis
• Analysis will be created with typical filters with appropriate settings for type of analysis and biological context
Causal variant identification in human DNA resequencing data
Live Demonstration
Analysis of Hereditary Pheochromocytoma
Proprietary and Confidential 62
Share and Re-Use Samples: The Power of Controls
Pheochromocytoma data as 3x Case only samples Pheochromocytoma data as 3x Case vs. 7x Controls
+73% / 13%
+175% / +67%
+67% / +13%
+168% / +144%
-86% / -87%
-89% / -89%
Proprietary and Confidential 63
Summary of Features and Benefits
Fast
Knowledge-driven
Scalable and Secure
User friendly
Easy to purchase
Easy to implement
Try Ingenuity® Variant Analysis™ with your Called Variant FilesFree Previews of Unlimited Analyses with Unlimited Samples
www.ingenuity.com/variants
����
����
����
����
����
Stratification Studies
Personal Genome
Tumour-Normal Pair
Trio/Quad Study
Genetic Disease Cohort
Large Cancer Studies ���� Pedigree Support
Disease Identification
Statistical Burden Testing
Analysis
StatisticsPathways
Biology
Annotation
NGS Variant Analysis
Pitfalls and Progress
Dr. Michael R. Barnes
Director of Bioinformatics, William Harvey Research Institute, QMUL
Overview
� What X Coverage is right for your experiment?� From FASTQ to VCF file
� Alignment and Variant calling
� Balancing sensitivity and specificity of variant calls
� The Functional analysis challenge
� Defining functional impact
� SNVs are not the only variants...
� Causality and “Clinically Actionable” variants
� How to define a causal variant
� Sources of Failure and Success
How much sequence coverage is
“enough”?
High pass genome re-sequencing design
Low pass genome re-sequencing design
Exome sequencing: Exome capture
Factors to define sequencing coverage
GenotypeKnownVariant
Detection
Population Variant
Discovery
Disease Variant
Identification
Inheritancemodel
Recessive*
Unknown**
Dominant**
100x
10x
40x
Sequencecoverage* Causal variant is expected to be homozygous
**Causal variant could be homozygous or heterozygous
Defining variants from NGS data
An overview of NGS Data processing
GATKGATK• Realign indels• Flag PCR duplicates• Mark unreliable calls
Annovar
The Variant Call Format (VCF)
ANNOVAR: Using Annotation to Narrow the
Search Space
� Annovar SNV annotation
� Known Frequencies
� dbSNP
� 1000 genomes
� NIH exome project (6500 exomes)
� Gene mapping
� Coding impact
� SIFT
� PhyloP
� Polyphen
� Mutation taster
openbioinformatics.org/annovar
ROC and PR curve comparison of protein funciton prediction methods(ExoVar dataset using a 10-fold cross-validation).
Li M-X, Kwan JSH, Bao S-Y, Yang W, et al. (2013) Predicting Mendelian Disease-Causing Non-Synonymous Single Nucleotide Variants in ExomeSequencing Studies. PLoS Genet 9(1): e1003143. doi:10.1371/journal.pgen.1003143http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1003143
Proportion of pathogenic rare nsSNPs & total load of pathogenic derived alleles in 8 HapMap subjects with high coverage sequencing
Li M-X, Kwan JSH, Bao S-Y, Yang W, et al. (2013) Predicting Mendelian Disease-Causing Non-Synonymous Single Nucleotide Variants in ExomeSequencing Studies. PLoS Genet 9(1): e1003143. doi:10.1371/journal.pgen.1003143http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1003143
Cancer is a disease of genome alterations.
SNVs are only a partial view of variation:
M. Meyerson, S.Gabriel, G.Getz, Nature Reviews Genetics 11, 685-696
Clinical Utility-Challenges
NGS data density =
frequently encountered
variants of unknown
significance
Which variants are “clinically actionable”?
Development of evidence-basedscientific standards to evaluate
utility in in different patient populations for accurate
risk estimation
Risk of over interpretationunnecessary medical action
unwarranted psychological stress
Careful selection of patients forgenome sequencing and genetic counseling-crucial
NGS analysis: The fine line between
success and failure
NGS analysis: Sources of failure
NGS Analysis: Sources of failure
� Not a SNV!
� Lack of sequence coverage
� Gene not in the exome capture
� Bioinformatic variant calling
� Bioinformatic annotation
� Mutation is not exonic
� You get scooped
� Clinical heterogeneity or wrong diagnosis
Some tips for success
� Variant analysis is only as good as the variant calls� Better to be too stringent than too lenient� But annotate poor calls rather than exclude
� If at first you don’t succeed re-analyse!� Consider other variant types� Consider other types of impact (e.g. ENCODE - RegulomeDB)� Consider other variant calling methods
� Focus your efforts where it matters most� Variants that can be tested in the lab� Genes with biological/pathway support
� Make your life easier and reduce error : automate! � try to avoid over-using excel
Closing Remarks
The Ingenuity Systems - Barts Genome Centre Partnership
Charles Mein, D.Phil. Niels Nielsen
Centre Manager Senior Sales Executive [email protected] [email protected]
Questions?
www.smd.qmul.ac.uk/gcwww.ingenuity.com
Stay and Play !• Try your own VCF files in Ingenuity Variant Analysis
• Test the system with some example data
www.smd.qmul.ac.uk/gcwww.ingenuity.com