Ngs part iii 2013

36
Sample & Assay Technologies Next Generation Sequencing: Data analysis for genetic profiling Ravi Vijaya Satya, Ph.D. Senior Scientist, R&D [email protected]

Transcript of Ngs part iii 2013

Page 1: Ngs part iii 2013

Sample & Assay Technologies

Next Generation Sequencing: Data analysis for genetic profiling

Ravi Vijaya Satya, Ph.D.Senior Scientist, R&D

[email protected]

Page 2: Ngs part iii 2013

Sample & Assay Technologies Welcome to the three-part webinar series

Next Generation Sequencing and its role in cancer biology

Webinar 1: Next-generation sequencing, an introdu ction to technology and applications

Date: April 4, 2013Speaker: Quan Peng, Ph.D.

Webinar 2: Next-generation sequencing for cancer researchDate: April 11, 2013Speaker: Vikram Devgan, Ph.D., MBA

Webinar 3: Next-generation sequencing data analy sis for genetic profiling Date: April 18, 2013

Speaker: Ravi Vijaya Satya, Ph.D.

Title, Location, Date 2

Page 3: Ngs part iii 2013

Sample & Assay Technologies Agenda

3

� NGS Data Analysis� Read Mapping� Variant Calling� Variant Annotation

� Targeted Enrichment� GeneRead Gene Panels

� GeneRead Data Analysis Portal� Background� Workflow� Data Interpretation

Page 4: Ngs part iii 2013

Sample & Assay Technologies Read Mapping

Title, Location, Date 4

Millions of reads from a single run

Reads mapped to a reference genome

� Programs for read-mapping� Hash-based

� MAQ, ELAND, SOAP, Novoalign� Suffix array/Burrows Wheeler Transform based

� BWA, BowTie, BowTie2, SOAP2

� Alignment� Mapping Quality

Page 5: Ngs part iii 2013

Sample & Assay Technologies Variant Calling

Determine if there is enough statistical support to call a variant

Title, Location, Date 5

ACAGTTAAGCCTGAACTAGACTAGGATCGTCCTAGATAGTCTCGATAGCTCGATATC

AACTAGACTAGGATCGTCCTAGATAGTCTCG

Reference sequence

Aligned reads

AACTAGACTAGGATCGTCCTACATAGTCTCGAACTAGACTAGGATCGTCCTACATAGTCTCG

GATCGTCCTAGATAGTCTCGATAGCTCGATGATCGTCCTAGATAGTCTCGATAGCTCGATGATCGTCCTAGATAGTCTCGATAGCTCGAT

Multiple factors are considered in calling variants

� No. of reads with the variant

� Mapping qualities of the reads

� Base qualities at the variant position

� Strand bias (variant is seen in only one of the strands)

Variant Calling Software

� GATK Unified Genotyper, Torrent Variant Caller, SamTools, Mutect, …

Page 6: Ngs part iii 2013

Sample & Assay Technologies Variant Representation

VCF – Variant Call Format

Title, Location, Date 6

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

##fileformat=VCFv4.1##FORMAT=<ID=GT,Number=1,Type=String,Description="G enotype">##INFO=<ID=DP,Number=1,Type=Integer,Description="Ap proximate read depth; some reads may have been filtered">##INFO=<ID=FS,Number=1,Type=Float,Description="Phre d-scaled p-value using Fisher's exact test to detect strand bias">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=OND,Number=1,Type=Float,Description="Ove rall non-diploid ratio (alleles/(alleles+non-alleles))">##INFO=<ID=QD,Number=1,Type=Float,Description="Vari ant Confidence/Quality by Depth">##contig=<ID=chrM,length=16571,assembly=hg19>##contig=<ID=chr1,length=249250621,assembly=hg19>##contig=<ID=chr2,length=243199373,assembly=hg19>##contig=<ID=chr3,length=198022430,assembly=hg19>##contig=<ID=chr4,length=191154276,assembly=hg19>#CHROM POS ID REF ALT QUAL FILT ER INFO FORMAT Samplechr1 11181327 rs11121691 C T 100.0 PASS DP=1000;MQ=87.67 GT:AD:DP 0/1:146,45:191chr1 11190646 rs2275527 G A 100.0 PASS DP=1000;MQ=67.38 GT:AD:DP 0/1:462,121:583chr1 11205058 rs1057079 C T 100.0 PASS DP=1000;MQ=79.57 GT:AD:DP 0/1:49,143:192

Header lines

Column labels

Variant calls

Page 7: Ngs part iii 2013

Sample & Assay Technologies Variant Annotation

Title, Location, Date 7

Chrom

Pos IDRef

AltGene Name

Mutation type

Codon Change

AA Change

Filtered Coverage

Allele FrequencyVariant

FrequencysnpEff Effect

chr1 11181327 rs11121691 C T MTOR SNP c.6909C>T p.L2303 1,924 C=0.761 T=0.239 0.239 SYNONYMOUS_CODINGchr1 11190646 rs2275527 G A MTOR SNP c.5553G>A p.S1851 5,842 G=0.791 A=0.208 0.208 SYNONYMOUS_CODINGchr1 11205058 rs1057079 C T MTOR SNP c.4731C>T p.A1577 1,928 C=0.254 T=0.746 0.746 SYNONYMOUS_CODINGchr1 11288758 rs1064261 G A MTOR SNP c.2997G>A p.N999 5,186 G=0.212 A=0.788 0.788 SYNONYMOUS_CODINGchr1 11300344 rs191073707 C T MTOR SNP 210 C=0.924 T=0.076 0.076 INTRONchr1 11301714 rs1135172 A G MTOR SNP c.1437A>G p.D479 3,965 A=0.248 G=0.752 0.752 SYNONYMOUS_CODINGchr1 11322628 rs2295080 G T MTOR SNP 339 G=0.239 T=0.755 0.755 UPSTREAMchr1 186641626 rs2853805 G A PTGS2 SNP 97 G=0.0 A=1.0 1 UTR_3_PRIMEchr1 186642429 rs2206593 A G PTGS2 SNP 3,552 A=0.167 G=0.833 0.833 UTR_3_PRIMEchr1 186643058 rs5275 A G PTGS2 SNP 237 A=0.759 G=0.241 0.241 UTR_3_PRIMEchr1 186645927 rs2066826 C T PTGS2 SNP 209 C=0.88 T=0.12 0.12 INTRONchr2 29415792 rs1728828 G A ALK SNP 2,520 G=0.0 A=1.0 1 UTR_3_PRIME

chr2 29416366 rs1881421 G C ALK SNP c.4587G>C p.D1529E4,361

G=0.907 C=0.0930.093

NON_SYNONYMOUS_CODING

chr2 29416481 rs1881420 T C ALK SNP c.4472T>C p.K1491R3,061

T=0.954 C=0.0450.045

NON_SYNONYMOUS_CODING

chr2 29416572 rs1670283 T C ALK SNP c.4381T>C p.I1461V5,834

T=0.0 C=0.9990.999

NON_SYNONYMOUS_CODING

chr2 29419591 rs1670284 G T ALK SNP 739 G=0.093 T=0.907 0.907 INTRONchr2 29445458 rs3795850 G T ALK SNP c.3375G>T p.G1125 1,776 G=0.917 T=0.082 0.082 SYNONYMOUS_CODINGchr2 29446184 rs2276550 C G ALK SNP 475 C=0.895 G=0.105 0.105 INTRON

dbSNP/COSMIC IDActual change and position within the codon or amino acid sequence

Effect of the variant on protein coding

� SIFT score� Predicts the deleterious effect of an amino acid change based on how conserved the

sequence is among related species� Polyphen score

� Predicts the impact of the variant on protein structure

Page 8: Ngs part iii 2013

Sample & Assay Technologies GeneRead DNAseq Gene Panel: Targeted Sequencing

Title, Location, Date 8

� What is targeted sequencing?� Sequencing a sub set of regions in the whole-genome

� Why do we need targeted sequencing?� Not all regions in the genome are of interest or relevant to specific study

� Exome Sequencing: sequencing most of the exonic regions of the genome (exome). Protein-coding regions constitute less than 2% of the entire genome

� Focused panel/hot spot sequencing: focused on the genes or regions of interest

� What are the advantages of focused panel sequencing ?� More coverage per sample, more sensitive mutation detection

� More samples per run, lower cost per sample

Page 9: Ngs part iii 2013

Sample & Assay Technologies Target Enrichment - Methodology

Title, Location, Date 9

� Multiplex PCR

� Small DNA input (< 100ng)

� Short processing time(several hrs)

� Relatively small throughput(KB - MB region)

Sample preparation

(DNA isolation)

PCR target enrichment(2 hours)

Library construction Sequencing Data analysis

Page 10: Ngs part iii 2013

Sample & Assay Technologies Variants Identifiable through Multiplex PCR

Title, Location, Date 10

� SNPs – single nucleotide polymorphisms

� Indels� Indels < 20 bp in length

� Variants not callable� Structural variants

� Large indels� Inversions� Copy Number Variants (CNVs)

Large insertion

Inversion

CNV

Page 11: Ngs part iii 2013

Sample & Assay Technologies GeneRead DNAseq Gene Panel

� Multiplex PCR technology based targeted enrichment for DNA sequencing

� Cover all human exons (coding region + UTR)

� Division of gene primers sets into 4 tubes; up to 1200 plex in each tube

11

Page 12: Ngs part iii 2013

Sample & Assay Technologies GeneRead DNAseq Gene Panel

12

Genes Involved in Disease

Genes with High Relevance

� Comprehensive Cancer Panel (124 genes)

� Disease Focused Gene Panels (20 genes)

� Breast cancer

� Colon Cancer

� Gastric cancer

� Leukemia

� Liver cancer

� Lung Cancer

� Ovarian Cancer

� Prostate Cancer

Focus on your Disease of Interest

Page 13: Ngs part iii 2013

Sample & Assay Technologies

13

GeneRead DNAseq Custom Panel

Page 14: Ngs part iii 2013

Sample & Assay Technologies Data Analysis for Targeted Sequencing

GeneRead data analysis work flow

Read Mapping

Primer Trimming

Variant Calling

Variant Annotation

Title, Location, Date 14

� Read mapping� Identify the possible position of the read within the reference� Align the read sequence to reference sequences

� Primer trimming� Remove primer sequences from the reads

� Variant calling� Identify differences between the reference and reads

� Variant filtering and annotation� Functional information about the variant

Page 15: Ngs part iii 2013

Sample & Assay Technologies Reads from Targeted Sequencing

Title, Location, Date 15

Typical NGS raw read from targeted sequencing

Adapter Barcode Primer Primer Adapter5’- -3’

Insert sequence

Primer Primer5’- -3’

Insert sequence

Removal of adapters and de-multiplexing

5’- -3’

5’- -3’

5’- -3’

Read length can vary: only part of the insert or the 3’ primer may be present

Page 16: Ngs part iii 2013

Sample & Assay Technologies Read Mapping

Align reads to the reference genome

Title, Location, Date 16

Reference sequence

Aligned reads

Amplicon 1

Amplicon 2

Page 17: Ngs part iii 2013

Sample & Assay Technologies Primer Trimming

Primer sequences must be trimmed for accurate variant calling

Title, Location, Date 17

Reference sequence

C

C

C

C

Frequency of `C` without primer trimming = 4/13 = 31%

Frequency of `C` after primer trimming = 4/7 = 57%Aligned reads

Amplicon 1

Amplicon 2

Page 18: Ngs part iii 2013

Sample & Assay Technologies GeneRead Variant Calling Overview

Title, Location, Date 18

Torrent Variant Caller (TVC)

GATK Unified Genotyper

Additional filtering (based on

frequency and coverage)

BAM file(w/ flow space info)

BED file for amplicons used

Run Parameters

Seq. platform

IonTorrent

MiSeq

GATK Variant Annotator

GATK Variant Filtration

Variant calling and filtering Annotation

dbSNP

Cosmic

dbNSFP

snpEff (basic annotation)

SnpSift (links to dbSNP,

Cosmic and computation of

Sift scores, etc.)

VCF to excel

Variants in excel format

vcf

vcf

vcf

vcf

vcf

GATK IndelRealigner GATK Base

Quality Recalibrator

bam

bam

Page 19: Ngs part iii 2013

Sample & Assay Technologies Indel realignment

Title, Location, Date 19

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May; 13(5):191-8. PMID: 21178889

� Eliminates some false-positive variant calls around indels� Read aligners can not eliminate these alignment errors since they align reads

independently� Multiple sequence alignment can identify these errors and correct them

Page 20: Ngs part iii 2013

Sample & Assay Technologies Base Quality Recalibration

Title, Location, Date 20

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May; 13(5):191-8. PMID: 21178889

� Eliminates sequencer-specific biases� Lane-specific/sample-specific biases� Instrument-specific under-reporting/over-reporting of quality scores� Systematic errors based on read position� Di-nucleotide-specific sequencing errors

� Recalibrartion leads to improved variant calls

Page 21: Ngs part iii 2013

Sample & Assay Technologies Variant Filtration

Title, Location, Date 21

� Variant Frequency� Somatic mode

� SNPs with frequency < 4% and indels with frequency < 20%� Germline mode

� SNPs with frequency < 20% and indels with frequency < 25%

� Strand Bias� SNPs with FS ≤ 60� Indels with FS ≤ 200

� Mapping Quality� SNPs with MQ ≤ 40.0

� Haplotype Score� SNPs with HaplotypeScore ≤ 13.0� Not applicable for pooled samples

CC

C

Strand Bias: variants that are present in reads from only one of the two strands

Page 22: Ngs part iii 2013

Sample & Assay Technologies Specificity Analysis

� Specificity : the percentage of sequences that map to the intended targets region of interest

number of on-target reads / total number of reads

Title, Location, Date 22

Reference

sequence

NGS

reads

ROI 1 ROI 2

Off-target reads

On-target readsOn-target

reads

Page 23: Ngs part iii 2013

Sample & Assay Technologies Sequencing Depth

Title, Location, Date 23

Reference

sequence

NGS

reads

coverage depth = 4 coverage depth = 2coverage depth = 3

� Coverage depth (or depth of coverage): how many times each base has been sequenced� Unlike Sanger sequencing, in which each sample is sequenced 1-3 times to be confident of

its nucleotide identity, NGS generally needs to cover each position many times to make a confident base call, due to relative high error rate (0.1 - 1% vs 0.001 – 0.01%)

� Increasing coverage depth is also helpful to identify low frequent mutations in heterogenous samples such as cancer sample

Page 24: Ngs part iii 2013

Sample & Assay Technologies NGS Data Analysis: Uniformity

� Coverage uniformity : measure the evenness of the coverage depth of target position

Title, Location, Date

24

Reference

sequence

NGS

reads

coverage depth = 10 coverage depth = 2coverage depth = 3

Page 25: Ngs part iii 2013

Sample & Assay Technologies GeneRead Data Analysis Web Portal

� FREE Complete & Easy to use Data Analysis with Web-based Software

25

Page 26: Ngs part iii 2013

Sample & Assay Technologies GeneRead Data Analysis Web Portal

Title, Location, Date 26

Page 27: Ngs part iii 2013

Sample & Assay Technologies

Title, Location, Date 27

Page 28: Ngs part iii 2013

Sample & Assay Technologies

Title, Location, Date 28

Page 29: Ngs part iii 2013

Sample & Assay Technologies

Note: Runtimes depend on the number of reads in the input files. Typical runtimes are 20-60 minutes.

Title, Location, Date 29

Page 30: Ngs part iii 2013

Sample & Assay Technologies

Title, Location, Date 30

Page 31: Ngs part iii 2013

Sample & Assay Technologies

Title, Location, Date 31

Page 32: Ngs part iii 2013

Sample & Assay Technologies

Title, Location, Date 32

Page 33: Ngs part iii 2013

Sample & Assay Technologies Summary

33

Run Summary� Specificity� Coverage� Uniformity� Numbers of SNPs and Indels

Summary By Gene� Specificity� Coverage� Uniformity� # of SNPs and Indels

Page 34: Ngs part iii 2013

Sample & Assay Technologies

34

Features of Variant Report

� SNP detection� Indel detection

Page 35: Ngs part iii 2013

Sample & Assay Technologies QIAGEN’s GeneRead DNAseq Gene Panel System

� Focused:� Biologically relevant content

selection enables deep sequencing on relevant genes and identification of rare mutations

� Flexible: � Mix and match any gene of interest

� NGS platform independent: � Functionally validated for PGM,

MiSeq/HiSeq

� Integrated controls:� Enabling quality control of prepared

library before sequencing

� Free, complete and easy of use data analysis tool

FOCUS ON YOUR RELEVANT GENES

Page 36: Ngs part iii 2013

Sample & Assay Technologies Upcoming webinars

Next Generation Sequencing and its role in cancer biology

Webinar 3: Next-generation sequencing data analy sis for genetic profiling Date: April 18, 2013

Speaker: Ravi Vijaya Satya, Ph.D.

Webinar 1: Next-generation sequencing, an introdu ction to technology and applications

Date: May 3, 2013Speaker: Quan Peng, Ph.D.

Webinar 2: Next-generation sequencing for cancer researchDate: May 10, 2013Speaker: Vikram Devgan, Ph.D., MBA

Title, Location, Date 36