NGS II Illumina Sequencing
Transcript of NGS II Illumina Sequencing
DepthOfCoverageGenetics for Dummies 2017
NGS II – Illumina Sequencing
Robert Kraaij
Department of Internal Medicine
• Data Analysis
• Applications
• Example: Exome Sequencing
Overview
Things to be addressed
NGS: many short reads that might contain errors
data analysis will handle these reads and errors
• Data Analysis
• Applications
• Example: Exome Sequencing
Overview
cBot
flowcell
bridgePCR
HiSeq2000
Illumina Sequencing
Per Cycle Imaging
G A T C
Per Cycle Imaging
G
good quality
G
poor quality
Per Cycle Base Calling
Phred Score Incorrect base Accuracy
10 1 in 10 90 %
20 1 in 100 99 %
30 1 in 1000 99.9 %
40 1 in 10000 99.99 %
50 1 in 100000 99.999 %
0 to 93 → ASCII 33 to 126 = single character
Quality Scoring
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC
+SEQ_ID
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>
FASTQ File
T A C G G T A C T T G C A T A
G A T T A C G G T A C T T G C A T A G C T
Alignment or Mapping of Reads
R E F E R E N C E G E N O M E (HG19)
chromosome + position + strand
sample.bam
Run QC and filtering
sample.bam
sample.bam
• both reads
• quality scores
• chromosome
• position
• quality flag
• duplicate flag
• off target flag
sortedBAM file
Coverage
T A C G G T A C T T G C A T A
G A T T A C G G T A C T T G C A T A G C T
A C G G T A C T T G C A T A G
G A T T A C G G T A C T T G C
G G T A C T T G C A T A G C T
T T A C G G T A C T T G C A T
5x coverage
Mean Coverage
bases on target
size of target
% of Bases Above a Certain Threshold
T A C G G T A C T T G C A T A
G A T T A C G G T A C T T G C A T A G C T
A C G G T A C T T G C A T A G
G A T T A C G G T A C T T G C
G G T A C T T G C A T A G C T
T T A C G G T A C T T G C A T
5x 5x 4x1x
Variant Calling
T A C G G T G C T T G C A T A
G A T T A C G G T A C T T G C A T A G C T
A C G G T G C T T G C A T A G
G A T T A C G G T G C T T G C
G G T G C T T G C A T A G C T
T T A C G G T G C T T G C A T
G = homozygous alternative
- G A T T A C G G T G C
C G G T G C T T G C A T A G C
T G C A T A G C T -
A T T A C G G T G C T T G C A
Variant Calling
T A C G G T G C T T G C A T A
G A T T A C G G T A C T T G C A T A G C T
A C G G T G C T T G C A T A G
G A T T A C G G T A C T T G C
G G T G C T T G C A T A G C T
T T A C G G T A C T T G C A T
A/G = heterozygous
- G A T T A C G G T A C
C G G T G C T T G C A T A G C
T G C A T A G C T -
A T T A C G G T G C T T G C A
Variant Calling
T A C G G T G C T T G C A T A
G A T T A C G G T A C T T G C A T A G C T
A C G G T G C T T G C A T A G
G A T T A C G G T A C T T G C
A/G = heterozygous?
Variant Calling
T A C G G T G C T T G C A T A
G A T T A C G G T A C T T G C A T A G C T
A C G G T G C T T G C A T A G
G A T T A C G G T A C T T G C
G
sequencing quality
goodpoor
sample.vcf
• chromosome
• position
• quality
• annotations
VCF File
Variant Calling
T A C G G T G C T T G C A T A
G A T T A C G G T A C T T G C A T A G C T
A C G G T G C T T G C A T A G T A G
G A T T A C G G T A C T T G C
G G T G C T T G C A T A G C T
- G A T T A C G G T A C T T G C A T
deletion = heterozygous
- G A T T A C G G T A C
C G G T G C T T G C A T A G C
T G C A T A G C T -
- G A T T A C G G T G C T T G C A
Paired-End Sequencing
2 x 100 bp
Variant Calling: Mate Pairs
normal
400 bp
deletion
800 bp
insertion
200 bp
Variant Calling: Mate Pairs
normal
400 bp
translocation
Variant Calling: Split Reads
genome
800 bp
mRNA (cDNA)
• Data Analysis
• Applications
• Example: Exome Sequencing
Overview
Applications
• Re-sequencing → full genome → SNPs and indels
• Re-sequencing → mate pairs → structural variations
• Re-sequencing → regional → SNPs and indels
• Sequencing → de novo assembly
• RNAseq
• ChIPseq
• …seq
www.illumina.com
Example:
Exome Sequencing
funding by NGI-NCHA, NWO, BBMRI
n > 3,000 samples of random set from RS-I
start May 2011; Nimblegen
part of “CHARGE-S” effort:
>5,000 exomes across 4 cohorts
Framingham, CHS, ARIC, Rotterdam Study
Expand with exome variants array?
CHARGE
Exome Sequencing
Exome vs Full Genome
exon exon exongenome → 3 Gb
exome → ~30 Mb
Exome Sequencing Workflow
DNA
isolation
Library
preparation
Exome
captureSequencing
Data
analysis
+
+
Exome
capture
Nimblegen SeqCap EZ v2 Capture
• CCDS (Sept 2009)
• miRBase (v14, Sept 2009)
• RefSeq (Jan 2010)
• 2,100,000 probes
• 30,246 coding genes
• 329,028 exons
• 710 miRNAs
• 36.5 Mb primary target
• 44.1 Mb capture target
Illumina TruSeq V3 2x100 PE Sequencing
Data analysis: BWA-GATK pipeline
• BclToFastQ (CASAVA)
• Chastity Filter
Demultiplexing
• BWA (paired)
• SortSam, MarkDuplicates (picard)
Alignment• BaseQualityScore
Recalibration, IndelRealignment (GATK)
Processing
• HaplotypeCaller
• VQSR
• VarEval
Variant-Calling• ANNOVAR,
VCFtools
• PlinkSeq, SKAT, R
• Spotfire
Analysis
Sample QC and Variant QC
RSX-2 Samples were sequenced to ~54x Mean Coverage
Average Mean Depth of Coverage
across the 44Mb SeqCap Exome
Perc
enta
ge o
f 44M
b c
overe
d 1
0x o
r better
Mean Depth of Coverage by Flowcell
Mean D
epth
of
Covera
ge
Flowcell Number (Roughly Chronological Order)
Determing Heterozygous Concordance versus 550k
genotyping arrays
Hete
rozygous C
oncord
ance
Flowcell Number (Roughly Chronological Order)
Sample QC and Variant QC
Number of Detected SNPs per Samples by Flowcell
Flowcell Number (Roughly Chronological Order)
Heterozygous to Homozygous ratio per Sample by
Flowcell
Flowcell Number (Roughly Chronological Order)
purines
Transition to Transversion Ratio
pyrimidines
tran
svers
ion
transition
Transition to Transversion Ratio per Sample by Flowcell
Flowcell Number (Roughly Chronological Order)
QC and filtering results
Things to Remember
NGS: many short reads that might contain errors
coverage indicates the number of independent reads that
cover a base → needed to analyse a genome
FASTQ file → sequence + quality scores
BAM file → aligned reads
VCF file → called variants + annotation