VariantCalling’’ (using’High0throughputSequencing’Data)’ · –...
-
Upload
duongtuyen -
Category
Documents
-
view
216 -
download
0
Transcript of VariantCalling’’ (using’High0throughputSequencing’Data)’ · –...
![Page 1: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/1.jpg)
Variant Calling (using High-‐throughput Sequencing Data)
Short course v2
Tim Hughes
![Page 2: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/2.jpg)
Downloading data
• Wiki pages for zip file
• Backup is usb sFck
![Page 3: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/3.jpg)
Humans and other mul/cellular organisms
* Self-‐assembling * Self-‐repairing * Self-‐operaFng
The full informa/on underlying this system is stored in the DNA of EVERY cell
* Social and spiritual * Self-‐aware
A reproducing system which is:
* Self-‐upgrading
![Page 4: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/4.jpg)
INTRODUCTION
![Page 5: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/5.jpg)
What is variant calling and why do it?
• What is variaFon? – VariaFon through mutaFon – What kind of variaFon occurs? SNPs, indels, structural variaFon
• Variant calling – Acquire data on sequence – Make an inference on whether a variant is present rela/ve to a reference
sequence
• Why perform variant calling? – Congenital disease – Case control studies
• Different types of variant calling – probe assays – microarrays – sequencing (low and high through-‐put)
![Page 6: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/6.jpg)
In a perfect world – Perfect sequencing
• Perfect sequencing: – single molecule (no PCR) – full length – no deterioraFon of quality
• While we are wai/ng: – Sanger
• PCR • length: 1 kb • limited number of reads • high quality
– HTS (Illumina) • PCR • 100 bp PE • billions of reads • high quality, but deterioraFng
along read
100 bp 100 bp
300 bp frag read
full length
249 Mbp chr read
full length
800-‐1000 bp amplicon read
100 bp
300 bp Single read PE
![Page 7: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/7.jpg)
A quick overview of the HTS workflow
Fragment sample
Capture
Sequence
Map
Align
Variant call
ref
sample frags sample
bait
ref
Sample mutaFon
Poor alignment >> FN micro indel + FP SNP
OpFonal
![Page 8: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/8.jpg)
Variant sites
C C C C T T T
6 aligned reads
Reference
pileup The common and easy case • Good mapping of reads • Good base qualiFes • Good depth
C T T T
pileup Poor depth • May not have sampled both alleles • Could be C/T or T/T
Poor quality • of base calls • of read mapping Can lead to false variant calls: site could be ref C/C and Ts are just base call errors C
C C C T T T
pileup
Poor quality
![Page 9: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/9.jpg)
In a bit more detail
FASTQ Mapping (BWA) SAM
HouseKeeping (picard)
Variant calling (gatk)
Refined BAM
VCF file
Coverage metrics FiltraFon and annotaFon Alignment metrics
Insert metrics
DuplicaFon metrics
Refinements (gatk)
![Page 10: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/10.jpg)
GSA team at the Broad InsFtute
• A large fracFon of the materials and sogware in this course are produced by the Genome Sequencing and Analysis Group team at the Broad InsFtute
• InformaFon sources – hhp://www.broadinsFtute.org/gsa/wiki – hhp://www.getsaFsfacFon.com/gsa
• People – Mark A. DePristo, Manager of Medical and PopulaFon GeneFcs Analysis – Eric Banks, Team Lead – Guillermo del Angel – Ryan Poplin – Kiran Garimella, Team Lead – Mauricio Carneiro – Chris Hartl – Khalid Shakir, Team Lead – Mahhew Hanna – David Roazen
• Others at the Broad – Heng Li: samtools and bwa – Tim Fennell: picard – Alec Wysoker: picard
• And others outside the Broad – sources at bohom of slides
![Page 11: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/11.jpg)
Overview of topics (not in chrono order)
• Sogware and datasets Fastq format • Read mapping (SAM/BAM format) • IGV • Variant calling (VCF format) • Metrics reports (esp coverage – BED format) • Alignment refinement • Base quality score recalibraFon • Variant annotaFon and filtraFon
• Because of circumstances (shortened course) – exercises will NOT involve computa/on – we will work with pre-‐computed results found at the central URL
![Page 12: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/12.jpg)
DATASETS
![Page 13: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/13.jpg)
IntroducFon of dataset
• reads_exomeCapt_chr5 in fastq format (reads_agilentV1_chr5)
• reference data (human_g1k_v37_chr5) – agilentV1 >> definiFon of capture Fles in different formats
– gatkBundle >> reference data in fasta format and vcf files of known variants (dbSNP, 1000 genomes, hapmap)
• Formats >> we will return to these later
![Page 14: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/14.jpg)
Naming and ordering of chromosome/conFgs
Hg18 (UCSC) B36 (NCBI)
ConFg prefix chr none
Mitochondrial conFg chrM MT
ConFg order chrM, chr1, chr2, ....., chrX, chrY 1, 2, ...., X, Y, MT
• Genome references – Fasta file: must have .fasta extension + respect naming and order – Fai file (created by samtools faidx): conFg, size, locaFon, basesPerLine è for efficient random
access – Dict file (created by Picard CreateSequenceDicFonary): SAM style header describing the contents
of the fasta file è for names and length of original file • ROD (reference ordered data)
– GATK supports several common file formats for reading ROD data: VCF, UCSC formahed dbSNP, BED
• dbSNP files – Must also be ROD – Generated by GSA from the dbSNP db using a bit of bash, awk and a perl script: sortByRef.pl. Full
details: hhp://www.broadinsFtute.org/gsa/wiki/index.php/The_DBSNP_rod • All of the above delivered for human as part of the GATK resource bundle
– Other species may also be available – Help on generaFng for another species see GATK wiki or getsaFsfacFon.com/gsa
![Page 15: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/15.jpg)
GENETICS 101
![Page 16: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/16.jpg)
Any quesFons?
• Cells • chromosomes • homo, hetero
![Page 17: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/17.jpg)
DistribuFon of Allele Count across 21 exomes
21 individual exomes (of diploid humans) i.e. 42 alleles
![Page 18: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/18.jpg)
SNP numbers and indel size distribuFon • Sequencing of human exome
– 23,602 SNPs in coding exons (approx. 25M bp size) – 40,621 SNPs outside coding exons (approx. 25M bp size)
No/ce both numbers and paZern in indel figures
![Page 19: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/19.jpg)
EXOME CAPTURE – ESSENTIALS
FASTQ Mapping (BWA) SAM HouseKeeping
& refinement Variant calling
(gatk) BAM VCF file Capture Sequence
![Page 20: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/20.jpg)
An overview of exome capture
SonicaFon
Library prep (sequencing adaptors on)
HybridisaFon to probes
Bead capture
AmplificaFon
Sequencing
![Page 21: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/21.jpg)
SEQUENCING – ESSENTIALS
FASTQ Mapping (BWA) SAM HouseKeeping
& refinement Variant calling
(gatk) BAM VCF file Capture Sequence
![Page 22: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/22.jpg)
Sequencing
Covered by Robert
![Page 23: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/23.jpg)
FASTQ FORMAT – ESSENTIALS
FASTQ Mapping (BWA) SAM HouseKeeping
& refinement Variant calling
(gatk) BAM VCF file
![Page 24: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/24.jpg)
In a perfect world – Perfect sequencing
• Perfect sequencing: – single molecule (no PCR) – full length – no deterioraFon of quality
• While we are waiFng: – Sanger
• PCR • length: some kb • limited number of reads • high quality
– HTS (Illumina) • PCR • 100 bp PE • billions of reads • high quality, but deterioraFng
along read
100 bp 100 bp
300 bp frag read
full length
249 Mbp chr read
full length
few kbp amplicon read
![Page 25: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/25.jpg)
Fastq format – fasta with qualiFes
• p = the probability that the corresponding base call is wrong
• QualiFes – p = 0.1 è Q = 10 – p = 0.01 è Q = 20 – P = 0.001 è Q = 30
• Encoding: Sanger/Phred format can encode a quality score from 0 to 93 using ASCII 33 to 126: Q + 33 è ASCII code
Source: hhp://en.wikipedia.org/wiki/FASTQ_format
![Page 26: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/26.jpg)
Illumina sequence idenFfiers
![Page 27: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/27.jpg)
FastQC -‐ Per cycle quality distribuFon
PosiFon in read
%
![Page 28: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/28.jpg)
FastQC -‐ Per cycle sequence content
PosiFon in read
%
Exome sequencing
![Page 29: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/29.jpg)
FastQC -‐ Per cycle sequence content
PosiFon in read
%
mRNA sequencing
![Page 30: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/30.jpg)
FastQC -‐ Per cycle sequence content
![Page 31: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/31.jpg)
ManipulaFng fasta and fastq files
• Fastx toolkit: hhp://hannonlab.cshl.edu/fastx_toolkit/
• FASTQ trimmer
• FASTQ quality filter
• FASTQ quality trimmer
• Can do most of the obvious manipulaFons of fastq/a you may need
![Page 32: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/32.jpg)
FASTQ Mapping (BWA) SAM HouseKeeping
& refinement Variant calling
(gatk) BAM VCF file
MAPPING WITH BWA
![Page 33: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/33.jpg)
Why mapping?
• The biggest difference with Sanger – we did NOT design and use primers for sequence amplificaFon – we sonicated – >> we do not know where the reads “originate” from
• For each read – we need to determine its likely origin – how likely it is that we have correctly idenFfied its origin
![Page 34: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/34.jpg)
Factors complicaFng mapping Millions of reads
Billions of posiFons in human genome
Complex sequence (100 bp) The simple case
Homologous regions Complex sequence (100 bp)
gene 1A gene 1B
Base call error
RepeFFve region RepeFFve sequence (100 bp)
big repeFFve region
Structural variaFon (not in reference) DuplicaFon in sample which is not present in the reference
Risk of mismapping
Impossible to be map correctly
Definite “mismapping”
![Page 35: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/35.jpg)
What are desirable characterisFcs of a read mapper?
• Accurately predict the source of a read – in the normal range of base error rates – in the normal range of indel frequency and size
• But, not necessary to get the alignment exactly right as this can be done later using mulFple sequence alignment (MSA)
• Produce an accurate esFmate of the reliability of predicFon
NNNNNCAAGNNNN NNNNNCAAAGNNN
Reference Sample
NNNNNCA_AGGNNN NNNNNCAAAGNNNN
Reference Correct read align
Alt. align NNNNNCAAAGNNNN NNNNNCAAGGNNN Reference
![Page 36: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/36.jpg)
Different programs
• BWA • Novoalign • BOWTIE • SOAP • .... • Most based on BWT: Burrows-‐Wheeler Transform
– a very neat computer algorithm for finding the locaFon of substrings within a string • can I find atgc in ahgcatcgatcga.......
– requires indexing of string / reference, but enables • rapid search, necessary when mapping billions of reads • manageable RAM footprint: 2.3 GB for single reads and 3GB for paired-‐end (for
BWA), so runs on an ordinary computer
![Page 37: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/37.jpg)
Mapping quality scores
• The mapping quality score is the Phred-‐scaled probability of the mapping being incorrect.
• Probability is computed from the qualiFes of the mismatched bases between read and reference and quality features of the second best hit (see Li, Ruan, and Durbin 2008)
• All programs do not necessarily produce good esFmates of mapping quality
• BWA provides good mapping qualites with slight overesFmaFon of quality score: – empirical error rate 7x10e-‐06 for Q60 mappings
![Page 38: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/38.jpg)
Imperfect alignment following mapping
Source: Heng Li, presentaFon at GSA workshop 2011
Incorrect
Correct
Base stacks
>> Can be solved by alignment: considering all mapping reads and reference together
No/ce how the inserted sequence is very similar to the sequence it is inserted in
![Page 39: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/39.jpg)
FASTQ Mapping (BWA) SAM HouseKeeping
& refinement Variant calling
(gatk) BAM VCF file
SAM FORMAT
![Page 40: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/40.jpg)
What does the SAM file look like?
Source: The SAM format specificaFon
Header Data lines (one per read)
![Page 41: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/41.jpg)
InspecFng one record
In details coming later
![Page 42: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/42.jpg)
Difference between 1-‐based and 0-‐based coordinates
• SAM (+ VCF and GFF) are 1-‐based • BED are 0-‐based • Can be very important when manipulaFng SNP coordinates >> be
careful
![Page 43: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/43.jpg)
The FLAG column – a bit wise flag
• Translate from bit wise flag to readable codes by using samtools view -‐X
100 bp 100 bp
300 bp frag read
![Page 44: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/44.jpg)
What is a PCR duplicate?
![Page 45: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/45.jpg)
About the SAM file produced by BWA
• It contains all the reads >> the Picard/GATK paradigm: informaFon is annotated (and not filtered) – unique – ambiguous – unmapped
• It has a number of short comings – it takes a lot of space è convert to BAM – the mates are not fully updated on each others existence è fixmate – it is not sorted è sort – it contains PCR duplicates è mark or remove duplicates – it does not contain meta-‐data on the reads (sample, sequencer, etc)
![Page 46: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/46.jpg)
IGV pracFcal on a basic BAM file
PRACTICAL
• We take a visual look at a basic BAM file in the IGV browser
• Get a feel for what a HTS dataset looks like
• On the central URL: slides/igvExercise.txt
![Page 47: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/47.jpg)
FASTQ Mapping (BWA) SAM HouseKeeping
(samtools) Variant calling (bcgools)
Sorted BAM
VCF file
COMPUTING ADVANCED METRICS – PICARD
FASTQ Mapping (BWA) SAM HouseKeeping
& refinement Variant calling
(gatk) BAM VCF file
![Page 48: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/48.jpg)
An overview of exome capture – weak points
SonicaFon
Library prep (sequencing adaptors on)
HybridisaFon to probes
Bead capture
AmplificaFon
Fragment Adapt
Problem: error in sonica/on >> adaptor seq in reads >> unmapped reads
Possible biases in sequences that hybridise >> coverage bias
Possible biases in sequences that elute >> coverage bias
Possible biases in sequences that amplify >> sequence PCR duplicates
Possible biases in sequences that bridge PCR >> coverage bias
Sequencing
Fragment
Fragment
Adapt
Adapt
Adapt Adapt
Adapt
![Page 49: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/49.jpg)
Metrics -‐ Basic read classificaFon
• Typical for repeats • Also possible for homologous regions
• Complex sequence with sufficient length should map uniquely
• ContaminaFon from other species • Reads containing non-‐genomic DNA e.g. adaptors • PCR gunk • Reads with sequencing errors • Parts of the genome that are not assembled • Parts of sample affected by structural variaFon
![Page 50: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/50.jpg)
Metrics – Insert sizes
0 100 200 300
010
0020
0030
0040
00
Insert Size Histogram for All_Reads in file aln.posiSrt.mkrdDups.bam
Insert Size
Cou
ntFR
![Page 51: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/51.jpg)
Metrics -‐ Coverage
• Even if doing Whole Genome Sequencing (WGS) >> coverage issues – due to repeFFve regions – due to properFes of the DNA e.g. GC content
• Exome sequencing >> Capture by hybridisaFon
Tile Target
Reads
Coverage
![Page 52: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/52.jpg)
Metrics -‐ Coverage
![Page 53: VariantCalling’’ (using’High0throughputSequencing’Data)’ · – Whatkind’of’variaon’occurs?’SNPs,’indels ... Possible’biases’in’sequences ... • Reads’containing’non0genomic’DNA’e.g](https://reader031.fdocuments.us/reader031/viewer/2022022521/5b2658cd7f8b9a00068b4944/html5/thumbnails/53.jpg)
What is a duplicate?
Duplicates potenFally introduce variant calling errors as PCR errors may get amplified up.