Introduction To Next-Generation Sequencing and Variant Calling - Karin Kassahn
Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide...
Transcript of Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide...
Variant Calling
Ready for variant calling!!
Discover “variants” relative to a reference genome
From GATK Introduction to Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
International Solanaceae Genomics Project
Different types of variants
From GATK Introduction to Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
Variant callers are not concordant
Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling pipelines
O'Rawe et al., Genome medicine 2013
Samtools mpileup:
- Simplest way to visualize SNP/indel calling and alignment.
- Piles up reads on each position
- Summarizes the base calls of aligned reads to a reference sequence
SAMtools mpileup
SAMtools mpileup format specification
chr2 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
chr2 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
chr2 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
chr2 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
chr2 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
chr2 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
chr2 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1-based coordinatechromosome
reference base
nb of reads covering the site
read bases Base qualities
SAMtools mpileup format specification
chr2 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
chr2 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
chr2 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
chr2 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
chr2 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1-based coordinatechromosome
reference base
nb of reads covering the site
read bases Base qualities
read base code :. match to the reference base on the forward strand, match on the reverse strand`ACGTN' mismatch on the forward strand`acgtn' mismatch on the reverse strand`\+[0-9]+[ACGTNacgtn]+' insertion between this reference position and the next reference position
chr3 156 A 11 .$......+2AG.+2AG.+2AGGG <975;:<<<<<
chr5 200 A 20 ,,,,,..,.-4CACC.-4CACC....,.,,.^~. ==<<<<<<<<<<<::<;2<<
SAMtools mpileup format specification1-based coordinatechromosome
reference base nb of reads covering the site
read bases Base qualities
read base code :. match to the reference base on the forward strand, match on the reverse strand`ACGTN' mismatch on the forward strand`acgtn' mismatch on the reverse strand`+[0-9]+[ACGTNacgtn]' insertion between this reference position and the next reference position`^' marks the start of a read segment`$' marks the end of a read segment
SAMtools mpileup
BUT
- is not a real variant caller
- must be combined to bcftools to perform the variant calling
> samtools mpileup -ugf myrefgenome.fa myreadsaligned.bam | bcftools call -vmO v -o myvariantscalled.vcf
## samtools mpileup# -u : generate uncompressed VCF/BCF output# -g : generate genotype likelihoods in BCF format# -f FILE : faidx indexed reference sequence file
## bcftools# -v : output variant sites only# -m : alternative model for multiallelic and rare-variant calling# -O : output type: 'v' uncompressed VCF [v]# -o, --output <file> : write output to a file [standard output]
samtools mpileup :- Collects summary information in the input
BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format.
bcftools call :- Applies the prior and does the actual calling.
VCF : Variant Call FormatStandardised format for storing the most prevalent types of sequence variationsText file format in 2 parts : header and body.
##fileformat=VCFv4.2##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20, length=62435964, assembly=B36, md5=f126cdf8a6e0c7f379d618ff66beb2da, species="Homo sapiens",taxonomy=x>##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA0000220 14370 rs6054257 ACG A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0/0:48:1:51,51 1|0:48:8:51,5120 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,320 1110696 rs6040355 A G,GT 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,5120 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2
Mandatory Header Lines
VCFheader
Body
Optional header lines (meta-data about the annotations in the VCF body)
Reference alleles (GT=0)
Alternate alleles (GT>0 is an index to the ALT column)
Phased data (G and C above are on the same chromosome)Deletion SNP InsertionOther event
VCF : Variant Call FormatTypes of variants :SNPsAlignment VCF representationACGT POS REF ALTATGT 2 C T
From http://vcftools.sourceforge.net/VCF-poster.pdf
InsertionsAlignment VCF representationAC-GT POS REF ALTACTGT 2 C CT
Large structural variants VCF representation POS REF ALT INFO 100 T <DEL> SVTYPE=DEL;END=300
DeletionsAlignment VCF representationACGT POS REF ALTA--T 1 ACG A
Complex eventsAlignment VCF representationACGT POS REF ALTA-TT 1 ACG AT
VCF : headerLines that start with #Some mandatory lines : file format, column headerOptional header lines contain meta-data about annotations in the vcf body
Meta-data may vary a lot from a variant caller to another one!
INFO vs FORMAT :INFO = annotations on variant as a wholeFORMAT = annotations that apply to each genotype
VCF representation of genotypes
Zygosity VCF representation
Heterozygous 0/1, 1/2, 0/2, ...
Homozygous Reference Alternate
0/01/1, 2/2, 3/3, ...
Missing ./0, ./1, ./., ...
0 = Ref 1 = Alt1 2 = Alt2 3 = Alt3 ...
VCF specification versionsVCF specifications evolve through versions!
Changes between VCFv4.2 and VCFv4.3 :
● VCF compliant implementations must support both LF and CR+LF newline conventions
● INFO and FORMAT tag names must match the regular expression ^[A-Za-z ][0-9A-Za-z .]*$
● Spaces are allowed in INFO field values ● Characters with special meaning (such as ’;’ in INFO, ’:’ in FORMAT, and ’%’ in
both) can be encoded using the percent encoding (see Section 1.2) • The character encoding of VCF files is UTF-8. 35
● The SAMPLE field can contain optional DOI URL for the source data file ● Introduced ##META header lines for defining phenotype metadata ● New reserved tag ”CNP” analogous to ”GP” was added. Both CNP and GP use 0 to
1 encoding, which is a change from previous phred-scaled GP. ● In order for VCF and BCF to have the same expressive power, we state explicitly
that Integers and Floats are 32-bit numbers. Integers are signed.● We state explicitly that zero length strings are not allowed, this includes the CHROM
and ID column, INFO IDs, FILTER IDs and FORMAT IDs. Meta-information lines can be in any order, with the exception of ##fileformat which must come first.
● All header lines of the form ##key= must have an ID value that is unique for a given value of ”key”. All header lines whose value starts with ”<” must have an ID field. Therefore, also ##PEDIGREE newly requires a unique ID.
● We state explicitly that duplicate IDs, FILTER, INFO or FORMAT keys are not valid. ● A section about gVCF was added, introduced the <*> symbolic allele.
...
Changes between VCFv4.1 and VCFv4.2:
● Information field format: adding source and version as recommended fields.
● INFO field can have one value for each possible allele (code R).
● For all of the ##INFO, ##FORMAT, ##FILTER, and ##ALT metainformation, extra fields can be included after the default fields.
● Alternate base (ALT) can include *: missing due to a upstream deletion.
● Quality scores, a sentence removed: High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired.
● Examples changed a bit.
GATK Best practices
https://software.broadinstitute.org/gatk/best-practices
GATK Variant discovery
HaplotypeCaller
GenotypeGVCFs
Once for each sample
Once for the full cohort
https://software.broadinstitute.org/gatk/best-practices
GATK HaplotypeCaller
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK HaplotypeCaller : step 1● Sliding window
● Count mismatches, indels and soft clips
● Measure of entropy
● Define active region according to a thresholdActive region
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK HaplotypeCaller : step 2● Local reassembly via graph
● Traverse graph to collect most likely haplotypes
● Align haplotypes using Smith-Waterman
Likely haplotypes and candidate variant sitesFrom GATK Best Practices for Variant Discovery Presentation,
https://software.broadinstitute.org/gatk/download/workshops
Recovering indels and remove artifacts
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
Resolving complexity
HaplotypeCaller will use one representation for a cleaner output
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK HaplotypeCaller : step 3
● PairHMM aligns each read to each haplotype
● Considers base qualities
Likelihood of the haplotype given reads
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK HaplotypeCaller results
GVCF file for each sample
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GVCF Format : Why ?● VCF format only includes variant sites
● When performing joint calling : How to genotype homozygous reference samples ?
○ Can be homozygous reference (good coverage, alignment of good quality, etc.)○ Can be unknown (poor coverage, alignment of bad quality, outside of WES kit, etc.)
● Solution : recording homozygous reference sites during the calling
https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf
GVCF Format specifications
https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf
GVCF Format : example
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLEchr2 21232195 . G A,<NON_REF> 1284.77 . MLEAC=1,0;... GT:AD:DP:GQ 0/1:38,39,0:77:99chr2 21232196 . G <NON_REF> . . END=21232802 GT:DP:GQ:MIN_DP 0/0:94:99:63chr2 21232803 . T C,<NON_REF> 4959.77 . DP=120;... GT:AD:DP:GQ 1/1:0,120,0:120:99chr2 21232805 . T <NON_REF> . . END=21234696 GT:DP:GQ:MIN_DP 0/0:104:99:51chr2 21234697 . A <NON_REF> . . END=21234697 GT:DP:GQ:MIN_DP 0/0:58:96:58chr2 21234698 . G <NON_REF> . . END=21234722 GT:DP:GQ:MIN_DP 0/0:48:99:46chr2 21234723 . C <NON_REF> . . END=21234726 GT:DP:GQ:MIN_DP 0/0:43:90:42
GVCF Default Bands : 1, 2, 3, 4,....., 60, 70, 80, 90, 99
GATK GenotypeGVCFs
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK GenotypeGVCFs● Joint calling
● Determine most likely combination of allele(s) for each site
● Based on allele likelihoods (from PairHMM)
● Apply Bayes’ theorem with ploidy assumption
Genotype callshttps://software.broadinstitute.org/gatk/documentation/presentations
GATK Calling variants workflow
BAM file
BAM file
BAM file
VCF file
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK Calling variants N+1 problem
BAM file
BAM file
BAM file
VCF file
BAM file
Already processed
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK Calling variants tutorial
GATK: Filtering variants
https://software.broadinstitute.org/gatk/best-practices
GATK : Filtering variants● Calling algorithms are very permissive
● Calling sets contain many false positives
● Two filtering approaches :
○ Hard filtering : using thresholds on annotations
○ Variant recalibration using machine learning
● Sensitivity vs Specificity
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK : Hard filtering● Suitable for all experiments (targeted gene, WES, small sample size, etc.)
● Goal : define annotations and thresholds to filter bad variants
● Pros :○ Easy to perform
● Cons :○ Hard to define annotations to use
○ Hard to define thresholds
○ May filter good variants, may keep bad variants
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK : Annotations● GATK adds annotations to each variant
● Represent properties/statistics describing each variant :○ Sequence context○ Depth of coverage○ Number of reads covering each allele○ Proportion of reads in forward/reverse orientation○ ...
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK : Hard filtering example (QualByDepth)● QUAL divided by the unfiltered
depth of the non hom-ref samples● Avoid inflation caused when there
is deep coverage● Two peaks :
○ 12 : Heterozygous○ 32 : Homozygous alternate
● QD > 2 :○ Eliminates most of the false positives○ Keeps some bad variants○ Filters some good variants
https://software.broadinstitute.org/gatk/documentation/article.php?id=6925
GATK Hard Filtering example (FisherStrand)● Phred-scaled probability that there
is a strand bias● Identify alternate allele more seen
or less often on the forward or reverse strand than the reference allele
● Large intersection between bad and good variants
● FS < 60 :○ Removes many bad variants○ Keeps many bad variants
https://software.broadinstitute.org/gatk/documentation/article.php?id=6925
GATK : Hard filtering examples
https://software.broadinstitute.org/gatk/documentation/article.php?id=6925
GATK : Hard filtering recommendations● Filtering SNPs where any :
○ QD < 2.0○ MQ < 40.0 ○ FS > 60.0○ SOR > 3.0○ MQRankSum < -12.5○ ReadPosRankSum < -8.0
● Filtering Indels where any :○ QD < 2.0○ ReadPosRankSum < -20.0○ InbreedingCoeff1 < -0.8○ FS > 200.0○ SOR > 10.0
1 When sample size > 10
Warning : Threshold on maximum depth should not be used for WES data
https://software.broadinstitute.org/gatk/documentation/article.php?id=6925
GATK : Hard filtering tutorial
GATK : Variant Quality Score Recalibration (VQSR)● Preferred method
● Requires :○ DNA-seq data (not working on RNA-seq data)
○ Well curated training/truth resources (usually not available for non human organisms)
○ Large amount of variants (no targeted gene panels, etc.)
○ > 30 samples for WES data (1000G WES samples can be added if needed but not optimal)
● Based on machine learning
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK : VQSRGoal : Train on high confidence known sites to determine the probability
that other sites are true or false
● Assume annotations tend to form Gaussian clusters
● Build a “Gaussian mixture model” from annotations of known variants
● Score all variants by where their annotations lie relative to the clusters
● Filter based on sensitivity to truth set
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK : VQSR
Positive model (good variants)
Negative model (bad variants)True positives
False positives
pq
VQSLOD(x) = Log(p(x)/q(x))
Done for each annotation and then integrated into a single VQSLODFrom GATK Best Practices for Variant Discovery Presentation,
https://software.broadinstitute.org/gatk/download/workshops
GATK : VQSR
Model trained on Hapmap Model applied to new SNPs
DePristo et al. Nat. Genet. 2011
GATK : VQSR workflowOriginal SNPs + Original Indels
VariantRecalibrator
ApplyRecalibration
Recalibrated SNPs + Original Indels
SNP MODE
VariantRecalibrator
ApplyRecalibration Recalibrated SNPs + Recalibrated Indels
INDEL MODE From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK : VQSR training and truth resources● Training : input variants that
overlap with these training sites to build the model
● Truth : determine where to set the cutoff
● Known : only for reporting purposes
● Prior : Phred-scaled estimate of data accuracy
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK : VQSR SNP human resources● Hapmap
○ Training○ Truth○ Prior = 15
● Omni○ Training○ Truth○ Prior = 12
● 1000G SNPs High confidence○ Training○ Prior = 10
● dbSNP○ Known○ Prior = 2
https://software.broadinstitute.org/gatk/documentation/article?id=1259
Annotations : QD, MQ, MQRankSum, ReadPosRankSum, FS, SOR, DP1, InbreedingCoeff
1 Not to be used for WES
GATK : VQSR Indel human resources● Mills indels
○ Training○ Truth○ Prior = 12
● dbSNP○ Known○ Prior = 2
https://software.broadinstitute.org/gatk/documentation/article?id=1259
Annotations : QD, DP1, FS, SOR, ReadPosRankSum, MQRankSum, InbreedingCoeff
1 Not to be used for WES
GATK : VQSR plots
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK : VQSR tranches plots
90 99 99.9 100 Truth sensitivity (%)
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
GATK : VQSR output example
#CHROM POS FILTER INFO
chr2 121456 VQSRTrancheSNP99.9to100.0 AC=2;..;NEGATIVE_TRAIN_SITE;VQSLOD=-4.532;culprit=MQ
chr2 121782 PASS AC=24;..;VQSLOD=3.278;culprit=FS
chr2 121987 VQSRTrancheINDEL99.0to99.9 AC=1;..;POSITIVE_TRAIN_SITE;VQSLOD=-2.312;culprit=SOR
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
Variant normalization
Why is variant normalization necessary?Every variant in the human genome has various representations!
When merging variants from multiple variant callers for the same sample
⇒ which variants are common between callers ?
When comparing variant from the same variant caller but from different samples
⇒ which variants are shared between samples ?
A normalized variant is parsimonious and left-aligned
Variant represented in as few nucleotides as possible without an allele of length 0.
If the leftmost nucleotide of each variant is of the same type and the removal of the nucleotide from each allele will not result in an empty allele ⇒ superfluous nt on its left side!
Variant normalization - Parsimony
Variant normalization - Parsimony
https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
Variant normalization - Parsimony
https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
Variant normalization - Parsimony
https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
Variant normalization - Parsimony
https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
Variant normalization - Parsimony
https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
Parsimonious !
A variant is left aligned if and only if it is no longer possible to shift its position to the left while keeping the length of all its alleles constant.
Variant normalization - Left Alignment
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
Empty allele is not allowed !
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
Ready for annotating variants!!