Download - Variant Calling - Illinoisveda.cs.uiuc.edu/CompGen2017/lecture/03_Variant_Calling_v2.pdf · variant calling chris fields mayo-illinois computationalgenomics workshop, june 19, 2017

VariantCallingCHRISFIELDS

MAYO-ILLINOISCOMPUTATIONAL GENOMICSWORKSHOP,JUNE19, 2017

Up-frontacknowledgmentsManyfigures/slidescomefrom:◦ GATKWorkshopslides:http://www.broadinstitute.org/gatk/guide/events?id=2038

◦ IGVWorkshopslides:http://lanyrd.com/2013/vizbi/scdttf/◦ DenisBauer(CSIRO):http://www.allpower.de/◦ Manyvariedpublications

http://www.broadinstitute.org/gatk/guide/events?id=2038

http://lanyrd.com/2013/vizbi/scdttf/

http://www.allpower.de/

BackgroundVariantcallingandusecases

Errorsvs.actualvariants

Experimentaldesign(GATKfocus)

Smallvariant(SNV/SmallIndel)analysis◦ GATKPipeline◦ Formatsencounteredwithin

StructuralVariationAnalysis(SV)

Associationanalysis(briefly)

VariantCallingAsthenameimplies,we’relookingfordifferences(variations)◦ Reference– referencegenome(hg19,b37)◦ Sample(s)– oneormorecomparativesamples

Startwithrawsequencedata

Endwithahuman‘diff’file,recordingthevariants

Additionalinformationaddeddownstream:◦ Filters(qualityofthecalls)◦ Functionalannotation

VariationsDifferencebetween2individuals:1every1000bp◦ ~2.7milliondifferences

Small(<50bp)◦ SNV– singlenucleotide◦ Smallinsertionsordeletions(‘indels’)

Large(structuralvariations)◦ Indels >50bp◦ CopyNumberVariations◦ Inversions◦ Translocations

VariationsMainlyfocusondiploidorganisms

◦ Human:◦ 22pairsofautosomalchromosomes

◦ Onefrommother,onefromfather◦ 2sexchromosomes(femaleXX,maleXY)

◦ Onefrommother,onefromfather(wheredoesYcomefromformaleoffspring)◦ Mitochondrialgenome(generallymaternallyinherited)

◦ 100-10,000copiespercell

Variationcanbein◦ Onechromosome(heterozygous,or‘het’)◦ Bothchromosomes(homozygous,or‘hom’)

UsecasesMedicine

• Hereditaryorgeneticdiseases,geneticpredispositiontodisease• Normalvs.tumoranalyses• Heteroplasmy

Populationgenetics

Populationgenetics

The1000GenomesProject

CEU

JPTCHB

YRI

LWK

MXL

ASW

GBRFIN

TSI

CHS

CLM

PUR

1,100 samples early 2011; 2,500 samples 2011/12

IBSCDX

KHVGWD

ACB

AJM

PEL

PJL

MAB

ADHKAKRDHMRM

The full 1000 Genomes Project data

Cancer

NGS studies of cancer progressionThis approach was published recently in a studydesigned to compare the tumor genomes of patients withde novo AML to their relapse genomes [21]. After sequen-cing each genome (de novo tumor and relapse tumor) andthe matched normal from skin for each patient, somaticmutations and structural variants were identified. Someof these appeared to be unique to the relapse sample ineach case. We then obtained high sequencing readdepth at each somatic mutation site in the de novo andrelapse tumors, and characterized the reads that con-tained the mutated base(s) at each site to calculate anallele frequency of that variant in the tumor cell popu-lation. Using kernel density estimation, we then ident-ified groups of mutations present at the same allelefrequencies, indicative of their prevalence in the tumorcell population. This comparison of allele frequencygroups between de novo and relapse disease allowed usto model the relative numbers of tumor subclones ateach disease presentation, and defined AML progressionas a clonal process, as illustrated in Figure 2. Namely,all subclones originate from a founder clone that sharesall but the newest mutations, and relapse disease

shares mutations with the founder clone as well as newmutations that portend its proliferative advantage in therelapse presentation.

In a similar study, with a slightly different experimentaldesign, we recently explored the differences betweenmyelodysplastic syndrome (MDS) genomes and the gen-omes found in those patients’ secondary AML (sAML)tumors. MDS identifies a heterogeneous group of syn-dromes characterized by dysplasia and ineffective hema-topoesis. Since about 1/3 of these patients progress tosAML for reasons that are not well understood at thegenomic level, we characterized these genomes to under-stand novel somatic variants in the sAML cells. In ourstudy, the results were quite different than the de novo torelapse AML study outlined above. Namely, we foundthat the sAML genomes were all oligoclonal (comprisedof several related tumor cell subclones, each with uniquesets of mutations), each containing a pre-existing MDSfounder clone that was out-competed in the sAML tumorcell population in some cases. We hypothesized that theoligoclonal nature of the sAML presentation may con-tribute to the very poor response rates of these patients to

Genome sequencing and cancer Mardis 247

Figure 2

DNMT3A, NPM1, FLT3, PTRPT

12.74%

HSCs

Diagnosis: Multipleleukemic clones

present

Clinical remission:loss of most

leukemic clones

Relapse: Acquisitionof new mutations in a

pre-existing clone

29.04%

5.10%

53.12%

AML1/ UPN933124

cell type:

normal AML

mutations:

founder (cluster 1)

primary specific (cluster 2)

relapse enriched (cluster 3)relapse enriched (cluster 4) random mutations in HSCs

pathogenic mutationsrelapse specific (cluster 5)

ETV6, WNK1-WAC,MY018B

Chemotherapy

Current Opinion in Genetics & Development

Model of the clonal progression process that occurs between the initial (de novo) and relapse presentation in AML patients. At diagnosis, this patienthas an oligoclonal disease characterized by four different subclones, each present at a specific proportion in the tumor cell population and with aspecific mutational profile. Chemotherapy used to induce the patient into remission decreases clonal heterogeneity but a single subclone persists,acquires new mutations, and again proliferates in the bone marrow as a relapse-specific subclone.

www.sciencedirect.com Current Opinion in Genetics & Development 2012, 22:245–250E.Mardis,CurrentOpinioninGenetics&Development2012,22:245–250

HeteroplasmyMixtureofmorethanonetypeoforganellegenomewithincell◦ 100-10,000mtDNA copies/cell

Predisposingfactorformitochondrialdiseases◦ Late(adult)onset

Possiblybeneficialinsomecases◦ Centenarianshaveaboveavg freq forheteroplasmy

Coble etal,PLOSOne,2009;4(3):e4838

Variantsvs.ErrorsMustdistinguishbetweenactualvariation (realchange)anderrors(artifacts)introducedintotheanalysis

Errorscancreepinonvariouslevels:◦ PCRartifacts (amplificationoferrors)◦ Sequencing (errorsinbasecalling)◦ Alignment (misalignment,mis-gappedalignments)◦ Variantcalling (lowdepthofcoverage,fewsamples)◦ Genotyping (poorannotation)

Trytocontrolforthesewhenpossibletoreducefalsepositives w/oincurring(worse)falsenegatives

Example:Initialrawsequencedata

SequencequalityDifferenttechnologieshavedifferenterrors,errorrates◦ 454– homopolymer trackerrors◦ Illumina – substitutionerrors◦ PacBio – indels.Lotsandlotsof‘em.

Representedasaqualityscore(Phred)◦ Q=-10log10(e)

Phred Quality Score Probability of incorrect base call Base call accuracy10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.90%40 1 in 10000 99.99%50 1 in 100000 ~100.00%

Formats:FASTQ– ‘sequencewithquality’

Three‘variants’– Sanger,Illumina,Solexa (Sangerismostcommon)

Maybe‘raw’data(straightfromseq pipeline)orprocessed(trimmedforvariousreasons)

Canhold100’sofmillionsofrecordspersample

Filescanbeverylarge(100’sofGB)apiece

@HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT

NTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT

+

#1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII

Formats:FASTQ– ‘sequencewithquality’@HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT


+


VerylowPhred score,lessthan10



+



LowPhred score,<20



+



Highqualityreads,Phred score>30

Howdosequencingerrorsoccur?

Illumina Sequencing

Checksequencedata!

BasicExperimentalDesign

TerminologyLane – Physicalsequencinglane

Library – UnitofDNApreppooledtogether

Sample – Singleindividual

Cohort – Collectionofsamplesanalyzedtogether lane

flowcell

TerminologySample(single)vsCohort(multiple)

Library

Lane

Flowcell

lane

flowcell

Library

Ingeneral,Library=Sample

TerminologyWGSvs Exome Capture◦ Wholegenomesequencing – everything◦ Highcostifpersampleisdeepsequence(>25-30x)◦ Canrunmultisample lowcoveragesamples

◦ Exome capture – targetedsequencing(1-5%ofgenome)◦ Deepercoverageoftranscribedregions◦ Missotherimportantnon-codingregions(promoters,introns,enhancers,smallRNA,etc)

Single*vs.*mul)6sample*analysis*

Deep%singleGsample%

•  Higher*sensi)vity*for*variants*in*the*sample*

•  More*accurate*genotyping*per*sample*

•  Cost:*no*informa)on*about*other*samples*

Shallower%mulJGsample%

Sample 1"

Variants"

Fixed*sequencing*budget*

Found 3 variants total" Found 5 variants total"

Sample 1"

Sample 2"

Sample 3"

Sample 4"

•  Sensi)vity*dependent*on*frequency*of*varia)on*

•  Worse*genotyping*•  More*total*variants*

discovered*

High6pass*sequencing*design*

Exon I" Exon II"Intron I"Intergenic" Intergenic"

Targeted bases" ~3 Gb"

Coverage" Avg. 30x"

# sequenced bases" 100 Gb"

# lanes of HiSeq" ~8 lanes"

Variants found per sample" ~3-5M"

Percent of variation in genome" >99%"

Pr{singleton discovery}" >99%"

Pr{common allele discovery}" >99%"

Data requirements per sample" Variant detection among multiple samples"

~30x reads"

Variant site"

Excellent sensitivity for hetero- and homozygotes"

High depth allows excellent genotype calling"

Datarequirementspersample VariantdetectionamongmultiplesamplesTargetbases 3Gb Variantsfoundpersample ~3-5M Coverage Avg.30x Percentofvariationingenome >99% #sequencedbases 100gb Pr{singletondiscovery} >99% #perlane(HiSeq4000) ~1 Pr{commonallelediscovery} >99%

Low6pass*sequencing*design*


Targeted bases" ~3 Gb"

Coverage" Avg. 4x"


# lanes of HiSeq" ~1.25"

Variants found per sample" ~3M"

Percent of variation in genome" ~90%"

Pr{singleton discovery}" <50%"

Pr{common allele discovery}" ~99%"


~4x reads"

Variant site"

Variants missed by sampling"

Heterozygotes can be mistaken for

homozygotes due to sampling"

Significantly better power to detect homozygous sites"

Datarequirementspersample VariantdetectionamongmultiplesamplesTargetbases 3Gb Variantsfoundpersample ~3MCoverage Avg.4x Percentofvariationingenome ~90% #sequencedbases 20gb Pr{singletondiscovery} < 50% #perlane(HiSeq4000) ~6 Pr{commonallelediscovery} ~99%

Exome Capture

ExomeCapture(TruSeq ExomeCapture)

Exome*capture*sequencing*design*


Targeted bases" ~32Mb"

Coverage" >80% 20x"


# lanes of HiSeq" ~0.33"

Variants found per sample" ~20K"

Percent of variation in genome" 0.5%"

Pr{singleton discovery}" ~95%"

Pr{common allele discovery}" ~95%"


150x reads" Little off-target

coverage"

Variant site"

Datarequirementspersample VariantdetectionamongmultiplesamplesTargetbases 45Mb Variantsfoundpersample ~25,000Coverage >80%20x Percentofvariationingenome 0.005#sequencedbases 5Gb Pr{singletondiscovery} ~95% #perlane(HiSeq4000) 24 Pr{commonallelediscovery} ~95%

GeneralvariantcallingpipelinesCommonpattern:◦ Alignreads◦ Optimizealignment◦ Callvariants◦ Filtercalledvariants◦ Annotate

PipelineexamplesExamples:◦ BroadGenomeAnalysisToolkit(GATK)

◦ samtools mpileup◦ VarScan2

Specialized pipelines◦ Heteroplasmy,tumorsampleanalyses

GATKPipeline

PhaseI:NGSDataProcessing

PhaseINGSDataProcessing

◦ Alignmentofrawreads

◦ Duplicatemarking

◦ Basequalityrecalibration

◦ LocalrealignmentnolongerrequiredifyouusetheHaplotypeCaller

PhaseI:Alignmentofrawreads

Accuracy

◦ Sensitivity – mapsreadsaccurately,allowingforerrorsorvariation

◦ Specificity – mapstothecorrectregion

Heng Li’salignerassessment

PhaseI:Alignmentofrawreads

Accuracyassessedusingsimulateddata

Generally,BWA-MEMorNovoalignarerecommended

Uniquevs.multi-mappedreads◦ Shouldweretainreadsmappingtorepetitiveregions?

◦ Maydependontheapplication

Heng Li’salignerassessment

PhaseI:AlignmentofrawreadsAlmost100ofthese(2016):http://wwwdev.ebi.ac.uk/fg/hts_mappers/

http://wwwdev.ebi.ac.uk/fg/hts_mappers/

Mapping'short'reads'to'a'reference'is'simple'in'principle'

Reference)genome)

Mapping'and'alignment'algorithms'

Reads'mapped'to'reference'

Enormous)pile)of)short)reads)from)NGS)

Iden;fy'where'the'read'matches'the'reference'sequence'and'record'match'details'as'CIGAR'string'

Region'1' Region'2' Region'3'

But mapping is actually very hard because of mismatches (true mutations or sequencing errors), duplicated regions etc.!

Region'1'

High'MQ' Low'MQ'

Reference)genome)

Region'2A' Region'2B'

For'more'informa;on'see:'

Li'and'Homer'(2010).'A'survey'of'sequence'alignment'algorithms'for'nextBgenera;on'sequencing.''Briefings)in)Bioinforma.cs.)

Enormous)pile)of)short)reads)from)NGS)

Mapping'and'alignment'algorithms'

Mapping'algorithms'account''

for'this'by'choosing'the'most''

likely'placement)

�  )mapping)quality)(MQ))

Alignmentoutput:SAM/BAMSAM– SequenceAlignment/Mapformat◦ SAMfileformatstoresalignmentinformation

◦ NormallyconvertedintoBAM(textformatismostlyuselessforanalysis)

Specification:http://samtools.sourceforge.net/SAM1.pdf

ContainsFASTQreads,qualityinformation,metadata,alignmentinformation,etc.

Filesaretypicallyverylarge:Many100’sofGBormore

http://samtools.sourceforge.net/SAM1.pdf

Alignmentoutput:SAM/BAMBAM– BGZFcompressedSAMformat◦ Maybeunsorted,orsortedbysequencenameorgenomecoordinates

◦ Maybeaccompaniedbyanindexfile(.bai)(onlyifcoord-sorted)◦ Makesthealignmentinformationeasilyaccessibletodownstreamapplications(largegenomefilenotnecessary)

◦ Relativelysimpleformatmakesiteasytoextractspecificfeatures,e.g.genomiclocations

◦ BAMisthecompressed/binaryversionofSAMandisnothumanreadable.Usesaspecializecompressionalgorithmoptimizedforindexingandrecordretrieval(bgzip)

Filesaretypicallyverylarge: 1/5ofSAM,butstillverylarge

Alignment

SAMformat

Alignmentoutput:SAM/BAM

SAMformat


BitFlags

Hex 0x80 0x40 0x20 0x10 0x8 0x4 0x2 0x1Bit 128 64 32 16 8 4 2 1

r001 1 1 1 1 = 163

SAMformat


Toomanytogoover!!!

Alignmentoutput:SAM/BAMTools◦ samtools◦ Picard

MininginformationfromaproperlyformattedBAMfile:◦ Readsinaregion(goodforRNA-Seq,ChIP-Seq)◦ Qualityofalignments◦ Coverage◦ …andofcourse,differences(variants)

PhaseI:DuplicateMarking

The'reason'why'duplicates'are'bad'

Reference)genome)


FP)variant)call)(bad))

='sequencing'error'propagated'in'duplicates'

ATer)marking)duplicates,)the)GATK)will)only)see):)

…)and)thus)be)more)likely)to)make)the)right)call)

PhaseI:DuplicateMarking

Duplicates'have'the'same'star;ng'posi;on''and'the'same'CIGAR'string'

Reference)genome)


Easy)to)bag)&)tag)

Hey,)Picard)has)an)app)for)that!)

PhaseI:Sorting,ReadGroupsA'quick'diversion'about'sor;ng'and'read'groups'

The)informaKon)for)this:) …)is)actually)stored)as)a)text)file)with)one)line)per)read)which)from)far)away)looks)like)this:)

…)but)the)GATK)wants)reads)to)be)sorted)by)starKng)posiKon)like)this:)

And'while'we’re'at'it,'let’s'add'read)group)informa;on'if'it'isn’t'already'there,'so'the)GATK)will)know)what)read)belongs)to)what)sample'(that’s'kind'of'important).'

…)and)Picard)has)an)app)for)that!)

Hey,)Picard)has)an)app)for)that)too!)

The)reads)are)in)no)parKcular)order…)

So)we)need)to)explicitly)sort)the)SAM)file…)

TerminologyReadgroups– informationaboutthesamplesandhowtheywererun◦ ID– Simpleuniqueidentifier◦ Library◦ Samplename◦ Platform– sequencingplatform◦ Platformunit– barcodeoridentifier◦ Sequencingcenter(optional)◦ Description(optional)◦ Rundate(optional)

PhaseI:Sorting,ReadGroupsTypical'workflow'using'Picard'tools'to'mark'duplicates'et)al.)

java)^jar)SortSam.jar)\))I=input.sam)\))O=output.bam)\))SO=coordinate)

java)^jar)MarkDuplicates.jar)\))I=input.sam)\))O=output.bam)

java)^jar)AddOrReplaceReadGroups.jar)\))I=input.bam)\))O=output.bam)\))RGID=id)RGLB=solexa^123)\))RGPL=illumina)RGPU=AXL2342)\))RGSM=NA12878)RGCN=bi)\))RGDT=12/12/2011)

Original)SAM) Ordered)BAM)

Dedupped)BAM)Final)BAM)

Wham,)BAM,)thank)you)Picard!)

sort'

“dedup”'

add'RG'

PhaseI:LocalRealignmentRealignaroundindels

Commonmisalignmentproblem

Cancauseproblemswithbasequalityrecalibration,variantcalls

DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Gen.!

This*is*what*a*realigned*BAM*looks*like*

Before* Afer*

PhaseI:BaseQualityScoreRecalibrationQualityscoresfromsequencersarebiasedandsomewhatinaccurate

Qualityscoresarecriticalforalldownstreamanalysis

Biasesareamajorcontributortobadvariantcalls

Caveat:◦ Inpractice,generallyrequireshavingaknownsetofvariants(dbSNP)

PhaseII:VariantDiscovery/Genotyping

PhaseII:VariantCallingThisiswhereweactuallycallthevariants

Priorstepsleadinguptothishelpremovepotentialcausesofvariantcallingerrors

I’llbecoveringuseoftheUnifiedGenotyper variantcallingtoolinGATK◦ …butyoushouldkeepaneyeonthenewesttool,HaplotypeCaller

PhaseII:VariantCallingIngeneral,usesaprobabilisticmethod,e.g.Bayesianmodel◦ DeterminethepossibleSNPandindel alleles◦ Only“goodbases”areincluded:

◦ Thosesatisfyingminimumbasequality,mappingreadquality,pairmappingquality,etc.

◦ Compute,foreachsample,foreachgenotype,likelihoodsofdatagivengenotypes

◦ Computetheallelefrequencydistributiontodeterminemostlikelyallelecount;emitavariantcallifdetermined

◦ Ifwearegoingtoemitavariant,assignagenotypetoeachsample

HetREF:50%ALT:50%

HomREF:0%ALT:100%

??REF:77%ALT:23%

Variantcallingoutput:VCFVCF(VariantCallFormat)

LikeSAM/BAM,alsohasaversionedspecification◦ Fromthe1000GenomesProject◦ http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

http://www.1000genomes.org/wiki/Analysis/Variant Call Format/vcf-variant-call-format-version-41



Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:


Samples

VCFCol Field Description

1 CHROM Chromosome name

2 POS1-based position. For an indel, this is the position preceding the indel.

3 ID Variant identifier. Usually the dbSNP rsID.

4 REFReference sequence at POS involved in the variant. For a SNP, it is a single base.

5 ALT Comma delimited list of alternative seuqence(s).

6 QUALPhred-scaled probability of all samples being homozygous reference.

7 FILTER Semicolon delimited list of filters that the variant fails to pass.

8 INFO Semicolon delimited list of variant information.

9 FORMATColon delimited list of the format of individual genotypes in the following fields.

10+ Sample(s) Individual genotype information defined by FORMAT.

PhaseII:FilteringTwobasicmethods:◦ Hardfiltering◦ Variantqualityscorerecalibration(VQSR)

PhaseII:HardFilteringReducingfalsepositivesbye.g.requiring◦ SufficientDepth◦ Varianttobein>30%reads◦ Highquality◦ Strandbalance◦ Etc etc etc

Veryhighdimensionalsearchspace◦ …so,verysubjective!

StrandBias

PhaseII:HardFiltering“TheeffectofstrandbiasinIllumina short-readsequencingdata”◦ Guo etal.,BMCGenomics 2012,13:666◦ Lastsentence:“IndiscriminantuseofstrandbiasasafilterwillresultinalargelossoftruepositiveSNPs”

StrandBias

PhaseII: VariantQualityScoreRecalibration(VQSR)ConsideredGATK‘bestpractice’

Trainontrustedvariants

Requirethenewvariantstoliveinthesamehyperspace

Potentialproblems:◦ Over-fitting◦ BiasingtofeaturesofknownSNPs

PhaseIII:IntegrativeAnalysis

PhaseIII:FunctionalAnnotationArethesemutationsinimportantregions?◦ Genes?UTR?◦ Aretheychangingthecodingsequence?

Wouldthesechangeshaveanaffect?

Tools:◦ SnpEff/SnpSift◦ Annovar

Theendofthe(pipe)line

Follow-upQualityControlTransition/Transversion ratio(Ti/Tv)

◦ bcftools canhelphere

Concordancewithknownvariants:dbSNP,HapMap,1000genomes

Others?

Condition ExpectedTi/Tvrandom 0.5whole genome 2.0-2.1exome 3.0-3.5

CallingvariantsoncohortsofsamplesWhenrunningHaplotypeCaller,canuseaspecificoutputtypecalledGVCF(GenotypeVCF)◦ Containsgenotypelikelihoodandannotationforeachsiteingenome

Performjointgenotypingcallsoncohort

Canrerunasneededifmoresamplesaddedtocohort

UsedforExAC cohort(92Kexomes)

Callingvariantsoncohortsofsamples

StructuralVariationToolslikeGATK,samtools can’tcurrentlydetectlargerstructuralchangeseasily

Alkan etal,NatureGenetics12:363,2011

StructuralVariationWiththelatestreleasesofGATK(v3.7orhigher),thisischanging:

https://software.broadinstitute.org/gatk/best-practices/

StructuralVariationDetectionusingNGSdatagenerallyrequiresmulti-layeranalyses,mayfocusonspecificSVtypes

Commontools:◦ CNVnator – grossdetectionofCNVs◦ BreakDancer,Cortex– breakpointdetection◦ Pindel – largedeletions◦ Hydra-SV– readpairdiscordance

Recenttools(lumpy-sv,GASVPRO)integrateapproaches

StructuralVariationStrategiesReadpairdiscordance◦ Insertsizeisoff,orientationofreadsiswrong

Readdepth◦ Regiondeviatesfromexpectedreaddepth

Splitreads◦ Singlereadissplit,partsalignintwodistinctuniquelocations

‘Assembly’◦ Reference-basedlocalassembliesindicateinconsistencies

StructuralVariationAnalyses

Ref

Ref

StructuralVariationAnalyses

StructuralVariationStillanactiveareaofresearch

Problems:◦ Lotsoffalsepositives◦ Hardtocomparemethodologies

AssociationStudiesGenome-wideassociationstudies(GWAS)

Tryingtodeterminewhetherspecificvariant(s)inmanyindividualscanbeassociatedwithatrait

Ex: comparisonofgroupsofpeoplewithadisease(cases)andwithout(controls)

Findingthecausalvariantinidealsituations*

Spotthevariantthatiscommonamongstallaffectedbutabsentinallunaffected

Thisvariantisinagenewithknownfunctionandcausestheproteintobedisrupted

*e.g.somerareautosomal disease

InrealityYoucan’tspotthedifference◦ Youdealwith~3.5millionSNPs◦ Youneedtoemploymethodsthatsystematicallyidentifyvariantsthatstandout:GWAS

GWAStaughtusthatitisunlikelytofindacausalcommonvariantforcomplexdiseases◦ RareVariant?◦ Abunchofrareandcommonvariants?◦ Anevenmorecomplexmodel?

1000GenomesProjectConsortium.Amapofhumangenomevariationfrompopulation-scalesequencing.Nature.2010Oct28;467(7319):1061-73.PubMed PMID:20981092

GWASCriticismsofinitialGWASstudies◦ Poorcontrols◦ Severelyunderpowered◦ Tonsoffalse-positives◦ Notworththecost

Ledtosomeretractions

Wikipediaarticleoutlinesthelimitationsverywell◦ http://en.wikipedia.org/wiki/Genome-wide_association_study

http://en.wikipedia.org/wiki/Genome-wide_association_study

http://en.wikipedia.org/wiki/Genome-wide_association_study

GWASMostoftheseusegenotypingarrays

UseofNGSmayhelpaddresssomeofthepastlimitations

GATKv4Outnowinpre-release

~5xfasterwithHaplotypeCaller

Completelyopen-source,buthasspecificsoftwarerequirements

https://software.broadinstitute.org/gatk/download/alpha

https://software.broadinstitute.org/gatk/download/alpha