VariantCallingCHRISFIELDS
MAYO-ILLINOISCOMPUTATIONAL GENOMICSWORKSHOP,JUNE19, 2017
Up-frontacknowledgmentsManyfigures/slidescomefrom:◦ GATKWorkshopslides:http://www.broadinstitute.org/gatk/guide/events?id=2038
◦ IGVWorkshopslides:http://lanyrd.com/2013/vizbi/scdttf/◦ DenisBauer(CSIRO):http://www.allpower.de/◦ Manyvariedpublications
BackgroundVariantcallingandusecases
Errorsvs.actualvariants
Experimentaldesign(GATKfocus)
Smallvariant(SNV/SmallIndel)analysis◦ GATKPipeline◦ Formatsencounteredwithin
StructuralVariationAnalysis(SV)
Associationanalysis(briefly)
VariantCallingAsthenameimplies,we’relookingfordifferences(variations)◦ Reference– referencegenome(hg19,b37)◦ Sample(s)– oneormorecomparativesamples
Startwithrawsequencedata
Endwithahuman‘diff’file,recordingthevariants
Additionalinformationaddeddownstream:◦ Filters(qualityofthecalls)◦ Functionalannotation
VariationsDifferencebetween2individuals:1every1000bp◦ ~2.7milliondifferences
Small(<50bp)◦ SNV– singlenucleotide◦ Smallinsertionsordeletions(‘indels’)
Large(structuralvariations)◦ Indels >50bp◦ CopyNumberVariations◦ Inversions◦ Translocations
VariationsMainlyfocusondiploidorganisms
◦ Human:◦ 22pairsofautosomalchromosomes
◦ Onefrommother,onefromfather◦ 2sexchromosomes(femaleXX,maleXY)
◦ Onefrommother,onefromfather(wheredoesYcomefromformaleoffspring)◦ Mitochondrialgenome(generallymaternallyinherited)
◦ 100-10,000copiespercell
Variationcanbein◦ Onechromosome(heterozygous,or‘het’)◦ Bothchromosomes(homozygous,or‘hom’)
UsecasesMedicine
• Hereditaryorgeneticdiseases,geneticpredispositiontodisease• Normalvs.tumoranalyses• Heteroplasmy
Populationgenetics
Populationgenetics
The1000GenomesProject
CEU
JPTCHB
YRI
LWK
MXL
ASW
GBRFIN
TSI
CHS
CLM
PUR
1,100 samples early 2011; 2,500 samples 2011/12
IBSCDX
KHVGWD
ACB
AJM
PEL
PJL
MAB
ADHKAKRDHMRM
The full 1000 Genomes Project data
Cancer
NGS studies of cancer progressionThis approach was published recently in a studydesigned to compare the tumor genomes of patients withde novo AML to their relapse genomes [21]. After sequen-cing each genome (de novo tumor and relapse tumor) andthe matched normal from skin for each patient, somaticmutations and structural variants were identified. Someof these appeared to be unique to the relapse sample ineach case. We then obtained high sequencing readdepth at each somatic mutation site in the de novo andrelapse tumors, and characterized the reads that con-tained the mutated base(s) at each site to calculate anallele frequency of that variant in the tumor cell popu-lation. Using kernel density estimation, we then ident-ified groups of mutations present at the same allelefrequencies, indicative of their prevalence in the tumorcell population. This comparison of allele frequencygroups between de novo and relapse disease allowed usto model the relative numbers of tumor subclones ateach disease presentation, and defined AML progressionas a clonal process, as illustrated in Figure 2. Namely,all subclones originate from a founder clone that sharesall but the newest mutations, and relapse disease
shares mutations with the founder clone as well as newmutations that portend its proliferative advantage in therelapse presentation.
In a similar study, with a slightly different experimentaldesign, we recently explored the differences betweenmyelodysplastic syndrome (MDS) genomes and the gen-omes found in those patients’ secondary AML (sAML)tumors. MDS identifies a heterogeneous group of syn-dromes characterized by dysplasia and ineffective hema-topoesis. Since about 1/3 of these patients progress tosAML for reasons that are not well understood at thegenomic level, we characterized these genomes to under-stand novel somatic variants in the sAML cells. In ourstudy, the results were quite different than the de novo torelapse AML study outlined above. Namely, we foundthat the sAML genomes were all oligoclonal (comprisedof several related tumor cell subclones, each with uniquesets of mutations), each containing a pre-existing MDSfounder clone that was out-competed in the sAML tumorcell population in some cases. We hypothesized that theoligoclonal nature of the sAML presentation may con-tribute to the very poor response rates of these patients to
Genome sequencing and cancer Mardis 247
Figure 2
DNMT3A, NPM1, FLT3, PTRPT
12.74%
HSCs
Diagnosis: Multipleleukemic clones
present
Clinical remission:loss of most
leukemic clones
Relapse: Acquisitionof new mutations in a
pre-existing clone
29.04%
5.10%
53.12%
AML1/ UPN933124
cell type:
normal AML
mutations:
founder (cluster 1)
primary specific (cluster 2)
relapse enriched (cluster 3)relapse enriched (cluster 4) random mutations in HSCs
pathogenic mutationsrelapse specific (cluster 5)
ETV6, WNK1-WAC,MY018B
Chemotherapy
Current Opinion in Genetics & Development
Model of the clonal progression process that occurs between the initial (de novo) and relapse presentation in AML patients. At diagnosis, this patienthas an oligoclonal disease characterized by four different subclones, each present at a specific proportion in the tumor cell population and with aspecific mutational profile. Chemotherapy used to induce the patient into remission decreases clonal heterogeneity but a single subclone persists,acquires new mutations, and again proliferates in the bone marrow as a relapse-specific subclone.
www.sciencedirect.com Current Opinion in Genetics & Development 2012, 22:245–250E.Mardis,CurrentOpinioninGenetics&Development2012,22:245–250
HeteroplasmyMixtureofmorethanonetypeoforganellegenomewithincell◦ 100-10,000mtDNA copies/cell
Predisposingfactorformitochondrialdiseases◦ Late(adult)onset
Possiblybeneficialinsomecases◦ Centenarianshaveaboveavg freq forheteroplasmy
Coble etal,PLOSOne,2009;4(3):e4838
Variantsvs.ErrorsMustdistinguishbetweenactualvariation (realchange)anderrors(artifacts)introducedintotheanalysis
Errorscancreepinonvariouslevels:◦ PCRartifacts (amplificationoferrors)◦ Sequencing (errorsinbasecalling)◦ Alignment (misalignment,mis-gappedalignments)◦ Variantcalling (lowdepthofcoverage,fewsamples)◦ Genotyping (poorannotation)
Trytocontrolforthesewhenpossibletoreducefalsepositives w/oincurring(worse)falsenegatives
Example:Initialrawsequencedata
SequencequalityDifferenttechnologieshavedifferenterrors,errorrates◦ 454– homopolymer trackerrors◦ Illumina – substitutionerrors◦ PacBio – indels.Lotsandlotsof‘em.
Representedasaqualityscore(Phred)◦ Q=-10log10(e)
Phred Quality Score Probability of incorrect base call Base call accuracy10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.90%40 1 in 10000 99.99%50 1 in 100000 ~100.00%
Formats:FASTQ– ‘sequencewithquality’
Three‘variants’– Sanger,Illumina,Solexa (Sangerismostcommon)
Maybe‘raw’data(straightfromseq pipeline)orprocessed(trimmedforvariousreasons)
Canhold100’sofmillionsofrecordspersample
Filescanbeverylarge(100’sofGB)apiece
@HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT
NTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT
+
#1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII
Formats:FASTQ– ‘sequencewithquality’@HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT
NTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT
+
#1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII
VerylowPhred score,lessthan10
@HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT
NTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT
+
#1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII
Formats:FASTQ– ‘sequencewithquality’
LowPhred score,<20
@HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT
NTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT
+
#1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII
Formats:FASTQ– ‘sequencewithquality’
Highqualityreads,Phred score>30
Howdosequencingerrorsoccur?
Illumina Sequencing
Illumina Sequencing
Illumina Sequencing
Illumina Sequencing
Checksequencedata!
BasicExperimentalDesign
TerminologyLane – Physicalsequencinglane
Library – UnitofDNApreppooledtogether
Sample – Singleindividual
Cohort – Collectionofsamplesanalyzedtogether lane
flowcell
TerminologySample(single)vsCohort(multiple)
Library
Lane
Flowcell
lane
flowcell
Library
Ingeneral,Library=Sample
TerminologyWGSvs Exome Capture◦ Wholegenomesequencing – everything◦ Highcostifpersampleisdeepsequence(>25-30x)◦ Canrunmultisample lowcoveragesamples
◦ Exome capture – targetedsequencing(1-5%ofgenome)◦ Deepercoverageoftranscribedregions◦ Missotherimportantnon-codingregions(promoters,introns,enhancers,smallRNA,etc)
Single*vs.*mul)6sample*analysis*
Deep%singleGsample%
• Higher*sensi)vity*for*variants*in*the*sample*
• More*accurate*genotyping*per*sample*
• Cost:*no*informa)on*about*other*samples*
Shallower%mulJGsample%
Sample 1"
Variants"
Fixed*sequencing*budget*
Found 3 variants total" Found 5 variants total"
Sample 1"
Sample 2"
Sample 3"
Sample 4"
• Sensi)vity*dependent*on*frequency*of*varia)on*
• Worse*genotyping*• More*total*variants*
discovered*
High6pass*sequencing*design*
Exon I" Exon II"Intron I"Intergenic" Intergenic"
Targeted bases" ~3 Gb"
Coverage" Avg. 30x"
# sequenced bases" 100 Gb"
# lanes of HiSeq" ~8 lanes"
Variants found per sample" ~3-5M"
Percent of variation in genome" >99%"
Pr{singleton discovery}" >99%"
Pr{common allele discovery}" >99%"
Data requirements per sample" Variant detection among multiple samples"
~30x reads"
Variant site"
Excellent sensitivity for hetero- and homozygotes"
High depth allows excellent genotype calling"
Datarequirementspersample VariantdetectionamongmultiplesamplesTargetbases 3Gb Variantsfoundpersample ~3-5M Coverage Avg.30x Percentofvariationingenome >99% #sequencedbases 100gb Pr{singletondiscovery} >99% #perlane(HiSeq4000) ~1 Pr{commonallelediscovery} >99%
Low6pass*sequencing*design*
Exon I" Exon II"Intron I"Intergenic" Intergenic"
Targeted bases" ~3 Gb"
Coverage" Avg. 4x"
# sequenced bases" 20 Gb"
# lanes of HiSeq" ~1.25"
Variants found per sample" ~3M"
Percent of variation in genome" ~90%"
Pr{singleton discovery}" <50%"
Pr{common allele discovery}" ~99%"
Data requirements per sample" Variant detection among multiple samples"
~4x reads"
Variant site"
Variants missed by sampling"
Heterozygotes can be mistaken for
homozygotes due to sampling"
Significantly better power to detect homozygous sites"
Datarequirementspersample VariantdetectionamongmultiplesamplesTargetbases 3Gb Variantsfoundpersample ~3MCoverage Avg.4x Percentofvariationingenome ~90% #sequencedbases 20gb Pr{singletondiscovery} < 50% #perlane(HiSeq4000) ~6 Pr{commonallelediscovery} ~99%
Exome Capture
ExomeCapture(TruSeq ExomeCapture)
Exome*capture*sequencing*design*
Exon I" Exon II"Intron I"Intergenic" Intergenic"
Targeted bases" ~32Mb"
Coverage" >80% 20x"
# sequenced bases" 5 Gb"
# lanes of HiSeq" ~0.33"
Variants found per sample" ~20K"
Percent of variation in genome" 0.5%"
Pr{singleton discovery}" ~95%"
Pr{common allele discovery}" ~95%"
Data requirements per sample" Variant detection among multiple samples"
150x reads" Little off-target
coverage"
Variant site"
Datarequirementspersample VariantdetectionamongmultiplesamplesTargetbases 45Mb Variantsfoundpersample ~25,000Coverage >80%20x Percentofvariationingenome 0.005#sequencedbases 5Gb Pr{singletondiscovery} ~95% #perlane(HiSeq4000) 24 Pr{commonallelediscovery} ~95%
GeneralvariantcallingpipelinesCommonpattern:◦ Alignreads◦ Optimizealignment◦ Callvariants◦ Filtercalledvariants◦ Annotate
PipelineexamplesExamples:◦ BroadGenomeAnalysisToolkit(GATK)
◦ samtools mpileup◦ VarScan2
Specialized pipelines◦ Heteroplasmy,tumorsampleanalyses
GATKPipeline
PhaseI:NGSDataProcessing
PhaseINGSDataProcessing
◦ Alignmentofrawreads
◦ Duplicatemarking
◦ Basequalityrecalibration
◦ LocalrealignmentnolongerrequiredifyouusetheHaplotypeCaller
PhaseI:Alignmentofrawreads
Accuracy
◦ Sensitivity – mapsreadsaccurately,allowingforerrorsorvariation
◦ Specificity – mapstothecorrectregion
Heng Li’salignerassessment
PhaseI:Alignmentofrawreads
Accuracyassessedusingsimulateddata
Generally,BWA-MEMorNovoalignarerecommended
Uniquevs.multi-mappedreads◦ Shouldweretainreadsmappingtorepetitiveregions?
◦ Maydependontheapplication
Heng Li’salignerassessment
PhaseI:AlignmentofrawreadsAlmost100ofthese(2016):http://wwwdev.ebi.ac.uk/fg/hts_mappers/
Mapping'short'reads'to'a'reference'is'simple'in'principle'
Reference)genome)
Mapping'and'alignment'algorithms'
Reads'mapped'to'reference'
Enormous)pile)of)short)reads)from)NGS)
Iden;fy'where'the'read'matches'the'reference'sequence'and'record'match'details'as'CIGAR'string'
Region'1' Region'2' Region'3'
But mapping is actually very hard because of mismatches (true mutations or sequencing errors), duplicated regions etc.!
Region'1'
High'MQ' Low'MQ'
Reference)genome)
Region'2A' Region'2B'
For'more'informa;on'see:'
Li'and'Homer'(2010).'A'survey'of'sequence'alignment'algorithms'for'nextBgenera;on'sequencing.''Briefings)in)Bioinforma.cs.)
Enormous)pile)of)short)reads)from)NGS)
Mapping'and'alignment'algorithms'
Mapping'algorithms'account''
for'this'by'choosing'the'most''
likely'placement)
� )mapping)quality)(MQ))
Alignmentoutput:SAM/BAMSAM– SequenceAlignment/Mapformat◦ SAMfileformatstoresalignmentinformation
◦ NormallyconvertedintoBAM(textformatismostlyuselessforanalysis)
Specification:http://samtools.sourceforge.net/SAM1.pdf
ContainsFASTQreads,qualityinformation,metadata,alignmentinformation,etc.
Filesaretypicallyverylarge:Many100’sofGBormore
Alignmentoutput:SAM/BAMBAM– BGZFcompressedSAMformat◦ Maybeunsorted,orsortedbysequencenameorgenomecoordinates
◦ Maybeaccompaniedbyanindexfile(.bai)(onlyifcoord-sorted)◦ Makesthealignmentinformationeasilyaccessibletodownstreamapplications(largegenomefilenotnecessary)
◦ Relativelysimpleformatmakesiteasytoextractspecificfeatures,e.g.genomiclocations
◦ BAMisthecompressed/binaryversionofSAMandisnothumanreadable.Usesaspecializecompressionalgorithmoptimizedforindexingandrecordretrieval(bgzip)
Filesaretypicallyverylarge: 1/5ofSAM,butstillverylarge
Alignment
SAMformat
Alignmentoutput:SAM/BAM
SAMformat
Alignmentoutput:SAM/BAM
SAMformat
SAMformat
BitFlags
Hex 0x80 0x40 0x20 0x10 0x8 0x4 0x2 0x1Bit 128 64 32 16 8 4 2 1
r001 1 1 1 1 = 163
SAMformat
SAMformat
SAMformat
CIGAR
SAMformat
SAMformat
SAMformat
SAMformat
Alignmentoutput:SAM/BAM
Toomanytogoover!!!
Alignmentoutput:SAM/BAMTools◦ samtools◦ Picard
MininginformationfromaproperlyformattedBAMfile:◦ Readsinaregion(goodforRNA-Seq,ChIP-Seq)◦ Qualityofalignments◦ Coverage◦ …andofcourse,differences(variants)
PhaseI:DuplicateMarking
The'reason'why'duplicates'are'bad'
Reference)genome)
Reads'mapped'to'reference'
FP)variant)call)(bad))
='sequencing'error'propagated'in'duplicates'
ATer)marking)duplicates,)the)GATK)will)only)see):)
…)and)thus)be)more)likely)to)make)the)right)call)
PhaseI:DuplicateMarking
Duplicates'have'the'same'star;ng'posi;on''and'the'same'CIGAR'string'
Reference)genome)
Reads'mapped'to'reference'
Easy)to)bag)&)tag)
Hey,)Picard)has)an)app)for)that!)
PhaseI:Sorting,ReadGroupsA'quick'diversion'about'sor;ng'and'read'groups'
The)informaKon)for)this:) …)is)actually)stored)as)a)text)file)with)one)line)per)read)which)from)far)away)looks)like)this:)
…)but)the)GATK)wants)reads)to)be)sorted)by)starKng)posiKon)like)this:)
And'while'we’re'at'it,'let’s'add'read)group)informa;on'if'it'isn’t'already'there,'so'the)GATK)will)know)what)read)belongs)to)what)sample'(that’s'kind'of'important).'
…)and)Picard)has)an)app)for)that!)
Hey,)Picard)has)an)app)for)that)too!)
The)reads)are)in)no)parKcular)order…)
So)we)need)to)explicitly)sort)the)SAM)file…)
TerminologyReadgroups– informationaboutthesamplesandhowtheywererun◦ ID– Simpleuniqueidentifier◦ Library◦ Samplename◦ Platform– sequencingplatform◦ Platformunit– barcodeoridentifier◦ Sequencingcenter(optional)◦ Description(optional)◦ Rundate(optional)
PhaseI:Sorting,ReadGroupsTypical'workflow'using'Picard'tools'to'mark'duplicates'et)al.)
java)^jar)SortSam.jar)\))I=input.sam)\))O=output.bam)\))SO=coordinate)
java)^jar)MarkDuplicates.jar)\))I=input.sam)\))O=output.bam)
java)^jar)AddOrReplaceReadGroups.jar)\))I=input.bam)\))O=output.bam)\))RGID=id)RGLB=solexa^123)\))RGPL=illumina)RGPU=AXL2342)\))RGSM=NA12878)RGCN=bi)\))RGDT=12/12/2011)
Original)SAM) Ordered)BAM)
Dedupped)BAM)Final)BAM)
Wham,)BAM,)thank)you)Picard!)
sort'
“dedup”'
add'RG'
PhaseI:LocalRealignmentRealignaroundindels
Commonmisalignmentproblem
Cancauseproblemswithbasequalityrecalibration,variantcalls
DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Gen.!
This*is*what*a*realigned*BAM*looks*like*
Before* Afer*
PhaseI:BaseQualityScoreRecalibrationQualityscoresfromsequencersarebiasedandsomewhatinaccurate
Qualityscoresarecriticalforalldownstreamanalysis
Biasesareamajorcontributortobadvariantcalls
Caveat:◦ Inpractice,generallyrequireshavingaknownsetofvariants(dbSNP)
PhaseII:VariantDiscovery/Genotyping
PhaseII:VariantCallingThisiswhereweactuallycallthevariants
Priorstepsleadinguptothishelpremovepotentialcausesofvariantcallingerrors
I’llbecoveringuseoftheUnifiedGenotyper variantcallingtoolinGATK◦ …butyoushouldkeepaneyeonthenewesttool,HaplotypeCaller
PhaseII:VariantCallingIngeneral,usesaprobabilisticmethod,e.g.Bayesianmodel◦ DeterminethepossibleSNPandindel alleles◦ Only“goodbases”areincluded:
◦ Thosesatisfyingminimumbasequality,mappingreadquality,pairmappingquality,etc.
◦ Compute,foreachsample,foreachgenotype,likelihoodsofdatagivengenotypes
◦ Computetheallelefrequencydistributiontodeterminemostlikelyallelecount;emitavariantcallifdetermined
◦ Ifwearegoingtoemitavariant,assignagenotypetoeachsample
HetREF:50%ALT:50%
HomREF:0%ALT:100%
??REF:77%ALT:23%
Variantcallingoutput:VCFVCF(VariantCallFormat)
LikeSAM/BAM,alsohasaversionedspecification◦ Fromthe1000GenomesProject◦ http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:
Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Formats:VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Samples
VCFCol Field Description
1 CHROM Chromosome name
2 POS1-based position. For an indel, this is the position preceding the indel.
3 ID Variant identifier. Usually the dbSNP rsID.
4 REFReference sequence at POS involved in the variant. For a SNP, it is a single base.
5 ALT Comma delimited list of alternative seuqence(s).
6 QUALPhred-scaled probability of all samples being homozygous reference.
7 FILTER Semicolon delimited list of filters that the variant fails to pass.
8 INFO Semicolon delimited list of variant information.
9 FORMATColon delimited list of the format of individual genotypes in the following fields.
10+ Sample(s) Individual genotype information defined by FORMAT.
PhaseII:FilteringTwobasicmethods:◦ Hardfiltering◦ Variantqualityscorerecalibration(VQSR)
PhaseII:HardFilteringReducingfalsepositivesbye.g.requiring◦ SufficientDepth◦ Varianttobein>30%reads◦ Highquality◦ Strandbalance◦ Etc etc etc
Veryhighdimensionalsearchspace◦ …so,verysubjective!
StrandBias
PhaseII:HardFiltering“TheeffectofstrandbiasinIllumina short-readsequencingdata”◦ Guo etal.,BMCGenomics 2012,13:666◦ Lastsentence:“IndiscriminantuseofstrandbiasasafilterwillresultinalargelossoftruepositiveSNPs”
StrandBias
PhaseII: VariantQualityScoreRecalibration(VQSR)ConsideredGATK‘bestpractice’
Trainontrustedvariants
Requirethenewvariantstoliveinthesamehyperspace
Potentialproblems:◦ Over-fitting◦ BiasingtofeaturesofknownSNPs
PhaseIII:IntegrativeAnalysis
PhaseIII:FunctionalAnnotationArethesemutationsinimportantregions?◦ Genes?UTR?◦ Aretheychangingthecodingsequence?
Wouldthesechangeshaveanaffect?
Tools:◦ SnpEff/SnpSift◦ Annovar
Theendofthe(pipe)line
Follow-upQualityControlTransition/Transversion ratio(Ti/Tv)
◦ bcftools canhelphere
Concordancewithknownvariants:dbSNP,HapMap,1000genomes
Others?
Condition ExpectedTi/Tvrandom 0.5whole genome 2.0-2.1exome 3.0-3.5
CallingvariantsoncohortsofsamplesWhenrunningHaplotypeCaller,canuseaspecificoutputtypecalledGVCF(GenotypeVCF)◦ Containsgenotypelikelihoodandannotationforeachsiteingenome
Performjointgenotypingcallsoncohort
Canrerunasneededifmoresamplesaddedtocohort
UsedforExAC cohort(92Kexomes)
Callingvariantsoncohortsofsamples
StructuralVariationToolslikeGATK,samtools can’tcurrentlydetectlargerstructuralchangeseasily
Alkan etal,NatureGenetics12:363,2011
StructuralVariationWiththelatestreleasesofGATK(v3.7orhigher),thisischanging:
https://software.broadinstitute.org/gatk/best-practices/
StructuralVariationDetectionusingNGSdatagenerallyrequiresmulti-layeranalyses,mayfocusonspecificSVtypes
Commontools:◦ CNVnator – grossdetectionofCNVs◦ BreakDancer,Cortex– breakpointdetection◦ Pindel – largedeletions◦ Hydra-SV– readpairdiscordance
Recenttools(lumpy-sv,GASVPRO)integrateapproaches
StructuralVariationStrategiesReadpairdiscordance◦ Insertsizeisoff,orientationofreadsiswrong
Readdepth◦ Regiondeviatesfromexpectedreaddepth
Splitreads◦ Singlereadissplit,partsalignintwodistinctuniquelocations
‘Assembly’◦ Reference-basedlocalassembliesindicateinconsistencies
StructuralVariationAnalyses
Ref
Ref
StructuralVariationAnalyses
StructuralVariationStillanactiveareaofresearch
Problems:◦ Lotsoffalsepositives◦ Hardtocomparemethodologies
AssociationStudiesGenome-wideassociationstudies(GWAS)
Tryingtodeterminewhetherspecificvariant(s)inmanyindividualscanbeassociatedwithatrait
Ex: comparisonofgroupsofpeoplewithadisease(cases)andwithout(controls)
Findingthecausalvariantinidealsituations*
Spotthevariantthatiscommonamongstallaffectedbutabsentinallunaffected
Thisvariantisinagenewithknownfunctionandcausestheproteintobedisrupted
*e.g.somerareautosomal disease
InrealityYoucan’tspotthedifference◦ Youdealwith~3.5millionSNPs◦ Youneedtoemploymethodsthatsystematicallyidentifyvariantsthatstandout:GWAS
GWAStaughtusthatitisunlikelytofindacausalcommonvariantforcomplexdiseases◦ RareVariant?◦ Abunchofrareandcommonvariants?◦ Anevenmorecomplexmodel?
1000GenomesProjectConsortium.Amapofhumangenomevariationfrompopulation-scalesequencing.Nature.2010Oct28;467(7319):1061-73.PubMed PMID:20981092
GWASCriticismsofinitialGWASstudies◦ Poorcontrols◦ Severelyunderpowered◦ Tonsoffalse-positives◦ Notworththecost
Ledtosomeretractions
Wikipediaarticleoutlinesthelimitationsverywell◦ http://en.wikipedia.org/wiki/Genome-wide_association_study
GWASMostoftheseusegenotypingarrays
UseofNGSmayhelpaddresssomeofthepastlimitations
GATKv4Outnowinpre-release
~5xfasterwithHaplotypeCaller
Completelyopen-source,buthasspecificsoftwarerequirements
https://software.broadinstitute.org/gatk/download/alpha
Top Related