SnpFilt: A pipeline for reference-free ... - Fudan...

31
SnpFilt: A pipeline for reference-free assembly-based identification of SNPs in bacterial genomes A/Prof Ruiting Lan University of New South Wales Australia

Transcript of SnpFilt: A pipeline for reference-free ... - Fudan...

Page 1: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

SnpFilt:Apipelineforreference-freeassembly-basedidentificationof

SNPsinbacterialgenomes

A/ProfRuitingLanUniversityofNewSouthWales

Australia

Page 2: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

WhyinterestedinSNPsinbacteria?

• Genomesequencingforpublichealthmicrobiology– Outbreakinvestigations– Diseasetransmission

Page 3: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Salmonella outbreak1

• Outbreak1occurredinaresidentialcollege• 16casesofgastroenteritisamongstudentsandstaffovertwodays

• MLVAprofile3-11-7-12-523• Chocolatemousseaspossiblecommonfoodsource

• 13humanisolatesand6mousseisolatesweresequenced

Page 4: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Outbreak1G

ene

stfC

STM

0270

STM

0328

.s

allP

fepA

mrdB

mrdA

ybeV

gltL

ybiS

rpsA

rpoS

nlpD

barA

STM

3073

arcB

mreB

yhhK

mtlR

rpoZ

rbsR

ilvD

rplL

yjdE hfq

mpl

arcA

AA

Cha

nge

N ->

D

Y ->

C

K->

N

N ->

D

L ->

R

A ->

V

H ->

L

D ->

G

E ->

V

S ->

I

H ->

D

H ->

R

A ->

V

S ->

R

R ->

H

Q ->

STO

P

S ->

A

V ->

A

Q ->

STO

P

K ->

T

Lab No. Source Epidemiological link A A G A A T C G C C T A A C C A C A C G C A G C T C T A T C T1687 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . G . . . . . .1688 Human Yes . . . . . . . . . . . . . . G . . . . . . . . . . . . . . . .1689 Human Yes . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . .1690 Human Yes G . . G G A . . T T . . . . . G . . . . . . A . . T . C . T C1691 Human Yes . . . . . . . A . . C . . . . . . . . . . . . . . . . . . . .1692 Human Yes . . . . . . . . . . . . . T . . . . . . T . . . . . A . C . .1693 Human Yes . G . . . . . . . . . . T . . . . . . . . . . . . . . . . . .1694 Human Yes . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . .1695 Human Yes . . T . . . T . . . . G . . . . T . T . . . . . . . . . . . .1696 Human Yes . . . . . . . . . . . . . . . . . C . . . C . . . . . . . . .1697 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1698 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1699 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1700 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1701 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1702 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1703 Mousse Yes . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . .1704 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1705 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Human - Epidemiologically confirmed

Food/contaminated source

1687 1

1689

116941

1703

1

1688

1

1693

2

16912

1692

41695 5

1690

12

1696

1

169716981699

17001701170217041705

Octaviaetal.JCM2015 53:1063

Page 5: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Salmonellaoutbreak2• Outbreak2occurredinaReady-to-eatfoodfromthesamebakeryinmetropolitanSydney

• 27cases• MLVAtype3-9-8-12-523• 11isolatessequenced

– 9isolatesfrompatientswithsalmonellosisatthetimeoftheoutbreakandresidingnearby

• 4confirmedoutbreakbasedondescriptivecaseseries• 4unrelatedfollowingPHUinvestigation• 1unknownlink– patientdidnotattendPHUinterview

– 1isolatefromabootswaband1fromdirtyeggshellrinsefromthebakery

Page 6: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Outbreak2Gene Name stiC cydC cysK yhgH dlhHAA Change T -> I H -> YConsensus Source Date of collection Epidemiological link A A G A A A C

1837 Human 27-Apr-12 No . . . . . G .1838 Human 26-Apr-12 No . . . . . . .1839 Human 26-Apr-12 No . . . . . . .1840 Human 24-Apr-12 Yes . . . . . . .1841 Human 24-Apr-12 No . . . . . . A1842 Human 24-Apr-12 Yes . . . . . . .1843 Human 23-Apr-12 Yes . . . . . . .1844 Human 23-Apr-12 Yes . . . . . . .1845 Human 10-Apr-12 Unknown G G A C G . .1846 Boot Swab Row 4 03-May-12 Yes . . . . . . .1847 Dirty Egg Shell Rinse 03-May-12 Yes . . . . . . .

1837 11841

1

18461847

18381839

1840184218431844

1845

5

Human - Epidemiologically confirmed

Human - Unknown epidemiological link

Human - Epidemiologically unlinked

Food/contaminated source

Octaviaetal.JCM2015 53:1063

Page 7: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

IsitaSNP?

Page 8: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Denovoassembly(velvet)

ProgressiveMAUVEalignment

CommonSNPs

SNPs

FilterreadsbyQUALITY

BWAMapping

SNPs

FilterreadsbyQUALITY

FilterSNPs

Mappingbased Assemblybased

Page 9: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Whyreferencefree?

• SNPsdiscovereddependingonthereferenceyouused

• SNPsculledforhighSNPdensityregions(Zhouetal. PLoS Genetics2013)

Page 10: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

NGS raw reads

Assembly (SPAdes)

Map reads to contigs (BWA)

Apply filters

SNPs

SNPfilt work flow

Page 11: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

SNPcallingperformancemetrics

Truepositives(TP)

Falsepositives(FP)

Falsenegatives(FN)

Truenegatives(TN)

Actualsequence

SNP NotSNP

SNPcallalgorithm

sSN

PNotSNP

Precision= TPTP+FP

Sensitivity= TPTP+FN

Page 12: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Choiceofassemblers

• Abyss• Cabog• Mira• MaSuRCA

• SGA• SoapDenovo• SPAdes• Velvet

Page 13: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Abys

s

Cab

og

Mira

MaS

uRC

A

SGA

Soap

Den

ovo

SPAd

es

Velv

et

Sens

itivi

ty

0.0

0.2

0.4

0.6

0.8

1.0A

Abys

s

Cab

og

Mira

MaS

uRC

A

SGA

Soap

Den

ovo

SPAd

es

Velv

et

Prec

isio

n

0.0

0.2

0.4

0.6

0.8

1.0M.abscessus (HiSeq)M.abscessus (MiSeq)R.sphaeroides (HiSeq)R.sphaeroides (MiSeq)

B

AssembliesfromtheGAGE-Bstudy

M.abscessus (HiSeq)M.abscessus (MiSeq)

R.sphaeroides (HiSeq)R.sphaeroides (MiSeq)

Page 14: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

SNPfilters

• F1)Regionsofexcessivecoverage– Therunningmeanofthereadcoverageacrossawindowof100basesisgreaterthanthemedian+2mediandeviationacrossthewholeassembly

• F2)Lowmappingquality– Mappingquality<58,foranysitewithinaneighbourhoodof400bases

Page 15: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

SNPfilters

• F3)Lowcoverage– <20reads,or0supportingreadineithertheforwardorreversedirection

• F4)lowforwardcoverage– <10readsintheforwarddirection,foranysitewithinaneighbourhoodof20bases

• F5)Highheterogeneity– Thenumberofsupportingreads<70%foranysitewithinaneighbourhoodof20bases

Page 16: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

SNPfilters

• F6)Lowbasequality– Atleast50baseswithinawindowof2000baseshavebasequality<q.thres,whereq.thres isthemean- 3standarddeviationsofqualityscoresacrossthewholeassembly

Page 17: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Effectoffilters:GAGE-Bassemblies

Page 18: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

M.abscessus

(HiSeq)

M.abscessus

(MiSeq)

R.sphaeroides

(HiSeq)

R.sphaeroides

(MiSeq)

Filter TN FN TN FN TN FN TN FN

F6:lowquality 24 9 8 0 107 1 0 0

F5:highheterogeneity 0 0 61 0 42 0 297 0

F4:lowforwardcoverage 0 0 6 0 2 2 11 0

F3:lowcoverage 0 0 58 2 0 3 0 2

F2:lowmappingquality 0 0 10 10 27 0 3 3

F1:excessivecoverage 0 7 0 7 22 0 0 0

Effectoffilters:GAGE-Bassemblies

Page 19: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Effectoffilters:knowngenomesE.coliK12 M.tuberculosis F11 S.pneumoniaeTIGR4

Filter Sites Errors Sites Errors Sites Errors

F6:lowquality 50026 29 244705 151 0 0

F5:highheterogeneity 3652 40 1706 38 91023 121

F4:lowforwardcoverage 8621 0 1365 0 750679 1

F3:lowcoverage 33832 0 7565 0 33057 4

F2:lowmappingquality 15062 4 36744 14 104219 12

F1:excessivecoverage 469390 6 375689 0 10357 0

F0:reliablesites 4017250 0 3713565 0 1937574 0

Totalassemblysize 4694957 4386568 2963539

Genomesize 4641652 4424435 2163340

Page 20: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

CoveragerequiredforfullSNPcalls

20 40 60 80 100

010

2030

40

Read depth

TPs

●●● ●●●

●●

●●●

●●●

●●● ●●●

●●● ●●

● ●●●

● MiSeqNextSeq

Page 21: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Conclusions

• Reference-freeassembly-baseddiscoveryofSNPs

• Unreliableregionsareremovedbasedonthequalityandcoverageofre-alignedreads

• Atleast40-foldcoverageisrequiredforreliableandcompleteSNPcalls

Page 22: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Acknowledgments

• DrCarmenChan• DrSophieOctavia• A/ProfVitaliSintchenko• DrQinningWang

• FundingsupportfromNationalHealthandMedicalResearchCouncilofAustralia

Page 23: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV
Page 24: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

IsitaSNP?MiSeq(2x250bp)sequencing

Mappingreadstothereferencegenome(LT2)

Burrows-Wheeler Aligner (BWA)

IdentificationsofSNPsSAMtools

denovoassemblyVelvet, Spades

ManualverificationofSNPs&NatureofSNPscustomscripts

AlignmentofContigs andscaffolds

progressiveMauve

Page 25: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

ReadscorrectionbyQUAKE

• Whencoverageislow,correctionisworthwhile

Page 26: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

1821

18261827

18191820182218231825

18241

1828

18531836

1830183118321834

1829 1 183321

181318181812

180818091810181118151817

1816 1 18141

Outbreak 3 Outbreak 4 Outbreak 5

Human - Epidemiologically confirmed

Human - Unknown epidemiological link

Human - Epidemiologically unlinked

Food/contaminated source

Threemoreoutbreaks

Octaviaetal.JCM2015 53:1063

Page 27: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

Isitpartoftheoutbreak?Cut-offbasedonSNPdifferences

Octaviaetal.JCM2015 53:1063

Page 28: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

1687 1

1689

116941

1703

1

1688

1

1693

2

16912

1692

41695 5

1690

12

1696

1

169716981699

17001702170217041705

1821

18261827

18191820182218231825

18241

1828

18531836

1830183118321834

1829 1 183321

18131818

1812

180818091810181118151817

1816 1 18141

18371

18411

18461847

18381839

1840184218431844

1845

5

Outbreak 1 Outbreak 2 Outbreak 3

Outbreak 4 Outbreak 5

Human - Epidemiologically confirmed

Human - Unknown epidemiological link

Human - Epidemiologically unlinked

Food/contaminated source

Page 29: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

SNPcallingperformancemetrics

• Truepositives(TrueSNPs)• Truenegatives(TruenotSNPs)• Falsepositives(CalledSNPsbutnottrueSNPs)• Falsenegatives(SNPsbutnotcalledSNPs)

Page 30: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

SNPdetection

• Qualitycontrolisveryimportant– Filterreads– Correctreads– FilterSNPs– ManualcheckingofSNPs

Page 31: SnpFilt: A pipeline for reference-free ... - Fudan Universityadmis.fudan.edu.cn/giw2016/slides/session-13/2-GIW... · Outbreak 1 Gene stfC STM0270 STM0328.s allP fepA mrdB mrdA ybeV

BWAmapping/SNPsiteextraction(4352)

Filter>=20readscoverage(3322)

Sitesthatcontain>=70%SNPsupportingreads(945)

Sitesthatcontain30%to<70%SNPsupportingreads(443)

Sitesthatcontain<30%SNPsupportingreads(1934)

DivideSNPs intothreecategories

Discard1.1%genuineSNPsthrownaway1.8%falsepositives

FilterreadsbyQUALITY/Correctreads