Scaffolding Draft Genomes Using Paired Sequencing Data

1
Scaffolding Problem Overview Draftgenom esare often com prised ofm any contigs Ordering and orienting the contigsrelative to each otherm akesgenom esm ore useful Accom plished using paired end or m ate pair reads w hich m ap to different contigs W hen pairsm ap to different contigs, theyinduce an orderand orientation betw een contigs Existing toolsw ere designed to w orkw ith Sangerreads (~1000bp) Long readsensuresaccurate mapping Nextgen readsare com parativelyshortand m apping errorsare frequent Thisw ork presentsa m ethod to scaffold draftgenom es using paired nextgen reads Legend: blue arrow : paired end reads red , green , orange lines: contigs red , green , orange arrow : oriented contigs M apping and Filtering Read M apping Use existing shortread alignm enttoolsto m ap data onto contigs Tool m ustbe able to reportm ultiple hitsforeach read and generate SAM output Read Filtering Rem ove pairsincluding readsthatdo notm ap uniquely “Uniqueness”dependson m apping param eters Rem ove pairsincluding readsthatm ap w ithin repeats Repeatsannotated by RepeatM asker/RepeatModeler Rem ove pairsw ith readsm apped in differentcontigsifinsertsize im plied by m appingisunlikely Read pairrem oved ifm inim um insertsize im plied bym apping longer than expected insertsize by 3 standard deviations. r e p e a t 2000 bp Legend blackline: contigs; colored arrow s : reads ; braces: annotations The blue readsare filtered because ofnon uniqueness, green because of repeats, orange (assum ing insertsize=1395, std dev=250)because of m inim um insertsize 500 bp Contig Graph and O rientation ILP Contiggraph Contigsare vertices Edgesdefined using redundancyparam eters r and d Read pairsbetw een contigsiand j can be divided in tw o classes: consistentw ith the contigs’ currentorientation, and consistentw ith sw itching the orientation ofone contig Contigsi,jare connected if: There isatleast r pairsw ith reads m apped betw een i and j The ratio betw een the sizesof largerand sm allerconsistency classesexceeds d IntegerLinearProgram (ILP) Typicallynotpossible to find a contig orientation consistentw ith all read pairs The follow ing integerlinear program (ILP)findsa contig orientation thatm inimizes numberofinconsistentpairs 0/1 variable S i indicatesfinal orientation ofcontig i( S i =1 iffcontig iisflipped) h ij and u ij denote # read pairs betw een contigsi,j thatare consistent, resp. inconsistent Experim ental Setup Contig sets Developed a testset consisting ofcontigsfrom chr21 ofHuRefassem bly, and from the de novo assem blyofa subsetof HuRefSangerreads(~4x average coverage) True orientation of assem bled contigsfound bym appingthem to the reference genom e; we retained all contigsw ith atleast50% contiguous alignment Read pair data 480 m illion 50bp SOLiD m ate pairsm apped against contigsusing Bow tie Readsm apped allowing1 m ism atch in seed, sum of Phred qualityscore of m ism atches< 80, reported 10 bestalignm ents W e com pared ourthe orientation step of algorithm to BAM BUS2 orientation tool Both filtered and unfiltered read pairsw ere given to BAM BUS2 and ILP for orientation * *BAM BUS2 hasitsow n repeatfiltering capabilities, in orderto facilitate com parison pairsgiven to BAM BUS2 w ere also filtered using ourrepeat annotations Contig GraphsforChr21 HuRefContigs unfiltered r=2, d=2 filtered, r=2, d=2 •Visualization ofcontig graphsforChr21 HuRef contigsgenerated using graphviz •Connected com ponents consisting oflessthan 5 contigsnotdisplayed •Red edgesindicate at leastone read pair inconsistentw ith current contig orientation •HuRefcontigsrepresentan “ideal”dataset; contigsare long and contain few errors •Filtering m akeslittle difference in thiscase Resultsfor5135 Chr21 Contigsfrom 4x Assem bly Algorithm Pair Filtering # pairs r delta Singletons Correct O rientation Incorrect O rientation bam bus2 No 573984 1 * 375 2824 1936 bam bus2 Yes 497907 1 * 450 2903 1782 ILP No 573984 1 1 ILP No 573984 1 2 ILP Yes 497907 1 1 ILP Yes 497907 1 2 bam bus2 No 573984 2 * 643 2696 1796 bam bus2 Yes 497907 2 * 678 2885 1572 ILP No 573984 2 1 561 2893 1681 ILP No 573984 2 2 561 2855 1719 ILP Yes 497907 2 1 614 3522 999 ILP Yes 497907 2 2 614 3469 1052 bam bus2 No 573984 3 * 771 2757 1607 bam bus2 Yes 497907 3 * 811 2865 1459 ILP No 573984 3 1 671 3405 1059 ILP No 573984 3 2 671 3398 1066 ILP Yes 497907 3 1 708 3837 590 ILP Yes 497907 3 2 708 3798 629 Resultsfor667 HuRefChr21 Contigs Algorithm Pair Filtering # pairs r delta Singletons Correct Orientation Incorrect O rientation bam bus2 No 231257 1 * 50 350 267 bam bus2 Yes 166571 1 * 54 333 280 ILP No 231257 1 1 44 377 246 ILP No 231257 1 2 44 377 246 ILP Yes 166571 1 1 52 390 225 ILP Yes 166571 1 2 52 435 180 bam bus2 No 231257 2 * 101 381 185 bam bus2 Yes 166571 2 * 139 371 157 ILP No 231257 2 1 78 585 4 ILP No 231257 2 2 78 587 2 ILP Yes 166571 2 1 99 567 1 ILP Yes 166571 2 2 99 567 1 bam bus2 No 231257 3 * 160 359 148 bam bus2 Yes 166571 3 * 231 319 117 ILP No 231257 3 1 107 556 4 ILP No 231257 3 2 107 558 2 ILP Yes 166571 3 1 145 522 0 ILP Yes 166571 3 2 145 522 0 Discussion and Future W ork HuRefContigs In thisideal datasetourfiltering seem sdetrim ental to BAM BUS at r=1,2 . Altering r seem sto have little effecton the num ber ofcorrectedges. The ILP isonly slightlyaffected byfiltering. • At r=1, ourILP perform s com parable to BAM BUS. Best Higherredundancy greatly im provesILP accuracy. • The d parameterhaslittle effect. 4xContigs In thism ore realisticdataset, filtering consistentlyhelpsboth BAM BUSand the ILP atall redundancies. Redundancythreshold haslittle effecton BAM BUS, but significanteffecton the ILP. A higherredundancy is im portant, although further investigation into its relationship w ith read coverage isnecessary. • The d parameterissometimes detrim ental. Future W ork Orientation isonly partofthe scaffolding problem ; in ongoing w ork w e are developing ordering and placem entalgorithm s. W e w ill also explore the effectofvarying assem blyand read coverage on the abilityto accuratelyscaffold draftgenom es. Acknow ledgm ents: Thisw ork hasbeen supported in partbyNSF awardsIis-0546457, IIS-0916401, IIS-0953563, and IIS-0916948. Scaffolding Draft Genomes Using Paired Sequencing Data James Lindsay, Jin Zhang, Thomas Farnham, Yufeng Wu, Rachel O’Neill, Ion Mandoiu (University of Connecticut) Edward Bullwinkel, Hamed Salooti , Alex Zelikovsky (Georgia State University) Contig GraphsforChr21 4x Contigs unfiltered, r=3, d=1 filtered, r=3, d=1 •4xcontigsm ore representative oftypical genom e projects •Contigsare m uch shorter and theycreate m ore com plex structures •Unfiltered graph hasm ore edgesinside a highly interconnected com ponent •Filtering separatessom e linearchainsofcontigsfrom the large interconnected com ponent

description

Scaffolding Draft Genomes Using Paired Sequencing Data. James Lindsay, Jin Zhang, Thomas Farnham , Yufeng Wu, Rachel O’Neill, Ion Mandoiu ( University of Connecticut) Edward Bullwinkel , Hamed Salooti , Alex Zelikovsky ( Georgia State University). - PowerPoint PPT Presentation

Transcript of Scaffolding Draft Genomes Using Paired Sequencing Data

Page 1: Scaffolding Draft Genomes Using Paired Sequencing Data

Scaffolding ProblemOverview• Draft genomes are often

comprised of many contigs• Ordering and orienting the

contigs relative to each other makes genomes more useful

• Accomplished using paired end or mate pair reads which map to different contigs

• When pairs map to different contigs, they induce an order and orientation between contigs

• Existing tools were designed to work with Sanger reads (~1000bp)

• Long reads ensures accurate mapping

• Next gen reads are comparatively short and mapping errors are frequent

• This work presents a method to scaffold draft genomes using paired next gen reads

Legend:blue arrow: paired end reads red,green,orange lines: contigsred,green,orange arrow: oriented contigs

Mapping and Filtering• Read Mapping

– Use existing short read alignment tools to map data onto contigs– Tool must be able to report multiple hits for each read and

generate SAM output• Read Filtering

– Remove pairs including reads that do not map uniquely• “Uniqueness” depends on mapping parameters

– Remove pairs including reads that map within repeats• Repeats annotated by RepeatMasker/RepeatModeler

– Remove pairs with reads mapped in different contigs if insert size implied by mapping is unlikely • Read pair removed if minimum insert size implied by mapping longer

than expected insert size by 3 standard deviations.

r e p e a t2000 bp

Legendblack line: contigs; colored arrows: reads; braces: annotationsThe blue reads are filtered because of non uniqueness, green because of repeats, orange (assuming insert size=1395, std dev=250) because of minimum insert size

500 bp

Contig Graph and Orientation ILP

Contig graph • Contigs are vertices• Edges defined using

redundancy parameters r and d– Read pairs between contigs i and j

can be divided in two classes: consistent with the contigs’ current orientation, and consistent with switching the orientation of one contig

– Contigs i,j are connected if:• There is at least r pairs with reads

mapped between i and j• The ratio between the sizes of

larger and smaller consistency classes exceeds d

Integer Linear Program (ILP)• Typically not possible to find

a contig orientation consistent with all read pairs

• The following integer linear program (ILP) finds a contigorientation that minimizes number of inconsistent pairs

• 0/1 variable Si indicates final orientation of contig i (Si =1 iff contig i is flipped)

• hij and uij denote # read pairs between contigs i,j that are consistent, resp. inconsistent

Experimental Setup

Contig sets• Developed a test set

consisting of contigs from chr21 of HuRef assembly, and from the de novo assembly of a subset of HuRef Sanger reads (~4x average coverage)

• True orientation of assembled contigs found by mapping them to the reference genome; we retained all contigs with at least 50% contiguous alignment

Read pair data• 480 million 50bp SOLiD

mate pairs mapped against contigs using Bowtie

• Reads mapped allowing 1 mismatch in seed, sum of Phred quality score of mismatches < 80, reported 10 best alignments

• We compared our the orientation step of algorithm to BAMBUS2 orientation tool

• Both filtered and unfiltered read pairs were given to BAMBUS2 and ILP for orientation *

*BAMBUS2 has its own repeat filtering capabilities, in order to facilitate comparison pairs given to BAMBUS2 were also filtered using our repeat annotations

Contig Graphs for Chr21 HuRef Contigsunfiltered r=2, d=2

filtered, r=2, d=2

• Visualization of contiggraphs for Chr21 HuRefcontigs generated using graphviz• Connected components consisting of less than 5 contigs not displayed• Red edges indicate at least one read pair inconsistent with current contig orientation

• HuRef contigs represent an “ideal” dataset; contigs are long and contain few errors• Filtering makes little difference in this case

Results for 5135 Chr21 Contigs from 4x Assembly

Algorithm Pair Filtering # pairs r delta Singletons Correct

OrientationIncorrect

Orientation

bambus2 No 573984 1 * 375 2824 1936

bambus2 Yes 497907 1 * 450 2903 1782

ILP No 573984 1 1

ILP No 573984 1 2

ILP Yes 497907 1 1

ILP Yes 497907 1 2

bambus2 No 573984 2 * 643 2696 1796

bambus2 Yes 497907 2 * 678 2885 1572

ILP No 573984 2 1 561 2893 1681

ILP No 573984 2 2 561 2855 1719

ILP Yes 497907 2 1 614 3522 999

ILP Yes 497907 2 2 614 3469 1052

bambus2 No 573984 3 * 771 2757 1607

bambus2 Yes 497907 3 * 811 2865 1459

ILP No 573984 3 1 671 3405 1059

ILP No 573984 3 2 671 3398 1066

ILP Yes 497907 3 1 708 3837 590

ILP Yes 497907 3 2 708 3798 629

Results for 667 HuRef Chr21 Contigs

Algorithm Pair Filtering # pairs r delta Singletons Correct

OrientationIncorrect

Orientation

bambus2 No 231257 1 * 50 350 267

bambus2 Yes 166571 1 * 54 333 280

ILP No 231257 1 1 44 377 246

ILP No 231257 1 2 44 377 246

ILP Yes 166571 1 1 52 390 225

ILP Yes 166571 1 2 52 435 180

bambus2 No 231257 2 * 101 381 185

bambus2 Yes 166571 2 * 139 371 157

ILP No 231257 2 1 78 585 4

ILP No 231257 2 2 78 587 2

ILP Yes 166571 2 1 99 567 1

ILP Yes 166571 2 2 99 567 1

bambus2 No 231257 3 * 160 359 148

bambus2 Yes 166571 3 * 231 319 117

ILP No 231257 3 1 107 556 4

ILP No 231257 3 2 107 558 2

ILP Yes 166571 3 1 145 522 0

ILP Yes 166571 3 2 145 522 0

Discussion and Future WorkHuRef Contigs• In this ideal dataset our filtering

seems detrimental to BAMBUS at r=1,2. Altering r seems to have little effect on the number of correct edges.

• The ILP is only slightly affected by filtering.

• At r=1, our ILP performs comparable to BAMBUS. Best Higher redundancy greatly improves ILP accuracy.

• The d parameter has little effect.

4x Contigs• In this more realistic dataset,

filtering consistently helps both BAMBUS and the ILP at all redundancies.

• Redundancy threshold has little effect on BAMBUS, but significant effect on the ILP.

• A higher redundancy is important, although further investigation into its relationship with read coverage is necessary.

• The d parameter is sometimes detrimental.

Future WorkOrientation is only part of the scaffolding problem; in ongoing work we are developing ordering and placement algorithms. We will also explore the effect of varying assembly and read coverage on the ability to accurately scaffold draft genomes.

Acknowledgments: This work has been supported in part by NSF awards Iis-0546457, IIS-0916401, IIS-0953563, and IIS-0916948.

Scaffolding Draft Genomes Using Paired Sequencing DataJames Lindsay, Jin Zhang, Thomas Farnham, Yufeng Wu, Rachel O’Neill, Ion Mandoiu (University of

Connecticut)Edward Bullwinkel, Hamed Salooti , Alex Zelikovsky (Georgia State University)

Contig Graphs for Chr21 4x Contigsunfiltered, r=3, d=1

filtered, r=3, d=1

• 4x contigs more representative of typical genome projects• Contigs are much shorter and they create more complex structures• Unfiltered graph has more edges inside a highly interconnected component

• Filtering separates some linear chains of contigs from the large interconnected component