Scaffolding Draft Genomes Using Paired Sequencing Data

Post on 25-Feb-2016

22 views 0 download

description

Scaffolding Draft Genomes Using Paired Sequencing Data. James Lindsay, Jin Zhang, Thomas Farnham , Yufeng Wu, Rachel O’Neill, Ion Mandoiu ( University of Connecticut) Edward Bullwinkel , Hamed Salooti , Alex Zelikovsky ( Georgia State University). - PowerPoint PPT Presentation

Transcript of Scaffolding Draft Genomes Using Paired Sequencing Data

Scaffolding ProblemOverview• Draft genomes are often

comprised of many contigs• Ordering and orienting the

contigs relative to each other makes genomes more useful

• Accomplished using paired end or mate pair reads which map to different contigs

• When pairs map to different contigs, they induce an order and orientation between contigs

• Existing tools were designed to work with Sanger reads (~1000bp)

• Long reads ensures accurate mapping

• Next gen reads are comparatively short and mapping errors are frequent

• This work presents a method to scaffold draft genomes using paired next gen reads

Legend:blue arrow: paired end reads red,green,orange lines: contigsred,green,orange arrow: oriented contigs

Mapping and Filtering• Read Mapping

– Use existing short read alignment tools to map data onto contigs– Tool must be able to report multiple hits for each read and

generate SAM output• Read Filtering

– Remove pairs including reads that do not map uniquely• “Uniqueness” depends on mapping parameters

– Remove pairs including reads that map within repeats• Repeats annotated by RepeatMasker/RepeatModeler

– Remove pairs with reads mapped in different contigs if insert size implied by mapping is unlikely • Read pair removed if minimum insert size implied by mapping longer

than expected insert size by 3 standard deviations.

r e p e a t2000 bp

Legendblack line: contigs; colored arrows: reads; braces: annotationsThe blue reads are filtered because of non uniqueness, green because of repeats, orange (assuming insert size=1395, std dev=250) because of minimum insert size

500 bp

Contig Graph and Orientation ILP

Contig graph • Contigs are vertices• Edges defined using

redundancy parameters r and d– Read pairs between contigs i and j

can be divided in two classes: consistent with the contigs’ current orientation, and consistent with switching the orientation of one contig

– Contigs i,j are connected if:• There is at least r pairs with reads

mapped between i and j• The ratio between the sizes of

larger and smaller consistency classes exceeds d

Integer Linear Program (ILP)• Typically not possible to find

a contig orientation consistent with all read pairs

• The following integer linear program (ILP) finds a contigorientation that minimizes number of inconsistent pairs

• 0/1 variable Si indicates final orientation of contig i (Si =1 iff contig i is flipped)

• hij and uij denote # read pairs between contigs i,j that are consistent, resp. inconsistent

Experimental Setup

Contig sets• Developed a test set

consisting of contigs from chr21 of HuRef assembly, and from the de novo assembly of a subset of HuRef Sanger reads (~4x average coverage)

• True orientation of assembled contigs found by mapping them to the reference genome; we retained all contigs with at least 50% contiguous alignment

Read pair data• 480 million 50bp SOLiD

mate pairs mapped against contigs using Bowtie

• Reads mapped allowing 1 mismatch in seed, sum of Phred quality score of mismatches < 80, reported 10 best alignments

• We compared our the orientation step of algorithm to BAMBUS2 orientation tool

• Both filtered and unfiltered read pairs were given to BAMBUS2 and ILP for orientation *

*BAMBUS2 has its own repeat filtering capabilities, in order to facilitate comparison pairs given to BAMBUS2 were also filtered using our repeat annotations

Contig Graphs for Chr21 HuRef Contigsunfiltered r=2, d=2

filtered, r=2, d=2

• Visualization of contiggraphs for Chr21 HuRefcontigs generated using graphviz• Connected components consisting of less than 5 contigs not displayed• Red edges indicate at least one read pair inconsistent with current contig orientation

• HuRef contigs represent an “ideal” dataset; contigs are long and contain few errors• Filtering makes little difference in this case

Results for 5135 Chr21 Contigs from 4x Assembly

Algorithm Pair Filtering # pairs r delta Singletons Correct

OrientationIncorrect

Orientation

bambus2 No 573984 1 * 375 2824 1936

bambus2 Yes 497907 1 * 450 2903 1782

ILP No 573984 1 1

ILP No 573984 1 2

ILP Yes 497907 1 1

ILP Yes 497907 1 2

bambus2 No 573984 2 * 643 2696 1796

bambus2 Yes 497907 2 * 678 2885 1572

ILP No 573984 2 1 561 2893 1681

ILP No 573984 2 2 561 2855 1719

ILP Yes 497907 2 1 614 3522 999

ILP Yes 497907 2 2 614 3469 1052

bambus2 No 573984 3 * 771 2757 1607

bambus2 Yes 497907 3 * 811 2865 1459

ILP No 573984 3 1 671 3405 1059

ILP No 573984 3 2 671 3398 1066

ILP Yes 497907 3 1 708 3837 590

ILP Yes 497907 3 2 708 3798 629

Results for 667 HuRef Chr21 Contigs

Algorithm Pair Filtering # pairs r delta Singletons Correct

OrientationIncorrect

Orientation

bambus2 No 231257 1 * 50 350 267

bambus2 Yes 166571 1 * 54 333 280

ILP No 231257 1 1 44 377 246

ILP No 231257 1 2 44 377 246

ILP Yes 166571 1 1 52 390 225

ILP Yes 166571 1 2 52 435 180

bambus2 No 231257 2 * 101 381 185

bambus2 Yes 166571 2 * 139 371 157

ILP No 231257 2 1 78 585 4

ILP No 231257 2 2 78 587 2

ILP Yes 166571 2 1 99 567 1

ILP Yes 166571 2 2 99 567 1

bambus2 No 231257 3 * 160 359 148

bambus2 Yes 166571 3 * 231 319 117

ILP No 231257 3 1 107 556 4

ILP No 231257 3 2 107 558 2

ILP Yes 166571 3 1 145 522 0

ILP Yes 166571 3 2 145 522 0

Discussion and Future WorkHuRef Contigs• In this ideal dataset our filtering

seems detrimental to BAMBUS at r=1,2. Altering r seems to have little effect on the number of correct edges.

• The ILP is only slightly affected by filtering.

• At r=1, our ILP performs comparable to BAMBUS. Best Higher redundancy greatly improves ILP accuracy.

• The d parameter has little effect.

4x Contigs• In this more realistic dataset,

filtering consistently helps both BAMBUS and the ILP at all redundancies.

• Redundancy threshold has little effect on BAMBUS, but significant effect on the ILP.

• A higher redundancy is important, although further investigation into its relationship with read coverage is necessary.

• The d parameter is sometimes detrimental.

Future WorkOrientation is only part of the scaffolding problem; in ongoing work we are developing ordering and placement algorithms. We will also explore the effect of varying assembly and read coverage on the ability to accurately scaffold draft genomes.

Acknowledgments: This work has been supported in part by NSF awards Iis-0546457, IIS-0916401, IIS-0953563, and IIS-0916948.

Scaffolding Draft Genomes Using Paired Sequencing DataJames Lindsay, Jin Zhang, Thomas Farnham, Yufeng Wu, Rachel O’Neill, Ion Mandoiu (University of

Connecticut)Edward Bullwinkel, Hamed Salooti , Alex Zelikovsky (Georgia State University)

Contig Graphs for Chr21 4x Contigsunfiltered, r=3, d=1

filtered, r=3, d=1

• 4x contigs more representative of typical genome projects• Contigs are much shorter and they create more complex structures• Unfiltered graph has more edges inside a highly interconnected component

• Filtering separates some linear chains of contigs from the large interconnected component