Reconstructing cancer genomes from paired-end sequencing data
Scaffolding Draft Genomes Using Paired Sequencing Data
description
Transcript of Scaffolding Draft Genomes Using Paired Sequencing Data
Scaffolding ProblemOverview• Draft genomes are often
comprised of many contigs• Ordering and orienting the
contigs relative to each other makes genomes more useful
• Accomplished using paired end or mate pair reads which map to different contigs
• When pairs map to different contigs, they induce an order and orientation between contigs
• Existing tools were designed to work with Sanger reads (~1000bp)
• Long reads ensures accurate mapping
• Next gen reads are comparatively short and mapping errors are frequent
• This work presents a method to scaffold draft genomes using paired next gen reads
Legend:blue arrow: paired end reads red,green,orange lines: contigsred,green,orange arrow: oriented contigs
Mapping and Filtering• Read Mapping
– Use existing short read alignment tools to map data onto contigs– Tool must be able to report multiple hits for each read and
generate SAM output• Read Filtering
– Remove pairs including reads that do not map uniquely• “Uniqueness” depends on mapping parameters
– Remove pairs including reads that map within repeats• Repeats annotated by RepeatMasker/RepeatModeler
– Remove pairs with reads mapped in different contigs if insert size implied by mapping is unlikely • Read pair removed if minimum insert size implied by mapping longer
than expected insert size by 3 standard deviations.
r e p e a t2000 bp
Legendblack line: contigs; colored arrows: reads; braces: annotationsThe blue reads are filtered because of non uniqueness, green because of repeats, orange (assuming insert size=1395, std dev=250) because of minimum insert size
500 bp
Contig Graph and Orientation ILP
Contig graph • Contigs are vertices• Edges defined using
redundancy parameters r and d– Read pairs between contigs i and j
can be divided in two classes: consistent with the contigs’ current orientation, and consistent with switching the orientation of one contig
– Contigs i,j are connected if:• There is at least r pairs with reads
mapped between i and j• The ratio between the sizes of
larger and smaller consistency classes exceeds d
Integer Linear Program (ILP)• Typically not possible to find
a contig orientation consistent with all read pairs
• The following integer linear program (ILP) finds a contigorientation that minimizes number of inconsistent pairs
• 0/1 variable Si indicates final orientation of contig i (Si =1 iff contig i is flipped)
• hij and uij denote # read pairs between contigs i,j that are consistent, resp. inconsistent
Experimental Setup
Contig sets• Developed a test set
consisting of contigs from chr21 of HuRef assembly, and from the de novo assembly of a subset of HuRef Sanger reads (~4x average coverage)
• True orientation of assembled contigs found by mapping them to the reference genome; we retained all contigs with at least 50% contiguous alignment
Read pair data• 480 million 50bp SOLiD
mate pairs mapped against contigs using Bowtie
• Reads mapped allowing 1 mismatch in seed, sum of Phred quality score of mismatches < 80, reported 10 best alignments
• We compared our the orientation step of algorithm to BAMBUS2 orientation tool
• Both filtered and unfiltered read pairs were given to BAMBUS2 and ILP for orientation *
*BAMBUS2 has its own repeat filtering capabilities, in order to facilitate comparison pairs given to BAMBUS2 were also filtered using our repeat annotations
Contig Graphs for Chr21 HuRef Contigsunfiltered r=2, d=2
filtered, r=2, d=2
• Visualization of contiggraphs for Chr21 HuRefcontigs generated using graphviz• Connected components consisting of less than 5 contigs not displayed• Red edges indicate at least one read pair inconsistent with current contig orientation
• HuRef contigs represent an “ideal” dataset; contigs are long and contain few errors• Filtering makes little difference in this case
Results for 5135 Chr21 Contigs from 4x Assembly
Algorithm Pair Filtering # pairs r delta Singletons Correct
OrientationIncorrect
Orientation
bambus2 No 573984 1 * 375 2824 1936
bambus2 Yes 497907 1 * 450 2903 1782
ILP No 573984 1 1
ILP No 573984 1 2
ILP Yes 497907 1 1
ILP Yes 497907 1 2
bambus2 No 573984 2 * 643 2696 1796
bambus2 Yes 497907 2 * 678 2885 1572
ILP No 573984 2 1 561 2893 1681
ILP No 573984 2 2 561 2855 1719
ILP Yes 497907 2 1 614 3522 999
ILP Yes 497907 2 2 614 3469 1052
bambus2 No 573984 3 * 771 2757 1607
bambus2 Yes 497907 3 * 811 2865 1459
ILP No 573984 3 1 671 3405 1059
ILP No 573984 3 2 671 3398 1066
ILP Yes 497907 3 1 708 3837 590
ILP Yes 497907 3 2 708 3798 629
Results for 667 HuRef Chr21 Contigs
Algorithm Pair Filtering # pairs r delta Singletons Correct
OrientationIncorrect
Orientation
bambus2 No 231257 1 * 50 350 267
bambus2 Yes 166571 1 * 54 333 280
ILP No 231257 1 1 44 377 246
ILP No 231257 1 2 44 377 246
ILP Yes 166571 1 1 52 390 225
ILP Yes 166571 1 2 52 435 180
bambus2 No 231257 2 * 101 381 185
bambus2 Yes 166571 2 * 139 371 157
ILP No 231257 2 1 78 585 4
ILP No 231257 2 2 78 587 2
ILP Yes 166571 2 1 99 567 1
ILP Yes 166571 2 2 99 567 1
bambus2 No 231257 3 * 160 359 148
bambus2 Yes 166571 3 * 231 319 117
ILP No 231257 3 1 107 556 4
ILP No 231257 3 2 107 558 2
ILP Yes 166571 3 1 145 522 0
ILP Yes 166571 3 2 145 522 0
Discussion and Future WorkHuRef Contigs• In this ideal dataset our filtering
seems detrimental to BAMBUS at r=1,2. Altering r seems to have little effect on the number of correct edges.
• The ILP is only slightly affected by filtering.
• At r=1, our ILP performs comparable to BAMBUS. Best Higher redundancy greatly improves ILP accuracy.
• The d parameter has little effect.
4x Contigs• In this more realistic dataset,
filtering consistently helps both BAMBUS and the ILP at all redundancies.
• Redundancy threshold has little effect on BAMBUS, but significant effect on the ILP.
• A higher redundancy is important, although further investigation into its relationship with read coverage is necessary.
• The d parameter is sometimes detrimental.
Future WorkOrientation is only part of the scaffolding problem; in ongoing work we are developing ordering and placement algorithms. We will also explore the effect of varying assembly and read coverage on the ability to accurately scaffold draft genomes.
Acknowledgments: This work has been supported in part by NSF awards Iis-0546457, IIS-0916401, IIS-0953563, and IIS-0916948.
Scaffolding Draft Genomes Using Paired Sequencing DataJames Lindsay, Jin Zhang, Thomas Farnham, Yufeng Wu, Rachel O’Neill, Ion Mandoiu (University of
Connecticut)Edward Bullwinkel, Hamed Salooti , Alex Zelikovsky (Georgia State University)
Contig Graphs for Chr21 4x Contigsunfiltered, r=3, d=1
filtered, r=3, d=1
• 4x contigs more representative of typical genome projects• Contigs are much shorter and they create more complex structures• Unfiltered graph has more edges inside a highly interconnected component
• Filtering separates some linear chains of contigs from the large interconnected component