Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
-
Upload
brianna-macpherson -
Category
Documents
-
view
214 -
download
0
Transcript of Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
![Page 1: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/1.jpg)
Accurate Assembly of Maize BACs
Patrick S. SchnableSrinivas Aluru
Iowa State University
![Page 2: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/2.jpg)
Motivation
• Maize genome is more complex than previously sequenced genomes– Many high-copy, long, highly conserved repeats– Genome contains many NIPs (Nearly Identical
Paralogs, low-copy genes that are expressed and >98% identical; Emrich et al., 2007) (= CNPs and CNV)
• Hence, assembling this genome presents new challenges
• Are existing assembly programs up to the task?
![Page 3: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/3.jpg)
Evidence of Assembly Errors
• Wash U noticed examples of collapse of repeats
• ISU identified examples of NIP collapse
![Page 4: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/4.jpg)
A C
A T
G C
B73
Mo17
SNP: single nucleotide polymorphism between alleles of a single geneParamorphism (PM): a single nucleotide substitution between paralogs Nearly Identical Paralogs (NIPs): paralogous sequences with >99% identity
Terms
![Page 5: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/5.jpg)
Paramorphisms Provide Evidence of NIPs
![Page 6: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/6.jpg)
Frequency of NIPs
• Conservatively ~1% of maize genes have NIPs (Emrich et al., 2007)
• Inspection of assembled BACs reveals NIP clusters
• But in addition also detect examples of “NIP collapse”
• CNPs/CNV associated with adaptive evolution in humans (Perry et al., Nat. Genetics, 2007)
![Page 7: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/7.jpg)
BAC Assembly, Example 1
• MAGI3.1 ID: MAGI_18749 (Emrich et al., 2007)
• BAC ID: CH201-140C17
Paramorphic Sites: C/T (1,175), C/T (1,293), C/T (1,359)
CH201-140C17: gi|146322123|gb|AC203431.1 (152,054 bp)
GenBank
56,572 55,984589 bp
![Page 8: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/8.jpg)
BAC Assembly Example 1 - Site #1BAC ID: CH201-140C17GI: 146322123GB: AC203431.1152,054 bp
MAGI_18749
Paramorphic Site #1:C/T (1,175)
2 C vs 2 T
“Consensus Base”
Paramorphic Site #1
2/7 assembled BACs known to contain NIPs exhibitevidence of NIP collapse (conservative)
![Page 9: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/9.jpg)
Traditional Assembly• Sequence alignments between
reads are identified
• Construct contigs– Start at a good alignment – Extend ends of contig one
sequence at a time
• Clone pair information is used to scaffold contigs after contig construction.
![Page 10: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/10.jpg)
Our Approach• Integrate clone pair data into contig assembly process
• Model sequence alignments & clone pairs as a graph.First, construct an alignment graph
Sequence reads are nodesA black edge is drawn between a pair of nodes if there is a valid sequence alignment
![Page 11: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/11.jpg)
Clone Pair Informed AssemblySecond, introduce two add’l types of edges into the graph
Clone pair edges (red)
Path edges (green)A path edge exists between two nodes if: • they are close together in the graph • AND their clone pairs are also close together
Identifies assembly-relevant sequence alignments
![Page 12: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/12.jpg)
Repeat Example
![Page 13: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/13.jpg)
Our Approach• Series of graph transformations to ensure black edges (sequence
alignments) represent correct genomic overlaps, and resolve entries into and exits out of repeats.– Use clone pairs to validate alignments in repeat regions if the
corresponding mate pairs are anchored to unique regions and exhibit alignment.
– Use paramorphisms to break spurious alignments due to NIPs.– Use clone pairs to match entries into and exits out of repeats.– Use clone pairs and validated alignments to guide contigs.– Use graph min-cuts to find correct assignment of reads to the
complementary strands.– Use graph reductions and visualization for further analysis.
![Page 14: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/14.jpg)
Example: Use Paramorphisms to Break Spurious Alignments
GTCT A CAGGTCT A CAGGTCT A CAG
GTCT C CAGGTCT C CAGGTCT C CAGGTCT C CAG
GTCT A CAGGTCT A CAGGTCT A CAG
GTCT C CAGGTCT C CAGGTCT C CAGGTCT C CAG
![Page 15: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/15.jpg)
Three Random “Stage 3” BACs
• Shotgun sequences extracted from Genbank and trimmed
Name Reads Post Trim Corrupt Quality Info
273D22 1402 1352 5
306N19 1396 1310 1
396H10 1391 1337 33
![Page 16: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/16.jpg)
273D22• Annotate paths via
walking through the graph.
• Make use of three levels of pointers:– Black edges: show
what steps are available
– Green edges: indicate the best path
– Red edges: indicate our final destination
![Page 17: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/17.jpg)
273D22: Incorrect Contiging
Contig 0
Contig 0
Conti
g 1
Contig 1 is a small contig inthe finished BAC that containssequences that shouldbe attached to the end of Contig 0.
![Page 18: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/18.jpg)
273D22: Missing Scaffold
![Page 19: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/19.jpg)
306N19: Mis-assembly
Contig 3
Contig 5
Contig 0
Conti
g 4
Conti
g 3
![Page 20: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/20.jpg)
306N19: Complex Repeat
![Page 21: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/21.jpg)
D396H10: Missed Scaffolding
Conti
g 6
Contig 8
Contig 5
![Page 22: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/22.jpg)
D396H10: Missed Scaffolding
Contig 7
Contig 2
Contig 3
![Page 23: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/23.jpg)
Identifying Assembly Errors???
![Page 24: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/24.jpg)
273D22: Weak Link not Corroborated by Clone Pairs
Contig 3
Contig 3
![Page 25: Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.](https://reader033.fdocuments.us/reader033/viewer/2022051412/5514a32f550346f06e8b5b2e/html5/thumbnails/25.jpg)
Conclusions & Future Directions• Discovered misassembled regions in all three randomly chosen BACs
– Conclusions supported by multiple lines evidence (clone pair + overlap)– Mis-assemblies (e.g., repeat-induced “knots”; collapsed repeats & NIPs) and missed
scaffolding
• Benefits of our approach– Can provide better assemblies
• Can navigate through repeats• Can correctly assemble NIPs
– With development could output contigs and perform scaffolding in one step– Could provide refined finishing advice– Could include a community-accessible visualization of assembled BAC contigs and
supporting data (confidence levels)
• Longer term– Our assembly approach could be applied to whole genome assembly of maize and other
complex genomes– Could incorporate paired next generation sequencing data (e.g. 454, Solexa, Solid)
• Needed research– Random collection of finished BACs (“truth”)– Develop algorithms for navigating paths through the graph– Accurately construct final contigs that contain multiple copies of repeats– Create BAC re-assembly pipeline (inform finishing efforts in future sequencing projects)– Scale approach to whole genome level