Whole Genome Assembly
-
Upload
richard-maxwell -
Category
Documents
-
view
49 -
download
2
description
Transcript of Whole Genome Assembly
![Page 1: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/1.jpg)
Whole Genome Assembly
![Page 2: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/2.jpg)
WGA
1. Screener
2. Overlapper
3. Unitigger,
4. Scaffolder,
5. Repeat Resolver.
![Page 3: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/3.jpg)
Overlapper
...looks for end-to end overlaps of at least 40 bp with no more than 6% differences in match.
What’s the significance? ...a one in 1017 event.
Sequencing Fidelity: 99.96%
![Page 4: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/4.jpg)
However
...the Screener doesn’t include all of the “low frequency” level repeats,
...so, a majority of the Overlapper outputs are bogus.
![Page 5: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/5.jpg)
Unitigger
...differentiates between a true overlap, and an overlap that includes more than one loci.
![Page 6: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/6.jpg)
8X
...over-collapsed.
...in a world where real data matches expected data, each loci would have 8X coverage,
...if there were repeats, then contigs would be “over-represented”, on average 8 more per repeat.
![Page 7: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/7.jpg)
What Now?
... uniquely assembled contigs (unitigs) are readily identifiable,
– all of the assembled sequences match over all of the known sequence,
- and -
...are consistent with an 8x coverage.
![Page 8: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/8.jpg)
Unitigs
...contig cluster is consistent with expected size,
...no dissimilar sequences between any members.
...all other contigs are sent to the Discriminator.
![Page 9: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/9.jpg)
Discriminator
...parses the “over-collapsed” contig by using sequence outside of the overlap region
![Page 10: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/10.jpg)
Discriminator
...may yield unitigs.
![Page 11: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/11.jpg)
Unitigger Output
...correctly assembled contigs covering 73.6% of the genome.
![Page 12: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/12.jpg)
Repeat Resolver
...most of the remaining gaps were due to repeats.
1. Allow “low Discriminator Value” contigs to fill gaps,
2. Find BAC sequences that unambiguously match outside the nearest unitig,
– 1 in 107 chance of being wrong,
3. Ensure the mate end sequence of candidate BACs match.
![Page 13: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/13.jpg)
If that Doesn’t Work
...find a mate-pair that spans the gap, and sequence it,
Chromosome Walking
...make sequencing primer from BES...
![Page 14: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/14.jpg)
Scaffolder
...contigs the contigs,
– uses mate-pair information.
![Page 15: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/15.jpg)
WGA Result
...91% sequence, 9% gaps,
![Page 16: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/16.jpg)
Compartmentalized Shotgun Assembly
Mapping
![Page 17: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/17.jpg)
Scaffolds
![Page 18: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/18.jpg)
Sequence Tagged Sites STS
...PCR primers are designed for unique regions of the genome or chromosome,
...the chromosome is cut ,
...assay two PCR products, frequency of co-amplification indicates .
![Page 19: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/19.jpg)
Sequence Tagged Sites STS
![Page 20: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/20.jpg)
Compartmentalized Shotgun Assembly
...ideally 24,
...really 3845.
![Page 21: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/21.jpg)
92.2 % Sequence
7.8 % Gaps
CSA
91 % Sequence
9 % Gaps
WPA
![Page 22: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/22.jpg)
PFP
Chromosome 21
CSA
Green: Same Order,
Orientation Yellow: Same
Orientation
Red: Out of Order, Orientation
Blue: GapsViolations:Red : misorientedYellow: distance
![Page 23: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/23.jpg)
Chromosome 8
PFP
CSA
![Page 24: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/24.jpg)
PFP
CSA
![Page 25: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/25.jpg)
Major Public Sequence Databases
![Page 26: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/26.jpg)
• 281 Curated Data Bases,
• “... facilitating Biological Discovery”.
![Page 27: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/27.jpg)
What Do We Know?(based on functional group analysis)
Science 291 (5507), 1304-1351
![Page 28: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/28.jpg)
Functional Groups
1st GenBank NR protein database was partitioned into clusters using BLASTP,
![Page 29: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/29.jpg)
Describing Aligned Sequences
2nd Statistical descriptions of the cluster are developed and tested,
• Hidden-Markov Markers: statistical descriptions of aligned sequences.
![Page 30: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/30.jpg)
Functional Group Annotation
3rd Categorization was done by manual review of the family and subfamily names,
...by examining SwissProt and GenBank records,
...and by review of the literature as well as resources on the World Wide Web.
http://www.expasy.ch/cgi-bin/niceprot.pl?P29965
![Page 31: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/31.jpg)
Outcomes?
• A relatively small number of structural and functional domains are used in a large number of different proteins,
• Pfam: 527 families,
• average length is 275 residues,
• 456 had “annotated functions”.
Nucleic Acids Research 26, 320-322
![Page 32: Whole Genome Assembly](https://reader030.fdocuments.us/reader030/viewer/2022012916/56812a5f550346895d8dcd98/html5/thumbnails/32.jpg)
New Genes
4th Newly sequenced genes are virtually translated, and the
predicted proteins are assayed against raw and HMM databases,
...significance cut-off levels are determined for each functional
group family.