High performance computational analysis of DNA sequences from different environments
-
Upload
ursula-moreno -
Category
Documents
-
view
33 -
download
1
description
Transcript of High performance computational analysis of DNA sequences from different environments
![Page 1: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/1.jpg)
High performance computational analysis of
DNA sequences from different environments
Rob Edwards
Computer ScienceBiology
edwards.sdsu.edu www.theseed.org
![Page 2: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/2.jpg)
Outline
There is a lot of sequence Tools for analysis More computers Can we speed analysis
![Page 3: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/3.jpg)
Firstbacterial genome
100bacterial genomes
1,000bacterial genomes
Num
ber
of
know
n s
equence
s
Year
How much has been sequenced?
Environmentalsequencing
![Page 4: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/4.jpg)
Everybody inSan Diego
Everybody inUSA
AllculturedBacteria
100people
How much will be sequenced?
One genome fromevery species
Most majormicrobial environments
![Page 5: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/5.jpg)
Metagenomics(Just sequence it)
200 liters water 5-500 g fresh fecal matter50 g soil
Sequence
Epifluorescent Microscopy
Concentrate and purify bacteria, viruses, etc
Extract nucleic acids
Publish papers
![Page 6: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/6.jpg)
The SEED Family
![Page 7: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/7.jpg)
The metagenomics RAST server
![Page 8: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/8.jpg)
Automated Processing
![Page 9: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/9.jpg)
Summary View
![Page 10: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/10.jpg)
Metagenomics ToolsAnnotation & Subsystems
![Page 11: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/11.jpg)
Metagenomics ToolsPhylogenetic Reconstruction
![Page 12: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/12.jpg)
Metagenomics ToolsComparative Tools
![Page 13: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/13.jpg)
Outline
There is a lot of sequence Tools for analysis More computers
![Page 14: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/14.jpg)
How much data so far
986 metagenomes
79,417,238 sequences
17,306,834,870 bp (17 Gbp)
Average: ~15-20 M bp per genome
~300 GS20~300 FLX~300 Sanger
![Page 15: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/15.jpg)
Computes
![Page 16: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/16.jpg)
Linear compute complexity
![Page 17: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/17.jpg)
Just waiting …
![Page 18: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/18.jpg)
Hours
of
Com
pute
Tim
e
Input size (MB)
Overall compute time~19 hours of compute per input megabyte
![Page 19: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/19.jpg)
How much so far
986 metagenomes
79,417,238 sequences
17,306,834,870 bp (17 Gbp)
Average: ~15-20 M bp per genome
Compute time (on a single CPU):
328,814 hours = 13,700 days = 38 years
~300 GS20~300 FLX~300 Sanger
![Page 20: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/20.jpg)
Outline
There is a lot of sequence Tools for analysis More computers Can we speed analysis
![Page 21: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/21.jpg)
Shannon’s Uncertainty
• Shannon’s Uncertainty – Peter’s surprisal
p(xi) is the probability of the occurrence of each base or string
![Page 22: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/22.jpg)
Surprisal in Sequences
![Page 23: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/23.jpg)
Uncertainty Correlates With Similarity
![Page 24: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/24.jpg)
But it’s not just randomness…
![Page 25: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/25.jpg)
Which has more surprisal:coding regions or non-coding regions?
Uncertainty in complete genomes
Coding regions Non-coding regions
![Page 26: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/26.jpg)
More extreme differences with 6-mers
Coding regions Non-coding regions
![Page 27: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/27.jpg)
Can we predict proteins
• Short sequences of 100 bp
• Translate into 30-35 amino acids
• Can we predict which are real and could be doing something?
• Test with bacterial proteins
![Page 28: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/28.jpg)
Kullback-Leibler Divergence
Difference between two probability distributions
Difference between amino acid composition and average amino acid composition
Calculate KLD for 372 bacterial genomes
![Page 29: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/29.jpg)
KLD varies by bacteria
Colored by taxonomy of the bacteria
![Page 30: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/30.jpg)
KLD varies by bacteria
![Page 31: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/31.jpg)
Most divergent genomes
• Borrelia garinii – Spirochaetes
• Mycoplasma mycoides – Mollicutes
• Ureaplasma parvum – Mollicutes
• Buchnera aphidicola – Gammaproteobacteria
• Wigglesworthia glossinidia – Gammaproteobacteria
![Page 32: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/32.jpg)
Divergence and metabolism
Bifidobacterium
Bacillus
Nostoc
Salmonella
Chlamydophila
Mean of all bacteria
![Page 33: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/33.jpg)
Divergence and amino acids
UreaplasmaWigglesworthia
BorreliaBuchnera
Mycoplasma
Bacteria meanArchaea mean
Eukaryotic mean
![Page 34: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/34.jpg)
Predicting KLD
y = 2x2-2x+0.5
KLD
per
gen
ome
Percent G+C
![Page 35: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/35.jpg)
Summary
• Shannon’s uncertainty could predict useful sequences
• KLD varies too much to be useful and is driven by %G+C content
![Page 36: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/36.jpg)
New solutions for old problems?
![Page 37: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/37.jpg)
Xen and the art of imagery
![Page 38: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/38.jpg)
The cell phone problem
![Page 39: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/39.jpg)
Searching the seed by SMS
1 2 34 5 67 8 9* 0 #
seed search
histidine coli
GMAIL.COM@
AUTOSEEDSEARCHES
edwar
ds.
sdsu
.ed
u
SEEDdatabases
22 proteins in E. coli
) ) ) )))))
Anywhere Idaho GMCS429 Argonne
![Page 40: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/40.jpg)
Challenges
• Too much data
• Not easy to prioritize
• New models for HPC needed
• New interfaces to look at data
![Page 41: High performance computational analysis of DNA sequences from different environments](https://reader037.fdocuments.us/reader037/viewer/2022110212/568136b8550346895d9e59ed/html5/thumbnails/41.jpg)
Acknowledgements
• Sajia Akhter• Rob Schmieder
• Nick Celms• Sheridan Wright
• Ramy Aziz
• FIG • The mg-RAST team
• Rick Stevens
• Peter Salamon• Barb Bailey• Forest Rohwer• Anca Segall