High performance computational analysis of DNA sequences from different environments

41
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu www.theseed.org

description

High performance computational analysis of DNA sequences from different environments. Rob Edwards Computer Science Biology. edwards.sdsu.edu. www.theseed.org. Outline. There is a lot of sequence Tools for analysis More computers Can we speed analysis. - PowerPoint PPT Presentation

Transcript of High performance computational analysis of DNA sequences from different environments

Page 1: High  performance computational  analysis of DNA sequences  from  different environments

High performance computational analysis of

DNA sequences from different environments

Rob Edwards

Computer ScienceBiology

edwards.sdsu.edu www.theseed.org

Page 2: High  performance computational  analysis of DNA sequences  from  different environments

Outline

There is a lot of sequence Tools for analysis More computers Can we speed analysis

Page 3: High  performance computational  analysis of DNA sequences  from  different environments

Firstbacterial genome

100bacterial genomes

1,000bacterial genomes

Num

ber

of

know

n s

equence

s

Year

How much has been sequenced?

Environmentalsequencing

Page 4: High  performance computational  analysis of DNA sequences  from  different environments

Everybody inSan Diego

Everybody inUSA

AllculturedBacteria

100people

How much will be sequenced?

One genome fromevery species

Most majormicrobial environments

Page 5: High  performance computational  analysis of DNA sequences  from  different environments

Metagenomics(Just sequence it)

200 liters water 5-500 g fresh fecal matter50 g soil

Sequence

Epifluorescent Microscopy

Concentrate and purify bacteria, viruses, etc

Extract nucleic acids

Publish papers

Page 6: High  performance computational  analysis of DNA sequences  from  different environments

The SEED Family

Page 7: High  performance computational  analysis of DNA sequences  from  different environments

The metagenomics RAST server

Page 8: High  performance computational  analysis of DNA sequences  from  different environments

Automated Processing

Page 9: High  performance computational  analysis of DNA sequences  from  different environments

Summary View

Page 10: High  performance computational  analysis of DNA sequences  from  different environments

Metagenomics ToolsAnnotation & Subsystems

Page 11: High  performance computational  analysis of DNA sequences  from  different environments

Metagenomics ToolsPhylogenetic Reconstruction

Page 12: High  performance computational  analysis of DNA sequences  from  different environments

Metagenomics ToolsComparative Tools

Page 13: High  performance computational  analysis of DNA sequences  from  different environments

Outline

There is a lot of sequence Tools for analysis More computers

Page 14: High  performance computational  analysis of DNA sequences  from  different environments

How much data so far

986 metagenomes

79,417,238 sequences

17,306,834,870 bp (17 Gbp)

Average: ~15-20 M bp per genome

~300 GS20~300 FLX~300 Sanger

Page 15: High  performance computational  analysis of DNA sequences  from  different environments

Computes

Page 16: High  performance computational  analysis of DNA sequences  from  different environments

Linear compute complexity

Page 17: High  performance computational  analysis of DNA sequences  from  different environments

Just waiting …

Page 18: High  performance computational  analysis of DNA sequences  from  different environments

Hours

of

Com

pute

Tim

e

Input size (MB)

Overall compute time~19 hours of compute per input megabyte

Page 19: High  performance computational  analysis of DNA sequences  from  different environments

How much so far

986 metagenomes

79,417,238 sequences

17,306,834,870 bp (17 Gbp)

Average: ~15-20 M bp per genome

Compute time (on a single CPU):

328,814 hours = 13,700 days = 38 years

~300 GS20~300 FLX~300 Sanger

Page 20: High  performance computational  analysis of DNA sequences  from  different environments

Outline

There is a lot of sequence Tools for analysis More computers Can we speed analysis

Page 21: High  performance computational  analysis of DNA sequences  from  different environments

Shannon’s Uncertainty

• Shannon’s Uncertainty – Peter’s surprisal

p(xi) is the probability of the occurrence of each base or string

Page 22: High  performance computational  analysis of DNA sequences  from  different environments

Surprisal in Sequences

Page 23: High  performance computational  analysis of DNA sequences  from  different environments

Uncertainty Correlates With Similarity

Page 24: High  performance computational  analysis of DNA sequences  from  different environments

But it’s not just randomness…

Page 25: High  performance computational  analysis of DNA sequences  from  different environments

Which has more surprisal:coding regions or non-coding regions?

Uncertainty in complete genomes

Coding regions Non-coding regions

Page 26: High  performance computational  analysis of DNA sequences  from  different environments

More extreme differences with 6-mers

Coding regions Non-coding regions

Page 27: High  performance computational  analysis of DNA sequences  from  different environments

Can we predict proteins

• Short sequences of 100 bp

• Translate into 30-35 amino acids

• Can we predict which are real and could be doing something?

• Test with bacterial proteins

Page 28: High  performance computational  analysis of DNA sequences  from  different environments

Kullback-Leibler Divergence

Difference between two probability distributions

Difference between amino acid composition and average amino acid composition

Calculate KLD for 372 bacterial genomes

Page 29: High  performance computational  analysis of DNA sequences  from  different environments

KLD varies by bacteria

Colored by taxonomy of the bacteria

Page 30: High  performance computational  analysis of DNA sequences  from  different environments

KLD varies by bacteria

Page 31: High  performance computational  analysis of DNA sequences  from  different environments

Most divergent genomes

• Borrelia garinii – Spirochaetes

• Mycoplasma mycoides – Mollicutes

• Ureaplasma parvum – Mollicutes

• Buchnera aphidicola – Gammaproteobacteria

• Wigglesworthia glossinidia – Gammaproteobacteria

Page 32: High  performance computational  analysis of DNA sequences  from  different environments

Divergence and metabolism

Bifidobacterium

Bacillus

Nostoc

Salmonella

Chlamydophila

Mean of all bacteria

Page 33: High  performance computational  analysis of DNA sequences  from  different environments

Divergence and amino acids

UreaplasmaWigglesworthia

BorreliaBuchnera

Mycoplasma

Bacteria meanArchaea mean

Eukaryotic mean

Page 34: High  performance computational  analysis of DNA sequences  from  different environments

Predicting KLD

y = 2x2-2x+0.5

KLD

per

gen

ome

Percent G+C

Page 35: High  performance computational  analysis of DNA sequences  from  different environments

Summary

• Shannon’s uncertainty could predict useful sequences

• KLD varies too much to be useful and is driven by %G+C content

Page 36: High  performance computational  analysis of DNA sequences  from  different environments

New solutions for old problems?

Page 37: High  performance computational  analysis of DNA sequences  from  different environments

Xen and the art of imagery

Page 38: High  performance computational  analysis of DNA sequences  from  different environments

The cell phone problem

Page 39: High  performance computational  analysis of DNA sequences  from  different environments

Searching the seed by SMS

1 2 34 5 67 8 9* 0 #

seed search

histidine coli

GMAIL.COM@

AUTOSEEDSEARCHES

edwar

ds.

sdsu

.ed

u

SEEDdatabases

22 proteins in E. coli

) ) ) )))))

Anywhere Idaho GMCS429 Argonne

Page 40: High  performance computational  analysis of DNA sequences  from  different environments

Challenges

• Too much data

• Not easy to prioritize

• New models for HPC needed

• New interfaces to look at data

Page 41: High  performance computational  analysis of DNA sequences  from  different environments

Acknowledgements

• Sajia Akhter• Rob Schmieder

• Nick Celms• Sheridan Wright

• Ramy Aziz

• FIG • The mg-RAST team

• Rick Stevens

• Peter Salamon• Barb Bailey• Forest Rohwer• Anca Segall