Beiko hpcs

Post on 30-Jun-2015

111 views 0 download

description

Presentation at HPCS 2014, Halifax

Transcript of Beiko hpcs

(an example of)

Computing the Microbial World

Rob BeikoJune 25, 2014

Siddique et al. (2014) Front Microbiol

Lawley et al., PLoS Genet (2012)

The Breakfast Organisms"Bacon Fields" Author: Michael DeForge

240M “pieces”, each 150 nucleotides long3.6 x 1010 nucleotides

~40 GB

Hundreds of “species”Genomes between 1.5M – 6M nucleotides

150 nt x 150 nt

We know this And this

But not this

who is doing what?

Marker genes WHO

Environmental “Shotgun” WHAT

The challenge ofMETAGENOME CLASSIFICATION

Clues – Sequence similarity(homology)

150 nt x 150 nt

Referencegenes

Take the WHOLE SEQUENCE

Best

Worst

Clues – composition150 nt x 150 nt

Referencegenome

k-mer profiles

Genome #1:20% G & C30% A & T

Genome #2:24% G & C26% A & T

Best

Worst

Take a K-MER FREQUENCY

DECOMPOSITION

Homology >> Composition

* GGCTGGACCA1 GACTGGACCA2 GGCCGGACTA

But homology evidence canmislead or be absent

Homology + Composition > Homology alone

GGCTGGACCA

GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Query:

Subject:

Exact string search? NO

BLAST? OK, but SLOW!

A compromise: UBLAST

• BLAST seeks out very similar “anchor points” between a pair of sequences before doing a more thorough search• Typically, a query is compared against all candidate DB

sequences, but most will return no hits

UBLAST:GGCTGGACCA

GCCTGTCCANNNNNNNNNNNNNNNNNNNNGCCAGGTGCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCTGGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(1) Query, DB sequences

GGCTGGACCA

GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(3) Rank DBbased on k-mer

matching

GGCTGGACCA

GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(4) Do detailed searchuntil there is

no more point

X

(2) k-mer table

Compositional models• Interpolated Markov models: adaptively generate

frequency models based on extending k-mers with sufficiently high frequencies

• One model per genome• Evaluate probability of each k-mer in query sequence,

given shorter k-mers in sequence• Model construction can take a while

k = 4 k = 5 k = 6 k = 7

PhymmBL: Brady and Salzberg (2009) Nat Methods

An alternative: Naïve Bayes• Just compute the frequency of each k-mer for a fixed

length k

• Build one frequency model for each genome

• FAST• Assumes conditional independence – may not matter

Probability of a query Fragment originating from genome Gi

For all k-mers in the fragment…

The frequency of that k-mer in Gi

Parks et al. (2011) BMC Bioinformatics

RITA: Rapid Identification of Taxonomic Assignments

UBLAST filter

MacDonald et al. (2012) Nucleic Acids Res

Evaluation set

• “Fake metagenome”: take sequences from known genomes, randomly sample fragments of 50, 100, 200 and 1000 nt in different trials

• Build reference models from other genomes – can leave close relatives out of reference model• Leave out other strains within the same species – not so

hard• Leave out other classes in the same phylum - HARD

But does it work?

Full RITA

Best class (homology and composition agree)

DNA sequence length50

Predicting genus from different species Predicting phylum from different class

Conclusions

• Careful attention needs to be paid to the choice of approach – simple is better

• RITA illustrates two key points in (microbial) bioinformatics:

1. Homology: How heuristic are you willing to go?2. Naïve Bayes: Keep it simple until told otherwise

• Technological change means that many bioinformatics algorithms will be irrelevant in 5 years

FIN