Beiko hpcs

(an example of)

Computing the Microbial World

Rob BeikoJune 25, 2014

Siddique et al. (2014) Front Microbiol

Lawley et al., PLoS Genet (2012)

The Breakfast Organisms"Bacon Fields" Author: Michael DeForge

240M “pieces”, each 150 nucleotides long3.6 x 1010 nucleotides

~40 GB

Hundreds of “species”Genomes between 1.5M – 6M nucleotides

150 nt x 150 nt

We know this And this

But not this

who is doing what?

Marker genes WHO

Environmental “Shotgun” WHAT

The challenge ofMETAGENOME CLASSIFICATION

Clues – Sequence similarity(homology)

150 nt x 150 nt

Referencegenes

Take the WHOLE SEQUENCE

Clues – composition150 nt x 150 nt

Referencegenome

k-mer profiles

Genome #1:20% G & C30% A & T

Genome #2:24% G & C26% A & T

Take a K-MER FREQUENCY

DECOMPOSITION

Homology >> Composition

* GGCTGGACCA1 GACTGGACCA2 GGCCGGACTA

But homology evidence canmislead or be absent

Homology + Composition > Homology alone

GGCTGGACCA

GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Query:

Subject:

Exact string search? NO

BLAST? OK, but SLOW!

A compromise: UBLAST

• BLAST seeks out very similar “anchor points” between a pair of sequences before doing a more thorough search• Typically, a query is compared against all candidate DB

sequences, but most will return no hits

UBLAST:GGCTGGACCA

GCCTGTCCANNNNNNNNNNNNNNNNNNNNGCCAGGTGCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCTGGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(1) Query, DB sequences

GGCTGGACCA

(3) Rank DBbased on k-mer

matching

GGCTGGACCA

(4) Do detailed searchuntil there is

no more point

(2) k-mer table

Compositional models• Interpolated Markov models: adaptively generate

frequency models based on extending k-mers with sufficiently high frequencies

• One model per genome• Evaluate probability of each k-mer in query sequence,

given shorter k-mers in sequence• Model construction can take a while

k = 4 k = 5 k = 6 k = 7

PhymmBL: Brady and Salzberg (2009) Nat Methods

An alternative: Naïve Bayes• Just compute the frequency of each k-mer for a fixed

length k

• Build one frequency model for each genome

• FAST• Assumes conditional independence – may not matter

Probability of a query Fragment originating from genome Gi

For all k-mers in the fragment…

The frequency of that k-mer in Gi

Parks et al. (2011) BMC Bioinformatics

RITA: Rapid Identification of Taxonomic Assignments

UBLAST filter

MacDonald et al. (2012) Nucleic Acids Res

Evaluation set

• “Fake metagenome”: take sequences from known genomes, randomly sample fragments of 50, 100, 200 and 1000 nt in different trials

• Build reference models from other genomes – can leave close relatives out of reference model• Leave out other strains within the same species – not so

hard• Leave out other classes in the same phylum - HARD

But does it work?

Full RITA

Best class (homology and composition agree)

DNA sequence length50

Predicting genus from different species Predicting phylum from different class

Conclusions

• Careful attention needs to be paid to the choice of approach – simple is better

• RITA illustrates two key points in (microbial) bioinformatics:

1. Homology: How heuristic are you willing to go?2. Naïve Bayes: Keep it simple until told otherwise

• Technological change means that many bioinformatics algorithms will be irrelevant in 5 years

Beiko hpcs

Technology

Transcript of Beiko hpcs

Kalman Graffi - IEEE HPCS 2013 - Comparative Evaluation of P2P Systems Using PeerfactSim.KOM

Collapsible 3D GIS Visualization - longsbrooks/projects/GenomeGIS/Brooks-GeoCo… · Collapsible 3D GIS Visualization Suwen Wang, Robert R. Beiko, Stephen Brooks Dalhousie University,

118 NUGGETS WBG-HPCS Wide Bandgap High Power …

Malkawi Keynote-Speech-Challenges to HPCS (1)

DARPA's HPCS Program: History, Models, Tools, Languages

HPCS HANDBOOK 2019-2020

TAM-S Installation and How to Use Manual HPCS

GroopM: an automated tool for the recovery of population ... · Metagenomics, the application of shotgun sequencing ... Parks,MacDonald&Beiko,2011). The main limitation underlying

2014 04-beiko-biology

HPCS Presentation

Harry Potter Lexicon Club MEET 17W LEADER: XENO PART 5 OF 12, HPCS STUDY.

HPCS Application Analysis and Assessment · Slide-3 HPCS Application Analysis and Assessment MITRE Lincoln Productivity Framework Overview • Program continuously integrates mission

Beiko ANL Soil Metagenomics presentation

DRAGON HPCS€¦ · • Rockwell OEM Lighting & Motor Control Panels by Dragon HPCS • Smart JBs by Dragon HPCS • Lighting Panel Distribution and Transformers • UPS Systems •

Biomedical Computing Requirements for HPCS

Meet 16W Part 4 of 12: HPCS Study… Leader: Xeno Harry Potter Lexicon Club.

HPCS languages: Fortress, Chapel, X10achauhan/Teaching/B629/2010-Fall/...Outline • Introduction to the HPCS programme – HPCS stands for High Productivity Computing Systems •

Beiko smbe2013-final

Marching towards a fully-integrated DWDM link for HPCs ...

Design and Implementation of the HPCS Graph … · Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors ... We represent the graph using array