Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel...

47
Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Parallel Computational Biochemistry

Transcript of Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel...

Page 1: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Parallel Computational Biochemistry

Page 2: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Proteins, DNA, etc.

DNA encodes the information necessary to produce proteins

Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes)

Page 3: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

• Proteins are formed from a chain of molecules called amino acids

Proteins, DNA, etc.

Page 4: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

• The DNA sequence encodes the amino acid sequence that constitutes the protein

Proteins, DNA, etc.

Page 5: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

• There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I, ...

Proteins, DNA, etc.

Page 6: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Multiple Sequence Alignment

Page 7: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Databases of Biological Sequences

>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

NCBI: 14,976,310 sequences

15,849,921,438 nucleotides

Swiss-Prot: 104,559 sequences

38,460,707 residues

PDB: 17,175 structures

Page 8: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Sequence comparison

• Compare one sequence (target) to many sequences (database search)

• Compare more than two sequences simultaneously

Page 9: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Applications

• Phylogenetic analysis

• Identification of conserved motifs and domains

• Structure prediction

Page 10: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Page 11: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Phylogenetic Analysis

Page 12: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Structure Prediction

Genomic sequences

> RICIN GLYCOSIDASEMYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

Protein sequences

Protein structures

Page 13: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Clustal W

Page 14: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Progressive Alignment

Scerevisiae [1]Celegans [2] 0.640Drosophia [3] 0.634 0.327Human [4] 0.630 0.408 0.420Mouse [5] 0.619 0.405 0.469 0.289

S.cerevisiaeC.elegans

DrosophilaMouse

Human

1. Do pairwise alignment of all sequences and calculate distance matrix

2. Create a guide tree based on this pairwise distance mat

3. Align progressively following guide tree. • start by aligning most closely related pairs of sequences• at each step align two sequences or one to an existing subalignment

Page 15: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Parallel Clustal• Parallel pairwise

(PW) alignment matrix

• Parallel guide tree calculation

• Parallel progressive alignment

Scerevisiae [1]Celegans [2] 0.640Drosophia [3] 0.634 0.327Human [4] 0.630 0.408 0.420Mouse [5] 0.619 0.405 0.469 0.289

S.cerevisiaeC.elegans

DrosophilaMouse

Human

Page 16: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Parallel Clustal - Improvements

• Optimization of input parameters– scoring matrices, gap penalties - requires

many repetitive Clustal W calculations with various input parameters.

• Minimum Vertex Cover– use minimum vertex cover to remove

erroneous sequences, and identify clusters of highly similar sequences.

Page 17: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Minimum Vertex Cover

Conflict Graph– vertex: sequence– edge: conflict (e.g.

alignment with very poor score)

TASK: remove smallest number of gene sequences that eliminates all conflicts

Page 18: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

FPT Algorithms

• Phase 1: Kernelization

Reduce problem to size f(k)

• Phase 2: Bounded Tree Search

Exhausive tree search; exponential in f(k)

Page 19: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Kernelization

Buss's Algorithm for k-vertex cover

• Let G=(V,E) and let S be the subset of vertices with degree k or more.

• Remove S and all incident edges

G->G’ k -> k'=k-|S|.

• IF G' has more than k x k' edges THEN no k-vertex cover exists

ELSE start bounded tree search on G'

Page 20: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Bounded Tree Search

VC={}

VC+=... VC+=... VC+=...

VC+=... VC+=... VC+=...

VC+=... VC+=... VC+=...

Page 21: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Case 1: simple path of length 3

VC+={v,v2}

VC={...}

VC+={v1,v2} VC+={v1,v3}

search tree

v

v1

v2

v3

in graph G'

remove selected vertices from G'k' - = 2

Page 22: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Case 2: 3-cycle

v

v1

v2

in graph G'

VC+={v,v1}

VC={...}

VC+={v1,v2} VC+={v,v2}

search tree

remove selected vertices from G'k' - = 2

Page 23: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Case 3: simple path of length 2

v

v1

v2

in graph G'

VC={...}

VC+={v1}

search tree

remove v1, v2 from G'k' - = 1

Page 24: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Case 4: simple path of length 1

v

v1

in graph G'

VC={...}

VC+={v}

search tree

remove v, v1 from G'k' - = 1

Page 25: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Sequential Tree Search

Depth first search

– backtrack when k'=0 and G'<>0 ("dead end" ))

– stop when solution found (G'={}, k'>=0 )

Page 26: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Parallel Tree SearchBasic Idea:

– Build top log p levels of the search tree (T ')

– every proc. starts depth-first search at one leaf of T '

– randomize depth-first search by selecting random child

T 'log p

Page 27: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Analysis: Balls-in-bins

sequential depth-first search path total length:L, #solutions: m

expected sequential time (rand. distr.): L/(m+1)

parallel search path

expected parallel time (rand. distr.): p + L/(p(m+1))expected speedup: p / (1 + (m+1)/L)if m << L then expected speedup = p

Page 28: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Simulation Experiment

number of processors

0 50

50

pre

dict

ed s

pee

dup

L = 1,000,000

m = 10m = 100m = 1,000m = 10,000m = 100,000

100

150

200

100 150 200

L = 1,000,000

Page 29: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Implementation

• test platform:– 32 node Beowulf cluster– each node: dual 1.4 GHz Intel Xeon, 512

MB RAM, 60 GB disk– gcc and LAM/MPI on LINUX Redhat 7.2

• code-s: Sequential k-vertex cover

• code-p: Parallel k-vertex cover

Page 30: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

HPCVL

High Performance Computing Virtual Laboratory - HPCVL (www.hpcvl.org)

Created by parallel computing researchers fromCarleton U. (Comp. Sci.)Queen's (Engineering)Ottawa U. (Life Sci./Hospital)

Obtained $30M+ in Federal (CFI) and Ontario (OIT, ORDCF) grants

Page 31: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• Protein sequences

• Same protein from several hundred species

• Each protein sequence a few hundred amino acid residues in length

• Obtained from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)

Page 32: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• Somatostatin

– neuropeptide involved in the regulation of many functions in different organ systems

– Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255

Page 33: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• WW

– small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling

– Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318

Page 34: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• Kinase

– large family of enzymes involved in cellular regulation

– Clustal Threshold = 16, |V| = 647, |E| = 113122, k = 497, k' = 397

Page 35: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• SH2 (src-homology domain 2)

– involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine

– Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397

Page 36: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• Thrombin

– protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin

– Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413

Page 37: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• PHD (pleckstrin homology domain)

– involved in cellular signaling

– Clustal Threshold = 10, |V| = 670, |E| = 147054, k = 603, k' = 603

Page 38: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• Random Graph

|V| = 220, |E| = 2155, k = 122, k' = 122

• Grid Graph

|V| = 289, |E| = 544, k = 145, k' = 145

Page 39: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

|VC| ~ |V| / 2 k' = k

Page 40: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Sequential Times

Kinase, SH2, Thombin: n/a

Page 41: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Code-p on Virtual Proc.

Page 42: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Parallel Times

Page 43: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Speedup: Somatostatin

Page 44: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Speedup: WW

Page 45: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Speedup: Rand. Graph

Page 46: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Speedup: Grid Graph

Page 47: Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel Computational Biochemistry.

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Thank You!

• Questions?