Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

MSA using Hadoop

Presented by:

Dr. G.Sudha Sadasivam

Professor, Dept of CSE,

PSG College of Technology,PSG College of Technology,

Coimbatore

Agenda

� Sequence alignment

� Introduction to Clouds

� Approaches for MSA

� Approach 1� Approach 1

� Approach 2

� Results

� Other Projects

What is Sequence Alignment?

The procedure of comparing two or more

sequences by searching for a series of individual

characters or character patterns that are in the

same order in the sequences.same order in the sequences.

� Uses

� For sequence similarity

�Phylogenetic tree analysis

� Factors – accuracy and speed

Cloud computing

Provides scalable, on-demand, RT computing services

Suitability of cloud for Sequence Alignment

� On-demand scalability of cloud makes it suitable

for dynamic nature of MSA

� Low cost in maintenance of infrastructure for � Low cost in maintenance of infrastructure for

applications

� Data and compute parallelism in clouds through

map-reduce paradigm facilitates energy efficient and

fast MSA.

Types of Sequence Alignment�Pair-wise Alignment

�Alignment of two sequences

�Global –using Needleman Wunsch algorithm.

L G P S S K Q T G K G S _ S R A W D N

| | | | | | |L N _ A T K S A G K G A I M R L G D AL N _ A T K S A G K G A I M R L G D A

�Local – using Smith Waterman algorithm.

_ _ _ _ _ _ _ _ _ T G K G _ _ _ _ _ _ _ _ _ _

| | |_ _ _ _ _ _ _ _ _ A G K G _ _ _ _ _ _ _ _ _ _

�Multiple Sequence Alignment

�Alignment of more than two sequences

� Initialization

F(0, 0) = 0

F(0, i) = −i * d

F(j, 0) = −j* d

� Main Iteration

For each i=1…M and j=1….N

Case 1: xi aligns to yi

Case 2: xi aligns to gapCase 3: yi aligns to gap

Needleman Wunsch Algorithm

For each i=1…M and j=1….N

F(i-1,j-1)+s(xi,yj), case 1F(i,j) = max F(i-1,j)-d, case 2

F(i,j-1)-d, case 3

DIAG, if case 1Ptr(i,j) = UP, if case 2

LEFT, if case 3

s(xi,yj ) = +1 , match

-1 , mismatch

Needleman Wunsch Algorithm

A G T A

0 -1 -2 -3 -4

F(i,j) i=0 1 2 3 4

j=0

f(0,0)+s(1,1) =1F(1,1)=max f(0,1)-1 = -2

f(1,0)-1 = -2 = 1(case 1)

Optimal

Alignment A_TA

AGTA

f(0,1)+s(1,2) =-2f(0,2)-1 = -3f(1,1)-1 = 0Max = 0 (case 3)

F(i-1,j-1)+s(xi,yj)

F(i-1,j)-d

F(i,j-1)-d0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

1

2

3

Case 1: xi aligns to yi

Case 2: xi aligns to gapCase 3: yi aligns to gap

s(xi,yj ) = +1, match-1, mismatch

d=1

PTR =DIAG, if case 1UP, if case 2LEFT, if case 3

F(0, 0) = 0

F(0, i) = −i * d

F(j, 0) = −j* d

� A multiple sequence alignment is a sequence

alignment of three or more biological sequences,

generally protein, DNA, or RNA.

� The input is a set of query sequences that are

Multiple Sequence Alignment

� The input is a set of query sequences that are

assumed to have an evolutionary relationship by

which they share a lineage and are descended from

a common ancestor.

� From the resulting multiple sequence alignment ,

phylogenetic analysis can be conducted to assess

the sequences shared evolutionary origins.

� Dynamic programming

� Progressive alignment

MSA Approaches

� Progressive alignment

� Iterative approach

MSA methods

Dynamic

Programming

(n – dim

matrix)

Accurate Computationally

complex

O(Nn)

Exhaustive

Progressive

approximation

Fast Alignment

Cannot be

ClustalW

MAFFTapproximation

(aligns closest

seq first -

heuristics)

Cannot be

modified

Local maxima

Less accurate

MAFFT

Iterative Probabilistic

/ Stochastic

(Random)

Slow & less

accurate

GA & HMM

N- sequence length; n- number of sequences

MSA in cloud

� CloudBurst – RMAP

� Does not split sequences to load in cloud

environment

� Not for MSA� Not for MSA

� No automatic scale up/down of clusters

� CLUE- proposal from Maryland University

� VM cloning – Snowflock with MPIs

S1 S2 S3

Map/ Reduce

aligner

Proposed MSA Approach – hadoop data grid

A1S1 A2S2

Map/ Reduce

aligner

A2S1 A2S2

Map/ Reduce

aligner

A1S3

1) Identify different Permutations

S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1

2) Perform alignment of each permutation in parallel in Map2

S1 and S2 are aligned to form A1S1 and A2S2

3) Align the output of first Map-Reduce with the third

sequence S3 in Map Phase.sequence S3 in Map Phase.

A1S1 is aligned with S3

A1S2 is aligned with S3

Best among these two is chosen to form

A2S1, A2S2 and A1S3.

4) Step 2 & 3 is repeated for all the other permutations in Map1

5) The best possible combination is chosen (alignment score)

4 0

6 0

8 0

1 0 0T

ime

in

Se

c

Varying Number of Sequences of Same Size

0

2 0

4 0

2 4 6 8 1 0N u m b e r o f s e q u e n c e s

Tim

e i

n S

ec

2 n o d e s 3 n o d e s

2 0 0

2 5 0

3 0 0

3 5 0

Tim

e i

n S

ec

Different Block Sizes

0

5 0

1 0 0

1 5 0

2 0 0

1 0 1 0 0 1 0 0 0 6 4 0 0B l o c k S i z e i n K B

Tim

e i

n S

ec

2 n o d e s 3 n o d e s

Complexity Proposed Conventional

‘n’ – Number of Sequences

‘N’ – Average length of a sequence

‘b’ – Average number of blocks in a sequence

‘K’ – Size of 1 block

Analysis

Complexity

Measure

Proposed

Method

Conventional

Method

Score

Calculation

O(N) O(n*N)

Pairwise

alignment

O(K2) O(N2)

MSA O[(n-1) *(N2)/b] O(Nn)

Proposed MSA Approach on Cloud

Time efficient approach to sequence alignment with quality (accuracy) in Cloud

� Using hadoop framework� Dynamic approach � accuracy� Dynamic approach � accuracy

� Data and compute parallelism in hadoop � speed

� Blocking and scalability of hadoop

� Parallel transfer of sequence splits over the network to remote clusters

� Automated scale up/down of clusters based on computational needs of th environment.

AGT….CG

AGT….CG

AGT….CG

AGT….CG

AGT….CG

Head Server

(VM)

New VMs

New VMs

……….

2. Parallel transmission

over Internet

4. Forking VMs / deleting VMs

System Architecture

3. Copy to HDFS

AGT….CG

New VMs

……….

.

.

CLIENT SIDE VIRTUAL

ENVIRONMENT

6. Report the resultSEQUENCE FRAGMENTS

1. Create virtual environment

2. Split the sequences

5. Perform Alignment

SERVER SIDE

HADOOP CLUSTER

A single Combination –An illustration

0 1 2 3 4

A G T A

0 0 -1 -2 -3 -4

1 A -1 1 0 -1 -2

2 T -2 0 0 1 0

0 1 2 3 4

A G T A

0 0 -1 -2 -3 -4

1. ALIGNMENT OF SI & S2

2. ALIGNMENT OF A1SI & S3

S1= “AGTA”; A2=“ATA”; A3=“GAT”

2 T -2 0 0 1 0

3 A -3 -1 -1 0 2

SCORE: 4

A1S1:“AGTA”; A1S2:“A_TA”

1 G -1 -1 0 -1 -2

2 A -2 0 -1 1 0

3 T -3 -1 -1 0 -1

SCORE: -5

A2S1:“AG_TA”; A1S3:“_GAT_”

0 1 2 3 4 5

A _ T A _

0 0 -1 -2 -3 -4 -5

1 _ -1 0 0 -1 -2 -3

2 G -2 -1 -1 -1 -2 -2

3. ALIGNMENT OF A1S2 & A1S3

2 G -2 -1 -1 -1 -2 -2

3 A -3 -1 -1 -2 0 -1

4 T -4 -2 -1 0 -1 0

5 _ -5 -3 -1 -1 0 0

SCORE: -3

A2S2:“A _ _TA_”;

A2S3:“ _GAT_ _”

Complexity Proposed Conventional

‘n’ – Number of Sequences

‘N’ – Average length of a sequence

‘k’ – Average number of blocks in a sequence

‘K’ – Size of 1 block

Analysis

Complexity

Measure

Proposed

Method

Conventional

Method

Score

Calculation

O(N) O(n*N)

Pairwise

alignment

O(K2) O(N2)

MSA O[K2 * ( n(n-1)/2] O(Nn)

‘T’ – Time for sequence transfer serially & ‘k’ –

block size

T/k – Time for sequence transfer in parallel

2. Parallelised data trasfer

3. Dynamic cluster creation

Advantage: Computation power of remote cluster

is optimal and not wasted

Disadvantage: Time to set up the cluster

Effect of parallel file transfer

File

Size

(MB)

File

Transfer

(sec)

Split

Time

(sec)

Merge

Time

(sec)

C1

(sec)

T1

(sec)

C2

(sec)

T2

(sec)

100 6.23 0.02 0.03 2.13 2.18 0.73 0.78

200 9.32 0.23 0.43 2.96 3.62 1.23 1.89

300 11.43 0.85 1.64 3.84 6.33 1.16 3.65

C1: Communication time from 3 client VMs to server without multithreading.

C2: Communication time from 3 client VMs to the server with multithreading.

T1: Total time for file transfer from client to server without multi threading

T2: Total time for file transfer from client to server with multi threading

Time to start virtual machines

60

80

100

120T

ime i

n S

ec

0

20

40

1 2 3 4

Number of VMs

Tim

e i

n S

ec

Parallelised starting of VMs can be done to reduce time

Cluster performance wrt number of VMs

30 KB sequences with 2 KB splits – upto 5 sequences

200

250

300

350

Tim

e in

Sec

Number of sequences is less than 6, a five node hadoop cluster is sufficient.

0

50

100

150

1 2 3 4 5 6 7 8 9 10

Number of sequences

Tim

e in

Sec

4 slave VMs (sec) 6 slave VMs (sec)

3 4 5 6 7 8 9 10 11 12

Dynamic scaling up/down of clusters

File Size(GB)

Static VM creation based onPredicted application load(maps + reduces)

Dynamic VM creationbased on actualapplication load(maps + reduces)

VMs instantiated based on number of Map-Reduce Tasks

Dynamically number of tasks were checked up � New VMs started and tasks were

reallocated

Old VMs were destroyed if not used

Block size(10 MB)

(maps + reduces)

Time(min -sec)

VMs Time(min-sec)

New VMsadded

1 5-36 2 3-16 1

2 5-52 3 5-40 1

3 8-27 4 5-48 2

5 12-13 5 6-39 9

Conclusion1) Proposed MSA improves on the computation time and also

maintains the accuracy.

� Parallelism of sequence alignment in three levels.

Hadoop data grids - Data and compute parallelism &

scalability

� Dynamic Programming - accuracy.

2) Complexity is reduced from O(Nn) to O[K2 * (n *(n-1)/2)]

� Combining progressive and dynamic approaches.

� Blocking in hadoop

3) Enhancements (using clouds for MSA)

� Automatic configuration of the cloud environment

based on the computational needs

� Efficient upload of data into the HDFS by parallel

transfer of sequence fragments over the Internet.

Other Projects

� Enhancement of existing fairshare scheduler in hadoop

� Reliability using Reed Solomon codes

� Hybrid scheduler

Motif identification for MSA� Motif identification for MSA

� CBIR using image signatures

� Text categorization

� Hybrid PSO (PSO and GA) for job scheduling

� Semantic search using hadoop framework.

� Others – Globus and GridSim

Acknowledgement

The Research has been carried out as a result of PSG-Yahoo

Research programme on Grid and Cloud computing.

Sincere Thanks to

1) Dr R Rudramoorthy, Principal, 1) Dr R Rudramoorthy, Principal,

PSG College of Techniology, Coimbatore.

2) Mr K V Chidambaran,

Director, Grid and Cloud Systems Group,

Yahoo, Bangalore

THANK YOU

QUESTIONS?

Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Documents

Transcript of Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop