Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop
-
Upload
yahoo-developer-network -
Category
Documents
-
view
1.403 -
download
1
Transcript of Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop
MSA using Hadoop
Presented by:
Dr. G.Sudha Sadasivam
Professor, Dept of CSE,
PSG College of Technology,PSG College of Technology,
Coimbatore
Agenda
� Sequence alignment
� Introduction to Clouds
� Approaches for MSA
� Approach 1� Approach 1
� Approach 2
� Results
� Other Projects
What is Sequence Alignment?
The procedure of comparing two or more
sequences by searching for a series of individual
characters or character patterns that are in the
same order in the sequences.same order in the sequences.
� Uses
� For sequence similarity
�Phylogenetic tree analysis
� Factors – accuracy and speed
Cloud computing
Provides scalable, on-demand, RT computing services
Suitability of cloud for Sequence Alignment
� On-demand scalability of cloud makes it suitable
for dynamic nature of MSA
� Low cost in maintenance of infrastructure for � Low cost in maintenance of infrastructure for
applications
� Data and compute parallelism in clouds through
map-reduce paradigm facilitates energy efficient and
fast MSA.
Types of Sequence Alignment�Pair-wise Alignment
�Alignment of two sequences
�Global –using Needleman Wunsch algorithm.
L G P S S K Q T G K G S _ S R A W D N
| | | | | | |L N _ A T K S A G K G A I M R L G D AL N _ A T K S A G K G A I M R L G D A
�Local – using Smith Waterman algorithm.
_ _ _ _ _ _ _ _ _ T G K G _ _ _ _ _ _ _ _ _ _
| | |_ _ _ _ _ _ _ _ _ A G K G _ _ _ _ _ _ _ _ _ _
�Multiple Sequence Alignment
�Alignment of more than two sequences
� Initialization
F(0, 0) = 0
F(0, i) = −i * d
F(j, 0) = −j* d
� Main Iteration
For each i=1…M and j=1….N
Case 1: xi aligns to yi
Case 2: xi aligns to gapCase 3: yi aligns to gap
Needleman Wunsch Algorithm
For each i=1…M and j=1….N
F(i-1,j-1)+s(xi,yj), case 1F(i,j) = max F(i-1,j)-d, case 2
F(i,j-1)-d, case 3
DIAG, if case 1Ptr(i,j) = UP, if case 2
LEFT, if case 3
s(xi,yj ) = +1 , match
-1 , mismatch
Needleman Wunsch Algorithm
A G T A
0 -1 -2 -3 -4
F(i,j) i=0 1 2 3 4
j=0
f(0,0)+s(1,1) =1F(1,1)=max f(0,1)-1 = -2
f(1,0)-1 = -2 = 1(case 1)
Optimal
Alignment A_TA
AGTA
f(0,1)+s(1,2) =-2f(0,2)-1 = -3f(1,1)-1 = 0Max = 0 (case 3)
F(i-1,j-1)+s(xi,yj)
F(i-1,j)-d
F(i,j-1)-d0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
1
2
3
Case 1: xi aligns to yi
Case 2: xi aligns to gapCase 3: yi aligns to gap
s(xi,yj ) = +1, match-1, mismatch
d=1
PTR =DIAG, if case 1UP, if case 2LEFT, if case 3
F(0, 0) = 0
F(0, i) = −i * d
F(j, 0) = −j* d
� A multiple sequence alignment is a sequence
alignment of three or more biological sequences,
generally protein, DNA, or RNA.
� The input is a set of query sequences that are
Multiple Sequence Alignment
� The input is a set of query sequences that are
assumed to have an evolutionary relationship by
which they share a lineage and are descended from
a common ancestor.
� From the resulting multiple sequence alignment ,
phylogenetic analysis can be conducted to assess
the sequences shared evolutionary origins.
� Dynamic programming
� Progressive alignment
MSA Approaches
� Progressive alignment
� Iterative approach
MSA methods
Dynamic
Programming
(n – dim
matrix)
Accurate Computationally
complex
O(Nn)
Exhaustive
Progressive
approximation
Fast Alignment
Cannot be
ClustalW
MAFFTapproximation
(aligns closest
seq first -
heuristics)
Cannot be
modified
Local maxima
Less accurate
MAFFT
Iterative Probabilistic
/ Stochastic
(Random)
Slow & less
accurate
GA & HMM
N- sequence length; n- number of sequences
MSA in cloud
� CloudBurst – RMAP
� Does not split sequences to load in cloud
environment
� Not for MSA� Not for MSA
� No automatic scale up/down of clusters
� CLUE- proposal from Maryland University
� VM cloning – Snowflock with MPIs
S1 S2 S3
Map/ Reduce
aligner
Proposed MSA Approach – hadoop data grid
A1S1 A2S2
Map/ Reduce
aligner
A2S1 A2S2
Map/ Reduce
aligner
A1S3
1) Identify different Permutations
S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1
2) Perform alignment of each permutation in parallel in Map2
S1 and S2 are aligned to form A1S1 and A2S2
3) Align the output of first Map-Reduce with the third
sequence S3 in Map Phase.sequence S3 in Map Phase.
A1S1 is aligned with S3
A1S2 is aligned with S3
Best among these two is chosen to form
A2S1, A2S2 and A1S3.
4) Step 2 & 3 is repeated for all the other permutations in Map1
5) The best possible combination is chosen (alignment score)
4 0
6 0
8 0
1 0 0T
ime
in
Se
c
Varying Number of Sequences of Same Size
0
2 0
4 0
2 4 6 8 1 0N u m b e r o f s e q u e n c e s
Tim
e i
n S
ec
2 n o d e s 3 n o d e s
2 0 0
2 5 0
3 0 0
3 5 0
Tim
e i
n S
ec
Different Block Sizes
0
5 0
1 0 0
1 5 0
2 0 0
1 0 1 0 0 1 0 0 0 6 4 0 0B l o c k S i z e i n K B
Tim
e i
n S
ec
2 n o d e s 3 n o d e s
Complexity Proposed Conventional
‘n’ – Number of Sequences
‘N’ – Average length of a sequence
‘b’ – Average number of blocks in a sequence
‘K’ – Size of 1 block
Analysis
Complexity
Measure
Proposed
Method
Conventional
Method
Score
Calculation
O(N) O(n*N)
Pairwise
alignment
O(K2) O(N2)
MSA O[(n-1) *(N2)/b] O(Nn)
Proposed MSA Approach on Cloud
Time efficient approach to sequence alignment with quality (accuracy) in Cloud
� Using hadoop framework� Dynamic approach � accuracy� Dynamic approach � accuracy
� Data and compute parallelism in hadoop � speed
� Blocking and scalability of hadoop
� Parallel transfer of sequence splits over the network to remote clusters
� Automated scale up/down of clusters based on computational needs of th environment.
AGT….CG
AGT….CG
AGT….CG
AGT….CG
AGT….CG
Head Server
(VM)
New VMs
New VMs
……….
2. Parallel transmission
over Internet
4. Forking VMs / deleting VMs
System Architecture
3. Copy to HDFS
AGT….CG
New VMs
……….
.
.
CLIENT SIDE VIRTUAL
ENVIRONMENT
6. Report the resultSEQUENCE FRAGMENTS
1. Create virtual environment
2. Split the sequences
5. Perform Alignment
SERVER SIDE
HADOOP CLUSTER
A single Combination –An illustration
0 1 2 3 4
A G T A
0 0 -1 -2 -3 -4
1 A -1 1 0 -1 -2
2 T -2 0 0 1 0
0 1 2 3 4
A G T A
0 0 -1 -2 -3 -4
1. ALIGNMENT OF SI & S2
2. ALIGNMENT OF A1SI & S3
S1= “AGTA”; A2=“ATA”; A3=“GAT”
2 T -2 0 0 1 0
3 A -3 -1 -1 0 2
SCORE: 4
A1S1:“AGTA”; A1S2:“A_TA”
1 G -1 -1 0 -1 -2
2 A -2 0 -1 1 0
3 T -3 -1 -1 0 -1
SCORE: -5
A2S1:“AG_TA”; A1S3:“_GAT_”
0 1 2 3 4 5
A _ T A _
0 0 -1 -2 -3 -4 -5
1 _ -1 0 0 -1 -2 -3
2 G -2 -1 -1 -1 -2 -2
3. ALIGNMENT OF A1S2 & A1S3
2 G -2 -1 -1 -1 -2 -2
3 A -3 -1 -1 -2 0 -1
4 T -4 -2 -1 0 -1 0
5 _ -5 -3 -1 -1 0 0
SCORE: -3
A2S2:“A _ _TA_”;
A2S3:“ _GAT_ _”
Complexity Proposed Conventional
‘n’ – Number of Sequences
‘N’ – Average length of a sequence
‘k’ – Average number of blocks in a sequence
‘K’ – Size of 1 block
Analysis
Complexity
Measure
Proposed
Method
Conventional
Method
Score
Calculation
O(N) O(n*N)
Pairwise
alignment
O(K2) O(N2)
MSA O[K2 * ( n(n-1)/2] O(Nn)
‘T’ – Time for sequence transfer serially & ‘k’ –
block size
T/k – Time for sequence transfer in parallel
2. Parallelised data trasfer
3. Dynamic cluster creation
Advantage: Computation power of remote cluster
is optimal and not wasted
Disadvantage: Time to set up the cluster
Effect of parallel file transfer
File
Size
(MB)
File
Transfer
(sec)
Split
Time
(sec)
Merge
Time
(sec)
C1
(sec)
T1
(sec)
C2
(sec)
T2
(sec)
100 6.23 0.02 0.03 2.13 2.18 0.73 0.78
200 9.32 0.23 0.43 2.96 3.62 1.23 1.89
300 11.43 0.85 1.64 3.84 6.33 1.16 3.65
C1: Communication time from 3 client VMs to server without multithreading.
C2: Communication time from 3 client VMs to the server with multithreading.
T1: Total time for file transfer from client to server without multi threading
T2: Total time for file transfer from client to server with multi threading
Time to start virtual machines
60
80
100
120T
ime i
n S
ec
0
20
40
1 2 3 4
Number of VMs
Tim
e i
n S
ec
Parallelised starting of VMs can be done to reduce time
Cluster performance wrt number of VMs
30 KB sequences with 2 KB splits – upto 5 sequences
200
250
300
350
Tim
e in
Sec
Number of sequences is less than 6, a five node hadoop cluster is sufficient.
0
50
100
150
1 2 3 4 5 6 7 8 9 10
Number of sequences
Tim
e in
Sec
4 slave VMs (sec) 6 slave VMs (sec)
3 4 5 6 7 8 9 10 11 12
Dynamic scaling up/down of clusters
File Size(GB)
Static VM creation based onPredicted application load(maps + reduces)
Dynamic VM creationbased on actualapplication load(maps + reduces)
VMs instantiated based on number of Map-Reduce Tasks
Dynamically number of tasks were checked up � New VMs started and tasks were
reallocated
Old VMs were destroyed if not used
Block size(10 MB)
(maps + reduces)
Time(min -sec)
VMs Time(min-sec)
New VMsadded
1 5-36 2 3-16 1
2 5-52 3 5-40 1
3 8-27 4 5-48 2
5 12-13 5 6-39 9
Conclusion1) Proposed MSA improves on the computation time and also
maintains the accuracy.
� Parallelism of sequence alignment in three levels.
Hadoop data grids - Data and compute parallelism &
scalability
� Dynamic Programming - accuracy.
2) Complexity is reduced from O(Nn) to O[K2 * (n *(n-1)/2)]
� Combining progressive and dynamic approaches.
� Blocking in hadoop
3) Enhancements (using clouds for MSA)
� Automatic configuration of the cloud environment
based on the computational needs
� Efficient upload of data into the HDFS by parallel
transfer of sequence fragments over the Internet.
Other Projects
� Enhancement of existing fairshare scheduler in hadoop
� Reliability using Reed Solomon codes
� Hybrid scheduler
Motif identification for MSA� Motif identification for MSA
� CBIR using image signatures
� Text categorization
� Hybrid PSO (PSO and GA) for job scheduling
� Semantic search using hadoop framework.
� Others – Globus and GridSim
Acknowledgement
The Research has been carried out as a result of PSG-Yahoo
Research programme on Grid and Cloud computing.
Sincere Thanks to
1) Dr R Rudramoorthy, Principal, 1) Dr R Rudramoorthy, Principal,
PSG College of Techniology, Coimbatore.
2) Mr K V Chidambaran,
Director, Grid and Cloud Systems Group,
Yahoo, Bangalore
THANK YOU
QUESTIONS?