Parallel DNA Sequence Alignment

64
S EQUENCE A LIGNMENT S PEED - U P A PARALLEL APPROACH University of Salerno Parallel and Concurrent Computing Course 19 February 2013 Giuliana Carullo Luca Pepe Daniele Valenza

description

The MapReduce model popularized by Google has successfully been utilized in several scientific applications. We investigated whether this approach can be flourishingly applied to DNA Sequence Alignment. In particular, algorithms for both perfect matching and sequence alignment are presented.

Transcript of Parallel DNA Sequence Alignment

Page 1: Parallel DNA Sequence Alignment

SEQUENCE ALIGNMENT SPEED-UP

A PARALLEL APPROACH

University of Salerno

Parallel and Concurrent Computing Course

19 February 2013

Giuliana Carullo Luca Pepe

Daniele Valenza

Page 2: Parallel DNA Sequence Alignment

• Introduction

• Problem definition

• Simple Search

• Approximate Search

• Parallelization

• Cross-Chunk Matching

• Bigger chunk

• On Demand

• Side to Side Sliding Query

• Approximate Search

• Test plan

Page 3: Parallel DNA Sequence Alignment

Sequence alignment is a process for comparing two or more DNA or RNA sequences.

Sequence alignment is performed in order to find similar or identical regions in the provided sequences, or to check if it is a known sequence stored in a database.

Page 4: Parallel DNA Sequence Alignment

DNA STRUCTURE

DNA bases: A C G T

Bounds: (A, T) (C, G)

Page 5: Parallel DNA Sequence Alignment

DNA ALIGNMENT

Affinity measures:• MATCH

• MISMATCH

• GAP

MATCHING TYPE:• SIMPLE

• REVERSE AND COMPLEMENT

Q: ATGATTACC DNA String

R(Q): CCATTAGTA Reverse

C (R(Q)): GGTAATCAT Complement

Page 6: Parallel DNA Sequence Alignment

• Global Alignment:

• Local Alignment:

• Local Alignment:

DNA ALIGNMENT TYPES

Page 7: Parallel DNA Sequence Alignment

• Introduction

• Problem definition

• Simple Search

• Approximate Search

• Parallelization

• Cross-Chunk Matching

• Bigger chunk

• On Demand

• Side to Side Sliding Query

• Approximate Search

• Test plan

Page 8: Parallel DNA Sequence Alignment

Searching all the perfect matchings of a small query string in a biggerDNA string.

INPUT: DNA String, Query String

OUTPUT: Number of occurences, Occurences starting positions

SIMPLE SEARCH

Variables Notation

# Workers 𝑛

Query length 𝑙𝑞

DNA Length 𝑙𝑑

Relative pos. 𝑂𝑓𝑓𝑖

Absolute pos. 𝑠𝑖

Page 9: Parallel DNA Sequence Alignment

Searching the «best» n alignments of a small query string in a biggerDNA string

INPUT: DNA String, Query String

OUTPUT: Best alignments starting positions

APPROXIMATE SEARCH

Variables Notation

# Workers 𝑛

Query length 𝑙𝑞

DNA Length 𝑙𝑑

Relative pos. 𝑂𝑓𝑓𝑖

Absolute pos. 𝑠𝑖

Page 10: Parallel DNA Sequence Alignment

APPROXIMATE SEARCH – SIMILARITY EVALUATION

Character similarity function

𝑠𝑖 = 𝑥, 𝑀𝑎𝑡𝑐ℎ𝑦, 𝑀𝑖𝑠𝑚𝑎𝑡𝑐ℎ𝑧, 𝐺𝑎𝑝

x > 0; y, z ≤ 0

(In this work gaps are not considered)

Objective function to maximize:

𝑆 =

𝑖

𝑙𝑞

𝑠𝑖

Page 11: Parallel DNA Sequence Alignment

• Introduction

• Problem definition

• Simple Search

• Approximate Search

• Parallelization

• Cross-Chunk Matching

• Bigger chunk

• On Demand

• Side to Side Sliding Query

• Approximate Search

• Test plan

Page 12: Parallel DNA Sequence Alignment

The common approach to all solutions is based on Map Reduce model:

• Master node splits the string intochunks and scatters them to workers node.

• The workers perform the computation and results are sentback to the master.

• Master combines the single solutions and returns the output.

GENERAL IDEA

Attention must be paid to the cross-matching strings

Page 13: Parallel DNA Sequence Alignment

GENERIC SPLIT AND COMPUTATION

Complete Matching

PartialMatching

𝑇0

𝑇1

𝑇2

𝑇3

𝑇4

𝑇5

𝑇6

𝑇7

Query string

DNA string

𝑙𝑑/n 𝑙𝑑/n 𝑙𝑑/n

Chunk size

Chunk 𝒊 − 𝟏 Chunk 𝒊 Chunk 𝒊 + 𝟏

Page 14: Parallel DNA Sequence Alignment

GENERIC REDUCE PHASE

𝑖 ∗ (𝑙𝑑 𝑛) + 𝑜𝑓𝑓𝑖

Worker ID Offset

𝑖 𝑜𝑓𝑓𝑖

𝑗 𝑜𝑓𝑓𝑗

𝑙𝑞

Query string

DNA string

Size

WORKERS OUTPUT

FINAL OUTPUT

𝑗 ∗ (𝑙𝑑 𝑛) + 𝑜𝑓𝑓𝑗 𝑙𝑞

Positions

𝑠𝑖

𝑠𝑗

𝑠𝑖 𝑠𝑗

Page 15: Parallel DNA Sequence Alignment

• Introduction

• Problem definition

• Simple Search

• Approximate Search

• Parallelization

• Cross-Chunk Matching

• Bigger chunk

• On Demand

• Side to Side Sliding Query

• Approximate Search

• Test plan

Page 16: Parallel DNA Sequence Alignment

Bigger chunk:

The master sends to every worker a chunk of sizes =𝑙𝑑

𝑛+ 𝑙𝑞 − 1 such

that cross chunk matching strings can be found.

On Demand:

The master sends chunks of sizes =𝑙𝑑

𝑛, whether a worker finds a partial

matching at the end of its chunk, it asks the remaining part r ≤ 𝑙𝑞 − 𝑘such that cross chunk matching strings can be found.

Two possible heuristics: big request and small request

Side to Side Sliding Query (3SQ):

Every worker receives a chunk of sizes =𝑙𝑑

𝑛and computes its complete

matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings.

SOLUTION APPROACHES

Page 17: Parallel DNA Sequence Alignment

𝑇0

𝑇1

𝑇2

𝑇3

𝑇4

𝑇5

𝑇6

𝑇7

BIGGER CHUNK APPROACH

Complete Matching

Chunk 𝒊 − 𝟏

𝑙𝑑/n

Query string

DNA string

𝑙𝑞-1

Chunk size

Chunk 𝒊

𝑙𝑑/n 𝑙𝑞-1

Chunk 𝒊 + 𝟏

𝑙𝑑/n 𝑙𝑞-1

Same Char

Page 18: Parallel DNA Sequence Alignment

ADVANTAGES:

• it does not requires intra-workers communication;

• it does not produce duplicated occurrences;

• the master has an extremely small sequential work to perform.

DISADVANTAGES:

• each worker (except the last one) receives 𝑙𝑞 − 1 extra characters Thus, an extra bandwidth 𝑏𝑒 usage is produced such as:

𝑏𝑒 = 𝑙𝑞 − 1 ⋅ (𝑛 − 1)

BIGGER CHUNK APPROACH

Page 19: Parallel DNA Sequence Alignment

Bigger Chunk

analogous to generic approach

On Demand:

analogous to generic approach

Side to Side Sliding Query (3SQ):

additional work is performed by master node for combining partialmatchings.

REDUCE PHASE

Page 20: Parallel DNA Sequence Alignment

Bigger chunk:

The master sends to every worker a chunk of sizes =𝑙𝑑

𝑛+ 𝑙𝑞 − 1 such

that cross chunk matching strings can be found.

On Demand:

The master sends chunks of sizes =𝑙𝑑

𝑛, whether a worker finds a partial

matching at the end of its chunk, it asks the remaining part r ≤ 𝑙𝑞 − 𝑘such that cross chunk matching strings can be found.

Two possible heuristics: big request and small request

Side to Side Sliding Query (3SQ):

Every worker receives a chunk of sizes =𝑙𝑑

𝑛and computes its complete

matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings.

SOLUTION APPROACHES

Page 21: Parallel DNA Sequence Alignment

𝑇0

𝑇𝑗

𝑇𝑗 + 1

𝑇𝑗 + 2

𝑇𝑗 + 3

ON DEMAND – BIG REQUEST APPROACH

𝑙𝑑/n

v v v v

v x

v v v v

Chunk 𝒊

Chunk 𝒊 + 𝟏

Complete Matching

PartialMatching

Query string

DNA string

Chunk size

Page 22: Parallel DNA Sequence Alignment

ADVANTAGES:

• extra data is requested only when needed

• it does not produce duplicated occurrences

• a single request is performed for each worker

DISADVANTAGES:

• extra overhead for the big request

• potential useless extra characters

ON DEMAND – BIG REQUEST APPROACH

Page 23: Parallel DNA Sequence Alignment

𝑇0

𝑇𝑗

𝑇𝑗 + 1

𝑇𝑗 + 2

𝑇𝑗 + 3

ON DEMAND – SMALL REQUEST APPROACH

𝑙𝑑/n

v v v v

v x

v v v v

Chunk 𝒊

Chunk 𝒊 + 𝟏

Complete Matching

PartialMatching

Query string

DNA string

Chunk size

Page 24: Parallel DNA Sequence Alignment

ADVANTAGES:

• extra data are requested only when needed

• it does not produce duplicated occurrences

• better bandwidth usage than big request

DISADVANTAGES:

• Number of requests grows proportionally to the length of the query

ON DEMAND – SMALL REQUEST APPROACH

Page 25: Parallel DNA Sequence Alignment

Two kind of communication can be adopted:

ON DEMAND – CENTRALIZED VS DISTRIBUTED

Centralized: request is made to master node

Distributed: request is made to adjacent right node

k )

Page 26: Parallel DNA Sequence Alignment

ON DEMAND – CENTRALIZED VS DISTRIBUTED

Centralized Distributed

ADVANTAGES Master idle time isreduced.

No extra accesses to DNA are needed.

No linearizationpoint.

DISADVANTAGES Linearization point isadded.

Access to DNA must be performed.

Extra data requestsmay be sloweddown.

Page 27: Parallel DNA Sequence Alignment

Bigger Chunk

analogous to generic approach

On Demand:

analogous to generic approach

Side to Side Sliding Query (3SQ):

additional work is performed by master node for combining partialmatchings.

REDUCE PHASE

Page 28: Parallel DNA Sequence Alignment

Bigger chunk:

The master sends to every worker a chunk of sizes =𝑙𝑑

𝑛+ 𝑙𝑞 − 1 such

that cross chunk matching strings can be found.

On Demand:

The master sends chunks of sizes =𝑙𝑑

𝑛, whether a worker finds a partial

matching at the end of its chunk, it asks the remaining part r ≤ 𝑙𝑞 − 𝑘such that cross chunk matching strings can be found.

Two possible heuristics: big request and small request

Side to Side Sliding Query (3SQ):

Every worker receives a chunk of sizes =𝑙𝑑

𝑛and computes its complete

matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings.

SOLUTION APPROACHES

Page 29: Parallel DNA Sequence Alignment

𝑇0

𝑇1

𝑇2

𝑇3

𝑇𝑗

𝑇𝑗+1

𝑇𝑗+2

𝑇𝑗+3

SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH

Complete Matching

Right-sidePartial

Matching

Query string

DNA string

𝑙𝑑/n 𝑙𝑑/n 𝑙𝑑/n

Chunk size

Chunk 𝒊 − 𝟏 Chunk 𝒊

Left-sidePartial

Matching

Chunk 𝒊 + 𝟏

Page 30: Parallel DNA Sequence Alignment

ADVANTAGES:

• no extra data is required

• it does not produce duplicated occurrences

• no extra communication is needed

• the master does not need to store the DNA string

• it reduces bandwidth consumption to perform cross-chunk strings checking. Indeed workers return bits instead of integers.

DISADVANTAGES:

• Extra work is required to the master (partial matchings combine)

SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH

Page 31: Parallel DNA Sequence Alignment

Bigger Chunk

analogous to generic approach

On Demand:

analogous to generic approach

Side to Side Sliding Query (3SQ):

additional work is performed by master node for combining partialmatchings.

REDUCE PHASE

Page 32: Parallel DNA Sequence Alignment

3SQ REDUCE PHASE

𝑖 ∗ (𝑙𝑑 𝑛)- j

1 1 0 1

𝑙𝑞

Query match

DNA string

Size

WORKER i Right side array

FINAL OUTPUT

𝑖 ∗ (𝑙𝑑 𝑛)- k 𝑙𝑞 Positions

𝑠𝑗

𝑠𝑘

Results array

𝑠𝑘

1 0 0 1

AND

1 0 0 1

WORKER i+1 Left side array

𝑠𝑖

𝒋 𝒌

Page 33: Parallel DNA Sequence Alignment

• Introduction

• Problem definition

• Simple Search

• Approximate Search

• Parallelization

• Cross-Chunk Matching

• Bigger chunk

• On Demand

• Side to Side Sliding Query

• Approximate Search

• Test plan

Page 34: Parallel DNA Sequence Alignment

Same as simple search

• Splitting phase: same of simple search

• Computation phase:• Similarity function is evaluated for every alignment of query string• Likely simple search, Cross-chunk strings must be considered• Every worker returns its 𝑛 best similarity values, with relative

positions

• Reduce phase:All similarity values are merged in order and the best 𝑛 alignmentsare returned

PARALLELIZATION MODEL

Page 35: Parallel DNA Sequence Alignment

REDUCE PHASE

Off. Similarity

X 10

Y 8

Z 3

Off. Similarity

A 5

B -3

C -6

W. Id

Off. Sim.

1 X 10

1 Y 8

2 U 7

3 A 5

1 Z 3

2 V 2

2 W -1

3 B -3

3 C -6

Pos. Similarity

X’ 10

Y’ 8

U’ 7

ORDERED

MERGE

POS.

TRANSLATION

Off. Similarity

U 7

V 2

W -1

FINAL OUTPUTWorker 1

Worker 2

Worker 3

Page 36: Parallel DNA Sequence Alignment

Bigger chunk:

The master sends to every worker a chunk of size s ≤𝑙𝑑

𝑛+ 𝑙𝑞 − 1 such

that cross chunk matching similarities can be evaluated.

Side to Side Sliding Query (3SQ):

Every worker receives a chunk of size s =𝑙𝑑

𝑛and computes its similarity

values and all partial similarities (leftside and rightside). Partialsimilarities will be summed by the master in order to compute Cross-Chunk String similarity values.

CROSS-CHUNK MATCHING

Page 37: Parallel DNA Sequence Alignment

3SQ PARTIAL SIMILARITY COMBINE PHASE

4 2 0 1

𝑙𝑞

Query match

DNA string

Size

WORKER i Right side array

OUTPUT

W.Id.

Off. Sim

i 𝑠𝑗 5

i 𝑠𝑘 3

i …

Results array

sk

1 0 3 -4

+

5 2 3 -3

WORKER i+1 Left side array

si

𝒋 𝒌𝑠𝑗 = 𝑙𝑐 − (𝑙𝑞 − 1)+ j

𝑙𝑞

Chunk 𝒊 Chunk 𝒊 + 𝟏

Page 38: Parallel DNA Sequence Alignment

• Introduction

• Problem definition

• Simple Search

• Approximate Search

• Parallelization

• Cross-Chunk Matching

• Bigger chunk

• On Demand

• Side to Side Sliding Query

• Approximate Search

• Test plan

Page 39: Parallel DNA Sequence Alignment

Varying parameters:

• Number of Workers

• Query Length

We plan to evaluate the running times of every presentedalgorithm. The analysis of these results will validate ourproposal, highlighting the algorithm that performs better.

OVERVIEW

Page 40: Parallel DNA Sequence Alignment

SEQUENCE ALIGNMENT SPEED-UP

A PARALLEL APPROACH

University of Salerno

Parallel and Concurrent Computing Course

19 February 2013

Giuliana Carullo Luca Pepe

Daniele Valenza

DEVELOPMENT AND BENCHMARKING

Page 41: Parallel DNA Sequence Alignment

• Implementation

• Introduction

• DNA Splitting

• Bandwidth usage

• Comunication

• Benchmarking

• Testing environment

• Test plan

• Results

• Conclusions

Page 42: Parallel DNA Sequence Alignment

Every proposed algorithm has been

implemented using C language and OpenMPI library

Advantages:

• High performances

• Scalability

• Portability

INTRODUCTION

Page 43: Parallel DNA Sequence Alignment

A natural approach: load it entirely from file, calculate the size (𝑙𝑑), split it in 𝑛 chunks and send them to the workers

Problems:

A DNA genome may be very large (3.0 ×109 bp (base pairs) )

The available memory can’t be enough.

Projectual choice:

The whole DNA is actually never needed

DNA is never entirely loaded in memory, first dna and chunk size are calculated, and then step by step 𝑙𝑐 characters are read from file and sent to a worker.

PROJECTUAL CHOICES: DNA SPLITTING

Page 44: Parallel DNA Sequence Alignment

• The type of messages exchanged during the simple searchcomputation would normally consist in: • Characters (splitting phase)

• Integers (Reduce phase)

Bandwidth usage:

• 1 byte (Char size) x lc x n - Splitting phase

• 4 byte (Integer size) x lc x n (best case, all matchings) – Reduce phase

Can we do better? … YES!

PROJECTUAL CHOICES: BANDWIDTH USAGE

Page 45: Parallel DNA Sequence Alignment

In the Simple Search algorithm, a compression can be performed in order to drastically reduce bandwidth consumption.

Simple Search Reduce phase Compression: instead of sendingactual positions, a bit array of size 𝑙𝑐 is exploited.

Bit array costruction:

for each position, if a matching is found starting from it, the bit isset to 1, 0 otherwise.

Compression Ratio:

1: 32 (E.g, with 4 integers from 4 positions to 128 positions)

PROJECTUAL CHOICES: BANDWIDTH USAGE

Page 46: Parallel DNA Sequence Alignment

COMUNICATION

Master to workers Extra Comunication Workers to Master

Messages DataType

Type Messages DataType

Type Messages DataType

Type

Bigger Chunk N(ld/n+lq-1) Char AsyncSync X N(ld/n) Int

BitSyncSync

On Demand: N(ld/n) Char AsyncSync

N-1(lq-1) Char SyncSync

N(ld/n) IntBit

SyncSync

3SQ N(ld/n) Char AsyncSync X N(ld/n)+2(l

q-1) IntBit

SyncSync

Page 47: Parallel DNA Sequence Alignment

• Implementation

• Introduction

• DNA Splitting

• Bandwidth usage

• Comunication

• Benchmarking

• Testing environment

• Test plan

• Results

• Conclusions

Page 48: Parallel DNA Sequence Alignment

Cluster

8 Nodes - Ethernet 100Mbps connection

Node

CPU: Intel Xeon Dual Core 2.8 Ghz

RAM: 4GB

Hard Drive: 2x 30GB SCSI

Software

OS: Debian 6.0.4

OpenMPI 1.6.1

TESTING ENVIRONMENT

Image for illustrative purposes only

Page 49: Parallel DNA Sequence Alignment

Benchmarking consisted in evaluating and comparing runningtimes of each algorithm as function of the followingparameters

• Number of processors (# workers +1) [2, 4, 8, 16]

• DNA length (Small -5MB-, Medium -149 MB-, Large -292MB-)

• Query length (Small -8byte-, Medium -32byte-, Large -64byte-)

• # best allignments -Approximate search only- (10, 50, 100)

In grey the fixed value for the parameter when not evaluated

TEST PLAN

Page 50: Parallel DNA Sequence Alignment

SIMPLE SEARCH: NUMBER OF WORKERS (1/2)

Results:• Good Scalability for

every algorithm

• 3SQ worse than the others becauseadditional sequentialwork must be performed.

Page 51: Parallel DNA Sequence Alignment

SIMPLE SEARCH: NUMBER OF WORKERS (2/2)

Results:• Bigger Chunk Bit

performs better thanint solution.

• Increasing processors, bigger chunk performsbetter than the othersbecause more cross-chunk matchings occur.

• No relevantimprovements occurredbetween 8 and 16 processors.

Page 52: Parallel DNA Sequence Alignment

SIMPLE SEARCH: SPEED UP

0

0,5

1

1,5

2

2,5

3

3,5

4

2 4 8 16

SPEE

DU

P

NUMBER OF PROCESSORS

Speed Up Simple Search

DNA Size: Big Query Size: Small

BC-bit OD-cent BC-int OD-dist 3SQ

Results:

• Increasing speedup for every algorithm(except BC-int)

• The speedup growsproportionally to

𝑛 + 1

• BC-int suffers from network bottleneck due to the size of the messages.

Page 53: Parallel DNA Sequence Alignment

SIMPLE SEARCH: DNA LENGTH

Results:• Good Scalability for

every algorithm

• 3SQ worse: additionalsequential work thanothers….

• Bigger Chunk Bit performs better thanint solution

• Execution times growslinearly respect to DNA size

Page 54: Parallel DNA Sequence Alignment

SIMPLE SEARCH: QUERY LENGTH

Results:

• 3SQ is highly sensible to querylength variations due to partialmatching combine phase.

• No significative variations for other algorithms since single Query Matching is interruptedon first mismatch found.

Page 55: Parallel DNA Sequence Alignment

APPROXIMATE SEARCH: NUMBER OF WORKERS

Results:• Running times

decrease linearlyrespectively to the number of processors.

• 3SQ is only slightlyworse than Biggerchunk because the sequential work isalmost the same(Ordered Merge)

Page 56: Parallel DNA Sequence Alignment

APPROXIMATE SEARCH: SPEED UP

0

2

4

6

8

10

12

14

16

2 4 8 16

SPEE

DU

P

NUMBER OF PROCESSORS

Speed Up Approximate Search

DNA Size: Medium Query Size: Small

3SQ BC-int

Results:Speed up globally betterthan simple search and close to the ideal value.

Page 57: Parallel DNA Sequence Alignment

APPROXIMATE SEARCH: DNA SIZE

Results:Running times growslinearly respectively to the DNA SIZE

MotivationThe main sequentialcomputation consists in Ordered Merge that haslinear complexity.

Page 58: Parallel DNA Sequence Alignment

APPROXIMATE SEARCH: QUERY SIZE

Results:Running times isinfluenced by Query Size.

MotivationThe computation of similarity function isaffected by query length.

Page 59: Parallel DNA Sequence Alignment

APPROXIMATE SEARCH: NUMBER OF BEST ALIGNMENTS

Results:Running times growsalmost linearly.

MotivationEach worker returns to the master its Number of best alignments and the ordered merge process isaffected by it.

0,00

5,00

10,00

15,00

20,00

25,00

30,00

35,00

10 50 100

RU

NN

ING

TIM

E (S

ECO

ND

S)

NUMBER OF BEST ALIGNMENT

Approximate SearchDNA size: Big Processor: 16 Query Size:

Small

BC-int 3SQ

Page 60: Parallel DNA Sequence Alignment

• Implementation

• Introduction

• DNA Splitting

• Bandwidth usage

• Comunication

• Benchmarking

• Testing environment

• Test plan

• Results

• Conclusions

Page 61: Parallel DNA Sequence Alignment

The winner is….

Bigger Chunk

On Demand

3SQ

Page 62: Parallel DNA Sequence Alignment

Further improvements can be applied to the presented algorithms

Splitting phase: DNA alphabet consists merely in 4 characters, 2 bit are enough to rappresent the character, instead of 8 bit

Bit Mapping:

e.g. A=00, T=01, C=10, G=11

Compression Ratio:

1: 4 (E.g, with 1 character from 1 base to 4 bases)

IMPROVEMENTS

Page 63: Parallel DNA Sequence Alignment

3SQ algorithm:

Partial matchings combine phase can be performed in a distributedmanner

Each node sends its left or right partial matching to left or right sibling, which will combine it with his results and send them to master.

In this way sequential work can be reduced

IMPROVEMENTS

Page 64: Parallel DNA Sequence Alignment

Thanks !