Implementation of the Smith-Waterman Algorithm on the ... · PDF fileimplementing the...

Implementation of the Smith-WatermanAlgorithm on the FLEET Simulator

by

Alfred Yu-Han Pang

B.Ap.Sc., The University of British Columbia, 1998

AN ESSAY SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OF

Master of Science

in

The Faculty of Graduate Studies

(Computer Science)

The University Of British Columbia

February, 2007

c© Alfred Yu-Han Pang 2007

Abstract

FLEET is an experimental architecture proposed by researchers at SUN Mi-crosystems and UC Berkeley. Reflecting the fact that the wiring dominatesthe power, delay, and area costs of most CPU designs, the FLEET archi-tecture focuses on communication. A FLEET processor may have a largenumber of functional units operating in parallel. Programming essentiallybecomes deciding how data should be moved between these functional units.

The Berkeley/SUN research has focused on hardware design issues andexisting software examples for FLEET are very simple. In this research, weused the Smith-Waterman algorithm for string comparison as an example ofa larger scale application of FLEET, as this algorithm can be acceleratedthrough the use of computer architectures with parallel execution capability.Smith-Waterman is a dynamic programming algorithm that has applicationsin biology for comparing DNA and protein sequences. We attempted toevaluate the FLEET architecture’s potential for exploiting parallelism byimplementing the Smith-Waterman algorithm on a simulator for FLEET.While we were not able to complete the implementation, our work revealedsome shortcomings of the current simulator and programming model. On theother hand, we believe that the architecture has merit for high-performancecomputing, and our experience suggests ways that the tools for FLEET couldbe improved to facilitate the development of FLEET software.

ii

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Project Proposal . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3.2 Summary of Results . . . . . . . . . . . . . . . . . . . 2

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Smith-Waterman Algorithm . . . . . . . . . . . . . . . . . . . 32.2 Summary of Relevant Research . . . . . . . . . . . . . . . . . 5

2.2.1 SWASAD: An ASIC Design for High Speed DNA Se-quence Matching . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 A Run-Time Reconfigurable System for Gene-SequenceSearching (Hokiegene) . . . . . . . . . . . . . . . . . . 6

2.2.3 A Smith-Waterman Systolic Cell . . . . . . . . . . . . 62.2.4 Cray XD1 . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.5 Other Papers . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 FLEET Architecture . . . . . . . . . . . . . . . . . . . . . . . 82.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Switch Fabric and GaSP Circuits . . . . . . . . . . . . 92.3.3 Programming for FLEET . . . . . . . . . . . . . . . . 9

2.4 Simulator and Tools . . . . . . . . . . . . . . . . . . . . . . . 102.4.1 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.2 Visualization Tool . . . . . . . . . . . . . . . . . . . . 11

iii

Table of Contents

3 Smith-Waterman SHIP Design and Implementation . . . . 123.1 High-Level Design (Alpha version) . . . . . . . . . . . . . . . 12

3.1.1 Overall Execution . . . . . . . . . . . . . . . . . . . . 123.1.2 Smith-Waterman SHIP . . . . . . . . . . . . . . . . . 153.1.3 Detailed Design . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 Experience of Designing and Implementing the SHIP . 19

4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.1 Revised SHIP Design (Beta version) . . . . . . . . . . . . . . 22

4.1.1 SHIP Design for Beta version . . . . . . . . . . . . . . 234.2 Simulation or Design Exploration . . . . . . . . . . . . . . . . 23

4.2.1 Simulating Signals . . . . . . . . . . . . . . . . . . . . 244.2.2 Design Exploration . . . . . . . . . . . . . . . . . . . . 24

4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

A Project Description from Proposal submitted April 2006 . 28

iv

List of Figures

2.1 Calculating matrix for Smith-Waterman . . . . . . . . . . . . 42.2 Cells that can be updated in parallel . . . . . . . . . . . . . . 4

3.1 Evaluation strategy for FLEET implementation . . . . . . . . 133.2 Smith-Waterman SHIPs in ring configuration . . . . . . . . . 143.3 Smith-Waterman SHIP . . . . . . . . . . . . . . . . . . . . . . 16

v

Acknowledgements

I would like to thank my supervisor Mark Greenstreet for the research as-sistant funding, as well as for his wisdom and guidance. As well, I wouldlike to thank Igor Benko, the author of the FLEET simulator, for taking thetime to respond to my email questions.

vi

Chapter 1

Introduction

Why study computer architecture? We are currently reaching the limits ofhow far we can go in terms of the performance of single computer chips.Academic (and industry) research is leaning towards parallel computing inorder to extract more computational power [5].

This project involves the implementation of the Smith-Waterman algo-rithm on the FLEET simulator. We would like to start with a brief descrip-tion of FLEET to provide some insight into the motivation for this work.

1.1 Motivation

The overall goal of the FLEET architecture, is to design a computer ar-chitecture that executes as many concurrent instructions as possible in anasynchronous fashion. Specifically, we are exploring the area of bioinformat-ics for computational problems that would be a good fit for evaluating theFLEET architecture. Algorithms in bioinformatics need to exploit parallelcomputation in order to get results with reasonable execution time.

The algorithm that we have chosen for the project is Smith-Waterman,which essentially computes the edit distance between two strings. Out of allthe possible choices of algorithms, this one was chosen because there has beensubstantial prior work on making Smith-Waterman run on platforms withparallel execution capability. As well, we would be able to take advantage ofinsights and ideas from existing work. For our particular work on FLEET,we hope to be able to observe similar performance as other solutions andpossibily have other advantages as well such as programmability.

1.2 Terminology

• FLEET - This term refers to the architecture and work on asyn-chronous computer architecture design. The idea is to have a com-munication focused design that allows for data to be passed easilybetween the many computation units on the chip. By having all the

1

Chapter 1. Introduction

computational units work in parallel, it is anticipated that high per-formance could be achieved, especially in scientific computing. Thereare also power reduction considerations with the design, but we willnot be focusing on this aspect of the design in this work.

• SHIP - A computation unit in the FLEET architecture. Data is fedinto input terminals and results come out of output terminals. Onecan almost treat SHIPs like logic components.

• switch fabric - The circuitry to support the routing of data betweenmultiple sources and destinations.

1.3 Project Proposal

The proposal for the project was submitted in April 2006 with the intentionthat the work and the report be completed by the end of August 2006. Therelevant excerpt has been included in the appendix.

1.3.1 Goal

The goal of the project is to create a Smith-Waterman SHIP that could bedeployed in multiples to concurrently evaluate the dynamic programmingtableau entries. Essentially, there are two levels to the design: the SHIPcomponent level, and the integration level of how all the components cometogether.

1.3.2 Summary of Results

The design we had for the Smith-Waterman SHIP was such that it wasincompatible with the simulator’s operational model. So unfortunately, wewere not able to complete the implementation. In the process, however, wehave gained some experience and insight into working with FLEET. Whilethe ideas and approach to parallel computing as set out by FLEET seempromising, much of the challenge will be to provide a more straightforwardway to program for FLEET.

2

Chapter 2

Background

2.1 Smith-Waterman Algorithm

The Smith-Waterman Algorithm is used to compute the edit distance be-tween two sequences typically of either DNA or protein using a dynamicprogramming approach. This is done by creating a matrix with cells indicat-ing the cost to change a subsequence of one to the subsequence of the other.By building on the edit distances of the subsequences, Smith-Waterman pro-vides an efficient way to compare the sequences.

The algorithm as originally published in [16] is often imitated but rarelyimplemented. In particular, papers from computer architecture journals tendto use their own simplified reinterpretation of the algorithm. The originalalgorithm has some aspects that can complicate the goal of showcasing a par-ticular hardware architecture. For our work, we will follow the interpretationof the algorithm as described in [19], which we feel allows us to concentrateon the work of evaluating FLEET rather than dealing with the nuances ofthe full bioinformatics algorithm and implications.

As mentioned previously, the algorithm runs by filling in cells in a matrix.The cell calculation is shown in Figure 2.1 for the comparison of sequence Sand sequence T. The matrix is filled out by computing d as follows:

d = min

{

a if Si = Tj

a + sub if Si 6= Tj

b + insc + del

(2.1)

The four possible cases are:

1. next character in both sequences match so there is no penalty.

2. next character can be matched by an insertion, with penalty ins.

3. next character can be matched by a deletion, with penalty del.

4. next character can be matched by a substitution, with penalty sub.

3

Chapter 2. Background

A C G0 1 2 3

A 1T 2C 3

(a) Initial table

A C G0 1 2 3

A 1 0 1 2T 2 1 2 3C 3 2 1 2

(b) Completed table

Si

a bTj c d

(c) Formula

Figure 2.1: Calculating matrix for Smith-Waterman

Figure 2.2: Cells that can be updated in parallel

A parallel implementation of the Smith-Waterman algorithm can be ob-tained by computing the diagonal cells in parallel. This is a feature of Smith-Waterman that all of the hardware implementations exploit. Figure 2.2 showthe cells that can be calculated independently. As each diagonal is computed(dark squares in the figure), the next diagonal will have available the requireddependencies and thus the entire table could be filled this way.

Suppose that the penalties of insertion and deletion are restricted to 1,i.e., values for b and c are plus 1. Then, the equation can be simplified asfollows, as noted by Lipton and Lopresti [11]:

d ={

a if ((b or c) = a− 1) or (Si = Tj)a + 2 if ((b and c) = a + 1) and (Si 6= Tj)

(2.2)

This is essentially the same as the original equation, however, it simpli-fies the representation of the matrix cells in hardware implementations. Itis important to note that the original Smith-Waterman paper specifies in-sertion and deletion penalties known as gap penalty. It is more typical tohave a separate "gap open" penalty for the beginning of a gap, and a rela-tively smaller "gap extend" penalty for the continuation of a gap, as can beseen in ’ssearch’, a pure software implementation of Smith-Waterman [6]. Itshould be noted is that the algorithm may be stated equivalently as eithera comparison of similarity (increasing score for better matches) or difference

4


(decreasing score for matches).After the entire matrix of scores have been calculated, it is a simple mat-

ter to walk from the far corner back to the origin, looking for the path of min-imal penalty to determine the actual differences between the strings. As forthe complexity of the algorithm under normal computational circumstances,the dynamic programming algorithm takes O(m · n) in space and time. Thewalk to retrieve the actual longest subsequences takes O(max(m, n)).

2.2 Summary of Relevant Research

The papers that I am summarizing describe different approaches at quicklycalculating the dynamic programming matrix of the Smith-Waterman al-gorithm. All of the hardware approaches uses the idea of having multipleprocessing units to calculate the cells of the matrix.

One thing to consider when reviewing the papers is the type of stringsthat could be accepted by the implementation. Specifically, DNA has fourcharacters that need to be considered (A, T, G, and C) which correspondto the bases which make up a string; this can be represented with 2-bits percharacter. On the other hand, protein can be made up of 20 different aminoacids which clearly requires more bits per character.

Performance for the Smith-Waterman algorithm by the different imple-mentations is usually stated as millions of MCPS (matrix cell updates persecond). Other criteria for comparison, besides speed, may include measuressuch as number of processing units per chip, number of chips used for thesolution, and power consumption.

2.2.1 SWASAD: An ASIC Design for High Speed DNASequence Matching

The authors of this paper [8] implemented the Smith-Waterman algorithm asan ASIC (Application Specific Integrated Circuit) running at 50MHz. In theintroduction of the paper, it was noted that even though the design was ca-pable of running much faster, their particular implementation connected theSWASAD (Smith and Waterman Algorithm-Specific ASIC Design) throughthe PCI bus to communicate with the main processor. The SWASAD designitself used normal ASIC design techniques, and would be easily adaptableinto a SoC (System on a chip) design.

SWASAD has 64 processing units per chip. The main effort for thisproject was to put as many units as possible on the chip, using as designsthat are efficient in terms of the number of transistors and space on the

5


chip. It is important to note that the design has registers to specify differentpenalties for insertions, deletions, and matches.

As for the performance, the paper reports 3200 million MCPS. In theconclusion of the paper, the authors state that by using a 0.1 µm processrather than the 0.5 µm process that they used, it would be possible to havea design that has 1024 processing elements.

2.2.2 A Run-Time Reconfigurable System forGene-Sequence Searching (Hokiegene)

This paper by Puttegowda et al. [15] from the Virginia Bioinformatics In-stitute describes their effort to implement the Smith-Waterman Algorithmusing FPGAs. The name of their system is Hokiegene and it uses an FPGA-PCI board called the Osiris in the implementation.

This implementation uses the simplification of the algorithm as per Lip-ton and Lopresti [11], i.e. Equation 2.2. As noted earlier, this makes asimplifying assumption about the penalty model for the Smith-Watermanalgorithm which restricts the penalty of an insert or deletion to one. Therepresentation of each cell only needs one-bit. As well, this implementationonly handles DNA, so only 2 bits are needed to represent each character ofthe string.

The Osiris board runs at 180 MHz and has 7000 processing elementsto handle the Smith-Waterman cell updates. The paper states that theperformance for the Hokiegene solution is 1260 billion updates per second.Note that higher number of processing units usually corresponds to higherperformance in terms of cell updates.

2.2.3 A Smith-Waterman Systolic Cell

This paper by Yu et al. [19] is another Smith-Waterman implementation us-ing FPGAs. The actual hardware used for this project is called the Pilchard[10]. This is a hardware platform that essentially puts an FPGA on a mini-board and uses the SDRAM interface to communicate with the main proces-sor. The implementation benchmarked had 4032 processing elements, ranat 202 MHz, and performed 814 billion CUPS (cell updates per second); thememory mapped interface was 133 MHz at 64 bits.

Like the effort by Puttegowda et al., this implementation also makes thesame simplifying assumption about the Smith-Waterman algorithm usingEquation 2.2.

6


2.2.4 Cray XD1

The source for this article [12] is more marketing material than scientificjournal, but for our purposes of understanding the main feature of this im-plementation, it is sufficient. The Cray XD1, is designed to be a generalpurpose platform rather than strictly a single-algorithm machine. The ma-chine’s main processing power comes from Opterons which are connected viaa fast communications fabric. Also available on this communications fabricare "Application Acceleration" components which are FPGAs.

The use of FPGAs is similar to the other papers; however, the approachon this project is completely different. While the other implementations workon the simplified Smith-Waterman, the Cray implementation was specifi-cally trying to optimize a software-only implementation (’ssearch34’ fromthe FASTA package at http://fasta.bioch.virginia.edu/). For their evalua-tion of the speed, the article claims that their runtime on their test droppedfrom 6461 seconds (pure software) to 100 seconds. The article does not pro-vide any other numbers or ways to compare the Cray implementation toother implementations.

2.2.5 Other Papers

Dydel and Bala [4] is yet another paper to make another FPGA implementa-tion of Smith-Waterman. However, their work compares their performanceto a pure software solution (osearch34). It should also be noted that theirpaper explicitly criticizes other FPGA-based works which uses the specialcase simplification of the Smith-Waterman. Another departure this papertakes from the others is that they investigated the use of the algorithm onthe alignment of protein sequences (20 letters) and using a full cost matrix(400 entries, 8 bits each). Their implementation using FPGAs turned out tobe 200 times faster than osearch34 running on a Pentium IV 1.7 GHz.

In the course of the research I also came across the following:

• Splash 2 [9] is a frequently cited paper which may have been one ofthe earlier efforts to develop an ASIC to handle the Smith-Watermanalgorithm.

• Gene Matching Using JBits [7] is another FPGA implementation ex-cept that it uses the Xilinx JBits toolkit to stream in and reconfigurethe circuit at run time to include the cost penalty constants, ratherthan having an explicit register.

7


• BLAST [1] is often mentioned in the literature. BLAST works byfinding subsequences of specific lengths between the strings to be com-pared, and then extending these matches of subsequences for as faras possible, based on a score that favours matches and penalizes mis-matches. An online version of BLAST is available at [13]. This heuris-tic approach can provide satisfactory results but is obviously differentthan Smith-Waterman.

• FASTA [6] provides an online interface to Smith-Waterman as well asthe FASTA tool. As for the FASTA algorithm itself [14], it looks forruns of identical sequences and then joins up the runs using as littleedit distance as possible. This provides an approximation, rather thanthe exact sequence alignment but appears to be widely used.

2.3 FLEET Architecture

2.3.1 Architecture

[17] sums up the guiding principles behind the design of the FLEET archi-tecture as three points: simplicity, communication, and concurrency. Thisleads naturally to descriptions of the three main components that make upa FLEET processor.

• SHIPS - A FLEET processor has many different small processing unitscalled SHIPs. Each SHIP perform a very specific operation such asadding, interfacing to memory, synchronizing, etc. Each ship has inputand output terminals called "ports" to handle data.

• Switch Fabric - FLEET needs to be able to pass data efficiently be-tween the input and output terminals of the SHIPS. The componentin FLEET that handles the routing is referred to as the switch fabric.This provides a similar function as a bus but is quite different [3].

• FLEET program instructions - Perhaps it is easy to overlook the actuallist of instructions that specify which moves are to be performed. Theencoding of the individual instructions is described in [2].

Given the above, there are questions that one might ask:

• What minimal set of SHIPS is necessary to implement a particularalgorithm in a reasonably competitive manner, when compared to amore conventional processor architecture?

8


• What kinds of limitations result from the physical constraints? Howmany SHIPS can we put on a chip? Specifically, given the limitationof the switch fabric, what is the maximum number of source and des-tination terminals that can be handled on a single FLEET processor?

• How many concurrent instructions can be handled at one time? Is theexecution deterministic? Is determinism desirable?

• How does one debug a FLEET program?

• What does a C to FLEET assembly compiler look like in order to takefull advantage of the architecture? Is C a suitable language for writingprograms for a FLEET processor?

• What types of computation is FLEET particularly suited for?

Although the range of questions presented here are geared more towardsthe programming aspect of the FLEET architecture, there are also manyphysically constrained questions that could be asked, some of which wereexplored in the FLEETzero prototype [3].

2.3.2 Switch Fabric and GaSP Circuits

The switch fabric and GaSP circuits as implemented in FLEETzero are de-scribed in [18]. For our purposes, we shall not concern ourself with detailsat this low level but acknowledge that the simulator as implemented tookthese physical constraints in mind for the purposes of better understandingthe timing and signaling requirements. However, it should be noted that thisrealism may impair in the exploration of the SHIP design space.

2.3.3 Programming for FLEET

Essentially, the FLEET architecture has one instruction: MOVE. There areseveral options to the MOVE that are described in [2]. They are as follows:

• Source - The source of the move may either be a constant embeddedin the instruction, or an output terminal of a SHIP.

• Destination - The target should be the input terminal of a SHIP.

• Repeated - The move may be specified with the number of times thatit should be repeated. The default is once.

9


• Standing - The move may be performed indefinitely for as long asthere continue to be data, until an out-of-band (OOB) data item isreceived. Note that the source for standing moves cannot be a constantembedded in the instruction.

For a given instance of a FLEET processor, the sources and destinations(i.e. terminals for SHIPs) are enumerated; the assembler uses a lookup tablefor the symbolic terminal names.

Besides the MOVE instruction, the other fundamental concept in pro-gramming for FLEET is the idea of bags of instructions. Bags of instruc-tions are groups of instructions that can be evaluated concurrently. In fact,FLEET evaluates all the instructions in the bag as concurrently as possible,without any particular guarantees about the order in which they will be ex-ecuted. As many outstanding instructions will be handled as possible. Thisis quite a paradigm shift from the serial-nature of conventional programminglanguages which can make for a serious learning curve.

While the default paradigm is parallel execution, it is possible to forceserialization by sequencing out the program using a series of bags. In fact,"goto" could be implemented by explicitly specifying the name of the bag ofinstruction to be handled after the current bag is finished.

2.4 Simulator and Tools

2.4.1 Simulator

Given that the FLEET architecture consists of a number of SHIPs and theexecution is essentially the movement of data betwen the input and outputterminals of the SHIPS, it should be a relatively straight forward matter tocreate a simulator for a particular instance of a FLEET processor. It waswritten in Java and new SHIPs may be created by implementing using theprovided interface.

The inputs to the simulator are the following:

• Library of SHIPS - The simulator comes with a variety of SHIPS suchas FIFOs, memory, arithmetic operators. They are designed to do onespecific thing and tend to be simple components.

• Model description - This is a file that describes the SHIPS that areavailable on the particular instance of the FLEET processor to besimulated. The actual format and syntax of this file is actually morecomplicated than just a list of the SHIPS that are available; the model

10


description also specifies the required interconnects such as the switchfabric components.

• FLEET program - This is the actual program that would be executed.As mentioned in the architecture description, FLEET really only hasone instruction: MOVE. The FLEET program consist of bags of in-structions that make up the program.

• Simulator parameters - This file contains the required information forthe simulator such as the name of the model file, and the FLEETprogram file. As well there are runtime parameters that affect how thesimulator will run, such as how many cycles to execute for.

2.4.2 Visualization Tool

The output logs from the simulator may viewed using a TCL based tool.Although it does not strictly provide additional information or processing ofthe events in the logs, it does provide a visually pleasing way to examine theresults.

11

Chapter 3

Smith-Waterman SHIP Designand Implementation

3.1 High-Level Design (Alpha version)

As mentioned before, it would be desirable to evaluate as many of the entriesas concurrently as possible in the dynamic programming table. Obviously,having enough cells to calculate across the largest diagonal would accomplishthis. However, this is clearly not physically possible. As a compromise, thereare a variety of ways to design something that is within the realm of physicalpossibility. This section will describe the approach that was chosen.

We should note that this essay will refer to two versions of the design.The first version, denoted Alpha, will refer to the initial design, while Betawill refer to the proposed revised design.

3.1.1 Overall Execution

For the evaluation of an entry in the dynamic programming table, the entriesimmediately above, immediately to the left, and immediately to the diagonaltop-left must be already available because of the Smith-Waterman algorithmdependency. We have already discussed this in the algorithm descriptionsection of the paper.

Given the constraints, our evaluation plan could be summed up as fol-lows: at the first level, we will evaluate strips of multiple rows across thetable; within the horizontal strips, we will evaluate the individual columnsas concurrently as possible. Figure 3.1 shows the sequence of evaluation.

In order to handle the columns within the strip, we will arrange fourSmith-Waterman SHIPS in such a way that a given SHIP will forward theresult of the column that is is responsible to the input of the next SHIP.We will cycle the use of the SHIPS to evaluate across the entire strip; thelast SHIP will feed its output back to the first SHIP as seen in Figure 3.2.Once the strip is complete, then we shall setup for evaluating across the next

12

Chapter 3. Smith-Waterman SHIP Design and Implementation

Figure 3.1: Evaluation strategy for FLEET implementation. The dynamicprogramming evaluation consists of two levels. The first is the large hor-izontal strips. The second level are columns within each horizontal strip.When all the columns for a particular strip has finished evaluating, then theevaluation moves to the next large horizontal strip. There are four SHIPsthat are setup to evaluate the columns; the different shading shows which ofthe SHIPs are responsible for which column. The unshaded areas indicatestable entries that have not yet been evaluated.

13


Figure 3.2: Smith-Waterman SHIPs in ring configuration. Only the termi-nals participating in this ring are shown for simplicity.

large strip. Note that we can allow for an arbitrary number of SHIPS. Thedecision for four SHIPs was to make it easy to observe and evaluate.

While this particular design seems rather different than an evaluationstrategy that is strictly along the diagonal, evaluating down a smaller numberof rows rather than down all the rows of the table gives us the advantageof requiring less buffering between the Smith-Waterman SHIPs. As well,this design could lead to designs that are less memory bandwidth intensive;however, having this initial version working first would be needed beforeother optimizations can be considered.

For initialization, the input terminals are fed with the appropriate values.Once initialized, then we shall run it essentially like this (example assumes10 columns (i.e. 9 to compute), 8 rows and 4 cells:

iter: move,16 cell3_d -> cell0_acmove,24 cell0_d -> cell1_acmove,16 cell1_d -> cell2_acmove,16 cell2_d -> cell3_ac

The move instruction allows us to specify repeated moves and if thenumber of columns and rows that are needed are known ahead of time, thenthe exact number of moves between the SHIPs are deterministic. Anotherpoint of view that one could take is imagining the data values being passedaround the ring of SHIPs, and then just counting exactly how many valuesneed to be passed between each segement of the ring until the final columnis reached.

Initially we considered having stages of 8/8/8/8 moves for the iterationpart, rather than calculating the number of moves that we actually need.

14


However, this will force a synchronization on all the cells on every eight cal-culations so that we will not take on the next iteration until the slowest partof the computation is done. Clearly this is not desirable - so our 16/24/16/16(or whatever maybe the case for your particular run) allows the FLEET chipto decide which ever move is ready and then make it happen.

As a result of the previous observation, I suspect that if the SW cell werebroken into smaller more generic FLEET components it may also presentsynchronization issues. Standing moves could also be considered for use inthe implementation to avoid unintended synchronizations.

3.1.2 Smith-Waterman SHIP

Given the above strategy, the design of the Smith-Waterman SHIP has someobvious constraints. For the Alpha design, we emphasized the followingideas:

• Minimize duplicated data movement - for this, we combined the A andthe C input for the Smith-Waterman calculation into a single input.When evaluating down a column, the C in the previous calculation isthe A in the subsequent calculation.

• Rely on correct delimitation of the columns and rows by inputing thisdirectly to the SHIP, rather than using an inter-SHIP protocol, i.e.using counted moves instead of using standing moves that terminateon some special code. The rationale for this is that it is easier to provethe correctness of the implementation than a protocol.

• Prefer to have a "fat" design, rather than smaller pieces. The idea isthat eventually when this large SHIP is replaced with a more realisticmulti-SHIP implementation that this design will act as a set interfacethus making the task of breaking it down into smaller ships, a cleanertask.

It is interesting to note, after the fact, that some of these ideas werepossibly contrary to the philosophy of FLEET. We shall get into a morein-depth discussion of this later on.

The terminals for the Smith-Waterman SHIP are as shown in Figure 3.3.Note that each SHIP is designed to evaluate down a column:

• A0In - This is the top row, left value.

• B0In - This is the top row, right value.

15


Figure 3.3: Smith-Waterman SHIP

• SxIn - These are the characters corresponding to the column. Notethat we will read this in one at a time for each column iteration.

• TyIn - These are the characters corresponding to the rows being com-puted. Note that given the appropriate numrows and numcols, we willonly need to feed in these characters once for a computation.

• AcIn - This corresponds to the left column of the calculation.

• DOut - This is an output and corresponds to the right column of theDP calculation.

• NumrowsIn - This specifies how many rows of computation will behandled. Note that because the first row is given, the number specifiedhere will be the same as the number of characters fed into Ty and thenumber of values fed into ac. (So 1 means that there is only one valueof D to calculate.)

• NumcolsIn - This specifies the number of columns that will be com-puted. The reason for this is to simplify the Ty input i.e. we canreuse the input rather than having to re-feed the Ty for each columncomputation for each cycle of SW computations.

Each of the terminal inputs correspond to particular input for the cal-culation of a particular column in the dynamic programming table. The

16


following figure shows exactly how each of the inputs and outputs fit intothe calculation. Notice that values of D are reused as values of B in subse-quent calculations:

Sx. a0 b0

T0 ac0 d0T1 ac1 d1T2 ac2 d2.

3.1.3 Detailed Design

This section describes the work to be done at the Java code level to imple-ment the Smith-Waterman SHIP. The design breaks down as the followingobjects/classes:

• Terminals - The SHIP implementation API expected to have a Termi-nal object initialized for each of the inputs and outputs. So exactlyone for each one listed in the description of the SHIP in the previoussection.

• Functions - When setting up the Actions, we need to specify an outputfunction that is triggered when the appropriate input Terminals haveinput. Specifically for our case, there are two types of output Functions:

– Acknowledgement - For each input Terminal, there is an impliedstate tracking that requires the receipt of input to be acknowl-edged. To that end, there will be an acknowledgement functionto handle each of the inputs.

– Outputting Smith-Waterman - The process of accepting inputswill result in quiet acknowledgements, until we receive the finalpiece of input. At which point, the output of the Smith-Watermancalculation should be triggered. Once the first output is triggered,the signal for the next calculation should be the acknowledgementreceived on the DOut Terminal.

• Actions - Essentially each of the action refers to a behaviour that theSmith-Waterman SHIP should handle. The actions that end in "Wai-tAndGo" handles the case when that particular input is the last of theset required for starting the Smith-Waterman calculation.

17


– ActionA0Wait - Accept A0 and ack.

– ActionA0WaitAndGo - Accept A0, ack, and start the output.

– ActionACWait - Accept AC and ack.

– ActionACWaitAndGo - Accept AC, ack, and start the output.

– ActionB0Wait - Accept B0 and ack.

– ActionB0WaitAndGo - Accept B0, ack, and start the output.

– ActionDWaitForAck - Accept the ack for the output D and outputagain.

– ActionDWaitForFinalAck - Accept the ack corresponding to thelast row that the SHIP is supposed to handle.

– ActionNumcolsWait - Accept number of columns and ack.

– ActionNumrowsWait - Accept number of rows and ack.

– ActionSxWait - Accept character for the particular column andack.

– ActionSxWaitAndGo - Accept character, ack, and start the out-put.

– ActionTyWait - Accept character for the particular row and ack.

– ActionTyWaitAndGo - Accept character, ack, and output.

• ActionManager - This module oversees all the Actions and enables anddisables them as appropriate.

Note that there is actually a problem with this particular design. Theabstraction used by the simulator does not allow for multiple actions to beattached to the same terminal. In the results section, we shall go into moredetails about the consequences of this.

3.2 Results

According to the original proposal, the goal was to get the entire imple-mentation completed by the end of the summer of 2006. Unfortunately, thedesign did not take into consideration of the simulator’s operation, and thiswas not discovered until the code was completed.

Specifically, the simulator was designed to allow one Terminal to cor-respond to one Action, whereas our Alpha design assumed that multipleActions per Terminal was allowable. In a quick fix attempt, we tried to

18


modify the simulator as minimally to accommodate this but unfortunatelythe simulator has internal checking which prevents running with this hack.

The simulator was designed to capture the physical signaling that wouldhappen should the actual chip get fabricated. To that end, the simulator wasdesigned to be strict about the signaling between the Terminals. Specifically,the simulator expects to capture transitions of active and inactive signals (i.e.presence and absence of data) in order that post-run analysis be possible.Because a double activation or a double deactivation does not show up "on ascope," this is considered a serious error and will cause the simulator to stopimmediately. The attempt to quick-fix the simulator to handle more thanone Action per Terminal essentially short-circuits into this condition rightaway, at which point, it is clear that the design of the Smith-Waterman SHIPhas to be scrapped.

In an e-mail with Igor Benko, the author of the simulator, he suggestedthat one possible design change would be to internally add another Actionabstraction that two Actions might be hooked up to. In this way, enablingor disabling either of the two Actions would not directly affect the Termi-nal and thus we would not be affected by the problem of double activa-tion/deactivation.

As mentioned previously, time has essentially ran out for any more pro-gramming work and any more effort will have to be left for future work.

3.2.1 Experience of Designing and Implementing the SHIP

In the previous section, we objectively presented the results (or lack thereof)for the design and the implementation. In this section, we would like toprovide an experience report of working with the simulator and what it islike to design and implement the Smith-Waterman SHIP.

There are two areas worth commenting on: one, working with the vo-cabulary and ideas of FLEET; the other, the actual work of designing andimplementing the Smith-Waterman SHIP to run on the Java-based simula-tor. In our commentary, we shall attempt to address both of these areas.

Some points of note:

• After figuring out how to download the simulator from Sourceforge, ittook a little bit more experimentation to figure out how to make thingsrun and how to use the simulator log visualizer. Nicer packaging anddocumentation would have helped made this part move quicker, butfor a project that is moving so rapidly, this is not always possible.

19


• The concept of bags of instructions is intuitively easy to understand,but it is not clear how to use this feature effectively. The mechanismused to specify which bags of code to evaluate next (i.e. by feeding thename of the bag to the FetchAndIssue) seems a bit awkward. Naturally,I would expect that the eventual typical user of FLEET would not haveto be concerned with programming at this level.

• The simulator comes with a decent set of SHIPS and provide an idea ofhow simple behaviours could be implemented. As well, there are someexamples of how the SHIPS could be put together for some interestingcomputation (such as Fibonacci, or Euclid’s factoring). However, thelearning curve and setup is quite a bit for even doing something thatis simple.

I think that my having an extensive background in software developmentaffects the way that I would go about doing the design and implementa-tion of a FLEET SHIP, especially in the context of a software simulator.Because the API of the simulator seems relatively similar to a component-based system framework, I instinctly designed the SHIP in a similar fashion,namely attempting to encapsulate buffering and functionality into the SHIP.However, what was natural for me was actually not the correct way to goabout designing the Smith-Waterman SHIP. Unfortunately, there was notany particular example that I could follow which shows the implementationof a complex algorithm.

Besides the implementation of the Smith-Waterman SHIP, there is alsothe issue of designing the surrounding framework to support the operation,namely the initialization and mechanisms to feed input and write the dy-namic programming table values back out to memory. Despite the availabil-ity of "stride" SHIPs which simplify loops programming, I would not saythat it is the easiest construct to use. As well, there is the issue of sharedaccess to the memory interface SHIP. This is actually a fundamental problemin parallel programming: even though you have many functional units, youwill have to share data at some point. I was foreseeing the need to have anmemory access serializer SHIP that would handle writes needed by the algo-rithm implementation. Also, I wanted to be able to implement a defensivemechanism that would be able to reliably reset the state of certain SHIPsbetween strips of execution; the current model of having MOVEs and code-bags, while it is simple and elegant, does not account for the messy reality ofbeing able to express something more complicated. A hypothetical questionmight be: how would one reliably implement interrupts in FLEET? Surely

20


being able to answer this question will answer my question about having areliable way to reset the state of the SHIPs and of the execution.

Generally, not having all the utility SHIPs available and not havingan expressive enough programming paradigm made implementing Smith-Waterman challenging task. However, it is certainly hopeful that more effortwill be devoted to making FLEET easier for programmers, either by toolsor a specially designed programming language. As well, being able to en-capsulate a set of SHIPs within a larger SHIP would allow for the creationof more complicated behaviour. Another observation is that writing FLEETprograms seems rather similar to hardware design, where SHIPs correspondto logic units and MOVE instructions correspond to wires. Efforts to im-prove the programmability might come from importing ideas from hardwaredesign tools and languages to the FLEET world.

21

Chapter 4

Discussion

In the previous chapter, we noted the issues with the Alpha design. Inthis chapter, we would like to propose a modified design (4.1), have a morephilosophical retrospect on the project work (4.2), and summarize directionsfor future work (4.3).

4.1 Revised SHIP Design (Beta version)

The design for the Alpha version had some problems.

• The heavyweight approach did not suit the way that the SHIPs ingeneral should be designed. All the existing SHIPs seem to be genericand minimal, rather than a monlithic component that handles andencompasses more complex computations.

• The idea of reading in all the parameters before starting to process thecalculations seems to be contrary to the goal of taking advantage ofthe FLEET architecture’s concurrency.

• Related to the previous point, it is rather unrealistic to assume a SHIPthat can store an arbitrary amount of input, when in reality, it is morelikely to have a fixed amount of storage capabilities. In fact, an alter-nate, better way to handle this would be to have FIFO SHIPs precedethe inputs of the Smith-Waterman calculation SHIPs; this would sep-arate out the buffer from the calculation logic, for a more modulardesign.

Perhaps a more general criticism of the Alpha design is that it is treatingthe simulator as some kind of software framework. The simulator’s Java APIand structure is such that a programmer might be led to believe that this isthe case.

What kind of design would result if we were to take a more light-weightapproach? We shall present the Beta design, which we will consider forfuture continuing work on this project.

22

Chapter 4. Discussion

4.1.1 SHIP Design for Beta version

The distinctive feature in Beta is to deliberately "clock" the output, ratherthan to buffer up the input. A single set of inputs will yield a single outputvalue. This will force a simplification onto the design of the SHIP.

So as we did with the Alpha version, we will describe the inputs/outputsand the behaviour. Although this Beta version is similar to the Alpha ver-sion, there are some differences in the way the input Terminals work:

• A0In - This is the top-left DP value.

• B0In - This is the top-right DP value.

• SxIn - This is the character corresponding to the column. Note that weexpect to have all the appropriate input for each "clock" or calculationof the Smith-Waterman algorithm. So for example, if there are 10 rowsto calculate, we will expect the same column character to be fed intothis input 10 times.

• TyIn - This is the characters corresponding to the rows being com-puted.

• AcIn - This corresponds to the left column of the DP calculation.

• DOut - This is an output and corresponds to the right column of theDP calculation.

Notice that we will not be specifying the number of columns or rowswith this particular version of the SHIP. Also note that the supporting logicto manage SHIPs for this configuration will be more complicated than theAlpha version; this design for this will also have to be left for future work.

4.2 Simulation or Design Exploration

While working on the project, one could not help but feel the two contrastinggoals of the work. On the one hand, the low level simulation would be of in-terest to people who are looking to fabricate actual processors. On the otherhand, there is the rather unexplored area of how one would provide an inter-face so that normal programmers may take advantage of the architecture’sfeatures.

23


While it would have been desirable to have these two areas exploredas equally and as thoroughly as possible, the work involved in these twodirections are not necessarily the same or complementary. 1

4.2.1 Simulating Signals

The FLEET simulator modeled the components of the architecture at afunctional description level. Specifically, the code for the SHIPs and thesimulator configured and tracked specific delays for the inputs and outputs.This was not the same level of physical description as something that couldbe fed into a SPICE simulator, but would give us insights into potentialbottlenecks in the processor.

Given that an actual prototype was fabricated (i.e. FLEETzero), it issensible for the simulator to make some assumptions, and use that as astarting point for a higher level of simulation.

4.2.2 Design Exploration

An alternate way of thinking about the FLEET work is whether the languageand abstraction are suitable for a programmer. In other words, insteadof asking "does it work?" the question might be instead "is this the rightway of expressing the computation?" Experience in programming languagedesign might be more useful for this exploration than experience in computerarchitecture. From looking at the recent efforts on making SHIPS implementuseful algorithms, working with FLEET is clearly not trivial at all. Questionsthat arised out of the design and implementation of the Smith-Watermanalgorithm seems to show that there is not any particularly straightforwardmanner to take a given algorithm and translate it to run on FLEET.

On the whole, working with FLEET feels rather like hardware design; inparticular the FLEET program’s MOVE instructions act like wires and theSHIPS are sources and destinations of signals. Logic design at the hardwarelevel is a well understood exercise. Perhaps given enough time, FLEETprogramming expertise will be just as well understood.

Another hurdle that a programmer has to overcome is the idea is thatthe FLEET program instructions are evaluated concurrently by default andregular assumptions about order of execution cannot be made. Perhaps it is

1The FLEET interpreter is now available (http://research.cs.berkeley.edu/class/fleet/)in addition to the original simulator. The emphasis of the interpreter is clearly in the areaof design exploration rather than physical simulation. At the time that I was implementingthe Smith-Waterman algorithm, this interpreter was not available.

24


not the concurrency that is difficult to deal with, but more the determinismof the execution. If there is anything that would make it easier to programFLEET it would be a clearer execution model of how the instructions arehandled. The recently available FLEET interpreter is a good step in thisdirection.

It is unfortunate that this project did not turn out as well as planned, asthe Smith-Waterman algorithm would have shown future FLEET researchershow to exploit the architectural features.

4.3 Future Work

Besides having a proper Smith-Waterman SHIP, there is also the whole mat-ter of having the proper supporting SHIPs and a FLEET program that couldautomatically feed in the appropriate characters to keep the "engine" run-ning.

Given that the Alpha design did not work out, the next step to do shouldbe one of: modify Alpha design slightly as suggested by Igor Benko; or scrapit and work out the Beta design. At this point, I believe that the simpler Betadesign has merit. Of course, now that the interpreter is available, workingout the rest of the design can progress quickly.

Once a concrete implementation is available, then more work can pro-ceed on benchmarking timing constraints and other physical limitations. Ofparticular interest would be whether a FLEET based implementation ofthe Smith-Waterman algorithm be running competitively when compared toFPGA or ASIC implementations.

4.4 Conclusion

While we were unable to complete the implementation, we belive that FLEETcertainly has the potential for high-performance computing. Future effortson tools to better help the programmer work with the architecture will makehelp with future research. On the whole, working with FLEET has openedmy eyes to some unique ideas about computer architectures.

25

Bibliography

[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basiclocal alignment search tool. J. Mol. Biol, 215(3):403–410, 1990.

[2] I. Benko. FLEET Assembly. UCIB #2006-ib08, May 4, 2006.http://research.cs.berkeley.edu/class/fleet/docs/people/igor.benko/ib08-FLEET.Assembly.pdf.

[3] W. S. Coates, J. K. Lexau, I. W. Jones, S. M. Fairbanks, and I. E.Sutherland. FLEETzero: An Asynchronous Switching Experiment. Sev-enth International Symposium on Asynchronous Circuits and Systems(ASYNC’01), 2001.

[4] S. Dydel and P. Bala. Large scale protein sequence alignment usingFPGA reprogrammable logic devices. Field Programmable Logic andApplication, 14th International Conference, FPL, 2004.

[5] Asanovic et al. The Landscape of Parallel Computing Research: A Viewfrom Berkeley. Technical Report No. UCB/EECS-2006-183. December18, 2006. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.

[6] FASTA Sequence Comparison at the U. of Virginia.http://fasta.bioch.virginia.edu/.

[7] S. A. Guccione and E. Keller. Gene Matching Using JBits. Xilinx paper,2002.

[8] T. Han and S. Parameswaran. SWASAD: An ASIC Design for HighSpeed DNA Sequence Matching. In Joint Asia South Pacific DesignAutomation Conference and VLSI Design Conference, 2002.

[9] D. T. Hoang. Searching genetic databases on splash 2. In Proceed-ings 1993 IEEE Workshop on Field-Programmable Custom ComputingMachines, pages 185–192, 1993.

26


[10] P. Leong, M. Leong, O. Cheung, T. Tung, C. Kwok, M. Wong, andK. H. Lee. Pilchard - a reconfigurable computing platform with mem-ory slot interface. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, April 2001.

[11] R. J. Lipton and D. Lopresti. A systolic array for rapid string com-parison. In Proceedings of the Chapel Hill Conference on VLSI, pages363–376, 1985.

[12] S. Margerm. Reconfigurable Computing in Real-World Applications.FPGA and Structured ASIC Journal, 10(5), February 2006.

[13] NCBI BLAST. http://www.ncbi.nlm.nih.gov/blast/ (Online; accessedFebruary 26, 2007).

[14] W. Pearson and D. Lipman. Improved tools for biological sequencecomparison. In Proc. Natl. Acad. Sci, volume 85, pages 2444–2448,April 1988.

[15] K. Puttegowda, W. Worek, N. Pappas, A. Dandapani, P. Athanas, andA. Dickerman. A Run-Time Reconfigurable System for Gene-SequenceSearching. In 16th International Conference on VLSI Design, pages561–566, 2003.

[16] T. Smith and M. Waterman. Identification of common molecular sub-sequences. Journal of Molecular Biology, 147:195–197, 1981.

[17] I. Sutherland. FLEET - A One-Instruction Computer. UCIES#2005-is02, August 25, 2006. http://research.cs.berkeley.edu/class/fleet/docs/people/ivan.e.sutherland/ies02-FLEET-A.Once.Instruction.Computer.pdf.

[18] I. Sutherland and S. Fairbanks. GasP: A minimal FIFO control. Proc.International Symposium on Advanced Research in Asynchronous Cir-cuits and Systems, pages 46–53, 2001.

[19] C. W. Yu, K. H. Kwong, K. H. Lee, and P. H. W. Leong. A Smith-Waterman Systolic Cell. In Proceedings of the Tenth International Work-shop on Field Programmable Logic and Applications (FPL’03), pages375–834, 2003.

27

Appendix A

Project Description fromProposal submitted April 2006

For the research, I am mainly interested in the evaluation of the FLEETplatform. To that end, I will be implementing the version of simplifiedversion of the Smith-Waterman algorithm for DNA, i.e. A, T, G, and C,as per Lipton and Lopresti. Although the eventual goal would be to runthe full version of the algorithm (i.e. with gap open and extend penaltiesetc.), running the simplified version will allow us to concentrate on betterunderstanding the FLEET platform. Should a real chip be fabricated, thissimplified version will allow for a direct performance comparison with otherASIC and FPGA implementations.

The final report for the project will provide for a more detailed descrip-tion of the workings of FLEET. In the meantime, the idea of the architec-ture can be summarized as: FLEET is a communication-centric, rather thanoperator- centric architectural design so that the compiler level can managethe parallelism of the chip level operations, via moves between functionalunits. An early version of FLEET was implemented and described by theFLEETzero paper.

A list of things to do is as follows:

1. Review FLEET materials.

2. Run Fibonacci and Euclid examples.

3. Design and implement simplified Smith-Waterman algorithm.

The time frame for the work is from May 2006 to July 2006 under theguidance of Mark Greenstreet.

28

Implementation of the Smith-Waterman Algorithm on the ... · PDF fileimplementing the...

Documents

Transcript of Implementation of the Smith-Waterman Algorithm on the ... · PDF fileimplementing the...