The Design of Special-Purpose VLSI Chips

Structured VLSI design proceeds from algorithm to logic cell tocell array to special-purpose chip, yielding cheap, powerful; and

modular hardware that will permanently alter the systemslazndscape of the 80's.

1/'

K_The Design of Special-Purpose\JLSI Chips

M. J. FosterH. T. Kung

Carnegie-Mellon University

We have now entered a technological domain inwhich many of the problems previously encounteredin building special-purpose hardware are much lesssevere. LSI technology allows tens of thousands ofdevices to fit on a single chip, and the advance toVLSI should increase this number. Devices once re-quiringmany components can now be built with justa few chips, reducing the difficulties in reliability,performance, and heat dissipation that arise fromcombining many standard SSI or MSI components.In addition, the development of simplified tech-niques1 and implementation guides 2 for structuringIC system design-an area often regarded as dif-ficult-allows relatively naive designers to achievesuccess.Special-purpose VLSI chips can function as periph-

eral devices attached to a conventional host com-puter. If many 'types of chips are attached, theresulting system can be considered an efficientgeneral-purpose computer. Figure 1 illustrates howspecial-purpose chips such as a pattern matcher, FFTdevice, and sorter might fotm part of such a general-purpose system.Construction of complex special-purpose VLSI

chips will be feasible only ifwe hold down design cost(i.e., design time).We will argue that chip design timecan be reduced significantly if 'the underlyingalgorithm is "good"-i.e., designed carefully in thefirst place. We will -characterize such algorithmsbelow and will examine a concrete example ofone-the design of a pattern matching chip. (We com-pleted this design in the spring of 1979 and have hadprototype chips fabricated, with testing now underway. Preliminary results show that the chip canachieve a data rate of one character every 250nanoseconds. which is higher than the memory band-

width ofmost conventional computers. This high per-formance is achieved in spite of little effort given tocircuit and layout design. We attribute this perform-ance mainly to the careful design of the underlyimgalgorithm.)We will also identify the major steps in special-

purposeVLSI chip designaccording to the good algo-rithm philosophy, and will offer a methodology fortransforming a good algorithm into a final layout in amore or less mechanical way. With this approach,designing a special-purpose chip should not be moredifficult than designing a high-level algorithm for thesame job.

Design philosophy

Algorithms that perform well on conventional ran-dom access computers are not always the best forVLSI implementation. As Sutherland and Mead3point out, good algorithms for VLSI implementationare not necessarily those requiringminimalcomputa-tion. Computation is cheap in VLSI; communicationdetermines performance. Thus, in this new era ofcomputation, we need to reconsider the algorithmsfor many tasks.A "good" algorithm in this context should possess

one or more of the following properties:

* The algorithm can be implemented by only a fewdifferent types of simple cells.

* The algorithm's data and control flow is simpleand regular, so that cells can be connected by anetwork with localand regular interconnections.Long distance or irregular communication isthus minimized.

0018-9162/80/0100-0026S00.75 © 1980 IEEE26 COMPUTER

* The algorithm uses extensive pipelining andmultiprocessing. Typically, several datastreams move at constant velocity over fixedpaths in the network, interacting at cells wherethey meet. In this way a large number of cells areactive at one time so that the computation speedcan keep up with the data rate.

Algorithms with these properties have been namedsystolic algorithms' by Kung and Leiserson.4 Manyhave been designed recently and are surveyed byKung.5

Since most special-purpose chips will be made inrelatively small quantities, the design cost must bekept low. Systolic algorithms have several advan-tages which help reduce this cost:

* One can design and test only a few different, sim-ple cells, since most of the cells on the chip arecopies of a few basic ones.

* Regular interconnection implies that the designcan be made modular and extensible, so one candesign a large chip by combining the designs ofsmal chips.

* By pipelining and multiprocessing, onecan meetthe performance requirement of a special-purpose chip simply by including many identicalcells on the chip.

*Theword "systole" was borrowed from physiologists, who use it torefer to the rhythmically recurrent contractions of the heart, whichpulse blood through the body. For a systolic algorithm, the functionof a cell is analogous to that of the heart. Each cell regularly pumpsdata in and out (performing some short computation before each"contraction"), so that a regular flow of data is kept up in the net-work.

All these imply that if a good algorithm is used, thedesign time, and therefore the design cost, can besubstantially reduced.In VLSI special-purpose chip design, then, the

most crucial decision is the choice of the underlyingalgorithm, since the suitability of the algorithmlargely determines the design cost and performance.Given the importance of algorithm design, it shouldreceive the largest part of the design effort. Low-leveloptimizations at the circuit or layout design level areprobably not worthwhile, as these will lead only tominor improvements in the overall performancewhile increasing design time.

The design of a pattern matching chip

A specific VLSI chip-one that performs on-linepattern matching of strings with wild card charac-ters-illustrates our design philosophy and method-ology. The design of the underlying algorithmdemonstrates that it can be mapped to circuit andlayout designs in a straightforward way.

The string pattern matching problem. Our chip ac-cepts two streams of characters from the host ma-chine, and produces a stream of bits as shown inFigure 2. One of the input streams, the text string, isan endless string of characters over some alphabet I.The other input stream, the pattern, contains a fixed-length vector ofcharacters over the alphabet TU{X},where X is the wild card character. The output is astream ofbits, each ofwhich corresponds toone ofthecharacters in the text string. The data streams moveat a steady rate between the host computer and the

Figure 1. Special-purpose chips attached to a general-purpose computer.

Figure 2. Data to and from the pattem matcher.

January 1980 27

pattern matcher, with a constant time between dataitems.Let us denote the input text stream as SOS0s2.

The finite pattern stream will be denoted asPoPi* Pk and the output result stream as r0rjr2...Characters in the two input streamsmay be tested forequality, with the wild card character-X deemned tomatch any character in X. The output bit ri is to be setto 1 if the substring si.si+ 1... si matches the pat-tern, and 0 otherwise, i.e.,

ri (si-k = po) A (si+i8k = p1) A A (s i =Pk)-

In Figure 2, for example, the pattern AXC matchesthe substrings SOS0s2, .S3s4S5, and s4s5s6 (ABC, AAC,and ACC). Result bits r2r5, and r6 are thus set to 1, andall other result bits are 0.This problem is important in many applications.

String pattern matching is a basic operation inSNOBOL-like languages6 and in data base querylanguages. String matching hardware has been pro-posed for use in office automation systems.7 Manyartificial intelligence systems make heavy use of pat-tern matching as a search method. Furthermore,string pattern matching is similar to many stressingnumerical computations such as convolutions andcorrelations. All -of the linear product problemsdiscussed by Fischer and Paterson8 are also similir tostring matching.

Several fast algorithms are known for solving thestring matching problem without wild card charac-ters on a normal random access machine.9' 10 Thesemethods use information about partial matches ofthe pattern with itself to avoid redundant com-parisons, skipping over parts of the string where par-tial match resultsmay be inferred fromprevious com-parisons. When wild card characters exist in the pat-tern these methods break down, since the "matches"relation is no longer transitive. The strings AC andXB both match AX, for example, but do not matcheach other. Information about matchings of the pat-tern with itself is therefore irrelevant if wild cardcharacters are present. The fastest algorithm knownfor string matching with wild card characters isbased on multiplication of large integers8 and re-quires more than linear time. The pattern matchingchip solves the problem in linear time by performingcomparisons in parallel.

The chip design. Wedesigned our chip according tothe methodology discussed earlier, beginning withthe careful design of a systolic algorithm and pro-ceeding to its hardware implementation.Algorithm design-data flow. The pattern and the

text string arrive alternately over the bus one charac-ter at a time. We will call the interval during whichone character arrives from either stream a beat. Dur-ing each pair ofconsecutive beats the chip must input

Figure 3. The flow of characters through a linear array of cells.

COMPUTER-28

two characters and output one result. All characterson the chip move during each beat.The chip is divided into character cells, each of

which can compare.two characters and accumulate atemporary result. The pattern and string follow apreset path of cells from the time they enter the chipuntil the time they leave it. On each beat every char-acter moves to a new cell. We use a linear array ofcells, with the pattern and string moving in oppositedirections, tomake each character of the stringmovepast all characters of the pattern. To make each pairof characters meet rather than just pass, we mustseparate them by one cell so that alternate cells areidle. Each cell is then active on alternate beats.Figure 3 traces the flow of characters for severalbeats.Following the pointer in Figure 3 illustrates the

history of the character cell, starting when the firstcharacter of the pattern, po, is present. Suppose thestring character si is present duringthis beat. Duringthe next beat the cell is idle, but during the beat afterthat it contains Pi and si+. Two beats later, P2 andSi+2 are together, thenp3 and s i+,, and soon. By thetime the last pattern characterp% leaves the cell, thesubstring sisi+1... si+k will have met the whole pat-tern. We can therefore keep the partial match resultsin this cell, update it whenever a new pair of charac-ters enters the cell, and output the results after thelast character of the pattern goes past. To outputresults we shift them along with the string, so thateach match result leaves the array with the last char-acter of its substring. Ifwe recirculate the pattern sothat the first character follows two beats after thelast one, we can output the completed result and in-itialize a new partial result on the same beat. Thenumber of character cells required is therefore nomore than the niitmber of characters in the pattern.Each character cell performs two separate func-

tions-it compares characters of the pattern andstring, and it updates and outputs match results.Wecan divide these functions betweeni two modules, sothat there are two linear arrays with connections be-tween corresponding cells as shown in Figure 4. Thecells on the topare the comparators; thepattern flowsthrough them from left to right, the string from rightto left, The bottom cells, or accumulators, receive theresults of the comparison from above. They maintainpartial results and shift completed results right toleft. Two bits associated with the pattern flowthrough the accumulators from left to right. One ofthese bits, called 1, marks the end of the pattern. It isone for the last character of the pattern and zero forthe others. The other bit is x, the "don't care" bit,which marks wild card characters. A one in this bittells the accumulator to ignore the result from thecomparator, since this pattern character matchesanything.We can. further divide the comparators. Rather

than using one large circuit to compare whole charac-ters, we can divide each comparator into modulesthat can compare single bits. Two characters areequal if corresponding bits are equaL By staggeringthe bits so the high-order bits enter the array before

the low-order ones, we can make a pipeline com-parator. Each single-bit comparator shifts its resultdown to meet the bits coming into thenextlowercom-parator. The active and idle comparators alternatevertically as well as horizontally, so that on each beatthe active comparators form a checkerboard patternas shown in Figure 5.

Algorithm design-cell algorithms. Two kinds ofcells must be designed to build a pattern matchingchip exhibiting the data flow described above:

* The one-bit comparatorhas one bit ofthepatternflowing from left to right, one bit of the stringflowing from right to left, and the comparisonresult for the pair of characters flowing fromtopto bottom. The cell uses this algorithm to updatethe comparison result:

din

Pout - Pinsout Sin Pin Poutdout - din AND (Pin= sin)

Sout Sin

dout

* The accumulatorreceives din (the result from thecomparator above), Ain (the end-of-pattern in-dicator), and xii (the don't care bit). It maintainsa temporary result t, and at theend of thepatternuses t to replace the result r that flows from rightto left:

din

Aout - AinXOut- xin_ AnoutIF A inTHEN rout-t; tr-TRUE xn--o -.XuELSE rout rin; t-t AND (xin OR din)

Flgure 4. Pattern matching achieved In two modules, eachconsisting of a linear array of identical cells. Comparatorsare on the top and accumulitors on the bottom.

January 1980 29

Figure 5. Comparators for single bits.

Circuit and layout design-data flow circuit. Eachpipeline used by the algorithm for data flow is imple-mented as a unidirectional shift register shifting oneach beat. Every other cell of the shift register con-tains valid data. In the NMOS technology used forthis chip, a shift register is a chain of invertersseparated by pass transistors as shown in Figure 6.When-the voltage on the gate of a transistor is nearthe supply voltage Vdd, its channel conducts current,while if the voltage is near ground it does not. The in-puts to the inverters can store charges, so data isstored within the inverters; the pass transistors con-trol the inverter inputs. A clock with two non-over-lapping phases controls the pass transistors.Because adjacent transistors are turned on by op-posite phases of the clock, there is never a closedpathbetween inverters that are separated by two transis-tors. Alternate inverters can therefore storeindependent data bits.The dynamic alternation of active and idle in-

verters in the NMOS shift register mirrors the alter-

nation of active and idle cells in the algorithm (com-pare with Figure 5). Each cell can thus contain onegated inverter from each of the shift registers thatpasses through it. The clock controlling the shiftregister stages in a cell can activate the cell. The shiftregister components are then fully utilized-all idleinverters are in idle stages.

Circuit and layout design-cell circuit. Since eachcell inverts its inputs before sending them to itsneighbors, two versions of each cell must be con-structed. One version-operates on positive inputs toproduce inverted outputs, while the other computespositive outputs from inverted inputs. Transforminga cell algorithm to its inverted twin is straightfor-ward, so the existence of two versions presents noproblem. Using the cell algorithms, wecan design cir-cuits for the twin versions of each cell. From the cir-cuit designs, we can lay out the masks for fabricatingthe chip. The positive version of the comparator cellillustrates the process. It takes positive inputs and

Figure 6. A shift register in NMOS.

COMPUTER

RASTER DISPLAY SYSTEM DESIGN NOTE 1.

How to displaya1280 x 1024

imagethat doesnt

flicker

System 3400's fast rise timeand minimal glitching yield sharp,uniform pixels at 60 Hz refresh.

uP 18

Competitive systems typically mustrun slower to minimize slowrise time and glitching problems.

Do it with the Lexidata System 3400 imageand graphics processor.

If you're designing high-resolution raster scan videodisplay systems, only Lexidatacan give you a 1280 x 1024picture that doesn't flicker.The high-speed, micro-

processor-controlled System3400 is unique among videoprocessors in its ability togenerate a pixel in only ninenanoseconds. This means youget a refresh rate thafts atleast twice as fast as otherprocessors on the marketAnd your happy systemusers get none of the eyefatigue common withconventional systems.

But a display that doesn'tflicker is just one of the waysthe System 3400 can helpimprove your image. Its ex-tensive line-drawing and tonalimaging capabilities make it aperfect fit in a wide range ofcolor, gray-scale and mono-chrome display applications.

So, whether you're design-ing a specialized system formedical imaging, or massproducing systems for avariety of CAD/CAM applica-tions, the 3400's repertoire ofover three dozen standardand optional features can giveyou the ideal mix of hardwareand software tools to handlethe job. And at a price you'resure to like.

Send For New DetailedSystem DescriptionTo find out more about theSystem 3400, send for a copyof our new 12-page systemdescription booklet Or, if youneed information immediately,call us at (617) 273-2700.

LEXIDATAISCORPORATION37 NORTH AVENUE. BURLINGTON, MA 01803

Reader Service Number 8

produces inverted outputs, so the outputs in the com-parator algorithm must be inverted:

Pout NOT Pin

Sout NOT Sin

ou-t- din NAND (Pin= Sin)

In NMOS, data storage can take place on the inputto any logic gate, as long as a pass transistor can iso-late that input. Thep and s shift registers can be im-plemented with inverters as planned, but a NANDgate can be used as the stage for the d shift register.Figure 7 is the circuit for the positive comparator.When the clock input goes from ground to Vdd, thepower supply voltage, all three pass transistors turnon. The pattern and string inputs are then stored onthe inverters, and the d input is stored on one input tothe NAND gate. The exclusive NOR gate outputsTRUE if the two inputs are equal, and FALSE other-wise. The output of this equality test goes to the otherinput of the NAND gate, which computes dou, Afterthe inputs have stabilized, the clock goes to ground.The outputs of this cell then provide stable inputs toneighboring cells until the clock goes high again.

Circuit and layout design-cell sticks. The nextstep after completing the circuit diagram is the de-sign of the cell's topological layout, or stick diagram,which shows the relative positions of all signal paths,power connections, and components but hides theirabsolute sizes and positions. Most of the circuit'scomponents can be implemented in several ways, anda choice among these must be made at this stage ofthe design. Figure 8 is an example of a stick diagram.

Silicon-gate NMOS technology uses three conduc-tion layers (differentiated by color in Figure 8). Fol-lowing Mead and Conway's 1 convention, blue linesrepresent metal conduction paths, red lines poly-crystalline silicon (polysilicon), and green lines diffu-sion into the substrate. The three layers are insulatedfrom each other except at contact cuts, representedby round black dots. The yellow squares are areas ofion implantation, used to create depletion mode tran-sistors. These serve as pull-up resistors in the gatesand inverters.NMOS field-effect transistors are created by cross-

ing a diffusion path (green) with a polysilicon area(red). The green path is the channel, and the red area isthe gate. If no ion implantation is present, the chan-nel conducts current only when the gate is at Vdd.The positive comparator cell uses pass transistors

and inverters to implement the shift registers; it alsouses a NAND gate and an equality, or NXOR, gate.These basic components are combined as shown inFigure 8 to produce the stick diagram for the positivecomparator cell. Power and ground run horizontallyacross the cell on metal (blue) paths. The clock is inpolysilicon (red) at the top and right edges, and dipsbelow the upper power wire near the middle of the cell

COMPUTER

Sin

9 ..

I-pX'lo.,

to allow the cell above to connect to the power wire.Data paths forp and s run horizontally along the top,while d runs downward in diffusion (green).Let us trace the p data path through the cell. It

enters at the left in diffusion and passes thr-ough thechannel of a transistor that is gated by the clock. Con-tact is made to a polysilicon path that goes to the in-put of the p inverter. The inverter output, in metaLcrosses the d data path with no interaction and pro-vides an input to the equality gate. It then passesover the s inverter and leaves the cell at the right.

Circuit and layout design-final layout and mask-ing. When stick diagrams have been designed for allof the cells, actual layouts can be produced. These fol-low the topology of the stick diagrams, but also in-clude the absolute sizes and positions of all com-ponents. Designing a layout involves choosing elec-trical parameters for all transistors as well as fol-lowing minimum spacing rules for the intendedfabrication process. Care must be taken to line uppower connections and data paths that cross several Figure 7. Positive comparator circuit.

NANDGATE NXORGATE

Figure 8. Stick diagram for the positive comparator cell in the pattern matching chip. Color differentiates conduction layers.

January 1980 33

Figure 9. The CMU pattern matching chip prototype. The pattern matching array at left center measures 472 by 1528 microns and Isconnected to bonding pads on a rectangle measuring 1536 by 1884 microns.

34 COMPUTER

cells. In principle, thelayout canbe designed mechan-ically from the circuit and stick diagrams.When the layouts for all cells are complete, theycan

be assembled into a working array with the inputsand outputs hooked to contact pads. The layouts canbe described using a graphics language (such as theCaltech Intermediate Form1) that can be interpretedto make the masks. These masks can then be used tofabricate the chips.Figure 9 is a photograph of the prototype pattern

matcher we constructed according to the methodolo-gy outlined above. It can handle patterns containingup to eight-two-bit characters.

Design alternatives. In designing the chip we oftenreached points wherewe had to choose among severalalternatives. There were three major decision-making areas-choice of an algorithm, choice of adata flow implementation, and choice of a method forcell implementation.Alternative algorithms. A bewildering variety of

algorithms could form the basis for a pattern match-ing chip. The desire for simple and regular data flowrules out the fast sequential algorithms described byBoyer and Moore9 and Knuth et al.10 Since these algo-rithms require dynamically changing communica-tion, their hardware implementation will be too com-plex to be modular.Mukhopadhyay" has proposed several machines

which store a character of the pattern in each cell andwhich broadcast the text string character by charac-ter to all cells. This broadcasting is the major disad-vantage of this algorithm. Each cell requires a con-nection to the broadcast channel, increasing thepower requirements of the system as a whole ordecreasing its speed. Our algorithm requires nobroadcasting of data.A chip designed by Mead et al.12 uses another

algorithm in which pattern characters are stored inthe cells. The text string passes through all of thecells, and the results of character matches are com-bined using a common wired-NOR bus. We wished toavoid unbounded fan-in of this type, since it maydegrade performance when a design is extended toVLSI.Another algorithm-similar to ours-uses a linear

array of cells with data flowing in only one direction.The pattern is permanently stored in the array ofcells, andthe text stringmoves past it. Partialresultsmove at half the speed of the text so that they accu-mulateresults from an entire substring match.Were-jected this algorithm because of the static storage ofthepattern-loadingthe cells impreparationforapat-tern match would require extra time and circuitry.Our algorithm is well suited to VLSI implementa-

tion. All communication is local, since each charactercell communicates only with its left and rightneighbors. This enhances modularity and extensibili-ty, as well as avoiding the large drivers needed forlong-range transmission. Only a few types of cells areused, with many copies of each type. By replicatingthe basic cells, pattern matching chips ofany size canbe formed. Finally, control of the chip is simplified

since our algorithm requires no separate operation toset up the system for a new pattern.Alternative data flow implementations. Although

the global flow of data is determined by the choice ofalgorithm, several methods of implementing the dataflow may be possible. Serial or parallel data transmis-sion between cells may be selected, for example. Com-munication may be coordinated in several ways. Thedata flow can even be transformed to combine severalcells into one circuit. We will discuss two of thechoices that arose in implementing the data flow ofthe pattern matching chip.The existence of idle cells can be avoided by com-

bining pairs of neighboring cells when implementingthe data flow. Because each cell pair contains one ac-tive and one idle cell at each beat, the two cells canshare circuitry. In the pattern matching chip, for ex-ample, neighboring comparators could have sharedthe equality gate, and the d datapath could havebeenmultiplexed.

If the amount of sharable circuitry is large enough,it may be advantageous to combine two or more cellsin this way. Some additional circuitry will of coursebe needed to coordinate the sharing andmay wipe outthe savings. The increased interdependence of the cir-cuit components may also offset the savings, sincedesign changes may become more difficult and errorsmay be made. The pattern matcher cells are too smallto profit from this data flow transformation.Another choice in data flow implementation is be-

tween self-timed and clocked (synchronous) datapaths. Ina clocked dataflow implementation, all datamovement is under centralized control. The data flowcontroller sends signals to each cell to enable datatransfers. The pattern matching chip uses this meth-od. In fact, the data flow control signals are the sameclock signals needed for data refreshing, althoughthis need not be true in general.

In a self-timed implementation, data flow control isdistributed among the cells so that each cell controlsits own data transfers. Neighboring cells must obey asignaling convention to coordinate their communica-tion. Self-timed data flow has advantages in modu-larity and extensibility, since no common clock isneeded. Each ofthe cells may run at its ownpace, syn-chronizing with its neighbors only when commu-nication is needed. Self-timing's disadvantage lies inthe extra circuitry needed to implement the signalingconventions. For systems small enough to use a com-mon clock-like the pattern matching chip-clockeddata flow is best. For larger systems, of course, self-timed communication may be the better choice.'3Alternative cell implementations. Two major

choices affected the design of the cells. We rejectedstatic shiftregisters, which can hold data for longperiods without shifting it, in favor of dynamic shiftregisters, which can not. Also, we chose a randomlogic implementation of the cell circuitry rather thana more structured approach using standard PLA-programmed logic array-and register layouts.The dynamic shift registers we used can not hold

data for more than about 1 millisecond without shift-ing. Data is refreshed only by shifting. Static shift

January 1980 35

registers, the alternative choice, have, regenerationcircuitry in every stage so that data can be held in-definitely without shifting. In addition to the twoclock phases, static registers need a third signal forthe shift command.

Static shift registers are probably the better choicefor most systems. They do not invert data betweenstages, as do dynamic shift registers, and theysimplify testing. For our chip, however, dynamicshift registers have advantages. The alternation ofactiveand idle cells allows justoneinverter from eachshift register to be placed in each cell. This permitsthe two-phase clock to do double duty as a data flowcontrol signal. The cells and the global layout arethus greatly simplified.The simplicity of the cell functions dictated the use

of random logic. If cells contain more than a fewgates, the state-machine design approach should betaken. The state of the cell can be held in a register,and the combinational logic used for changing statescan be implemented with a PLA. Standard layoutsfor registers and PLA cells are available, simplifyingdesign and layout tasks and shortening designchange and error correction time. However, the smallsize of ourpattern matcher cells, eachcontaining onlyfour gates, made the use of random logic possible.Design and layout of such simple circuits is easy.

Uses and extensions of the pattern matching chip.A pattern matching chip with n character cells candirectly match patterns oflength onlyup to n. Longerpatterns require the existence of more than n partialresults at each beat. Since any chip must be of finitesize, it is important that the chip be extensible. Itshould be possible to combine several chips to form alarger pattern matcher.In order to make the chip extensible, an input for

the result streamand outputs for thepattern and textstreams must be provided. Several pattern matchingchips can then be cascaded (Figure 10). The inputs toeach chip in the figure are taken from the outputs ofitsneighbors, so that the cells on all of the chips forma single linear array. The pattern is fed to the inputsof the leftmost chip, and the text string is input to therightmost chip. The result output is taken from the

leftmost chip. A cascade of k chips with n cells eachcan thus match patterns of up to kn characters.

If the pittern to be matched is longer than the ca-pacity of the available pattern matching system, thepattern can be run through the system several timesto match it against the entire string. If the systemcontains a total of n character cells, each run willmatch the complete pattern against n substrings. Tocover all substrings, all we need do is delay the stringby n characters on succeeding runs.Modifying the design of the pattern matcher can

piovide special-purpose hardware for problemssimilar to string matching. For example, we mightwish to count how many characters in a substringmatch corresponding characters in a pattern. Thisproblem can be solved by replacing the result bitstream with a stream of integers, and replacing theaccumulator cell with a counting cell:

din

Xout. xin Aln_ kAout

THEN rout-t; t-O xin_-XoutELSE IF xin OR dn,THEN t -t+ 1; rout r1n routELSE rout-rin

A problem of more practicpl interest is the com-putation of correlations. Here, the pattern, string,and result are all numbers. The result ri of a correla-tion is defined as

r1=(s_k-po)2 + (S,+i-k-Pi)2 + *(Si-p. )

A good match of substring topattern results in ahighcorrelation.Correlations can be computed by a machine with a

data flow identical to the string matching chip, ex-cept that all streams contain numbers. The com-parator is replaced by a difference cell that computes

Figure 1O. A five-chip pattem matcher-cascading of chips permits the direct matching of longer strings.

d, ' Sin-Pi.-

36 COMPUTER

Like the character comparison, this difference com-putation may be pipelined bit by bit. An adder cellreplaces the accumulator. The algorithm for the ad-der cell is

IFAinTHEN r0,t --t; t-0ELSE rout *- rin; t - t+din2.

Other problems such as convolutions and FIRfiltering have algorithms using the same data flow.4'5It should be clear that special-purpose hardwaresimilar to the pattern matching chip can be built forany of these problems.

Design methodology

A systematic approach is essential when designinga complex system of any kind. The design task mustbe broken into manageable subtasks, with a well-defined flow of information between them. Each sub-task can then be performed separately with no needto consider more than one subtask at a time. Thisallows division of labor and, more importantly,prevents mistakes and eases design changes.Because of the diversity of tasks and concerns in

VLSI design, a systematic method is especially im-portant in designing a special-purpose chip. It is im-possible, for example, to take global data flow, circuitdesign, and transistor characteristics into account allat once. We must find small subtasks, with bound-aries between them that hide the implementationdetails of one from another. Of course, any set of sub-tasks is unlikely to be completely independent, sinceproblems that crop up in performing onemay requireredoing another-difficulties in layout, for example,may mandate a circuit redesign. However, thesedesign iterations will be easier if the interactions be-tween subtasks are few.VLSI system structure suggests several natural

information boundaries. One advantage of geometri-cally regular algorithms is the spatial separation thatthey impose between subsystems. The interior of onecell can be designed in ignorance of the interiordetails of another (although exterior details such assize and data path positions must be known). If cellsare complex, the separation of circuit functionswithin each cell may provide an additional informa-tion boundary. The design of each functional block ofa cell can then be largely independent of the others.The existence of a hierarchy of abstract chip models,from algorithm to gate to layout level, is a further aidtoVLSI design. Each level of the hierarchy deals withan independent set of design issues and serves as animplementation ofthe next levelup and as a specifica-tion for the next level down.Chip design can thus be decomposed geometrical-

ly, functionally, and hierarchically. These decomposi-tions must be consistent to be used to best advan-tage. Tasks separated geometrically should also beseparated functionally and hierarchically. It wouldbe unfortunate, for example, if all cell circuits had to

be considered at once in order to construct a stickdiagram for a chip. Careful construction of a taskdependency graph, before beginning the design,avoids this problem. This graph should contain all ofthe subtasks to be performed and include the infor-mation needed for each and the precedence relationsamong them. Of course, backtrack paths resulting inseveral iterations ofone task because of difficulties inanother need not be shown. The chip design task isnot yet understood well enough to predict such back-tracking.The task dependency graph ensures that no more

than a small amount of knowledge is required for anysubtask. Each of the subtasks in the graph shoulddeal with the design ofone geometric area at one levelof abstraction. The circuit design of the entire chip allat once is too large a task because it covers too muchchip area. Generating a layout from a cell's function istoo large a task, since it spans too many levels of thehierarchy. Designing a single cell circuit from a cell'sfunction is probably a task of the proper size,although if the cell performs several different func-tions the task should be further subdivided.Figure 11 is a task dependency graph for the design

of a pattern matching chip like ours. Our own projectin fact brought out the need for the task dependencygraph and suggested its structure. It should besuitable for designing other chips of about the samescale. Each subtask deals with only one geometricregion, one circuit function, and one level of the VLSIchip model hierarchy. The arrows indicate the flow ofinformation between the subtasks, each of which webriefly describe below.

Data flow and cell type function. The chip designmust begin with an algorithm design conceptuallyspecifying the overall chip structure. Several algo-rithms will exist for any problem, and the best oneshould be found at this stage. The algorithm is a levelof abstraction at which to think about importantproperties such as regularity and modularity,without worrying about low-level issues. It should in-tegrate two distinct bodies of information. One is thedata flow pattern, including the number of cells, theirgeometric placements, and the choreography of data.The types of cells should be distinguished and thebeats onwhich each is active should be identified. Theother body of information is the function of each celltype, comprising notjust the circuit function but alsothe relative positions of signal inputs and outputsand the sequence of activity on each beat.

Cell combinations and placements. Cells in the im-plementation might not correspond one to one withthe cells in the algorithm. Several cells may be com-bined to share components or rarely used communi-cation paths in the algorithm may be multiplexed on-to one physical data path. The first task in implemen-ting the algorithm is to chooseamong these combina-tions and to position cells and cell combinations onthe chip.

This subtask requires information about the pat-tern of active and idle cells on each beat and the use of

January 1980 37

each communication path, which the data flow andgeometry subtask provides. It also requires informa-tion about the sharable subfunctions and complexityof each cell type, which the cell function subtask pro-vides. The output of this subtask is a skeleton layoutfor the chip, with each cell group assigned a locationand a set of contained cells.

Data flow control circuit.Data flow control circuitryensures the orderly movement of data on the correctbeats. Toperform this subtask we must learn the cor-rect sequence of beats from the algorithm data flowand which elements are active on each beat. From thecell combination subtask we learn which cell groupsand physical data paths contain the active elements.Based on the size and intended use of the chip we

can decide whether the data flow should be clocked orself-timed. If we choose to clock we must decidewhether to generate the signal externally or on thechip. We can then design the shift registers for datamovement and route any clock wires or synchroniza-tion signals among the cep groups on the skeletonlayout.

Cell logic circuits. We now have the three pieces ofinformation 'needed to design circuits for the

cells-the cell functions (from the cell type functionstep), thegroup ofcells tobeimplementedbyeach cir-cuit (from the cell combination step), and the shiftregister stages that must be included in each cell(from the data flow control). If the cell functions aresimple enough, ad hoc circuit design techniques maybe adequate. If the functions are complex, the cellsmay be split into subsystems to be designed inde-pendently. In this way, full advantage may be takenof the functional decomposition of each cell. In addi-tion, the circuit for each cell type can be designedwithout reference to the others, since all communica-tion needs have been considered in' the data flow con-trol. In designing the circuits, however, considera-tion must also be given to how the chip will be testedafter fabrication.

Cell timing signals. A cell function may compriseseveral distinct sequential steps performed on eachbeat. In the pattern matching accumulator, forexam-ple, the assignments

r0ut - t; t - TRUE

must take place in the correct order. The cell circuit,especially in a clocked system, may requlire signals to

PROBLEM

I.ALGORITHM LEVEI

GATE LEVEL

I I-ng_

i.~Em

I,EONv .

STICKS LEVEL

LAYOUTL

LAYOUTLEVEL j 1

MASKS FOR FABRICATION

Figure 11. A task dependency graph can guide the designer of a special-purpose VLSI chip. This graph-fora patternmatching chip-shows how the design effort can be decomposed Into simpler tasks.

COMPUTER

`N1

38 '

,L

------ --II

. - --- I

I

x-----_

- ---II

;__j

SUBTASKENE=

control such sequences, in addition to the signalsneeded for cell activation. These should be suppliedby the data flow control. Such signals should be iden-tified as soon as the cell circuits are complete, and cir-cuits to generate them added to the data flow control.

Communication sticks. When the data flow controlcircuitry is complete we can draw its stick diagram.For geometrically regular chips this will consist of anopen communication path network, control circuitry,and blank spaces for the cells. If the chip has cen-tralized clock circuitry, its topology can be designed.The distribution network for power and groundshould also be designed at this stage.

Cell sticks. The topological layouts of the in-dividual cells can now be designed. The relative loca-tions of power, ground, and all inputs and outputs areknown from the communication sticks. We must nowchoose implementations for the circuit elements anddecide on the relative positions ofinternaldata paths.

Cell layouts. Once the topological layouts of thecells are complete, the detailed layout of each cell ispossible. Following the design rules for the intendedfabrication process, actual dimensions for each elec-trical component and distances between circuit ele-ments must be chosen. This subtask's output is ascale drawing of the cell.

Cell boundary layouts. With cell sizes known, thecell boundaries can be laid out. The communicationpath and data flow control topology is known fromthecommunication sticks. Wirelengths and spacingscan be chosen, as can distances between cells. Inputsand outputs can be connected to contact pads. Thecell and cell boundary layouts form a completedescription of the chip; once they are complete,masks can be made and the chip fabricated.

Summary of the design methodology. With the helpof the task dependency graph, the seemingly com-plicated process of designing a special-purpose chipcan be carried out systematically, one subtask at atime. The graph presented here, although based onlimited design experience, seems to be a good start-ing point. The design tasks below the algorithm levelare relatively routine and may at least in principle behelped a great deal by various (future) computer-aided design systems. Eventually only the algorithmdesign level will require substantial effort and ex-perience.

Over the past few years efforts in several fields ofcomputer science have converged to make possiblethe design of special-purpose chips as described inthis article. The study of parallel algorithms, par-ticularly those for mesh-connected computers, hasprovided techniques for VLSI algorithm design.'4The work of Mead and Conway' in developing struc-tured NMOS design techniques has eased the designof reliable circuits and layouts. Improvements in

computer-aided design and graphics systems havereduced the drudgery of mask design. Finally, thedevelopment of suitable* intermediate languagesmakes the design and fabrication processes relative-ly independent and allows several users to sharedesigns.These developments allow the relatively inexperi-

enced designer to develop chips quickly and con-fidently for his own application. By concentrating onalgorithms, he can construct-with minimal designtime-chips of good performance and fairly smallarea. The design of the pattern matching chip de-scribed here took only about two man-months.Further developments' can make the designer's

task even easier. It is possible, for example, to buildlibraries of standard cells similar to subroutinelibraries. If a designer needs, say, an inner productstep cell, he can select it from a library rather thanconstruct it himself. Libraries of data flow implemen-tations are also possible, although their forms are lessobvious.Advances in fabrication technology may increase

the scale of projects that can be attempted. Asidefrom reductions in feature size, techniques such aswafer-scale integration wil increase the size andpower of special-purpose devices. The modularity in-herent in our philosophy is especially appropriate towafer-scale integration, where a wafer's chips are in-

TERMINALSPURCHASE FULL OWNERSHIP AND LEASE PLANS

PURCHASE PER MONTHDESCRIPTION PRICE 12 MOS. 24 MOS. 3 MOS.

LA36 DECwriter 11..........$1,695 $162 $ 90 $ 61LA34 DECwriter IV .......... 1,295 124 69 47LA120 DECwnter Ill KSR .... 2,295 220 122 83VT100 CRT DECscope ...... 1,895 182 101 68VT132 CRT DECscope ...... 2,295 220 122 83DT80/1 DATAMEDIA CRT .... 1,895 182 101 68T1745 Portable Terminal .... 1,595 153 85 5711765 Bubble Memory Terminal 2,795 268 149 101T1810 RO Printer .......... 1,895 182 101 68T1820 KSR Printer ......... 2,195 210 117 79T1825 KSR Printer ......... 1,695 162 90 61ADM3A CRT Terminal ...... 875 84 47 32QUME Letter Quality KSR ... 3,195 306 170 115QUME Letter Quality RO .... 2,795 268 149 101HAZELTINE 1410 CRT....... 875 84 47 32HAZELTINE 1500 CRT....... 1,195 115 64 43HAZELTINE 1552 CRT ....... 1,295 124 69 47DataProducts 2230 Printer .. 7,900 757 421 284DATAMATE Mini Floppy ..... 1,750 168 93 63

FULL OWNERSHIP AFTER 12 OR 24 MONTHS10% PURCHASE OPTION AFTER 3 MONTS

ACCESSORIES AND PERIPHERAL EQUIPMENTACOUSTIC COUPLERS * MODEMS * THERMAL PAPER

RIBBONS * INTERFACE MODULES * FLOPPY DISK UNITSPROMPT DELIVERY . EFFICIENT SERVICE

January 1980Reader Service Number 9

terconnected rather than cut apart for individualpackaging. Since some of the wafer's chips may bedefective, the fabricator must be able to reroute theinterconnections to replace a faulty chip with a func-tioning one. He can do this easily if the chips haveregular interconnections and if they include only afew types.The philosophy and methodology described here

will make practical the design of special-purposeVLSI chips by their users. Connected to a general-purpose computer, these devices will provide rapidsolutions to a variety of computations. Given theease of the design process and the availability ofnewdesign tools, VLSI modularity will become a commonarchitectural strategy in the 80's. E

Acknowledgments

We received help from many people during ourresearch. Using a preliminary version of Mead andConway's Introduction to VLSI Systems, BobSproull taught us basic NMOS design techniques inhis VLSI design course at CMU. Our pattern match-ing chip prototypes are included in the XEROXPARC multiproject chips for spring 1979 and havebeen fabricated at XEROX facilities. Larry Stewartchecked our layouts for violations of design rules.Bob Hon provided much needed help, includingbuilding the layout design system we used, convert-ing our designs into ICARUS format, and mountingchips for testing. Philip Lehman and Siang Song of-fered suggestions in the early stages of the algorithmdesign. Lynn Conway, Bob Hon, Dick Lyon, SiangSong, and Bob Sproull reviewed this paper. To thesepeople, and to the others who helped us directly or in-directly, we express our thanks.This research was supported in part by the Defense

Advanced Research Projects Agencyunder ContractF33615-78-C-1551 (monitored by the Air Force Officeof Scientific Research), the National Science Founda-tion under Grant MCS 78-236-76, the Office of NavalResearch under Contract N00014-76-C-0370, and anNSF Fellowship.

References

1. C. A. Mead and L. A. Conway, Introduction to VLSISystems, Addison-Wesley, Reading, Mass., 1980.

2. R. Hon and C. Sequin, "A Guide to LSI Implementa-tion," 2nd ed., XEROX Palo Alto Research CenterTechnical Report, 1979.

3. I. E. Sutherland and C. A. Mead, "MicroelectronicsandComputer Science," ScientificAmerican, Vol.237,No. 3, Sept. 1977, pp. 210-228.

4. H. T. Kung and C. E. Leiserson, "Systolic Arrays (forVLSI)," in I. S. Duff and G. W. Stewart, eds., SparseMatrix Proc. 1978, Society for Industrial and AppliedMathematics, Philadelphia, Pa., 1979, pp, 256-282. Aslightly different version appears in C. A. Mead andL. A. Conway, Introduction to VLSI Systems, Addi-son-Wesley, Reading, Mass., 1980, sec. 8.3.

5. H. T. Kung, "Let's Design Algorithms for VLSISystems," Proc. Caltech Conf Very Large Scale In-tegration, California Institute of Technology,Pasadena, Calif., Jan. 1979., pp. 65-90. Also availableas a Carnegie-Mellon University Computer ScienceDepartment Technical Report, 1979.

6. R. E. Griswold, J. F. Poage, and I. P. Polansky, TheSNOBOL4 Programming Language, Prentice-Hall,Englewood Cliffs, N. J., 1968.

7. P. J. WarterandD. W. Mules, "A Proposalfor an Elec-tronic File Cabinet," MICRO-DELCON, IEEE Com-puter Society, Long 'Beach, Calif., Mar. 1979, pp.56-63.

8. M. J. Fischer and M. S. Paterson, "String Matchingand Other Products," Massachussetts Institute ofTechnology, Project MAC, Technical Report 41,1974.

9. R. S. Boyer and J. S. Moore, "A Fast String SearchingAlgorithm," Comm. ACM, Vol. 20, No. 10, Oct. 1977,p. 762.

10. D. E. Knuth, J. H. Morris, and V. R. Pratt, "Fast Pat-tern Matching in Strings." SIAMJ. Computing, Vol.6, No. 2, June 1977, pp. 323-350.

11. A. Mukhopadhyay, "Hardware Algorithms for Non-numeric Computation," IEEE Trans. Computers, Vol.C-28, No. 6, June 1979, pp. 384-394.

12. C. A. Mead, R. D. Pashley, L. D. Britton, Y. T. Daimon,and S. F. Sando, "128-Bit Multicomparator," IEEEJ.Solid State Circuits, VoL SC-11, No. 5, Oct. 1976, pp.692-695.

13. C. L. Seitz, "Self-TimedVLSI Systems, " Proc. CaltechConf Very Large Scale Integration, California In-stitute of Technology, Pasadena, Calif., Jan. 1979, pp.345-355.

14. H. T. Kung, "The Structure of ParallelAlgorithms, "inM. C. Yovits, ed., Advances in Computers, Volume 19,Academic Press, N.Y., 1980.

Michael J. Foster is working toward aPhD in the Computer Science Depart-ment at Carnegie-Mellon University,where he holds a National ScienceFoundation Graduate Fellowship.From 1973 to 1977 he was employed asa programmer by Tracor Northern andPrinceton Gamma-Tech. His currentresearch interests include algorithms,

U " VLSI structures, andcomputer systemarchitecture. Foster received a BS in mathematics fromMIT in 1973 and is a member of ACM and Sigma Xi.

H. T. Kung is an associate professor of| computer science at Carnegie-Mellon

University, where he has been on thefaculty since 1974. His current researchinterests include special-purpose chipdesign, VLSI-related complexitytheory, and data base systems. He hasworked on algorithms for parallel com-puters and problems in computationalcomplexity. Kung graduated from Na-

tional Tsing-Hua University, Taiwan, and received his PhDdegree from Carnegie-Mellon University.

COMPUTER40

The Design of Special-Purpose VLSI Chips

Documents

Transcript of The Design of Special-Purpose VLSI Chips