SINGLE-CHIP COMPUTERS, THE NEW VLSI BUILDING BLOCKSkf044zq1720/kf044zq1720.pdfthe New VLSI Building...

11

/'

SINGLE -CHIP COMPUTERS, ¥

THE NEW VLSI BUILDING BLOCKS

Carlo H. Sequin

Computer Science DivisionElectrical Engineering and Computer Sciences

University of CaliforniaBerkeley CA 94720

ABSTRACT

Current trends in the design of general purpose VLSI chips are analyzedto explore what a truly modular, general-purpose component for digital comput-ing systems might look like in the mid 1980s. It is concluded that such a com-ponent would be a complete single-chip computer, in which the hardware foreffective interprocessor communication has been designed with the architectureof the overall multiprocessor system in mind. Computation and communica-tion are handled by separate processors in such a manner, that both can be per-formed simultaneously with full efficiency.

This paper then describes relevant features of X-tree, a research projectwhich addresses the question how the power of VLSI of the next decade canbest be used to build general purpose computing systems of arbitrary size. InX-tree, a general VLSI component realizable in the mid 1980's is defined, andits interconnection into a hierarchical tree-structured network is studied. Theoverall architecture, communications issues and the blockdiagram of the modu-lar component used are discussed.

ftCALTECH CONFERENCE ON VLSI, January 1979

jut r

"

y*

I1

ii1

1. INTRODUCTIONBy now it seems to be a well accepted

truth that computing demand will alwaysstay ahead of the available computing power.No matter how much computational powercan be realized on a single VLSI circuit,there will always be customers that want tohave computing power that substantiallyexceeds the capabilities of a single chip.(Even though the fraction of those custo-mers will decrease dramatically compared tothe number of single-chip users.)

If a system is to be extended overseveral integrated circuits, the questionarises how it should be partitioned onto thevarious chips. This question is crucial sincethe connections between chips represent aninherent communications bottleneck. Thenumber of interconnection points on a VLSIchip is limited and increases at a muchslower rate than the number of gates in thecircuit. Furthermore the bandwidth throughthose pins is limited; with current packagingtechnology, the physical paths leading fromone chip to the next are one or two ordersof magnitude longer than on-chip intercon-nections, and possess correspondingly largerparasitic capacitances.

It is thus important to find the rightfunctional blocks to be packaged onto indivi-dual chips. We can state immediately thatthe partitioning should not cut through thehigh-bandwidth path between main memoryand the actual processing circuits. Thusmemory and processing elements shouldstay tightly coupled on one and the samechip. VLSI technology in the mid 1980's,will make it possible to pack a sophisticatedprocessor and a substantial amount ofmemory on a single integrated circuit.

2. GENERAL PURPOSE VLSI COM-PONENTS

No matter how fast, convenient andsophisticated VLSI design tools will become,standard off-the-shelf parts will always be instrong demand for people who want to puttogether experimental systems or who wantto bring a product to market quickly. Theycannot afford to start from scratch and to

determine in all details, how the optimumchip for their purpose could be constructed.For small and medium volume production,jt is far more cost effective to pick a com-ponent from a limited selection of existingparts. If these parts have already beendebugged and if applications manuals areavailable, the user need not worry about allthe critical details of internal timing andinterconnections. It normally pays to choosea somewhat overdesigned component, usingmore circuitry, power or silicon area thanabsolutely necessary, thus guaranteeing thatthe requirements of the application will besafely met.

The question thus arises what type ofcircuits can be placed on the shelf and willprove useful in a large number of applica-tions. Within the frame work of digitalcomputing systems, what kind of chip willtake on the former role of NAND gates,flop-flops or registers ? Simple extrapola-tion of existing trends may not be the rightanswer for general purpose components witha hundredfold greater functional capability.

Memory chips with four times morestorage capacity appear on the market aboutevery three years now. Certainly, a 64k-word by 1-bit chip is a nice part to have, butwill a IM-word by 1-bit chip be equallyattractive ? - Or would most customersrather use a 64k- word by 16-bit part ? Also,from the point of view of optimal systemspartitioning, discussed in the previous sec-tion, a simple memory chip is a very poorcomponent. It has virtually no internalinteractions, and every time it is used, infor-mation has to go in and out through thepackage pins. These shortcomings shouldbe remedied by adding processing poweronto the memory chip itself.

Looking at the field of random logic,the current trends are more obscure. Thereis a certain evolution from the NAND gateto more complicated Boolean functions suchas multi-bit logic functions, parity checkers,arithmetic functions or priority encoders.However there is a limit to the size of usefullogic functions that are commonly used.The trend is to add input or output registersto the logic functions to permit easy assem-

ARCHITECTURE SESSION

o >

)

t

Iii

ii

t

1

i

the New VLSI Building Blocks

bly of synchronous machines. Here again,we find a mixture of processing and storageelements.

The most complicated parts availabletoday are obviously the single-chip micro-computers with a certain amount of on-chipmemory. But can these chips really beregarded as general purpose, modular build-ing blocks in the sense of a NAND gate ?While one can prove that all computing sys-tems could be built from nothing butNAND gates, it is rather doubtful that wecan interconnect many, say, INTEL 8048and produce a computing system of superiorpower ! What is missing is an answer to thequestion how several such componentsshould be interconnected and how theyshould interact. While successful TTL partshave been designed so that many of themcan be combined to form a system withmore capabilities than the sum of its parts,the interconnection issue was definitely nota primary concern in the construction of thefirst-single chip microcomputers. In thedesign of a truly modular VLSI componentfor large computing systems, the communi-cations issue can no longer be neglected.

3. INTER-CHIP COMMUNICATIONStandardizing the communication

between computers is a considerably morecomplicated issue than standardizing inter-connections between Boolean logic chips. Inaddition to voltage levels and load con-siderations, which apply to logic chips, inter-connection of computers involves additionalissues such as link allocation, timing, mes-sage formats, addressing and routing. Allthese issues have to be addressed in thedesign of a truly modular general purposeVLSI component. Hardware for effectivesupport of these features has to be incor-porated on the chip. The block diagram of apotential general purpose component, con-sisting of at least a processor, memory, acommunication switch and its controller, isshown in Fig.l using PMS notation [Bell &Newell 1971]. The case for such "ComputerModules" with proper consideration of thecommunications issues was first made byBell etal. [1973].

Figure 1. X-node, a modular VLSI building block ofthe mid 1980's, shown in PMS notation.

4. PROJECT X-treeIn 1977 a research project to investi-

gate these questions was started at theUniversity of California at Berkeley. Onespecific issue was to define a modular com-ponent from which general purpose comput-ing systems of arbitrary size could be built.Since the most advantageous way to inter-connect these components turned out to bea tightly coupled, hierarchical, tree-structured network, the project was namedX-tree [Despain & Patterson 1978a]. Andthe projected single component, from whichthis computing system would be assembled,is known as X-node. The project is still in itsearly stages of research. First some of thecommunications issues have been addressedin a top-down manner [Sequin et air 1978].Discussions of the design of the architectureof X-node have recently been started [Patter-son et al. 1979]. The anticipated timeframefor a potential realization of X-node is themid- 198075. It is assumed that at that timeVLSI chips with about IOO'OOO gates andhalf a million bits of storage will readily befeasible. Key research issues are the designof a truly modular VLSI component and theorganization of many such components intoa general purpose computing system.

CALTECH CONFERENCE ON VLSI, January 1979

Carlo H. Sequin438

■i

fi

i

i

5. INTERCONNECTION TOPOLOGIESIn an evaluation of different multipro-

cessor interconnection topologies [Despain& Patterson 1978b], it turned out that mostschemes have one of three disadvantages.1) The interconnecting link constitutes a

serious communication bottleneck,such as a common bus which has to beshared by all attached processors.

2) The switching hardware becomesunreasonably expensive for a largenumber of processors, as in fully inter-connected networks or in a fullcrossbar arrangement.

3) The interconnection scheme is nottruly modular, because the require-ments for the individual componentschange as the system grows. This isthe case in a hypercube network withthe topology of a cube in n dimen-sions, where the number of ports hasto increase with the logarithm of the.number of nodes.

Networks that are truly modular and incre-mentally expansible comprise lattice struc-tures in n dimensions and tree structures.Among those, trees have the further advan-tage that the average distance betweennodes grows only logarithmically with thenumber of nodes [Despain & Patterson1978a,b].

The primary contender for the inter-connection.schemeîn X-tree; is therejoreâbinary tree enhanced with additional links toform a half-ring or full-ring tree^CFig^).These additional links further shorten theaverage path length, they distribute message^traffic more evenly throughout the tree, andthey provide the potential for fault toleran?communication, with respect, to the .removalof a few nodes or links^ Various placements"for these extra links have been investigated,but most yield only insignificant improve-ment on the message traffic density and typ-ically require significantly more complicatedrouting algorithms than the half-ring orfull-ring structures [Despain & Patterson1978b].

Figure 2. Binary tree with full-ring connections. Whenthe dashed branches are omitted, a half-ring tree is ob-tained. Notice that the children of node n have nodeaddresses 2n and 2n + l respectively.

6. ADDRESSING AND ROUTINGX-tree is designed to be a truly modu-

lar, incrementally expansible system. Toâllow the system to grow to arbitrary size,neither the number of nodes nor the addressspace should be limited by artificial restric-tions. Thus unbounded, variable length

"addresses are employed throughout [Sequinetal. 1978].

Figure 3. An X-tree example in PMS notation.


mgie-unip computers, 439i<

II

!!i

!

!

i

II

I

1

tj

II

ii


T (one level below C )

Figure 4. Shortest path in a full-ring tree from current position,

C,

on node 19 to target, T, on node 50.

All communication throughout the tree is inthe form of messages. To enable effectiverouting of messages, the complete address issubdivided into a node address part and asecond part identifying a particular memorylocation if that node has any memory

that belongs to the global addressspace. In X-tree secondary memory, as well

* as -all input/output, is restricted to the fron-tier (i.e. the leaves) of the tree. This is tominimize the number of ports per node, andthus to alleviate the restrictions originatingfrom the limited number of pins. A PMSrepresentation of an example of a small X.tree is shown in Fig. 3.

The addressing scheme is designed sothat messages can be effectively routed fromnode to node simply based on local deci-sions, involving only the current nodeaddress and the destination address. In abinary tree this can be achieved in a verysimple manner. The root node is assignednode address "1". The node address of a leftchild is formed by appending a "0", and thatof a right child by appending a "1". Withthat scheme any node can readily be foundby starting at the root and using thesequence of bits in the node address tomake the routing decision at each node. Togo from an arbitrary node to another node,one has to move up in the tree to the com-mon ancestor of the two nodes, i.e. to thenode where the address matches all leadingbits in the target address. From there onemoves down to the destination.

In half-ring and full-ring binary treesthe routing algorithm is only marginallymore complicated. The horizontal links per-mit a message to take a shortcut before thecommon ancestor has been reached. Itturns out that in a full-ring tree the optimallevel to take the horizontal links is the onewhere ascending and descending path areseparated by less than five link units (Fig.4).Again the routing decision can be basedentirely on a comparison of the local nodeaddress and the target address.

Some fault tolerance can be achievedby listing secondary and ternary choices forthe routing decision, which are taken when


Choice

of

directions»s a function

of

relative target positi

far I near

left

line'ipht ne»r far

choice

left

!

left

line right right

I up

left

up upleft

uprightleft

upright

left

uprightleft

above "t

J right rif.hi right

1 up

left

left

proc.up : backIdown ' back

proc.

backright up

level 2 uprdown

nghileft3 right back

belowI

leftleft I Mown

rdo*n

right up23

upIdown

rdown

left

Idown up

rdo*n

rightrdownIdcwn rich!

fable 1. Decision table for full-ring tn « routing algO'ithm

440i

t

Ithe primary choice cannot be followed.Such a decision table for the case of the

full-ring tree is shown in table 1. Theproper field is selected from a comparison of

the horizontal and vertical distances between

current position and target node. Within

each field, first, second and third choice for

the routing out of the current position are

listed.Of course there are always cases where

this fault tolerant routing algorithm cannotbe successful - for instance when the target

node itself is missing. Special means to

safeguard against cluttering the tree with

worthless messages are thus required. Onepossible solution is to accompany each mes-

sage header with a byte that counts the

number of detours experienced, i.e. how

often the message could not be routed in

the primary direction. When this countreaches a limit, preset by the operating sys-

tem a higher level recovery mechanism is

invoked. For instance, the message couldthen be purged and a notice of this tact

could be returned to the originator.

7. SYSTEM EXPANSIBILITYIn X-tree all input or output and secon-

dary memory is at the frontiers of the tree.

Thus every time a node is added to the

tree 'a terminal or a storage device has to

move and thereby changes its node address.

A mechanism is necessary to route messagesautomatically to that node even though themessage header may still carry an outdatednode address. With the following conven-tions this can be achieved. Messages for

leaf nodes are identified as such. Each node

knows whether it is a leaf node or not

Messages destined for leaf nodes are routeddownwards from their specified target node

address along the chain of left children untilthey reach an actual leaf node. This schemeworks, if during the expansion of the tree

existing leaf attachments are moved to the

left child position, of the newly attachednode, and the right child position is used to

attach new equipment (Fig.s).

before expansion

after expansion

Figure 5. Principle of incremental expansion of X-tree, which permits messages to find the proper leafnode.

8. MESSAGESIf X-tree lives up to its potential of

substantial parallel computing, many mes-sages will travel simultaneously through the

tree, and many messages may want to use

the same link between two nodes. In ordernot to stall one message until another mes-sage has been transmitted completely, mes-sages are time multiplexed on the links on a

demand bases. For this purpose each mes-sage header carries a unique identifier called

a "slot address", so that the various partsbelonging to different messages can be

sorted out again at the receiving node, in

each node all incoming messages are

assigned such a "slot address", which is validwithin that node and on the subsequent link.

A specific slot address precedes any part of amessage transmitted over a particular link.


Single-Chip Computers 441

i

Ii

I

t

\

I

i!

t!

t

!

t


9. X-node ARCHITECTUREBytes Explanation

SA / new Slot address (SA) of type *new"

Target NA together with target node address <NA).which

may

extend overseveral bytes.NA cont. sets up a new message channel through the tree,

Beginning offirst submessage.

Here, one or several other messages

may

be interspersed on the same link.

Slot address of type *old* identifiescontinuation of a previous message.

Data may consist ofone or more submessageswhich

may

be individually checkedand acknowledged to the originator.

Here, one or several other messages

may

be interspersed on the same link.

Previous

message

continues.

End of last submessage in this channel,may include ere check remainderand end oftransmission character.

Tear-down slot addressremoves

message

channel.

At a first glance, each X-node is simplya computer that communicates with 4 or 5nearest neighbors. However because of thebandwidth requirements discussed above,normal microprocessor input/output tech-niques are inadequate. It is important thatthe processor is not involved in the menialtask of routing messages through X-tree.Computation must occur in parallel withcommunication. Thus^each JX-node, containsa self , controlled switching liSwork with its,,own I/O buffers and controllers** In the sim-plest version of X-node, the actual processoris attached to this network in the samemanner as the links to the nearest neighbors(Fig. 6). This permits an easy separation ofthe development of the communicationshardware and of an advanced X-node com-puter.

10. SWITCHING HARDWARE

Table 2. Message format.

For the purpose of creating and removingmessage paths, there are different types ofslot addresses. With the header of a newmessage, which carries the target address, aslot address of type "new" is transmitted, toestablish a path throughout the tree to theproper destination. All subsequent bytes areconsidered to be part of the message contentand .are thus not involved in the routingprocess. If a message has to be interrupted,then subsequent parts of that message willbe identified with the same slot address asbefore, but of type "old" to indicate that thisis a continuation of a previous message.Active transmission is terminated by the ori-ginator by sending a slot address of type"end", which removes the established mes-sage channel. These, conventions result inthe message format shown in Table 2.

The heart of this switching network isa time'multiprexed bus, Since this bus isanticipated to be completely containedwithin one chip, its associated parasitic capa-citances will be rather low, and the resultingbandwidth for a given amount of drivepower will thus be about an order of magni-tude higher than that through the packagepins associated with the input / output ports.This bus can thus conveniently serve -theattached six16 eignt ports at their full capa-city (Fig.6)..

Each port consists of a set of input andoutput buffers and the necessary finite statemachines to control them. Arriving siotaddresses, which precede every separate partof a message and are identified by a specialtag, switch the input multiplexer to theproper fifo buffer for that particular message(Fig.7). The internal communications busconsists of a data bus and an address buscarrying port numbers as well as slotaddresses. The combined bus is allocated ina fixed round-robin manner to all attachedinput ports, which in turn can transmit onebyte to a particular output port or to therouting controller.


SA / old

Dsta

Data

SA / end

equmarlo442

ii

»

Figure 6. Block diagram of X-node, showing the X-node processor and the switching networkconsisting of intra-nodebus, routing controller and input/outputbuffers to the communication links.

Slot addresses of type "new" cause themessage header to be sent to the routingcontroller. There the proper output port isdetermined from the routing algorithm anda suitable slot address will be assigned tothis newly to be established message chan-nel. This information is returned to the portof entry where it is written into a look-uptable (Fig.7) . From then on, the intra-nodecommunication for this channel can be han-dled directly by the input port.

Each output port monitors the intra-node bus. If it finds its own port number onthe address bus, it will pick up a slot addressand the corresponding data byte and enterthe later in the proper fifo output buffer(Fig.B). Simultaneously, the finite statemachine at the output end selects a channelwith valid data for transmission over thelink. The data bytes are preceded by thetransmission of the corresponding slotaddress.

11. X-node MEMORYMemory contained within each X-node

is not part of the global address space. It*acts only as a cache to the secondarymemory contained entirely at the. leaves ofX-tree. To alleviate the paging overhead,this memory should be as large as possibleand thus the densest storage techniqueshould be employed. For the mid 1980's itis anticipated that about 64k bytes could beimplemented together with the X-node pro-cessor on the same chip if dynamic RAM orcharge coupled devices are used. The con-tents of this local main memory can be data,program or microcode, and they will thus beused by different functional blocks. Unfor-tunately, typical high-density memory can-not easily be implemented as a true mul-tiport memory, and thus it is not possible toextract three words from different locationssimultaneously. Severe contentions for thememorybus can thus be expected.


Single-Chip Computers, 443

I

t

!


Figure 7. Blockdiagram of input port Figure 8. Blockdiagramof output port.

To alleviate this problem, .we have,decided to build an on-chip memory hierajfcchy

;

consisting of,,a Jiigh-density RAM: andthree high-speed static caches todata, instructions and microcode, and there-fore closely tied in with the ALU, theinstruction decoder and the microcontroller,respectively. Contentions for the commondata path between the main memory and thethree caches can be minimized by properchoice of the cache parameters. Thisapproach combines the storage densityadvantage of dynamic RAM with the higherspeed of static caches and yields a lot offlexibility in the allocation of local mainmemory space to data, programs or micro-code.

Swapping of microcode occurs throughthe same mechanisms as paging of programsor data from secondary memory at the fron-tier of X-tree. No special mechanism has tobe invoked to change the instruction set ofone of the X-nodes, and the latter can thusbe tailored dynamically to best solve thecurrent computational problem.

12. OTHER ARCHITECTURALFEATURES

There are other architectural featureswhich indirectly relate to the communicationbetween processors. Issues of timing, syn-chronization, data sharing and protection areall crucial in multiprocessor, multi- user sys-tems. The architecture of future chipsshould thus lend ample support to thosefeatures. Specifically, the concepts ofprocesses [Dijkstra 68, Wirth 1971, BrinchHansen 1972, 1975] and modules [Wirth1977] are important for a clean and struc-tured approach to parallel computing. In X-tree we plan to give strong support to themechanisms that create, terminate or moveprocesses throughout the system [Pattersonet al. 1979]. A vital part in this context is amechanism that permits fast process switch-ing without the explicit need to save manyspecial registers. Again the cache basedarchitecture provides a fast, automatic andtransparent mechanism for the replacementof the active parts of the memory contentswhen a process environment is changed. A


444 Carlo H. Sequin

*

Ijf

i

{

l

i

small operating systems kernel based on thelanguage Modi) la [Wirth 1977] is currentlybeing developed and will be used in futureextensive studies of the operating issues.

It goes almost without saying, that thecustomers of the next decade will expectstrong support for high level language con-structs, data structures, bounds checkingand any thing that may help to slow downthe increase of the gap between softwareand hardware costs.

13. SUMMARYIn summary, there will always be a

demand for off-the-shelf, general-purposecomponents, which can be used as trulymodular building blocks for the constructionof computing systems of any size. One suchcomponent has been identified and one wayto organize that component into an incre-mentally expansible computing system hasbeen outlined. Two crucial issues in thedesisri of such a system are interprocessorcommunication and the operating system.Both issues must be addressed at an earlystage of the design so that the necessaryhardware suDoort can be provided.

We felt that the issue of interprocessorcommunication is even more important atan early stage of a VLSI multiprocessor pro-ject. The emergence of truly modular, gen-eral purpose VLSI components may actuallydepend on the design of a standard inter-processor communication protocoll and arealization of the required switchinghardware on VLSI chips.

14. ACKNOWLEDGMENTSThe author would like to point out that

project X-tree is a "tightly coupled" teameffort of our Architecture Group in theComputer Science Division at Berkeley,including Profs. Despain and Patterson, andseveral graduate students, and that it isalmost impossible to determine the specificcontributions of each member of the team.Particular thanks go to Al Despain and DavePatterson for their suggestions, commentsand careful review of this manuscript.

This study was sponsored in part bythe Joint Services Electronics Proeram, Con-tract F44620-76-C-0100.


445

r

/

\

t

?

*

I

!

I

Single-Chip Computers,the New VLSI Building Blocks

REFERENCESBell,C.G. and Newell,A. (1971): „.„ , M , . -,

in Computer Structures: Readings and Examples, McGraw-Hill 1971, Chapter 2.

8e11,C.G., Chen,R.C, Fuller,S.H., Grason,J., Rege,S. and Siewiorek,D.P (1973):

"The Architecture and Applications of Computer Modules: A Set of Components tor

Digital Design", lEEE Compcon, March 1973, Conf. Proc. pp 177-180.

Brinch Hansen,P. (1972):_T1 1(VV)

„A <7Q"Structured Multiprogramming", Comm. ACM 15, No 7, July 1972, pp 574-578.

Brinch Hansen,P. (1975): „'^^ „ c r r ; *j« o"The programming language Concurrent Pascal", lEEE Trans. Software Eng. 1, No 2,

1975, pp 199-207.Despain,A.M. and Patterson,D.A. (1978a):

"X-Tree- A Tree Structured Multiprocessor Computer Architecture , J//7 symp. on

Comp. Arch., Palo Alto, CA, April 3-5, 1978, Conf. Proc. pp 144-151.Despain,A.M. and Patterson,D.A. (1978b):

"The Computer as a Component: Powerful Computer Systems from Monolithic

Microprocessors", submitted to Comm. ofACM.Dijkstra,E.W. (1968):

"Co-operating sequential processes", in Programming Languages, cd. F. Cxenuys,

Academic Press, London, 1968.Haendler,W., Hofman,U. and Schneider,HJ. (1976):

"A General Purpose Array with a Broad Spectrum of Applications , in Computer archi-

tecture workshop of the Gesellschaft fuer Informatik Erlangen, May 1975, InformatikFachberichte, Springer, Berlin, 1976.

Patterson,D.A.,Fehr,E.S. and Sequin,C.H. (1979):"Design Considerations for the VLSI Processor of X-tree , submitted to the 6th

Annual Symposium on Computer Architecture, Philadelphia, April 1979.

Seauin C H Despain,A.M. and Patterson,D.A. (1978):' '"'Communication in X-tree, a Modular Multiprocessor System", ACM 78, Washington

D.C., DecA 1978, Proc. pp 194-203.Wirth,N. (1971):

"The programming language Pascal", Acta Informatica 1, 1971, pp 3S-W.

IFt ' "Modula: A language for modular multiprogramming", Software - Practice and Experi-

ence 7, No 1, 1977, pp 3-35.


446

*i


SINGLE-CHIP COMPUTERS, THE NEW VLSI BUILDING BLOCKSkf044zq1720/kf044zq1720.pdfthe New VLSI Building...

Documents

Transcript of SINGLE-CHIP COMPUTERS, THE NEW VLSI BUILDING BLOCKSkf044zq1720/kf044zq1720.pdfthe New VLSI Building...