CRAY FAMILY [COA]

download CRAY FAMILY [COA]

of 16

Transcript of CRAY FAMILY [COA]

  • 8/8/2019 CRAY FAMILY [COA]

    1/16

    TERM PAPER

    OF

    COMPUTER ARCHITECTURE AND

    ORGANIZATION

    TOPIC CRAY FAMILY

  • 8/8/2019 CRAY FAMILY [COA]

    2/16

    Submitted by: Submitted to :Lect.Ruchika Dhall

    Avinash Manhas

    Roll.no- RE2801B46

    Reg.no- 10809450

    Course- B Tech-M.Tech( IT)

    CRAY FAMILY

    INTRODUCTION

    The first Cray-1 system was installed at Los Alamos National Laboratory in 1976 for$8.8 million. It boasted a world-record speed of 160 million floating-point operations persecond (160 megaflops) and an 8 megabyte (1 million word) main memory. No wire inthe system was more than four feet long. To handle the intense heat generated by thecomputer, Cray developed an innovative refrigeration system using Freon.

    In 1988, Cray Research introduced the Cray Y-MP, the world's first supercomputer tosustain over 1 gigaflop on many applications. Multiple 333 MFLOPS processors poweredthe system to a record sustained speed of 2.3 gigaflops.

    The 1990s brought a number of transforming events to Cray Research. The companycontinued its leadership in providing the most powerful supercomputers for productionapplications. The Cray C90 featured a new central processor with industry-leadingsustained performance of 1 gigaflop. Using 16 of these powerful processors and 256million words of central memory, the system boasted unrivaled total performance. Thecompany also produced its first "minisupercomputer," the Cray XMS system, followed

    by the Cray Y-MP EL series and the subsequent Cray J90.

    In 1993, Cray Research offered its first massively parallel processing (MPP) system, theCray T3D supercomputer, and quickly captured MPP market leadership from early MPPcompanies such as Thinking Machines and MasPar. The Cray T3D proved to beexceptionally robust, reliable, sharable and easy-to-administer, compared with competingMPP systems.

  • 8/8/2019 CRAY FAMILY [COA]

    3/16

    In another technological landmark, the Cray T90 became the world's first wirelesssupercomputer when it was unveiled in 1994. Also introduced that year, the Cray J90series has since become the world's most popular supercomputer, with over 400 systemssold.

    Cray Research merged with SGI (Silicon Graphics, Inc.) in February 1996. In August1999, SGI created a separate Cray Research business unit to focus exclusively on theunique requirements of high-end supercomputing customers. Assets of this business unitwere sold to Tera Computer Company in March 2000.

    Cray provides two types of dedicated nodes compute nodes and service nodes.Compute nodes are optimized to run parallel MPI and/or OpenMP tasks withmaximum efficiency. Service nodes provide scalable system and I/O connectivity andcan serve as login nodes from which applications are compiled and launched. Cray provides fully integrated networking, using an efficient, low-contention three-dimensional (3D) torus architecture, designed for superior application performance

    for large-scale, massively parallel applications.

    Contents:

    * Cray supercomputer families

    * Assessing a supercomputer

    * Cray1

    a. Address Component

    b. Scalar Component

    c. Vector Component

    d. I/O Component

    * PVP Generations, XMP, YMP, C90, T90

    a. Parallel Vector Processors, the core product line of Cray Research.

    b. Inside a Vector CPU

  • 8/8/2019 CRAY FAMILY [COA]

    4/16

    * The Cray-2

    * Cray Superserver systems

    * Cray Operating systems

    * Ows and other support equipment

    Cray supercomputer families

    1972 Very Approx. Time line 1996 (dates not to scale)

    **T3d* --> T3e --> T3e/1200 = MPP* *

    C1 -> XMP --> YMP --> C90 ----> T90 = PVP\ \ \

    C2 \ \ -> C90M = Large memory. XMS --> ELs ---> J90 -> J90se -> SV1 = Air cooled Vector supermini. APP --> SMP ---> CS6400 = Sparc Superserver....... C3 --> C4 = Cray Computer corp.

    Object Description:

    --> Direct architectural descendant\ Similar architecture but different technology... Family resemblance* Hosted by

    Computers that proudly carried the Cray name can be divided into groups of relatedarchitectural families. The first incarnation was the Cray-1, designed and built by CrayResearch founder Seymour Cray. This machine incorporated a number of novelarchitectural concepts that became the foundation for the line of Vector super computerswhich made Cray Research (1978..1995) legendary in the scientific and technicalcomputing market.

  • 8/8/2019 CRAY FAMILY [COA]

    5/16

    The Cray-1 evolved through a number of, often one-of-a-kind, sub-variants before beingreplaced by the evolutionary XMP and substantially different Cray-2. This split betweenthe XMP and Cray-2 marked the first divide in the Cray architectural line that togethercame to define and dominate the super computing market for the best part of 20 years.The line of machines designated C1, C2, C3 and C4 were the particular developments of

    Seymour Cray, who split on friendly financial terms from Cray Research in 1989, toprogress the Cray 3 project and found Cray Computer Corporation (1989..1992).

    In parallel to this, the main body of Cray Research evolved and developed the originalCray-1 concept through four technological generations that culminated in the 32 CPUT90. Along with this a line of compatible mini super computers, an enterprise classversion of a scaled-up SMP Sparc architecture and a line of Massively parallel machineswere developed and brought to market.

    Cray machines were never cheap to buy or own but provided demanding customers withthe most powerful computers of the time. Used by a select group of research labs and the

    top flight of industry, they defined the very nature of supercomputing in the 1980s and1990s.

    Assessing a supercomputer

    It is easy to describe the power of these super computers in terms of CPU MHzs andMflops but the numbers fail to quantify the real difference between Cray computers andthe other available machines of the day. It is easier to think in terms of an analogy. If youcompare computers to cars and lorries the Cray machines are those big dumper trucks andland graders that you see lumbering round a quarry. They certainly are not the fastest tochange direction and you can't use them to do the weekly supermarket shop but when it

    comes to moving rocks in large quantities there is nothing to touch them. The speed ofthe CPU in a computer is only one measure of its usefulness in solving problems, theother important metrics for effective problem solution are capacity and balance.

    When looking at a high performance computer you have to examine many aspectsspecifically, CPU speed, memory bandwidth, parallelism, IO capacity and finally ease ofuse. Running briefly through this list we can see that Cray machines had features in eacharea that combined to deliver unmatched time to solution.

    CPU speed: all data maths was done in 64 bit words, lists of numbers (vectors) could beworked as efficiently as single values. Special instructions could be used to short-circuit

    more complex operations (gather/scatter, population, leading zero, vector arithmetic).The CPUs also happened to be implemented in very fast logic.

    Memory bandwidth: Cray memory is fast real memory, no page faults or translationtables to slow memory access. Memory was subdivided into independent banks so that alist of numbers could be fetched to a CPU faster than the per bank memory delay. In theVector machines the memory was globally and equally accessible by all CPUs but theMPP systems had physically distributed, globally accessible memory.

  • 8/8/2019 CRAY FAMILY [COA]

    6/16

    IO capacity: provided by separate subsystems that read and wrote directly to memorywithout the need to divert the CPU from computational tasks. The disks had to be the bestavailable as they would often receive a pounding far in excess of most disk duty cycles.Heavy duty networking was provided initially by proprietary protocols over proprietaryhardware but later, when TCP/IP had been invented, open standards were adopted. The

    machines often sat at the centre of large diverse networks.

    Ease of use: achieved on two fronts for both programmers and administrators. Byproviding compilers that could automatically detect and optimise the parallelism andvectorisation within a program as well as highly efficient numerical libraries, programswere developed that achieved very high percentages of the peak speed of the machines.Unicos, an extended Unix variant, provided the system administrators with an operatingsystem with big system facilities with a familiar interface. Unicos was Unix tuned for bigmulti-user environments, providing mainframe class resource control and securityfeatures.

    The Cray 1

    The Cray 1 was first delivered in 1976. This was around the same time that 8-bitmicroprocessors were beginning to gain popularity, typical memory components were 1Kbit SRAM and 4 K bit DRAM. Most machines were operating at about a 1 MHz clockrate, had 32-bit words, and large mainframes had 1 MB to 8 MB of RAM.

    The Cray 1 had (Baron and Higbie CS manual)

    64-bit words 8 MB of RAM 16-way interleaving on low-order bits 50 ns memory cycle 12.5 ns clock cycle (80 MHz) 12 pipelined functional units

    The Cray 1 has 3 basic data types: addesses (24-bit integer), integers (64-bit), floatingpoint (64-bit, 48-bit mantissa).

    The 12 functional units are divided into four groups.

    Group 1 -- Vector units

    Vector (integer) Add: 3 stagesVector Logical: 2 stagesVector Shift: 4 stages

    Group 2 -- Vector and scalar units

  • 8/8/2019 CRAY FAMILY [COA]

    7/16

    Floating Add: 6 stagesFloating Multiply: 7 stagesFloating Reciprocal Approximation: 14 stages

    Group 3 -- Scalar units

    Integer Add: 3 stagesLogical: 1 stageShift: 2 stagesScalar population count and leading zero count: 3 stages

    Group 4 -- Address units

    Add: 2 stagesMultiply: 6 stages

    The machine itself is divided into six major subsystems

    Memory Instruction component Address component Scalar component Vector component I/O component Instruction Component

    Cray 1 instructions are 32 or 16 bits, so from 2 to 4 instructions can be packed into aword. Instructions are thus addressed on 16-bit boundaries while data is addressed on 64-bit boundaries.

    The instruction unit has four 16-word instruction buffers, three instruction registers, andone instruction counter. Each 16-bit field in a word is called an instruction parcel.

    The three instruction registers are

    Next Instruction Parcel -- holds first parcel of the next instruction, prefetched

    from buffer Current Instruction Parcel -- holds the high-order portion of the instruction to be

    issued Lower Instruction Parcel -- holds low-order portion of instruction to be issued

    For a 32-bit instruction, the low-order portion is fetched to the NIP and then moved to theLIP. There is no mechanism for discarding instructions in the pipe -- once in the CIP/LIP,they will be issued. At most they will be delayed for some time.

  • 8/8/2019 CRAY FAMILY [COA]

    8/16

    The instruction buffers are tied to the memory via the 16-way interleaving, so it ispossible to fill a buffer in 4 clock cycles (recall that the clock is 12.5 ns and memory is 50ns). Buffers are filled on a demand basis in a round-robin pattern. They thus act as aninstruction cache of 256 instructions, organized into four lines of 64 instructions. Eachbuffer has its own address comparator, so we would call this a fully associative cache

    (easy to implement when there are only 4 lines). The buffers cannot be written to -- awrite bypasses the instruction cache and only goes to main memory.

    Scalar instruction issue requires that all of the instruction's required resources be free --otherwise the instruction waits. Vector instruction issue in the Cray involves reservingfunctional units, including memory, operand registers and result registers, and thenreleasing an instruction once all of its resources are available. In addition, some datapaths are shared between the vector and scalar components, and these must be available.

    The control unit is able to detect when a result register for one vector operation is anoperand for another vector operation and, if the two vector instructions do not conflict in

    any other resource requirements, it sets up a vector chaining operation between the twoinstructions.

    Address Component

    There are 8 24-bit address registers, 64 24-bit spill registers, an adder, and a multiplier inthis component. Its purpose is to perform index arithmetic and send the results to thescalar and vector components so that they can fetch the appropriate operands.

    Arithmetic is performed on the address registers directly. The spill registers are used tohold address values that do not fit into the address registers. A set of 8 addresses can be

    transferred between the address registers and their spill registers in a single cycle. Thus,they bear a certain similarity to the register windows of the SPARC (or vice versa). Thespill registers can be thought of as an explicitly managed data cache with 8 lines. Theirvalue is that they reduce the traffic to main memory, freeing that resource for vectoroperations.

    Scalar Component

    Similar to the address component, the scalar component has 8 64-bit registers and 64 64-bit spill registers. It has sole access to four functional units: Integer Add, Logical, Shift,and Population Count. The Scalar Component also has access to three functional units

    that are shared with the Vector Component: Floating Add, Multiply, and ReciprocalApproximation.

    Because the scalar component has its own integer units, it can always execute integeroperations in parallel with a vector operation. However, for floating point, the vector unittakes priority.

    Vector Component

  • 8/8/2019 CRAY FAMILY [COA]

    9/16

    The are 8 64-word vector registers in the vector component. It takes four memory loadsto fill a vector register. Normally, this would require 16 instruction cycles. However,careful pipelining in the memory unit reduces the time to just 11 cycles.

    A vector mask register contains a bit-map of the elements in a register operand that will

    participate in an instruction. A vector length register determines whether fewer than 64operands are contained in a set of vector operands. Manipulating these values is theprimary reason for the population and leading zeros counter. Vector loads and storesspecify the first location, the length, and the stride.

    I/O Component

    The I/O component has 24 programmable I/O channel units. I/O has the lowest priorityfor memory access.

    Cray X-MP

    Extended the Cray-1 architecture to 4-way multiprocessing. Cycle reduced to 8.5 ns (117 MHz) Increased instruction buffers to 32 words Added a multiport memory system. Redesigned the vector unit to support arbitrary chaining. Added Gather/Scatter to support sparse arrays. Increased memory to 16 M words, 32-way interleave Provides a set of shared registers to support fine-grained (loop-level)

    multiprocessing. There are N+1 sets of these registers for an N-processor system.They include eight address registers, 8 scalar registers, and 32 binary semaphores.

    The I/O system was improved and a solid state disk cache was added.

    Cray Y-MP

    Extends the X-MP architecture to 8 processors. Cycle reduced to 6 ns (166 MHz) Extends memory to 128 M words

    PVP Generations, XMP, YMP, C90, T90

    Parallel Vector Processors, the core product line of Cray Research.

    The XMP range evolved from the Cray 1 and introduced dual processing to the Vectorline. Originally limited to 16 MWd memory the later "Extended memory architecture"variants grew the address register from 24 to 32 bits growing the maximum program sizeto 2 GBytes. The XMP was brought to market whilst the Cray 2 was still in developmentand was a huge success that proved hard to repeat in later years. The line of big-iron

  • 8/8/2019 CRAY FAMILY [COA]

    10/16

    vector super computers continued with the YMP, C90 and T90 each generation of whichapproximately doubled the CPU speed, number of CPUs, memory size and CPU quantityalong with improving a host of other details.

    Date| Model| | Max. Number CPUs

    CPU speed per CPU approx. peak.1976 C1 1 0.160 GFlop1982 XMP 4 0.235 GFlop1988 YMP 8 0.333 GFlop1991 C90 16 1 GFlop1995 T90 32 2 GFlop

    The YMP range of machines started by utilising new board and cooling technology forjust the CPUs, using XMP style boards for the IOS and SSD. Eventually the IOS andSSD also came to use YMP style, internally cooled boards in a range of chassis sizes.This final leg, which balanced the extreme performance of the memory and CPUs, wasthe model E IO subsystems which provided high speed sustained and parallelinput/output capacity for the system.

    Inside a Vector CPU

    Just what was it that made the Cray CPUs so fast? Putting aside the fact that the logic wasimplemented in fast bipolar hardware, there were a number of features that, combinedwith clever compiler technology, made the processors speed through the type of scientific

    and engineering problems that were the heartland of Cray customers. Described in thissection are some to the features that made the difference in both speed and price.

    Registers: lots of them, in a YMP CPU for example

    8 V registers, each 64 words long, each word 64 bits,64 T registers, each 64 bits,8 S registers, each 64 bits,

    64 B registers, each 32 bits,8 A registers, each 32 bits,4 Instructions buffers 32 64-bit words (128 16-bit parcels).

    YMP functional units were:address: add, multiplyscalar: add, shift, logical, pop/parity/leading zerovector: add, shift, logical, pop/parityfloating: add, multiply, reciprocal approximation

  • 8/8/2019 CRAY FAMILY [COA]

    11/16

    Other sundry CPU registers are Vector mask, Vector length, Instruction issue registers,performance monitors, programmable clock, Status bits and finally exchange parametersand I/O control registers. The quantity and the types of registers evolved and expandedthrough the life of the CPU types. The C90 added more functional units to the YMPdesign and the T90 even more still.

    Memory interface: CPUs are faster than memory so the speed at which a processor canexchange information with memory limits the effectiveness of the processor. This canstrangle the performance of an architecture so a simple solution to halve the memorydelay is to have two independent banks of memory. Taking this further, having enoughmemory banks to match the ratio of memory speed to CPUs speed, would remove thememory refresh speed delay. For example if a CPU has an 8.5 Nano second clock cycletime and the memory banks have a refresh time of 68 Nano seconds and there are 16memory banks an operation such as

    do i = 1,60000c[i] = a[i] + ndone

    can run at full speed. Even in modern processors the above operation would becomememory bound as soon as the processors cache was exhausted. As well as multiple banksthere were multiple ports to memory from each CPU to prevent bus contention. Lookingat this from another view, sequential memory locations come from separate memorybanks. As the architecture developed the number of banks and ports increased along withvector length.

    Location, 0,1,2,3,4,5,6,7,8,9,A,B,C,...bank, 0,1,2,3,4,5,6,7,0,1,2,3,4,...

    This memory bank architecture also accounted for machines with identical CPUs, butdifferent memory sizes having different peak speeds. It also explained why a memoryboard failure could not be worked around by shrinking memory. In the above exampleremoving a memory board would remove every 8th memory location, which isimpossible to code round. C90 systems had the ability to map spare memory chips tocover failing memory locations. Later T90s did have the ability to down half memory orsome CPUs in the event of a failure.

    The Cray-2

    The Cray 2 sits on the Cray time line at a position after the XMP had become wellestablished but before the YMP range was delivered. It is however in a class of its ownfor a number of reasons. The Cray 2 had a huge memory starting at 512 Mb and rising to4 Gbytes, a size that was not matched by other production systems for a decade. Thesystem had a very small foot print sitting on a 1.35m diameter circle and rising to just1.15m. This very compact arrangement was made possible by the other major innovation,

  • 8/8/2019 CRAY FAMILY [COA]

    12/16

    total immersion cooling. The processor case was filled in a circulating inert fluid thatcooled the boards allowing a much higher packing density than other arrangements.

    Some About Cray 2:

    One foreground and four background processors. 4.1 ns cycle (244 MHz) Up to 256 M words of memory 64 or 128 way interleave depending on configuration Eliminates the spill registers in favor of a 16K word cache Cache feeds all three computational components with 4-cycle access time Has 8 16-word instruction buffers Foreground processor controls the I/O subsystem, which has up to 4 high speed ]

    communication channels (4 Gb/s).

    Massively Parallel Processing systems, T3d, T3e

    During the late part of the 1980s a variety of companies were researching and selling anew class of machines that threatened to topple the super computing crown held by Crayfor so long. These machines derived their compute power from using larger numbers ofsmaller processors. Pioneered by Thinking machines, Kendal Square, NCube, Masparand Mieko these systems had begun to catch the eye of academics and scientists asproviding an alternative to the expensive and often over-subscribed Cray machines. Atfirst Cray reacted with scorn, emphasising how hard these machines were to programindeed many problems just won't sub-divide enough to allow more than a handful of

    CPUs to work efficiently in parallel. A whole new "message passing" programmingmethod was developed to overcome the communication and co-ordination problemsinherent in such loosely bound architecture. Some people likened it to lifting a brick byharnessing a sack full of wasps.

    However the writing was on the wall and the requirement for MPP machines that couldgrow in small increments and the programming techniques for utilising them forced achange of heart at Cray Research. In 1991 a three stage programme was announced thatwould in 5 years turn Cray Research in to the leading, and as it turned out only MPPsuppercomputing vendor.

    Original ActualTarget Plan Delivery300 GFlops Max 1993 1993 T3d *1 TFlop Peek mid-90s 1996 T3e/6001 Tflop Sustained 1997 1998 T3e/900

  • 8/8/2019 CRAY FAMILY [COA]

    13/16

    The project was not without some pain along the way - essentially the company had to re-invent all its crown jewels, compiler technology, operating system internals, IOsubsystem and even getting used to someone else's CPUs.

    By joining the MPP market after the first round of machines Cray was in a position to

    learn from the mistakes of others. It convened a steering committee of computer scientistsand MPP users to set the design parameters for the MPP development program.

    The first fruits of the MPP development programme was the T3d. Using a DEC 21044Alpha chip as the core processor it was surrounded by local memory and attached to ahigh speed CPU interconnection network. The T3d had no I/O capability - instead it wasattached to and hosted by a YMP or C90 front-end. Using the Vhisp channels of a ModelE IOS system the T3e dumped its IO load on the host at a phenomenal rate that couldswamp a smaller YMP.

    Each IOS is shared between 4 compute nodes. Each compute node interconnects in 3 bi-directional dimensions with its nearest neighbours. At the edges of the 3D cube of nodeseach direction loops over to join the other side thus placing each node on three bi-directional network loops.

    Cray Superserver systems

    The range of Cray Superserver systems designated APP, SMP, CS6400 started with theacquisition of some of the assets of Floating point systems of Beaverton CO in 1991.These machines ran a modified version of SunMicrosystems Solaris OS providing a

    saleability well beyond that of any available Sun equipment. Using a system of domainsthe machines, which could have up to 64 (60 MHz) Super SPARC, (later Ultra Sparc)CPUs, but be reconfigured to appear as a group of smaller machines.

    Cray never managed to sell very many of these systems despite their industry leadingperformance. When Cray was bought by SGI the whole project was sold to SunMicrosystems who developed the idea into the E10000 or "Starfire" range. A pressrelease that went out early in 1999 announced the 1000th system sold. Unlike the vectorand MPP systems the superservers can be split into seperate independant domains toprovide resiliance and failure isolation capability.

    Benchmark 24 CPU 32 CPU

    SPECRate-int 92 41,967 54,186

    SPECRate-pp 92 55,734 72,177

    CS6400 Was Available with 4..64 SuperSPARC CPUs, 256Mbyte..16GBytes memory,1.8Gbytes/s peek memory bandwidth. Could have over 5 terabytes (Tbytes) of on-linedisk storage.

    Cray operating systems

  • 8/8/2019 CRAY FAMILY [COA]

    14/16

    The first operating system produced by Cray Research for its machines was CrayOperating System (COS). However some customers choose to roll-their-own one USgovernment Lab wrote and used Cray timesharing system (CTSS). COS ran on allsystems up to 1985 and continued on many machines for several years after that.

    Unicos 1.0, known initially as CX-OS, was a Unix derivative developed in 1986 for theCray-2 architecture line. It was decided that it would be cheaper and faster to port Unix tothe new processor architecture than COS. Later Unicos was made available on the rest ofthe Cray line, probably from customer demand. There was also the long termmaintenance economics to consider, COS had lots of assembler and it's easier tomaintain, port and extend the C code base that forms the heart of any Unix derivative.

    Unicos shipped as source code + binaries with the release 1.0 licencing note reads ... "Ifonly the binary is licenced, the source will be kept under the control of the site analystwho will build the product from source." Unicos 1.0 shipped with TCP/IP, C, Pascal,Fortran, Cal ( assembler ), SCM and SCCS src control packages.

    According to the Unicos 1.0 software release notice March 1986 "The Unicos operatingsystem can execute on the Cray-1M, Cray-1S, Cray XMP and Cray-2 computer systems.Cray-1 systems must have an I/O Subsystem (IOS) to use Unicos."

    Over the years, and 10 major releases, Unicos matured and developed into a fullmainframe class Unix with full resource (and user) control, Multilevel security (Orangebook C2), comprehensive tape sub-system, high performance math and science librariesand Posix compliance. Along the way ISO networking, DCE, Total view debugger and aGUI called XFilingManager and lots of performance measuring tools put in appearances.The file system technology ( CFS and NC1FS) remained focused on performance and

    scalability at a cost of flexibility. Multi-file system disks and multi-disk file systems werestandard but until Model-E IOS arrived any file system change required a reboot.

    The Cray-3 utilised Unicos as did the YMP-EL and J90s but the Cray Superserversystems used a modified version of Solaris.

    The introduction of the MPP range with the T3D saw the start of major work in theoperating system area. The T3D was hosted by a Unicos mainframe but it an ran a microkernel OS called Unicos MAX. For the T3D all the physical IO system calls wereperformed by the host PVP system.

    In a computer system where there is a modest amount of CPUs, say 2..8 it is possible tohave all of OS services provided by a single CPU time slicing between the Kernel anduser application work. As the number of CPUs in the system increases, say 8..32, theamount of OS service work increases past the point where it is possible to have just asingle service thread in the Kernel. Unless the Kernel is modified to handle separate OSservice requests in parallel the system will lose efficiency by forcing process in multipleCPUs to wait until there is a service slot in the Kernel CPU. Prior to Unicos 8 as much as7% of a C90 16 CPU system could be wasted waiting for OS requests. After the

    http://www.spikynorman.dsl.pipex.com/CrayWWWStuff/Criscan/srn_Unicos_fr.jpghttp://www.spikynorman.dsl.pipex.com/CrayWWWStuff/Criscan/srn_Unicos_fr.jpg
  • 8/8/2019 CRAY FAMILY [COA]

    15/16

    introduction of the multithreaded Unicos 8.0 C90 systems were seen to spend a little as2..3% of there time servicing OS request. However while this multi-threading answerallows many cpu threads to be active in the kernel, the kernel still has to exist in a singleCPU. In a MPP system where there could be hundreds of CPUs demanding services fromthe kernel, that kernel has to be able to run across multiple CPUs as well as execute

    multiple threads. This requires a complete rethink of traditional operating systemimplementation.

    The solution provided in Unicos/MK was to slice the OS into a number of services, thenallow multiple instances of the services to run on separate CPUs. In one 850+ CPUsystem there were 17 OS CPUs with the disk, file system, process management, resourcecontrol, network and tape servers split across them. The exact number of CPUs that arededicated to each OS task varies with the size, workload and configuration of the systembut typically was in the ratio of 1 OS PE per 32 worker CPUs.

    Practical Considerations in Supercomputer Design

    To achieve such high speeds, high-power (i.e. hot) drivers are employed, signals aredetected with specialized analog circuits, conductors are all shielded and precisely tunedin both impedance and length, and data is encoded with error-correcting so that losses canbe recovered.

    In addition, the circuits are usually designed to operate in balanced mode so that there isno change in power drawn as drivers switch. As one driver switches from low to high,another switches from high to low, so that the power supply sees a DC load and there is

    no coupling of switching noise back into the logic via the power supply. In addition,using balanced signal lines can increase the signal to noise ratio by 6dB, although theseare not often used. In a design such as the Cray-1, roughly 40% of the transistorssupposedly do nothing but balance the power loading.

    Even so, these machines dissipate large amounts of heat. The IBM 3090 uses specialthermal conduction modules in which a multichip substrate is mounted in a carrier withbuilt-in plumbing for a chilled water jacket. CDC used a similar system in its designs,and on one instance a maintenance crew pumped live steam through the building airconditioning system, which crossed over to the processor, with predictable results. Thisraises the issue that these machines usually need thermal shut-down systems, and

    possibly even fire suppression gear.

    The Cray-1 series uses piped freon, and each board has a copper sheet to conduct heat tothe edges of the cage, where freon lines draw it away. The first Cray-1 was in factdelayed six months due to problems in the cooling system: lubricant that is normallymixed with the freon to keep the compressor running would leak through the seals as amist and eventually coat the boards with oil until they shorted out.

  • 8/8/2019 CRAY FAMILY [COA]

    16/16

    The Cray-2 is unique in that it uses a liquid bath to cool the processor boards. A specialnonconductive liquid (flourinert) is pumped through the system and the chips areimmersed in this.

    Special fountains aerate the liquid, and reservoirs are provided for storing the liquid when

    it is pumped out for service. This is somewhat remeniscent of the oil cooling bath thatwas sometimes used in magnetic core memory units.