San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT...

40
San Diego, March 27th 2003 Roberto De Pietri -- chep03 1 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status report * The apeNEXT project Roberto De Pietri ([email protected]) Università di Parma & INFN gruppo collegato di Parma

Transcript of San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT...

Page 1: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 1

apeNEXT* The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status report

* The apeNEXT project

Roberto De Pietri ([email protected])Università di Parma & INFN gruppo collegato di Parma

Page 2: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 2

The APE familyOur line of Home Made Computers

APE(1988)

APE100(1993)

APEmille(1999)

apeNEXT(2003)

Architecture SIMD SIMD SIMD SIMD++

# nodes 16 2048 2048 4096

Topology flexible 1D

rigid 3D flexible 3d flexible 3D

Memory 256 MB 8 GB 64 GB 1 TB

# registers (w.size)

64 (x32) 128 (x32) 512 (x32) 512 (x64)

clock speed 8 MHz 25 MHz 66 MHz 200 MHz

Total Computing Power of all …

~1.5 GFlops

~ 250 GFlops

~ 2 TFlops ~ 8-20 TFlops

Page 3: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 3

APE (‘88) 1 GFlops

Page 4: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 4

The APE paradigm

Very efficient for LQCD The normal operation as a basic operation Native implementation of the complex type a x b + c (complex numbers)

Large number of register Efficient optimizations

VLIW (very long instruction word) Reliable and safe HW solution Easy to program software tools

APEse, TAO Machine simulator

Page 5: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 5

Since APE 100 Our own designed VLSI

Pipelined normal operation on a chip (MAD) 3D topology

Remote I/O and X - link ON CABLE Y and Z – link on the BACKPLANE

Large number of APEmille installation in Europe 30 crate (~ 65 GFlops) Almost 2 TeraFlops of computing power

Page 6: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 6

APEmille installations

Bielefeld 130 GF (2 crates) Zeuthen 520 GF (8 crates) Milan 130 GF (2 crates) Bari 65 GF (1 crates) Trento 65 GF (1 crates) Pisa 325 GF (5 crates) Rome 1 520 GF (8 crates) Rome 2 130 GF (2 crates) Orsay 16 GF (1/4 crates) Swansea 65 GF (1 crates)

Gr. Total ~1966 GF

Page 7: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 7

The apeNEXT architecture

3D mesh of computing nodes

Each node is a:complete self-sufficient computing engine(1.6 GFlops)

Z+(bp)

Y+(bp)

X+(cables)

0 2

4 6

8 10

12 14

1 3

5 7

9 11

13 15

J&T

DDR-MEM

X+

……Z-

7th link

Page 8: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 8

The apeNEXT architecture (2)

Two directions (Y,Z) on the backplane

Direction X through front panel cables

System topologies:

Processing Board 4 x 2 x 2 ~ 26 GF subCrate (16 PB) 4 x 8 x 8 ~ 0.4 TF Crate (32 PB) 8 x 8 x 8 ~ 0.8 TF Large systems (8*n) x 8 x 8

Z+(bp)

Y+(bp)

X+(cables)

0 2

4 6

8 10

12 14

1 3

5 7

9 11

13 15

J&T

DDR-MEM

X+

……Z-

Page 9: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 9

Components (1)

The CHIP

The J&T chip is the core of apeNEXT and everything is built around it !!

Page 10: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 10

Components (2)

J&T Module 1 J&T Chip 9 DRAM chips

256 Mbitsmemory chips

1024 Mbits memory chips(supported)

Page 11: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 11

Components (3)

Processing Board

Z+(bp)

Y+(bp)

X+(cables)

0 2

4 6

8 10

12 14

1 3

5 7

9 11

13 15

Page 12: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 12

Components (4)

Back Plane Z+,Z-

links Y+,Y-

links

Z+(bp)

Y+(bp)

X+(cables)

0 2

4 6

8 10

12 14

1 3

5 7

9 11

13 15

Page 13: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 13

Components (5)

The Cabinet

Standard 1U rack mounted

PC

Standard 48Volt Power

Supplies

Page 14: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 14

Host Interface

I2C: bootstrap & control 7th-Link (200MB/s)

Page 15: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 15

I2C (x4)

7th Link Port

PCI (64bit,66Mhz)PCI form factor

Fifo

Altera APEXIIPCI Interface PLDA

7Link Ctrl

I2C Ctrl

PCIMaster

Ctrl

PCITargetCtrl

7Link Ctrl

QDR Mem Ctrl

Fifo

Fifo

QDRMem Bank

• PCI Interface 64bit, 66Mhz

• PCI Master Mode for 7th Link Intf

• PCI Target Mode for I2C Intf

• QuadDataRate Memory (x32)

• Altera APEX II based

• 7th Link: 1(2) bidir chan. (200*9 M/s)

• I2C: 4 independent ports

Host I/O Interface

Page 16: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 16

• Dominant Technologies:– LVDS: 1728 (16*6*2*9) differential signals 200MB/s, 144 routed via cables, 576 via backplane on 12 controlled-impedance (100W) layers

– High-Speed differential connectors:

•Samtec QTS (J&T Module)

•Erni ERMET-ZD (Backplane)

•16 Nodes 3D-Interconnected

• 4x2x2 Topology 26 Gflops, 4.6 GB Memory

• Light System:

– J&T Module connectors

– Glue Logic (Clock tree 10Mhz)

– Global signal interconnection (FPGA)

– DC-DC converters (48V to 3.3/2.5/1.8 V)

• Collaboration with NEURICAM spaPB

Page 17: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 17

J&T Module J&T 9 DDR-SDRAM, 256Mbit (x16)

memory chips 6 Link LVDS up to 400MB/s Host Fast I/O Link (7th Link) I2C Link (slow control network) Dual Power 2.5V + 1.8V, 7-10W

estimated Dominant technologies:

SSTL-II (memory interface) LVDS (network interface + I/O)

Page 18: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 18

Overview of the J&T Architecture

Peak floating point performance of about 1.6Gflops IEEE compliant double precision

Integer arithmetic performance of about 400 Mips Link bandwidth of about 200 Mbyte/sec each

full duplex 7 links: X+,X-,Y+,Y-,Z+,Z- and the 7th link

Support for current generation DDR memory Memory bandwidth of 3.2 Gbyte/sec

400 Mword/sec

Page 19: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 19

J&T Computing & control

integrated

no glue logic

Reduced time for project, simulation and test of the prototype

Page 20: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 20

J&T: Top Level Diagram

Page 21: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 21

The J&T Arithmetic BOX

4 multipliers

4 adder/sub

At 200 MHz (fully piped) = 1.6 GFlops

•Pipelined complex “normal” a*b+c (8 flops) per cycle

Page 22: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 22

The J&T remote IO

fifo-based communication:

LVDS

1.6 Gb/s per link (8 bit @ 200MHz)

6 (+1) independent links

Page 23: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 23

J&T summary

CMOS 0.18m, 7 metal (ATMEL)

200 MHz Double Precision Complex

Normal Operation 64 bit AGU 8 KW program cache 128 bit local memory

channel 6+1 LVDS 200 MB/s links BGA package, 600 pins

Page 24: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 24

Key steps of the J&T design

✔ January 2001: VHDL design starts✔ May 2001: Contract with Atmel established✔ November 2001: First placement experiment started

✔ February 2002: Major rework on the network protocol (to increase robustness against transmission errors).

✔April 2002: Network OK, re-start placement exercises✔June 2002: Good placement available✔June 2002 (end): Satisfactory routing available

✔July 2002(beginning): Power routing not OK and✔ 5% of “random logic” removed✔July 2002(end): Both problems solved

………………Continues on next slides .............................

Page 25: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 25

Key steps of the J&T design (2)

✔September 2002: New placement available (with new power layout)✔September 2002: Excessive congestion .... OR✔October 2002: Very bad timing closure✔November 2002: Satisfactory placement OK✔Dec. 9th 2002: successful routing completed.

✔January 2003: Timing analysis reasonably satisfactory✔January 2003: Simulations with back annotation OK ✔January 2003: Analysis of critical path (dangerous and not)✔February 2003: Hammering down remaining timing problems✔February 2003: Careful analysis of all risky corners✔February 2003: Transfer of simulation data to Atmel✔End of March Final sign off (Laura …. is working on it…..)

Page 26: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 26

Timing J&T ready June 03

We will receive between 300 to 600 chips We need 256 processor to assemble a crate !!

We expect them to work !! The same team designed 7 ASICs of similar complexity Impressive full-detailed simulations of multiple J&T systems More one simulate less one has to test !!

Everything else ready and tested Within days/weeks the first working apeNEXT computer will

operate

September ’03 mass production will star (hopefully) at Neuricam INFN already founded 8 TFlops of computing power !!

Page 27: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 27

Mechanics DC/DC

J&T Module

apeNEXT PB

J&T Module

Board-to-Board Connector

AIR-FLOWCHANNEL

2

TOP VIEW ( local )

AIR-FLOW CHANNEL

1

AIR-FLOWCHANNEL

3

AIR-FLOWCHANNEL

3

Fra

me

a1

b1

b3

a3

b2

a2

PB constraints:

• Power consumption: up to 340W

• PB-BP insertion force: 80-150 Kg (!)

• Fully populated PB weight: 4-5 Kg

Custom design of card frame and insertion tool

Detailed study of airflow

Page 28: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 28

• T, V, I monitored;• Interfaced to I2C control network

PB Prototype

Page 29: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 29

PB (preliminary)Test• Next Test-Bed: metal frame with power supply

• I2C Test i.e. test of “slow-control” I/O intf.

• minimal set of components assembled•simple/short test (1 week) •done succesfully (Dec 01)

• Clock distribution test• PB LVDS characterization

Page 30: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 30

PB Status

Activity Status Who Cost Note

PB development

(inc. feasibility study and LVDS EVB)

Done Neuricam 67 KEuro

PB ver.1 prototypes (3) Done Neuricam

DDI 10 KEuro

J&T Module develop. Done Neuricam 23 KEuro

PB ver.2 prototypes (3) Done Neuricam

SOMACIS 10 KEuro

Page 31: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 31

connector kit cost:7KEuro (!)PB Insertion force:80-150 Kg(!)

NEXT BackPlane • 16 PB Slots + Root Slot

• Size 447x600 mm2•4600 LVDS differential signals,

point-to-point up to 600 Mb/s

• 16 controlled-imp. layers (32 Tot)• Press-fit only

• Erni/Tyco connectors

•ERMET-ZD• Providers:

APW (primary)

ERNI (2nd source)

Activity Status Who Cost Note

BP development Done APW(ERNI) 32 KEuro

BP prototypes (3)

Done APW 41 KEuro

Page 32: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 32

Host I/O Interface

• PCI Interface 64bit, 66Mhz

• PCI Master Mode for 7th Link Intf

• PCI Target Mode for I2C Intf

Activity Status Who Cost Note

Altera design Done INFN

PCB design and prototypes

Done NEURICAM 3KE

I2C (x4)

7th Link Port

PCI (64bit,66Mhz)

PCI form factor

Fifo

Altera APEXIIPCI Interface PLDA

7Link Ctrl

I2C Ctrl

PCIMaster

Ctrl

PCITargetCtrl

7Link Ctrl

QDR Mem Ctrl

Fifo

Fifo

QDRMem Bank

• QuadDataRate Memory (x32)

• Altera APEX II based

• 7th Link: 1(2) bidir chan. (200*9 M/s)

• I2C: 4 indipendent ports

Page 33: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 33

• Problem:•PB weight: 4-5 Kg, PB consumption: 340W (est.) •32 PB + 2 Root Board (2 independent subcrates)• Power supply: (<48Vx150A per subcrate)• Integrated Host PCs• Forced air cooling• Robust, expandable/modular, CE, EMC ....

• Solution:•42U rack (h: 2,10 m):

• EMC proof,• efficient cables routing

• 19”-1U slots per 9 “host PCs” (rack mounted)

• Hot-swap power supply cabinet (modular)

• Custom design of “card cage” and “tie bar”• Custom design of cooling system

Activity Status Who Cost Note

Design of rack (inc. selection of power

supply)

Done (Apr ’02) APW(NEURICAM)

50 KEuro

Full rack prototype Done (Sept ’02) APW 8-10 KEuro

Cabinets

Page 34: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 34

Page 35: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 35

Software

TAO compilers and linker ….. READY All existing APE program will run with no change Physical code already been run on the simulator

Kernel of PHYSICS codes used to benchmark the efficiencies of the FP unit

C COMPILER gcc (2.93) and lcc have be retargeted lcc WORKS (almost). Factor 5 on performance

http://www.cs.princeton.edu/software/lcc/

Page 36: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 36

Project Costs

Total development cost of 1700 k€uro

1050 k€uro for VLSI development 550 k€uro non VLSI

Manpower involved = 20 man/year Mass production cost ~0.5 €uro/MFlops

Page 37: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 37

Conclusions J&T ready June 03 (300….600 chips)

Everything else ready and tested !!!

If tests ok mass production starting September ‘03 at Neuricam

All components over-dimensioned Cooling, LVDS tested @ 400 Mb/s, power supply on

boards …

Makes possible a technology step with no extra design and test effort

Page 38: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 38

Conclusions (2) Installation plans

INFN 8 TFlops (10 cabinets)already approved (on delivering of a working machine)

DESY Considering between 8 TFlops to 16 TFlops Paris ……….

Inversion of Dirac Operator (APEmill program) 54 % efficiency on the VHDL hardware simulator

Communications, memory refresh, synchronization wait …….. all included …

Page 39: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 39

apeNEXT vs. cluster

72.5 GFlops409.6 GFlops

819.2 GFlops

1.6*16*16 *2 GFlops

Page 40: San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

San Diego, March 27th 2003

Roberto De Pietri -- chep03 40

ASICs of similar complexity

ADD322 3 input integer Adder. Prototype for APE100 integrated into ZCPU

MAD APE100 Floating point engine

ZCPU APE100Sequencer + Integer ALU + AGU

Commuter APE100 Communication device

T1000 APEmille Integer ALU+AGU+Program controller

J1000 APEmille Floating point engine

COMM1000 APEmille Communication device