Leveraging HyperTransport for a custom high-performance...
Transcript of Leveraging HyperTransport for a custom high-performance...
![Page 1: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/1.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Leveraging HyperTransport for a custom high-performance cluster network
Mondrian NüssleHTCE Symposium 2009
11.02.2009
![Page 2: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/2.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
![Page 3: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/3.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
EXTOLL: Background & MotivationEXTOLL: Background & Motivation
High-performance computing synonymous with parallel computingInterconnection networks between processors are a key component in parallel systemsPatterson stated: “Latency lags Bandwidth”
The EXTOLL project at the CAG aims to significantly lower communication latency and improve communication in parallel systems
![Page 4: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/4.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
GoalsGoals
Enable communication with extremely low latency → close to main memory access
Enable communication – computation overlapDesign a balanced system
In terms of CPU on-loading and off-loadingIn terms of system complexity
Adding bandwidth is much easier ☺
![Page 5: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/5.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Key design factsKey design facts
Leverage HT as host interface for lowest latency of data transport between CPU and deviceLeverage modified HT as on-chip communication protocol Implement a lean network interface controller:
Minimize state information on NICProvide user-level, virtualized access (avoid kernel)Minimize number of CPU ↔ device and memory ↔device transactions
Network layer that provides reliable, in-order, low-latency transport service
![Page 6: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/6.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
![Page 7: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/7.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Block diagramBlock diagram
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
Host Interface blockNIC block:
Several communication functions
Network block6 links9x9 crossbar
Flexible architecture:
Configurable data path
![Page 8: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/8.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Communication functions: VELOCommunication functions: VELO
Virtualized Engine for low overheadEnable ultra-low send/receive communicationSupports messages of up to 64-byte (one cache line) directlyA single PIO transaction triggers sending of a messageMessage completion at the receiver is usually performed with a single DMA transaction
Minimized traffic between host and device!
NIC
ATU
VELO
C&SRegisterfile
RMA
![Page 9: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/9.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Communication functions: RMACommunication functions: RMA
EXTOLL Remote Memory Architecture
Enables access to remote memory using put, get and atomic transactionsTransaction triggered by a single 128-bit SSE2 store → minimizing start-up latencyFlexible notifications:
at the requesterthe completerthe responder or any combination thereof
NIC
ATU
VELO
C&SRegisterfile
RMA
![Page 10: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/10.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Supporting modulesSupporting modules
Address Translation UnitProvides address translation services for RMARegistration/unregistration latency in prototype systems starts at ~2 µsTranslation using on-chip TLB and main-memory tables
Control and Status Registerfileautomatically generated from high-level spec (including kernel code)Local and remote access possible (network management software)
NIC
ATU
VELO
C&SRegisterfile
RMA
![Page 11: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/11.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
HT InterfaceHT Interface
HT-Core: interface to hostAll functional units need to communicate with hostAvoid protocol conversion for on chip-network
→ HTAX crossbar running on-chip protocol
simplifiedmore source tagsfixed format
Host Interface
Hyper-TransportIP Core
HTAXXBar
![Page 12: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/12.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Network layerNetwork layer
Fully parametrizable width of data-paths and number of portsIn-order delivery of packetsVirtual channelsHardware retransmissionCut-through switchingCredit based flow-control
Current implementations:6 ports used to connect to external links16+2 bit data path width
Network
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
![Page 13: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/13.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
![Page 14: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/14.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Implementation IImplementation I
EXTOLL prototype is implemented on the HTX-Board
Virtex 4 FX100 FPGA, speed-grade 11 or 126 SFP optical transceivers
Currently :16 bit width, 180 MHz core frequency
3.6 Gb/s links
![Page 15: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/15.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Implementation IIImplementation II
> 90% of all slices of the FPGA are in use for the designHT-Core runs at 200 MHz internal frequency and HT400EXTOLL modules run with 180 MHz on speed-grade -12 device
![Page 16: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/16.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
![Page 17: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/17.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
EXTOLL Basedriver
atudrv
User-Application
libVELO
Middleware, i.e. MPI, GasNET(Library)
EXTOLL Hardware
VELO RMA RegisterfileATU
User Space
NIC
Kernel Space
Application Management
libRMA
extoll_rfrmadrvvelodrv
sEru
PCIConfig-space
Software StackSoftware Stack
OS bypassLayered approachPGAS support through GasNETMPI support through OpenMPILinux kernel driver
![Page 18: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/18.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
![Page 19: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/19.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
0
1
2
3
4
5
6
7
8
10 100 1000
Late
ncy
[us]
Size [byte], logarithmic scale
EXTOLL VELOEXTOLL RMA PutEXTOLL RMA Get
Results Results –– LatencyLatency
Start-up latency~ 1 µs
RMA Put transaction beats VELO at 256 bytes
Get latency is full roundtrip
![Page 20: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/20.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
50
100
150
200
250
300
350
10 100 1000
Ban
dwid
th [M
B/s
]
Size [byte], logarithmic scale
EXTOLL VeloEXTOLL PutEXTOLL Get
Peak payload bandwidthHalf peak payload bandwidth
Results Results -- BandwidthBandwidthMore than n½ bandwidth
at 32 byte! Maximum bandwidth
reached at 4k
![Page 21: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/21.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
0 200 400 600 800 1000 1200
HT1000 ASIC, 800MHz,est.
HT800 ASIC, 500MHz int,est.
optimized FPGA, HT400,200 MHz
FPGA, HT400, 180 MHz
Reference: MellanoxConnect X DDR IB
Technology ScalingTechnology Scaling
Already beats best IB Silicon
ASIC would show 3 times lower latency!
![Page 22: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/22.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
![Page 23: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/23.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
ConclusionConclusion
EXTOLL is an architecture for ultra low-latency communication in parallel systemsprototype hardware is up and runningbasic software environment is up and runningPerformance numbers are excellent:
~ 1 μs start-up latency on FPGA prototypeBandwidth limited by serializers & board, but can be improved with new platform
![Page 24: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/24.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Next StepsNext Steps
more software is being addedMost interesting GasNET
Evaluation on 1024-core Valencia ClusterOn the hardware-side, next step is a new revision with a more powerful base technology
Evaluation of next platform for HW
![Page 25: Leveraging HyperTransport for a custom high-performance ...ra.ziti.uni-heidelberg.de/coeht/pages/events/... · Leveraging HyperTransport for a custom high-performance cluster network](https://reader034.fdocuments.us/reader034/viewer/2022042312/5eda402cb3745412b57106f1/html5/thumbnails/25.jpg)
Leveraging HyperTransport for a custom high-performance cluster network
Thanks !Thanks !
Questions?