Reconfigurable Computing: HPC Network Aspects
description
Transcript of Reconfigurable Computing: HPC Network Aspects
Reconfigurable Computing: HPC Network Aspects
Mitch Sukalski (8961)
David Thompson (8963)
Craig Ulmer (8963)[email protected]
Pete Dean R&D SeminarDecember 11, 2003
FPGAs are promising…
But what’s the catch?
There are three main challenges that need to be addressed in order to apply to practical, scientific computing.
RC Challenge #1: Floating Point
• Most FPGAs fine grained
• Floating point units are large– 32b FP occupies ~1,000 CLBs– Commercial capacity improving
• 2000: 6,000 CLBs
• 2003: 40,000 CLBs (Max: 220,000)
• Keith Underwood at Sandia/NM– LDRD: Working on high-speed 64b floating-point cores
32b FP in Xilinx V2P7
RC Challenge #2: Design Tools
• Hardware design is non-trivial– Micromanage computations, clock-by-clock– Not appropriate for most scientists– Need languages, APIs that are easy to use
• Maya Gokhale at LANL– Streams-C: C-like language for HW design– Pipeline/unroll loops– Schedules access to external memory
RC Challenge #3: High-speed I/O
• FPGAs have large internal computational power– How do we get data into/out of FPGA?– How do we connect to our existing HPC machines?
• Mitch Sukalski, David Thompson, Craig Ulmer– LDRD: Connect FPGAs to high-performance SANs
FPGA
FPGA
Outline
• Where we have beenNetworking FPGAs using external NI cards
• Where we are goingNetworking FPGAs using internal transceivers
• Project statusEarly details
Previous Work
Where we’ve been..
Networking Earlier FPGAs
• Previous generation of FPGAs were like blank ASICs– Configurable logic and pins
• Attach a network card to an FPGA card– Communication over PCI
• Examples:– Virginia Tech: Myrinet– Washington U. in St. Louis: ATM (inline)– Clemson University: Gigabit Ethernet– Georgia Tech: Myrinet
CPU
FPGA
NIC
PC
I B
us
GRIM Project at Georgia Tech
• Add multimedia devices to cluster– Message layer connects
CPUs, memory, and peripherals
– Myrinet between hosts,PCI within hosts
• Celoxica RC-1000 FPGA– Virtex FPGA (1M logic gates)– Four SRAM banks – PCI w/ PMC
SRAM
0SRAM
1SRAM
2SRAM
3
PCIFPGA
Control & Switching
CPU
CPUCPU CPU CPU
CPU
FPGA
RAID
FPGAFPGA
Ethernet
GRIM
FPGA Organization
Frame
Incoming Message Queues
OutgoingMessage Queues
Communication Library API
ApplicationData
Memory API
FPGA Card Memory
FPGACircuit Canvas
User Circuit API
User Circuit n
User Circuit 1
Lessons Learned
• Frame provides simple OS– Isolates users from board– Portability
• Dynamically manage resources– Card memory– Computational circuits
• PCI bottleneck– Distance between NI and FPGA– PCI difficult to work with
Page A
SRAM 1
Page B
SRAM 2
HostCPU
FPGA
Circuit X
Circuit Y
Circuit E
Circuit F
Circuit G
FunctionFault
Message:Use Circuit F
on $C0000000
PageFault
Page C
Page C
NIC
Network Features of Recent FPGAs
Where we’re going…
FPGA Network Improvements
• Recent FPGAs have special, built-in cores– High-speed transceivers, dedicated processors
• Idea: Build our NI inside the FPGA– FPGA becomes a networked, compute resource– Removes the PCI bottleneck
FPGA
NI Tx
Rx
NI Tx
Rx
User-definedComputational
Circuits
CPU
NIC
System Area Network
CPU
NIC
CPU
NIC
Xilinx Virtex-II/Pro FPGA
• Up to 4 PowerPC405 cores– Embedded version of PPC– 300-400MHz
• Multiple gigabit transceivers– Run at 600Mbps to 3.125Gbps– Up to twenty-four transceivers
• Additional cores– Distributed internal memory– Arrays of 18b multipliers– Digital clock multipliers, PLLs
Xilinx V2P20
Multi-Gigabit Transceivers: Rocket I/O
• Flexible, high-speed transceivers– Can be configured to connect with different physical layers– InfiniBand, GigE, FC, 10GigE, Aurora– Note: low-level interface (commas, disparity, clock mismatches)
FPGAFabric
Serializer
Deserializer
Tx FIFO8B/10B
EncoderCRC
8B/10BDecoder
Rx ElasticBuffer
ClockRecoverCRC check
PIN+
-PIN
PIN+
-PIN
FPGAFabric
Rocket I/OPIN
PIN
Rocket I/OPIN
PIN
Rocket I/OPIN
PIN
Why MGTs are Important
• Direct connection to networks– Same chip, different network – Remove PCI from equation
• Fast connections between FPGAs– Reduces analog design issues– Chain FPGAs together– Reduce pin count
• Update: Virtex II/ProX– Now 2.488 Gbps – 10.3125 Gbps– Chips have either 8 or 20 transceivers
3.125 Gbps over 44” FR4 *
* From Xilinx, http://www.xilinx.com/products/virtex2pro/mgtcharacter.htm
Hard PowerPC Core
• PowerPC 405– 16KB Instruction / 16KB Data caches– Real and Virtual memory modes– GCC is available
• Multiple memory ports for core– On-chip memory (OCM)– Processor Local Bus (PLB)
• User-defined memory map– Connect memory blocks or cores– External memory cores available
ProcessorLocal
Bus (PLB)
PowerPC
I-Cache D-Cache
On-ChipMemory
(OCM) Interface
System on a Chip (SoC)
• Commercial SoC– Designing with cores– Customize system
• New tools– Rapidly connect cores– Library of cores & buses– Saves on wiring legwork
Xilinx Platform Studio
Current Status
• Exploring V2P– New architecture, new tools
• Two reference boards– ML300 (V2P7-6)– Avnet (V2P20-6)
• Transceiver work– Raw transmission over fiber– Working towards IB
http://cdulmer.ran.sandia.gov