T W F H -C C
Transcript of T W F H -C C
THE WORLD’S FIRST HYBRID-CORE COMPUTER. THE WORLD’S FIRST HYBRID-CORE COMPUTER.
DESIGN PHILOSOPHIES FOR MEMORY-CENTRIC INSTRUCTION SET ARCHITECTURES
John Leidel : Software Architect SAAHPC’10
• Introduction to Convey Computer • Architecture Overview • Observations of Memory Technology • Memory-Centric Instruction Sets • Example Design • Wrap-up
Agenda
3
INTRODUCTION
Introduction to Convey Computer
5
• Developer of the HC-1 hybrid-
core computer system
• Leverages Intel x86 ecosystem
• FPGA based coprocessor for
performance & efficiency
• Experienced team
We have hit a “power wall”
6
386
Pentium®
Pentium®4
Core, Core 2
Graphic courtesy of Herb Sutter http://www.gotw.ca/publications/concurrency-ddj.htm
• Heterogeneous computing is inevitable – More performance using less power (more efficient
use of transistors) – Application-specific logic is the most efficient
• Successful performance enhancements are tightly integrated with the processor – integrated vector processors vs. array processors – common address space & data types
• Single compiler & programming environment – industry standard source (no new languages or
dialects) – leverage of existing applications and algorithms
• Systems that are simpler to program win
Observations:
7
ARCHITECTURE OVERVIEW
Performance of application-
specific hardware
Appl
icat
ion
Perfo
rman
ce/
Pow
er e
ffici
ency
Ease of Deployment
Hybrid-core Computing Lo
w
Hig
h
Difficult Easy 9
Programmability and deployment ease of
an x86 server
Convey HC-1
Heterogenous solutions • can be much more efficient • still hard to program
Multicore solutions • don’t always scale well • parallel programming is
hard
Memory Memory
Intel Chipset
HC-1 Hardware
8 GB/s
PCI I/O
80 GB/s
Cache Coherent, Shared Virtual Memory
10
Personalities
FPGA FPGA
FPGA FPGA
Hybrid-Core Computing
11
Oil & Gas
Financial
Custom
CAE
Life Sciences
x86-64 ISA Custom ISA
Application-Specific Personalities • Extend the x86 instruction set • Implement key operations in
hardware
Shared Virtual Memory
Applications
Convey Compilers
Cache-coherent, shared memory • Both ISAs address common memory
*ISA: Instruction Set Architecture
Using Personalities
• Personalities are reloadable instruction sets
• Compiler generates x86 and coprocessor instructions from ANSI standard C/C++ & Fortran
• Executable can run on x86 nodes or Convey Hybrid-Core nodes
12
Convey Software Development Suite
Hybrid-Core Executable x86-64 and Coprocessor
Instructions
C/C++ Fortran
Convey HC-1
Intel x86 Coprocessor
P Personalities
user specifies personality at compile time
personality loaded at runtime by OS
instruction descriptions
FPGA bitfiles
HC-1 Hardware
13
• 2U enclosure: – Top half of 2U platform
contains the coprocessor – Bottom half contains Intel
motherboard
Coprocessor Assembly
Host x86 Server Assembly
FSB Mezzanine
Card
3 x 3½” Disk Drives
x16 PCI-E slot
Host memory DIMMs
HC-1 Architecture
14
“Commodity” Intel Server Convey FPGA-based coprocessor
Memory Subsystem
15
• Optimized for 64-bit accesses; 80 GB/sec peak • Automatically maintains coherency without impacting AE performance
OBSERVATIONS OF MEMORY TECHNOLOGY
Memory Performance Observations • DRAM bandwidth is historically tracking well with
Moore’s Law [core DRAM technology] – Progression from SD-RAMs through DDR, DDR2 and DDR3 – Memory clock frequency will eventually hit a power/
transistor density tradeoff wall – Ignores macro DRAM technologies such as GDDRx
• DRAM latency is significantly lagging behind Moore’s Law – Latency is being hidden by buffered DIMM tech, larger
caches and increase in number of outstanding requests – Becoming more painful to cover this latency gap as
compared to function unit performance • DRAM capacity is reasonably tracking Moore’s Law
– …but we already knew this – Looking forward to 3D stacked DRAMs
DRAM Bandwidth
18
DRAM Latency
19
HPCC Random Access Performance
Luszczek, P., Dongarra, J. "Analysis of Various Scalar, Vector, and Parallel Implementations of RandomAccess," Innovative Computing Laboratory (ICL) Technical Report, ICL-UT-10-03, June,
2010.
IBM - Dawn
NEC SX9 Small memory access windows
Flexible Conclusions • Memory performance will continually become a larger
portion of the computational bottleneck – Amdahl’s Law is a buzz kill when analyzing memory-bound apps…
but we know this • Accesses that are latency sensitive [eg, not in cache] will
become much of the limiting factor – As DRAM density increases, we’re not doing enough creative
engineering to cover the latency hot spots… more stuff through the same soda straws
• Future algorithm and instruction set development needs to become more memory centric in order to have a reasonable chance at utilizing new core technologies
MEMORY-CENTRIC INSTRUCTION SETS
Memory Centric Instruction Sets • Instruction sets designed explicitly around the functional
representation and distribution of the operands • Given a platform that permits instruction set flexibility, the key
to garnering the maximum efficiency [really, throughput] is the following:
1) Examine the operand 2) Examine the operand dependency graphs 3) Determine the operand width for a single clock cycle read
latency 4) Out of this, determine the optimal single stage* function
unit: operands/func unit 5) Scale out the operands/func units as pipeline capacity
permits
*stage ~= (single pipeline stage) and/or (die area/clock)
Stage 1: Operand Examination
• How is the data represented logically in memory?
• Graph Theory? – Vertex, Edge representation
as vectors – Adjacency matrices -> bit
vectors • Genomics/Proteomics?
– UINT8, UINT2 is sufficient • 2D/3D Stencil?
– How many points required per cell?
– Is double precision important?
24
X(I,J,K) = S0*Y(I ,J ,K ) + S1*Y(I-‐1,J ,K ) + S2*Y(I+1,J ,K ) + S3*Y(I ,J-‐1,K ) + S4*Y(I ,J+1,K ) + S5*Y(I ,J ,K-‐1) + S6*Y(I ,J ,K+1)
ACTGTGACATGCTGACATGCTAGTAATGCA
Stencils [FD/FV/FE]
Genomics/Proteomics
Graph Theory
Edge Vector
Vertex Vector
, etc, etc
Stage 2: Operand Dependency Graphs
• What are the data interdependencies: eg, what level of data parallelism can I achieve?
• Graph Theory? – Vertex, Edge representation
as vectors – Adjacency matrices -> bit
vectors • Genomics/Proteomics?
– Embarrassingly parallel • 2D/3D Stencil?
– (I,J,K) case is dependent upon (I-1, J-1, K-1)
• Algorithmic decomposition
25
X(I,J,K) = S0*Y(I ,J ,K ) + S1*Y(I-‐1,J ,K ) + S2*Y(I+1,J ,K ) + S3*Y(I ,J-‐1,K ) + S4*Y(I ,J+1,K ) + S5*Y(I ,J ,K-‐1) + S6*Y(I ,J ,K+1)
ACTGTGACATGCTGACATGCTAGTAATGCA
Stencils [FD/FV/FE]
Genomics/Proteomics
Graph Theory
Edge Vector
Vertex Vector
, etc, etc
Stage 3: Operand Width per Functional Memory Bus
• My data has no dependencies [eg, embarrassingly parallel], so how many operands can I service in a single read cycle?
• My data is non-unit stride, so how many operands can I service for a single read cycle after scalar address expansion?
26
X(I,J,K) = S0*Y(I ,J ,K ) + S1*Y(I-‐1,J ,K ) + S2*Y(I+1,J ,K ) + S3*Y(I ,J-‐1,K ) + S4*Y(I ,J+1,K ) + S5*Y(I ,J ,K-‐1) + S6*Y(I ,J ,K+1)
ACTGTGACATGCTGACATGCTAGTAATGCA
Stencils [FD/FV/FE]
Genomics/Proteomics
Graph Theory
Edge Vector
Vertex Vector
, etc, etc
Stage 3: Genomics Example
27
MC MC MC MC MC MC MC MC
Crossbar
8 memory controllers w/ 333Mhz clock SG-DIMMs deliver 8 byte single ops
64 b
ytes
/512
bits
per
clo
ck
Single Protein = 8 bits
Single Nucleotide = 2 bits
512 bits/8 bits => 64 proteins per clock
512 bits/2 bits => 256 nucleotides per clock
Stage 4: Single Stage Function Unit
• Now that I know how much data can be ingested per memory cycle, what function units are appropriate?
• Can I perform multiple ops/clock?
• Is this a single pipeline model?
• Do I need to develop a new algorithmic model?
28
MC
MC
MC
MC
MC
MC
MC
MC
Cros
sbar
Stage 4: Genomics Example
29
ACTGTGACATGCTGACATGCTAGTAATGCATG ACTGTGACATGCTGACATGCTAGTAATGCATG
MC MC MC MC MC MC MC MC
Crossbar
32 input chars 32 reference chars
Pipeline Stage 1
Application Engine ACTGTGACATGCTGACATGCTAGTAATGCATG
ACTGTGACATGCTGACATGCTAGTAATGCATG
All-to-all comparator
Stage 5: Functional Scale Out
• Given this function unit, how many additional function units can be implemented
• Single clock read offset
• Multi-clock reads and multi-stage pipeline
• Heavy pipelining with single stage pipeline
30
M C M C
M C M C
M C M C
M C M C
Cros
sbar
AE
AE
AE
AE
Stage 5: Scale Out Example
31
AE AE AE
Crossbar Crossbar Crossbar
Dispatch Dispatch Dispatch
Pipeline Stage 0
Pipeline Stage 1
Pipeline Stage 2
Func
tion
Pipe
0
Func
tion
Pipe
1
Func
tion
Pipe
2
Func
tion
Pipe
3
Scal
ar U
nit
Mis
c Un
it
Func
tion
Pipe
0
Func
tion
Pipe
1
Func
tion
Pipe
2
Stag
e 2
Stag
e 2
Stag
e 2
Traditional Pipelined
Model
Traditional Vector Model
Pipelined, multi-stage SIMD Units
CONCLUSION
Conclusions
33
• Insanity in Physical Science: “The definition of insanity is doing the same thing over and over again and expecting different results” – Albert Einstein
• Insanity in Computational Science: The definition of insanity is building more function units than one can reasonably utilize with a fixed memory system….. and expecting faster results.
Energy Efficient, Hybrid Core Computing • Higher Performance
– 5x to 25x application gains
• Energy Saving – Up to 90% reduction in data
center power usage
• Easy to program – ANSI standard C, C++ and Fortran
• Reloadable Personalities – application specific
performance on an x86 base
34
“Convey Computers may be at the forefront of a wave of innovation brought on by developing FPGAs as a viable alternative to CPUs…”
“Convey Computer seeks to use FPGAs to create a hybrid computing platform” MIS Impact Report, 12/09/08
451 Group
“We have found that one rack of HC-1 servers will replace eight racks of other servers…with correspondingly lowered energy requirements” Pavel Pevzner - UCSD
THE WORLD’S FIRST HYBRID-CORE COMPUTER. THE WORLD’S FIRST HYBRID-CORE COMPUTER.