T W F H -C C

THE WORLD’S FIRST HYBRID-CORE COMPUTER. THE WORLD’S FIRST HYBRID-CORE COMPUTER.

DESIGN PHILOSOPHIES FOR MEMORY-CENTRIC INSTRUCTION SET ARCHITECTURES

John Leidel : Software Architect SAAHPC’10

•  Introduction to Convey Computer •  Architecture Overview •  Observations of Memory Technology •  Memory-Centric Instruction Sets •  Example Design •  Wrap-up

Agenda

3

INTRODUCTION

Introduction to Convey Computer

5

•  Developer of the HC-1 hybrid-

core computer system

•  Leverages Intel x86 ecosystem

•  FPGA based coprocessor for

performance & efficiency

•  Experienced team

We have hit a “power wall”

6

386

Pentium®

Pentium®4

Core, Core 2

Graphic courtesy of Herb Sutter http://www.gotw.ca/publications/concurrency-ddj.htm

•  Heterogeneous computing is inevitable –  More performance using less power (more efficient

use of transistors) –  Application-specific logic is the most efficient

•  Successful performance enhancements are tightly integrated with the processor –  integrated vector processors vs. array processors –  common address space & data types

•  Single compiler & programming environment –  industry standard source (no new languages or

dialects) –  leverage of existing applications and algorithms

•  Systems that are simpler to program win

Observations:

7

ARCHITECTURE OVERVIEW

Performance of application-

specific hardware

Appl

icat

ion

Perfo

rman

ce/

Pow

er e

ffici

ency

Ease of Deployment

Hybrid-core Computing Lo

w

Hig

h

Difficult Easy 9

Programmability and deployment ease of

an x86 server

Convey HC-1

Heterogenous solutions • can be much more efficient • still hard to program

Multicore solutions • don’t always scale well • parallel programming is

hard

Memory Memory

Intel Chipset

HC-1 Hardware

8 GB/s

PCI I/O

80 GB/s

Cache Coherent, Shared Virtual Memory

10

Personalities

FPGA FPGA

FPGA FPGA

Hybrid-Core Computing

11

Oil & Gas

Financial

Custom

CAE

Life Sciences

x86-64 ISA Custom ISA

Application-Specific Personalities •  Extend the x86 instruction set •  Implement key operations in

hardware

Shared Virtual Memory

Applications

Convey Compilers

Cache-coherent, shared memory •  Both ISAs address common memory

*ISA: Instruction Set Architecture

Using Personalities

•  Personalities are reloadable instruction sets

•  Compiler generates x86 and coprocessor instructions from ANSI standard C/C++ & Fortran

•  Executable can run on x86 nodes or Convey Hybrid-Core nodes

12

Convey Software Development Suite

Hybrid-Core Executable x86-64 and Coprocessor

Instructions

C/C++ Fortran

Convey HC-1

Intel x86 Coprocessor

P Personalities

user specifies personality at compile time

personality loaded at runtime by OS

instruction descriptions

FPGA bitfiles

HC-1 Hardware

13

•  2U enclosure: –  Top half of 2U platform

contains the coprocessor –  Bottom half contains Intel

motherboard

Coprocessor Assembly

Host x86 Server Assembly

FSB Mezzanine

Card

3 x 3½” Disk Drives

x16 PCI-E slot

Host memory DIMMs

HC-1 Architecture

14

“Commodity” Intel Server Convey FPGA-based coprocessor

Memory Subsystem

15

•  Optimized for 64-bit accesses; 80 GB/sec peak •  Automatically maintains coherency without impacting AE performance

OBSERVATIONS OF MEMORY TECHNOLOGY

Memory Performance Observations •  DRAM bandwidth is historically tracking well with

Moore’s Law [core DRAM technology] –  Progression from SD-RAMs through DDR, DDR2 and DDR3 –  Memory clock frequency will eventually hit a power/

transistor density tradeoff wall –  Ignores macro DRAM technologies such as GDDRx

•  DRAM latency is significantly lagging behind Moore’s Law –  Latency is being hidden by buffered DIMM tech, larger

caches and increase in number of outstanding requests –  Becoming more painful to cover this latency gap as

compared to function unit performance •  DRAM capacity is reasonably tracking Moore’s Law

–  …but we already knew this –  Looking forward to 3D stacked DRAMs

DRAM Bandwidth

18

DRAM Latency

19

HPCC Random Access Performance

Luszczek, P., Dongarra, J. "Analysis of Various Scalar, Vector, and Parallel Implementations of RandomAccess," Innovative Computing Laboratory (ICL) Technical Report, ICL-UT-10-03, June,

2010.

IBM - Dawn

NEC SX9 Small memory access windows

Flexible Conclusions •  Memory performance will continually become a larger

portion of the computational bottleneck –  Amdahl’s Law is a buzz kill when analyzing memory-bound apps…

but we know this •  Accesses that are latency sensitive [eg, not in cache] will

become much of the limiting factor –  As DRAM density increases, we’re not doing enough creative

engineering to cover the latency hot spots… more stuff through the same soda straws

•  Future algorithm and instruction set development needs to become more memory centric in order to have a reasonable chance at utilizing new core technologies

MEMORY-CENTRIC INSTRUCTION SETS

Memory Centric Instruction Sets •  Instruction sets designed explicitly around the functional

representation and distribution of the operands •  Given a platform that permits instruction set flexibility, the key

to garnering the maximum efficiency [really, throughput] is the following:

1)  Examine the operand 2)  Examine the operand dependency graphs 3)  Determine the operand width for a single clock cycle read

latency 4)  Out of this, determine the optimal single stage* function

unit: operands/func unit 5)  Scale out the operands/func units as pipeline capacity

permits

*stage ~= (single pipeline stage) and/or (die area/clock)

Stage 1: Operand Examination

•  How is the data represented logically in memory?

•  Graph Theory? –  Vertex, Edge representation

as vectors –  Adjacency matrices -> bit

vectors •  Genomics/Proteomics?

–  UINT8, UINT2 is sufficient •  2D/3D Stencil?

–  How many points required per cell?

–  Is double precision important?

24

X(I,J,K) = S0*Y(I ,J ,K ) + S1*Y(I-‐1,J ,K ) + S2*Y(I+1,J ,K ) + S3*Y(I ,J-‐1,K ) + S4*Y(I ,J+1,K ) + S5*Y(I ,J ,K-‐1) + S6*Y(I ,J ,K+1)

ACTGTGACATGCTGACATGCTAGTAATGCA

Stencils [FD/FV/FE]

Genomics/Proteomics

Graph Theory

Edge Vector

Vertex Vector

, etc, etc

Stage 2: Operand Dependency Graphs

•  What are the data interdependencies: eg, what level of data parallelism can I achieve?

•  Graph Theory? –  Vertex, Edge representation

as vectors –  Adjacency matrices -> bit

vectors •  Genomics/Proteomics?

–  Embarrassingly parallel •  2D/3D Stencil?

–  (I,J,K) case is dependent upon (I-1, J-1, K-1)

•  Algorithmic decomposition

25



Stencils [FD/FV/FE]

Genomics/Proteomics

Graph Theory

Edge Vector

Vertex Vector

, etc, etc

Stage 3: Operand Width per Functional Memory Bus

•  My data has no dependencies [eg, embarrassingly parallel], so how many operands can I service in a single read cycle?

•  My data is non-unit stride, so how many operands can I service for a single read cycle after scalar address expansion?

26



Stencils [FD/FV/FE]

Genomics/Proteomics

Graph Theory

Edge Vector

Vertex Vector

, etc, etc

Stage 3: Genomics Example

27

MC MC MC MC MC MC MC MC

Crossbar

8 memory controllers w/ 333Mhz clock SG-DIMMs deliver 8 byte single ops

64 b

ytes

/512

bits

per

clo

ck

Single Protein = 8 bits

Single Nucleotide = 2 bits

512 bits/8 bits => 64 proteins per clock

512 bits/2 bits => 256 nucleotides per clock

Stage 4: Single Stage Function Unit

•  Now that I know how much data can be ingested per memory cycle, what function units are appropriate?

•  Can I perform multiple ops/clock?

•  Is this a single pipeline model?

•  Do I need to develop a new algorithmic model?

28

MC

MC

MC

MC

MC

MC

MC

MC

Cros

sbar

Stage 4: Genomics Example

29

ACTGTGACATGCTGACATGCTAGTAATGCATG ACTGTGACATGCTGACATGCTAGTAATGCATG

MC MC MC MC MC MC MC MC

Crossbar

32 input chars 32 reference chars

Pipeline Stage 1

Application Engine ACTGTGACATGCTGACATGCTAGTAATGCATG

ACTGTGACATGCTGACATGCTAGTAATGCATG

All-to-all comparator

Stage 5: Functional Scale Out

•  Given this function unit, how many additional function units can be implemented

•  Single clock read offset

•  Multi-clock reads and multi-stage pipeline

•  Heavy pipelining with single stage pipeline

30

M C M C

M C M C

M C M C

M C M C

Cros

sbar

AE

AE

AE

AE

Stage 5: Scale Out Example

31

AE AE AE

Crossbar Crossbar Crossbar

Dispatch Dispatch Dispatch

Pipeline Stage 0

Pipeline Stage 1

Pipeline Stage 2

Func

tion

Pipe

0

Func

tion

Pipe

1

Func

tion

Pipe

2

Func

tion

Pipe

3

Scal

ar U

nit

Mis

c Un

it

Func

tion

Pipe

0

Func

tion

Pipe

1

Func

tion

Pipe

2

Stag

e 2

Stag

e 2

Stag

e 2

Traditional Pipelined

Model

Traditional Vector Model

Pipelined, multi-stage SIMD Units

CONCLUSION

Conclusions

33

•  Insanity in Physical Science: “The definition of insanity is doing the same thing over and over again and expecting different results” – Albert Einstein

•  Insanity in Computational Science: The definition of insanity is building more function units than one can reasonably utilize with a fixed memory system….. and expecting faster results.

Energy Efficient, Hybrid Core Computing •  Higher Performance

–  5x to 25x application gains

•  Energy Saving –  Up to 90% reduction in data

center power usage

•  Easy to program –  ANSI standard C, C++ and Fortran

•  Reloadable Personalities –  application specific

performance on an x86 base

34

“Convey Computers may be at the forefront of a wave of innovation brought on by developing FPGAs as a viable alternative to CPUs…”

“Convey Computer seeks to use FPGAs to create a hybrid computing platform” MIS Impact Report, 12/09/08

451 Group

“We have found that one rack of HC-1 servers will replace eight racks of other servers…with correspondingly lowered energy requirements” Pavel Pevzner - UCSD

THE WORLD’S FIRST HYBRID-CORE COMPUTER. THE WORLD’S FIRST HYBRID-CORE COMPUTER.

T W F H -C C

Documents

Transcript of T W F H -C C