T W F H -C C

35
THE WORLDS FIRST HYBRID-CORE COMPUTER. THE WORLDS FIRST HYBRID-CORE COMPUTER.

Transcript of T W F H -C C

Page 1: T W F H -C C

THE WORLD’S FIRST HYBRID-CORE COMPUTER. THE WORLD’S FIRST HYBRID-CORE COMPUTER.

Page 2: T W F H -C C

DESIGN PHILOSOPHIES FOR MEMORY-CENTRIC INSTRUCTION SET ARCHITECTURES

John Leidel : Software Architect SAAHPC’10

Page 3: T W F H -C C

•  Introduction to Convey Computer •  Architecture Overview •  Observations of Memory Technology •  Memory-Centric Instruction Sets •  Example Design •  Wrap-up

Agenda

3

Page 4: T W F H -C C

INTRODUCTION

Page 5: T W F H -C C

Introduction to Convey Computer

5

•  Developer of the HC-1 hybrid-

core computer system

•  Leverages Intel x86 ecosystem

•  FPGA based coprocessor for

performance & efficiency

•  Experienced team

Page 6: T W F H -C C

We have hit a “power wall”

6

386

Pentium®

Pentium®4

Core, Core 2

Graphic courtesy of Herb Sutter http://www.gotw.ca/publications/concurrency-ddj.htm

Page 7: T W F H -C C

•  Heterogeneous computing is inevitable –  More performance using less power (more efficient

use of transistors) –  Application-specific logic is the most efficient

•  Successful performance enhancements are tightly integrated with the processor –  integrated vector processors vs. array processors –  common address space & data types

•  Single compiler & programming environment –  industry standard source (no new languages or

dialects) –  leverage of existing applications and algorithms

•  Systems that are simpler to program win

Observations:

7

Page 8: T W F H -C C

ARCHITECTURE OVERVIEW

Page 9: T W F H -C C

Performance of application-

specific hardware

Appl

icat

ion

Perfo

rman

ce/

Pow

er e

ffici

ency

Ease of Deployment

Hybrid-core Computing Lo

w

Hig

h

Difficult Easy 9

Programmability and deployment ease of

an x86 server

Convey HC-1

Heterogenous solutions • can be much more efficient • still hard to program

Multicore solutions • don’t always scale well • parallel programming is

hard

Page 10: T W F H -C C

Memory Memory

Intel Chipset

HC-1 Hardware

8 GB/s

PCI I/O

80 GB/s

Cache Coherent, Shared Virtual Memory

10

Personalities

FPGA FPGA

FPGA FPGA

Page 11: T W F H -C C

Hybrid-Core Computing

11

Oil & Gas

Financial

Custom

CAE

Life Sciences

x86-64 ISA Custom ISA

Application-Specific Personalities •  Extend the x86 instruction set •  Implement key operations in

hardware

Shared Virtual Memory

Applications

Convey Compilers

Cache-coherent, shared memory •  Both ISAs address common memory

*ISA: Instruction Set Architecture

Page 12: T W F H -C C

Using Personalities

•  Personalities are reloadable instruction sets

•  Compiler generates x86 and coprocessor instructions from ANSI standard C/C++ & Fortran

•  Executable can run on x86 nodes or Convey Hybrid-Core nodes

12

Convey Software Development Suite

Hybrid-Core Executable x86-64 and Coprocessor

Instructions

C/C++ Fortran

Convey HC-1

Intel x86 Coprocessor

P Personalities

user specifies personality at compile time

personality loaded at runtime by OS

instruction descriptions

FPGA bitfiles

Page 13: T W F H -C C

HC-1 Hardware

13

•  2U enclosure: –  Top half of 2U platform

contains the coprocessor –  Bottom half contains Intel

motherboard

Coprocessor Assembly

Host x86 Server Assembly

FSB Mezzanine

Card

3 x 3½” Disk Drives

x16 PCI-E slot

Host memory DIMMs

Page 14: T W F H -C C

HC-1 Architecture

14

“Commodity” Intel Server Convey FPGA-based coprocessor

Page 15: T W F H -C C

Memory Subsystem

15

•  Optimized for 64-bit accesses; 80 GB/sec peak •  Automatically maintains coherency without impacting AE performance

Page 16: T W F H -C C

OBSERVATIONS OF MEMORY TECHNOLOGY

Page 17: T W F H -C C

Memory Performance Observations •  DRAM bandwidth is historically tracking well with

Moore’s Law [core DRAM technology] –  Progression from SD-RAMs through DDR, DDR2 and DDR3 –  Memory clock frequency will eventually hit a power/

transistor density tradeoff wall –  Ignores macro DRAM technologies such as GDDRx

•  DRAM latency is significantly lagging behind Moore’s Law –  Latency is being hidden by buffered DIMM tech, larger

caches and increase in number of outstanding requests –  Becoming more painful to cover this latency gap as

compared to function unit performance •  DRAM capacity is reasonably tracking Moore’s Law

–  …but we already knew this –  Looking forward to 3D stacked DRAMs

Page 18: T W F H -C C

DRAM Bandwidth

18

Page 19: T W F H -C C

DRAM Latency

19

Page 20: T W F H -C C

HPCC Random Access Performance

Luszczek, P., Dongarra, J. "Analysis of Various Scalar, Vector, and Parallel Implementations of RandomAccess," Innovative Computing Laboratory (ICL) Technical Report, ICL-UT-10-03, June,

2010.

IBM - Dawn

NEC SX9 Small memory access windows

Page 21: T W F H -C C

Flexible Conclusions •  Memory performance will continually become a larger

portion of the computational bottleneck –  Amdahl’s Law is a buzz kill when analyzing memory-bound apps…

but we know this •  Accesses that are latency sensitive [eg, not in cache] will

become much of the limiting factor –  As DRAM density increases, we’re not doing enough creative

engineering to cover the latency hot spots… more stuff through the same soda straws

•  Future algorithm and instruction set development needs to become more memory centric in order to have a reasonable chance at utilizing new core technologies

Page 22: T W F H -C C

MEMORY-CENTRIC INSTRUCTION SETS

Page 23: T W F H -C C

Memory Centric Instruction Sets •  Instruction sets designed explicitly around the functional

representation and distribution of the operands •  Given a platform that permits instruction set flexibility, the key

to garnering the maximum efficiency [really, throughput] is the following:

1)  Examine the operand 2)  Examine the operand dependency graphs 3)  Determine the operand width for a single clock cycle read

latency 4)  Out of this, determine the optimal single stage* function

unit: operands/func unit 5)  Scale out the operands/func units as pipeline capacity

permits

*stage ~= (single pipeline stage) and/or (die area/clock)

Page 24: T W F H -C C

Stage 1: Operand Examination

•  How is the data represented logically in memory?

•  Graph Theory? –  Vertex, Edge representation

as vectors –  Adjacency matrices -> bit

vectors •  Genomics/Proteomics?

–  UINT8, UINT2 is sufficient •  2D/3D Stencil?

–  How many points required per cell?

–  Is double precision important?

24

X(I,J,K)  =  S0*Y(I    ,J    ,K    )                    +  S1*Y(I-­‐1,J    ,K    )                    +  S2*Y(I+1,J    ,K    )                    +  S3*Y(I    ,J-­‐1,K    )                    +  S4*Y(I    ,J+1,K    )                    +  S5*Y(I    ,J    ,K-­‐1)                    +  S6*Y(I    ,J    ,K+1)  

ACTGTGACATGCTGACATGCTAGTAATGCA

Stencils [FD/FV/FE]

Genomics/Proteomics

Graph Theory

Edge Vector

Vertex Vector

, etc, etc

Page 25: T W F H -C C

Stage 2: Operand Dependency Graphs

•  What are the data interdependencies: eg, what level of data parallelism can I achieve?

•  Graph Theory? –  Vertex, Edge representation

as vectors –  Adjacency matrices -> bit

vectors •  Genomics/Proteomics?

–  Embarrassingly parallel •  2D/3D Stencil?

–  (I,J,K) case is dependent upon (I-1, J-1, K-1)

•  Algorithmic decomposition

25

X(I,J,K)  =  S0*Y(I    ,J    ,K    )                    +  S1*Y(I-­‐1,J    ,K    )                    +  S2*Y(I+1,J    ,K    )                    +  S3*Y(I    ,J-­‐1,K    )                    +  S4*Y(I    ,J+1,K    )                    +  S5*Y(I    ,J    ,K-­‐1)                    +  S6*Y(I    ,J    ,K+1)  

ACTGTGACATGCTGACATGCTAGTAATGCA

Stencils [FD/FV/FE]

Genomics/Proteomics

Graph Theory

Edge Vector

Vertex Vector

, etc, etc

Page 26: T W F H -C C

Stage 3: Operand Width per Functional Memory Bus

•  My data has no dependencies [eg, embarrassingly parallel], so how many operands can I service in a single read cycle?

•  My data is non-unit stride, so how many operands can I service for a single read cycle after scalar address expansion?

26

X(I,J,K)  =  S0*Y(I    ,J    ,K    )                    +  S1*Y(I-­‐1,J    ,K    )                    +  S2*Y(I+1,J    ,K    )                    +  S3*Y(I    ,J-­‐1,K    )                    +  S4*Y(I    ,J+1,K    )                    +  S5*Y(I    ,J    ,K-­‐1)                    +  S6*Y(I    ,J    ,K+1)  

ACTGTGACATGCTGACATGCTAGTAATGCA

Stencils [FD/FV/FE]

Genomics/Proteomics

Graph Theory

Edge Vector

Vertex Vector

, etc, etc

Page 27: T W F H -C C

Stage 3: Genomics Example

27

MC MC MC MC MC MC MC MC

Crossbar

8 memory controllers w/ 333Mhz clock SG-DIMMs deliver 8 byte single ops

64 b

ytes

/512

bits

per

clo

ck

Single Protein = 8 bits

Single Nucleotide = 2 bits

512 bits/8 bits => 64 proteins per clock

512 bits/2 bits => 256 nucleotides per clock

Page 28: T W F H -C C

Stage 4: Single Stage Function Unit

•  Now that I know how much data can be ingested per memory cycle, what function units are appropriate?

•  Can I perform multiple ops/clock?

•  Is this a single pipeline model?

•  Do I need to develop a new algorithmic model?

28

MC

MC

MC

MC

MC

MC

MC

MC

Cros

sbar

Page 29: T W F H -C C

Stage 4: Genomics Example

29

ACTGTGACATGCTGACATGCTAGTAATGCATG ACTGTGACATGCTGACATGCTAGTAATGCATG

MC MC MC MC MC MC MC MC

Crossbar

32 input chars 32 reference chars

Pipeline Stage 1

Application Engine ACTGTGACATGCTGACATGCTAGTAATGCATG

ACTGTGACATGCTGACATGCTAGTAATGCATG

All-to-all comparator

Page 30: T W F H -C C

Stage 5: Functional Scale Out

•  Given this function unit, how many additional function units can be implemented

•  Single clock read offset

•  Multi-clock reads and multi-stage pipeline

•  Heavy pipelining with single stage pipeline

30

M C M C

M C M C

M C M C

M C M C

Cros

sbar

AE

AE

AE

AE

Page 31: T W F H -C C

Stage 5: Scale Out Example

31

AE AE AE

Crossbar Crossbar Crossbar

Dispatch Dispatch Dispatch

Pipeline Stage 0

Pipeline Stage 1

Pipeline Stage 2

Func

tion

Pipe

0

Func

tion

Pipe

1

Func

tion

Pipe

2

Func

tion

Pipe

3

Scal

ar U

nit

Mis

c Un

it

Func

tion

Pipe

0

Func

tion

Pipe

1

Func

tion

Pipe

2

Stag

e 2

Stag

e 2

Stag

e 2

Traditional Pipelined

Model

Traditional Vector Model

Pipelined, multi-stage SIMD Units

Page 32: T W F H -C C

CONCLUSION

Page 33: T W F H -C C

Conclusions

33

•  Insanity in Physical Science: “The definition of insanity is doing the same thing over and over again and expecting different results” – Albert Einstein

•  Insanity in Computational Science: The definition of insanity is building more function units than one can reasonably utilize with a fixed memory system….. and expecting faster results.

Page 34: T W F H -C C

Energy Efficient, Hybrid Core Computing •  Higher Performance

–  5x to 25x application gains

•  Energy Saving –  Up to 90% reduction in data

center power usage

•  Easy to program –  ANSI standard C, C++ and Fortran

•  Reloadable Personalities –  application specific

performance on an x86 base

34

“Convey Computers may be at the forefront of a wave of innovation brought on by developing FPGAs as a viable alternative to CPUs…”

“Convey Computer seeks to use FPGAs to create a hybrid computing platform” MIS Impact Report, 12/09/08

451 Group

“We have found that one rack of HC-1 servers will replace eight racks of other servers…with correspondingly lowered energy requirements” Pavel Pevzner - UCSD

Page 35: T W F H -C C

THE WORLD’S FIRST HYBRID-CORE COMPUTER. THE WORLD’S FIRST HYBRID-CORE COMPUTER.