Post on 01-Jan-2016
1
A GPU-Like Soft Processor for High-Throughput
Acceleration
Jeffrey Kingyens and J. Gregory SteffanElectrical and Computer Engineering
University of Toronto
A GPU-Like Soft Processor
FGPA-Based AccelerationIn-socket acceleration platforms
FPGA and CPU on same motherboardXtremedata, Nallatech, SGI RASCIntel Quick-Assist, AMD Torrenza
How to program them?HDL is for expertsBehavioural synthesis is limited
Can we provide a more familiar programming model?
2
XD1000
A GPU-Like Soft Processor
Potential Solution: Soft ProcessorAdvantages of soft processors:
Familiar, portable, customizableOur Goal: Develop a new S.P architecture that is:
Naturally capable of utilizing FPGA resourcesSuitable for high-throughput workloads
Challenges:Memory latencyPipeline latency and hazardsExploiting parallelismScaling
3
A GPU-Like Soft Processor
Inspiration: GPU ArchitectureMultithreading
Tolerate memory and pipeline latenciesVector instructions
Data-level parallelism, scalingPredication
Minimize impact of control flowMultiple processors
Scaling
We propose a GPU-like S.P. and programming model
4
A GPU-Like Soft Processor
OverviewA GPU-based system
NVIDIA’s CgAMD’s CTM r5xx ISA
A GPU-like architectureOvercoming port limitationsAvoiding stalls
Preliminary resultsSimulation based on Xtremedata XD1000
5
A GPU-Like Soft Processor
A GPU-Based System
6
A GPU-Like Soft Processor
GPU Shader Processors
7
ShaderProgram
( X o , Yo )
Input Buffers
Fetch( n,x,y)
Data
Register File
ConstantRegisters
Output Buffer
Xo
Yo
Domain Size
Separate in/out buffers simplify memory coherence
A GPU-Like Soft Processor
Software Compilation
8
CTM Binary
Cg Shader Program
C/C++ Host Program
Written by Developer:
CTM Driver
GPU Processor
System Memorycgc
ps3 asm
amucomp
ctm asm
Our system behaves like a real graphics card!
A GPU-Like Soft Processor
NVIDIA’s Cg Language (C-like)Cg Shader Program
struct data_out { float4 sum : COLOR;};data_out multadd(float2 coord : TEXCOORD0, uniform sampler2D A: TEXUNIT0, uniform sampler2D B: TEXUNIT1) { data_out r; float4 offset = {1.0f, 1.0f, 1.0f, 1.0f}; r.sum = tex2D(A,coord)*tex2D(B,coord)+offset; return r;}
9
Matrix-matrix element-wise multiplication + offset
A GPU-Like Soft Processor
AMD’s CTM r5xx ISA (simplified)multadd:TEX r1 r0 s1TEX r0 r0 s0MAD o0 r1 r0 c0END
Loads
A B CSource regsDest regs
10
ALU op
Each register is a 4-element vector
Central regfile: need 2 writes and 4 reads per cycle
A GPU-Like Soft Processor
A GPU-Like Architecture
11
A GPU-Like Soft Processor
Soft Processor Architecture
12
Soft Processor
Coordinate Generator
Register File
ConfigHT Slave
Constant File
HT Master
Output Register
ALU
A B C
TEX
Fifo
Fifo
A
FPGA Block-RAMshave only 2 ports!
64 Cycles!
305 cycles!
Must tolerate port limitations and latencies
A GPU-Like Soft Processor
Overcoming Port LimitationsProblem: central register file:
Needs four reads and two writes per cycleFPGA block RAMs have only two ports
Solution: exploit symmetry of threadsGroup threads into batches of 4Fetch operands across batches in lock-step
13
Only read one operand per thread per cycle
A GPU-Like Soft Processor
Transposed RegFile Access
14
3 2 1 0
C
3 2 1 0
B
3 2 1 0
A
T3RF
T2RF
T1RF
T0RF
A GPU-Like Soft Processor
Avoiding Stalls
Problem: long pipeline and memory latency Frequent long stalls lead to underutilized ALU datapath
Solution: exploit abundance of threads Store contexts for multiple batches of threads Issue instructions from different batches to hide latencies
15
Requires logic to issue-from and manage batches
A GPU-Like Soft Processor
Methodology and Results
16
A GPU-Like Soft Processor
Simulation Methodology
SystemC-based simulation Parameterized to model XD1000 Assume conservative 100Mhz soft processor clock Cycle accurate at the block interfaces Models HyperTransport (bandwidth and latency)
Benchmarksphoton: monte-carlo heat-transfer sim (ALU-intensive)matmatmult: dense matrix multiplication (mem-intensive)
17
A GPU-Like Soft Processor
ALU Utilization
18
Number of Hardware Batch Contexts(Photon)
Utilized
Mem Wait
Not ALU
Inside ALU
100%
80%
60%
40%
20%
1 2 4 8 16
32
640%
A GPU-Like Soft Processor 19
ALU Utilization
19
Utilized Mem WaitNot ALU Inside ALU
32 batches is sufficient
A GPU-Like Soft Processor
ConclusionsGPU-inspired soft processor architecture
exploits multithreading, vector operations, predicationThread symmetry and batching allows:
tolerating limited block RAM portstolerating long memory and pipeline latencies
32 batches sufficient to achieve 100% ALU utilization
Future work:customize programming model and arch. to FPGAsexploit longer vectors, multiple CPUs, custom ops
20
A GPU-Like Soft Processor
Backups
A GPU-Like Soft Processor
ALU Datapath
ALU
FPADD FPADD FPADD FPADD
FPMULT FPMULT FPMULT FPMULT
FPADD
11
FPADD FPADD FPADD FPADD
1414
14
pre-subtract
multiply
add
dp3/dp4
64 Cycles!
A GPU-Like Soft Processor
Reducing Register File Size
M-RAM64KB
.r
M-RAM64KB
.g
M-RAM64KB
.b
M-RAM64KB
.a
R W R W R W R WW
Reg=3 Reg=5r3.r0-3 r3.g0-3 r3.b0-3 r5.a0-3
512 Bits
Possible to shrink this?Some programs use much fewer registersEx) photon uses 4, can replace below with 16
x M4k
23
A GPU-Like Soft Processor
Multi-threading
24
Next Type: ALU, TEX, ALU, TEX, ALU, TEX…
Context RAM
True False FalseTrue
ALUReady
ALUReady Executing
TEXWaiting
0 1 2 3
CurrentBatch States
Next Batch ( to decode) Instruction (to decode)
= 0Next Batch
Rotate
Priority Encode
Last Batch = 2
False True True False3 0 1 2
Highest Priority
Instruction RAM
ProgramCounter[0 ]
A GPU-Like Soft Processor
Implementation
Assume an Xtremedata XD1000 system (or similar)FPGA plugs into second socket on dual CPU boardCommunication with CPU via HyperTransport 16-bit wide, 400 Mhz DDR interface 1.6GB/sec
CPU FPGA
System Memory
HT Master
HT Slave
Accelerator
xd1000Driver
Host Program
25
A GPU-Like Soft Processor
Estimating Memory Latency
26
A GPU-Like Soft Processor 27
ALU Utilization
Utilized Mem WaitNot ALU Inside ALU
Matmatmult is bottlenecked on load latency
A GPU-Like Soft Processor 28
Performance (16 Bit HT Link)
Number of Hardware Batch Contexts 2 4 8
16 32
64