Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Greg Stitt Dept. of ECE

University of Florida

This research was supported in part by the National Science Foundation and the Semiconductor

Research Corporation

Frank VahidDept. of CS&E

University of California, Riverside

Also with the Center for Embedded Computer Systems, UC Irvine

2/30

Binary Translation

VLIWµP

Background Motivated by commercial dynamic binary translation of early

2000s

x86Binary x86 VLIW

x86 VLIW FPGA

VLIWBinary

FPGAµP

Binary

Warp processing (Lysecky/Stitt/Vahid 2003-2007): dynamically translate binary to circuits on FPGAs

Performance

e.g., Transmeta Crusoe “code morphing”

Binary “Translation”

3/30

µP

FPGAOn-chip CAD

Warp Processing Background

Profiler

Initially, software binary loaded into instruction memory

11

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary

4/30

µP

FPGAOn-chip CAD


ProfilerI Mem

D$


Software BinaryMicroprocessor executes

instructions in software binary

22

Time EnergyµP

5/30

µP

FPGAOn-chip CAD


Profiler

µP

I Mem

D$


Software BinaryProfiler monitors instructions

and detects critical regions in binary

33

Time Energy

Profiler

add

add

add

add

add

add

add

add

add

add

beq

beq

beq

beq

beq

beq

beq

beq

beq

beq

Critical Loop Detected

6/30

µP

FPGAOn-chip CAD


Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD reads in critical

region44

Time Energy

Profiler

On-chip CAD

7/30

µP

FPGADynamic Part. Module (DPM)


Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD decompiles critical

region into control data flow graph (CDFG)

55

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07

Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits

8/30

µP



Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD synthesizes

decompiled CDFG to a custom (parallel) circuit

66

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

9/30

µP



Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD maps circuit onto

FPGA77

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

Lean place&route/FPGA 10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06)

Multi-core chips – use 1 powerful core for CAD

10/30

µP



Profiler

µP

I Mem

D$


Software Binary88

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more

Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA

Ret reg4

FPGA

Time Energy

Software-only“Warped”

>10x speedups for some apps

11/30

Warp Scenarios

µP

Time

µP (1st execution)

Time

On-chip CAD

µP FPGA

Speedup

Long Running Applications Recurring Applications

Long-running applications Scientific computing, etc.

Recurring applications (save FPGA configurations) Common in embedded systems Might view as (long) boot phase

On-chip CAD

Single-execution speedup

FPGA

Possible platforms: Xilinx Virtex II Pro, Altera Excalibur, Cray XD1, SGI Altix, Intel QuickAssist, ...

Warping takes time – when useful?

12/30

µP

Thread Warping - Overview

FPGAµPµP

µP

OS

µP

f()

f()f()

Compiler

Binary

for (i = 0; i < 10; i++) {

thread_create( f, i );

}

f()

µP

On-chip CAD

Acc. Lib

f() f()

OS schedules threads onto available µPs

Remaining threads added to queue

OS invokes on-chip CAD tools to create accelerators for f()

OS schedules threads onto accelerators (possibly dozens), in addition to µPs

Thread warping: use one core to create accelerator for waiting threads

Very large speedups possible – parallelism at bit, arithmetic, and now thread level too

x86 VLIW TWrp

PerformanceMulti-core platforms multi-threaded apps

13/30

Decompilation

Memory Access Synchronization

High-level Synthesis

Thread Functions

Netlist

Binary Updater

Updated Binary

Hw/Sw Partitioning

Hw Sw

Thread Group Table

Thread Warping Tools Invoked by OS Uses pthread library (POSIX)

Mutex/semaphore for synchronization

Defined methods/algorithms of a thread warping framework

Accelerator Instantiation

Thread Queue

Thread Functions

Thread Counts

Accelerator Synthesis

Accelerator Library

FPGA

Not In Library?

DoneAccelerators Synthesized?

Queue Analysis

falsefalse

true true

Updated Binary

Schedulable Resource List

Place&RouteThread Group Table

NetlistBitfile

On-chip CAD

FPGA

µP



14/30

Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough

void f( int a[], int val ){ int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . }

Memory Access Synchronization (MAS)

Same array

FPGA b()a()

RAMData for dozens of threads can create bottleneck

for (i = 0; i < 10; i++) {

thread_create( thread_function, a, i );

}DMA

Threaded programs exhibit unique feature: Multiple threads often access same data

Solution: Fetch data once, broadcast to multiple threads (MAS)

….

15/30


1) Identify thread groups – loops that create threads

for (i = 0; i < 100; i++) { thread_create( f, a, i );}

void f( int a[], int val ){ int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . }

Thread Group

Def-Use: a is constant for all threads

Addresses of a[0-9] are constant for thread group

f() f() ……………… f()

DMA RAMA[0-9]

A[0-9] A[0-9] A[0-9]

Before MAS: 1000 memory accesses

After MAS: 100 memory accesses

Data fetched once, delivered to entire group

2) Identify constant memory addresses in thread function Def-use analysis of parameters to thread function

3) Synthesis creates a “combined” memory access Execution synchronized by OS

enable (from OS)

16/30


Also detects overlapping memory regions – “windows”

void f( int a[], int i ){ int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . . . }

for (i = 0; i < 100; i++) { thread_create( thread_function, a, i );}

a[0] a[1] a[2] a[3] a[4] a[5] ………

f() f() ……………… f()

DMA RAMA[0-103]

A[0-3] A[1-4] A[6-9]

Data streamed to “smart buffer”

Smart Buffer

Buffer delivers window to each thread

W/O smart buffer: 400 memory accessesWith smart buffer: 104 memory accesses

Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04]

Caches reused data, delivers windows to threads

Each thread accesses different addresses – but addresses may overlap

enable

17/30

Framework

Accelerator Instantiation

Thread Queue

Thread Functions

Thread Counts


Accelerator Library

FPGA

Not In Library? Done

Accelerators Synthesized?

Queue Analysis

falsefalse

true true

Updated Binary

Schedulable Resource List

Place&RouteThread Group Table

NetlistBitfile

Also developed initial algorithms for: Queue analysis Accelerator

instantiation OS scheduling

of threads to accelerators and cores

18/30

Thread Warping Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP

On-chip CAD filter()

Thread Queue

Queue Analysis

Thread functions: filter()

filter() threads execute on available cores

Remaining threads added to queue

OS invokes CAD (due to queue size or periodically)

CAD tools identify filter() for synthesis

19/30

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP


filter() binary

Decompilation

CDFG


MAS detects overlapping windows

MAS detects thread group

CAD reads filter() binary

20/30

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP


filter() binary

Decompilation

CDFG


High-level Synthesis

+ +

+

>>2

filter() filter(). . . . .

Smart Buffer

RAM

RAM Accelerator Library

filter filter

Synthesis creates pipelined accelerator for filter() group: 8 accelerators

Stored for future use

Accelerators loaded into FPGA

21/30

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i][=avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP



Smart Buffer

RAM

RAM

filter filter

a[0-52]

a[2-5] a[9-12]

Smart buffer streams a[] data

After buffer fills, delivers a window to all eight accelerators

OS schedules threads to accelerators

enable (from OS)

22/30

Example


FPGAµPµP

µP

OS

µP

main() filter()

µP



Smart Buffer

RAM

RAM

filter filter

a[0-53]

a[10-13] a[17-20]

Each cycle, smart buffer delivers eight more windows – pipeline remains full

23/30

Example


FPGAµPµP

µP

OS

µP

main() filter()

µP



Smart Buffer

RAM

RAM

filter filter

a[0-53]

b[2-9]Accelerators create 8 outputs after pipeline latency passes

24/30

Example


FPGAµPµP

µP

OS

µP

main() filter()

µP



Smart Buffer

RAM

RAM

filter filter

a[0-53]

b[10-17]

Thread warping: 8 pixel outputs per cycle Software: 1 pixel output every ~9 cycles72x cycle count improvement

Additional 8 outputs each cycle

25/30

Experiments to Determine Thread Warping Performance: Simulator Setup

main

filter filter filter…………

main

……

Parallel Execution Graph (PEG) – represents thread level parallelism

Nodes: Sequential execution blocks (SEBs)

Edges: pthread calls

Generate PEG using pthread wrappers

Determine SEB performancesSw: SimpleScalarHw: Synthesis/simulation (Xilinx)

Event-driven simulation – use defined algoritms to change architecture dynamically

Simulation Summary

Complete when all SEBs simulated Observe total cycles

1)

2)

3)

4)

Optimistic for Sw execution (no memory contention)

Pessimistic for warped execution (accelerators/microprocessors execute exclusively)

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}

26/30

Experiments

Benchmarks: Image processing, DSP, scientific computing Highly parallel examples to illustrate thread warping potential We created multithreaded versions

Base architecture – 4 ARM cores Focus on recurring applications (embedded)

TW: FPGA running at whatever frequency determined by synthesis

On-chip CAD

FPGAµP

µP

µP

µP

4 ARM11 400 MHz

Compared to

4 ARM11 400 MHz + FPGA (synth freq)

Multi-core Thread Warping

µP

µP

µP

µP

27/30

Speedup from Thread Warping

Average 130x speedup

130 502 63 130 38308

01020304050

Fir Prewitt Linear Moravec Wavelet Maxfilter 3DTrans N-body Avg. Geo.Mean

4-uP

TW

8-uP

16-uP

32-uP

64-uP

11x faster than 64-core system Simulation pessimistic, actual results likely better

But, FPGA uses additional area

So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s

28/30

Limitations

Dependent on coding practices Assumes boss/worker thread model Not all apps amenable to FPGA

speedup Commercial CAD slow – warping

takes time But in worst case, FPGA just not

used by application

29/30

Why not Partition Statically?

Static good, but hiding FPGA opens technique to all sw platforms

Standard languages/tools/binaries

On-chip CAD

FPGA

µP

Any Compiler

FPGA

µP

Specialized Compiler

Binary Netlist Binary

Specialized Language Any Language

Static Thread Synthesis Dynamic Thread Warping

Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor

Can adapt to changing workloads Smaller & more accelerators, fewer & large accelerators, ...

Memory-access synchronization applicable to static approach

30/30

Conclusions

Thread warping framework dynamically synthesizes accelerators for thread functions

Memory Access Synchronization Helps reduce memory bottleneck problem

130x speedups for chosen examples Future work

Handle wider variety of coding constructs Improve for different thread models Numerous open problems

E.g., dynamic reallocation of FPGA resources

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Documents

Transcript of Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators