Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

30
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part by the National Science Foundation and the Semiconductor Research Corporation Frank Vahid Dept. of CS&E University of California, Riverside Also with the Center for Embedded Computer Systems, UC Irvine

description

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators. Frank Vahid Dept. of CS&E University of California, Riverside Also with the Center for Embedded Computer Systems, UC Irvine. Greg Stitt Dept. of ECE University of Florida. - PowerPoint PPT Presentation

Transcript of Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Page 1: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Greg Stitt Dept. of ECE

University of Florida

This research was supported in part by the National Science Foundation and the Semiconductor

Research Corporation

Frank VahidDept. of CS&E

University of California, Riverside

Also with the Center for Embedded Computer Systems, UC Irvine

Page 2: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

2/30

Binary Translation

VLIWµP

Background Motivated by commercial dynamic binary translation of early

2000s

x86Binary x86 VLIW

x86 VLIW FPGA

VLIWBinary

FPGAµP

Binary

Warp processing (Lysecky/Stitt/Vahid 2003-2007): dynamically translate binary to circuits on FPGAs

Performance

e.g., Transmeta Crusoe “code morphing”

Binary “Translation”

Page 3: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

3/30

µP

FPGAOn-chip CAD

Warp Processing Background

Profiler

Initially, software binary loaded into instruction memory

11

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary

Page 4: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

4/30

µP

FPGAOn-chip CAD

Warp Processing Background

ProfilerI Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryMicroprocessor executes

instructions in software binary

22

Time EnergyµP

Page 5: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

5/30

µP

FPGAOn-chip CAD

Warp Processing Background

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryProfiler monitors instructions

and detects critical regions in binary

33

Time Energy

Profiler

add

add

add

add

add

add

add

add

add

add

beq

beq

beq

beq

beq

beq

beq

beq

beq

beq

Critical Loop Detected

Page 6: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

6/30

µP

FPGAOn-chip CAD

Warp Processing Background

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD reads in critical

region44

Time Energy

Profiler

On-chip CAD

Page 7: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

7/30

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD decompiles critical

region into control data flow graph (CDFG)

55

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07

Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits

Page 8: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

8/30

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD synthesizes

decompiled CDFG to a custom (parallel) circuit

66

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

Page 9: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

9/30

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD maps circuit onto

FPGA77

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

Lean place&route/FPGA 10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06)

Multi-core chips – use 1 powerful core for CAD

Page 10: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

10/30

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary88

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more

Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA

Ret reg4

FPGA

Time Energy

Software-only“Warped”

>10x speedups for some apps

Page 11: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

11/30

Warp Scenarios

µP

Time

µP (1st execution)

Time

On-chip CAD

µP FPGA

Speedup

Long Running Applications Recurring Applications

Long-running applications Scientific computing, etc.

Recurring applications (save FPGA configurations) Common in embedded systems Might view as (long) boot phase

On-chip CAD

Single-execution speedup

FPGA

Possible platforms: Xilinx Virtex II Pro, Altera Excalibur, Cray XD1, SGI Altix, Intel QuickAssist, ...

Warping takes time – when useful?

Page 12: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

12/30

µP

Thread Warping - Overview

FPGAµPµP

µP

OS

µP

f()

f()f()

Compiler

Binary

for (i = 0; i < 10; i++) {

thread_create( f, i );

}

f()

µP

On-chip CAD

Acc. Lib

f() f()

OS schedules threads onto available µPs

Remaining threads added to queue

OS invokes on-chip CAD tools to create accelerators for f()

OS schedules threads onto accelerators (possibly dozens), in addition to µPs

Thread warping: use one core to create accelerator for waiting threads

Very large speedups possible – parallelism at bit, arithmetic, and now thread level too

x86 VLIW TWrp

PerformanceMulti-core platforms multi-threaded apps

Page 13: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

13/30

Decompilation

Memory Access Synchronization

High-level Synthesis

Thread Functions

Netlist

Binary Updater

Updated Binary

Hw/Sw Partitioning

Hw Sw

Thread Group Table

Thread Warping Tools Invoked by OS Uses pthread library (POSIX)

Mutex/semaphore for synchronization

Defined methods/algorithms of a thread warping framework

Accelerator Instantiation

Thread Queue

Thread Functions

Thread Counts

Accelerator Synthesis

Accelerator Library

FPGA

Not In Library?

DoneAccelerators Synthesized?

Queue Analysis

falsefalse

true true

Updated Binary

Schedulable Resource List

Place&RouteThread Group Table

NetlistBitfile

On-chip CAD

FPGA

µP

Accelerator Synthesis

Memory Access Synchronization

Page 14: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

14/30

Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough

void f( int a[], int val ){ int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . }

Memory Access Synchronization (MAS)

Same array

FPGA b()a()

RAMData for dozens of threads can create bottleneck

for (i = 0; i < 10; i++) {

thread_create( thread_function, a, i );

}DMA

Threaded programs exhibit unique feature: Multiple threads often access same data

Solution: Fetch data once, broadcast to multiple threads (MAS)

….

Page 15: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

15/30

Memory Access Synchronization (MAS)

1) Identify thread groups – loops that create threads

for (i = 0; i < 100; i++) { thread_create( f, a, i );}

void f( int a[], int val ){ int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . }

Thread Group

Def-Use: a is constant for all threads

Addresses of a[0-9] are constant for thread group

f() f() ……………… f()

DMA RAMA[0-9]

A[0-9] A[0-9] A[0-9]

Before MAS: 1000 memory accesses

After MAS: 100 memory accesses

Data fetched once, delivered to entire group

2) Identify constant memory addresses in thread function Def-use analysis of parameters to thread function

3) Synthesis creates a “combined” memory access Execution synchronized by OS

enable (from OS)

Page 16: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

16/30

Memory Access Synchronization (MAS)

Also detects overlapping memory regions – “windows”

void f( int a[], int i ){ int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . . . }

for (i = 0; i < 100; i++) { thread_create( thread_function, a, i );}

a[0] a[1] a[2] a[3] a[4] a[5] ………

f() f() ……………… f()

DMA RAMA[0-103]

A[0-3] A[1-4] A[6-9]

Data streamed to “smart buffer”

Smart Buffer

Buffer delivers window to each thread

W/O smart buffer: 400 memory accessesWith smart buffer: 104 memory accesses

Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04]

Caches reused data, delivers windows to threads

Each thread accesses different addresses – but addresses may overlap

enable

Page 17: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

17/30

Framework

Accelerator Instantiation

Thread Queue

Thread Functions

Thread Counts

Accelerator Synthesis

Accelerator Library

FPGA

Not In Library? Done

Accelerators Synthesized?

Queue Analysis

falsefalse

true true

Updated Binary

Schedulable Resource List

Place&RouteThread Group Table

NetlistBitfile

Also developed initial algorithms for: Queue analysis Accelerator

instantiation OS scheduling

of threads to accelerators and cores

Page 18: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

18/30

Thread Warping Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP

On-chip CAD filter()

Thread Queue

Queue Analysis

Thread functions: filter()

filter() threads execute on available cores

Remaining threads added to queue

OS invokes CAD (due to queue size or periodically)

CAD tools identify filter() for synthesis

Page 19: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

19/30

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP

On-chip CAD filter()

filter() binary

Decompilation

CDFG

Memory Access Synchronization

MAS detects overlapping windows

MAS detects thread group

CAD reads filter() binary

Page 20: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

20/30

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP

On-chip CAD filter()

filter() binary

Decompilation

CDFG

Memory Access Synchronization

High-level Synthesis

+ +

+

>>2

filter() filter(). . . . .

Smart Buffer

RAM

RAM Accelerator Library

filter filter

Synthesis creates pipelined accelerator for filter() group: 8 accelerators

Stored for future use

Accelerators loaded into FPGA

Page 21: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

21/30

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i][=avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP

On-chip CAD filter()

filter() filter(). . . . .

Smart Buffer

RAM

RAM

filter filter

a[0-52]

a[2-5] a[9-12]

Smart buffer streams a[] data

After buffer fills, delivers a window to all eight accelerators

OS schedules threads to accelerators

enable (from OS)

Page 22: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

22/30

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP

On-chip CAD filter()

filter() filter(). . . . .

Smart Buffer

RAM

RAM

filter filter

a[0-53]

a[10-13] a[17-20]

Each cycle, smart buffer delivers eight more windows – pipeline remains full

Page 23: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

23/30

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP

On-chip CAD filter()

filter() filter(). . . . .

Smart Buffer

RAM

RAM

filter filter

a[0-53]

b[2-9]Accelerators create 8 outputs after pipeline latency passes

Page 24: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

24/30

Example

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}

FPGAµPµP

µP

OS

µP

main() filter()

µP

On-chip CAD filter()

filter() filter(). . . . .

Smart Buffer

RAM

RAM

filter filter

a[0-53]

b[10-17]

Thread warping: 8 pixel outputs per cycle Software: 1 pixel output every ~9 cycles72x cycle count improvement

Additional 8 outputs each cycle

Page 25: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

25/30

Experiments to Determine Thread Warping Performance: Simulator Setup

main

filter filter filter…………

main

……

Parallel Execution Graph (PEG) – represents thread level parallelism

Nodes: Sequential execution blocks (SEBs)

Edges: pthread calls

Generate PEG using pthread wrappers

Determine SEB performancesSw: SimpleScalarHw: Synthesis/simulation (Xilinx)

Event-driven simulation – use defined algoritms to change architecture dynamically

Simulation Summary

Complete when all SEBs simulated Observe total cycles

1)

2)

3)

4)

Optimistic for Sw execution (no memory contention)

Pessimistic for warped execution (accelerators/microprocessors execute exclusively)

int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}

Page 26: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

26/30

Experiments

Benchmarks: Image processing, DSP, scientific computing Highly parallel examples to illustrate thread warping potential We created multithreaded versions

Base architecture – 4 ARM cores Focus on recurring applications (embedded)

TW: FPGA running at whatever frequency determined by synthesis

On-chip CAD

FPGAµP

µP

µP

µP

4 ARM11 400 MHz

Compared to

4 ARM11 400 MHz + FPGA (synth freq)

Multi-core Thread Warping

µP

µP

µP

µP

Page 27: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

27/30

Speedup from Thread Warping

Average 130x speedup

130 502 63 130 38308

01020304050

Fir Prewitt Linear Moravec Wavelet Maxfilter 3DTrans N-body Avg. Geo.Mean

4-uP

TW

8-uP

16-uP

32-uP

64-uP

11x faster than 64-core system Simulation pessimistic, actual results likely better

But, FPGA uses additional area

So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s

Page 28: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

28/30

Limitations

Dependent on coding practices Assumes boss/worker thread model Not all apps amenable to FPGA

speedup Commercial CAD slow – warping

takes time But in worst case, FPGA just not

used by application

Page 29: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

29/30

Why not Partition Statically?

Static good, but hiding FPGA opens technique to all sw platforms

Standard languages/tools/binaries

On-chip CAD

FPGA

µP

Any Compiler

FPGA

µP

Specialized Compiler

Binary Netlist Binary

Specialized Language Any Language

Static Thread Synthesis Dynamic Thread Warping

Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor

Can adapt to changing workloads Smaller & more accelerators, fewer & large accelerators, ...

Memory-access synchronization applicable to static approach

Page 30: Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

30/30

Conclusions

Thread warping framework dynamically synthesizes accelerators for thread functions

Memory Access Synchronization Helps reduce memory bottleneck problem

130x speedups for chosen examples Future work

Handle wider variety of coding constructs Improve for different thread models Numerous open problems

E.g., dynamic reallocation of FPGA resources