A Code Refinement Methodology for Performance-Improved Synthesis from C
-
Upload
lars-gentry -
Category
Documents
-
view
28 -
download
0
description
Transcript of A Code Refinement Methodology for Performance-Improved Synthesis from C
A Code Refinement Methodology for Performance-Improved Synthesis from C
Greg Stitt, Frank Vahid*, Walid Najjar
Department of Computer Science and Engineering
University of California, RiversideAlso with the Center for Embedded Computer Systems,
UC Irvine
This research is supported in part by the National Science Foundation and the Semiconductor
Research Corporation
2/23
Introduction
uP FPGA
motionComp()filterLuma()
filterChroma()deblocking()
. . . . .
. . . . .
H.264
Compiler* * **
+ ++
* * **+ +
+
Synthesis
motionComp()
Select Critical Region
deblocking()
Previous work: In-depth hw/sw partitioning study of H.264 decoder
Collaboration with Freescale
3/23
Introduction Previous work: In-depth hw/sw partitioning
study of H.264 decoder Collaboration with Freescale
0123456789
10
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
Number of Functions in Hardware
Spee
dup
Speedup from C Partititioning
0123456789
10
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
Number of Functions in Hardware
Spee
dup
Ideal Speedup (Zero-time Hw Execution)
Speedup from C PartititioningLarge gap between ideal and actual speedup
Obtained 2.5x speedup
4/23
Introduction Noticed coding constructs/practices limited hw speed Identified problematic coding constructs
Developed simple coding guidelines Dozens of lines of code Minutes per guideline
Refined critical regions using guidelines
motionComp()filterLuma()
filterChroma()deblocking()
. . . . .
. . . . .
Apply Guidelines
motionComp’()filterLuma()
filterChroma()deblocking’()
. . . . .
. . . . .
Hw/Sw Partitioning
5/23
Introduction Noticed coding constructs/practices limited hw speed Identified problematic coding constructs
Developed simple coding guidelines Dozens of lines of code Minutes per guideline
Refined critical regions using guidelines
0
1
2
3
4
5
6
7
8
9
10
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Number of Functions in Hardware
Sp
eed
up
Ideal Speedup (Zero-time Hw Execution)
Speedup After Rewrite (C Partitioning)
Speedup from C Partititioning Simple guidelines increased speedup to 6.5x
Can simple coding guidelines show similar improvements on other
applications?
6/23
Coding Guidelines Analyzed dozens of benchmarks
Identified common problems related to synthesis Developed 10 guidelines to fix problems
Although some are well known, analysis shows they are rarely applied
Automation unlikely or impossible in many cases
Conversion to Constants (CC)
Conversion to Fixed Point (CF)
Conversion to Explicit Data Flow (CEDF)
Conversion to Explicit Memory Accesses (CEMA)
Function Specialization (FS)
Constant Input Enumeration (CIE)Loop Rerolling (LR)Conversion to Explicit Control Flow (CECF)Algorithmic Specialization (AS)Pass-By-Value Return (PVR)
Coding Guidelines
7/23
Fast Refinement
Several dozen lines of code provide most performance improvement Refining takes minutes/hours
Idct()Memset()
FIR()Sort()
Search()ReadInput()
WriteOutput()Matrix()Brev()
Compress()Quantize()
. . . . .
Sample Application Name %Time
IDCT 65%Compress 11%FIR 10%Dequantize 3%Sort 2%Memset 1%Matrix 1%Search 0.5%ReadInput 0.5%WriteOutput 0.5%Brev 0.5%. . . . . . . . . .
Profiling Results
Only several performance critical regionsApply guidelines
to only the critical regions
8/23
int coef[100];void initCoef() { // initialize coef}void fir() { // fir filter using coef}
void f() { initCoef() // other code fir();}
Conversion to Constants (CC) int coef[100];void initCoef() { // initialize coef}void fir() { // fir filter using coef } void firConstWrapper(const int array[100]) { // misc code . . . fir(array);}void f() { initCoef() // other code fir();}
Problem: Arrays of constants commonly not specified as constants
Initialized at runtime Guideline: Use constant
wrapper function Specifies array constant
for all future functions Automation
Difficult, requires global def-use/alias analysis
int coef[100];void initCoef() { // initialize coef}void fir(const int array[100]) { // fir filter using const array} void constWrapper(const int array[100]) { // misc code . . . fir(array);}void f() { initCoef() // other code fir();}
int coef[100];void initCoef() { // initialize coef}void fir(const int array[100]) { // fir filter using const array} void constWrapper(const int array[100]) { // misc code . . . fir(array);}void f() { initCoef() constWrapper(coef);
}
Can also enable constant folding
int coef[100];void initCoef() { // initialize coef}void fir(const int array[100]) { // fir filter using const array} void constWrapper(const int array[100]) { prefetchArray( array ); // misc code . . . fir(array);}void f() { initCoef() constWrapper(coef);
}
Array can’t change, prefetching won’t violate dependencies
9/23
int array[100];void a() { for (i=0; i < 100; i++) array[i] = . . . . . }void b() { for (i=0; i < 100; i++) array[i] = array[i]+f(i);}int c() { for (i=0; i < 100; i++) temp += array[i];}void d() { for (. . . . . ) { a(); b(); c(); }}
Conversion to Explicit Data Flow (CEDF)
Problem: Global variables make determination of parallelism difficult
Requires global def-use/alias analysis
Guideline: Replace globals with extra parameters
Makes data flow explicit Simpler analysis may expose
parallelism Automation
Been proposed [Lee01] But, difficult because of
aliases
a(), b(), c() must execute sequentially because of global array dependencies
void a(int array[100]) { for (i=0; i < 100; i++) array[i] = . . . . . }void b(int array1[100], int array2[100]) { for (i=0; i < 100; i++) array2[i] = array1[i]+f(i);}int c(int array[100]) { for (i=0; i < 100; i++) temp += array[i];}void d() { int array1[100], array2[100]; for (. . . . . ) { a(array1 ); b(array1, array2 ); c(array2 ); }}
a() and c() can execute in parallel after 1st iteration
10/23
void f(int a, int b) { . . . . for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } }}
Constant Input Enumeration (CIE)
Problem: Function parameters may limit parallelism
Guideline: Create enum for possible values
Synthesis can create specialized functions
Automation In some cases, def-use
analysis may identify all inputs
In general, difficult due to aliases
c[i][j]
+
i jOne iteration at a time
Bounds not known, hard to unroll
enum PRM { VAL1=2, VAL2=4 };void f(enum PRM a, enum PRM b) { . . . . for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } }}
c[0][0]
+
0 0
Iterations can be parallelized in each version
c[0][1]
+
0 1
c[0][2]
+
0 2
. . . . .
Specialized Versions: f(2,2), f(2,4), f(4,2), f(2,4)
11/23
Conversion to Explicit Control Flow (CECF)
Problem: Function pointers may prevent static control flow analysis
Guideline: Replace function pointer with if-else, static calls
Makes possible targets explicit
Automation In general, is
impossible Equivalent to halting
problem
void f( int (*fp) (int) ) { . . . . . for (i=0; i < 10; i++) { a[i] = fp(i); }}
enum Target { FUNC1, FUNC2, FUNC3 };void f( enum Target fp ) { . . . . . for (i=0; i < 10; i++) { if (fp == FUNC1) a[i] = f1(i); else if (fp == FUNC2) a[i] = f2(i); else a[i] = f3(i); }}
Synthesis unlikely to determine possible targets of function pointer
?a[i]
Synthesized Hardware
a[i]
Synthesized Hardware
f1(i) f2(i) f3(i)
3x1fp
12/23
Algorithmic Specialization (AS)
Algorithms targeting sw may not be fast in hw
Sequential vs. parallel
C code generally uses sw algorithms
Guideline: Specialize critical functions with hw algorithms
Automation Requires higher level
specification Intrinsics
void search(int a[], int k, int l, int r) { while (l <= r) { mid = (l+r)/2; if (k > a[mid]) l = mid+1; else if (k < a[mid) r = mid-1; else return mid; } return –1;}
void search(int a[], int k, const int s) { for (i=0; i < s; i++) { if (a[i] == k) return i; }
return –1;}
Can be parallelized in hardware
13/23
Pass-By-Value Return (PVR) Problem: Array
parameters cannot be prefetched due to potential aliases Designer may
know aliases don’t exist
Guideline: Use pass-by-value-return
Automation Requires global
alias analysis
void f(int *a, int *b, int array[16]) {
… // unrelated computation g(array); … // unrelated computation
} int g(int array[16]) { // computation done on array}
void f(int *a, int *b, int array[16]) { int localArray[16]; memcpy(localArray,array,16*sizeof(int)); … // misc computation g(localArray); … // misc computation memcpy(array, localArray,16*sizeof(int));} int g(int array[16]) { // computation done on array}
Can’t prefetch array for g(), may be aliased
Local array can’t be aliased, can prefetch
14/23
Why Synthesis From C? Why not use HDL?
HDL may yield better results C is mainstream language
Acceptable performance in many cases Learning HDL is large overhead
Approaches are orthogonal This work focuses on improving mainstream
Guidelines common for HDL Can also be applied to algorithmic HDL
15/23
Software Overhead
Refined regions may not be partitioned to hardware
Partitioner may select non-refined regions
OS may select software or hardware implementation
Based on state of FPGA
Coding guidelines have potential software overhead
motionComp’()filterLuma()
filterChroma()deblocking’()
. . . . .
. . . . .
Hw/Sw Partitioning
uP FPGA
filterLuma()filterChroma()
motionComp’()deblocking’()
Problem - Refined code mapped to software
16/23
Refinement Methodology Considerations
Reduce software overhead Reduce refinement time
Methodology Profile Iterative-improvement
Determine critical region Apply all except PVR/AS
Minimal overhead Apply PVR if overhead
acceptable Apply AS if known
algorithm and overhead acceptable
Profile
Determine Critical Region
Is overhead of copying array acceptable?
Does suitable hw algorithm exist and have acceptable sw
performance ?
Apply CC, CF, CEMA, CIE, CEDF, CECF, FS, LR
Apply PVR
Apply AS
yes
no
no
yes
Repeat until performance acceptable
17/23
Experimental Setup Benchmark suite
MediaBench, Powerstone Manually applied guidelines
1-2 hours 23 additional lines/benchmark, on
average Target Architecture
Xilinx VirtexII FPGA with ARM9 uP Hardware/software partitioning
Selects critical regions for hardware Synthesis
High-level synthesis tool ~30,000 lines of C code Outputs register-transfer level (RTL) VHDL
RTL Synthesis using Xilinx ISE Compilation
Gcc with –O1 optimizations
Benchmarks
Manual Refinement
Hw/Sw Partitioning
Synthesis
Refined Code
Compilation
Sw Hw
Bitfile
ARM9 Virtex II
18/23
0
5
10
15
20
g3fax
Sw
Hw/sw with original code
Hw/sw with guidelines
Speedups from Guidelines
0
5
10
15
20
g3fax
Sw
Hw/sw with original code
Hw/sw with guidelines
0
5
10
15
20
g3fax
Sw
Hw/sw with original code
Hw/sw with guidelines
Conversion to constants
Speedup: 3.6x Total Time: 5 minutes
Explicit Dataflow + Algorithmic Specialization
Speedup: 16.4x Total Time: 15 minutes
No guidelines
Speedup: 2x
19/23
0
5
10
15
20
g3fax crc
Sw
Hw/sw with original code
Hw/sw with guidelines
Speedups from Guidelines
0
5
10
15
20
g3fax crc
Sw
Hw/sw with original code
Hw/sw with guidelines
0
5
10
15
20
g3fax crc
SwHw /sw w ith original codeHw /sw w ith guidelines
No Guidelines
Speedup: 8.6x
Conversion to Constants
Speedup: 14.4x Total Time: 10 minutes
Input Enumeration
Speedup: 16.7x Total Time: 20 minutes
Algorithmic Specialization
Speedup: 19x Time: 30 minutes Sw Overhead: 6000%
20/23
Speedups from Guidelines573 842
0
5
10
15
20
g3fax crc jpeg brev fir mpeg2
SwHw /sw w ith original codeHw /sw w ith guidelines
Original code Speedups range from 1x (no speedup) to 573x Average: 2.6x (excludes brev)
Refined code with guidelines Average: 8.4x (excludes brev) 3.5x average improvement compared to original code
21/23
Speedups from Guidelines
Guidelines move speedups closer to ideal Almost identical for mpeg2, fir
Several examples still far from ideal May imply new guidelines needed
84267 1090
0
5
10
15
20
g3fax crc jpeg brev fir mpeg2
SwHw /sw w ith original codeHw /sw w ith guidelinesIdeal
22/23
Guideline SW Overhead/Improvement
Average Sw performance overhead: -15.7% (improvement) -1.1% excluding brev 3 examples improved
Average Sw size overhead (lines of C code) 8.4% excluding brev
-88% -47%-30%
-20%
-10%
0%
10%
20%
30%
g3fa
x
mpe
g2
jpeg
brev fir crc
Performance Overhead
Size Overhead Overhead
Improvement
23/23
Summary Simple coding guidelines significantly
improve synthesis from C 3.5x speedup compared to Hw/Sw
synthesized from unrefined code Major rewrites may not be necessary
Between 1-2 hours Refinement Methodology
Reduces software size/performance overhead In some cases, improvement
Future Work Test on commercial synthesis tools New guidelines for different domains