January 2007
Luca Pezzoni
VLIW Programming
Section 1:Introduction
1VLIW Programming AST/SEMP/Agrate
Agenda: section 1 (Introduction)
� Background VLIW and ILP
� Multimedia VLIW processors:� Philips TriMedia� TI c6x� ST200
� ST200 in SoC:� STm8000 DVD-recorder
� Efficient C-programming
2VLIW Programming AST/SEMP/Agrate
Learning Objectives
� At the end of this section, you will able to:
� Compare different VLIW processor architectures
� Describe what is a System-on-Chip
� Recall the “golden rules” for efficient C-programming a VLIW processor
3VLIW Programming AST/SEMP/Agrate
Background: VLIW, ILP and SIMD
4VLIW Programming AST/SEMP/Agrate
Background: VLIW
RISC
C-language
Compiler Compiler Compiler
Super-Scalar VLIW
REG
REG
REG
REG
ALU
LD/S JMP
I$ D$
REG
REG
REG
REG
LD/S JMP
I$ D$
REG
REG
REG
REG
ALU
LD/S JMP
I$ D$
ALU ALU ALU
Assembler Assembler Assembler
Real Time ILP Static ILP
ALU ALU
5VLIW Programming AST/SEMP/Agrate
Background: VLIW
Instruction Cache
Instruction Fetch
Instruction
Pipeline &
Control Unit
ExecutionUnit
ExecutionUnit
ExecutionUnit
ExecutionUnit
Register File
Instruction Cache
Instruction Fetch
Control Unit
ExecutionUnit
ExecutionUnit
ExecutionUnit
ExecutionUnit
Data Cache
Register File
Data Cache
� 4-issue Super-Scalar:� Large Instruction
Pipeline and Control Unit
� Instruction scheduling is handled during program run time.
� 4-issue VLIW:� Very small control
unit
� Compiler schedules the instructions efficiently before program runs
6VLIW Programming AST/SEMP/Agrate
Background: VLIW
� Instruction Level Parallelism (ILP) speed up programs by executing in parallel several elementary RISC operations, such ad memory load and store, integer additions or floating point multiplications.
� VLIW: The compiler determines the ILP and statically schedules the operations on the proper functional units.
� This parallelism is “invisible” to the user, though to achieve an high degree of ILP, the programmer may restructure his code according with some proper rules.
� In Super-Scalar processor, the ILP is exploited on the fly, by a complex hardware called Scheduling HW, which usually requires a large silicon area (more than the CPU itself) and can be very complex.
7VLIW Programming AST/SEMP/Agrate
Background: ILP
MPEG2:IDCT & Reconstruction
Avg ILP = 3.91
8VLIW Programming AST/SEMP/Agrate
Background: ILP
MPEG2:Intra/Inter Q and IQ
Avg ILP = 3.96
9VLIW Programming AST/SEMP/Agrate
Background: ILP
MPEG2: Motion Compensation
Avg ILP = 3.95
10VLIW Programming AST/SEMP/Agrate
Background: SIMD 8-bit (SAD)
u-char u-char u-char u-char
Unsigned Int (32-bit)
u-char u-char u-char u-char
Unsigned Int (32-bit)
Unsigned Int (32-bit)
1 2 3 4
56 7 8
9
10
11
11VLIW Programming AST/SEMP/Agrate
Background: SIMD 16-bit (MinMax)
Unsigned Int (32-bit) Unsigned Int (32-bit)
Unsigned Int (32-bit)
Signed short Signed short Signed short Signed short
SX SX SX SX
Min Min
MinMin
Signed short Signed short
1 2
3 4
12VLIW Programming AST/SEMP/Agrate
Background: SIMD (Shuffles/Permutes)
7 6 5 4 3 2 1 0
7 3 6 2 5 1 4 0
7 6 5 4 3 2 1 0
7 3 5 1 6 2 4 0
3 2 1 0 C3 C2 C1 C0
0 1 2 3
Shuffle
Permute
13VLIW Programming AST/SEMP/Agrate
Multimedia VLIW Processors:Philips TriMedia
http://www.nxp.com/products/nexperia/about/index.html
14VLIW Programming AST/SEMP/Agrate
TriMedia SoC� 1998:
� TM-1000 @ 100MHz� TM-1100 @ 125MHz
� From 2000 to 2001:� TM-1300 @ 133MHz� TM-1600 @ 166MHz
� From 2004:� New Name: Nexperia� PNX1300 @ 143,180, 200Mhz� PNX1500 @ 266, 300MHz� PNX1700 @ 450, 500MHz
� Rich Multimedia SW library available� Comprehensive Software
development environment C/C++
� One VLIW CPU (TM5250)� Video Input Processor (on-fly
cropping, downscale,formatconversion, de-interlacer,…)
� Audio input/output (up to 8-channels)
� 2D drawing engine for accelerating 2D graphics operation
� VLD (MPEG1/2)� Ethernet� ……..
15VLIW Programming AST/SEMP/Agrate
TriMedia VLIW CPU (TM5250)
30
L1 16KB
L2 128KB
64 KB
I-cache
D-cache
16VLIW Programming AST/SEMP/Agrate
TriMedia VLIW CPU (TM5250)
� VLIW CPU:� 32-bit, 30 Functional units� 5 unit addressed in a single cycle (ILP=5)� 128 general purpose 32-bit registers.� Floating point units (IEEE-754)
� Caches:� 2-D cache memory access for cycle� 64KB Istruction Cache (4/8 way associative)� L1 Data Cache 16KB (4/8 way associative)� L2 Data Cache 128KB (4/8 way associative)
� SIMD:� 8-bit and 16-bit SIMD
17VLIW Programming AST/SEMP/Agrate
CPU Functional Unit Assignement
18VLIW Programming AST/SEMP/Agrate
Profiling with TM simulator
19VLIW Programming AST/SEMP/Agrate
Multimedia VLIW Processors:Texas Instruments C64x family
http://focus.ti.com/paramsearch/docs/parametricsearch.tsp?family=dsp§ionId=2&tabId=217&familyId=477
20VLIW Programming AST/SEMP/Agrate
TI C64x Architecture
21VLIW Programming AST/SEMP/Agrate
TI C64x Architecture
22VLIW Programming AST/SEMP/Agrate
TI C64x
� Up to 1.1 GHz � Divided in 2 data path� 2 general purpose 32-bit register files A and B (32 register each
one)� Support up to 40-bit fixed point and 64-bit floating point (two
registers)� 8 functional units (2 group of 4 FU)
� .M unit (.M1, .M2):� 16x16 Multiply� 16x32 Multiply� Quad 8x8 multiply� Dual 16x16 Multiply� Quad 8x8 multiply with add
� .L unit (.L1, .L2):� 32/40-bit arithmetic and cmp� 32-bit logical op� Dual 16-bit arithmetic op� Quad 8-bit arithmetic op� Dual 16-bit min/max op� Quad 8-bit min/max op
� .S unit (.S1, .S2):� Dual 16-bit cmp op� Quad 8-bit cmp op� Dual 16-bit shift op
� .D unit (.D1, .D2):� Load/Store double words� Load/Store un-aligned words
and double words� 32-bit logical op� Dual 16-bit arithmetic op
23VLIW Programming AST/SEMP/Agrate
Multimedia VLIW Processors:STMicroelectronics ST200 family
http://www.st.com/stonline/press/news/year2002/t1124p.htm
24VLIW Programming AST/SEMP/Agrate
ST210 Features:
� ST210 is the first implementation of ST200 (Nov 2001):
�4-issue VLIW (4 operations for clock cycle)
�64 general purpose 32-bit register
�1 Load/Store for cycle
� 2 multiplies (16x32->32) for cycle
�32KB I-cache direct mapped, 32KB D-Cache 4-ways)
�250 MHz
25VLIW Programming AST/SEMP/Agrate
ST210 workload
� Video:� MPEG2 loop encoder: 200 MHz (15Mbit/s 25 PAL frame/s)
� MPEG2 decoder: 250 MHz (15Mbit/s 25 PAL frame/s)
�MPEG4 SP L3 Decoder: 12 MHz (64 Kbit/s 30 QCIF frame/s)
� Audio:�MPEG1 layer 2 encoder (256Kbit/s stereo):
� 28 MHz @ 32 KHz s.r.� 37 MHz @ 44.1 KHz s.r.
�MPEG1 layer 2 decoder:� 24 MHz @ 256 Kbit/s stereo at 32 KHz.
26VLIW Programming AST/SEMP/Agrate
ST220 scheme:
27VLIW Programming AST/SEMP/Agrate
ST220 Cluster:
28VLIW Programming AST/SEMP/Agrate
ST220 target applications
�ST220 is a 400MHz high performance media core:
�Low cost with functional flexibility providing optimum SoCsolutions for embedded systems
�Designed for STM8000 SoC:�Video Processing: MPEG2 Video encoding loop: Q, IQ, HVLC, RLC, ZZ and Rate control
�Audio Processing�MPEG1 Layers 1,2,3 dual channel stereo encoding� Dolby Ace dual channel stereo encoding
29VLIW Programming AST/SEMP/Agrate
ST220 features
� I and D Protection units:� Support Supervisor/User model of protection� Allow easy future integration of Virtual memory management unit
� Core Memory Controller (CMC):� Allow multiple masters access to STBUS via single port� Provides arbitration between multiple requestors
� Pre-fetch buffer:� Request data to be loaded into local D-cache if not present� Used to reduce the D-cache miss ratio� The SW programmer can use them with pragmas on C-code
� Streaming Data Interface (SDI):� Provides mechanism for attaching HW co-processor to core� Reduce STBus traffic� Reduce D-cache pollution and control complexity
30VLIW Programming AST/SEMP/Agrate
ST220 Pipeline
31VLIW Programming AST/SEMP/Agrate
ST231-ST240-..
� ST231 vs ST220:� Data cache partitioning (32K 4-way, 8K 1-way + 24K 3-way,…)� Insertion TLB (Translation lookaside buffer) (For implementing a memory management system OS)
� Better cache flush and refill� Debug support� Idle mode� Performance monitoring� Speculative loads
� ST240 vs ST231:� 64 bits load/store� 8-bits and 16-bits SAD� Floating point�Multi-way I-Cache
32VLIW Programming AST/SEMP/Agrate
VLIW cores in SoC
33VLIW Programming AST/SEMP/Agrate
“Traditional” System-on-Chip
Core Micro
And Caches
DSP
DSP
ASIC ASIC
DSPs run computational kernel(asm)
ASICs run critical Pipelines(hard-wired
CPU core runsthe mainApplication(in C)
34VLIW Programming AST/SEMP/Agrate
New VLIW-bases System-on-Chip
Core Micro
And Caches
VLIW DSP-CPU
ASICVLIW
DSP-CPU
Some ASICs and DSPs no longer needed:VLIW CPU can be fast enough to absorb their functionality
VLIW DSP-CPU,Smaller, customizable to the application and faster (in C)
35VLIW Programming AST/SEMP/Agrate
Trends in the consumer electronics market
� SoC designs use a combination of core processors, DSP engines and specialized ASICs
� High-performance, low-cost, time-to-market
� VLIW is becoming the predominant embedded/DSP technology
� VLIW processors allow many functions to be implemented as SW algorithms instead of HW circuits, at the cost of hard-wired logic
36VLIW Programming AST/SEMP/Agrate
STm8000 DVD recorder SoC
37VLIW Programming AST/SEMP/Agrate
STm8000 DVD recorder SoC
Video Input
Interface
Video
Pre-processor
(hard-wired)
SHE
PiPeline
DSP Coprocessor ST220 Core
(D-Cache,
I-Cache)
STbus Interconnect
� Zooming the ST220-based Video encoding Subsystem cell
� SLIMPEG Hardware Engine (SHE)
� Motion estimation� Motion Compensation
� ILA-coprocessor� DCT/IDCT� Reconstruction
� ST220 VLIW� Video Loop encoder
38VLIW Programming AST/SEMP/Agrate
Efficient C programming guide lines
39VLIW Programming AST/SEMP/Agrate
Profiling Methodologyhttp://kcachegrind.sourceforge.net/cgi-bin/show.cgi
http://valgrind.org/
40VLIW Programming AST/SEMP/Agrate
Profiling Methodology (SVC Decoding)
41VLIW Programming AST/SEMP/Agrate
Profiling Methodology
� /* Main module: not real-time Set up */�MemoryAllocation();� ReadInputParameters();� OpenIOFile();
� /* Main module: real-time DSP part to profile */� Start = clock();� DPS_Routines();� Stop = clock();
� /* Main module: not real-time ending part */�WriteOutputResults();� CloseIOFile();�MemoryFree();
42VLIW Programming AST/SEMP/Agrate
Profiling Methodology
� In the profiling results, file I/O subroutines are not counted.Since in a real-time environment I/O peripheral put/take data to/from SDRAM via a DMA, without stalling the VLIW CPU.
� No stdio, stdlib in the module under profile.Like printf,fread, fwrite, etc….
43VLIW Programming AST/SEMP/Agrate
Golden rules: globals
� Copy global variables in local variables.� The compiler can not understand if a global variable being accessed in a function will gets its value modified
�Most of the time the compiler does not copy the value to a local variable (a register), but it generates access to it all the time its valued is needed
� Local variables should be use since they are semantically on the stack, so their address cannot conflict with global variables.
44VLIW Programming AST/SEMP/Agrate
Golden rules: pointers
� Memory disambiguation.� C pointers are alias and point to overlapping memory areas. The compiler can not determine if *src and *dst point to the same memory locations and stops any further optimisation.
� Use not-ANSI C restricted pointers or proper #pragma ivdep
For example:
for(i=0;i<MAX;i++)
*dst++ = *src++;
Generates: Instead of:
LD src->reg1; src++;; LD src->reg1; src++;;
NOP;; LD src->reg2; src++;;
NOP;; NOP;;
ST reg1->dst; dst++;; ST reg1->dst; dst++;;
LD src; src++;; ST reg2->dst; dst++;;
NOP;;
NOP;;
45VLIW Programming AST/SEMP/Agrate
Golden rules: miscellany
� Use 32 bits variables when possible
� Load and Store at 32 bits (4 bytes packed or 2 16-bit data packed)
� Use look up tables to save clock cycles by trading of memory for speed
� Concentrate all the processing in few FOR loops
� Single Dimension array is faster than multi-dimensional one
� Use Macro replacement of little functions that are frequently called.
� Reduce the number of functional calls
46VLIW Programming AST/SEMP/Agrate
Golden rules: miscellany
� Floating point multiplication is often faster than division:� Use (val*0.5) instead of (val/2.0)
� Avoid operator like &&, || and %.� Use: if((a==0) & (b==0)) instead of if((a==0) && (b==0))
� On ST200:� FP is a emulation� Divide is a emulation
47VLIW Programming AST/SEMP/Agrate
Golden rules: Select
Convert IF into SELECT
if(x>y)
result = val1;
else
result = val2;
result = (x>y) ? val1 : val2;
Generates the following ST200 assembler:
Cycle 1: cmpgt b0,x,y;;
Cycle 2: slct b0,result,val1,val2;;
48VLIW Programming AST/SEMP/Agrate
� Max:
� Generates:
� Min:
� Generates:
� Clipping between 0-255 is therefore:
Golden rules: Min/Max
x = (x>255) ? 255 : x;
max x,255,x;;
x = (x<0) ? 0 : x;
min x,0,x;;
x = (x>255) ? 255 : (x<0) ? 0 : x;
49VLIW Programming AST/SEMP/Agrate
int a[200][30][5]
for(i=0;i<200;i++)
for(k=0;k<5;k)
for(j=0;j<30;j++)
a[i][j][k] = i*j*k;
int a [200][30][5]
for(i=0;i<200;i++)
for(j=0;j<30;j++)
for(k=0;k<5;k)
a[i][j][k] = i*j*k;
int a[5*30*200]
for(i=0;i<(5*30*200);i++)
a[i] = i;
A(0,0) A(0,1) A(0,2)
A(1,0) A(1,1) A(1,2)
A(2,0) A(2,1) A(2,2)
int A[3][3];
A(0,0) A(0,1) A(0,2) A(1,0) A(1,1) A(1,2) A(2,0) A(2,1) A(2,2)
B(0) B(1) B(2) B(3) B(4) B(5) B(6) B(7) B(8)
int A[3][3];
int B[9];
Array collapse
Loop Interchange
50VLIW Programming AST/SEMP/Agrate
Loop Fusion: Basic transformation
for(i=0;i<N;i++)
B[i] = F1(A[i]);
for(j=0;j<N;j++)
C[j] = F2(B[j],A[j]);
for(i=0;i<N;i++)
{
B[i] = F1(A[i]);
C[i] = F2(B[i],A[i]);
}
for(i=0;i<N;i++)
{
int a,b;
a = A[i];
b = F1(a);
C[i] = F2(b,a);
B[i] = b;
}
51VLIW Programming AST/SEMP/Agrate
Loop Fusion: Bump
for(i=2;i<N;i++)
B[i] = F1(A[i]);
for(j=0;j<N-2;j++)
C[j] = F2(B[j+2],A[j+2]);
for(i=2;i<N;i++)
{
B[i] = F1(A[i]);
C[i-2] = F2(B[i],A[i]);
}
for(i=2;i<N;i++)
{
int a,b;
a = A[i];
b = F1(a);
C[i-2] = F2(b,a);
B[i] = b;
}
52VLIW Programming AST/SEMP/Agrate
Loop Fusion: Reverse
for(i=0;i<=N;i++)
B[i] = F1(A[i]);
for(j=N;j>=0;j--)
C[j] = F2(B[N-j],A[N-j]);
for(i=0;i<=N;i++)
B[i] = F1(A[i]);
for(j=0;j<=N;j++)
C[N-j] = F2(B[j],A[j]);
for(I=0;i<=N;i++)
{
int a,b;
a = A[i];
b = F1(a);
C[N-i] = F2(b);
B[i] = b;
}
53VLIW Programming AST/SEMP/Agrate
Loop unrolling
for(i=0;i<N;i++)
{
something(i);
}
for(i=0;i<N;i+=4)
{
something(i);
something(i+1);
something(i+2);
something(i+3);
}
54VLIW Programming AST/SEMP/Agrate
Loop unrolling: Example (reduction)
Sum=0;
for(j=0;j<M;j++)
{
for(i=0;i<N;i++)
{
sum+=B[j]*C[i];
}
}
Sum1=Sum2=Sum3=Sum4= 0;
for(i=0;i<N;i++)
{
int c = C[i];
for(j=0;j<M;j+=4)
{
sum1 += B[j ]*c;
sum2 += B[j+1]*c;
sum3 += B[j+2]*c;
sum4 += B[j+3]*c;
}
}
Sum = Sum1+Sum2+Sum3+Sum4;
55VLIW Programming AST/SEMP/Agrate
Software Pipeline
for(i=0;i<N;i++)
{
int a = src1[i];
int b = src2[i];
d = a*b;
F(d);
}
int a = src1[0];
int b = src2[0];
for(i=0;i<N-1;i++)
{
d = a*b;
F(d);
a = src1[i+1];
b = src2[i+1];
}
d = a*b;
F(d);
Prologue
Epilogue
Top Related