Download - Section 1: Introduction -

January 2007

Luca Pezzoni

VLIW Programming

Section 1:Introduction

1VLIW Programming AST/SEMP/Agrate

Agenda: section 1 (Introduction)

� Background VLIW and ILP

� Multimedia VLIW processors:� Philips TriMedia� TI c6x� ST200

� ST200 in SoC:� STm8000 DVD-recorder

� Efficient C-programming


Learning Objectives

� At the end of this section, you will able to:

� Compare different VLIW processor architectures

� Describe what is a System-on-Chip

� Recall the “golden rules” for efficient C-programming a VLIW processor


Background: VLIW, ILP and SIMD


Background: VLIW

RISC

C-language

Compiler Compiler Compiler

Super-Scalar VLIW

REG

REG

REG

REG

ALU

LD/S JMP

I$ D$

REG

REG

REG

REG

LD/S JMP

I$ D$

REG

REG

REG

REG

ALU

LD/S JMP

I$ D$

ALU ALU ALU

Assembler Assembler Assembler

Real Time ILP Static ILP

ALU ALU


Background: VLIW

Instruction Cache

Instruction Fetch

Instruction

Pipeline &

Control Unit

ExecutionUnit

ExecutionUnit

ExecutionUnit

ExecutionUnit

Register File

Instruction Cache

Instruction Fetch

Control Unit

ExecutionUnit

ExecutionUnit

ExecutionUnit

ExecutionUnit

Data Cache

Register File

Data Cache

� 4-issue Super-Scalar:� Large Instruction

Pipeline and Control Unit

� Instruction scheduling is handled during program run time.

� 4-issue VLIW:� Very small control

unit

� Compiler schedules the instructions efficiently before program runs


Background: VLIW

� Instruction Level Parallelism (ILP) speed up programs by executing in parallel several elementary RISC operations, such ad memory load and store, integer additions or floating point multiplications.

� VLIW: The compiler determines the ILP and statically schedules the operations on the proper functional units.

� This parallelism is “invisible” to the user, though to achieve an high degree of ILP, the programmer may restructure his code according with some proper rules.

� In Super-Scalar processor, the ILP is exploited on the fly, by a complex hardware called Scheduling HW, which usually requires a large silicon area (more than the CPU itself) and can be very complex.


Background: ILP

MPEG2:IDCT & Reconstruction

Avg ILP = 3.91


Background: ILP

MPEG2:Intra/Inter Q and IQ

Avg ILP = 3.96


Background: ILP

MPEG2: Motion Compensation

Avg ILP = 3.95


Background: SIMD 8-bit (SAD)

u-char u-char u-char u-char

Unsigned Int (32-bit)

u-char u-char u-char u-char



1 2 3 4

56 7 8

9

10

11


Background: SIMD 16-bit (MinMax)

Unsigned Int (32-bit) Unsigned Int (32-bit)


Signed short Signed short Signed short Signed short

SX SX SX SX

Min Min

MinMin

Signed short Signed short

1 2

3 4


Background: SIMD (Shuffles/Permutes)

7 6 5 4 3 2 1 0

7 3 6 2 5 1 4 0

7 6 5 4 3 2 1 0

7 3 5 1 6 2 4 0

3 2 1 0 C3 C2 C1 C0

0 1 2 3

Shuffle

Permute


Multimedia VLIW Processors:Philips TriMedia

http://www.nxp.com/products/nexperia/about/index.html


TriMedia SoC� 1998:

� TM-1000 @ 100MHz� TM-1100 @ 125MHz

� From 2000 to 2001:� TM-1300 @ 133MHz� TM-1600 @ 166MHz

� From 2004:� New Name: Nexperia� PNX1300 @ 143,180, 200Mhz� PNX1500 @ 266, 300MHz� PNX1700 @ 450, 500MHz

� Rich Multimedia SW library available� Comprehensive Software

development environment C/C++

� One VLIW CPU (TM5250)� Video Input Processor (on-fly

cropping, downscale,formatconversion, de-interlacer,…)

� Audio input/output (up to 8-channels)

� 2D drawing engine for accelerating 2D graphics operation

� VLD (MPEG1/2)� Ethernet� ……..


TriMedia VLIW CPU (TM5250)

30

L1 16KB

L2 128KB

64 KB

I-cache

D-cache


TriMedia VLIW CPU (TM5250)

� VLIW CPU:� 32-bit, 30 Functional units� 5 unit addressed in a single cycle (ILP=5)� 128 general purpose 32-bit registers.� Floating point units (IEEE-754)

� Caches:� 2-D cache memory access for cycle� 64KB Istruction Cache (4/8 way associative)� L1 Data Cache 16KB (4/8 way associative)� L2 Data Cache 128KB (4/8 way associative)

� SIMD:� 8-bit and 16-bit SIMD


CPU Functional Unit Assignement


Profiling with TM simulator


Multimedia VLIW Processors:Texas Instruments C64x family

http://focus.ti.com/paramsearch/docs/parametricsearch.tsp?family=dsp&sectionId=2&tabId=217&familyId=477


TI C64x Architecture


TI C64x

� Up to 1.1 GHz � Divided in 2 data path� 2 general purpose 32-bit register files A and B (32 register each

one)� Support up to 40-bit fixed point and 64-bit floating point (two

registers)� 8 functional units (2 group of 4 FU)

� .M unit (.M1, .M2):� 16x16 Multiply� 16x32 Multiply� Quad 8x8 multiply� Dual 16x16 Multiply� Quad 8x8 multiply with add

� .L unit (.L1, .L2):� 32/40-bit arithmetic and cmp� 32-bit logical op� Dual 16-bit arithmetic op� Quad 8-bit arithmetic op� Dual 16-bit min/max op� Quad 8-bit min/max op

� .S unit (.S1, .S2):� Dual 16-bit cmp op� Quad 8-bit cmp op� Dual 16-bit shift op

� .D unit (.D1, .D2):� Load/Store double words� Load/Store un-aligned words

and double words� 32-bit logical op� Dual 16-bit arithmetic op


Multimedia VLIW Processors:STMicroelectronics ST200 family

http://www.st.com/stonline/press/news/year2002/t1124p.htm


ST210 Features:

� ST210 is the first implementation of ST200 (Nov 2001):

�4-issue VLIW (4 operations for clock cycle)

�64 general purpose 32-bit register

�1 Load/Store for cycle

� 2 multiplies (16x32->32) for cycle

�32KB I-cache direct mapped, 32KB D-Cache 4-ways)

�250 MHz


ST210 workload

� Video:� MPEG2 loop encoder: 200 MHz (15Mbit/s 25 PAL frame/s)

� MPEG2 decoder: 250 MHz (15Mbit/s 25 PAL frame/s)

�MPEG4 SP L3 Decoder: 12 MHz (64 Kbit/s 30 QCIF frame/s)

� Audio:�MPEG1 layer 2 encoder (256Kbit/s stereo):

� 28 MHz @ 32 KHz s.r.� 37 MHz @ 44.1 KHz s.r.

�MPEG1 layer 2 decoder:� 24 MHz @ 256 Kbit/s stereo at 32 KHz.


ST220 scheme:


ST220 Cluster:


ST220 target applications

�ST220 is a 400MHz high performance media core:

�Low cost with functional flexibility providing optimum SoCsolutions for embedded systems

�Designed for STM8000 SoC:�Video Processing: MPEG2 Video encoding loop: Q, IQ, HVLC, RLC, ZZ and Rate control

�Audio Processing�MPEG1 Layers 1,2,3 dual channel stereo encoding� Dolby Ace dual channel stereo encoding


ST220 features

� I and D Protection units:� Support Supervisor/User model of protection� Allow easy future integration of Virtual memory management unit

� Core Memory Controller (CMC):� Allow multiple masters access to STBUS via single port� Provides arbitration between multiple requestors

� Pre-fetch buffer:� Request data to be loaded into local D-cache if not present� Used to reduce the D-cache miss ratio� The SW programmer can use them with pragmas on C-code

� Streaming Data Interface (SDI):� Provides mechanism for attaching HW co-processor to core� Reduce STBus traffic� Reduce D-cache pollution and control complexity


ST220 Pipeline


ST231-ST240-..

� ST231 vs ST220:� Data cache partitioning (32K 4-way, 8K 1-way + 24K 3-way,…)� Insertion TLB (Translation lookaside buffer) (For implementing a memory management system OS)

� Better cache flush and refill� Debug support� Idle mode� Performance monitoring� Speculative loads

� ST240 vs ST231:� 64 bits load/store� 8-bits and 16-bits SAD� Floating point�Multi-way I-Cache


VLIW cores in SoC


“Traditional” System-on-Chip

Core Micro

And Caches

DSP

DSP

ASIC ASIC

DSPs run computational kernel(asm)

ASICs run critical Pipelines(hard-wired

CPU core runsthe mainApplication(in C)


New VLIW-bases System-on-Chip

Core Micro

And Caches

VLIW DSP-CPU

ASICVLIW

DSP-CPU

Some ASICs and DSPs no longer needed:VLIW CPU can be fast enough to absorb their functionality

VLIW DSP-CPU,Smaller, customizable to the application and faster (in C)


Trends in the consumer electronics market

� SoC designs use a combination of core processors, DSP engines and specialized ASICs

� High-performance, low-cost, time-to-market

� VLIW is becoming the predominant embedded/DSP technology

� VLIW processors allow many functions to be implemented as SW algorithms instead of HW circuits, at the cost of hard-wired logic


STm8000 DVD recorder SoC


STm8000 DVD recorder SoC

Video Input

Interface

Video

Pre-processor

(hard-wired)

SHE

PiPeline

DSP Coprocessor ST220 Core

(D-Cache,

I-Cache)

STbus Interconnect

� Zooming the ST220-based Video encoding Subsystem cell

� SLIMPEG Hardware Engine (SHE)

� Motion estimation� Motion Compensation

� ILA-coprocessor� DCT/IDCT� Reconstruction

� ST220 VLIW� Video Loop encoder


Efficient C programming guide lines


Profiling Methodologyhttp://kcachegrind.sourceforge.net/cgi-bin/show.cgi

http://valgrind.org/


Profiling Methodology (SVC Decoding)


Profiling Methodology

� /* Main module: not real-time Set up */�MemoryAllocation();� ReadInputParameters();� OpenIOFile();

� /* Main module: real-time DSP part to profile */� Start = clock();� DPS_Routines();� Stop = clock();

� /* Main module: not real-time ending part */�WriteOutputResults();� CloseIOFile();�MemoryFree();


Profiling Methodology

� In the profiling results, file I/O subroutines are not counted.Since in a real-time environment I/O peripheral put/take data to/from SDRAM via a DMA, without stalling the VLIW CPU.

� No stdio, stdlib in the module under profile.Like printf,fread, fwrite, etc….


Golden rules: globals

� Copy global variables in local variables.� The compiler can not understand if a global variable being accessed in a function will gets its value modified

�Most of the time the compiler does not copy the value to a local variable (a register), but it generates access to it all the time its valued is needed

� Local variables should be use since they are semantically on the stack, so their address cannot conflict with global variables.


Golden rules: pointers

� Memory disambiguation.� C pointers are alias and point to overlapping memory areas. The compiler can not determine if *src and *dst point to the same memory locations and stops any further optimisation.

� Use not-ANSI C restricted pointers or proper #pragma ivdep

For example:

for(i=0;i<MAX;i++)

*dst++ = *src++;

Generates: Instead of:

LD src->reg1; src++;; LD src->reg1; src++;;

NOP;; LD src->reg2; src++;;

NOP;; NOP;;

ST reg1->dst; dst++;; ST reg1->dst; dst++;;

LD src; src++;; ST reg2->dst; dst++;;

NOP;;

NOP;;


Golden rules: miscellany

� Use 32 bits variables when possible

� Load and Store at 32 bits (4 bytes packed or 2 16-bit data packed)

� Use look up tables to save clock cycles by trading of memory for speed

� Concentrate all the processing in few FOR loops

� Single Dimension array is faster than multi-dimensional one

� Use Macro replacement of little functions that are frequently called.

� Reduce the number of functional calls


Golden rules: miscellany

� Floating point multiplication is often faster than division:� Use (val*0.5) instead of (val/2.0)

� Avoid operator like &&, || and %.� Use: if((a==0) & (b==0)) instead of if((a==0) && (b==0))

� On ST200:� FP is a emulation� Divide is a emulation


Golden rules: Select

Convert IF into SELECT

if(x>y)

result = val1;

else

result = val2;

result = (x>y) ? val1 : val2;

Generates the following ST200 assembler:

Cycle 1: cmpgt b0,x,y;;

Cycle 2: slct b0,result,val1,val2;;


� Max:

� Generates:

� Min:

� Generates:

� Clipping between 0-255 is therefore:

Golden rules: Min/Max

x = (x>255) ? 255 : x;

max x,255,x;;

x = (x<0) ? 0 : x;

min x,0,x;;

x = (x>255) ? 255 : (x<0) ? 0 : x;


int a[200][30][5]

for(i=0;i<200;i++)

for(k=0;k<5;k)

for(j=0;j<30;j++)

a[i][j][k] = i*j*k;

int a [200][30][5]

for(i=0;i<200;i++)

for(j=0;j<30;j++)

for(k=0;k<5;k)

a[i][j][k] = i*j*k;

int a[5*30*200]

for(i=0;i<(5*30*200);i++)

a[i] = i;

A(0,0) A(0,1) A(0,2)

A(1,0) A(1,1) A(1,2)

A(2,0) A(2,1) A(2,2)

int A[3][3];

A(0,0) A(0,1) A(0,2) A(1,0) A(1,1) A(1,2) A(2,0) A(2,1) A(2,2)

B(0) B(1) B(2) B(3) B(4) B(5) B(6) B(7) B(8)

int A[3][3];

int B[9];

Array collapse

Loop Interchange


Loop Fusion: Basic transformation

for(i=0;i<N;i++)

B[i] = F1(A[i]);

for(j=0;j<N;j++)

C[j] = F2(B[j],A[j]);

for(i=0;i<N;i++)

{

B[i] = F1(A[i]);

C[i] = F2(B[i],A[i]);

}

for(i=0;i<N;i++)

{

int a,b;

a = A[i];

b = F1(a);

C[i] = F2(b,a);

B[i] = b;

}


Loop Fusion: Bump

for(i=2;i<N;i++)

B[i] = F1(A[i]);

for(j=0;j<N-2;j++)

C[j] = F2(B[j+2],A[j+2]);

for(i=2;i<N;i++)

{

B[i] = F1(A[i]);

C[i-2] = F2(B[i],A[i]);

}

for(i=2;i<N;i++)

{

int a,b;

a = A[i];

b = F1(a);

C[i-2] = F2(b,a);

B[i] = b;

}


Loop Fusion: Reverse

for(i=0;i<=N;i++)

B[i] = F1(A[i]);

for(j=N;j>=0;j--)

C[j] = F2(B[N-j],A[N-j]);

for(i=0;i<=N;i++)

B[i] = F1(A[i]);

for(j=0;j<=N;j++)

C[N-j] = F2(B[j],A[j]);

for(I=0;i<=N;i++)

{

int a,b;

a = A[i];

b = F1(a);

C[N-i] = F2(b);

B[i] = b;

}


Loop unrolling

for(i=0;i<N;i++)

{

something(i);

}

for(i=0;i<N;i+=4)

{

something(i);

something(i+1);

something(i+2);

something(i+3);

}


Loop unrolling: Example (reduction)

Sum=0;

for(j=0;j<M;j++)

{

for(i=0;i<N;i++)

{

sum+=B[j]*C[i];

}

}

Sum1=Sum2=Sum3=Sum4= 0;

for(i=0;i<N;i++)

{

int c = C[i];

for(j=0;j<M;j+=4)

{

sum1 += B[j ]*c;

sum2 += B[j+1]*c;

sum3 += B[j+2]*c;

sum4 += B[j+3]*c;

}

}

Sum = Sum1+Sum2+Sum3+Sum4;


Software Pipeline

for(i=0;i<N;i++)

{

int a = src1[i];

int b = src2[i];

d = a*b;

F(d);

}

int a = src1[0];

int b = src2[0];

for(i=0;i<N-1;i++)

{

d = a*b;

F(d);

a = src1[i+1];

b = src2[i+1];

}

d = a*b;

F(d);

Prologue

Epilogue