Progress on media processor design Xiaolang Yan ([email protected])[email protected] Xing Qin...

Progress on media processor design

Xiaolang Yan ([email protected]) Xing Qin ([email protected]) Jian Yang ([email protected]) Xiaohua Luo ([email protected]) Peiyong Zhang ([email protected]) Dake Liu ([email protected])

Embedded DSP Research & Develop Group

Presented by Chunyue Liu

([email protected])

Outline Overview of media processor

Progress on Spock

Progress on Schubert

- Overview

- Key features

- Performance Conclusions & Problems

Background and Challenges

General MCU Enhanced DSP Vector Processor

ARMAudio

AcceleratorVideo

Accelerator

Communication network

Nomatic

Media applications have very high

computation complexity

- H.264 encoding of 720 x 576 pixels

@ 30 frames /s up to 30 GOPS Media processor is on the demand

- Some state of art Media Processors

(e.g. Nomatic, da Vinci) Multiple standards coexist

- Flexible & programmable Our current IC design level

constraint ([email protected]) ASIP is the best choice Our proposal on IC-DFN’05

Overview of media processor Programmable and heterogeneous processors on a SoC platform

- General MCU (CK510, a 32-bit RISC core)

Interface (GUI), Os (Linux)

- Enhanced DSP (Spock)

Audio processing, Bitstream parsing, Data transferring

- Vector processor (Schubert)

Video processing Core

External Bus

InterfaceMatrix Memory

Controller

Matrix memory

Schubert

DM TM

Spock

DMA

SDRAM Controller

Off-chip Memory

MailBoxperipheral

AMBA BUS

CK510

Media Access

Controller


Progress on Spock


- Overview

- Key features


Progress on Spock Developed tools chain

- Assembler, Simulator and Debugger FPGA prototype: real time decoding

-128kb/s OGG @ 40MHz

To test Spock , Dual-core SoC platform is developed - Integrated with CK510 - Inter-processor communication uses mailbox and shared memory -.18um, less than 500mw ,166MHz - CK510 core area: 2 x 2 mm2

- Spock core area: 1.5 x 1.5 mm2

Overview of Spock

IF

ID

RF

EX1

GPR

Decode

Issuing logicOperand Bypass

vlxalu mul Address adder

PC MUX

EX2/MEM

ExternalBus

ExternalBusDM/TM

Writebufferacc

Aligner

PMPCFSM

pc+2 br

dependencytable

Optimization for Control

- Branch optimization:

conditional execution

2-level hardware loop, repeat Optimization for Signal

Processing

- Multiple addressing mode:

Post address ++/--

Reverse/module addressing

- MAC with parallel load

- VLX instruction set extension:

putbits, showbits, getbits, etc.


Progress on Spock


- Overview

- Key features


Progress on SchubertApplication coverage to function coverage

SW-HW partition: 10%-90% locality

Assembly instruction set specification

Design of Assembler and Simulator

Build golden model

Benchmark instruction

set

Behavior function

verification

Micro-architecture design

RTL coding

Backend design

Design for test

RTL code verification

Test chip fabrication & test board prototype

Good performance?

Design Methodology Released 316 novel instructions

- SIMD and RISC Developed tools chain

- Assembler

- Cycle-accurate Simulator Mapped kernels

H.264/AVC

- IT/IIT, Intra/inter-prediction

- de-blocking, Motion estimation

MPEG2

- DCT, Motion compensation Micro-Architecture is designed

estimated area: 3.5 x 3.5 [email protected]

with a 70KB SRAM

Key features of Schubert

Dual clusters and dual coupling pipelines

- SIMD combined with VLIW architecture Explicit Data Organization SIMD (EDO-SIMD) 2-Dimensional and byte-align addressing storage Cycle accurate instruction set simulator

Dual clusters and dual coupling pipelines Two clusters:

- Cluster0: Computation (+/-,*,&,>/<,etc.)

- Cluster1: Data conversion & LD/ST

- Based on Decoupled Access & Execution

(DAE) Two pipelines:

- Each cluster holds its own

executive-level pipeline

- Share the IF & ID level pipeline Advantages

- Parallelize computation operations

with non-computation operations

- Perform well on cycle count

IF

DP

UD

RF0

EX0

EX1

RF1

EX2

EX3

WB0

AD0/PERM

AD1

MEM

WB1

Cluster0 executive Cluster1 executive

Instruction decoder

Instruction fetch

Dual clusters and dual coupling pipelines

ADG

ACC ACC ACC ACC

RF0

EX0

EX1

EX2

EX3

WB

RF1

AD0/PERM

AD1

MEM

General Register File

WB

Cluster 0 Cluster 1

Memory

Explicit Data Organization SIMD ISA Bottleneck of conventional SIMD ISA

- SIMD is inefficient if sub-word data is unaligned each other

- SIMD is less flexible than VLIW

SIMD class VIS MMX/SSE AltiVec

Ld/St 11.70% 21.00% 17.90%

Organize 9.70% 12.60% 17%

Integer ALU 13.60% 18.80% 11.80%

Float ALU -- 9.30% 6.90%

Cycle percent of conventional SIMD ISA

This overhead is reduced by Dual-Cluster

How to reduce this overhead?

Related works

- Complex streamed instruction, Delft TU

- Stream buffer, Stream processor, Stanford University

- Indirect register addressing, Elite project, IBM

Explicit Data Organization SIMD ISA Proposed EDO-SIMD ISA

- Explicit data organization information (e.g. 3x8|3:4:7:0:1:2:6:5)

Indicate operand relations (align, merge, extract, broadcast, cross)

- Append Permutation network onto the RF pipeline of Cluster0

- Add Permutation pipeline in the Cluster1 in parallel with AD0 Advantages

- Merge organization with computation to reduce overhead

- As flexible as VLIW

- Simplified implementation

interpolate

DCT

Intra predict

IIT

vOADD vR2<3x8|3:4:7:0:1:2:6:5>, vR1, vR0

34 12 10 1a 2f 02 10 a0

1a 2f a0 34 12 10 10 02

3:4:7:0:1:2:6:5

03 02 00 04 02 01 00 03+ + + + + + + +

1d 31 a0 38 14 11 10 05vR0

vR1

vR2

2-D stream storage and addressing Multimedia temporal data behavior

- 2-D block by block

- Row and column access

- Byte alignment

- Flexible block jumping Conventional 1-D addressing

impose burdens on Computation

Elements for address generation

and address alignment tasks Related works

- Linear addressing with circle buffer, Blackfin

- Special transpose unit, Trimedia

ox1

oy1

ox0

oy0

ox2

oy2

B0

B1

Row access

Column access

Block jump

2-D stream storage and addressing Proposed storage and addressing mode

- 2-D stream storage (base, 2-D stride, 2-D offset)

- Row and interleave data arrangement (row access & column access )

- Base update for block jump (UPDATE B0, OX0, OY0, B0)

- C-like programming model is

friendly to programmer

asm: vLDOBR B0, 4, 2, vR0;

C: for(i=0; i<8; i++)

r [i] = b [2][4+i]; Advantages

- Reduce addressing and aligning

overhead (avoid transpose)

0 1 2 3 4 5 6 701 2 3 4 5 6 7

0 12 3 4 5 6 70 1 23 4 5 6 7

0 1 2 34 5 6 70 1 2 3 45 6 7

0 1 2 3 4 56 70 1 2 3 4 5 67

Base Address

x offset

y offset

y stridex stride

Logic Space

Cycle accurate instruction set simulator Useful for benchmarking and ISA design space exploration during early stage

- Input is assemble text program not

binary code

- Focus on function not micro-architecture

Resource Model

ISA model

Behavior & Timing model

Support

IS

Decode

Perm

IS

ID/RF

EX0

EX2

Read OP

Mult Add Logic

EX1

WB

Shift/rounding

reduction

Write back

IFIF

Consist of

- Resource modeling

- ISA function modeling at each pipeline

- Behavior and timing modeling

- Debug and profiling support 3 men for 2 months work, about 60,000

lines C++ code

Benchmarking and performance Mapped benchmarks: - Full H.264 baseline decoder kernels like integer transform, intra predict, interpolation and de-blocking. - H.264 fast motion estimation - MPEG2 motion compensation and DCT/IDCT The cycle accurate and function correct programs help: - Make assembler, simulator more robust - Demonstrate the performance of ISA - Explore and refine ISA (more than 900 instructions are refined to 316 in the end ) Performance - 4-CIF(704x576) H.264 baseline real-time decoder @ 200MHz - 16 kB code size for H.264 baseline decoder

Cycles for 8x8 IDCT with IEEE compliant precision

0

100

200

300

400

500

600

RISC-

Media[10]

MMX

TMS320C6x NEC V830 VIRAM Proposed

Outline Overview of media processor Progress on Spock Progress on Schubert

- Overview

- Key features


Conclusions

Core

External Bus

InterfaceMatrix Memory

Controller

Matrix memory

Schubert

DM TM

Spock

DMA

SDRAM Controller

Off-chip Memory

MailBoxperipheral

AMBA BUS

CK510

Media Access

Controller

Integration of a general MCU with heterogeneous ASIPs in a

SoC platform is a good choice for media processing in China

- a good trade-off between performance and flexibility

- overcome our IC design level constraint([email protected]) Progress on our Media processor

- CK510 and Spock is finished

- A dual-core SoC of CK510

and Spock is taped out

- Novel features of

Schubert are verified

and the RTL implement

is on-going

ProblemsApplication coverage to function coverage

SW-HW partition: 10%-90% locality

Assembly instruction set specification

Design of Assembler and Simulator

Build golden model

Benchmark instruction

set

Behavior function

verification

Micro-architecture design

RTL coding

Backend design

Design for test

RTL code verification

Test chip fabrication & test board prototype

Good performance?

Behavior Synthesis tool

The Behavior synthesis stage in our ASIP design depends on human experience not tools, which takes too much effort.

It is very valuable to research and develop CAD tools for design space exploration of ASIP ISA and ASIP SoC communication during the early stage

Thank you!!!

Progress on media processor design Xiaolang Yan ([email protected])[email protected] Xing Qin...

Documents

Transcript of Progress on media processor design Xiaolang Yan ([email protected])[email protected] Xing Qin...