Progress on media processor design Xiaolang Yan ([email protected])[email protected] Xing Qin...
-
Upload
aubrey-caldwell -
Category
Documents
-
view
216 -
download
0
Transcript of Progress on media processor design Xiaolang Yan ([email protected])[email protected] Xing Qin...
Progress on media processor design
Xiaolang Yan ([email protected]) Xing Qin ([email protected]) Jian Yang ([email protected]) Xiaohua Luo ([email protected]) Peiyong Zhang ([email protected]) Dake Liu ([email protected])
Embedded DSP Research & Develop Group
Presented by Chunyue Liu
Outline Overview of media processor
Progress on Spock
Progress on Schubert
- Overview
- Key features
- Performance Conclusions & Problems
Background and Challenges
General MCU Enhanced DSP Vector Processor
ARMAudio
AcceleratorVideo
Accelerator
Communication network
Nomatic
Media applications have very high
computation complexity
- H.264 encoding of 720 x 576 pixels
@ 30 frames /s up to 30 GOPS Media processor is on the demand
- Some state of art Media Processors
(e.g. Nomatic, da Vinci) Multiple standards coexist
- Flexible & programmable Our current IC design level
constraint ([email protected]) ASIP is the best choice Our proposal on IC-DFN’05
Overview of media processor Programmable and heterogeneous processors on a SoC platform
- General MCU (CK510, a 32-bit RISC core)
Interface (GUI), Os (Linux)
- Enhanced DSP (Spock)
Audio processing, Bitstream parsing, Data transferring
- Vector processor (Schubert)
Video processing Core
External Bus
InterfaceMatrix Memory
Controller
Matrix memory
Schubert
DM TM
Spock
DMA
SDRAM Controller
Off-chip Memory
MailBoxperipheral
AMBA BUS
CK510
Media Access
Controller
Outline Overview of media processor
Progress on Spock
Progress on Schubert
- Overview
- Key features
- Performance Conclusions & Problems
Progress on Spock Developed tools chain
- Assembler, Simulator and Debugger FPGA prototype: real time decoding
-128kb/s OGG @ 40MHz
To test Spock , Dual-core SoC platform is developed - Integrated with CK510 - Inter-processor communication uses mailbox and shared memory -.18um, less than 500mw ,166MHz - CK510 core area: 2 x 2 mm2
- Spock core area: 1.5 x 1.5 mm2
Overview of Spock
IF
ID
RF
EX1
GPR
Decode
Issuing logicOperand Bypass
vlxalu mul Address adder
PC MUX
EX2/MEM
ExternalBus
ExternalBusDM/TM
Writebufferacc
Aligner
PMPCFSM
pc+2 br
dependencytable
Optimization for Control
- Branch optimization:
conditional execution
2-level hardware loop, repeat Optimization for Signal
Processing
- Multiple addressing mode:
Post address ++/--
Reverse/module addressing
- MAC with parallel load
- VLX instruction set extension:
putbits, showbits, getbits, etc.
Outline Overview of media processor
Progress on Spock
Progress on Schubert
- Overview
- Key features
- Performance Conclusions & Problems
Progress on SchubertApplication coverage to function coverage
SW-HW partition: 10%-90% locality
Assembly instruction set specification
Design of Assembler and Simulator
Build golden model
Benchmark instruction
set
Behavior function
verification
Micro-architecture design
RTL coding
Backend design
Design for test
RTL code verification
Test chip fabrication & test board prototype
Good performance?
Design Methodology Released 316 novel instructions
- SIMD and RISC Developed tools chain
- Assembler
- Cycle-accurate Simulator Mapped kernels
H.264/AVC
- IT/IIT, Intra/inter-prediction
- de-blocking, Motion estimation
MPEG2
- DCT, Motion compensation Micro-Architecture is designed
estimated area: 3.5 x 3.5 [email protected]
with a 70KB SRAM
Key features of Schubert
Dual clusters and dual coupling pipelines
- SIMD combined with VLIW architecture Explicit Data Organization SIMD (EDO-SIMD) 2-Dimensional and byte-align addressing storage Cycle accurate instruction set simulator
Dual clusters and dual coupling pipelines Two clusters:
- Cluster0: Computation (+/-,*,&,>/<,etc.)
- Cluster1: Data conversion & LD/ST
- Based on Decoupled Access & Execution
(DAE) Two pipelines:
- Each cluster holds its own
executive-level pipeline
- Share the IF & ID level pipeline Advantages
- Parallelize computation operations
with non-computation operations
- Perform well on cycle count
IF
DP
UD
RF0
EX0
EX1
RF1
EX2
EX3
WB0
AD0/PERM
AD1
MEM
WB1
Cluster0 executive Cluster1 executive
Instruction decoder
Instruction fetch
Dual clusters and dual coupling pipelines
ADG
ACC ACC ACC ACC
RF0
EX0
EX1
EX2
EX3
WB
RF1
AD0/PERM
AD1
MEM
General Register File
WB
Cluster 0 Cluster 1
Memory
Explicit Data Organization SIMD ISA Bottleneck of conventional SIMD ISA
- SIMD is inefficient if sub-word data is unaligned each other
- SIMD is less flexible than VLIW
SIMD class VIS MMX/SSE AltiVec
Ld/St 11.70% 21.00% 17.90%
Organize 9.70% 12.60% 17%
Integer ALU 13.60% 18.80% 11.80%
Float ALU -- 9.30% 6.90%
Cycle percent of conventional SIMD ISA
This overhead is reduced by Dual-Cluster
How to reduce this overhead?
Related works
- Complex streamed instruction, Delft TU
- Stream buffer, Stream processor, Stanford University
- Indirect register addressing, Elite project, IBM
Explicit Data Organization SIMD ISA Proposed EDO-SIMD ISA
- Explicit data organization information (e.g. 3x8|3:4:7:0:1:2:6:5)
Indicate operand relations (align, merge, extract, broadcast, cross)
- Append Permutation network onto the RF pipeline of Cluster0
- Add Permutation pipeline in the Cluster1 in parallel with AD0 Advantages
- Merge organization with computation to reduce overhead
- As flexible as VLIW
- Simplified implementation
interpolate
DCT
Intra predict
IIT
vOADD vR2<3x8|3:4:7:0:1:2:6:5>, vR1, vR0
34 12 10 1a 2f 02 10 a0
1a 2f a0 34 12 10 10 02
3:4:7:0:1:2:6:5
03 02 00 04 02 01 00 03+ + + + + + + +
1d 31 a0 38 14 11 10 05vR0
vR1
vR2
2-D stream storage and addressing Multimedia temporal data behavior
- 2-D block by block
- Row and column access
- Byte alignment
- Flexible block jumping Conventional 1-D addressing
impose burdens on Computation
Elements for address generation
and address alignment tasks Related works
- Linear addressing with circle buffer, Blackfin
- Special transpose unit, Trimedia
ox1
oy1
ox0
oy0
ox2
oy2
B0
B1
Row access
Column access
Block jump
2-D stream storage and addressing Proposed storage and addressing mode
- 2-D stream storage (base, 2-D stride, 2-D offset)
- Row and interleave data arrangement (row access & column access )
- Base update for block jump (UPDATE B0, OX0, OY0, B0)
- C-like programming model is
friendly to programmer
asm: vLDOBR B0, 4, 2, vR0;
C: for(i=0; i<8; i++)
r [i] = b [2][4+i]; Advantages
- Reduce addressing and aligning
overhead (avoid transpose)
0 1 2 3 4 5 6 701 2 3 4 5 6 7
0 12 3 4 5 6 70 1 23 4 5 6 7
0 1 2 34 5 6 70 1 2 3 45 6 7
0 1 2 3 4 56 70 1 2 3 4 5 67
Base Address
x offset
y offset
y stridex stride
Logic Space
Cycle accurate instruction set simulator Useful for benchmarking and ISA design space exploration during early stage
- Input is assemble text program not
binary code
- Focus on function not micro-architecture
Resource Model
ISA model
Behavior & Timing model
Support
IS
Decode
Perm
IS
ID/RF
EX0
EX2
Read OP
Mult Add Logic
EX1
WB
Shift/rounding
reduction
Write back
IFIF
Consist of
- Resource modeling
- ISA function modeling at each pipeline
- Behavior and timing modeling
- Debug and profiling support 3 men for 2 months work, about 60,000
lines C++ code
Benchmarking and performance Mapped benchmarks: - Full H.264 baseline decoder kernels like integer transform, intra predict, interpolation and de-blocking. - H.264 fast motion estimation - MPEG2 motion compensation and DCT/IDCT The cycle accurate and function correct programs help: - Make assembler, simulator more robust - Demonstrate the performance of ISA - Explore and refine ISA (more than 900 instructions are refined to 316 in the end ) Performance - 4-CIF(704x576) H.264 baseline real-time decoder @ 200MHz - 16 kB code size for H.264 baseline decoder
Cycles for 8x8 IDCT with IEEE compliant precision
0
100
200
300
400
500
600
RISC-
Media[10]
MMX
TMS320C6x NEC V830 VIRAM Proposed
Outline Overview of media processor Progress on Spock Progress on Schubert
- Overview
- Key features
- Performance Conclusions & Problems
Conclusions
Core
External Bus
InterfaceMatrix Memory
Controller
Matrix memory
Schubert
DM TM
Spock
DMA
SDRAM Controller
Off-chip Memory
MailBoxperipheral
AMBA BUS
CK510
Media Access
Controller
Integration of a general MCU with heterogeneous ASIPs in a
SoC platform is a good choice for media processing in China
- a good trade-off between performance and flexibility
- overcome our IC design level constraint([email protected]) Progress on our Media processor
- CK510 and Spock is finished
- A dual-core SoC of CK510
and Spock is taped out
- Novel features of
Schubert are verified
and the RTL implement
is on-going
ProblemsApplication coverage to function coverage
SW-HW partition: 10%-90% locality
Assembly instruction set specification
Design of Assembler and Simulator
Build golden model
Benchmark instruction
set
Behavior function
verification
Micro-architecture design
RTL coding
Backend design
Design for test
RTL code verification
Test chip fabrication & test board prototype
Good performance?
Behavior Synthesis tool
The Behavior synthesis stage in our ASIP design depends on human experience not tools, which takes too much effort.
It is very valuable to research and develop CAD tools for design space exploration of ASIP ISA and ASIP SoC communication during the early stage
Thank you!!!