Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho...

Kaiming Ho

Achieving over 50% system speedup with custom instructions and multi-threading.

Kaiming Ho

Fraunhofer [email protected]

June 3rd, 2014

2

Overview

• Introduction and system description• Motivation for work• Optimization approach– using user defined instructions (UDI)– using multi-threading (MT)

• Results• Concluding remarks

Kaiming Ho

3

Video Encoder System (Overview)

video in(1080p30)

ethernet out(1000Mbps)

DDR memory

mem

ory

dedicatedhardware

MIPSprocessorrunning s/w

- encoded byte stream (IP/UDP/RTP)

- statistics (IP/UDP)Kaiming Ho

ff 4c ff 51 00 2f 00 0007 80 00 04 38 00 ff 93f3 b6 ...

4

Overview of software

• Main software is partitioned into three parts– Each part must finish before the next starts

PART2(codestreamformation)

PART3(output to network)

DONEPART1(rate

optimization)

fromh/w

• Timestamps are added to measure how long each part takes. Add up time for all three parts for performance metric.– convert absolute time to frames/sec. (33.33ms -> 30fps)

• s/w also instrumented to count instructions.– can calculate instr./cycle (IPC)

• h/w delivers input at 30 fps. Analyze rate at which s/w is done.– visualize in GUI

Kaiming Ho

5

Visualization GUI

Kaiming Ho

Performance beforeall optimizations

6

Optimization approach1. Identify functional hot-spots which can be replaced by user-

defined custom instructions (UDI).– base instruction-set is extended– One custom instruction replaces many instructions from the base-

ISA.– Highest impact when

• # instructions replaced is high• function is called often.

2. Use multi-threading (MT) to run all three parts simultaneously.– stalls in execution pipeline reduce instructions/cycle (IPC).– when one thread stalls, attempt to schedule an instruction from

another thread.– increases effective IPC.

Kaiming Ho

7

Using User-defined instructions (UDI)• MIPS UDI allows complex functions to be implemented

in a single custom instruction.– ISA is extended to include new custom instructions– Fully supported in compiler tool-chain.

• Instructions take the form:reg_result = custom_udi(reg_src1, reg_src2);

– Two 32-bit source operands (both optional) and one 32-bit result (also optional).

– Typical RISC style.– Instructions can be pure (no side-effects), or can update

internal state.

• Instructions are likely domain specific.Kaiming Ho

8

UDI Examples (1)• Bit accumulation, with zero-stuffing.

– hard for 32-bit processor to do.

• <n> bits are pushed into an accumulator.• When eight 1’s in a row occur, an extra “0” is added.• data is popped out 16/32-bits at a time.

bitwr_push 0x1f2, 10

0 1 1 11 01 10 1 bitwr_push 0xfd, 8

bitwr_getlen r10(r10 <= 19)

bitwr_pop16 r11(r11 <= 0xecff)

1 1 1 1 1 1 0 10

bitwr_push 0x17ffd, 18

0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1accumulator state

Kaiming Ho

0 0 1

9

UDI Examples (2)• FIFO pointer management.

– not domain specific. Could find use in multiple applications.ring_start

ring_end

rd_ptr

wr_ptr

Kaiming Ho

struct { unsigned *ring_start; unsigned *ring_end; unsigned *wr_ptr; unsigned *rd_ptr;} FIFO_PTR;

unsigned *FIFO_PTR_INC_WP() { unsigned *retval, *next_wp; next_wp = retval = FIFO_PTR.wr_ptr; // increment and wrap next_wp += 1; if (next_wp == FIFO_PTR.ring_end) next_wp = FIFO_PTR.ring_start; // check for full if (next_wp == FIFO_PTR.rd_ptr) return NULL; FIFO_PTR.wr_ptr = next_wp; return retval;}

• Internal state:

• s/w writes one word at a time– check for buffer full– handle wraparound

ptr = FIFO_PTR_INC_WP();if (ptr) *ptr = data;

PC: bfc059fc UDI r3 // inc_wpPC: bfc05a00 BEQZ r3, 0xbfc05ac8PC: bfc05a04 NOPPC: bfc05a08 SW r3, 0(r3)

FIFO_PTR_INC_WP() reduced to oneatomic UDI

Usage:

10Kaiming Ho

UDI name cyclessaved(per use)

instr.saved(per use)

freq.of use(per frame)

overallspeedup

BIT WRITE (push) 46-161 29-77 20889

22%BIT WRITE (get_len) 46-108 24-48 4185

BIT WRITE (pop) 31-82 16-42 3288

FIFO PTR (inc wp) 39-101 22-46 32881.9%

FIFO PTR (inc rp) 16 8 9

UDI savings

13 instr, 38 cyc.

34 instr, 57 cyc.

cyclecount

instr.count

• Two UDI replace 47 standard instructions, taking 95 cycles.• UDI does not stall.

• Amount saved is dependent on input.• # standard instructions variable.• With UDI, always 2 instructions.

11

Performance gain from UDI

Kaiming Ho

Savings: 20.96ms (25%)

62.76ms (after)

83.72ms (before)

12

multi-threading (1)• instructions/cycle (IPC) is a measure of efficiency in CPU

execution pipeline.– stalls due to cache misses, multi-cycle instructions, branch

penalties, etc… decrease IPC.• A CPU working in multi-threaded mode attempts to schedule

instructions from a different thread when one stalls.– increases effective IPC

• Programs with low IPC in single-threaded mode benefit most from multi-threading.

Representative execution statistics of our program gathered in the lab:part1: 3056 cyc, 1587 instr.part2: 4597034 cyc, 1954337 instr.part3: 2454570 cyc, 816940 instr.total: 7054660 cyc, 2772864 instr. avg. IPC is 0.393

avg. IPC is low!!

Expect MT to have significant impact

Kaiming Ho

30fps 30fps 30fps

part1 part2 part3

frame1

part1 part2 part3

frame2

part1 part2 part3

frame3

TOO SL

OW

multi-threading (2)• Execution of our program (in ST), over time is shown

below.

TOO SL

OW

TOO

SLOW

Kaiming Ho 13

– Too slow. The 30fps time budget is overrun.

• With MT, each part runs in its own thread, which are interleaved together.– overall effect is better performance.

14

Multi-threading and IRQ handling• Traditional ST programs get interrupted when external IRQs are asserted.

– running of ‘normal’ program is interrupted with running IRQ handler.• When MT programs are architected the same way, ALL threads are

interrupted when IRQ occurs.– On IRQ, CPU goes to exception level and MT is effectively turned off.– very inefficient. When IRQ handler stalls, cycles are wasted.

• Our program takes many interrupts. (175k / sec.)

• Different approach:– IRQ handler is given its own thread.– Assertion of IRQ does not cause a CPU interrupt. They wake up the thread with

the IRQ handler.– When IRQ handler runs, it is scheduled simultaneously with other threads in the

system.– No IRQ overhead.– CPU never goes to exception level.

Kaiming Ho

15

Performance gain from MT

Kaiming Ho

45%

Originalperformance: 83.72ms

With UDIand MT : 43.37ms

ST

MT

16

Discussion of Results

• Adding UDI decreases #instr. and IPC.– custom instructions are part of multiplier pipeline.

• When MT is used, same # instr. takes longer.– IPC of individual threads lower– Overall IPC (performance) is higher.

• lower IPC in ST means greater gain from ST->MT• Frequency of CPU does not matter

– Our application is not I/O or memory bound. Kaiming Ho

ST/noUDI (111MHz):86.6ms. IPC 42.42%

cyc. instr. IPCp1: 2*1126 1130 50.17%p2: 2*3301726 2967154 44.95%p3: 2*1508140 1114213 37.01%

ST/UDI (111MHz):65.4ms. IPC 39.39%

cyc. instr. IPCp1: 2*1126 1130 50.17%p2: 2*2118058 1741554 41.17%p3: 2*1508130 1114291 37.01%

MT/noUDI (111MHz):68.6ms.

cyc. instr. IPCp1: 2*1458 1125 38.58%p2: 2*3745384 2967201 39.61%p3: 2*1508443 1080524 35.89%

26%

MT/UDI (111MHz):43.8ms.

cyc. instr. IPCp1: 2*1973 1125 28.50%p2: 2*2435277 1741563 35.76%p3: 2*1508548 1078515 35.83%

49%

ST/UDI/rate_alloc (111MHz):89.5ms. IPC 35.22%

cyc. instr. IPCp1: 2*1339904 639013 23.84%p2: 2*2115672 1741554 41.17%p3: 2*1508090 1113907 37.00%

MT/UDI/rate_alloc (111MHz):57.3ms. (34/30/32)

cyc. instr. IPCp1: 2*1531915 639041 19.25%p2: 2*3187194 1741574 27.34%p3: 2*2249536 1057951 23.56%

56%

• adding extra processing with memory accesses and FPU decreases IPC.

• effect of MT is enhanced.

98%

17

Concluding Remarks• Over 50% improvement in performance was obtained by

using two simple techniques:– Use of custom user-defined instructions (UDI)– Use of multi-threading (MT) technology.

• UDI reduces the number of instructions executed. Consistently saves 20-25%.– Easy to implement compared to dedicated h/w design.– man-weeks of work vs. man-years.

• Benefit of MT is more variable.– Between 26-49% has been measured.– depends on operating point. Image complexity. IPC of application.– Heavily loaded systems benefit more.– memory or I/O bound applications benefit more

Kaiming Ho

18Kaiming Ho

Achieving over 50% system speedup with custom instructions and multi-threading

THANK YOU!!!

way

Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho...

Documents

Transcript of Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho...