ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

20
ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions Koji Inoue†, Hamid Noori‡, Farhad Mehdipour†, Takaaki Hanada†, and Kazuaki Murakami† †Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan ‡School of Electrical and Computer Engineering, University of Tehran

description

ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions . Koji Inoue†, Hamid Noori ‡, Farhad Mehdipour †, Takaaki Hanada †, and Kazuaki Murakami† †Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan - PowerPoint PPT Presentation

Transcript of ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Page 1: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Koji Inoue†, Hamid Noori‡, Farhad Mehdipour†, Takaaki Hanada†, and Kazuaki Murakami†

†Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan‡School of Electrical and Computer Engineering, University of Tehran

Page 2: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

2

Outline

• Introduction• ADEXOR: Adaptive Extensible Processor

– Overview– Microarchitecture– Coarse-grained Reconfigurable Functional Unit

• Evaluation• Conclusions

Page 3: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Motivation and Solution• Embedded processors have to achieve

– Low cost– High-performance– Low-power or low-energy consumption

• Key point– How can processors adapt to target applications?

• Solution: ASIP w/ Re-configurability– Application specific ISA

• Provide custom instructions (CIs)– Implement re-configurable FUs

3

Page 4: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

ADaptive EXtensible processOR(ADEXOR)

4GPP: General Purpose Processor

CRFU: Coarse-grained Reconfigurable Functional Unit

Register File

ID/EXE Reg

CRFU

RFUConfiguration

Memory

ALU

MUX Counter

EXE/MEM Reg

GPP Augmented HW

Triggered by mtc1 orsequencer

Indexed by mtc1or sequencer

400680 subiu $25,$25,1400688 lbu $13,0($7)400690 lbu $2,0($4)400698 sll $2,$2,0x184006a0 sra $14,$2,0x184006a8 addiu $4,$4,14006b0 srl $8,$2,0x1c4006b8 sll $2,$8,0x24006c0 addu $2,$2,$254006c8 bgez $10,4006f0 4006d0 xori $13,$13,14006d8 addu $10,$10,$2400680 subiu $25,$25,1400698 sll $2,$2,0x184006a0 sra $14,$2,0x18400688 lbu $13,0($7)4006e0 bgez $10,4006f0 ....

Hot Basic Block

ID/EXE Reg

• Has a coarse-grained re-configurable functional unit• Supports efficient “Multi-Exits CIs”• Achieves high-performance and low energy

Page 5: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

CRFU Microarchitecture

• 16 FUs controlled by configuration bits• MUX-base interconnection between FUs• Early stage data can be transferred to output

ports

Row 1

Row 5

Adder/subtractor AND OR XOR Barrel

Shifter

Configurationbits

Configurationbits

Configurationbits

FU FU FU FU

Page 6: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

2 5 beq 7 8bgez3 9

18 17 16 14151920bne

bne10 11 1210

BB5

…………….30

BB1BB2

BB3BB4

BB6

50%

50%

60%

40%95%

5%

Supporting Multi-Exits Custom Instructions (MECIs)

6

Multiple-Exits Custom InstructionConditional Execution + Hot-Path Selection

Assume 16 nodes can be included in one CI in maximum

#Required nodes: 16adpcm

Exit

Exit

Page 7: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

7

Experimental Setup (1/2)

Issue 1-way

L1-Instruction Cache 32K, 4 way, 1 cycle latency, miss penalty 20 cycles

L1- Data Cache 16K, 4 way, 1 cycle latency, miss penalty 20 cyclesALUs 1 integer unit, 1 floating point unit

Multiplier 1 Integer (5 cycles)

Divider 1 Integer (8 cycles)

Branch predictor bimodal

Branch prediction table size 256Extra branch misprediction 3

Register File 4-read ports, 2-write ports

Clock Frequency 135 MHz

Base Processor Configuration

Page 8: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Experimental Setup (2/2)

DEC/EXE Pipeline Registers

ALU MUL/DIV CRFU

Reg0 ………………………………....

Reg31

EXE/MEM Pipeline Registers

Counter

ConfigMemory

CounterFrom decode stage

Triggered bymtc1

Triggered bymtc1

CRFU Input RegsEn

Result bus

or sequencer

or sequencer

DEC/EXE Pipeline Registers

ALU MUL/DIV CRFU

Reg0 ………………………………....

Reg31

EXE/MEM Pipeline Registers

Counter

ConfigMemory

CounterFrom decode stage

Triggered bymtc1

Triggered bymtc1

CRFU Input RegsEn

Result bus

or sequencer

or sequencer

arch1: (4-read/2-write)•Clock freq: 135MHz•RF read/write access

Input: 5, 6, 7, or 8 +1 extra cycleOutput: 3 or 4 +1 extra cycleOutput: 5 or 6 +2 extra cycles

•CRFU executionarch-1-var: variable (1 or 2 cycles)arch-1-fix: 2 cycles

arch2: (8-read/4-write)•Clock freq: 130MHz•RF read/write access

Input: no extra cycleOutput: 5 or 6 +1 extra cycle

•CRFU executionarch-2-var: variable (1 or 2 cycles)arch-2-fix: 2 cycles

8

Page 9: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

9

Performance Evaluation

1

1.5

2

2.5

3

3.5

4

4.5

5

Spee

dup

arch1-vararch2-fixarch2-var

Page 10: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Energy Consumption

Pros.• Low activity of hardware

components– I-Cache, Bpred– Decoder– Register File– Functional Unit

• Higher I-Cache hit rates– Reduce the energy for off-

chip accesses

Cons.• RFU configuration

– Accessing the config. Memory

– Setting control signals in the RFU

• Increased complexity– Communication between the

processor’s data-path and the RFU

10

Page 11: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

11

Total Energy Reduction

0

10

20

30

40

50

60

70

80

Tota

l ene

rgy

redu

ctio

n (%

)

clk-gating-arch2-vararch2-vararch2-fixarch1-var

Page 12: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

12

basic

...qso

rtsus

anc-jp

egd-j

peg

lame

dijkstr

a

patric

ia

string

se...

blowfish sha

adpcm fft crc gsm

45

45.5

46

46.5

47

47.5

48130MHz 260MHz 390MHz 520MHz 650MHz

Tem

pera

ture

(℃)

Temperature Analysis

FU

FU

FU

FU

FU

FU FUFU

FUFU

FU FU FU

FUFU

FU

CRFU Floor Plan(1.7x1.7 [mm2])

Page 13: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Conclusions

13

• ADEXOR: Adaptive Extensible Processor– Has a coarse-grain reconfigurable functional unit– Supports multi-exit custom instructions

• Performance / Energy Analysis– 5X speed up (best case)– 60% energy reduction (best case)

• Future Work– Extend for 3D-IC Implementation

Page 14: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Acknowledgement

• This research was supported in part by – New Energy and Industrial Technology Development

Organization– The chip fabrication program of VLSI Design and Education

Center(VDEC), the University of Tokyo in collaboration with Hitachi Ltd. and Dai Nippon Printing Corporation.

14

Page 15: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Backup Slides

15

Page 16: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Area overhead (1/2)• VHDL & Hitachi 0.18μm library

– Base processor: 4.5 mm2

– CRFU: 1.7 mm2

• CACTI 4.2 (0.18μm)– I-Cache & D-Cache (32KB 4-way ): 2.25mm2 – Configuration Memory (SRAM - for 32 MECIs): 0.56mm2 – Sequencer (CAM – 32 entries): 0.092mm2

• Base Processor (with caches)– Area: 9.0mm2

16

Page 17: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Area overhead (2/2)

Area Overheadarch1/mtc1 25.1%

arch1/sequencer

26.1%

arch2/mtc1 30%arch2/

sequencer31%

17

Page 18: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Access Reduction

-5

15

35

55

75

95

basicmath

bitcount

qsort

susa

ncjp

egdjpeg

dijkstr

a

patricia

blowfish

rijndae

lsh

a

adpcm crc fft gsm

stringse

arch

avg-se

q

avg-m

tc1

Acc

ess

redu

ctio

n (%

)

decoderbranch predreg fileicacheALUicache-miss

55

35

15

seq mtc1

18

Page 19: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Energy Consumption Breakdown for arch1/invoke-mtc1

0102030405060708090

100

basicm

ath

bitcoun

tqso

rtsu

sancjp

egdjpeg

dikjstr

a

patrici

a

blowfish

rijnda

el sha

adpcm crc fft

gsm

string

searc

h

Ener

gy C

onsu

mpt

ion

Bre

akdo

wn

Base Processor CRFU Config Mem

19

Page 20: ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

Energy Consumption Breakdown for arch2/invoke-seq

0

20

40

60

80

100

basicm

ath

bitcoun

tqso

rtsu

sancjp

egdjpeg

dijkstr

a

patrici

a

blowfish

rijnda

el sha

adpcm crc fft

gsm

string

searc

h

Ener

gy C

onsu

mpt

ion

Brea

kdow

n

Base Processor CRFU Config Mem Reg File Sequencer

20