Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable...

Post on 21-Dec-2015

220 views 0 download

Tags:

Transcript of Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable...

Hot Chips 16 August 24, 2004

OptimoDE: Programmable Accelerator Engines Through Retargetable Customization

Nathan Clark, Hongtao Zhong,Kevin Fan, Scott Mahlke

CCCP Research GroupUniversity of Michigan

http://cccp.eecs.umich.edu

Krisztián Flautner,Koen Van Nieuwenhove

ARM Limited

Hot Chips 16 August 24, 2004

OptimoDE Overview• OptimoDE

– A configurable VLIW-styled Data Engine architecture– Targeted at intensive data processing

• Characteristics– Very wide performance envelope

• Power / area / speed tradeoff• Exploiting parallelism in applications

– Unlimited data path configuration options– User extensible through ISA customization

• Semi-automatic design system– User-in-the-loop design, retargetable compiler toolchain

Hot Chips 16 August 24, 2004

OptimoDE in a System On Chip

SDRAM

SoC

AH

B B

us M

atrix

ARMCPU

DMAController

InterruptController

SRAM

I/O

SDRAMController

Mem

ory

Con

trol

Data Engine

DA

TA

1

MEM

CTRL

DA

TA

2

FIFO

switch

Memory

Dat

a M1

M2

M

S

S S

M

S S M

Hot Chips 16 August 24, 2004

OptimoDE Architecture Model• Functional Units

– ALU, ACU, Multipliers– Custom

• Memory– RAM ( asynch / synch )– ROM

• I/O ports– addressable– handshake protocol

• Registers– Register files

• Interconnect– Direct connection– Shared bus

• Controller

• All layers required• Intra-layer configuration

interconnect

InterconnectInterconnect

Controller

C

ControllerController

interconnect

… …

Function unitsFunction units MemoriesMemories

regs regs regs…regs

I/O portsI/O ports

RegistersRegisters

Hot Chips 16 August 24, 2004

Design Toolchain

A.tmp

DesignDE DEvelop

Librarian

saveISS

set target

001010110011010110110111

load

run / profile

A.inc

A.inc.c1

UserLibrary

instantiate

#include …

main(){

… = dct();

}

create_resource

xxxxxxxdctxxxxx

A.inc

A.inc.c2

A.inc

A.inc.c3

A Evaluation

OptimoDELibraryOptimoDE

LibraryUser

Library

A Definition

LIFETIMELIFETIME

LOADLOAD

Hot Chips 16 August 24, 2004

Compiler Toolchain

*1

+2

*4

*5

+6

+7

*8

INPUT

OUTPUT

+3

*1

+2

+3

012345

+2

*1

*4

*3

+6

*8

+7

*5

*9

+10

C SourceDescription

DEvelop

Micro-code

check map compile

Analysis feedback

Syntax checks

Dataflow analysis

Match architecture

and dataflow graph

Optimize code and

register use

Hot Chips 16 August 24, 2004

32-point DCT Microarchitecture

inport

outportacu_1 acu_2 acu_3

romram_1 ram_2 alu_1 alu_2

imm_1 imm_2 control

• 2 Custom FUs, 2 RAM, 1 ROM, 3 ACU, 2 I/O ports• Designer responsible for creating custom units manually

Hot Chips 16 August 24, 2004

Retargetable Customization• Prototype 2 technologies in OptimoDE

– Automated ISA customization– Retargetable customization to an “application-area”

• Customizing for 1 application– Programmability Nominally programmable– Critical problem – Cannot sustain performance

across similar applications– How well does a custom ISA generalize

• 5 encryption algorithms, create custom design for each• Average loss >80% verses native [MICRO, 2003]

– Proactive generalization creates a retargetable design

Hot Chips 16 August 24, 2004

Creating Custom Instructions

• Candidate discovery– Identify customization

opportunities• Examine program DFG• Partition DFG at:

– Memory operations– Unprofitable edges

• Enumerate candidate subgraphs within each partition

Hot Chips 16 August 24, 2004

Grouping and Selection

• Group candidate subgraphs with same structure

Group 4

Group 2

Group 1

Group 3

Group 4Group 3

Group 1 Group 2

• Estimate performance and cost for each group

Cost: 0.5 AddersGain: 1,000 Cycles

Cost: 1 AdderGain: 2,500 Cycles

Cost: 2 AddersGain: 10,000 Cycles

Cost: 1 AdderGain: 1,500 Cycles

• Greedily select groups to implement in hardware subject to budget

Hot Chips 16 August 24, 2004

• Wildcard – multiple functionality at nodes

Input 2

Input 1

0xFF

0x8, 0x40x4

>>

|,&

+,-

Output

Proactively Generalize Groups

• Cost-effectively extend group functionality to enable reuse

Input 2

Input 1

0xFF

0x8

>>

|

+

Output

Input 2

Input 1

0xFF

0x8, 0x40x4

>>

|,&

+,-

Output

• Subsumed – configurable interconnect to bypass nodes

Hot Chips 16 August 24, 2004

Native Speedups

0

1

2

3

4

5

6

7

8

9

3Des AES Blowfish Md5 Rc4 SHA

Speedup

ARM 926EJ (200 MHz)OptimoDE (333 MHz, 5 Issue: 3 ALU, 1 Mem, 1 Brn)OptimoDE + 1 CFU (333 MHz)

Hot Chips 16 August 24, 2004

0

1

2

3

4

5

6

7

3des-3des3des-Blowfish

3des-Rc43des-AES3des-Sha

Blowfish-3desBlowfish-Blowfish

Blowfish-Rc4Blowfish-AESBlowfish-ShaRc4-3des

Rc4-Blowfish

Rc4-Rc4Rc4-AESRc4-ShaAES-3desAES-Blowfish

AES-Rc4AES-AESAES-ShaSha-3desSha-Blowfish

Sha-Rc4Sha-AESSha-Sha

Speedup

Base

Generalized

0

1

2

3

4

5

6

7

3des-3des3des-Blowfish

3des-Rc43des-AES3des-Sha

Blowfish-3desBlowfish-Blowfish

Blowfish-Rc4Blowfish-AESBlowfish-ShaRc4-3des

Rc4-Blowfish

Rc4-Rc4Rc4-AESRc4-ShaAES-3desAES-Blowfish

AES-Rc4AES-AESAES-ShaSha-3desSha-Blowfish

Sha-Rc4Sha-AESSha-Sha

Speedup

Base

Generalized

Importance of Generalization

0

1

2

3

4

5

6

7

3des-3des3des-Blowfish

3des-Rc43des-AES3des-Sha

Blowfish-3desBlowfish-Blowfish

Blowfish-Rc4Blowfish-AESBlowfish-ShaRc4-3des

Rc4-Blowfish

Rc4-Rc4Rc4-AESRc4-ShaAES-3desAES-Blowfish

AES-Rc4AES-AESAES-ShaSha-3desSha-Blowfish

Sha-Rc4Sha-AESSha-Sha

Speedup

Base

Generalized

Key: application run – application designed for

Hot Chips 16 August 24, 2004

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rc4 Rc4, Sha Blowfish, Rc4, Sha 3des, Blowfish,Rc4, Sha

3des, AES,Blowfish, Rc4, Sha

Apps Designed For

Fraction of Native Speedup Attained

3des

AES

Blowfish

Rc4

Sha

Md5

Mean

Designing for a Domain

Hot Chips 16 August 24, 2004

Case Study - Md5

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10

CFU Cost Budget (Relative to 32-bit ripple-carry adder)

Speedup

Hot Chips 16 August 24, 2004

OptimoDE Design for this Point

Input 4

Input 1

Input 3

Input 2

+

+

+

<<

0x5

>>

0x1B

|

Output

^

&

^

Input 1Input 2

Input 3

Output

ALU 1 ALU 2 CFUSRAMACU

RF RFRF RF …

Control Memory

Hot Chips 16 August 24, 2004

Die Area Breakdown

ALU 1 ALU 2 CFUSRAMACU

RF RFRF RF …

Control Memory

Die impact is artificially large because of naïve implementation

OptimoDE = 5.5 mm2 in 0.13 ARM 926EJ = 5.0 mm2 in 0.13

Reg. File and Interconnect

9%

CFU2%

Control Memory

11%

Baseline Design

78%

Hot Chips 16 August 24, 2004

Conclusions

• OptimoDE – Configurable VLIW-style data engine architecture– Automated tools for implementing embedded signal

and data processing solutions

• Automatic retargetable customization– Customized design combined with cost-effective

generalization– Performance programmability - Performance stability

across a family of similar applications