1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer...

22
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18

Transcript of 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer...

Page 1: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

1

Towards OptimalCustom Instruction Processors

Wayne LukKubilay Atasu, Rob Dimond and Oskar Mencer

Department of ComputingImperial College London

HOT CHIPS 18

Page 2: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

2

Overview

1. background: extensible processors

2. design flow: C to custom processor silicon

3. instruction selection: bandwidth/area constraints

4. application-specific processor synthesis

5. results: 3x area delay product reduction

6. current and future work + summary

Page 3: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

3

1. Instruction-set extensible processors

● base processor + custom logic– partition data-flow graphs into custom instructions

data out

ALURegister

File

data in

Page 4: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

4

Previous work

● many techniques, e.g.– Atasu et al. (DAC 03)

– Goodwin and Petkov (CASES 03)

– Clark et al. (MICRO 03, HOT CHIPS 04)

● current challenges– optimality and robustness of heuristics

– complete tool chain: application to silicon

– research infrastructure for custom processor design

Page 5: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

5

2. Custom processor research at Imperial

● focus on effective optimization techniques– e.g. Integer Linear Programming (ILP)

● complete tool-chain– high-level descriptions to custom processor silicon

● open infrastructure for research in– custom processor synthesis– automatic customization techniques

● current tools– optimizing compiler (Trimaran) for custom CPUs– custom processor synthesis tool

Page 6: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

6

Application to custom processor flow

Application Source (C)

TemplateGeneration

TemplateSelection

AreaConstraint

GenerateCustom

Unit

GenerateBaseCPU

ProcessorDescription

ASICTools

Area,Timing

Page 7: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

7

Custom instruction model

output ports

RegisterFile

input portsInput Register

Pipeline Register

Output Register

Page 8: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

8

3. Optimal instruction identification● minimize schedule length of program

data flow graphs (DFGs)● subject to constraints

– convexity: ensure feasible schedules

– fixed processor critical path: pipeline for multi-cycle instructions

– fixed data bandwidth: limited by register file ports

● steps: based on Integer Linear Pogramming (ILP)

a. template generation

b. template selection

Page 9: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

9

a. Template generation

X

1. Solve ILP for DFG to generate a template

2. Collapse template to a single DFG node

3. Repeat while (objective > 0)

Page 10: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

10

b. Template selection

● determine isomorphism classes

– find templates that can be implemented using the same instruction

– calculate speed-up potential of each class

● solve Knapsack problem using ILP

– maximize speedup within area constraint

Page 11: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

11

Optimizing compilation flowApplication in C/C++

Impact Front-end

CDFG Formation

a) TemplateGeneration

b) TemplateSelection

MDESGeneration

Assembly Code and Statistics

InstructionReplacement

Scheduling,Reg. Allocation

Elcor Backend

Gain

Data BandwidthConstraints

Data BandwidthConstraints

AreaConstraints

SynopsysSynthesis

AreaVHDL

Page 12: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

12

4. Application-specific processor synthesis

● design space exploration framework– Processor Component Library– specialized structural description

● prototype: MIPS integer instruction set– custom instructions– flexible micro-architecture

● evaluate using actual implementation– timing and area

Page 13: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

13

Processor synthesis flow

CustomData paths

from compiler

FE

Processor Component Library

● merging● add state registers● processor interface

● pipeline description● parameters

FE EX M W

interface● data in/out● stall control

Custom Processor

Page 14: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

14

Implementation

● based on Python scripts

– structural meta-language for processors

– combine RTL (Verilog/VHDL) IP blocks

– module generators for custom units

● generate 100s of designs automatically

– ASIC processor cores

– complete system on FPGA: CPU + memory + I/O

Page 15: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

15

5. Results● cryptography benchmarks: C source

– AES decrypt, AES encrypt, DES, MD5, SHA

● 4/5 stage pipelined MIPS base processor– 0.225mm2 area, 200 MHz clock speed– single issue processor– register file with 2 input ports, 1 output port

● processors synthesized to 130nm library– Synopsys DC and Cadence SoC Encounter

– also synthesize to Xilinx FPGA for testing

Page 16: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

16

AES Decryption Processor

130nm CMOS200MHz0.307mm2

35% area cost(mostly one instruction)

76% cycle reduction

Page 17: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

17

AES Decryption Processor

130nm CMOS200MHz0.307mm2

35% area cost(mostly one instruction)

76% cycle reduction

Page 18: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

18

Execution time

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70

Area constraint (ripple carry adders)

Nor

mal

ised

num

ber o

f Cyc

les

AES decryptAES encryptDESMD5SHA

4 inputs, 1 output

4 inputs, 1 output

4 inputs, 4 outputs

4 inputs, 2 outputs

4 inputs, 1 output

76% reduction

63% reduction

43% reduction

Register file in all cases: 2 input ports, 1 output port

Page 19: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

19

Timing

• 48% of designs meet timing at 200MHz without manual optimization

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

50000 60000 70000 80000 90000 100000 110000 120000

Cell area/mm2

Sla

ck/n

s

Page 20: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

20

Area (for maximum speedup)

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

aesdec

aesenc

des md5 sha basecpu

Are

a/m

m2

Cell area

Chip area35% 28%

42%

93%

23%

Page 21: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

21

6. Current and future work● support memory access in custom instructions

– automate data partitioning for memory access– automate SIMD load/store instructions for state registers

● use architectural techniques e.g. shadow registers– improve bandwidth without additional register file ports

● study trade-offs for VLIW style– multiple register file ports– multiple issue and custom instructions

● extend compiler: e.g. ILP model for cyclic graphs– adapt software pipelining for hardware

Page 22: 1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

22

Summary● complete flow from C to custom processor

● automatic instruction set extension– based on integer linear programming– optimize schedule length under constraints

● application-specific processor synthesis– complete flow: permits real hardware evaluation

● up to 76% reduction in execution cycles– 3x area delay product reduction

● max speedup: 23% to 93% area overhead