Automatic Compilation for Domain Specific Accelerators · 2020. 8. 5. · Mem Tile PEak-Specified...

Automatic Compilation for Domain Specific Accelerators

Ross Daly Caleb Donovick

Jackson Melchert

Golden Age of Computer Architecture!

• Architecture Specifications change frequently


• Architecture Specifications change frequently • Compiler is the (often overlooked) key component!


• Architecture Specifications change frequently • Compiler is the (often overlooked) key component! • Waterfall methodology:


ApplicationAnalysis

Architectural Specification

RTL Design and Test

Physical Design

Software / Compiler

Design

• Architecture Specifications change frequently • Compiler is the (often overlooked) key component! • Agile methodology:


Base Hardware Accelerator v0

Compiler Toolchain v0

Application 1Application 2

Power, Performance,

Area



Incremental Updates

Application 2.1Application 3

• Architecture Specifications change frequently • Compiler is the (often overlooked) key component! • Agile methodology: • Automatically generate compiler for every spec change




Application 1Application 2

Power, Performance,

Area



Incremental Updates

Application 2.1Application 3

CPU

• Compile to IR (CoreIR) • Common Optimizations • Mapping • Packing • Placement • Routing • Bitfile generation

• Compile to IR (LLVM) • Common Optimizations • Instruction Selection • Peephole Optimization • Instruction Scheduling • Register Allocation • Assembly

CGRA/FPGA

CGRA Mapping

Lower

Application Halide Program

CoreIR Graph

Map PE and Memory

Mapped CoreIR Graph

CGRA Bitstream

Our DSL-based Hardware Generation and Software Compilation Flow

PEak Compiler

PE HW in Magma

CGRA Verilog

PEak Program (PE spec)

Halide Compiler

CoreIR Graph

PE and MEM Mapper

Mapped CoreIR Graph

CGRA Bitstream

Place & Route Engine


Magma Compiler

Compiler Collateral

Our DSL-based Hardware Generation and Software Compilation Flow

Lake CompilerPEak Compiler

PE HW in Magma

CGRA Verilog

Lake Program (MEM spec)


Halide Compiler

CoreIR Graph

PE and MEM Mapper

Mapped CoreIR Graph

CGRA Bitstream



Magma Compiler

MEM HW in Magma

Compiler Collateral

Output of Halide Compiler

Unified Buffer

Unified Buffer

Computation Kernel

Computation Kernel

CoreIR Graph

From Global Buffer

To Global Buffer

Desired Output of Mapper

From Global Buffer

To Global Buffer

Lake-Specified Mem Tile

PEak-Specified PE Tile

Mapped CoreIR Graph

To Buffer/IO

Kernels are composed of CoreIR PrimitivesCoreIR Primitives

add

add

sub

ashr

divmul

mul

Computational Kernel

From Buffer/IO

CoreIR has SMT QF BitVector Semantics

In0 In1

Out

CoreIR.Sub Out = In0 - In1

Mapping

a

as

a

dm

m


CoreIR Primitives

Kernel Mapped Kernel


CoreIR Primitives

Rewrite Rule 1

Rewrite Rule 2

Rewrite Rule 3

Rewrite Rule 4

…

Rewrite Rule Table

a

as

a

dm

m

Kernel

Instruction Selection Algorithm

Mapped Kernel

Instruction Selection

div

mul add

sub

ashr add


CoreIR Primitives

Rewrite Rule 1

Rewrite Rule 2

Rewrite Rule 3

Rewrite Rule 4

…

Rewrite Rule Table

4.3

6.0

3.1

1.2

a

as

a

dm

m

Kernel

Instruction Selection Algorithm

Mapped Kernel

Instruction Selection

div

mul add

sub

ashr add

Cost

Peak Compiler generates a table of Rewrite Rules

PEak Compiler


Halide Compiler

CoreIR Graph

PE and MEM Mapper

Mapped CoreIR Graph

CGRA Bitstream



Rewrite Rule 1

Rewrite Rule 2

Rewrite Rule 3

Rewrite Rule 4

…

Rewrite Rule Table

div

mul add

sub

ashr add

PEak: PE DSLPE Functional Specificationclass PE(Peak): def __call__(self, inst: Const(Instruction), A: Word, B: Word, C: Word) -> {“res”:Data, “flag”:Bit}:

if inst.invert_A: A = ~A

if inst.op == Opcode.Add: res, c_out = A.add(B, inst.c_in) flag = c_out elif inst.op == Opcode.Mul: res = A * B flag = (res == 0) elif ... :

... return res, flag

PE ISA Specification

class Opcode(Enum): Add = 0 Mul = 1 …# Define Instructionclass Instruction(Product): op = Opcode invert_A = Bit c_in = Bit # Define WordWord = UnsignedBitVector[16]

Specific types (or composition of types) for operands and instructions

PEak: PE DSLPE Functional Specificationclass PE(Peak): def __call__(self, inst: Const(Instruction), A: Word, B: Word, C: Word) -> {“res”:Word, “flag”:Bit}:




PE ISA Specification

class Opcode(Enum): Add = 0 Mul = 1 …# Define Instructionclass Instruction(Product): op = Opcode invert_A = Bit c_in = Bit # Define WordWord = UnsignedBitVector[16]

Specific types (or composition of types) for operands and instructions

Subtract?

res flag

A B C

PE

inst

PE Functional Specificationclass PE(Peak): def __call__(self, inst: Instruction, A: Word, B: Word, C: Word) -> {“res”:Word, “flag”:Bit}:




Subtract?

res flag

A B C

PE

inst





= Instruction( op=Add, invert_A=1, c_in=1)

Subtract?

res flag

A B C

PE

inst






res = ~A + B + 1

Subtract?

res flag

A B C

PE

inst






res = ~A + B + 1 = B - A

class RISCV(Peak): def __init__(self): self.rf = RegisterFile(32, Word) self.PC = Register(Data)

def __call__(self, inst: Instruction) ->{“next_PC”:Word}: #ID rs1_idx, rs2_idx, rd_idx, … = decode(inst) rs1_val, rs2_val = self.rf.read(rs1_idx, rs2_idx) #EX ...

#MEM...

#WBself.rf.write(rd_val)

Define sub-components and state

RiscV Peak Specification

RiscV ISA Specification with Algebraic Data Types

RiscV ISA Specification with Algebraic Data Types

class Register(Product): funct7 = Funct7Enum rs2 = BitVector[5] rs1 = BitVector[5] funct3 = Funct3Enum rd = BitVector[5] opcode= Opcode

class Immediate(Product): ...

class UImmediate(Product): ... class Store(Product): ... class Branch(Product): ... class Jump(Product): ...

Instruction = Sum[Register, Immediate, UImmediate, Store, Branch, Jump]

Multiple Interpretations of PEak Specification

• PEak program uses abstract types provided by the PEak DSL such as Bit, BitVector etc. • Each component of the

PEak compiler provides a separate concrete implementation of these abstract types • Multiple interpretations of a

PEak specification in different contexts

Python Context

Functional Model

PEak Program

BitVector

Magma Context

PEak Program

RTL

Bits

SMT Context

PEak Program

Symbolic Representation

(for Rewrite Rules)

SMTBitVector

Multiple Interpretations of PEak Specification

• PEak program uses abstract types provided by the PEak DSL such as Bit, BitVector etc. • Each component of the

PEak compiler provides a separate concrete implementation of these abstract types • Multiple interpretations of a

PEak specification in different contexts

Python Context

Functional Model

PEak Program

BitVector

Magma Context

PEak Program

RTL

Bits

SMT Context

PEak Program

Symbolic Representation

(for Rewrite Rules)

SMTBitVector

SINGLE SOURCE OF TRUTHPEak Program

In0 In1

Out

CoreIR.Sub

Discovering a Rewrite Rule

res flag

A B C

PE

inst

In0 In1

Out

CoreIR.Sub

Input/Output Bindings

res flag

A B C

PE

inst

In0 In1

Out

CoreIR.Sub

Input/Output Bindings

res flag

A B C

PE

inst

Constant

In0 In1

Out

CoreIR.Sub

Setting Constants

res flag

A B C

PE

inst = Instruction( op=Add, invert_A=1, c_in=1)

In0 In1

Out

CoreIR.Sub

res flag

A B C

PE

inst

CoreIR.Sub(in0, in1) == PE(inst, input_binding(in0, in1))

∃(input_binding, inst)

CoreIR.Sub(in0, in1) == PE(inst, input_binding(in0, in1))

st ∀(in0, in1):

Out

In0 In1

CoreIR.Sub

res flag

A B C

PE

inst

CoreIR.Sub(in0, in1) == PE(inst, input_binding(in0, in1))[‘res’]

Out

In0 In1

CoreIR.Sub

res flag

A B C

PE

inst

∃(input_binding, inst) st ∀(in0, in1):

∃(input_binding, inst)

CoreIR.Sub(in0, in1) == PE(inst, input_binding(in0, in1, other))[‘res’]

st ∀(in0, in1, other):

Out

In0 In1

CoreIR.Sub

res flag

A B C

PE

inst

How to Handle State?

res flag

A B C

PE

inst

State

How to Handle State?

res flag

A B C

PE

inst

State

res flag

A B C

PE

inst

State

Transform

Floating Point?

res flag

A B C

PE

inst

Floating Point

res flag

A B C

PE

inst

Transform

Floating Point

Performance of Rewrite Rule Generator

• Problem: Universally Quantified SMT queries can take a long time • Solutions: • It is okay to be slightly slow (unless doing DSE!) • Different ways to encode the final formula • Different techniques for solving Quantified Expression

• Recent results: • ~ 1 minute to solve 20 rewrite rules on our current CGRA.

What patterns to use in the rewrite rule table?

PEak Compiler


Halide Compiler

CoreIR Graph

PE and MEM Mapper

Mapped CoreIR Graph

CGRA Bitstream



Rewrite Rule 1

Rewrite Rule 2

Rewrite Rule 3

Rewrite Rule 4

…

Rewrite Rule Table

??

div

mul add

sub

ashr add

Which Patterns?

• Enumerate all possible patterns up to a size • Lots of uncommon patterns • Bloated Rewrite Rule Table • Slower instruction selection

• Analyze target domain’s applications for common subgraphs • Approach used for our upcoming DSE paper

• Only very basic patterns • Use peephole optimization/packing after instruction selection

CPU Instruction Selection

Unified Buffer

Unified Buffer

Computation Kernel

Computation Kernel

CoreIR Graph

From Global Buffer

To Global Buffer

CGRA Compilation

Basic Block

Basic Block

Basic Block

Basic BlockR2 <— Sub(R0, R1)

R3 <— M[R2] M[R3] <— R1 R4 <— Add(R1, 0x50) …

Control Flow Graph Basic Block(Machine independent)

In0 In1

Out

Out <— Sub(In0, In1)

Compiling WebAssembly to RiscV?

RISCV

inst

Register File

WebAssembly Subtract

Transform RiscV to remove Register File

RISCV

inst

TransformRegister File Register

File

RISCV

inst

rs1 rs2

rd

In0 In1

Out

Out <— Sub(In0, In1)

Discovering Subtract

RISCV

inst rs1 rs2

rd

RISCV

inst rs1 rs2

rd

Branch/Memory Instructions?

PC MemRead

Next PC

Mem Addr

Mem Write

The Future

• Goal: Fully Automatic compiler generation for Accelerator Architectures

Thank You

Automatic Compilation for Domain Specific Accelerators · 2020. 8. 5. · Mem Tile PEak-Specified...

Documents

Transcript of Automatic Compilation for Domain Specific Accelerators · 2020. 8. 5. · Mem Tile PEak-Specified...