CAVA: Open Source Infrastructure for System-On-a-Chip Designs

CAVA Open Source Infrastructure CAVA Open Source Infrastructure for System-On-a-Chip Designsfor System-On-a-Chip Designs

Peter Hsu PhDPeter Hsu PhD

Peter Hsu Consulting Inc43551 Mission Blvd Fremont CA 94539

email peterhsucswiscedu

Presented 14 March 2002 at the University of Wisconsin in Madison

CAVA Open Source Infrastructure for System-On-a-Chip Designs

2

Design InfrastructureDesign Infrastructure

Hardware ldquoIntellectual Propertiesrdquondash Processor Core(s)ndash Multiprocessor Bus External Interfaces

Software Development Environmentndash Gnu Compiler gcc as ld Utilities glibc gdb hellipndash Linux Operating System Kernel Drivers

Analysis Toolsndash Performance Simulator(s)ndash Architecture Verification Programs


3

Multiprocessor OrganizationMultiprocessor Organization

MemoryController

CPU L1$

IO interface

CPU L1$

IO interface

CPU L1$

IO interface

CPUL1$

ComplexIO interface

NewInsns

CPU L1$

IO interface

ApplicationSpecific

Logic


4

Cost-Performance EvolutionCost-Performance Evolution

60mm2

30

15

018microm500 wafer70 yield

$10

013microm1000 wafer

85 yield$4

009microm90 yield

$2 302-4 CPUs

8MB DRAM$5

1 GHz

800 MHz

600 MHz

400 MHz

+30


5

Authorrsquos PerspectiveAuthorrsquos Perspective

Maybe Boringhellipndash ldquoDecent Computer Using Crummy Technologyrdquo

bull Example Seymore Crayrsquos vs IBMrsquos Approach

ndash ldquoGet More out of Existing Technologyrdquobull Hybrid PetroElectric Car

Cost Drives Application Breadthndash ldquo900 Mips Make A Better Light Bulbrdquondash ldquoEaring PDA Helps You Remember Datesrdquondash ldquoSilicon Sequin Dress Adapts to Weatherrdquo


6

Efficiency Matters F Efficiency Matters F C C V V22

Frequency (F)ndash Fewer Instructions Lower Latencies

Capacitance (C)ndash Fewer Gates Narrower Bus hellip

Voltage (V)ndash Slower Gates Less Logic per Cycle

Same Designndash 600 MHz 12 V 4 W frac12 W 250 MHz 06 V


7

Why Open SourceWhy Open Source

Embedded Marketndash High Volume Low Prices Little Profit Small

Engineering Budget Few Inventionshellipndash Most SoC Use 1970rsquos Architecture

bull Motorola 6800 x86 Family eg z80

ndash Royalty Payment Based Innovationbull ARM MIPS Tensilica hellip Future Unclear

Industry Ripe for New Market Dynamicsndash Cava Set New Cost-Performance Standardndash No Royalties New Players Opportunities


8

Key DecisionsKey Decisions

ldquoRTLrdquo Designndash Integration Effort Process Migration Accessibility

bull More Important Than Pure Performance

Logic Synthesis Standard Cells PampRbull No Dynamic Circuits No Custom Layout

New ISAndash Companies ldquoOpenrdquo Architectures Feel ldquoUnsaferdquo

bull Originator Company Big and Mean

ndash Neutrality ldquoPerception is Realityrdquobull Ex Linux Very x86-centric but Perceived Otherwise


9

Why ASIC CPU Usually SlowWhy ASIC CPU Usually Slow

Design Issuendash C-MOS vs Dynamic Circuits

bull High Fan-In Gate Cache Hit Detection TLBbull ASIC SRAM Very Fast (ldquoHard Macrordquo)

ndash Fruitful Area for Innovationshellip

Manufacturing Issuendash Make Few Wafers Worse-Case Designndash Make Many Wafers Typically 50 Faster

bull 2GHz Pentium 4 ISSCC Chips

ndash SoC ldquoMake Cheaper Goes Fasterrdquo


10

Ex Tag Match Critical PathEx Tag Match Critical Path

Dynamic vs Staticndash Adder 20 30ndash Tag 15 25ndash Data 20 25ndash Match 5 20

2-Cycle Loadndash D (20+15+5)2 = 20ndash S (30+25+20)2 = 38ndash 800 vs 400 MHz

bull Inv Delay = 63ps

Adder

DataTag

=

DataTag

=

30

25

20


11

Recurrance Forward SubstitutionRecurrance Forward Substitution

Replicate ALUndash (30+25)2 = 275ndash 600 MHz

bull vs 800 400 MHz

ndash 5K gates 01mm2

Opportunityndash Most CPU Knowledge

from Custom Designsndash Experiment Using Logic

Synthesis PlaceampRoute

Adder

DataTag

= Adder

DataTag

=3020

25 25


12

Instruction SetInstruction Set

RISC Lessons Many Registers

Everything Else

Code Densityndash Silicon CPU 3

MBytes of DRAMndash Applications Millions

of Lines of Code

24-bit Instructionsndash Two 64-reg Specifiers

32-bit address x -op

x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13

Variable Length InstructionsVariable Length Instructions

Superscalar Paradigmndash Speculatively Decode Ignore Some

bull Middle vs Always Instructions At End

ndash Idiom Acceleration bull Set high 16bit + load with 16bit displacement

Key Featuresndash Lengths In Multiples (Semour Crayrsquos Parcel)

bull Else Many Wasted Decode Stations

ndash Opcode Register Fields in 1st Parcelbull Single-Issue Design Conserve Gates Power


14

Why Multiprocessor ArchitectureWhy Multiprocessor Architecture

Efficiencyndash Area

bull More Mips per Gate

ndash Powerbull Less Synchronousbull Effective Clock Gating

Ease of Usendash Real Time

bull vs Fancy Scheduling

ndash Customization

SuperscalarUniprocessor

MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl

ConventionalIntegration

CAVA-1


15

IO Architecture EvolutionIO Architecture Evolution

Typical IO Controller Chipndash Mixed AnalogDigital Physical Media Interfacendash Protocol Encapsulation Error Handlingndash Flow Control Interrupts DMA Buffering

Market Forcesndash Complex Controller Chips are Cost Effective

bull Economy of Scale Outweights Gate Utilization

ndash Integrate Existing Chips into SoCbull Minimize Engineering Less ldquoRiskrdquo

Is ldquoHolisticrdquo Solution Better


16

CAVA IO ArchitectureCAVA IO Architecture

Processor Doubles As IO Controllerndash ldquoBackgroundrdquo Context

bull 64 Registers Overlapped Execution Speculative

ndash ldquoForegroundrdquo Contextbull 4 Registers De-pipelined ( 1 Instruction 4 Clocks)bull Branch on IO Register Bit (eg Error Data Readyhellip)

ndash Dedicated Gates Only Physical Media Interface

Open Questionsndash Efficiency Dedicated Logic Interferencendash Need Good Abstraction


17

CAVA-1 Memory SystemCAVA-1 Memory System

Initial Targetndash No L2 Cache

bull If 6-T Memory Cell Bigger L1 Same Cycle Time

ndash Not Rambusbull Royalty Payment RAC Process Migration Difficulties

DRAMndash FCRAM ldquoSlightly Better DRAMrdquo

bull Cache Miss Penalty 50ns = 30 Cycles 600 MHzbull 64-bit Interface 24GBs 1 ByteCycleCPU

ndash Flexibility Very Importantbull Also Use Standard DDR SDRAMbull 64- 32- and 16-bit wide (4 2 or 1 chips)


18

Multi-Thread SupportMulti-Thread Support

Memory Latency Impactndash 5 miss 30 cycles = 15 cpi Half Idlendash Significant Improvement Lots of Silicon Area

2-Port SRAM as Register Filendash 64 64 256 64 +50 Areandash CAVA-1 3 Background 16 Foreground

Interesting Questionsndash ldquoHelper Threadrdquo for Serial Programndash Quantify Speculation Area Clock Speed Perf


19

Project StatusRoadmapProject StatusRoadmap

Phase 1 Running (Limping -)

ndash gcc as ld glibc gdb ISA emulator

Phase 2 Fall 2002ndash Linux Kernel IO Controller Programs

bull Standalone PC Emulator FPGA PCI Card for IObull Key Decisions re Verification Strategy

ndash Quantitative Analysis of Clock Cyclebull Target 600 MHz 018microm (typ process wc env)

Phase 3 Spring 2003ndash Design RTL Tapeout Party Debug hellip


20

ISA Research ToolISA Research Tool

Instruction Descriptionsndash Syntaxin

bull Lengths Field Definitions Opcode EncodingsiR imm[8] [6] x[6] [4=0] Immediate with Register

bR [8] y[6] x[6] [4=F] Binary Register Operator

ndash Semanticsinbull Gcc ldquomdrdquo Patterns Emulator C Statement

iRia addi x += imm Add Immediate

(set (match_operandDI ldquoregister_operandrdquo ldquo=rrdquo)

(plusDI (match_operandDI ldquoregister_operandrdquo ldquo0rdquo)

(match_operandDI ldquoimmediate_operandrdquo ldquoIrdquo))

Consistent gcc as ld gdb emulator doc


21

ConclusionConclusion

New Technologyndash Creation Only Beginningndash Adoption Critical Mass Understand Can Modify

bull Necessary to Evangelize to Share

Visionndash Modular Architectural Enhancements by Manyndash Invent Novel SoCrsquos Uninhibited by Lawyers

Itrsquos Funndash Periodic Open Releases of Compiler RTLndash Participate Build Computer System from Scratch


2

Design InfrastructureDesign Infrastructure

Hardware ldquoIntellectual Propertiesrdquondash Processor Core(s)ndash Multiprocessor Bus External Interfaces

Software Development Environmentndash Gnu Compiler gcc as ld Utilities glibc gdb hellipndash Linux Operating System Kernel Drivers

Analysis Toolsndash Performance Simulator(s)ndash Architecture Verification Programs


3


MemoryController

CPU L1$

IO interface

CPU L1$

IO interface

CPU L1$

IO interface

CPUL1$

ComplexIO interface

NewInsns

CPU L1$

IO interface

ApplicationSpecific

Logic


4


60mm2

30

15


$10

013microm1000 wafer

85 yield$4

009microm90 yield

$2 302-4 CPUs

8MB DRAM$5

1 GHz

800 MHz

600 MHz

400 MHz

+30


5







6







7








8









9









10





Adder

DataTag

=

DataTag

=

30

25

20


11



bull vs 800 400 MHz





Adder

DataTag

= Adder

DataTag

=3020

25 25


12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







3


MemoryController

CPU L1$

IO interface

CPU L1$

IO interface

CPU L1$

IO interface

CPUL1$

ComplexIO interface

NewInsns

CPU L1$

IO interface

ApplicationSpecific

Logic


4


60mm2

30

15


$10

013microm1000 wafer

85 yield$4

009microm90 yield

$2 302-4 CPUs

8MB DRAM$5

1 GHz

800 MHz

600 MHz

400 MHz

+30


5







6







7








8









9









10





Adder

DataTag

=

DataTag

=

30

25

20


11



bull vs 800 400 MHz





Adder

DataTag

= Adder

DataTag

=3020

25 25


12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







4


60mm2

30

15


$10

013microm1000 wafer

85 yield$4

009microm90 yield

$2 302-4 CPUs

8MB DRAM$5

1 GHz

800 MHz

600 MHz

400 MHz

+30


5







6







7








8









9









10





Adder

DataTag

=

DataTag

=

30

25

20


11



bull vs 800 400 MHz





Adder

DataTag

= Adder

DataTag

=3020

25 25


12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







5







6







7








8









9









10





Adder

DataTag

=

DataTag

=

30

25

20


11



bull vs 800 400 MHz





Adder

DataTag

= Adder

DataTag

=3020

25 25


12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







6







7








8









9









10





Adder

DataTag

=

DataTag

=

30

25

20


11



bull vs 800 400 MHz





Adder

DataTag

= Adder

DataTag

=3020

25 25


12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







7








8









9









10





Adder

DataTag

=

DataTag

=

30

25

20


11



bull vs 800 400 MHz





Adder

DataTag

= Adder

DataTag

=3020

25 25


12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







8









9









10





Adder

DataTag

=

DataTag

=

30

25

20


11



bull vs 800 400 MHz





Adder

DataTag

= Adder

DataTag

=3020

25 25


12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







9









10





Adder

DataTag

=

DataTag

=

30

25

20


11



bull vs 800 400 MHz





Adder

DataTag

= Adder

DataTag

=3020

25 25


12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







10





Adder

DataTag

=

DataTag

=

30

25

20


11



bull vs 800 400 MHz





Adder

DataTag

= Adder

DataTag

=3020

25 25


12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







11



bull vs 800 400 MHz





Adder

DataTag

= Adder

DataTag

=3020

25 25


12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







12



Everything Else



of Lines of Code



x -opimm

x opydisp

x -yop24-bit offset

6 468

x = x op y

x = x op imm

x = (y+disp)

x -yop


13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







13









14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







14







ndash Customization


MemCtrl

Eth1

Eth2PCI 1PCI 2

USB

CPU +Thin IO

CPU +Thin IO

CPU+ Thin IO

MemCtrl


CAVA-1


15








16








17









18






19









20












21







15








16








17









18






19









20












21







16








17









18






19









20












21







17









18






19









20












21







18






19









20












21







19









20












21







20












21







21






CAVA: Open Source Infrastructure for System-On-a-Chip Designs

Documents

Transcript of CAVA: Open Source Infrastructure for System-On-a-Chip Designs