© Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported...

24
1 © Derek Chiou RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported in part by DOE, NSF, IBM, Intel, and
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    3

Transcript of © Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported...

1© Derek Chiou

RAMP-White

Derek Chiou and Hari Angepat

The University of Texas at Austin

Supported in part by DOE, NSF, IBM, Intel, and Xilinx

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

2

Test of size

RAMP-White Requirements Coherent shared memory experimental platform

Configurable coherence protocol, engine

Scalable to the same level as other RAMP machines

1K eventual target Down to 2

Full system (OS, I/O, etc.) Intentions

ISA/Architecture independent (like all RAMP efforts) Use different cores

Integrate components from other RAMP participants A test-bed for sharing IP

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

3

Test of size

Texas Modifications to RAMP-White New code in Bluespec rather than Verilog/VHDL

Many advantages including interfaces, configurability My group’s hardware development is exclusively Bluespec Free/low cost for academics (www.bluespec.com)

Start with XUP board We had XUP before BEE2

Embedded PowerPC is starting core It’s a free, fast core with real (incoherent) 16KB caches

No space issues on XUP 2 Leons + MMU + memory controller barely fits (no space for our stuff)

RAMP is core independent My research needs fast cores Can then use synthesizable 405s

Multi-OS shared space Processors map to shared global space May try SMP OS, but unlikely to scale well to 1K processors

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

4

Test of size

High-Level Architecture Philosophy Flexibility

Avoid wasted work Easy changes Module-agnostic

Processors, network, I/O, etc.

Interfaces Complete set of necessary interfaces All communication via messages

Fixed fields, but fields are configurable “shims” connect components to White infrastructure

Use existing IP

Building one instance to confirm interface completeness

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

5

Test of size

32b Address in Shared Memory Machine?? 4GB possible per BEE2 FPGA

Need more than 32b

Eventually, hope for 64b soft-core processors

For now two options: live with 4GB space Or, provide one more layer of translation

Physical address in certain region is global virtual address Translated by hardware to node + physical address

Also useful for multiple OSs in single memory OSs tend to assume they own physical address 0

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

6

Test of size

RAMP-White Block DiagramRAMP-White Block Diagram

Network Router

Intersection Unit (IU)

Memory Controller

(MC)

IO & Platform

Devices

Processor

Network Interface

(NIU)

Coherent $

Proc dependent

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

7

Test of size

Three Phase Approach to Hardware Phase 1: Incoherent shared memory

No hardware global cache, just global shared memory support Optional cache for local memory

However, software can maintain coherence if necessary Network virtual memory Run a simulator on top of the processor

Ring network Ring-based coherence (scalable bus)

Requires a coherent cache, IU awareness Running what is essentially a snoopy protocol

True coherence engine not required But, very restricted communication

Sufficient for testing, modeling many targets General network-based coherence

Requires general coherence engine, general network

IUP $$

MC I/O

IUP $$

MC I/O

C $IU

P $$

MC I/O

IU

P $$

MC I/O

C $

C $IU

P $$

MC I/O

IU

P $$

MC I/O

C $

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

8

Test of size

Intersection Unit Processor interface Slave Snoop

Network interface Master (send) Slave (receive)

Memory interface Master (issue memory requests)

Hooks for coherency engine Bluespec nice to specify coherence

engine Incoherent version is a special case

Programmable memory regions Global (local and remote) Local translation

Intersection Unit (IU)

Memory Controller

(MC)

IO &

Platform Devices

Processor

Network Interface

(NIU)

Coherent $

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

9

Test of size

Intersection Unit InternalsIntersection Unit Internals

Intersection Unit ControllerMemory

Controller & DRAM

Controller BRAMs

Proc IO Net

Proc IO Net

Global Address Translation

hardware

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

10

Test of size

Network Interface Unit Currently two virtual channels

Split into two components Msg composition/Queuing Net transmit/receive

Insert/extract for ring Intended to permit other net-

specific transmit/receive

One input/one output Creates a simple

unidirectional ring Can interface to more

advanced fabrics

Intersection Unit (IU)

Memory Controller

(MC)

IO &

Platform Devices

Processor

Network Interface

(NIU)

Coherent $

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

11

Test of size

IU Internal Message

Defaults PRI: High priority, Low priority CMD: Read, Write, Coherence, … PERM: Modified, Exclusive, Shared, Invalid SIZE: Byte, word, double word, cache-line GADDR: global address (translated by IU) DATA: dependent on size

Bluespec permits easy modification for your protocol

PRI CMD PERM SIZE TAG

GADDR

DATA

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

12

Test of size

Network Message

PRI: High and Low DEST,SRC: destination, source of message SIZE: Total message size NETTAG: network tag (optional) CMD: network command (optional) MESSAGE: data

PRI DEST SRC NETTAG CMD

MESSAGE

SIZE

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

13

Test of size

Programmer View Sequential consistency

PowerPC Global addresses labeled as uncached

Ordered accesses from PowerPC 405 Coherent global cache still uncached

Soft cores can be weaker

User interface Terminal per core/OS if desired Mmap to map shared memory

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

14

Test of size

Operating System Issues with SMP OS on embedded PowerPC

Incoherent cache Load-reservation/store-conditional instructions not MP capable Also missing TLB Invalidation & OpenPIC (interprocessor interrupts,

bring-up) How scalable anyways? (1K processors)

Therefore, separate OS per core Region of memory is global

Mmap Locks implemented using regular loads/stores + sequential

consistency

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

15

Test of size

Status: Phase 1 RAMP-White Hari Angepat did the work

Components Written in Bluespec NIU code complete and tested

2 processor ring IU code complete and tested

Processor Slave (no coherence right now) PLB Master/slave interface (I/O) NIU interface

Hardware intended to target different ISAs PLB master and slave shims written

Some preliminary OS work Multi-image mmap interface running

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

16

Test of size

Current RAMP-White Phase 1

Intersection Unit (IU)

IO & Platform

DevicesPPC 405

Network Interface

(NIU)

Memory Controller

(MC)

PLB shim

Intersection Unit (IU)

PPC 405

Network Interface

(NIU)

Linux Linux

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

17

Test of size

Phase 1 Demo on XUP Configuration See both processors boot and run (top, cpu_info) Run a simple “take-lock, increment counter, release

lock”

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

18

Test of size

Our Long Term Plans Phase 1, XUP just started to work

With multi-OS, limited device support Limited alpha release end of the 3Q07

Phase 2 Coherent cache, IU forwarding modifications Better OS support (ProtoFlex?) Limited alpha release 1Q08

Phase 3 Arbitrary network, cache coherency engine

Getting network from Washington, Berkeley RDL? Leon?

Release depends on ease of integration

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

19

Test of size

Conclusions RAMP-White architecture

Phased approach minimizes wasted work Designed to be easy to modify for your purpose

Many architectures only require modified coherence engine, maybe cache

ISA/implementation agnostic Care taken to not be specific

RAMP White Phase 1 works Running on XUP

We will be our own customer Building cycle-accurate x86 CMP simulator on top

20© Derek Chiou

Extra slides

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

21

Test of size

Node Architecture

IU

P

P $$

MC I/O

IUP $$

MC I/O

C $IU

P $$

MC I/O

IU

P $$

MC I/O

C $

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

22

Test of size

Generalized Architecture

Proc

IU NIUMC

$

Mem

OPBbridge

Intersection Unit Network Interface Unit

PLB

Proc dependent

Proc independent

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

23

Test of size

Sharing IP: Some Preliminary Experience We looked at RAMP-Red XUP

Used some code (PLB master) Red-BEE is not ready to distribute

Looking for switch code Berkeley’s code on CVS repository

But, we can’t use memory controller because we don’t have BEE2 board yet Bluespec We are spinning almost all of our own code right now

Would like to steal software OS (kernel proxy) SMP OS port

Naming MPI reference design in BEE2 repository Is that RAMP-Blue?

A central CVS repository for RAMP code?

04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007

24

Test of size

Sharing Over the Long Term Processor is shared

Leon PowerPC MicroBlaze Everything else

MC is shared Xilinx or Berkeley

Coherent cache can be shared Transactional/traditional Borrow Stanford’s?

Coherency engine can be shared CMU/Stanford

IU functionality can be shared Trying to make ours general

NIU can be shared Borrow half from Berkeley?

Network can be shared Borrow Berkeley’s?

Proc

IU NIUMC

$

Mem

Peripherals

CCE