RAMP-White Hari Angepat Derek Chiou University of Texas at Austin.
© Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
3
Transcript of © Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported...
1© Derek Chiou
RAMP-White
Derek Chiou and Hari Angepat
The University of Texas at Austin
Supported in part by DOE, NSF, IBM, Intel, and Xilinx
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
2
Test of size
RAMP-White Requirements Coherent shared memory experimental platform
Configurable coherence protocol, engine
Scalable to the same level as other RAMP machines
1K eventual target Down to 2
Full system (OS, I/O, etc.) Intentions
ISA/Architecture independent (like all RAMP efforts) Use different cores
Integrate components from other RAMP participants A test-bed for sharing IP
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
3
Test of size
Texas Modifications to RAMP-White New code in Bluespec rather than Verilog/VHDL
Many advantages including interfaces, configurability My group’s hardware development is exclusively Bluespec Free/low cost for academics (www.bluespec.com)
Start with XUP board We had XUP before BEE2
Embedded PowerPC is starting core It’s a free, fast core with real (incoherent) 16KB caches
No space issues on XUP 2 Leons + MMU + memory controller barely fits (no space for our stuff)
RAMP is core independent My research needs fast cores Can then use synthesizable 405s
Multi-OS shared space Processors map to shared global space May try SMP OS, but unlikely to scale well to 1K processors
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
4
Test of size
High-Level Architecture Philosophy Flexibility
Avoid wasted work Easy changes Module-agnostic
Processors, network, I/O, etc.
Interfaces Complete set of necessary interfaces All communication via messages
Fixed fields, but fields are configurable “shims” connect components to White infrastructure
Use existing IP
Building one instance to confirm interface completeness
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
5
Test of size
32b Address in Shared Memory Machine?? 4GB possible per BEE2 FPGA
Need more than 32b
Eventually, hope for 64b soft-core processors
For now two options: live with 4GB space Or, provide one more layer of translation
Physical address in certain region is global virtual address Translated by hardware to node + physical address
Also useful for multiple OSs in single memory OSs tend to assume they own physical address 0
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
6
Test of size
RAMP-White Block DiagramRAMP-White Block Diagram
Network Router
Intersection Unit (IU)
Memory Controller
(MC)
IO & Platform
Devices
Processor
Network Interface
(NIU)
Coherent $
Proc dependent
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
7
Test of size
Three Phase Approach to Hardware Phase 1: Incoherent shared memory
No hardware global cache, just global shared memory support Optional cache for local memory
However, software can maintain coherence if necessary Network virtual memory Run a simulator on top of the processor
Ring network Ring-based coherence (scalable bus)
Requires a coherent cache, IU awareness Running what is essentially a snoopy protocol
True coherence engine not required But, very restricted communication
Sufficient for testing, modeling many targets General network-based coherence
Requires general coherence engine, general network
IUP $$
MC I/O
IUP $$
MC I/O
C $IU
P $$
MC I/O
IU
P $$
MC I/O
C $
C $IU
P $$
MC I/O
IU
P $$
MC I/O
C $
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
8
Test of size
Intersection Unit Processor interface Slave Snoop
Network interface Master (send) Slave (receive)
Memory interface Master (issue memory requests)
Hooks for coherency engine Bluespec nice to specify coherence
engine Incoherent version is a special case
Programmable memory regions Global (local and remote) Local translation
Intersection Unit (IU)
Memory Controller
(MC)
IO &
Platform Devices
Processor
Network Interface
(NIU)
Coherent $
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
9
Test of size
Intersection Unit InternalsIntersection Unit Internals
Intersection Unit ControllerMemory
Controller & DRAM
Controller BRAMs
Proc IO Net
Proc IO Net
Global Address Translation
hardware
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
10
Test of size
Network Interface Unit Currently two virtual channels
Split into two components Msg composition/Queuing Net transmit/receive
Insert/extract for ring Intended to permit other net-
specific transmit/receive
One input/one output Creates a simple
unidirectional ring Can interface to more
advanced fabrics
Intersection Unit (IU)
Memory Controller
(MC)
IO &
Platform Devices
Processor
Network Interface
(NIU)
Coherent $
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
11
Test of size
IU Internal Message
Defaults PRI: High priority, Low priority CMD: Read, Write, Coherence, … PERM: Modified, Exclusive, Shared, Invalid SIZE: Byte, word, double word, cache-line GADDR: global address (translated by IU) DATA: dependent on size
Bluespec permits easy modification for your protocol
PRI CMD PERM SIZE TAG
GADDR
DATA
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
12
Test of size
Network Message
PRI: High and Low DEST,SRC: destination, source of message SIZE: Total message size NETTAG: network tag (optional) CMD: network command (optional) MESSAGE: data
PRI DEST SRC NETTAG CMD
MESSAGE
SIZE
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
13
Test of size
Programmer View Sequential consistency
PowerPC Global addresses labeled as uncached
Ordered accesses from PowerPC 405 Coherent global cache still uncached
Soft cores can be weaker
User interface Terminal per core/OS if desired Mmap to map shared memory
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
14
Test of size
Operating System Issues with SMP OS on embedded PowerPC
Incoherent cache Load-reservation/store-conditional instructions not MP capable Also missing TLB Invalidation & OpenPIC (interprocessor interrupts,
bring-up) How scalable anyways? (1K processors)
Therefore, separate OS per core Region of memory is global
Mmap Locks implemented using regular loads/stores + sequential
consistency
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
15
Test of size
Status: Phase 1 RAMP-White Hari Angepat did the work
Components Written in Bluespec NIU code complete and tested
2 processor ring IU code complete and tested
Processor Slave (no coherence right now) PLB Master/slave interface (I/O) NIU interface
Hardware intended to target different ISAs PLB master and slave shims written
Some preliminary OS work Multi-image mmap interface running
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
16
Test of size
Current RAMP-White Phase 1
Intersection Unit (IU)
IO & Platform
DevicesPPC 405
Network Interface
(NIU)
Memory Controller
(MC)
PLB shim
Intersection Unit (IU)
PPC 405
Network Interface
(NIU)
Linux Linux
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
17
Test of size
Phase 1 Demo on XUP Configuration See both processors boot and run (top, cpu_info) Run a simple “take-lock, increment counter, release
lock”
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
18
Test of size
Our Long Term Plans Phase 1, XUP just started to work
With multi-OS, limited device support Limited alpha release end of the 3Q07
Phase 2 Coherent cache, IU forwarding modifications Better OS support (ProtoFlex?) Limited alpha release 1Q08
Phase 3 Arbitrary network, cache coherency engine
Getting network from Washington, Berkeley RDL? Leon?
Release depends on ease of integration
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
19
Test of size
Conclusions RAMP-White architecture
Phased approach minimizes wasted work Designed to be easy to modify for your purpose
Many architectures only require modified coherence engine, maybe cache
ISA/implementation agnostic Care taken to not be specific
RAMP White Phase 1 works Running on XUP
We will be our own customer Building cycle-accurate x86 CMP simulator on top
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
21
Test of size
Node Architecture
IU
P
P $$
MC I/O
IUP $$
MC I/O
C $IU
P $$
MC I/O
IU
P $$
MC I/O
C $
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
22
Test of size
Generalized Architecture
Proc
IU NIUMC
$
Mem
OPBbridge
Intersection Unit Network Interface Unit
PLB
Proc dependent
Proc independent
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
23
Test of size
Sharing IP: Some Preliminary Experience We looked at RAMP-Red XUP
Used some code (PLB master) Red-BEE is not ready to distribute
Looking for switch code Berkeley’s code on CVS repository
But, we can’t use memory controller because we don’t have BEE2 board yet Bluespec We are spinning almost all of our own code right now
Would like to steal software OS (kernel proxy) SMP OS port
Naming MPI reference design in BEE2 repository Is that RAMP-Blue?
A central CVS repository for RAMP code?
04/18/23 Derek Chiou, RAMP-White Tutorial, FCRC 2007
24
Test of size
Sharing Over the Long Term Processor is shared
Leon PowerPC MicroBlaze Everything else
MC is shared Xilinx or Berkeley
Coherent cache can be shared Transactional/traditional Borrow Stanford’s?
Coherency engine can be shared CMU/Stanford
IU functionality can be shared Trying to make ours general
NIU can be shared Borrow half from Berkeley?
Network can be shared Borrow Berkeley’s?
Proc
IU NIUMC
$
Mem
Peripherals
CCE