GEMS Tutorial ISCA05

119
(C) 2005 Multifacet Project http://www.cs.wisc.edu/gems ISCA Tutorial June 5 th , 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin Moore Please Ask Questions

Transcript of GEMS Tutorial ISCA05

Page 1: GEMS Tutorial ISCA05

(C) 2005 Multifacet Project http://www.cs.wisc.edu/gems

ISCA TutorialJune 5th, 2005

Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin Moore

Please Ask Questions

Page 2: GEMS Tutorial ISCA05

Slide 2 http://www.cs.wisc.edu/gems

What do you want to simulate?

Symmetric Multiprocessor

Glueless Multiprocessor

CPU

Uniprocessor

Multiple-CMP

CMP CMP

CMP CMP

P

Chip Multiprocessor (CMP)

P P P

$ $ $ $

Page 3: GEMS Tutorial ISCA05

Slide 3 http://www.cs.wisc.edu/gems

Open Source Release of GEMS

• GEMS v1.1 released as GPL software http://www.cs.wisc.edu/gems

• Contributors

Alaa Alameldeen

Brad Beckmann

Ross Dickson

Pacia Harper

Milo Martin

Mike Marty

Carl Mauer

Kevin Moore

Manoj Plakal

Dan Sorin

Min Xu

Luke Yen

• Multifacet Project directed by Mark Hill & David Wood

Page 4: GEMS Tutorial ISCA05

Slide 4 http://www.cs.wisc.edu/gems

GEMS Requirements

• Virtutech Simics 2.0.x or 2.2.x – Personal academic licenses available– http://www.virtutech.com

• Host Machine– x86 (32 or 64-bit) Linux or Sparc/Solaris host machine– > 1 GB Memory

• Workload Checkpoints YOU Create– License issues w/ releasing checkpoints

Page 5: GEMS Tutorial ISCA05

Slide 5 http://www.cs.wisc.edu/gems

GEMS From 50,000 Feet

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

istic

Cont

ende

d lo

cks

Trac

e fli

e

Page 6: GEMS Tutorial ISCA05

Slide 6 http://www.cs.wisc.edu/gems

GEMS From 50,000 Feet

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

istic

Cont

ende

d lo

cks

Trac

e fli

e

Full-System Functional Simulator

• Boots unmodified Solaris 9• BUT, each instruction 1-cycle

• www.virtutech.com

Page 7: GEMS Tutorial ISCA05

Slide 7 http://www.cs.wisc.edu/gems

GEMS From 50,000 Feet

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

istic

Cont

ende

d lo

cks

Trac

e fli

e

Memory System Model

• Flexible multiprocessor memory hierarchy • Includes domain-specific language

Page 8: GEMS Tutorial ISCA05

Slide 8 http://www.cs.wisc.edu/gems

GEMS From 50,000 Feet

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

istic

Cont

ende

d lo

cks

Trac

e fli

e

OoO Processor Model

• Implements partial SPARC v9 ISA• Modeled after MIPS R10000

Page 9: GEMS Tutorial ISCA05

Slide 9 http://www.cs.wisc.edu/gems

GEMS From 50,000 Feet

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

istic

Cont

ende

d lo

cks

Trac

e fli

e

Other Drivers

• Testing independent of Simics• Microbenchmarks

Page 10: GEMS Tutorial ISCA05

Slide 10 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation

• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model

• BREAK

• Opal: Out-of-order processor model

• Demo: Two gems are better than one

• GEMS Source Code Tour and Extending Ruby

• Building Workloads

Page 11: GEMS Tutorial ISCA05

Slide 11 http://www.cs.wisc.edu/gems

Full-System Simulation with GEMS

• Steps:– Choosing a Ruby protocol– Building Ruby and Opal– Starting and configuring Simics– Loading and configuring Ruby– Loading and configuring Opal– Running simulation– Getting results

Demo

Page 12: GEMS Tutorial ISCA05

Slide 12 http://www.cs.wisc.edu/gems

Choosing the Ruby System/Protocol

• Included with GEMS release v1.1

– CMP protocols • MOESI_CMP_token: M-CMP token coherence• MSI_MOSI_CMP_directory: 2-level Directory• MOESI_CMP_directory: higher performing 2-level Directory

– SMP protocols• MOSI_SMP_bcast: snooping on ordered interconnect• MOSI_SMP_directory• MOSI_SMP_hammer: based on AMD Hammer• And more

Demo

Page 13: GEMS Tutorial ISCA05

Slide 13 http://www.cs.wisc.edu/gems

Building Ruby and Opal

• Ruby module

cd $GEMS_ROOT/ruby– set compile-time defaultsvi config/rubyconfig.defaults

– Build module, choosing protocol and destination dirmake PROTOCOL=MOESI_CMP_token DESTINATION=MOESI_CMP_token

– SLICC runs, generates HTML and additional C++ files– Ruby module built and moved to

$GEMS_ROOT/simics/home/MOESI_CMP_token

• Build Opal

cd $GEMS_ROOT/opalmake module DESTINATION=MOESI_CMP_token

Demo

Page 14: GEMS Tutorial ISCA05

Slide 14 http://www.cs.wisc.edu/gems

Starting Simics

• Start non-GUI Simics

maya(9)% cd $GEMS_ROOT/simics/home/MOESI_CMP_token/maya(10)% ./simicsChecking out a license... done: academic license.Looking for additional Simics modules in ./modules +----------------+ Copyright 1998-2004 by Virtutech, All Rights

Reserved | Virtutech | Version: simics-2.0.23 | Simics | Compiled: Thu Oct 14 20:27:36 CEST 2004 +----------------+ www.simics.com "Virtutech" and "Simics" are trademarks of

Virtutech AB

Type 'copyright' for details on copyright.Type 'license' for details on warranty, copying, etc.Type 'readme' for further information about this version.Type 'help help' for info on the on-line documentation.

simics>

Demo

Page 15: GEMS Tutorial ISCA05

Slide 15 http://www.cs.wisc.edu/gems

Checkpoint and Configuration

• Checkpoints should be created first– Simics-only process

simics> read-configuration ../../checkpoints-u3/jbb/jbb-16p.check

– SpecJBB checkpoint loaded

• Load python scripts

simics> @sys.path.append("../../../gen-scripts")simics> @import mfacet

• Configure Simicssimics> istc-disableTurning I-STC off and flushing old datasimics> dstc-disableTurning D-STC off and flushing old datasimics> instruction-fetch-mode instruction-fetch-tracesimics> magic-break-enable

Demo

Page 16: GEMS Tutorial ISCA05

Slide 16 http://www.cs.wisc.edu/gems

Load and Configure Ruby

Load module

simics> load-module ruby

Setting # processors is required

simics> ruby0.setparam g_NUM_PROCESSORS 16

Create a M-CMP system (4 chips, 4 procs/chip)

simics> ruby0.setparam g_PROCS_PER_CHIP 4

Override compile-time defaults

simics> ruby0.setparam g_NUM_L2_BANKS 32simics> ruby0.setparam L2_CACHE_ASSOC 4simics> ruby0.setparam L2_CACHE_NUM_SETS_BITS 16simics> ruby0.setparam NETWORK_LINK_LATENCY 50

Initialize

simics> ruby0.init

Demo

Page 17: GEMS Tutorial ISCA05

Slide 17 http://www.cs.wisc.edu/gems

Optionally Load and Configure Opal

Load module

simics> load-module opal

Initialize default processor

simics> opal0.initsimics> opal0.listparam

Start opal (but do not start simulating)

simics> opal0.sim-start “output.opal"

Demo

Page 18: GEMS Tutorial ISCA05

Slide 18 http://www.cs.wisc.edu/gems

Running simulation

• Setup transaction-based simulation– “magic breakpoints”– Five JBB transactions

simics> @mfacet.setup_run_for_n_transactions(5,1)

• Start simulating– Ruby only (Simics drives Ruby):

simics> c

– Opal is loaded (Opal steps Simics):

simics> opal0.sim-step 9999999999

Demo

Page 19: GEMS Tutorial ISCA05

Slide 19 http://www.cs.wisc.edu/gems

Dumping Some Output

• Opal stats

simics> opal0.stats

• Ruby stats

simics> ruby0.dump-stats ruby.stats

• Ruby short stats

simics> ruby0.dump-short-stats

– Ruby_cycles is a good runtime metric

Demo

Page 20: GEMS Tutorial ISCA05

Slide 20 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations

• BREAK

• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 21: GEMS Tutorial ISCA05

Slide 21 http://www.cs.wisc.edu/gems

High-Level Infrastructure MapDr

iver

sM

emor

y Sy

stem

Internal

Ruby Teste

rsExt

ernal

CPU Models

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

istic

Cont

ende

d lo

cks

Trac

e fli

e

Page 22: GEMS Tutorial ISCA05

Slide 22 http://www.cs.wisc.edu/gems

Ruby Driver: Random Tester

• “Verifying a Multiprocessor Cache Controller Using Random Test Generation” [Wood et al. 90]

• Purpose: Excite cache coherency bugs • Competing actions performed then checked• Utilizes false sharing

– Multiple writers - action– Single read - check

• Randomly inserted delay

RandomTester

Page 23: GEMS Tutorial ISCA05

Slide 23 http://www.cs.wisc.edu/gems

Ruby Driver: Microbenchmarks

• Deterministic tester– Simple sequence of requests– Sanity checking and performance tuning– DeterministicDriver.C

• GETX, SeriesGETS, Inv

• Contended locks– Compare and swap atomic op.– RequestGenerator.C / SyntheticDriver.C

• Trace file– Issues requests one at a time– Similar to cache warmup mechanism– ‘-z <trace_file.gz>’

Microbenchmarks

Dete

rmin

istic

Cont

ende

d lo

cks

Trac

e fil

e

Page 24: GEMS Tutorial ISCA05

Slide 24 http://www.cs.wisc.edu/gems

Ruby Driver: In-order Processor Model

• Simics blocking interface (in-order processor)– Single issue, non-pipelined processor– Only one outstanding request per CPU

• SIMICS_RUBY_MULTIPLIER > 1– Estimates a higher performance processor– Multiple simics processor cycles == one ruby cycle

Simics

Page 25: GEMS Tutorial ISCA05

Slide 25 http://www.cs.wisc.edu/gems

Ruby Driver: In-order Processor Model

• Implements Simics’ mh_memorytracer_possible_cache_miss()• “Callback” Simics with SIM_stall_cycle(proc_ptr, 0)

P0

Simics time queue

P1 P2 P3

stall()/unstall()

stall()/unstall()

stall()/unstall()

stall()/unstall()

instructions

Simics in-order processor model

SIMICS

RubyMemory System Model

Page 26: GEMS Tutorial ISCA05

Slide 26 http://www.cs.wisc.edu/gems

Ruby Driver: Out-of-order Processor Model

• Opal (out-of-order processor)– Super-scalar pipelined processor– Multiple outstanding requests per CPU

• OPAL_RUBY_MULTIPLIER > 1– Faster processor core frequency than memory– Simulation execution optimization

What are they driving?

DetailedProcessor

Model

Opal

Page 27: GEMS Tutorial ISCA05

Slide 27 http://www.cs.wisc.edu/gems

Ruby Multiprocessor Memory System

• Physical Components– Caches– Memory– System Interconnect

• Determines the timing of memory requests– Driver issues memory request to Ruby– Ruby simulates the requests– Ruby eventually callbacks the driver with the latency

• Ruby’s purpose:

Return memory latency

RubyMemory System Model

Page 28: GEMS Tutorial ISCA05

Slide 28 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations

• BREAK

• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 29: GEMS Tutorial ISCA05

Slide 29 http://www.cs.wisc.edu/gems

Discrete Event-driven Simulation

• Discrete event-driven simulation– Events change system state– Series of scheduled events

• Global EventQueue– Heart of Ruby– Priority heap of event/time pairs

• Not a true queue - not in FIFO order• Self-sorting queue

– Given cycle events occur in arbitrary order– All events must be at least one unit of time

GlobalEventQueue

Event | Time

*Event G 7

*Event B 5

*Event J 3

*Event S 3

*Event A 4

Page 30: GEMS Tutorial ISCA05

Slide 30 http://www.cs.wisc.edu/gems

Events and Consumers

• Event = Consumer Wakeup– Consumer determines event type– Consumer changes system state

• Typical event– Consumer wakes up to observe its input ports– Consumer acts upon the incoming message(s)

• Change system state• Enqueue outgoing messages

– Consumer pops the incoming message(s)– Consumer schedules outgoing message(s) consumers

Input PortConsumer

Output Port

Output Port

Consumer

Consumer

Page 31: GEMS Tutorial ISCA05

Slide 31 http://www.cs.wisc.edu/gems

Events and Consumers

• Stalled event– Consumer wakes up to observer its input ports– Consumer encounters a stall– Consumer schedules itself again

• Doesn’t pop incoming queue

Input PortConsumer

Output Port

Output Port

Consumer

Consumer

Page 32: GEMS Tutorial ISCA05

Slide 32 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations

• BREAK

• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 33: GEMS Tutorial ISCA05

Slide 33 http://www.cs.wisc.edu/gems

Interconnection Network

• A single flexible infrastructure– Point-to-point links and switches: Consumers– Both intra-chip and inter-chip networks

• Dynamic network creation– Routing tables created at runtime– Utilizes input parameters

• Two ways to generate topologies1. Auto-generated

– Intra-chip network: Single on-chip switch– Inter-chip network: 4 included (next slide)

2. Customized– TopologyType_FILE_SPECIFIED– Adjust individual link latency and bandwidth– Specify one link per line

Link

Switch

Throttle.C

PerfectSwitch.C

Page 34: GEMS Tutorial ISCA05

Slide 34 http://www.cs.wisc.edu/gems

Auto-generated Inter-chip Network Topologies

TopologyType_TORUS_2D

TopologyType_CROSSBAR

TopologyType_HIERARCHICAL_SWITCH

TopologyType_PT_TO_PT

Page 35: GEMS Tutorial ISCA05

Slide 35 http://www.cs.wisc.edu/gems

Network Characteristics

• Link latency1. Auto-generated

– ON_CHIP_LINK_LATENCY– NETWORK_LINK_LATENCY

2. Customized– ‘link_latency:’

• Link bandwidth– Bandwidth specified in 1000th of byte1. Auto-generated

– On-chip = 10 x g_endpoint_bandwidth– Off-chip = g_endpoint_bandwidth

2. Customized– Individual link bandwidth = ‘bw_multiplier:’ x g_endpoint_bandwidth

• Buffer size1. Infinite by default2. Customized network supports finite buffering

• Prevent 2D-mesh network deadlock through e-cube restrictive routing• ‘link_weight’

1. Perfect switch bandwidth

Page 36: GEMS Tutorial ISCA05

Slide 36 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations

• BREAK

• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 37: GEMS Tutorial ISCA05

Slide 37 http://www.cs.wisc.edu/gems

• Domain-specific language– Designed to specify state machines for cache coherence– Syntactically similar to C/C++/Java– Constrains to hardware-like structures (i.e. no loops)– Generates C++ tightly coupled to Ruby

• Two purposes1. Specify system coherence

– Per-memory-block State Machines– I.e. cache and memory controller logic

2. Glue components together– Caches with transaction buffers– Network ports with controllers

Specification Language for Implementing Cache Coherence (SLICC)

SLICCState

Machine

NetworkIn-ports

NetworkOut-ports

Page 38: GEMS Tutorial ISCA05

Slide 38 http://www.cs.wisc.edu/gems

System Flexibility via SLICC

• Substantial portion of Ruby code generated– In combination with dynamic network creation– Permits a tremendously flexible simulation infrastructure

• protocols/<protocol_name>.slicc– Indicates the SLICC files needed by the protocol– Specifies the necessary generated objects

• Controller state machines• Network messages

– Snooping protocol: requests and response messages– Directory protocol: requests, forwarded requests, and responses

– Allocates only C++ objects needed by the particular protocol• Ex. Shadow tags for an exclusive two-level cache• Ex. Persistent Request Table for Token coherence

Page 39: GEMS Tutorial ISCA05

Slide 39 http://www.cs.wisc.edu/gems

Inside a SLICC State Machine

• Network buffers– Outgoing and incoming ports

• States– Base and transient states

• Events– Internal events that cause state transitions

• Ruby Structures– Caches, transaction buffers… etc.

• Trigger events– Incoming messages trigger internal events

• Actions– Operations performed on structures

• Transitions– Cross-product of possible states and events– Performs atomic sequence of actions

<controller_name>.smnetwork ports

states

events

ruby structures

trigger events

actions

transitions

Page 40: GEMS Tutorial ISCA05

Slide 40 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations

• BREAK

• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 41: GEMS Tutorial ISCA05

Slide 41 http://www.cs.wisc.edu/gems

Creating a protocol with SLICC

• MI-example protocol– Simple, SMP directory protocol– Cache and directory/memory controller– Assume ordered interconnect (for simplicity)

Demo

$ $ $

Ruby interconnect

dir dir dir M

I

GETS/GETX

Fwd

Page 42: GEMS Tutorial ISCA05

Slide 42 http://www.cs.wisc.edu/gems

MI Cache Controller – States and Events

// STATES enumeration(State, desc="Cache states") {

// stables states

I, desc="Not Present/Invalid"; M, desc="Modified";

// transient states MI, desc="Modified, issued PUT"; II, desc="Not Present/Invalid, issued PUT"; IS, desc="Issued request for IFETCH/GETX"; IM, desc="Issued request for STORE/ATOMIC"; }

// EVENTS enumeration(Event, desc="Cache events") { // from processor Load, desc="Load request from processor"; Ifetch, desc="Ifetch request from processor"; Store, desc="Store request from processor";

Data, desc="Data from network"; Fwd_GETX, desc="Forward from network";

Replacement, desc="Replace a block"; Writeback_Ack, desc="Ack from the directory for a writeback"; Writeback_Nack, desc="Nack from the directory for a writeback"; }

Demo

Page 43: GEMS Tutorial ISCA05

Slide 43 http://www.cs.wisc.edu/gems

MI Cache Controller – Network Ports

// NETWORK BUFFERS MessageBuffer requestFromCache, network="To", virtual_network="0", ordered="true"; MessageBuffer responseFromCache, network="To", virtual_network="1", ordered="true"; MessageBuffer forwardToCache, network="From", virtual_network="2", ordered="true"; MessageBuffer responseToCache, network="From", virtual_network="1", ordered="true";

// NETWORK PORTS

out_port(requestNetwork_out, RequestMsg, requestFromCache); out_port(responseNetwork_out, ResponseMsg, responseFromCache); in_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { if (forwardRequestNetwork_in.isReady()) { peek(forwardRequestNetwork_in, RequestMsg) { if (in_msg.Type == CoherenceRequestType:GETX) { trigger(Event:Fwd_GETX, in_msg.Address); } else if (in_msg.Type == CoherenceRequestType:WB_ACK) { trigger(Event:Writeback_Ack, in_msg.Address); } else { error("Unexpected message"); } } } }

Demo

Page 44: GEMS Tutorial ISCA05

Slide 44 http://www.cs.wisc.edu/gems

MI Cache Controller – Structures

// CacheEntry structure(Entry, desc="...", interface="AbstractCacheEntry") { State CacheState, desc="cache state"; bool Dirty, desc="Is the data dirty (different than memory)?"; DataBlock DataBlk, desc="data for the block"; }

external_type(CacheMemory) { bool cacheAvail(Address); Address cacheProbe(Address); void allocate(Address); void deallocate(Address); Entry lookup(Address); void changePermission(Address, AccessPermission); bool isTagPresent(Address); }

CacheMemory cacheMemory, template_hack="<L1Cache_Entry>", constructor_hack='L1_CACHE_NUM_SETS_BITS, L1_CACHE_ASSOC, MachineType_L1Cache, int_to_string(i)+"_L1"', abstract_chip_ptr="true";

Demo

Page 45: GEMS Tutorial ISCA05

Slide 45 http://www.cs.wisc.edu/gems

MI Cache Controller – “Mandatory Queue”

// Mandatory Queue in_port(mandatoryQueue_in, CacheMsg, mandatoryQueue, desc="...") { if (mandatoryQueue_in.isReady()) { peek(mandatoryQueue_in, CacheMsg) {

if (cacheMemory.isTagPresent(in_msg.Address) == false && cacheMemory.cacheAvail(in_msg.Address) == false ) {

// make room for the block trigger(Event:Replacement, cacheMemory.cacheProbe(in_msg.Address)); } else { trigger(mandatory_request_type_to_event(in_msg.Type), in_msg.Address); } } } }

Demo

Page 46: GEMS Tutorial ISCA05

Slide 46 http://www.cs.wisc.edu/gems

MI Cache Controller – Transitions

transition(I, Store, IM) { v_allocateTBE; i_allocateL1CacheBlock; a_issueRequest; m_popMandatoryQueue; }

transition(IM, Data, M) { u_writeDataToCache; s_store_hit; w_deallocateTBE; n_popResponseQueue; }

transition(M, Fwd_GETX, I) { e_sendData; o_popForwardedRequestQueue; }

transition(M, Replacement, MI) { v_allocateTBE; b_issuePUT; x_copyDataFromCacheToTBE; h_deallocateL1CacheBlock; }

Atomic sequence of actions

Demo

Page 47: GEMS Tutorial ISCA05

Slide 47 http://www.cs.wisc.edu/gems

MI Cache Controller – Actions

action(a_issueRequest, "a", desc="Issue a request") { enqueue(requestNetwork_out, RequestMsg, latency="ISSUE_LATENCY") { out_msg.Address := address; out_msg.Type := CoherenceRequestType:GETX; out_msg.Requestor := machineID; out_msg.Destination.add(map_Address_to_Directory(address)); out_msg.MessageSize := MessageSizeType:Control; } }

action(e_sendData, "e", desc="Send data from cache to requestor") { peek(forwardRequestNetwork_in, RequestMsg) { enqueue(responseNetwork_out, ResponseMsg, latency="CACHE_RESPONSE_LATENCY") { out_msg.Address := address; out_msg.Type := CoherenceResponseType:DATA; out_msg.Sender := machineID; out_msg.Destination.add(in_msg.Requestor); out_msg.DataBlk := cacheMemory[address].DataBlk; out_msg.MessageSize := MessageSizeType:Response_Data; } } }

Demo

Page 48: GEMS Tutorial ISCA05

Slide 48 http://www.cs.wisc.edu/gems

SLICC-generated HTML tablesDemo

• http://www.cs.wisc.edu/gems/MI_example_html/

Page 49: GEMS Tutorial ISCA05

Slide 49 http://www.cs.wisc.edu/gems

Testing MI_exampleDemo

Build Protocol

cd $GEMS_ROOT/rubymake PROTOCOL=MI_example

Random test– stresses protocol with simultaneous false-sharing requests– 16 processors (-p), 10000 requests (-l)

./amd64_linux/generated/MI_example/bin/tester.exec –p 16 –l 10000

Deterministic test with transition trace– use a trace, requests handled one at a time– input trace (-z), compressed or non-compressed – transition debug (-s) starting at cycle 1

./amd64_linux/generated/MI_example/bin/tester.exec –p 16 –z ruby.trace.gz –s 1

Page 50: GEMS Tutorial ISCA05

Slide 50 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model

• BREAK

• Opal: Out-of-order processor model– Overview– Pipeline– Example: Load instruction– Additional Tidbits

• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 51: GEMS Tutorial ISCA05

Slide 51 http://www.cs.wisc.edu/gems

Overview

• What is OPAL?– Out-of-Order SPARC processor simulator

• (modeled after MIPS R10K)– Uses Timing-First design– Realized as a Simics module – like RUBY– Does NOT use Simics’ MAI interface

• Goal of this section– Starting point for hacking Opal

• Learning approaches– Code review / summarization (using Control Flow Graphs)– Example: a load instruction– Analogies to SimpleScalar…pay attention to the differences

Page 52: GEMS Tutorial ISCA05

Slide 52 http://www.cs.wisc.edu/gems

Ruby Driver: In-order Processor Model

• Implements Simics’ mh_memorytracer_possible_cache_miss()• “Callback” Simics with SIM_stall_cycle(proc_ptr, 0)

P0

Simics time queue

P1 P2 P3

stall()/unstall()

stall()/unstall()

stall()/unstall()

stall()/unstall()

instructions

Simics in-order processor model

SIMICS

RubyMemory System Model

Page 53: GEMS Tutorial ISCA05

Slide 53 http://www.cs.wisc.edu/gems

Preview: OPAL & Simics

• Use opal’s opal0.sim-step command

P0Phy_mem

fetch

decode

Schedule/execute

retire

check

12

SIMICS

OPAL

8 76 54 3 1

Instruction

Step

RUBY

LOAD

IFETCHHIT

HIT

Page 54: GEMS Tutorial ISCA05

Slide 54 http://www.cs.wisc.edu/gems

Timing-First Simulation [Mauer Sigmetrics 02]

• Timing Simulator (Opal)– functional execution of user/supervisor operations– speculative, OoO multiprocessor timing simulation– does NOT implement full ISA or any devices

• Functional Simulator (Simics)– full-system multiprocessor simulation– does NOT model detailed micro-architectural timing

KEY: Reload state if Opal state != Simics state

Page 55: GEMS Tutorial ISCA05

Slide 55 http://www.cs.wisc.edu/gems

Measured Deviations

• Less than 20 deviations per 100,000 instructions (0.02%)

Worst case performance error: 2.4% (assuming deviation latency is pipeline flush)

additional timing slides

Page 56: GEMS Tutorial ISCA05

Slide 56 http://www.cs.wisc.edu/gems

Opal and UltraSparc

• Functionally simulates 103 of 183 of UltraSparc ISA instructions (99.99% of all dynamic instr in workloads) LIST

• Sample of unimplemented instrs:– ARRAY -FEXPAND -FPADD -RDSOFTINT– EDGE -FMUL8x16 -FPMERGE -RDSTICK– SHUTDOWN -SIAM -SIR -WRSOFTINT -WRSTICK

• Does not functionally simulate devices or any I/O instructions– SCSI controllers and disks– PCI and SBUS interfaces– interrupt and DMA controllers– temperature sensors

Correctness type % error

Functional

Performance

02.4 (worst case)

Page 57: GEMS Tutorial ISCA05

Slide 57 http://www.cs.wisc.edu/gems

Simulation Control (system.[C h])

system_t::simulate(int instrs)

Disable all simics procs

returnSimulated enough instrs?

No

Yes

Forall seq->advanceCycle()

ruby->advanceTime()

global_cycle++

Pipeline is modeled here

For MP sims: P0’s instrs counted here

Page 58: GEMS Tutorial ISCA05

Slide 58 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model

• BREAK

• Opal: Out-of-order processor model– Overview– Pipeline– Example: Load instruction– Additional Tidbits

• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 59: GEMS Tutorial ISCA05

Slide 59 http://www.cs.wisc.edu/gems

What’s done in a cycle?

• SimpleScalar uses a reverse order, why?

pseq::advanceCycle()

FetchInstructions()

return

DecodeInstructions()

ScheduleInstructions()

RetireInstructions()

Scheduler->execute()

Uses separate queues (finitecycle.h) to record how many instructions are available for each stage.

The order is in fact not important here.

Page 60: GEMS Tutorial ISCA05

Slide 60 http://www.cs.wisc.edu/gems

Pipeline Model (pseq.[C h])

• Instructions stored/tracked in a RUU-like structure (iwindow.[C h])

• Flexible multi-stage pipeline– Delay modeled with separate queues

(finitecycle.h)

• Models fully-pipelined FUs– Types: CONFIG_ALU_MAPPING– Number: CONFIG_NUM_ALUS

F

F

F

D

D

FU0

FU0

FU0

R

R

FU1

FU1

FETCH_STAGES

DECODE_STAGES

RETIRE_STAGES

Determined byCONFIG_ALU_LATENCY

MAX_FETCH

MAX_DECODE

MAX_RETIRE

MAX_DISPATCHSched

MAX_EXECUTE

Page 61: GEMS Tutorial ISCA05

Slide 61 http://www.cs.wisc.edu/gems

Instructions ({dynamic,statici,memop,controlop}.[C h] )

Dynamic

Control (controlop.[C h])

Memory (memop.[C h]) ALU (dynamic.[C h])

decoded instr (statici.[C h])

Traps

Registers

Event TimesSeq #

Wait List ptr

Predicted Addr

Actual Addr

Virtual/Phys Addr

LSQ indexTaken/Not

Taken

Page 62: GEMS Tutorial ISCA05

Slide 62 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model

• BREAK

• Opal: Out-of-order processor model– Overview– Pipeline– Example: Load instruction– Additional Tidbits

• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 63: GEMS Tutorial ISCA05

Slide 63 http://www.cs.wisc.edu/gems

Fetch

Is Fetch Ready?

Address Translation I-TLB Miss?

Emit NOP/Stall Fetch

Yes

Read instruction: pseq::getInstr()

No

Stall fetch

Invoke Ruby to simulate Ifetch

timing

Create Dynamic Instr

(load_inst_t)

Yes

Page 64: GEMS Tutorial ISCA05

Slide 64 http://www.cs.wisc.edu/gems

Decode

Get load instr from instr window

dynamic_inst_t::decode()

Insert decoded load inst in decode

queue

Get current source operand mappings :

arf::readDecodeMap() (regmap.[C h], arf.[C h])

Rename dest reg : arf::allocateRegister()

(regmap.[C h], arf.[C h])

Page 65: GEMS Tutorial ISCA05

Slide 65 http://www.cs.wisc.edu/gems

Schedule

Get load instr from instr window

Exceeded scheduling window?

Stop scheduling

Yes

TestSourceReadiness() WAIT_XX_STAGESource not

ready

Scheduler->schedule() All sources ready?

Wakeup

YesNo

Source is ready

Page 66: GEMS Tutorial ISCA05

Slide 66 http://www.cs.wisc.edu/gems

Execute

Read port avail? D-TLB address translate (memory_inst_t::addresstranslate())

TLB Miss?Raise TLB miss exception Yes

No, reschedule

Invoke Ruby to simulate load timing (rubycache_t::access())

Read value from Simics memory(pseq->readPhysicalMemory())

No

Cache Miss?

CACHE_MISS_STAGE

Yes

pseq->complete()

No

Yes

Page 67: GEMS Tutorial ISCA05

Slide 67 http://www.cs.wisc.edu/gems

Retire

Get completed LD inst

checkCriticalstate():(PC, NPC,regs)

checkChangedState() (verify load value)

FullSquash() (reload state &

refetch from instr following LD)

FAIL

Step Simics (pseq->advanceSimics())

Retire LD

Traps?takeTrap() (set trap state,squash pipeline)

Yes

No

FAIL

Match

Match

Memory Consistency

Page 68: GEMS Tutorial ISCA05

Slide 68 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model

• BREAK

• Opal: Out-of-order processor model– Overview– Pipeline– Example: Load instruction– Additional Tidbits

• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 69: GEMS Tutorial ISCA05

Slide 69 http://www.cs.wisc.edu/gems

Opal-Ruby Interface

rubycache_t:access()complete()

OpalInterface:isReady()makeRequest()hitCallback()

OPAL RUBY

system_t:rubyCompletedRequest()

pseq_t:completedRequest()

load_inst_t::Execute()

Complete()

LD

Asynchronous

12

3

45

6

78

Page 70: GEMS Tutorial ISCA05

Slide 70 http://www.cs.wisc.edu/gems

Branch Prediction

pseq_t::createInstruction{…s_instr->nextPC()…

}

dynamic_inst_t::nextPC_call(),nextPC_predicated_branch(),nextPC_predict_branch(),nextPC_indirect()

Branch predictor (fetch/{yags.[C h], …} :

Predict()Update()

Predict()Controlop_t::Execute(){ (check prediction and flush if mispredict)}Retire(){

…Bpred->Update()…

}

Update()

Page 71: GEMS Tutorial ISCA05

Slide 71 http://www.cs.wisc.edu/gems

Common Config Parameters

Processor Width:MAX_FETCH

_DECODE _DISPATCH _EXECUTE _RETIRE

Pipeline Stages:FETCH_STAGESDECODE_STAGESRETIRE_STAGES

Register File Sizes:CONFIG_IREG_PHYSICAL (int)CONFIG_FPREG_PHYSICAL (fp)CONFIG_CCREG_PHYSICAL (cond code)

ROB Size:IWINDOW_ROB_SIZE

Scheduling Window Size:IWINDOW_WIN_SIZE

Page 72: GEMS Tutorial ISCA05

Slide 72 http://www.cs.wisc.edu/gems

Opal : Present and Future

• Implements Sparc instructions– Simulating additional Sparc instructions easy task– Porting to x86 substantial code rewrite

• Simulates timing of weaker memory consistency models– Add SC checks in Opal– Add write buffers for weaker models (like TSO)

• No functional simulation of I/O– Plug in disk simulator that interacts with Opal

• Not currently using MAI interface– Possible to replace Opal w/ MAI module that interacts with

Ruby• Aggressive micro-architectural techniques not

modeled– Add support for trace caches, mem. dependence pred., etc

Page 73: GEMS Tutorial ISCA05

Slide 73 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with

GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model

• Demo: Two gems are better than one– Breakdown network stats– Example: Network contention with and without Opal – Simulation runtimes

• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 74: GEMS Tutorial ISCA05

Slide 74 http://www.cs.wisc.edu/gems

Breaking Down Ruby Stats Files

• Ruby system config print– Values of all ruby config parameters

• Overall runtime– Target and host machine runtimes, IPC, etc.

• Cache profiling: L1I, L1D, L2…etc.

• Structure occupancy– Demand for cache ports, transaction buffers

• Latency breakdown• Request vs. system state (optional)• Message delay cycles (optional)• Network stats

– Link and switch utilization

• CC event / transition counts

<system_config>.statsRuby config

Overall runtime

Cache profiling

Demo

Structureoccupancy

Latencybreakdown

Request vs.system state

Messagedelay cycles

Network stats

Event /transition

counts

Page 75: GEMS Tutorial ISCA05

Slide 75 http://www.cs.wisc.edu/gems

Two GEMS are Better than One

• Network behavior with and without Opal• 8 processor CMP• SPLASH benchmark: ocean• 8 byte-wide links between CPUs & L2 cache banks• Two runs using a customized network

1. Ruby only• Allows only one requests per processor• Maximum 8 outstanding requests• Low network utilization• Little network contention

2. Ruby & Opal• Allows multiple outstanding requests• Maximum 128 outstanding requests• Higher network utilization• Noticeable network contention

Demo

Page 76: GEMS Tutorial ISCA05

Slide 76 http://www.cs.wisc.edu/gems

Two GEMS are Better than One

Ruby Only

Demo

Message Delayed Cycles----------------------

Total_delay_cycles: [binsize: 16 max: 553 count: 22892759

average: 0.534205 | standard deviation: 4.18656 | 22855760 20077 1945 325 309 175 105 3935 7681 518 338 254 397 273 166 130 142 33 41 25 26 29 15 10 10 2 0 2 0 4 4 10 7 6 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

Network Stats-------------

links_utilized_percent_switch_0_link_3: 4.38966 bw: 8000 base_latency: 1

links_utilized_percent_switch_0_link_4: 4.36838 bw: 8000 base_latency: 1

Ruby_cycles: 41361869

Page 77: GEMS Tutorial ISCA05

Slide 77 http://www.cs.wisc.edu/gems

Two GEMS are Better than One

Ruby & Opal

Demo

Message Delayed Cycles----------------------

Total_delay_cycles: [binsize: 16 max: 703 count: 22893122

average: 1.35992 (0.534205) | standard deviation: 6.55126 | 22608266 220366 29575 9084 4686 3248 2009 1687 6018 1798 1143 828 625 516 384 272 271 288 398 319 299 228 203 161 92 51 41 26 12 9 30 39 48 43 25 20 3 0 0 1 0 2 4 4 0 0 0 0 0 0 ]

Network Stats-------------

links_utilized_percent_switch_0_link_3: 7.81863 (4.38966) bw: 8000 base_latency: 1

links_utilized_percent_switch_0_link_4: 7.64388 (4.36838) bw: 8000 base_latency: 1

Ruby_cycles: 72550169 (41361869)

Page 78: GEMS Tutorial ISCA05

Slide 78 http://www.cs.wisc.edu/gems

Simulation Time Comparison

• Comparisons of Runtimes– Progressively add more simulation fidelity

• Simics only• Simics + Ruby• Simics + Ruby + Opal

– Accuracy vs. simulation time tradeoff

• Target Machine– 8 UltraSPARC™ iii processor SMP (1 GHz)– 4 GBs of memory

• Host Machine– AMD Opteron™ uniprocessor (2.2 GHz)– 4 GBs of memory

Page 79: GEMS Tutorial ISCA05

Slide 79 http://www.cs.wisc.edu/gems

Simulation Slowdown

Time Slowdown Slowdown / CPUTarget 20 ms 1 1

Simics 1 minute 3000 x 380 x

Simics + Ruby 15 minutes 45000 x 5600 x

Simics + Ruby + Opal 45 minutes 140000 x 17000 x

2000 JBB Transactions

CAVEAT: These performance numbers may not reflect the optimal configuration of Virtutech Simics. For example, running Simics in “fast mode” (or emulation-only mode) can reduce the slowdown (per CPU) of Simics, compared to real hardware, to less than 10x

Page 80: GEMS Tutorial ISCA05

Slide 80 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with

GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one

• GEMS Source Code Tour and Extending Ruby– GEMS software structure– Directory Tour– Demo: Extending Ruby and a CMP Protocol

• Building Workloads

Page 81: GEMS Tutorial ISCA05

http://www.cs.wisc.edu/gems

GEMS Software Structure

System

DriverChip ProfilerNetworkcommon/Driver.h

Internal Drivers

Simics Interface

Opal Interface

generated/<protocol>/Chip.hprofiler/Profiler.hnetwork/simple

Deterministic Tester

Contended Locks

Random Tester

tester/DeterministicDriver.h

tester/SyntheticDriver.h

Topologytester/Tester.h

interface/OpalInterface.h

simics/SimicsInterface.h

network/simple/Topology.h

MultipleInstantiations

OneInstantiation

Page 82: GEMS Tutorial ISCA05

http://www.cs.wisc.edu/gems

Ruby Software Structure

Chip

DirectorySequencer

Caches

Cache Controllers

Cache Line Directory

State

system/DirectoryMemory.hsystem/CacheMemory.h

Directory Controller

SLICC

system/Sequencer.h

Network Ports

buffer/MessageBuffer.h

generated/<protocol>/Chip.h SLICC

generated/<protocol>/L1Cache_Controller.hgenerated/<protocol>/Directory_Controller.hgenerated/<protocol>/L2Cache_Controller.h

Ruby

Ruby

generated/<protocol>/L1Cache_Entry.h

generated/<protocol>/Directory_Entry.h

Page 83: GEMS Tutorial ISCA05

Slide 83 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with

GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one

• GEMS Source Code Tour and Extending Ruby– GEMS software structure– Directory Tour– Demo: Extending Ruby and a CMP Protocol

• Building Workloads

Page 84: GEMS Tutorial ISCA05

Slide 84 http://www.cs.wisc.edu/gems

Map of Directories: Top-Level

Top-Level Directory

ruby opal slicc protocols

common gen-scripts scripts

LICENSE KNOWN_ISSUESREADME

microbenchmarks

MemorySystemComponents

ProcessorComponents

GeneratorCode

ProtocolSpecificationFiles

CommonGEMSC++ code

GeneratedSimicsInterfaceScripts

results SimulationOutput

SeparateMicrobenchmarkExecutables

CommonGEMSscripts

Page 85: GEMS Tutorial ISCA05

Slide 85 http://www.cs.wisc.edu/gems

Map of Directories: ruby

ruby

buffers common config MessageBufferbetween consumers

Ruby config files forModule and tester

Common RubyC++ structs

eventqueue interfaces module network Globaleventqueue

Ruby → Opal &Simics

Simple network codeThe ruby simicsmodule

profiler recorder simics slicc_interface Profiling code cache and trace

recordersAbstract classes interfacewith different protocols

Simics → Ruby

system tester platform generated Physical memory components

Random tester& ubenchmarks

SLICC generated C++files

Object files &executables

html Protocoltables

Example tracefile

Ruby debug flag infoRuby initializer &destroyer

ruby.trace.gz init.h/.C README.debugging

Makefile

Page 86: GEMS Tutorial ISCA05

Slide 86 http://www.cs.wisc.edu/gems

Map of Directories: ruby/system

ruby/systemmemory datastructure

object that identifies aunique chip or machineinstatiation

object that uniquelyidentifies all rubymachines

specific to tokenprotocol a fully associative,

unbounded cachememory template

specific to tokenprotocol

specific to tokenprotocol

manages memoryrequests between the driver and L1 cache controller

used to simulateTSO-like timing

top-level object of theruby memory system,all ruby objects can beaccessed via theg_system_ptr

used to simulateTSO-like timing

transaction bufferentry table used by cache controllers fortransient requests

specific to tokenprotocol

CacheMemory.h DirectoryMemory.h/C MachineID.h NodeID.h

NodePresistentTable.h/CPerfrectCacheMemory.h PresistentArbiter.h/C PersistentTable.h/C

Sequencer.h/CStoreBuffer.h/C StoreCache.h/C System.h/C

TBETable.h/CTimerTable.h

cache templatedata structure

Page 87: GEMS Tutorial ISCA05

Slide 87 http://www.cs.wisc.edu/gems

Map of Directories: ruby/slicc_interface

ruby/slicc_interfaceruby abstract class forthe protocol specificchip object

parent class of all messagesmessages communicatedbetween consumers viaMessageBuffers

contains booleans to defineprotocol characteristics to ruby

parent class of allnetwork messages, each protocolimplements uniquenetwork messageobjects to communicatebetween controllers

All address manipulation to determine location and set mapping is here

miscellaneous rubyfunctions used bythe generated controllers

interface between the generated protocol logic and the ruby profiler code

wrapper for the RubySlicc interface files

AbstractCacheEntry.h/C AbstractChip.h/C AbstractProtocol.h/C

Message.h NetworkMessage.h RubySlicc_ComponentMapping.h

RubySlicc_Profiler_interface.h/C RubySlicc_Util.h RubySlicc_includes.h

ruby abstract class forthe protocol specific cache entries

Page 88: GEMS Tutorial ISCA05

Slide 88 http://www.cs.wisc.edu/gems

Map of Directories: slicc

sliccast doc

parser

Abstract SyntaxTree code

contains the lexerand parser thatconstruct aprotocol’s AST

contains someold but usefuldocumentation

symbols contains SLICCobjects createdduring the firstpass of the AST,majority of codegenerated by these symbols

generator file, html and MIF generatorcode

platform generated generated lexer and parser files

Object files &executables

main functionof the SLICCexecutable

defines typedef, namespaces, etc.main.h/C slicc_global.h

Makefile

READMESummary ofhow SLICC works

Makefile for theSLICC codegenerator executable

Page 89: GEMS Tutorial ISCA05

Slide 89 http://www.cs.wisc.edu/gems

Map of Directories: opal

opal

benchmark bypassing common Micro-architecturebenchmarks

Global Opal structsMisc. proc structs

config design fetch module Module and testerconfig files

Helpful informaldesign docs

Code for Opal modulePredictors (branch,Trap,RAS)

python regression sparc systemMisc test and graphing scripts

Golden resultsfor tester

Pipeline modelImplementation-specific defines

tester trace platform generated Opal tester files Files for branch,

memory tracesFiles for parsing configparams

Object files &executables

TODO Todo wish list Describes building

& running OpalOpal handling of mem. consistency

README README.memory_consistency

Makefile

Page 90: GEMS Tutorial ISCA05

Slide 90 http://www.cs.wisc.edu/gems

Map of Directories: opal/system (1)

opal/systemRegister file interface

Used to analyze memdependencies

Opal’s built-in cachestructures

Structs used in validation w/ Simics Type defines for

config paramsPer opcode stats collector class

Branch instr typeclass

TLB implementation for stand-alone sims Code for execution

of dynamic instrsNon-renamed registerfile interface

Top-level classfor all dynamic instrs

CFG class Opal-Simics interface

actor.[C h] arf.[C h] cache.[C h] chain.[C h]

checkresult.hconfig.include controlop.[C h] decode.[C h]

dtlb.[C h]dx.[C h i] dynamic.[C h] flatarf.[C h]

flow.[C h] hfa.C

General micro-arch.structure class

hfa_init.h histogram.[C h]Opal-Simics interfaceexterns

Histogram statsclass

Page 91: GEMS Tutorial ISCA05

Slide 91 http://www.cs.wisc.edu/gems

Map of Directories: opal/system (2)

opal/systemInstruction page cache class

Code to execute CFGinstructions

RUU-like struct forstoring/tracking instrs

Stats on locks in system LSQ structure Memory addr stats

classMemory instr class

Simlink to Opal-Ruby interface MSHR structure

(used in Opal cachehierarchy only)

Single waiter object for pipepool

Wait-list object . Used to model MSHR whenrunning w/ Ruby

Top-level procsequencer

Functions used forAPI calls to Simics

ipage.[C h] ipagemap.[C h] iwindow.[C h] ix.[C h]

lockstat.[C h]lsq.[C h] memop.[C h] memstat.[C h]

mf_api.hmshr.[C h] pipepool.[C h] pipestate.[C h]

pseq.[C h] pstate.[C h]

Instruction pageclass

ptrace.[C h] regbox.[C h]Used for analyzingmemory traces

Contains interfaceptrs to registers.

Page 92: GEMS Tutorial ISCA05

Slide 92 http://www.cs.wisc.edu/gems

Map of Directories: opal/system (3)

opal/systemRename map structure

Global event queueHandles all Opal -Ruby memorytransactions

Dummy Simics functions for tester Several includes Decoded instr classStats class for

static insts

Timer class, used tocollect time stats Stats class for

dynamic instsStats class for tracking per-threadstats

Top-level class formanipulating sim

Wait-list object for dynamic insts

regfile.[C h] regmap.[C h] rubycache.[C h] scheduler.[C h]

simdist12.Csparx.C sstat.[C h] statici.[C h]

stopwatch.[C h]sysstat.[C h] system.[C h] threadstat.[C h]

wait.[C h]

Models the registerfile itself

Page 93: GEMS Tutorial ISCA05

Slide 93 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with

GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one

• GEMS Source Code Tour and Extending Ruby– GEMS software structure– Directory Tour– Demo: Extending Ruby and a CMP Protocol

• Building Workloads

Page 94: GEMS Tutorial ISCA05

Slide 94 http://www.cs.wisc.edu/gems

Extending Ruby

• Goal: – Add new functionality to Ruby and interface to SLICC

• DemoPrefetcher– Simple, L2->memory next-line prefetcher– Module implemented as C++ object (DemoPrefetcher.C)– New type added to SLICC– Observes L1 GETS requests via function call– Triggers event for prefetch in next cycle

• Object is connected to an in_port

– Not the only way (or the right way) of implementing a prefetcher

Demo

Page 95: GEMS Tutorial ISCA05

Slide 95 http://www.cs.wisc.edu/gems

Implementing DemoPrefetcher

• Creating an object that can “wakeup” a controller

DemoPrefetcher.h

class DemoPrefetcher {public:

// An object in a SLICC controller will be passed a Chip* DemoPrefetcher(Chip* chip_ptr);

// Allow an in_port to be attached void setConsumer(Consumer* consumer_ptr) { m_consumer_ptr =

consumer_ptr; }

// When wakeup() is called, ensure it should do something bool isReady() const;

// functions to implement simple next-line prefetching const Address& popNextPrefetch(); const Address& peekNextPrefetch() const; void cancelNextPrefetch(); void observeL1Request(const Address& address);

Demo

Page 96: GEMS Tutorial ISCA05

Slide 96 http://www.cs.wisc.edu/gems

Implementing DemoPrefetcher

DemoPrefetcher.C

void DemoPrefetcher::observeL1Request(const Address& address){ // next-line prefetch address Address prefetch_addr = address; prefetch_addr.makeNextStrideAddress(1);

// add to prefetch queue m_prefetch_queue.push( prefetch_addr );

// when to wakeup-- choose 1 cycles later Time ready_time = g_eventQueue_ptr->getTime() + 1;

// schedule a wakeup() so that the L2 controller can trigger g_eventQueue_ptr->scheduleEventAbsolute(m_consumer_ptr,

ready_time);

}

Demo

Page 97: GEMS Tutorial ISCA05

Slide 97 http://www.cs.wisc.edu/gems

Interfacing DemoPrefetcher to SLICC

external_type(DemoPrefetcher, inport="yes") { bool isReady(); Address popNextPrefetch(); void cancelNextPrefetch(); Address peekNextPrefetch(); void observeL1Request(Address); } DemoPrefetcher prefetcher;

// wakeup logic in_port(prefetcher_in, Null, prefetcher) { if (prefetcher_in.isReady() ) { if (L2cacheMemory.cacheAvail(prefetcher.peekNextPrefetch()) ||

L2cacheMemory.isTagPresent(prefetcher.peekNextPrefetch())) { if ( getState(prefetcher.peekNextPrefetch()) == State:I ||

getState(prefetcher.peekNextPrefetch()) == State:NP ) { trigger(Event:Prefetch, prefetcher.popNextPrefetch()); } else { // tag is already present in a non-invalid state prefetcher.cancelNextPrefetch(); } } else { trigger(Event:L2_Replacement,

L2cacheMemory.cacheProbe(prefetcher.peekNextPrefetch())); } } }

Demo

Page 98: GEMS Tutorial ISCA05

Slide 98 http://www.cs.wisc.edu/gems

Implementing DemoPrefetcher

• Nice property of TokenCMP: no tracking of prefetch– A tag is allocated and a request issued to memory– keeps received tokens/data if tag allocated

MOESI_CMP_tokenDEMO-L2cache.sm

transition(NP, Prefetch, I) { vv_allocateL2CacheBlock; a_issuePrefetch; }

transition(I, Prefetch) { a_issuePrefetch; }

transition({S,O,M,I_L,S_L}, Prefetch) { // do nothing }

Demo

Page 99: GEMS Tutorial ISCA05

Slide 99 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with

GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby

• Building Workloads

Page 100: GEMS Tutorial ISCA05

Slide 100 http://www.cs.wisc.edu/gems

Workloads for Simics/GEMS

• Unfortunately, we cannot release our workloads (legal reasons)

• Steps for Workload Development– Simple Example: Barnes-Hut– What about more complex applications?

• Workload Simulation Methodology– Simulating transactions/requests– Coping with workload variability

Page 101: GEMS Tutorial ISCA05

Slide 101 http://www.cs.wisc.edu/gems

Workload Setup

• Simple Example: Barnes-Hut (Splash2 suite)– Commands not to be taken literally! (might be different in

different versions)

• Main Steps:– Build OS checkpoint– Copy application source or binary to simulation– Create initial (cold) application checkpoint in Simics– Create warm application checkpoint with Simics/Ruby

Page 102: GEMS Tutorial ISCA05

Slide 102 http://www.cs.wisc.edu/gems

Build OS Checkpoint

• Use Simics to boot your OS and get a checkpoint (assuming 16 processor serengeti target machine)– cd simics/home/sarek– ./simics –x sarek-16p.simics

• Script loads configuration and boots Solaris• Scripts should be provided with your Simics distribution

assuming you have Solaris license (contact Virtutech Simics Forum)

• Modify scripts to fit your target configuration (e.g., memory, disk, network)

– At the end of your script, take a system snapshot (checkpoint):simics> write-configuration CHKPT_DIR/sarek-16p.checksimics> quit

– Use this checkpoint to build all your workloads’ 16 processor checkpoints

Page 103: GEMS Tutorial ISCA05

Slide 103 http://www.cs.wisc.edu/gems

Copy Barnes Source or Binary

• Develop benchmark on real machine (if available)– Use Simics “magic” instructions after initialization

• See Simics reference manual for magic instruction use– Compile benchmark with such instructions before running in

Simics

• Load from your OS checkpoint– ./simics

simics> read-configuration CHKPT_DIR/sarek-16p.check

simics> magic-break-enable

• Copy binary into simulated machine (or copy source and compile) – Console commands:

mount /host

cp –r /host/workloads/splash2/codes/apps/barnes/BARNES .• See Simics reference manual on the use of the /host filesystem

Page 104: GEMS Tutorial ISCA05

Slide 104 http://www.cs.wisc.edu/gems

Obtain Initial Barnes Checkpoint

• Warm up application in Simics– Console Commands:

./BARNES < input-warm• input_warm specifies Barnes parameters

./BARNES < input-warm• Use this second run to warm up cache (see next slide)

./BARNES < input-run > output; magic_call break

• After initial run, write checkpointsimics> write-configuration CHKPT_DIR/barnes-cold-16p.checksimics> quit

• Checkpoint is ready for GEMS run

Page 105: GEMS Tutorial ISCA05

Slide 105 http://www.cs.wisc.edu/gems

Obtain Warm Barnes Checkpoint

• Load initial checkpoint– setenv CHECKPOINT_AT_END yes– setenv TRANSACTIONS 1– setenv PROCESSORS 16– setenv CHECKPOINT CHKPT_DIR/barnes-cold-16p.check– ./simics -no-win -x GEMS_ROOT/gen-scripts/go.simics

• Script (provided in release) should load ruby and run till the end of the warmup run– Also writes checkpoint at the end

• Edit checkpoint to remove ruby object– Modify script to suit your needs

Page 106: GEMS Tutorial ISCA05

Slide 106 http://www.cs.wisc.edu/gems

What About More Complex Applications?

• Setup on real hardware– Tune workload, OS parameters– Scale-down for PC memory limits– Re-tune – For details, [Alameldeen et al., IEEE Computer, Feb’03]

• What if we don’t have access to real hardware?– Install applications and setup in Simics– Checkpoint often– Not optimal for large scale applications!

Page 107: GEMS Tutorial ISCA05

Slide 107 http://www.cs.wisc.edu/gems

Simulating Transactions/Requests

• Throughput-based applications– Work-based unit to compare configurations– IPC not always meaningful

• Counting Transactions during Simulation– Enable magic breaks in Simics– Benchmark traps to Simics on every magic instruction– Count magic breaks until we reach required number of

transactions– Cope with benchmark variability

Page 108: GEMS Tutorial ISCA05

Slide 108 http://www.cs.wisc.edu/gems

Why Consider Variability?

OLTP

Page 109: GEMS Tutorial ISCA05

Slide 109 http://www.cs.wisc.edu/gems

Workload Variability

• How can slower memory lead to faster workload?• Answer: Multithreaded workload takes different paths

– Different lock race outcomes– Different scheduling decisions→ Runs from same initial conditions can be different

This can lead to wrong conclusions for deterministic simulations

• Solution with deterministic simulation– Add pseudo-random delay on memory accesses

(MEMORY_LATENCY)– Simulate base (and enhanced) system multiple times– Use simple or complex statistics [Alameldeen and Wood,

HPCA 2003]

Page 110: GEMS Tutorial ISCA05

Slide 110 http://www.cs.wisc.edu/gems

The End

• Download and Subscribe to Mailing Lists

http://www.cs.wisc.edu/gems

• We encourage your contributions– Workloads– Additional timing fidelity

Page 111: GEMS Tutorial ISCA05

Slide 111 http://www.cs.wisc.edu/gems

Additional Opal Slides

Page 112: GEMS Tutorial ISCA05

Slide 112 http://www.cs.wisc.edu/gems

Sensitivity Analysis

return

Page 113: GEMS Tutorial ISCA05

Slide 113 http://www.cs.wisc.edu/gems

Sensitivity Results

return

Page 114: GEMS Tutorial ISCA05

Slide 114 http://www.cs.wisc.edu/gems

Opal and Memory Consistency

• Designed to be aggressive OoO processor• Our use of Simics is sequentially consistent execution• Models the performance of weaker models (such as

TSO) for only SC memory interleavings• Violations of SC in Opal:

– Identical MSHR entry for memory requests with same addr– Executes Ld/St out of program order– No snooping of LSQ for external stores

Return

Page 115: GEMS Tutorial ISCA05

Slide 115 http://www.cs.wisc.edu/gems

Implemented UltraSparc Instructions (1)addaddcaddccaddcccalignaddralignaddrlandandccandnandnccbabccbcsbebgbgebgublblebleubmaskbnbnebnegbpabpccbpcs

bpebpgbpgebpgubplbplebpleubpnbpnebpnegbposbpposbpvcbpvsbrgezbrgzbrlezbrlzbrnzbrzbshufflebvcbvscallcasacasxacmpdoneretryfabsdfabsqfabss

fadddfaddqfaddsfaligndatafbafbefbgfbgefblfblefblgfbnfbnefbofbpafbpefbpgfbpgefbplfbplefbplgfbpnfbpnefbpofbpufbpuefbpugfbpugefbpulfbpulefbufbuefbug

fbugefbulfbulefcmpdfcmpedfcmpeqfcmpeq16fcmpeq32fcmpesfcmpgt16fcmpgt32fcmple16fcmple32fcmpne16fcmpne32fcmpqfcmpsfdivdfdivqfdivsfdmulqfdtoifdtoqfdtosfdtoxfitodfitoqfitosflushflushwfmovdfmovdafmovdccfmovdcsfmovde

fmovdgfmovdgefmovdgufmovdlfmovdlefmovdleufmovdnfmovdnefmovdnegfmovdposfmovdvcfmovdvsfmovfdafmovfdefmovfdgfmovfdgefmovfdlfmovfdlefmovfdlgfmovfdnfmovfdnefmovfdofmovfdufmovfduefmovfdugfmovfdugefmovfdulfmovfdulefmovfqafmovfqefmovfqgfmovfqgefmovfqlfmovfqle

fmovfqlgfmovfqnfmovfqnefmovfqofmovfqufmovfquefmovfqugfmovfqugefmovfqulfmovfqulefmovfsafmovfsefmovfsgfmovfsgefmovfslfmovfslefmovfslgfmovfsnfmovfsnefmovfsofmovfsufmovfsuefmovfsugfmovfsugefmovfsulfmovfsulefmovqfmovqafmovqccfmovqcsfmovqefmovqgfmovqgefmovqgu

fmovqlfmovqlefmovqleufmovqnfmovqnefmovqnegfmovqposfmovqvcfmovqvsfmovrdgezfmovrdgzfmovrdlezfmovrdlzfmovrdnzfmovrdzfmovrqgezfmovrqgzfmovrqlezfmovrqlzfmovrqnzfmovrqzfmovrsgezfmovrsgzfmovrslezfmovrslzfmovrsnzfmovrszfmovsfmovsafmovsccfmovscsfmovsefmovsgfmovsge

fmovsgufmovslfmovslefmovsleufmovsnfmovsnefmovsnegfmovsposfmovsvcfmovsvsfmuldfmulqfmulsfnegdfnegqfnegsfqtodfqtoifqtosfqtoxfsmuldfsqrtdfsqrtqfsqrtsfsrc1fstodfstoifstoqfstoxfsubdfsubqfsubsfxtodfxtoq

Page 116: GEMS Tutorial ISCA05

Slide 116 http://www.cs.wisc.edu/gems

Implemented UltraSparc Instructions (2)fxtosfzerofzerosillimpdep1impdep2jmpjmplldblklddlddalddflddfaldfldfaldfsrldqaldqfldqfaldsbldsbaldshldshaldstubldstubaldswldswaldubldubalduhlduhalduwlduwaldx

ldxaldxfsrmembarmovmovamovccmovcsmovemovfamovfemovfgmovfgemovflmovflemovflgmovfnmovfnemovfomovfumovfuemovfugmovfugemovfulmovfulemovgmovgemovgumovlmovlemovleumovnmovnemovnegmovpos

movrgezmovrgzmovrlezmovrlzmovrnzmovrzmovvcmovvsmulsccmulxnopnotororccornornccpopcprefetchprefetchardrdccrdprrestorerestoredretrnsavesavedsdivsdivccsdivxsethisllsllxsmul

smulccsrasraxsrlsrlxstbstbastbarstblkstdstdastdfstdfastfstfastfsrsthsthastqfstqfastwstwastxstxastxfsrsubsubcsubccsubcccswapswapatataddcctaddcctv

tcctcstetgtgetgutltletleutntnetnegtpostraptsubcctsubcctvtvctvsudivudivccudivxumulumulccwrwrccwrprxnorxnorccxorxorcc

return

Page 117: GEMS Tutorial ISCA05

Slide 117 http://www.cs.wisc.edu/gems

TLB Misses

• ITLB Misses– emit special NOP instruction: STATIC_INSTR_MOP; stall

fetch– does NOT update PC, NPC – fetch resumes whenever any instr (including special NOP)

squashes

• DTLB Misses– Set DTLB miss trap for instruction (setTrapType()) in

Execute()– In retireInstruction(), retrieve trap and call takeTrap() to set

trap state for DTLB handler– refetch from DTLB handler

Page 118: GEMS Tutorial ISCA05

Slide 118 http://www.cs.wisc.edu/gems

Example: Load instruction

• In dynamic_t::Schedule(), load waits until all operands ready (WAIT_XX_STAGE cases)

• Scheduler gets invoked when all operands ready• Load waits until read port to L1 is available• Load_inst_t::Execute() gets called

– Generates virtual address– Performs D-TLB address translation– Inserts entry in LSQ– Initiates cache access (via Ruby or Opal’s built-in simple cache

hierarchy)– If cache miss -> put on wait list (CACHE_MISS_STAGE) and is

woken up by rubycache_t::complete()• Invokes Simics to read actual memory value in

load_inst_t::Complete()• Retirement check of load value & squash if value deviates from

Simics

Page 119: GEMS Tutorial ISCA05

Slide 119 http://www.cs.wisc.edu/gems

Modifying Opal-Ruby Interface

• Ruby->Opal interface defined in mf_opal_api object (ruby/interfaces/mf_api.h)

• Opal->Ruby interface defined in mf_ruby_api object• To create new Ruby->Opal callback (ex: hitCallback())

– Define function in ruby/interfaces/OpalInterface.C– Add new function pointer to mf_opal_api object– Create a new function handler in opal/system/system.C and

assign m_opal_api object’s new function pointer to this function handler

• To create new Opal->Ruby callback (ex: makeRequest())– Define function in ruby/interfaces/OpalInterface.C– Add new function pointer to mf_ruby_api object– Assign function pointer to new function in

OpalInterface::installInterface()