Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load...

49
2007-11-14 1 Overview of SimpleScalar Mafijul Islam Department of Computer Science and Engineering

Transcript of Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load...

Page 1: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 1

Overview of SimpleScalar

Mafijul Islam

Department of Computer Science and Engineering

Page 2: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 2

Acknowledgement

• SimpleScalar Tutorial available at www.simplescalar.com

• http://hpc5.cs.tamu.edu/docs/SimplescalarOverview2006.ppt

Page 3: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 4

• What is an architectural simulator?q tool that reproduces the behavior of a computing device

• Why use a simulator?q leverage faster, more flexible S/W development cycle

q permits more design space exploration

q facilitates validation before H/W becomes available

q level of abstraction can be throttled to design task

q possible to increase/improve system instrumentation

A Computer Architecture Simulator Primer

DeviceSimulator

SystemInputs

System Outputs

System Metrics

Page 4: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 4

Simulators

• Almost 40 simulators listed at http://www.cs.wisc.edu/arch/www/tools.html

• 1st simulator of the list

– SimpleScalar (uni-processor, superscalar)

– Developed by Todd Austin (U Michigan) while

in U of Wisconsin-Madison

– Still widely used in the academia and industry

Page 5: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

Page 7SimpleScalar Tutorial

A Taxonomy of Simulation Tools

• shaded tools are part of SimpleScalar

Architectural simulators

Cycle timers

Performance

Inst. schedulersExec-driven

Functional

Trace-driven

Interpreters Direct execution

Page 6: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

Page 9SimpleScalar Tutorial

Functional vs. performance simulators

• Functional simulators implement the architecture

- Perform the actual execution

- Implement what programmers see

• Performance (or timing) simulators implement the microarch.

- Model system resources/internals

- Measure time

- Implement what programmers do not see

Page 7: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 8

Trace Driven vs. Execution Driven

Simulators• Trace-Driven

– Simulator reads a ‘trace’ of the instructions captured during a previous execution

– Easy to implement

– No functional components necessary

– No feedback to trace (eg. mis-prediction)

• Execution-Driven– Simulator runs the program (trace-on-the-fly)

– Hard to implement

– Advantages• Faster than tracing

• No need to store traces

• Register and memory values usually are not in trace

• Support mis-speculation cost modeling

Page 8: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 9

Instruction Schedulers vs. Cycle Timers

• Instruction Schedulers

– Simulator schedules instruction when resources are available

– Instructions proceeded one at a time

– Simpler, but less detailed

• Cycle Timers

– Simulator tracks microarchitecture state each cycle

– Simulator state == microarchitecture state

– Perfect for microarchitecture simulation

Page 9: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 10

The SimpleScalar Tool Set• computer architecture research test bed

q compilers, assembler, linker, libraries, and simulators

q targeted to the virtual SimpleScalar PISA architecture

q hosted on most any Unix-like machine

• developed during Austin's dissertation work at UW-Madisonq third generation simulation tool (Sohi → Franklin → SimpleScalar)

q in development since ‘94

q first public release (1.0) in July ‘96

q second public release (2.0) testing completed in January ‘97

• available with source and docs from SimpleScalar LLC http://www.simplescalar.com

Page 10: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 12

SimpleScalar Tool Set Overview

• compiler chain is GNU tools ported to SimpleScalar

• Fortran codes are compiled with AT&T’s f2c

• libraries are GLIBC ported to SimpleScalar

F2C GCC

GAS

GLDlibf77.a

libm.alibc.a

Simulators

Bin Utils

Fortran code C code

Assembly code

object files

Executables

Page 11: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

Page 4SimpleScalar Tutorial

Advantages of SimpleScalar• Extensible

- source for compiler, libraries, simulators

- user-extensible instruction format

• Portable

- runs on NT and most UNIX platforms

- target can support multiple ISAs

• Detailed

- Interfaces support simulators of arbitrary detail

- Multiple simulators included with distribution

• Fast (millions of instructions per second)

Page 12: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 9

The Zen of Simulator Design

• design goals will drive which aspects are optimized

• the SimpleScalar Tool Setq optimizes performance and flexibility

q in addition, provides portability and varied detail

Performance

Detail Flexibility

PickTwo

Performance: speeds design cycle

Flexibility: maximizes design scope

Detail: minimizes risk

Page 13: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 13

Simulation Suite Overview

Performance

Detail

Sim-Fast Sim-SafeSim-Cache/

Sim-Cheetah/Sim-BPred

Sim-Profile Sim-Outorder

- 420 lines- functional- 4+ MIPS

- 350 lines- functional w/ checks

- < 1000 lines- functional- cache stats- pred stats

- 900 lines- functional- lot of stats

- 3900 lines- performance- OoO issue- branch pred.- mis-spec.- ALUs- cache- TLB- 200+ KIPS

Page 14: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 12

Sim-Fast

• Functional simulation

• Optimized for speed

• Assumes no cache

• Assumes no instruction checking

• Does not support Dlite!

• Does not allow command line arguments

• <300 lines of code

Page 15: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 13

Sim-Safe

• Functional simulation

• Checks for instruction errors

• Optimized for speed

• Assumes no cache

• Supports Dlite!

• Does not allow command line arguments

Page 16: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 14

Sim-Cache

• Cache simulation

• Ideal for fast simulation of caches (if the effect of cache

performance on execution time is not necessary)

• Accepts command line arguments for:

– level 1 & 2 instruction and data caches

– TLB configuration (data and instruction)

– Flush and compress

– and more

• Ideal for performing high-level cache studies that don’t

take access time of the caches into account

Page 17: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 15

Sim-Bpred

• Simulate different branch prediction mechanisms

• Generate prediction hit and miss rate reports

• Does not simulate the effect of branch prediction on total

execution time

nottaken

taken

perfect

bimod bimodal predictor

2lev 2-level adaptive predictor

comb combined predictor (bimodal and 2-level)

Page 18: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 16

Sim-Profile

● Program Profiler

● Generates detailed profiles, by symbol and by address

● Keeps track of and reports

● Dynamic instruction counts

● Instruction class counts

● Branch class counts

● Usage of address modes

● Profiles of the text & data segment

Page 19: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 17

Sim-Outorder

• Most complicated and detailed simulator

• Supports out-of-order issue and execution

• Provides reports

– branch prediction

– cache

– external memory

– various configuration

Page 20: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 16

Generating SimpleScalar Binaries• compiling a C program, e.g.,

ssbig-na-sstrix-gcc -g -O -o foo foo.c -lm

• compiling a Fortran program, e.g.,ssbig-na-sstrix-f77 -g -O -o foo foo.f -lm

• compiling a SimpleScalar assembly program, e.g.,ssbig-na-sstrix-gcc -g -O -o foo foo.s -lm

• running a program, e.g.,sim-safe [-sim opts] program [-program opts]

• disassembling a program, e.g.,ssbig-na-sstrix-objdump -x -d -l foo

• building a library, use:ssbig-na-sstrix-{ar,ranlib}

Page 21: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 17

Global Simulator Options• supported on all simulators:

-h - print simulator help message-d - enable debug message-i - start up in DLite! debugger-q - quit immediately (use w/ -dumpconfig)-config <file> - read config parameters from <file>-dumpconfig <file>- save config parameters into <file>

• configuration files:q to generate a configuration file:

q specify non-default options on command line

q and, include “-dumpconfig <file>” to generate configuration file

q comments allowed in configuration files, all after “#” ignored

q reload configuration files using “-config <file>”

Page 22: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 18

The SimpleScalar Instruction Set• clean and simple instruction set architecture:

q MIPS/DLX + more addressing modes - delay slots

• bi-endian instruction set definitionq facilitates portability, build to match host endian

• 64-bit inst encoding facilitates instruction set researchq 16-bit space for hints, new insts, and annotations

q four operand instruction format, up to 256 registers

16-annote 16-opcode 8-ru 8-rt 8-rs 8-rd

16-imm

081624324863

Page 23: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 19

SimpleScalar InstructionsControl:j - jumpjal - jump and linkjr - jump registerjalr - jump and link registerbeq - branch == 0bne - branch != 0blez - branch <= 0bgtz - branch > 0bltz - branch < 0bgez - branch >= 0bct - branch FCC TRUEbcf - branch FCC FALSE

Load/Store:lb - load bytelbu - load byte unsignedlh - load half (short)lhu - load half (short) unsignedlw - load worddlw - load double wordl.s - load single-precision FPl.d - load double-precision FPsb - store bytesbu - store byte unsignedsh - store half (short)shu - store half (short) unsignedsw - store worddsw - store double words.s - store single-precision FPs.d - store double-precision FP

addressing modes: (C) (reg + C) (w/ pre/post inc/dec) (reg + reg) (w/ pre/post inc/dec)

Integer Arithmetic:add - integer addaddu - integer add unsignedsub - integer subtractsubu - integer subtract unsignedmult - integer multiplymultu - integer multiply unsigneddiv - integer dividedivu - integer divide unsignedand - logical ANDor - logical ORxor - logical XORnor - logical NORsll - shift left logicalsrl - shift right logicalsra - shift right arithmeticslt - set less thansltu - set less than unsigned

Page 24: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 20

SimpleScalar Instructions

Floating Point Arithmetic:add.s - single-precision addadd.d - double-precision addsub.s - single-precision subtractsub.d - double-precision subtractmult.s - single-precision multiplymult.d - double-precision multiplydiv.s - single-precision dividediv.d - double-precision divideabs.s - single-precision absolute valueabs.d - double-precision absolute valueneg.s - single-precision negationneg.d - double-precision negationsqrt.s - single-precision square rootsqrt.d - double-precision square rootcvt - integer, single, double conversionc.s - single-precision comparec.d - double-precision compare

Miscellaneous:nop - no operationsyscall - system callbreak - declare program error

Page 25: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 21

SimpleScalar Architected StateVirtual Memory

0x00000000

0x7fffffff

Unused

Text(code)

Data(init)(bss)

StackArgs & Env

0x00400000

0x10000000

0x7fffc000

.

.

r0 - 0 source/sink

r1 (32 bits)

r2

r31

Integer Reg File

.

.

f0 (32 bits)

f1

f2

f31

FP Reg File (SP and DP views)

r30

f30

f1

f3

f31

PC

HI

LO

FCC

Page 26: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 22

Simulator I/O

• a useful simulator must implement some form of I/Oq I/O implemented via SYSCALL instruction

q supports a subset of Ultrix system calls, proxied out to host

• basic algorithm (implemented in syscall.c):q decode system call

q copy arguments (if any) into simulator memory

q perform system call on host

q copy results (if any) into simulated program memory

write(fd, p, 4)

Simulated Program Simulator

sys_write(fd, p, 4)

args in

results out

Page 27: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 23

Simulator S/W Architecture• interface programming style

q all “.c” files have an accompanying “.h” file with same base

q “.h” files define public interfaces “exported” by moduleq mostly stable, documented with comments, studying these files

q “.c” files implement the exported interfacesq not as stable, study these if you need to hack the functionality

• simulator modulesq sim-*.c files, each implements a complete simulator core

• reusable S/W components facilitate “rolling your own”q system components

q simulation components

q “really useful” components

Page 28: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 24

Simulator S/W Architecture

• most of performance core is optional

• most projects will enhance on the “simulator core”

BPred SimulatorCore

Machine DefinitionFunctional

Core

SimpleScalar ISA POSIX System Calls

Proxy Syscall Handler

Dlite!

Cache MemoryRegsLoader

Resource

Stats

PerformanceCore

Prog/SimInterface

SimpleScalar Program BinaryUserPrograms

Page 29: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 41

SIM-OUTORDER: H/W Architecture

• implemented in sim-outorder.c and components

Fetch DispatchRegister

Scheduler

MemoryScheduler

Writeback CommitExec

Mem

D-Cache(DL1)

I-Cache(IL1)

Virtual Memory

D-TLBI-TLB

I-Cache(IL2)

D-Cache(DL2)

Page 30: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 18

Sim-Outorder HW Architecture

Fetch DispatchRegister

SchedulerExe Writeback Commit

I-Cache

Memory

SchedulerMem

Virtual Memory

D-Cache D-TLBI-TLB

ruu_fetch ruu_dispatch ruu_issue

lsq_refresh

ruu_writeback ruu_commit

Page 31: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 19

Sim-Outorder (Main Loop) • sim_main() in sim-outorder.c

ruu_init();

for(;;){

ruu_commit();

ruu_writeback();

lsq_refresh();

ruu_issue();

ruu_dispatch();

ruu_fetch();

}

• Executed once for each simulated machine cycle

• Walks pipeline from Commit to Fetch– Reverse traversal handles inter-stage latch synchronization by only

one pass

Page 32: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 20

Sim-Outorder (RUU/LSQ)

• RUU (Register Update Unit)

– Handles register synchronization/communication

– Serves as reorder buffer and reservation stations

• LSQ (Load/Store Queue)

– Handles memory synchronization/communication

– Contains all loads and stores in program order

• Relationship between RUU and LSQ

– Memory dependencies are resolved by LSQ

– Load/Store effective address calculated in RUU

Page 33: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 46

Fetch

misprediction (from Writeback)

to instruction fetch queue (IFQ)

Fetch Stage Implementation

• models machine fetch bandwidth

• implemented in ruu_fetch()

• inputs:q program counter

q predictor state (see bpred.[hc])

q misprediction detection from branch execution unit(s)

• outputs:q fetched instructions sent to instruction fetch queue (IFQ)

Page 34: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 47

Fetch

misprediction (from Writeback)

to instruction fetch queue (IFQ)

Fetch Stage Implementation

• procedure (once per cycle):q fetch instructions from one I-cache line, block until I-cache or

I-TLB misses are resolved

q queue fetched instructions to instruction fetch queue (IFQ)

q probe branch predictor for cache line to access in next cycle

Page 35: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 48

Dispatch to RUU or LSQinstructionsfrom IFQ

Dispatch Stage Implementation

• models machine decode, rename, RUU/LSQ allocationbandwidth, implements register renaming

• implemented in ruu_dispatch()

• inputs:q instructions from IFQ, from Fetch stage

q RUU/LSQ occupancy

q rename table (create_vector)

q architected machine state (for execution)

• outputs:q updated RUU/LSQ, rename table, machine state

Page 36: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 49

Dispatch Stage Implementation

• procedure (once per cycle):q fetch insts from IFQ

q decode and execute instructionsq permits early detection of branch mis-predicts

q facilitates simulation of “oracle” studies

q if branch misprediction occurs:q start copy-on-write of architected state to speculative state buffers

q enter instructions into RUU and LSQ (load/store queue)q link to sourcing instruction(s) using RS_LINK structure

q loads/stores are split into two insts: ADD + Load/Storeq improves performance of memory dependence checking

Dispatch to RUU or LSQinstructionsfrom IFQ

Page 37: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 50

RegisterScheduler

MemoryScheduler

RUU, LSQ to functional units

Scheduler Stage Implementation

• models instruction wakeup, selection, and issueq separate schedulers track register and memory dependencies

• implemented in ruu_issue()and lsq_refresh()

• inputs:q RUU/LSQ

• outputs:q updated RUU/LSQ

q updated functional unit state

Page 38: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 51

RegisterScheduler

MemoryScheduler

RUU, LSQ to functional units

Scheduler Stage Implementation

• procedure (once per cycle):q locate instructions with all register inputs ready

q in ready queue, inserted when dependent insts enter Writeback

q locate loads with all memory inputs readyq determined by walking the load/store queue

q if load addr unknown, then stall issue (and poll again next cycle)

q if earlier store w/ unknown addr, then stall issue (and poll again)

q if earlier store w/ matching addr, then forward store data

q else, access D-cache

Page 39: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 52

insts issued by Scheduler completed insts to WritebackExec

Mem

requests to memory hierarchy

Execute Stage Implementation

• models functional units and D-cacheq access port bandwidths, issue and execute latencies

• implemented in ruu_issue()

• inputs:q instructions ready to execute, issued by Scheduler stage

q functional unit and D-cache state

• outputs:q updated functional unit and D-cache state, Writeback events

Page 40: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 53

Execute Stage Implementation

• procedure (once per cycle):q get ready instructions (as many as supported by issue B/W)

q find free functional unit and access port

q reserve unit for entire issue latency

q schedule writeback event using operation latency offunctional unitq for loads satisfied in D-cache, probe D-cache for access latency

q also probe D-TLB, stall future issue on a miss

q D-TLB misses serviced in Commit with fixed latency

insts issued by Scheduler completed insts to WritebackExec

Mem

requests to memory hierarchy

Page 41: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 54

detected mispredictions to Fetch

Writebackfinished insts from Execute insts ready to Commit

Writeback Stage Implementation

• models writeback bandwidth, wakes up ready insts,detects mispredictions, initiated misprediction recovery

• implemented in ruu_writeback()

• inputs:q completed instructions as indicated by event queue

q RUU/LSQ state (for wakeup walks)

• outputs:q updated event queue, RUU/LSQ, ready queue

q branch misprediction recovery updates

Page 42: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 55

detected mispredictions to Fetch

Writebackfinished insts from Execute insts ready to Commit

Writeback Stage Implementation

• procedure (once per cycle):q get finished instructions (specified by event queue)

q if mispredicted branch, recover state:q recover RUU

q walk newest instruction to mispredicted branch

q unlink instructions from output dependence chains (tag increment)

q recover architected stateq roll back to checkpoint (copy-on-write bits reset, spec mem freed)

q wakeup walk: walk output dependence chains of finished instsq mark dependent instruction’s input as now ready

q if deps satisfied, wake up inst (memory checked in lsq_refresh())

Page 43: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 56

Commitinsts ready to Commit

Commit Stage Implementation

• models in-order retirement of instructions, storecommits to the D-cache, and D-TLB miss handling

• implemented in ruu_commit()

• inputs:q completed instructions in RUU/LSQ that are ready to retire

q D-cache state (for store commits)

• outputs:q updated RUU, LSQ, D-cache state

Page 44: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 57

Commit Stage Implementation

• procedure (once per cycle):q while head of RUU/LSQ is ready to commit (in-order

retirement)q if D-TLB miss, then service it

q if store, attempt to retire store into D-cache, stall commitotherwise

q commit instruction result to the architected register file, updaterename table to point to architected register file

q reclaim RUU/LSQ resources (adjust head pointer)

Commitinsts ready to Commit

Page 45: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

2007-11-14 27

Sim-Outorder parameters

• Instruction fetch queue size, decode and issue bandwidth

• Capacity of RUU and LSQ

• Branch mis-prediction latency

• Number of functional units– integer ALU, integer multipliers/dividers

– FP ALU, FP multipliers/dividers

• Latency of I-cache/D-cache, memory and TLB

• Record statistic by text address

Page 46: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 109

Specifying Cache Configurations• all caches and TLB configurations specified with same format:

<name>:<nsets>:<bsize>:<assoc>:<repl>

• where:<name> - cache name (make this unique)<nsets> - number of sets<assoc> - associativity (number of “ways”)<repl> - set replacement policy

l - for LRUf - for FIFOr - for RANDOM

• examples:il1:1024:32:2:l 2-way set-assoc 64k-byte cache, LRUdtlb:1:4096:64:r 64-entry fully assoc TLB w/ 4k pages,

random replacement

Page 47: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 110

Specifying Cache Hierarchies• specify all cache parameters in no unified levels exist, e.g.,

-cache:il1 il1:128:64:1:l -cache:il2 il2:128:64:4:l

-cache:dl1 dl1:256:32:1:l -cache:dl2 dl2:1024:64:2:l

• to unify any level of the hierarchy, “point” an I-cache level into thedata cache hierarchy:

-cache:il1 il1:128:64:1:l -cache:il2 dl2

-cache:dl1 dl1:256:32:1:l -cache:dl2 ul2:1024:64:2:l

il1 dl1

il2 dl2

il1 dl1

ul2

Page 48: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 115

Specifying the Branch Predictor• specifying the branch predictor type:

-bpred <type>

the supported predictor types are:nottaken always predict not takentaken always predict takenperfect perfect predictorbimod bimodal predictor (BTB w/ 2 bit counters)2lev 2-level adaptive predictor

• configuring the bimodal predictor (only useful when “-bpred bimod” isspecified):

-bpred:bimod <size> size of direct-mapped BTB

Page 49: Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load single-precision FP l.d - load double-precision FP sb - store byte sbu - store byte unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 116

Specifying the Branch Predictor (cont.)• configuring the 2-level adaptive predictor (only useful when “-bpred

2lev” is specified):

-bpred:2lev <l1size> <l2size> <hist_size>

where:

<l1size> size of the first level table<l2size> size of the second level table<hist_size> history (pattern) width

l1size

patternhistory

hist_size

branchaddress

l2size

2-bitpredictors

branchprediction