Parallel SimOS: Scalability and Performance for Large System...

53
Parallel SimOS: Scalability and Performance for Large System Simulation Ph.D. Oral Defense Robert E. Lantz Computer Systems Laboratory Stanford University 1

Transcript of Parallel SimOS: Scalability and Performance for Large System...

Page 1: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Parallel SimOS: Scalability and Performance for

Large System Simulation

Ph.D. Oral DefenseRobert E. Lantz

Computer Systems LaboratoryStanford University

1

Page 2: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Overview

• This work develops methods to simulate large computer systems with practical performance

• We use smaller machines to simulate larger machines

• We extend the capabilities of computer system simulation by an order of magnitude, to systems of more than 1000 processors

2

Page 3: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Outline

• Background and Motivation

• Parallel SimOS Investigation

• Design Issues and Experiences

• Performance Evaluation

• Usability Evaluation

• Related Work

• Future Work and Conclusions

3

Page 4: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Why large systems?

• Large applications!

• Biology, Chemistry, Physics, Engineering

• From large systems (e.g. Earth’s climate) to small systems (e.g. cells, DNA)

• Web applications, search, databases

• Simulation, visualization (and games!)

4

Page 5: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Why simulate large systems?

• Compare alternative designs

• Verify a system before building it

• Predict behavior and performance

• Debug a system during bring-up

• Write software when the system is not available (or before it exists!)

• Avoid expensive mistakes

5

Page 6: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

The SimOS System• Complete Machine Simulator

developed in CSL

• Simulates complete hardware of computer system: CPU, memory, devices

• Enough speed and detail to run full operating system, system software, application programs

• Multiple CPU and memory models for fast or detailed performance and behavioral modeling

Target Workload

SimOS Simulated Hardware

Device Models

Disk Network Other

CPU Model

P P P P

Memory Model

M M M M

Target OS

Host OS

Host Hardware

6

Page 7: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Using SimOS

SimOS

Config/Control Scripts

ExternalI/O

Disk Image

OS, System Software

UserApplications

ApplicationData

Modeled performance

and event statistics

Programoutput

Simulatorstatistics

7

Page 8: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Performance Terminology

• Execution time is the most meaningful measurement of simulator performance

• Slowdown = Real Time/Simulated Time

• Slowdown tells you how much longer it will take to simulate a workload compared to running it on actual hardware

• Self-relative slowdown compares a simulator with the machine it is running on

8

Page 9: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Speed/Detail Trade-off

SimOS CPU and Memory Models

SimOS CPU Model Detail Approximate KIPS (225 MHz R10K)

Self-relative slowdown

MXS

Dynamic, superscalar microarchitecture

model; non-blocking memory system

12 2000+

MipsySequential interpreter;

blocking memory system

800 300+

Embra w/cachesSingle-cycle CPU

model; simplified cache model

12000 ~20

Embra Single-cycle CPU and memory model 25000 ~10

9

Page 10: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Benefits of fast simulation

• Makes it possible to simulate complex workloads

• Real OS, system software, large applications

• Many billions of cycles

• Positioning before more detailed simulation

• Allows software development, debugging

• interactive usability

• Enables exploration of large design space

• Provides rough estimate of performance, trends10

Page 11: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

SimOS Applications

• Used in design, development, debugging of Stanford FLASH multiprocessor throughout its life cycle

• Enabled numerous studies of OS and application performance

• Research platform for operating systems, virtual machines, visualization

11

Page 12: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

SimOS Limitations

• As we simulate larger machines,slowdown increases

0

5,000

10,000

15,000

1 128 1024

Simulated Processors

BarnesFFTRadixLU

Slowdown (real time /

simulated time)

12

Page 13: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

SimOS Limitations• ...resulting in longer simulation times

0

5,000

10,000

15,000

1 128 1024

Simulated Processors

Time (minutes) to simulate

one minute of virtual time

10 minutes23 hours

> 1 week

13

Page 14: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Problem: Simulator Slowdown

• What causes simulator slowdown?

• Intrinsic Slowdown

• Resource Exhaustion

• Linear slowdown

• Overall multiplicative slowdown:

Simulation Time = Workload Time * (Intrinsic Slowdown + Resource Exhaustion Penalty) * Linear Slowdown

14

Page 15: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Solution: Parallel SimOS

• Use increased capacity of shared-memory multiprocessors to address resource exhaustion and linear slowdown

• Extend speed/detail trade-off with fast, parallel mode of simulation

• Goal: eliminate slowdown due to parallelism and increase scalability to enable large system simulation with practical performance

15

Page 16: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Outline• Background and Motivation

• Parallel SimOS Investigation

• Design Issues and Experiences

• Embra background

• Parallel Embra Design

• Performance Evaluation

• Usability Evaluation

• Related Work

• Future Work and Conclusions

16

Page 17: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Embra: SimOS’ fastest simulation mode

• Binary translation CPU and memory simulator

• Translation Cache (TC)

• Callouts to handle events, MMU operations, exceptions and annotations

• CPU multiplexing

• ~10x base slowdownEmbra

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

MMU Cache

MMU Handler

StatisticsReporting

Translator

Decoder

Callout andException Handlers

Event Handlers

SimOSInterface

17

Page 18: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Embra: sources of slowdown

• Binary translation overhead

• Multiplexing overhead

• Resource Exhaustion

ST = WT * (Slowdown(I) + Slowdown(R)) * M

18

Page 19: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Binary translation overhead

Simulator Memory

Translation Cache (TC)

lw r1, (r2)

lw r3, (r4)

add r5, r1, r3

lw SIM_T1, R2(cpu_base)

jal mem_read_addr

lw SIM_T2, (SIM_T1)

sw SIM_T2, R1(cpu_base)

lw SIM_T1, R4(cpu_base)

jal mem_read_addr

lw SIM_T3, (SIM_T1)

sw SIM_T3, R3(cpu_base)

add.w SIM_T1, SIM_T2, SIM_T3

sw SIM_T1, R5(cpu_base)

TC Index

Decoder and

Translator

PC

19

Page 20: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

CPU multiplexing overhead

• CPU State array

• Context switching with variable timeslice

• large for low overhead

• small for better responsiveness

• minimal: MPinUP mode

CPU 0

Registers FPU

MMU other state

CPU 1

Registers FPU

MMU other state

CPU 2

Registers FPU

MMU other state

P

P

P

20

Page 21: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

A new, faster mode:Parallel Embra

• Use parallelism and memory system of shared-memory multiprocessor

• Decimation-in-space approach

• Parallelism and increased memory bandwidth reduce linear slowdown and resource exhaustion:

ST = WT * (Slowdown(I) + Slowdown(R)) * M

Simulated

nodes

Simulator

threads

21

Page 22: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Design Evolution

• We started with a baseline design and evolved it to achieve scalable performance

• Baseline: thread-based parallelism, shared memory

• Critical design features:

• Mirroring hardware in software

• Replication, fine-grained parallelism

• Unsynchronized execution speed

22

Page 23: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Design: Software should mirror Hardware

• Shared Translation Cache to reduce overhead?

• Problem: contention and serialization; chaining and cache conflicts

• Fuses hardware, breaks parallelism

• Solution: mirror hardware in software with replicated Translation Caches

Parallel Embra

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

MMU Cache

MMU Handler

StatisticsReporting

Translator

Decoder

Callout andException Handlers

Event Handlers

SimOSInterface

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

23

Page 24: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Parallel Embra

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

MMU Cache

MMU Handler

StatisticsReporting

Translator

Decoder

Callout andException Handlers

Event Handlers

SimOSInterface

Design: Software should mirror Hardware

• Shared Event Queue for global ordering? Events are rare!

• Problem: event frequency increases with parallelism

• Solution: replicated event queues to mirror hardware in software

24

Page 25: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Design: Software should mirror Hardware

• 90% of time in TC - how about parallelize TC only?

• Problem: Amdahl’s law

• Problem: frequent callouts, contention everywhere

• Result: critical region expansion and serialization

Parallel Embra

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

MMU Cache

MMU Handler

StatisticsReporting

Translator

Decoder

Callout andException Handlers

Event Handlers

SimOSInterface

25

Page 26: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Critical Region Expansion

Critical Regions

Expansion and Serialization

Contention and Descheduling

Time26

Page 27: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Design: Software should mirror Hardware

• Solution: mirror hardware in software with fine-grained parallelism throughout Parallel Embra

• OS and apps require parallel callouts from Translation Cache

• Parallel statistics reporting is also a good idea, but happens infrequently

Parallel Embra

Translation Cache (TC)

MMU/glue code

Kernel TC

User TC

Translation Cache (TC) index

MMU Cache

MMU Handler

StatisticsReporting

Translator

Decoder

Callout andException Handlers

Event Handlers

SimOSInterface

27

Page 28: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Design: flexible virtual time synchronization

• Problem: cycle skew between fast, slow processors

• Solution: configurable barrier synchronization

• fast processors wait for slow processors

• fine-grain (like MPinUP mode)

• loose grain (reduce sync overhead)

• variable interval for flexibility

28

Page 29: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Design: synchronization causes slowdown

0

1

2

3

4

500000 1000000 10000000

BarnesFFTLUMP3DOceanRaytraceRadixWater

32p Slowdown

vs. large sync interval

Synchronization interval (cycles)

29

Page 30: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Design: unsynchronized execution

• For performance, the best synchronization interval is longer than the workload, i.e. never synchronize

• We were surprised to find that both the OS and parallel benchmarks ran correctly with unlimited time skew

• This is because every thread sees a consistent ordering of memory and synchronization events

30

Page 31: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Design conclusions• Parallelism increases

contention for: callouts, event system, TC, clock, MMU, interrupt controllers, any shared subsystem

• Contention cascades, resulting in critical region expansion and serialization

• Mirroring hardware in software preserves parallelism, avoids contention effects

• Fine-grained synchronization is required to permit correct and highly parallel access to simulator data

• Time synchronization across processors is unnecessary for correctness and undesirable for speed

• Performance depends on combination of all parallel performance features

31

Page 32: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Outline• Background and Motivation

• Parallel SimOS Investigation

• Design Issues and Experiences

• Performance Evaluation

• Usability Evaluation

• Related Work

• Future Work

• Conclusions

32

Page 33: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Performance:Test Configuration

benchmark description

Barnes Hierarchical Barnes-Hut method for N-body problem

FFT Fast Fourier Transform

LU Lower/Upper matrix factorization

MP3D Particle-based hypersonic wind tunnel simulation

Radix Integer radix sort

Raytrace Ray tracer

Ocean Ocean currents simulation

Water Water molecule simulation

pmake Compile phase of Modified Andrew Benchmark

ptest Simple benchmark for sanity check/peak performance

Stanford FLASH Multiprocessor

64 nodesMIPS R10000, 225 Mhz220 MB DRAM/node (14GB total)flash1, flash32, flash64, etc.

WorkloadMachine33

Page 34: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Performance: Peak and actual MIPS

0

100

200

300

400

500

600

700

800

900

1000

1100

0 200 400 600 800 1000 1200

MIPS over time

-vpc-suite/flash-32-suite.log

Flash32: ptest Flash32: SPLASH-2Overall result: > 1000 MIPS in simulation, ~10x slowdown compared to hardware

1600 MIPS1000 MIPS

34

Page 35: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Performance: Hardware self-relative slowdown

0

10

20

30

40

50

60

1 2 4 8 16 32 64

BarnesFFTLUMP3DOceanRadixRaytraceWaterpmakeLU-bigRadix-big

Self-relative slowdown

Simulated Machine Size

~10x slowdown regardless of machine size35

Page 36: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Performance: benchmark phases

Barnes-Flash32 LU-Flash32

36

Page 37: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Performance: benchmark phases

MP3D-Flash3237

Page 38: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Large Scale Performance

-1024 processor simulation, 16x64p cluster,8-way parallel

38

Page 39: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Large Scale Performance

0

2,500

5,000

7,500

10,000

12,500

15,000

SimOS Parallel SimOS

442

9,409

772

10,323Radix/Flash32LU/Flash64

Slowdown(Real time/

simulated time)

Hours or days rather than weeks

39

Page 40: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Speed/Detail Trade-off, revisited

Parallel SimOS CPU and Memory Models

Parallel SimOS CPU Model Detail Approximate KIPS

(225 MHz R10K)Self-relative

slowdown

MXS

Dynamic, superscalar microarchitecture

model; non-blocking memory system

12 2000+

MipsySequential interpreter;

blocking memory system

800 300+

Embra w/cachesSingle-cycle CPU

model; simplified cache model

12000 ~20

Embra Single-cycle CPU and memory model 25000 ~10

Parallel EmbraNon-deterministic,

single-cycle CPU and memory model

> 1,000,000 ~10

40

Page 41: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Performance Conclusions

• Parallel SimOS achieves peak and actual MIPS far beyond serial SimOS

• Parallel SimOS simulates multiprocessor with analogous performance to Serial SimOS simulating a uniprocessor

• Parallel SimOS extends scalability of complete machine simulation to 1024 processor systems

41

Page 42: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

• Study of large, complex parallel program: Parallel SimOS itself

• Self-hosting capability of orthogonal simulators

• Performance debugging of Parallel SimOS, and test of functionality and usability

• Self-hosting architecture:Hardware (SGI Origin)

Usability Study

Outer SimOS

Outer Irix 6.5

Irix 6.5

Inner SimOS

Inner Irix 6.5

Benchmark (Radix)

42

Page 43: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Phase profile

|40

|45

|50

|55

|60

|65

|70

|75

|80

|

0

|

1

|

2

|

3

|

4

Computation intervals for self-hosted radix

time(s)

CP

U

|11

|13

|15

|17

|19

|21

|23

|25

|

0

|

1

|

2

|

3

|

4

Computation intervals for self-hosted radix

time(s)

CP

U

Serial SimOS Parallel SimOS

Bugs: Excessive TLB misses, interrupt stormsLimitation: system imbalance effects

43

Page 44: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Usability Conclusions

• Parallel SimOS worked correctly on itself

• Revealed bugs and limitations of Parallel SimOS

• Speed/detail trade-off enabled with checkpoints

• Detailed mode too slow - ended up scaling down workload

• Need for faster detailed simulation modes

44

Page 45: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Limitations

• Virtual time depends on real time

• Loss of determinism, repeatability

• but can use checkpoints!

• System Imbalance Effects

• Memory Limits

• Need for fast detailed mode

• future work

45

Page 46: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Related Work

• Parallel SimOS uses shared-memory multiprocessors and decimation in space

• Other approaches to improving performance using parallelism include:

• Decimation in time

• Cluster-based simulation

46

Page 47: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Related Work: Decimation in Time

ST = WT * (Slowdown(I) + Slowdown(R)) * N

Initial serial execution Segment 1 Segment 2 Segment 3 Segment 4

checkpoint checkpoint checkpoint

checkpoint

Subsequent parallel execution

Serial reconstruction

Segment 1

Segment 2

Segment 3

Segment 4

overlap

Segment 1

checkpoint

Segment 2

Segment 3

checkpoint checkpoint checkpoint

Segment 4

47

Page 48: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Parallel SimOS: Decimation in Space

Simulated

nodes

Simulator

threads

ST = WT * (Slowdown(I) + Slowdown(R)) * M48

Page 49: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Related Work: Cluster-based Simulation

• Most common means of parallel simulation: Shaman, BigSim, others;

• Fast (?) LAN = high-latency communication

• Software-based shared memory = low performance

• Reduced flexibility

switch

49

Page 50: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Parallel SimOS: Flexible Simulation

• Tightly and loosely coupled machines

• From clusters to multiprocessors and everything in between

• Parallelism across multiprocessor nodes

Workstation Cluster - “Sweet Hall”

Network

Multi-level Bus/interconnect

Node

CPU Cache

CPU Cache

Memory Controller

NUMA Shared-Memory Multiprocessor - Stanford FLASH Machine

Node

CPU Cache

CPU Cache

Memory Controller

Node

CPU Cache

CPU Cache

Memory Controller

Node

CPU Cache

CPU Cache

Memory Controller

...

Multiprocessor Cluster

...Network

Multiprocessor

NetworkInterface

Node

Node

Node

Node

Multiprocessor

NetworkInterface

Node

Node

Node

Node

50

Page 51: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Related Work Summary• Decimation in Time achieves good speedup

at the expense of interactivity

• synergistic with Parallel SimOS

• Cluster-based simulation addresses needs of loosely-coupled systems, generally without shared memory

• Parallel SimOS approach achieves programmability and performance - for larger design space that includes tightly-coupled and hybrid systems

51

Page 52: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Future Work

• Faster detailed simulation

• Parallel detailed mode with flexible memory, pipeline models

• Try to recapture determinism

• Global memory ordering in virtual time

• Faster less-detailed simulation

• Revisit direct execution, using virtual machine monitors, user-mode OS, etc.

52

Page 53: Parallel SimOS: Scalability and Performance for Large System Simulationrlantz/papers/lantz-orals-talk... · 2008. 12. 8. · The SimOS System • Complete Machine Simulator developed

Conclusion: Thesis Contributions

• Developed design and implementation of scalable, parallel complete machine simulation

• Eliminated slowdown due to resource exhaustion and multiplexing

• Scaled complete machine simulation up by an order of magnitude - 1024 processor machines on our hardware

• Developed flexible simulator capable of simulating large, tightly-coupled systems with interactive performance

53