NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

31
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND- CORE SYSTEMS Publisher’s: DANIEL SANCHEZ CHRISTOS KOZYRAKIS Presented By: Vaibhav Ashtikar(13IS24F) Govind Dhonddev(13IS06F) DEPARTMENT OF COMPUTER SCIENCE & ENGGINERING

description

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL. DEPARTMENT OF COMPUTER SCIENCE & ENGGINERING. Presentation on . ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s : DANIEL SANCHEZ CHRISTOS KOZYRAKIS . Presented By: - PowerPoint PPT Presentation

Transcript of NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Page 1: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Presentation on

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Publisher’s: DANIEL SANCHEZ CHRISTOS KOZYRAKIS

Presented By:Vaibhav Ashtikar(13IS24F)Govind Dhonddev(13IS06F)

DEPARTMENT OF COMPUTER SCIENCE & ENGGINERING

Page 2: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Piece of software: Modelling computer system/components

Input Predicts o/p & performance

Architectural Simulator

Evaluating different hardware designs without building costly physical hardware systems.

Enabling the opportunities to access non-existing computer components or systems.

Obtaining detailed performance metrics: A single execution of simulators can often generate a large set of performance data.

Debugging: Debugging on real hardware typically require re-booting and re-running the code to reproduce the problems. In contrast, some simulators have a fully controlled environment and allow software developers to run code backward once an error is detected.

Page 3: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Piece of software: Modelling computer system/components

Input Predicts o/p & performance

IDEAL Architectural Simulator

FAST ACCURATE Execute wide range of WORKLOADS Easy to use, easy to modify

Page 4: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 5: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Problem: Architectural simulation is Time Consuming

Current detailed simulators are slow (~200 KIPS)

Problem: Time to simulate 1000 cores @ 2GHz for 1second at

200 KIPS: 4 months 200 MIPS: 3 hours

Simulation performance wall • More complex targets (multicore, memory hierarchy, …) •  Hard to parallelize

Page 6: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Architectural Simulator

Sequential Simulation Parallel simulation

More cores to be simulated ;More slower sequential simulation.

Scaling poorly due to excessive synchronization.

Sacrifice accuracy by allowing event reorderingtradeoff

Page 7: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Speed Accuracy

Sequential

Scaling

Performance measures

Parallel

OOO

Tradeoff between speed and accuracy:

Page 8: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Speed up detailed core models

Bound Weave

Light weight user level virtualization

ZSIM

Page 9: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Speed up detailed core modelsZSIM

With instruction driven timing models that uses

DBT(Dynamic Binary Translation)

FAST

Page 10: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Bound WeaveZSIM ACCURACY

2 phase parallelization technology that scales parallel simulation on multicore..

Page 11: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Light weight user level virtualizationZSIM Wide range

of workloads

To support complex workloads .E.g. multiprogramming client server based applications etc.

To bridge user-level/full system gap

Page 12: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 13: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Dynamic Binary Translation• Simulate basic block using host instructions

• Binary Code

Load r3, 16(r1)Add r4, r3, r2

Jump 0x48074

Load t1, simRegs[1] Load t2, 16(t1) Store t2,

simRegs[3]

Load t1, simRegs[2] Load t2, simRegs[3]

Add t3, t1, t2 Store t3, simregs[4]

Store 0x48074, simPc J dispatch_loop

Translated Code

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 14: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Dynamic Binary Translation• Modeling thousand-core simulator with parallelization alone is not

sufficient.• Instrumentation based approach to eliminate need for functional

modeling of X86.

Timing Based Model

Simple Core Model OOO Core Model

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 15: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Simple Core Model (SCM)• Instrument Load and Store instructions.• SCM counts cycles, instructions, derives memory hierarchy.• Simulated up to 90 MIPS per simulated Core

• Pitfalls:• Doesn’t represent ooo model used in desktops, server and processor chips.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 16: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

OOO Core Model• Models• Branch prediction • Instruction length• Pre-decoder instruction decoding• Issue stalls• Register renaming

• Conventional Simulators with OOO model execute around 100 KIPS.• ZSIM accelerates OOO core by pushing most of the work at

instrumentation phase

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 17: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

OOO core modeling• Basic Block Instrumented basic block +Basic Block Descriptor

• Instruction Driven Approach: Simulate all stages at once for each instruction / μ-operation.• Schedule of given μ operation must not depend on future μ operation.• Execution time of every μ operation must be known in advance.

BasicBlock(DecodedBBL)Load(addr = -0x38(%rbp))mov -0x38(%rbp),%rcxlea -0x2040(%rbp),%rdxadd %rax,%rdxmov %rdx,-0x2068(%rbp)Store(addr = -0x2068(%rbp))cmp $0x1fff,%raxjne 10840530a

mov (%rbp),%rcxadd %rax,%rbxmov %rdx,(%rbp)ja 40530a

Ins →μop decodingμop dependencies,functional units, latencyFront-end delays

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 18: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Parallelism and Interference• Path-altering Interference• Two accesses if simulated in out of

order changes their paths through memory hierarchy.• Root cause:

• Two accesses address to same line (except both reads)

• Second access if executed out of order, causes first access as miss.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 19: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Parallelism and Interference• Path-Preserving Interference• Two accesses if simulated out of

order changes their timing but path to memory hierarchy remains unaffected..

• Ex. 2 accesses to different Cache sets in same bank.

• In small intervals(1-10K cycles) path altering interference is very rare(<1 in 10K accesses)

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 20: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Bound-weave Algorithm• Need accuracy on path-altering interference

Bound-weave Algorithm

1. Bound Phase 2. Weave Phase

In this interval, each core is simulated for specific small interval

Zero load latency during the interval.

In this interval, each core is simulated for specific small interval

Parallel simulation of core for prior knowledge of events to scale efficiently.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 21: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Bound Phase• Limit the skew between simulated Cores• Thread execution in parallel and sync for interval barrier.

• Moderate parallelism• Allow as many threads as host hardware threads run concurrently.• Ex. 1024-core simulation on host with 32 hardware threads, barrier only wakes up 32

threads at each time interval.

• Avoiding systematic bias• At end of time interval, barrier shuffles thread wake up order to avoid consistently

prioritizing a few threads.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 22: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Bound-Weave Example• 2-core host

simulating 4-core system• 1000-cycles intervals• Dividing components

among 2 domains

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 23: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Complex Workloads• Multi process simulation• Scheduler – simple round robin scheduler• Avoiding Simulator-OS Deadlock• Time Virtualization – virtualize rdtsc counter.• System Virtualization – pregenerated virtual instruction.• Fast forwarding – DBT to perform pre-processing fast close to

native speed.

• Challenge: Accounting for OS execution time.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 24: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Accuracy

• 18 out of 29 benchmarks, zsim is within 10% of real system (average performance error around 9.7 )• Absolute performance error

Simulated Real

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 25: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Accuracy- Cache Level

• MPKI Error

• Error rate increases along with memory hierarchy.• Currently TLB misses not modelled.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 26: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Thousand Core performance

• Single system in sequence simulation:• 1.32 trillion instructions takes 1.8 hours(IPC1-NC) to 8.9 hours(OOO-C).

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 27: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Speed Up• Performance simulation of single threaded ZSIM on SPEC2006 using 4

models: IPC1 or OOO cores with and without contention

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 28: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Speed Up• Average ZSIM speedup on workloads as we increase host threads

from 1 to 32(16 cores with 2 hardware threads/cores)

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 29: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Comparison With other simulators• Parallel simulators reports 1-10 MIPS.

• Many constraints – host, workloads, memory intensive application, leads to potential difference.

• ZSIM is 2-3 orders of magnitude faster than other simulators.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 30: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

Conclusion

• New techniques to achieve speed and accuracy.• DBT based Timing Model• Bound-Weave Parallelization• Lightweight virtualization of user process

• Leads to speedup of 1-1500 MIPS on thousand core simulation.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Page 31: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL

References• [1] “ZSIM: Fast and Accurate Micro architectural Simulation of

Thousand-Core Systems ” Daniel Sanchez, Christos Kozyrakis, ISCA 2013

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS