PINTOS : An Execution Phase Based Optimization and Simulation Tool )
-
Upload
erica-sullivan -
Category
Documents
-
view
24 -
download
0
description
Transcript of PINTOS : An Execution Phase Based Optimization and Simulation Tool )
![Page 1: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/1.jpg)
PINTOSPINTOS: An Execution : An Execution Phase BasedPhase Based Optimization Optimization
and Simulation Tooland Simulation Tool))Wei HsuWei Hsu, Jinpyo Kim, Sreekumar Kodak, Jinpyo Kim, Sreekumar Kodak
Computer ScienceComputer Science Department DepartmentUniversity of MinnesotaUniversity of Minnesota
October 9October 9,, 2004 2004PIN Tutorial at ASPLOS`04PIN Tutorial at ASPLOS`04
![Page 2: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/2.jpg)
OutlineOutline
• What is Pintos?
• What can Pintos do?
• Phase detection for optimization and simulation
• Optimization (instruction prefetching)
• Fast Simulation
• Summary
![Page 3: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/3.jpg)
What is Pintos?What is Pintos?• PINTOS is a PIN based Tool for Optimization and
Simulation• A research framework supports adaptive object code
optimization – Supports deep analysis of run-time program behavior for object
code optimization (e.g. instruction, data prefetching)– Integrates HPM performance monitoring (Pfmon) with dynamic
instrumentation (PIN).
• Also supports fast performance simulation– Identifies program phases (with coarse and fine granularity)– Generates simulation strings that capture representative
program behaviors
![Page 4: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/4.jpg)
Pintos FrameworkPintos Framework
program
pfmon
profile
profileanalysis
Opttargets
program
pfmon
profile
profileanalysis
phasetargets
PIN-basedAnalysis
control flow
CacheSim
PIN-basedPhase
Detection
SimulationString Gen
Optim
ization
Sim
ulation
FilteredOpt
Targets
SimulationStrings
PhaseInfo
![Page 5: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/5.jpg)
Our BackgroundOur Background• ADORE dynamic optimization system
Main Thread
Kernel / Pfmon
Hardware Performance Monitoring Unit
DynamicOptimization
Thread
Code Cache
Trace Selection
Optimization
Deployment
Phase Detection
![Page 6: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/6.jpg)
ADORE Performance: ADORE Performance: Speedup of ORC2.1 Speedup of ORC2.1 +O2 Compiled SPEC2000 Benchmarks+O2 Compiled SPEC2000 Benchmarks
0.00
%
1.59
%
8.75
%
1.32
%
6.18
%
4.97
%
0.00
%
1.00
%
0.71
%
4.14
%
18.6
3%
0.83
%
0.00
%
7.02
%
0.06
%
0.00
%
8.66
%
115.
25%
22.4
0%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
![Page 7: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/7.jpg)
ADORE ADORE Performance at Different Performance at Different Sampling RatesSampling Rates
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
100000 200000 400000 800000 1000000 2000000 4000000 8000000
Dynopt Overhead
Net Speedup
![Page 8: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/8.jpg)
Future Enhancements to ADOREFuture Enhancements to ADORE
• I-cache prefetching
• Help thread based optimizations
• Value prediction based optimizations
• Dynamically undo aggressive optimizations (e.g. control/data speculations, indirect array prefetches)
• Software Branch Predictions
![Page 9: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/9.jpg)
What What can can Pintos doPintos do for us? for us?
• Pintos uses pfmon to identify high-level performance problems (e.g. I-cache miss) and locate target code (phases) for optimization
• Pintos then uses PIN-based analysis tool to focus on target code (phases) to conduct deep analysis
• Pintos provides a framework to support deep analysis of program behavior so that we may experience with new object code optimization techniques and feed them to ADORE.
• Simulation strings can be generated by Pintos and used for more efficient micro-architecture simulations
![Page 10: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/10.jpg)
Phase Phase basedbased Optimization and Optimization and SimulationSimulation
• Phase is a sequence of code that consistently exhibits certain performance behaviors in Pintos, for example– Gzip shows consistent and repeated data cache miss patterns – Crafty exhibits consistent I-cache misses
• A repeating phase can serve as an unit for dynamic and adaptive optimization, or for fast performance simulations. – Optimization unit can be basic block, trace, procedure and
region (loops and loop nests including complex control transfers)– Simulation unit can be an extended code sequence
![Page 11: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/11.jpg)
Phase Phase DDetectionetection
• One phase detection method doesn’t fit all needs. – Dynamic data cache prefetching requires coarse grain
phases (e.g. loops) while dynamic I-cache prefetching requires fine-grain phases (e.g. frequent calling paths).
• A phase tuple is used to determine the current point of execution in PIN instrumentation – Phase tuple: (phase ID #, ip addr, # of retired insts)
![Page 12: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/12.jpg)
Pintos for Optimization (I-Prefetch)Pintos for Optimization (I-Prefetch)
• Many applications still suffer from significant I-cache misses (e.g. data base apps, some SPEC CPU2000 benchmarks, etc)
L1I miss rate (%)
L1I Prefetch miss rate
L2I miss rate
176.gcc 5.4 16.8 7.4
186.crafty 28.7 44.5 0.2
252.eon 16.9 40.8 0.0
253.perlbmk 27.6 42.8 7.2
255.vortex 15.0 26.8 5.3
• Complex control flows cause high miss rate from streaming prefetches
• Predictable call sequence results in relatively low miss rate
![Page 13: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/13.jpg)
I-Cache Miss Analysis (pfmon)I-Cache Miss Analysis (pfmon)
• Miss address based info– Crafty (2125/4760000)
• 25% 30 (1.41%) Each top miss PC was caused by 10-40
• 50% 91 (4.28%) different paths.• 75% 228 (10.73%)• 90% 442 (20.80%)
• Path based info– Crafty (8016/4760000) Each top path leading to I-cache
• 25% 28 (0.34%) miss has 1-2 possible prefetch targets
• 50% 126 (1.57%) • 75% 436 (5.43%) Data show we can reduce points
of • 90% 1118 (13.94%) interest for inst prefetching
![Page 14: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/14.jpg)
Exploring prospective points of Exploring prospective points of instruction prefetchinginstruction prefetching (PIN) (PIN)
B2
B1
B3
B6
B4
B5 B7
B8InstructionCache Simulator
Control flow graph
• Pintos generates prospective paths leading to frequent I-cache misses by analyzing pfmon profile
• PIN instrumentation routine constructs control flow graph and simulates instruction cache along execution
• It inserts I-cache prefetching instructions for the prospective paths based on control flow edge weight and estimated cache replacement
Paths frequentlycausing I-cache misses
![Page 15: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/15.jpg)
Exploring prospective points of Exploring prospective points of instruction prefetching (PIN)instruction prefetching (PIN)
B2
B1
B3
B6
B4
B5 B7
B8InstructionCache Simulator
Control flow graph
• Key observation– Most I-cache misses
happen in the following cache lines after the entry or the return of a function call.
– L1I cache misses are mostly capacity misses. We need to estimate how prefetch affect incoming instruction stream.
• Key idea– Run ahead by exploring
CFG and I-cache simulator– Evaluate prospective paths
given by Pintos
Paths frequentlycausing I-cache misses
![Page 16: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/16.jpg)
Pintos for Pintos for FastFast Simulation Simulation
• Execution driven micro-architectral simulation is commonly used for evaluating new micro-architecture features and respective code optimizations.
• Simulation time is often too long for a complete simulation. New methods for fast simulations such as Simpoint and Smarts have been proposed.
• PASS (Phase Aware Stratified Sampling) is a different way to generate representative and customized traces for targeted simulations
![Page 17: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/17.jpg)
FastFast Simulation Techniques Simulation Techniques
• Truncated Execution- Run Z, FastFoward-W-R
• Sampling- SMARTS- SIMPOINT- Stratified Sampling
• Reduced Input Sets- MinneSPEC
![Page 18: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/18.jpg)
Problems of Previous Works
• Truncated Execution gives very inaccurate results
• Reduced Input sets do not always behave the same as reference inputs so the performance estimation based on reduced input sets may be misleading.
![Page 19: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/19.jpg)
Mechanism of SMARTSMechanism of SMARTS
UWW U (K-1) * U
Program Run Time
W: Warm up time (Fixed to 2000 instructions for SPEC 2000)
U: Detailed Simulation (Fixed to 1000 instructions for SPEC2000)
(K-1)*U:
Function Simulation with Functional Warming (The tool gives the value of K for which the IPC will be within + 3% of the actual value with 99.7% confidence interval)
![Page 20: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/20.jpg)
Issues in Previous WorkIssues in Previous WorkSMARTS • Value of U and W fixed for SPEC 2000 suite. Have to
identify them for every new benchmark suite (Very time consuming)
• Over sampling in steady phases. Does not effectively exploit the existence of phases in programs
SIMPOINT• The user chooses the length of simulation point (100
million, 10 million, 1 million)• Provides Simulation Points based on Clustering of Basic
Block profiles which is generated using sim-fast or ATOM
![Page 21: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/21.jpg)
Phase Aware Stratified Sampling Phase Aware Stratified Sampling (PASS)(PASS)
• Deploy a hierarchical method to detect coarse and fine grain program phases(1) Tracking calling stack (stable bottom = coarse grain
phase) inter-procedure
(2) Detecting loops within the procedure intra-procedure
(3)Tracking data access pattern such as stride within loops (fine grain phases)
• Select stratified samples from each phase until getting high statistical confidence
![Page 22: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/22.jpg)
IPC vs SimPoint IPC vs SimPoint (cc1-166, 1 million insts) (cc1-166, 1 million insts)
simpoi
ntIPC
![Page 23: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/23.jpg)
IPC vs Phase Classification on PASSIPC vs Phase Classification on PASS(cc1-166, 1 million insts)(cc1-166, 1 million insts)
![Page 24: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/24.jpg)
IPC vs SimPoint IPC vs SimPoint (cc1-166, 250 million insts) (cc1-166, 250 million insts)
0
12
34
5
67
89
10
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211
0
0.5
1
1.5
2
2.5
Selected Simpoints Simpoint Clusters IPC From Itanium2 Machine
![Page 25: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/25.jpg)
IPC vs SimPoint IPC vs SimPoint (gzip-source, 1 million insts)(gzip-source, 1 million insts)
simpoi
ntIPC
![Page 26: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/26.jpg)
IPC vs Phase ClassificationIPC vs Phase Classification on PASSon PASS(gzip-source, 1 million insts)(gzip-source, 1 million insts)
![Page 27: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/27.jpg)
IPC vs SimPoint IPC vs SimPoint (gzip-source, 250 million insts)(gzip-source, 250 million insts)
0
1
2
3
4
5
6
7
8
9
1 18 35 52 69 86 103 120 137 154 171 188 205 222 239 256 273 290 307 324 341 358 375 392 409 426 443 460
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Selected Simpoints Simpoint Clusters IPC Data from Itanium2
![Page 28: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/28.jpg)
IPC vs SimPoint IPC vs SimPoint (mcf-ref, 1 million insts)(mcf-ref, 1 million insts)
simpoi
ntIPC
![Page 29: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/29.jpg)
IPC vs Phase Classification on PASSIPC vs Phase Classification on PASS (mcf-ref)(mcf-ref)
![Page 30: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/30.jpg)
IPC vs SimPoint IPC vs SimPoint (mcf-ref, 250 million insts)(mcf-ref, 250 million insts)
0
1
2
3
4
5
6
7
8
1 18 35 52 69 86 103 120 137 154 171 188 205 222 239 256 273 290 307 324 341 358 375 392 409 426
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Selected Simpoints Simpoint Clusters IPC Data from Itanium2
![Page 31: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/31.jpg)
IPC vs Phase ClassificationIPC vs Phase Classification on PASS on PASS(gap-ref, 1 million insts)(gap-ref, 1 million insts)
![Page 32: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/32.jpg)
IPC vs SimPoint IPC vs SimPoint (gap-ref, 250 million insts)(gap-ref, 250 million insts)
01
23
456
78
910
1 71 141 211 281 351 421 491 561 631 701 771 841 911 981 1051 1121 1191 1261 1331 1401 1471
00.2
0.40.6
0.811.2
1.41.6
1.82
Selected Simpoints Simpoint Clusters IPC Data from Itanium2
![Page 33: PINTOS : An Execution Phase Based Optimization and Simulation Tool )](https://reader033.fdocuments.us/reader033/viewer/2022051620/56813327550346895d9a0f2c/html5/thumbnails/33.jpg)
SummarySummary
• We show the combination of HPM sampling (Pfmon) and dynamic instrumentation (Pin) in our research framework (Pintos) for adaptive object code optimization and micro-architectural simulation.
• PASS (Phase Aware Stratified Sampling) may lead to a more efficient way in simulating the interaction between compiler optimizations and new micro-architectural features.