Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo...
-
Upload
godfrey-banks -
Category
Documents
-
view
214 -
download
1
Transcript of Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo...
Accuracy of Accuracy of Performance Performance
Monitoring HardwareMonitoring HardwareMichael E. Maxwell, Patricia J. Teller, and Michael E. Maxwell, Patricia J. Teller, and
Leonardo SalayandiaLeonardo SalayandiaUniversity of Texas-El PasoUniversity of Texas-El Paso
andandShirley MooreShirley Moore
University of Tennessee-KnoxvilleUniversity of Tennessee-Knoxville
PCAT - The University of Texas at El Paso
PCAT TeamPCAT Team
Dr. Patricia TellerDr. Patricia Teller Alonso Bayona - UndergraduateAlonso Bayona - Undergraduate Alexander Sainz - UndergraduateAlexander Sainz - Undergraduate Trevor Morgan - UndergraduateTrevor Morgan - Undergraduate Leonardo Salayandia – M.S. Leonardo Salayandia – M.S.
StudentStudent Michael Maxwell – Ph.D. StudentMichael Maxwell – Ph.D. Student
PCAT - The University of Texas at El Paso
Credits (Financial)Credits (Financial) DoD PET ProgramDoD PET Program NSF MIE (Model Institutions of NSF MIE (Model Institutions of
Excellence) REU (Research Excellence) REU (Research Experiences for Undergraduates) Experiences for Undergraduates) ProgramProgram
UTEP Dodson EndowmentUTEP Dodson Endowment
PCAT - The University of Texas at El Paso
MotivationMotivation
Facilitate performance-tuning Facilitate performance-tuning efforts that employ aggregate efforts that employ aggregate event countsevent counts
When possible provide calibration When possible provide calibration datadata
Identify unexpected results, errors Identify unexpected results, errors Clarify misunderstandings of Clarify misunderstandings of
processor functionalityprocessor functionality
PCAT - The University of Texas at El Paso
Road MapRoad Map
Scope of ResearchScope of Research MethodologyMethodology ResultsResults Future Work and ConclusionsFuture Work and Conclusions
PCAT - The University of Texas at El Paso
Processors Under StudyProcessors Under Study
MIPS R10K and R12K: 2 counters, MIPS R10K and R12K: 2 counters, 32 events32 events
IBM Power3: 8 counters, 100+ IBM Power3: 8 counters, 100+ eventsevents
Linux/IA-64: 4 counters, 150 eventsLinux/IA-64: 4 counters, 150 events Linux/Pentium: 2 counters, 80+ Linux/Pentium: 2 counters, 80+
eventsevents
PCAT - The University of Texas at El Paso
Events Studied So FarEvents Studied So Far Number of load and store instructions Number of load and store instructions
executedexecuted Number of floating-point instructions Number of floating-point instructions
executedexecuted Total number of instructions executed Total number of instructions executed
(issued/committed)(issued/committed) Number of L1 I-cache and L1 D-cache missesNumber of L1 I-cache and L1 D-cache misses Number of L2 cache missesNumber of L2 cache misses Number of TLB missesNumber of TLB misses Number of branch mispredictionsNumber of branch mispredictions
PCAT - The University of Texas at El Paso
PAPI OverheadPAPI Overhead Extra instructionsExtra instructions
Read counter before and after workloadRead counter before and after workload Processing of counter overflow Processing of counter overflow
interruptsinterrupts Cache pollutionCache pollution TLB pollutionTLB pollution
PCAT - The University of Texas at El Paso
MethodologyMethodology Validation micro-benchmark Validation micro-benchmark Configuration micro-benchmarkConfiguration micro-benchmark Prediction via tool, mathematical model, Prediction via tool, mathematical model,
and/or simulationand/or simulation Hardware-reported event count collection Hardware-reported event count collection
via PAPI (instrumented benchmark run via PAPI (instrumented benchmark run 100 times; mean event count and 100 times; mean event count and standard deviation calculated)standard deviation calculated)
Comparison/analysis Comparison/analysis Report findingsReport findings
PCAT - The University of Texas at El Paso
Validation Micro-benchmarkValidation Micro-benchmark Simple, usually small programSimple, usually small program Stresses a portion of the Stresses a portion of the
microarchitecture or memory microarchitecture or memory hierarchyhierarchy
Its size, simplicity, or execution time Its size, simplicity, or execution time facilitates the tracing of its execution facilitates the tracing of its execution path and/or prediction of the number path and/or prediction of the number of times an event is generatedof times an event is generated
PCAT - The University of Texas at El Paso
Validation Micro-benchmarkValidation Micro-benchmarkBasic types: Basic types:
array array loop loop in-line in-line floating-pointfloating-point
Scalable w.r.t. granularity, i.e., Scalable w.r.t. granularity, i.e., number of generated eventsnumber of generated events
PCAT - The University of Texas at El Paso
Example – Loop Validation Example – Loop Validation Micro-benchmarkMicro-benchmark
For (I = 0; I < number_of_loops; I++)For (I = 0; I < number_of_loops; I++){{
sequence of 100 instructions with data sequence of 100 instructions with data dependencies that prevent compiler dependencies that prevent compiler reorder or optimizationreorder or optimization
}}
Used to stress a particular functional unit,e.g., Used to stress a particular functional unit,e.g., the load/store unitthe load/store unit
PCAT - The University of Texas at El Paso
Configuration Configuration Micro-benchmarkMicro-benchmark
Program designed to provide insight Program designed to provide insight into microarchitecture organization into microarchitecture organization and/or the algorithms that control itand/or the algorithms that control it
ExamplesExamples Page size used – for TLB miss countsPage size used – for TLB miss counts Cache prefetch algorithmCache prefetch algorithm Branch prediction buffer Branch prediction buffer
size/organizationsize/organization
PCAT - The University of Texas at El Paso
Some ResultsSome Results
PCAT - The University of Texas at El Paso
Reported Event Counts: Reported Event Counts: Expected, Consistent Expected, Consistent
and Quantifiable Resultsand Quantifiable Results Overhead related to PAPI and other Overhead related to PAPI and other
sources is consistent and sources is consistent and quantifiablequantifiable
Reported Event Count – Predicted Reported Event Count – Predicted Event Count = OverheadEvent Count = Overhead
Example 1: Number of Loads Example 1: Number of Loads Itanium, Power3, and R12KItanium, Power3, and R12K
Load data using loop benchmark
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0 50000 100000 150000 200000 250000
Expected value
% E
rro
r Itanium
Power3
R12k
Example 2: Number of Stores Example 2: Number of Stores Itanium, Power3, and R12KItanium, Power3, and R12K
Store count results
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1.400
0 100000 200000 300000 400000 500000 600000 700000 800000
Expected value
% D
iffe
ren
ce Itanium
Power3
R12k
PCAT - The University of Texas at El Paso
Example 2: Number of StoresExample 2: Number of StoresPower3 and ItaniumPower3 and Itanium
Platform
MIPS R12K
IBM Power3
Linux/IA-64
Linux/Pentium
Loads 46 28 86 N/A
Stores 31 129 N/AMultiplicative
PCAT - The University of Texas at El Paso
Example 3: Total Number of Example 3: Total Number of Floating Point Operations – Floating Point Operations –
Pentium II, R10K and R12K, and Pentium II, R10K and R12K, and ItaniumItanium
ProcessorProcessor AccurateAccurateConsistentConsistent
Pentium IIPentium II
MIPS R10K, R12KMIPS R10K, R12K
ItaniumItanium
Even when counters overflow. Even when counters overflow.
No overhead due to PAPI.No overhead due to PAPI.
PCAT - The University of Texas at El Paso
Reported Event Counts: Reported Event Counts: UnexpectedUnexpected and Consistent and Consistent
Results --Errors?Results --Errors? The hardware-reported counts are The hardware-reported counts are
multiples of the predicted countsmultiples of the predicted counts Reported Event Count / Multiplier = Reported Event Count / Multiplier =
Predicted Event CountPredicted Event Count Cannot identify overhead for calibration Cannot identify overhead for calibration
Example - Total Number of Floating-Point Operations –
Power3Floating Point Adds
0
20
40
60
80
100
120
6600 19800 33000 46200 59400 1E+05 3E+05 4E+05 5E+05 7E+05 2E+06 3E+06 3E+07
Expected Value
% E
rro
r
Itanium
Pow er3
R12k
Pentium
AccurateConsistent
PCAT - The University of Texas at El Paso
Reported Counts: ExpectedReported Counts: Expected (Not Quantifiable) Results (Not Quantifiable) Results
Predictions: only possible under Predictions: only possible under special circumstances special circumstances
Reported event counts seem Reported event counts seem reasonablereasonable
But are they useful without But are they useful without knowing more about the algorithm knowing more about the algorithm used by the vendor?used by the vendor?
PCAT - The University of Texas at El Paso
Example 1: Total Data TLB Example 1: Total Data TLB MissesMisses
Replacement policy can Replacement policy can (unpredictably) affect event counts(unpredictably) affect event counts
PAPI may (unpredictably) affect PAPI may (unpredictably) affect event countsevent counts
Other processes may Other processes may (unpredictably) affect event counts(unpredictably) affect event counts
Example 2: Example 2: L1 D-Cache MissesL1 D-Cache Misses# misses relatively constant as # of array # misses relatively constant as # of array
references increasereferences increaseL1 D cache misses using sequential access
-200.0
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
1600.0
1800.0
2000.0
100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05
Data accesses
% D
iffe
ren
ce
Itanium
Power3
R12k
Pentium
Example 2 EnlargedExample 2 Enlarged
L1 D cache misses using sequential access
-200.0
-100.0
0.0
100.0
200.0
300.0
400.0
100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05
Data accesses
Power3
R12k
Pentium
Example 3: L1 D-Cache Misses with Example 3: L1 D-Cache Misses with Random Access Random Access
(Foil Prefetch Scheme used by Stream (Foil Prefetch Scheme used by Stream Buffers)Buffers)L1 D cache misses as a function of % filled
-150.0
-100.0
-50.0
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
0.0 50.0 100.0 150.0 200.0 250.0 300.0
% of cache filled
% E
rro
r Power3
R12k
Pentium
Example 4: A Mathematical Model that Example 4: A Mathematical Model that Verifies that Execution Time increases Verifies that Execution Time increases Proportionately with L1 D-Cache MissesProportionately with L1 D-Cache Misses
Cycles per Data Access
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Data Accesses
Cyc
les
Itanium
Power3
R12K
Pentium
total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss
PCAT - The University of Texas at El Paso
Reported Event Counts: Reported Event Counts: Unexpected but Unexpected but
ConsistentConsistentResultsResults
Predicted counts and reported Predicted counts and reported counts differ significantly but in a counts differ significantly but in a consistent mannerconsistent manner
Is this an error?Is this an error? Are we missing something?Are we missing something?
Example: Compulsory Data Example: Compulsory Data TLB MissesTLB Misses
% difference per no. % difference per no. of referencesof references
Reported counts are Reported counts are consistentconsistent
Vary between Vary between platformsplatforms
Data TLB misses
0%
100%
200%
300%
400%
500%
600%
1 10 100 1000 10000Num ber of References
Diff
eren
ce
MIPS R10K Itanium Power 3
PCAT - The University of Texas at El Paso
Reported Event Counts: Reported Event Counts: Unexpected ResultsUnexpected Results
OutliersOutliers PuzzlesPuzzles
Example 1: Outliers Example 1: Outliers L1 D-Cache Misses for L1 D-Cache Misses for
ItaniumItaniumL1 D cache misses using sequential access
-200.0
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
1600.0
1800.0
2000.0
100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05
Data accesses
% D
iffe
ren
ce
Itanium
Power3
R12k
Pentium
PCAT - The University of Texas at El Paso
Example 1: Supporting DataExample 1: Supporting Data
Itanium L1 Itanium L1 Data Cache Data Cache MissesMisses
MeanMean Standard Standard DeviationDeviation
90% of 90% of data 1M data 1M accessesaccesses
1,2901,290 170170
10% of 10% of data data
1M 1M accessesaccesses
782,891782,891 566,370566,370
Example 2: L1 I-Cache Misses and Example 2: L1 I-Cache Misses and Instructions Retired - Itanium Instructions Retired - Itanium
L1 I cache misses
-80
-60
-40
-20
0
20
40
60
80
0 2000 4000 6000 8000 10000 12000
Expected value
% E
rro
r Itanium
Power3
R12k
Total instructions retired
0
2
4
6
8
10
12
14
16
18
20
0 500000 1000000 1500000 2000000 2500000 3000000 3500000
Expected value
% E
rro
r Itanium
Power3
R12k
Pentium
Both about 17% more than expected.
PCAT - The University of Texas at El Paso
Future WorkFuture Work
Extend events studied – include Extend events studied – include multiprocessor eventsmultiprocessor events
Extend processors studied – Extend processors studied – include Power4include Power4
Study sampling on Power4; IBM Study sampling on Power4; IBM collaboration re: workload collaboration re: workload characterization/system resource characterization/system resource usage using samplingusage using sampling
PCAT - The University of Texas at El Paso
ConclusionsConclusions Performance counters provide informative data that can Performance counters provide informative data that can
be used for performance tuningbe used for performance tuning Expected frequency of event may determine usefulness Expected frequency of event may determine usefulness
of event countsof event counts Calibration data can make event counts more useful to Calibration data can make event counts more useful to
application programmers (loads, stores, floating-point application programmers (loads, stores, floating-point instructions)instructions)
The usefulness of some event counts -- as well as our The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaborationresearch – could be enhanced with vendor collaboration
The usefulness of some event counts is questionable The usefulness of some event counts is questionable without documentation of the related behaviorwithout documentation of the related behavior
PCAT - The University of Texas at El Paso
Should we attach the Should we attach the following warning to some following warning to some
event counts on some event counts on some platforms?platforms?
CAUTION: The values in CAUTION: The values in the performance the performance counters may be greater counters may be greater than you think.than you think.
PCAT - The University of Texas at El Paso
And should we attach the And should we attach the PCAT Seal of Approval on PCAT Seal of Approval on
others?others?
PCAT
PCAT - The University of Texas at El Paso
Invitation to VendorsInvitation to Vendors
Help us understand what’s going on, Help us understand what’s going on, when to attach the “warning”,when to attach the “warning”,and when to attach the “seal of and when to attach the “seal of approval.” Application programmers approval.” Application programmers will appreciate your efforts and so will appreciate your efforts and so will we!will we!
PCAT - The University of Texas at El Paso
Question to YouQuestion to You
On-board Performance Counters: On-board Performance Counters: What do they really tell you?What do they really tell you?
With all the caveats, are they useful With all the caveats, are they useful nonetheless?nonetheless?
PCAT - The University of Texas at El Paso
Example 1: Total Example 1: Total CompulsoryCompulsory Data TLB Misses for R10K Data TLB Misses for R10K
% difference per no. of % difference per no. of referencesreferences
Predicted values Predicted values consistently lower than consistently lower than reportedreported
Small standard Small standard deviationsdeviations
Greater predictability Greater predictability with increased no. of with increased no. of referencesreferences
3%
6%
9%
12%
15%
1
10
100
1000
10000
Example 1: Example 1: CompulsoryCompulsory Data TLB Misses for ItaniumData TLB Misses for Itanium
% difference per no. % difference per no. of referencesof references
Reported counts Reported counts consistently ~5 consistently ~5 times greater than times greater than predictedpredicted
399%
400%
401%
402%
403%
404%
1
10
100
1000
10000
Example 3: Compulsory Example 3: Compulsory Data TLB Misses for Power 3Data TLB Misses for Power 3
% difference per no. % difference per no. of referencesof references
Reported counts Reported counts consistently ~5/~2 consistently ~5/~2 times greater than times greater than predicted for predicted for small/large countssmall/large counts
Total TLB misses (Power3)% Discrepancy
150%
200%
250%
300%
350%
400%
450%
500%
550%
1 10 100 1000 10000
Example 3: L1 D-Cache Misses Example 3: L1 D-Cache Misses with Random Access – Itaniumwith Random Access – Itanium
only when at array size = 10x cache size only when at array size = 10x cache size L1 D cache misses as a function of % filled
-200.0
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
1600.0
0.0 50.0 100.0 150.0 200.0 250.0 300.0
% of cache filled
% E
rro
r Itanium
Power3
R12k
Pentium
Example 2: Example 2: L1 D-Cache MissesL1 D-Cache Misses
On some of the processors studied, as On some of the processors studied, as the number of accesses increased, the the number of accesses increased, the miss rate approached 0miss rate approached 0
Accessing the array in strides of size Accessing the array in strides of size two cache-size units plus one cache-line two cache-size units plus one cache-line resulted in approximately the same resulted in approximately the same event count as accessing the array in event count as accessing the array in strides of one wordstrides of one word
What’s going on?What’s going on?
Example 2: R10K Example 2: R10K Floating-Point Division InstructionsFloating-Point Division Instructions
a = init_value;a = init_value;
b = init_value;b = init_value;
c = init_value;c = init_value;
a = b / init_value;a = b / init_value;
b = a / init_value;b = a / init_value;
c = b / init_value;c = b / init_value;
a = init_value;a = init_value;
b = init_value;b = init_value;
c = init_value;c = init_value;
a = a / init_value;a = a / init_value;
b = b / init_value;b = b / init_value;
c = c / init_value;c = c / init_value;1 FP Instruction
Counted3 FP Instructions
Counted
Example 2: Assembler Example 2: Assembler Code AnalysisCode Analysis
No optimizationNo optimization Same instructionsSame instructions Different (expected) Different (expected)
operandsoperands Three division Three division
instructions in bothinstructions in both No reason for No reason for
different FP countsdifferent FP counts
l.ds.dl.ds.dl.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.d
l.ds.dl.ds.dl.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.d