Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo...

Accuracy of Accuracy of Performance Performance

Monitoring HardwareMonitoring HardwareMichael E. Maxwell, Patricia J. Teller, and Michael E. Maxwell, Patricia J. Teller, and

Leonardo SalayandiaLeonardo SalayandiaUniversity of Texas-El PasoUniversity of Texas-El Paso

andandShirley MooreShirley Moore

University of Tennessee-KnoxvilleUniversity of Tennessee-Knoxville

PCAT - The University of Texas at El Paso

PCAT TeamPCAT Team

Dr. Patricia TellerDr. Patricia Teller Alonso Bayona - UndergraduateAlonso Bayona - Undergraduate Alexander Sainz - UndergraduateAlexander Sainz - Undergraduate Trevor Morgan - UndergraduateTrevor Morgan - Undergraduate Leonardo Salayandia – M.S. Leonardo Salayandia – M.S.

StudentStudent Michael Maxwell – Ph.D. StudentMichael Maxwell – Ph.D. Student


Credits (Financial)Credits (Financial) DoD PET ProgramDoD PET Program NSF MIE (Model Institutions of NSF MIE (Model Institutions of

Excellence) REU (Research Excellence) REU (Research Experiences for Undergraduates) Experiences for Undergraduates) ProgramProgram

UTEP Dodson EndowmentUTEP Dodson Endowment


MotivationMotivation

Facilitate performance-tuning Facilitate performance-tuning efforts that employ aggregate efforts that employ aggregate event countsevent counts

When possible provide calibration When possible provide calibration datadata

Identify unexpected results, errors Identify unexpected results, errors Clarify misunderstandings of Clarify misunderstandings of

processor functionalityprocessor functionality


Road MapRoad Map

Scope of ResearchScope of Research MethodologyMethodology ResultsResults Future Work and ConclusionsFuture Work and Conclusions


Processors Under StudyProcessors Under Study

MIPS R10K and R12K: 2 counters, MIPS R10K and R12K: 2 counters, 32 events32 events

IBM Power3: 8 counters, 100+ IBM Power3: 8 counters, 100+ eventsevents

Linux/IA-64: 4 counters, 150 eventsLinux/IA-64: 4 counters, 150 events Linux/Pentium: 2 counters, 80+ Linux/Pentium: 2 counters, 80+

eventsevents


Events Studied So FarEvents Studied So Far Number of load and store instructions Number of load and store instructions

executedexecuted Number of floating-point instructions Number of floating-point instructions

executedexecuted Total number of instructions executed Total number of instructions executed

(issued/committed)(issued/committed) Number of L1 I-cache and L1 D-cache missesNumber of L1 I-cache and L1 D-cache misses Number of L2 cache missesNumber of L2 cache misses Number of TLB missesNumber of TLB misses Number of branch mispredictionsNumber of branch mispredictions


PAPI OverheadPAPI Overhead Extra instructionsExtra instructions

Read counter before and after workloadRead counter before and after workload Processing of counter overflow Processing of counter overflow

interruptsinterrupts Cache pollutionCache pollution TLB pollutionTLB pollution


MethodologyMethodology Validation micro-benchmark Validation micro-benchmark Configuration micro-benchmarkConfiguration micro-benchmark Prediction via tool, mathematical model, Prediction via tool, mathematical model,

and/or simulationand/or simulation Hardware-reported event count collection Hardware-reported event count collection

via PAPI (instrumented benchmark run via PAPI (instrumented benchmark run 100 times; mean event count and 100 times; mean event count and standard deviation calculated)standard deviation calculated)

Comparison/analysis Comparison/analysis Report findingsReport findings


Validation Micro-benchmarkValidation Micro-benchmark Simple, usually small programSimple, usually small program Stresses a portion of the Stresses a portion of the

microarchitecture or memory microarchitecture or memory hierarchyhierarchy

Its size, simplicity, or execution time Its size, simplicity, or execution time facilitates the tracing of its execution facilitates the tracing of its execution path and/or prediction of the number path and/or prediction of the number of times an event is generatedof times an event is generated


Validation Micro-benchmarkValidation Micro-benchmarkBasic types: Basic types:

array array loop loop in-line in-line floating-pointfloating-point

Scalable w.r.t. granularity, i.e., Scalable w.r.t. granularity, i.e., number of generated eventsnumber of generated events


Example – Loop Validation Example – Loop Validation Micro-benchmarkMicro-benchmark

For (I = 0; I < number_of_loops; I++)For (I = 0; I < number_of_loops; I++){{

sequence of 100 instructions with data sequence of 100 instructions with data dependencies that prevent compiler dependencies that prevent compiler reorder or optimizationreorder or optimization

}}

Used to stress a particular functional unit,e.g., Used to stress a particular functional unit,e.g., the load/store unitthe load/store unit


Configuration Configuration Micro-benchmarkMicro-benchmark

Program designed to provide insight Program designed to provide insight into microarchitecture organization into microarchitecture organization and/or the algorithms that control itand/or the algorithms that control it

ExamplesExamples Page size used – for TLB miss countsPage size used – for TLB miss counts Cache prefetch algorithmCache prefetch algorithm Branch prediction buffer Branch prediction buffer

size/organizationsize/organization


Some ResultsSome Results


Reported Event Counts: Reported Event Counts: Expected, Consistent Expected, Consistent

and Quantifiable Resultsand Quantifiable Results Overhead related to PAPI and other Overhead related to PAPI and other

sources is consistent and sources is consistent and quantifiablequantifiable

Reported Event Count – Predicted Reported Event Count – Predicted Event Count = OverheadEvent Count = Overhead

Example 1: Number of Loads Example 1: Number of Loads Itanium, Power3, and R12KItanium, Power3, and R12K

Load data using loop benchmark

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0 50000 100000 150000 200000 250000

Expected value

% E

rro

r Itanium

Power3

R12k

Example 2: Number of Stores Example 2: Number of Stores Itanium, Power3, and R12KItanium, Power3, and R12K

Store count results

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

0 100000 200000 300000 400000 500000 600000 700000 800000

Expected value

% D

iffe

ren

ce Itanium

Power3

R12k


Example 2: Number of StoresExample 2: Number of StoresPower3 and ItaniumPower3 and Itanium

Platform

MIPS R12K

IBM Power3

Linux/IA-64

Linux/Pentium

Loads 46 28 86 N/A

Stores 31 129 N/AMultiplicative


Example 3: Total Number of Example 3: Total Number of Floating Point Operations – Floating Point Operations –

Pentium II, R10K and R12K, and Pentium II, R10K and R12K, and ItaniumItanium

ProcessorProcessor AccurateAccurateConsistentConsistent

Pentium IIPentium II

MIPS R10K, R12KMIPS R10K, R12K

ItaniumItanium

Even when counters overflow. Even when counters overflow.

No overhead due to PAPI.No overhead due to PAPI.


Reported Event Counts: Reported Event Counts: UnexpectedUnexpected and Consistent and Consistent

Results --Errors?Results --Errors? The hardware-reported counts are The hardware-reported counts are

multiples of the predicted countsmultiples of the predicted counts Reported Event Count / Multiplier = Reported Event Count / Multiplier =

Predicted Event CountPredicted Event Count Cannot identify overhead for calibration Cannot identify overhead for calibration

Example - Total Number of Floating-Point Operations –

Power3Floating Point Adds

0

20

40

60

80

100

120

6600 19800 33000 46200 59400 1E+05 3E+05 4E+05 5E+05 7E+05 2E+06 3E+06 3E+07

Expected Value

% E

rro

r

Itanium

Pow er3

R12k

Pentium

AccurateConsistent


Reported Counts: ExpectedReported Counts: Expected (Not Quantifiable) Results (Not Quantifiable) Results

Predictions: only possible under Predictions: only possible under special circumstances special circumstances

Reported event counts seem Reported event counts seem reasonablereasonable

But are they useful without But are they useful without knowing more about the algorithm knowing more about the algorithm used by the vendor?used by the vendor?


Example 1: Total Data TLB Example 1: Total Data TLB MissesMisses

Replacement policy can Replacement policy can (unpredictably) affect event counts(unpredictably) affect event counts

PAPI may (unpredictably) affect PAPI may (unpredictably) affect event countsevent counts

Other processes may Other processes may (unpredictably) affect event counts(unpredictably) affect event counts

Example 2: Example 2: L1 D-Cache MissesL1 D-Cache Misses# misses relatively constant as # of array # misses relatively constant as # of array

references increasereferences increaseL1 D cache misses using sequential access

-200.0

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

1800.0

2000.0

100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05

Data accesses

% D

iffe

ren

ce

Itanium

Power3

R12k

Pentium

Example 2 EnlargedExample 2 Enlarged

L1 D cache misses using sequential access

-200.0

-100.0

0.0

100.0

200.0

300.0

400.0

100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05

Data accesses

Power3

R12k

Pentium

Example 3: L1 D-Cache Misses with Example 3: L1 D-Cache Misses with Random Access Random Access

(Foil Prefetch Scheme used by Stream (Foil Prefetch Scheme used by Stream Buffers)Buffers)L1 D cache misses as a function of % filled

-150.0

-100.0

-50.0

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

0.0 50.0 100.0 150.0 200.0 250.0 300.0

% of cache filled

% E

rro

r Power3

R12k

Pentium

Example 4: A Mathematical Model that Example 4: A Mathematical Model that Verifies that Execution Time increases Verifies that Execution Time increases Proportionately with L1 D-Cache MissesProportionately with L1 D-Cache Misses

Cycles per Data Access

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Data Accesses

Cyc

les

Itanium

Power3

R12K

Pentium

total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss


Reported Event Counts: Reported Event Counts: Unexpected but Unexpected but

ConsistentConsistentResultsResults

Predicted counts and reported Predicted counts and reported counts differ significantly but in a counts differ significantly but in a consistent mannerconsistent manner

Is this an error?Is this an error? Are we missing something?Are we missing something?

Example: Compulsory Data Example: Compulsory Data TLB MissesTLB Misses

% difference per no. % difference per no. of referencesof references

Reported counts are Reported counts are consistentconsistent

Vary between Vary between platformsplatforms

Data TLB misses

0%

100%

200%

300%

400%

500%

600%

1 10 100 1000 10000Num ber of References

Diff

eren

ce

MIPS R10K Itanium Power 3


Reported Event Counts: Reported Event Counts: Unexpected ResultsUnexpected Results

OutliersOutliers PuzzlesPuzzles

Example 1: Outliers Example 1: Outliers L1 D-Cache Misses for L1 D-Cache Misses for

ItaniumItaniumL1 D cache misses using sequential access

-200.0

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

1800.0

2000.0

100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05

Data accesses

% D

iffe

ren

ce

Itanium

Power3

R12k

Pentium


Example 1: Supporting DataExample 1: Supporting Data

Itanium L1 Itanium L1 Data Cache Data Cache MissesMisses

MeanMean Standard Standard DeviationDeviation

90% of 90% of data 1M data 1M accessesaccesses

1,2901,290 170170

10% of 10% of data data

1M 1M accessesaccesses

782,891782,891 566,370566,370

Example 2: L1 I-Cache Misses and Example 2: L1 I-Cache Misses and Instructions Retired - Itanium Instructions Retired - Itanium

L1 I cache misses

-80

-60

-40

-20

0

20

40

60

80

0 2000 4000 6000 8000 10000 12000

Expected value

% E

rro

r Itanium

Power3

R12k

Total instructions retired

0

2

4

6

8

10

12

14

16

18

20

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

Expected value

% E

rro

r Itanium

Power3

R12k

Pentium

Both about 17% more than expected.


Future WorkFuture Work

Extend events studied – include Extend events studied – include multiprocessor eventsmultiprocessor events

Extend processors studied – Extend processors studied – include Power4include Power4

Study sampling on Power4; IBM Study sampling on Power4; IBM collaboration re: workload collaboration re: workload characterization/system resource characterization/system resource usage using samplingusage using sampling


ConclusionsConclusions Performance counters provide informative data that can Performance counters provide informative data that can

be used for performance tuningbe used for performance tuning Expected frequency of event may determine usefulness Expected frequency of event may determine usefulness

of event countsof event counts Calibration data can make event counts more useful to Calibration data can make event counts more useful to

application programmers (loads, stores, floating-point application programmers (loads, stores, floating-point instructions)instructions)

The usefulness of some event counts -- as well as our The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaborationresearch – could be enhanced with vendor collaboration

The usefulness of some event counts is questionable The usefulness of some event counts is questionable without documentation of the related behaviorwithout documentation of the related behavior


Should we attach the Should we attach the following warning to some following warning to some

event counts on some event counts on some platforms?platforms?

CAUTION: The values in CAUTION: The values in the performance the performance counters may be greater counters may be greater than you think.than you think.


And should we attach the And should we attach the PCAT Seal of Approval on PCAT Seal of Approval on

others?others?

PCAT


Invitation to VendorsInvitation to Vendors

Help us understand what’s going on, Help us understand what’s going on, when to attach the “warning”,when to attach the “warning”,and when to attach the “seal of and when to attach the “seal of approval.” Application programmers approval.” Application programmers will appreciate your efforts and so will appreciate your efforts and so will we!will we!


Question to YouQuestion to You

On-board Performance Counters: On-board Performance Counters: What do they really tell you?What do they really tell you?

With all the caveats, are they useful With all the caveats, are they useful nonetheless?nonetheless?

Example 1: Total Example 1: Total CompulsoryCompulsory Data TLB Misses for R10K Data TLB Misses for R10K

% difference per no. of % difference per no. of referencesreferences

Predicted values Predicted values consistently lower than consistently lower than reportedreported

Small standard Small standard deviationsdeviations

Greater predictability Greater predictability with increased no. of with increased no. of referencesreferences

3%

6%

9%

12%

15%

1

10

100

1000

10000

Example 1: Example 1: CompulsoryCompulsory Data TLB Misses for ItaniumData TLB Misses for Itanium


Reported counts Reported counts consistently ~5 consistently ~5 times greater than times greater than predictedpredicted

399%

400%

401%

402%

403%

404%

1

10

100

1000

10000

Example 3: Compulsory Example 3: Compulsory Data TLB Misses for Power 3Data TLB Misses for Power 3


Reported counts Reported counts consistently ~5/~2 consistently ~5/~2 times greater than times greater than predicted for predicted for small/large countssmall/large counts

Total TLB misses (Power3)% Discrepancy

150%

200%

250%

300%

350%

400%

450%

500%

550%

1 10 100 1000 10000

Example 3: L1 D-Cache Misses Example 3: L1 D-Cache Misses with Random Access – Itaniumwith Random Access – Itanium

only when at array size = 10x cache size only when at array size = 10x cache size L1 D cache misses as a function of % filled

-200.0

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

0.0 50.0 100.0 150.0 200.0 250.0 300.0

% of cache filled

% E

rro

r Itanium

Power3

R12k

Pentium

Example 2: Example 2: L1 D-Cache MissesL1 D-Cache Misses

On some of the processors studied, as On some of the processors studied, as the number of accesses increased, the the number of accesses increased, the miss rate approached 0miss rate approached 0

Accessing the array in strides of size Accessing the array in strides of size two cache-size units plus one cache-line two cache-size units plus one cache-line resulted in approximately the same resulted in approximately the same event count as accessing the array in event count as accessing the array in strides of one wordstrides of one word

What’s going on?What’s going on?

Example 2: R10K Example 2: R10K Floating-Point Division InstructionsFloating-Point Division Instructions

a = init_value;a = init_value;

b = init_value;b = init_value;

c = init_value;c = init_value;

a = b / init_value;a = b / init_value;

b = a / init_value;b = a / init_value;

c = b / init_value;c = b / init_value;

a = init_value;a = init_value;

b = init_value;b = init_value;

c = init_value;c = init_value;

a = a / init_value;a = a / init_value;

b = b / init_value;b = b / init_value;

c = c / init_value;c = c / init_value;1 FP Instruction

Counted3 FP Instructions

Counted

Example 2: Assembler Example 2: Assembler Code AnalysisCode Analysis

No optimizationNo optimization Same instructionsSame instructions Different (expected) Different (expected)

operandsoperands Three division Three division

instructions in bothinstructions in both No reason for No reason for

different FP countsdifferent FP counts

l.ds.dl.ds.dl.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.d

l.ds.dl.ds.dl.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.d

Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo...

Documents

Transcript of Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo...