Session 2: Tracing and Characterizationiacoma.cs.uiuc.edu/iacoma-papers/file2.pdf · Session 2:...
Transcript of Session 2: Tracing and Characterizationiacoma.cs.uiuc.edu/iacoma-papers/file2.pdf · Session 2:...
Session 2: Tracing and Characterization
Optimizing UNIX for OLTP on CC-NUMA Darrell Suggs Data General Corporation
Tracing and Characterization of NT-based System Workloads Jason Casmira, David Kaeli - Northeastern University David Hunter - DEC Software Partners Engineering Group
Analysis of Commercial and Technical Workloads on AlphaServer Platforms Zarka Cvetanovic Digital Equipment Corporation
Characterizing TPC-D on a MIPS R10K Architecture
Qiang Cao, Pedro Trancoso, and Josep Torrellas University of Illinois at Urbana Champaign
Data General
02/01/98 Page 1Prepared By: Darrell Suggs
Optimizing UNIX for OLTP on CC-NUMA
Darrell Suggs, PhDPerformance Architect
Data General Corp.
Data General
02/01/98 Page 2Prepared By: Darrell Suggs
Overview
• In late ‘96 our challenge was ...• Tune software today for “future” architecture
- no reasonable prototypes available- software development lead time is significant
- design issues very complex, exceed our intuitive abilities-
• Specific target- Architecture: 16-32 Intel Pentium Pro, CC-NUMA- Operating System: DG/UX, commercial/enterprise UNIX- Application: Oracle RDBMS
- Workload: TPC-C
Data General
02/01/98 Page 3Prepared By: Darrell Suggs
Product Status
• AV20000 Product Shipped for Revenue in ‘97• Demonstrated industry leading performance• First in a line of CC-NUMA products
Data General
02/01/98 Page 4Prepared By: Darrell Suggs
Basic Approach to SW Scaling
• Construct advanced analysis environment- Obtain architecture independent traces- Construct detailed cache simulation of target platform
- Simulate with model/traces looking for SW scaling issues
•• Use analysis environment to:
- Prototype changes in OS and App- Re-trace prototype software, verify increased scaling- Work with OS/APP developers to implement changes
•• Repeat until no more high leverage scaling issues found
NUMA Building Block Architecture
SCIBoard
P6 CPUL2 Cache
P6 CPUL2 Cache
P6 CPUL2 Cache
P6 CPUL2 Cache
OPB OMCMemory
OPB
B
BB
PCI
3DB
SnoopRAM PIU-A PIU-DFar Memory
Cache
SCC SCI Directory
Link Chip Link Chip
Dual SCI Rings
FabricInterface
• High throughput coherentbridge between P6 busand SCI bus
• Far memory cache forreduced averagelatencies
• HA diagnostic features• Provides for globally
viewable and uniformlyaddressed cc memory(GCM)
SHV
PGS970925-6
Data General
02/01/98 Page 6Prepared By: Darrell Suggs
CC-NUMA Architecture
• Platform Characteristics- 16 Intel/Ppro with 1MB L2 Cache- Full service local memory controller (OMC)
- Far Memory Cache controller, 128MB Direct mapped (OMC like)- Distributed coherent memory -- single image- SCI directory based cache coherency at interconnect
- Local access latency: ~300ns- Remote access latency: ~3 to 5 microsecs
-
• Key scaling issue- Number of interconnect operations per unit of work
- Interconnect operation demand per second
Data General
02/01/98 Page 7Prepared By: Darrell Suggs
CC-NUMA Architecture
P6/L2 P6/L2 P6/L2 P6/L2
FarCache
SCI
InterConnect
Near Mem
P6/L2 P6/L2 P6/L2 P6/L2
FarCache
SCI
InterConnectNear
Mem
P6/L2 P6/L2 P6/L2 P6/L2
FarCache
SCI
InterConnectNear
Mem
P6/L2 P6/L2 P6/L2 P6/L2
FarCache
SCI
InterConnect
Near Mem
Local Bus
Local Bus
Local Bus
Local Bus
SCIRing
Data General
02/01/98 Page 8Prepared By: Darrell Suggs
CC-NUMA Architecture Simulation
• Construct detailed discrete event simulation- Models all system cache contents and protocols- Models all busses/interconnects and associated protocols
- E.g. full simulation of SCI protocol
- Specifically,- 16 L2’s, 4 system busses/far memory caches, 4 SCI directories7
• Model driven by physical, pre-L2, address traces- flexibility to change all cache geometries (except L1)- can examine impact of various protocol optimizations
• Simulation Tool - SES Workbench- Scientific and Engineering Software- Mature and flexible tool for commercial grade simulation
Data General
02/01/98 Page 9Prepared By: Darrell Suggs
Architecture Independent Traces
• Objective: Capture traces on existing HW to be- extensible to different architectures (diff L2’s, bus structs, cpu
counts, etc.)
- physical addresses for both user and kernel- long, contiguous traces for large cache simulation & continuity
- representative, but manageable, sample (30 to 60 secs)- pre-L2, post-L1
Data General
02/01/98 Page 10Prepared By: Darrell Suggs
Architecture Independent Traces
• Technique Overview- Use largest available SMP (quad P6)- Start with well balanced OLTP configuration (TPC-C)
- Trace all processes executing on SMP- annotate traces to identify individual PID’s
- Process traces to identify independent process address streams
- Simulate HW by assigning processes to simulated CPU’s
Data General
02/01/98 Page 11Prepared By: Darrell Suggs
Trace Environment
L2 L2 L2 L2
MemCtl
P6/L1 P6/L1 P6/L1 P6/L1
Pod
Database
I/OCtl
LogicAnalyzer
NFS Server
Network
Trace Storage
System Bus
Trace Operation
- Run workload to steady state- Capture all system bus accesses, filling analyzer buffer- Logic analyzer triggers CPU feedback to “halt” cpu- Logic analyzer dumps trace buffer to disk, no cpu activity occurs- Analyzer frees cpu’s to resume work, captures addresses til full- Repeat start/capture/stop repeatedly- Results in long, contiguous traces. 30 to 60 system secs.
Hundreds of millions of accesses captured.
Buffer FullFeedback
Data General
02/01/98 Page 12Prepared By: Darrell Suggs
Process Simulation
CPU PID Address
0 128 0x1230
0 128 0x1240
1 321 0x8820
3 161 0x4210
3 161 0x4220
2 421 0x0500
1 321 0x8830
1 006 0x0070
2 421 0x0510
3 161 0x4230
1 006 0x0080
Traced Data
PID PID PID PID PID006 128 161 321 421------ ----- ----- ----- -----
0x0070 0x1230 0x4210 0x8820 0x05000x0080 0x1240 0x4220 0x8830 0x0510
0x4230
Post Processed Data
L2 0 L2 1 L2 N
HardwareSimulations
Architectureof choice
Data General
02/01/98 Page 13Prepared By: Darrell Suggs
Architecture Independent Traces
• Issues with trace technique- Post-L1 data is filtered
- pre-L1 data is too dense to handle 30 second sample (100’s of GB)- compensate for L1 filter by flushing L1’s on context switch- capture all addresses accessed, not every access to each address
- Increased process count for number of CPU’s- overloaded scheduler has high context switch rate
- compensate by configuring “run to block” scheduling
- I/O Service times skewed due to start/stop
- Start/stop perturbs environment- minimal impact on sequence of physical addresses per process
Data General
02/01/98 Page 14Prepared By: Darrell Suggs
CC-NUMA Software Scaling Issues
• Motivating Issues- Major HW issue: high interconnect latency- Major SW issue: long access time for shared data
- Key scaling leverage: ** Interconnect operations **- Basic NUMA optimizations were already applied
• Classes of shared data- True sharing: locks, write shared data
- False sharing: write shared data on cache line with read-only data- “Partner data”: data should be on same cache line
- e.g. a lock structure and the data that it guards
• Approach- Find & fix all high frequency false sharing/partner data- Develop algorithmic changes to minimize true sharing
Data General
02/01/98 Page 15Prepared By: Darrell Suggs
Interconnect Operation Trends
• Intial SW interconnect ops- 15,000/TPC-C (new order)
• Reduced (via simulation/analysis/prototype) to- 6,700/TPC-C (as measured via simulation)
• Actual system measurement- 6,600/TPC-C (with prototype changes productized)
-
• System performance improvement- 35% increase in TPM
•• Areas where performance problems persisted:
- I/O device drivers, controllers, etc- The main area ignored in simulation
Data General
02/01/98 Page 16Prepared By: Darrell Suggs
Additional Benefits of Techniques
• Simulation/analysis feedback to HW design- cache geometries- protocol optimizations
- HW buffers and other low-level resource tuning-
• Framework for studying advanced architecture design- Supporting coarse grain block-diagram tradeoffs
- Early positioning of product performance- Understanding other OS/SW issues with CC-NUMA and high
processor count SMP
Tracing and Characterization ofNT-based System Workloads
J. Casmira, D. Kaeli Northeastern University
D. Hunter Digital Equipment Corp.
Outline
• Overview
• Workloads
• Results
• Conclusions
Overview
• Issues with trace-driven simulation– results only as good as input trace (GIGO)
– typically only capture application behavior
• Existing trace tools– Shade
– ATOM
– SimOS
Current Technology
• Trace driven studies using OS-rich traces– ISCA96 27% ; ISCA97 16%
– HPCA96 0% ; HPCA97 8%
Workload Instruction Counts
0
2000000
4000000
6000000
8000000
10000000
12000000
idea
cdpl
ay
OQ
1
OQ
2
OQ
3
OQ
4
OQ
5
Applications
Inst
ruct
ion
Cou
nt
App Only
App & DLL
App & OS
What is PatchWrx?
• Dynamic Execution Tracing Tool Suite– system instrumentation
– trace capture
– stream reconstruct
• DEC Alpha 21064 Windows NT platforms
• Low overhead with minimum slowdown– 2X when instrumented; 4X while tracing
How Does PWX Work?
• Instrument NT binary images
• Using DEC Alpha PALcalls– reserve trace buffer at boot time
– log branch instruction trace entries
• Using instrumented images & trace log,reconstruct original stream
Workloads
• BYTEmark benchmarks– typical “industry standard” benchmark
• MS Internet Explorer– web-browser application
• MS CD Player– NT packaged utility/application
• Oracle 7.3– 3rd party NT database
Characteristics
• Instruction Counts and Basic Block Sizes
• Instruction cache performance
• Instruction mix
• Application only
• Application and DLLs
• Application, DLLs, and OS
Average Basic Block Sizes
0
5
10
15
20
25
30
idea
strin
g
neur
al
float
assi
gn
cdpl
ay
OQ
1
OQ
2
OQ
3
OQ
4
OQ
5Applications
Avg
. Siz
e in
Inst
ruct
ions
App Only
App & DLL
App & OS
Cache Miss Rates
0
2
4
6
8
10
12
148k
neur
al
128k
neur
al
8kcd
play
128k
cdpl
ay 8k OQ
2
128k
OQ
2
8k OQ
5
128k
OQ
5
Applications, Cache Sizes
Mis
s R
ate App Only
App & DLL
App & OS
Workload Instruction C
omposition
0%
20%
40%
60%
80%
100%App & OS
neural
App Onlyneural
App & OScdplay
App Onlycdplay
App & OSOQ2
App OnlyOQ2
App & OSOQ5
App OnlyOQ5
Applica
tions
Percent Composition
BS
R/JS
R
BR
BR
XX
LD/S
T
OTH
ER
Summary
• OS can dominate execution in commercialapplications
• OS reduces the average basic block length
• OS can dramatically change the cache behavior
• OS can significantly alter the instruction mix
• OS must be included in trace-driven simulations toprovide an accurate picture of applicationexecution
Future Work
• Full D-Stream Reconstruction
• FX!32
• Multiprocessor Traces
• Microsoft Windows NT 5.0
• DEC Alpha 21164
Analysis of Commercial andAnalysis of Commercial andTechnical Workloads onTechnical Workloads onAlphaServer PlatformsAlphaServer Platforms
Zarka Cvetanovic
Digital Equipment Corporation
February 1, 1998
Zarka Cvetanovic, February 1, 1998 2
GoalsGoals
◆ highlight differences between commercialand technical workloads on AlphaServers
◆ identify architectural components that areimportant for commercial performance
Zarka Cvetanovic, February 1, 1998 3
IntroductionIntroduction
◆ systems: AlphaServer 4100, 8400
◆ tools: CPU/platform performance counters
◆ workloads:◆ commercial: TPC-C, SPECweb96, Laddis
◆ technical: SPEC95 (rates, parallel), NASParallel, Streams
Zarka Cvetanovic, February 1, 1998 4
Cycles Per Instruction (CPI)Cycles Per Instruction (CPI)
◆ CPI higher incommercial than themajority of technical
◆ several technical(tomcatv, hydro2d)have as high CPI ascommercial
R H 4 6 6 C P I
0 0 .5 1 1 .5 2 2 .5 3 3 .5 4
S P E C w e b 9 6
T P C - C
L a d d i s
a p p l u
a p s i
fp p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C fp 9 5 _ p a r a
c o m p r e s s
g c c
g o
i j p e g
l i
m 8 8 k s i m
p e r l
v o r t e x
S P E C r a t e _ i n t 9 5
a p p l u
a p s i
fp p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C r a t e _ fp 9 5
Zarka Cvetanovic, February 1, 1998 5
Issuing and Stall TimeIssuing and Stall Time◆ issuing time
◆ comparable single anddual issuing time
◆ no triple/quad issuing incommercial (no fp)
◆ stall time◆ higher in commercial
than SPECint95
◆ SPECfp95: comparable
◆ frozen stalls (Dstream)higher than dry (Istream)
R H 4 6 6 P e r c e n t a g e S t a l l / I s s u i n g T i m e
0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0 1 1 0
S P E C w e b 9 6
T P C - C
L a d d is
a p p lu
a p s i
fp p p p
h y d ro 2 d
m g r id
s u 2 c o r
s w im
t o m c a t v
t u r b 3 d
w a ve 5
S P E C fp 9 5 _ p a ra
c o m p r e s s
g c c
g o
i jp e g
l i
m 8 8 k s im
p e r l
vo r t e x
S P E C r a t e _ in t 9 5
a p p lu
a p s i
fp p p p
h y d ro 2 d
m g r id
s u 2 c o r
s w im
t o m c a t v
t u r b 3 d
w a ve 5
S P E C r a t e _ fp 9 5
f r o z e n s t a l l
d r y s t a l l
q u a d . i s s u e
t r ip le . is s u e
d u a l . is s u e
s in g le . i s s u e
Zarka Cvetanovic, February 1, 1998 6
Memory Barrier TimeMemory Barrier Time
◆ MB time high incommercial
◆ MBs have little effecton SPEC95◆ except parallel: still
lower than commercial
R H 4 6 6 P e r c e n t a g e M e m o r y B a r r i e r C y c l e s
0 2 4 6 8 1 0 1 2 1 4
S P E C w e b 9 6
T P C - C
L a d d i s
a p p l u
a p s i
fp p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C fp 9 5 _ p a r a
c o m p r e s s
g c c
g o
i j p e g
l i
m 8 8 k s i m
p e r l
v o r t e x
S P E C r a t e _ i n t 9 5
a p p l u
a p s i
fp p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C r a t e _ fp 9 5
Zarka Cvetanovic, February 1, 1998 7
Cache MissesCache Misses◆ high SC misses in
commercial (Bcachebandwidth important)
◆ other caches:◆ IC misses higher in
commercial (and int95)
◆ DC misses higher inSPECfp95 thancommercial
◆ BC misses higher inSPECfp95 thancommercial
R H 4 6 6 C a c h e M i s s e s p e r 1 K I
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0 1 6 0 1 8 0
S P E C w e b 9 6
T P C - C
L a d d i s
a p p lu
a p s i
fp p p p
h y d ro 2 d
m g r id
s u 2 c o r
s w im
t o m c a t v
t u r b 3 d
w a v e 5
S P E C fp 9 5 _ p a ra
c o m p re s s
g c c
g o
i j p e g
l i
m 8 8 k s im
p e r l
v o r t e x
S P E C r a t e _ in t 9 5
a p p lu
a p s i
fp p p p
h y d ro 2 d
m g r id
s u 2 c o r
s w im
t o m c a t v
t u r b 3 d
w a v e 5
S P E C r a t e _ fp 9 5
B C m is sS C m is sD C m is sI C m is s
Zarka Cvetanovic, February 1, 1998 8
Replay Traps and MispredictsReplay Traps and Mispredicts◆ Replays:
◆ LDU replays high incommercial (andSPECint95)
◆ WB_MAF_FULLreplays higher inSPECfp95 thancommercial
◆ branch/PC mispredicts◆ higher in SPECint95
than commercial
R H 4 6 6 T r a p s / M i s p r e d i c t s p e r 1 K I
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0 1 6 0 1 8 0
S P E C w e b 9 6
T P C - C
L a d d i s
a p p l u
a p s i
f p p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C f p 9 5 _ p a r a
c o m p r e s s
g c c
g o
i j p e g
l i
m 8 8 k s i m
p e r l
v o r t e x
S P E C r a t e _ i n t 9 5
a p p l u
a p s i
f p p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C r a t e _ fp 9 5
l i t m u s . t r a pP C . m i s p rb r a n c h . m i s p rW B _ M A F . r e p l a yL D U . r e p l a y
Zarka Cvetanovic, February 1, 1998 9
Branch MispredictsBranch Mispredicts
◆ branch mispredicts notcrucial for commercialperformance:◆ number of branches and
mispredicts in commercialis comparable to SPECint95
B r a n c h a n d B r a n c h - M is p r e d ic t p e r 1 K I
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0
S P E C w e b 9 6
T P C -C
L a d d is
a p p l u
a p s i
fp p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w im
t o m c a t v
t u r b 3 d
w a ve 5
S P E C fp 9 5 _ p a ra
c o m p re s s
g c c
g o
i j p e g
l i
m 8 8 k s i m
p e r l
v o r t e x
S P E C ra t e _ in t 9 5
a p p l u
a p s i
fp p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w im
t o m c a t v
t u r b 3 d
w a ve 5
S P E C ra t e _ fp 9 5
b r a n c hb r a n c h . m i s p r
Zarka Cvetanovic, February 1, 1998 10
TB MissesTB Misses
◆ TB misses not crucialfor commercialperformance:◆ several technical
workloads have higherDTB misses thancommercial
◆ ITB misses low
R H 4 6 6 T B M i s s e s P e r 1 K I n s t r u c t io n s
0 1 2 3 4 5
S P E C w e b 9 6
T P C - C
L a d d i s
a p p l u
a p s i
f p p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C fp 9 5 _ p a r a
c o m p r e s s
g c c
g o
i j p e g
l i
m 8 8 k s i m
p e r l
v o r t e x
S P E C r a t e _ i n t 9 5
a p p l u
a p s i
f p p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C r a t e _ fp 9 5
I T B . m i s sD T B . m i s s
Zarka Cvetanovic, February 1, 1998 11
Instruction profilesInstruction profiles◆ commercial profiles
comparable toSPECint95:◆ no fp instructions
◆ ~25% loads
◆ ~10% stores
◆ ~50% integer
◆ ~15% branches
R H 4 6 6 I n s t r u c t i o n T y p e s
0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0 1 1 0
S P E C w e b 9 6
T P C - C
L a d d i s
a p p l u
a p s i
f p p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C fp 9 5 _ p a r a
c o m p r e s s
g c c
g o
i j p e g
l i
m 8 8 k s i m
p e r l
v o r t e x
S P E C r a t e _ i n t 9 5
a p p l u
a p s i
f p p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C r a t e _ fp 9 5
l d l k . ij s r . r eb r a n c hf l o a t .i n t . o pl o a d ss t o r e s
Zarka Cvetanovic, February 1, 1998 12
System RequestsSystem Requests◆ commercial: high
sharing (ReadDirty andInvalidate)
◆ parallel: high sharingin several workloads
◆ rates:◆ no sharing
◆ high bus bandwidthrequirements
R H 4 6 6 S y s t e m R e q u e s t s p e r 1 K I
0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3
S P E C w e b 9 6
T P C - C
L a d d i s
a p p l u
a p s i
f p p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C fp 9 5 _ p a r a
c o m p r e s s
g c c
g o
i j p e g
l i
m 8 8 k s i m
p e r l
v o r t e x
S P E C r a t e _ i n t 9 5
a p p l u
a p s i
f p p p p
h y d r o 2 d
m g r i d
s u 2 c o r
s w i m
t o m c a t v
t u r b 3 d
w a v e 5
S P E C r a t e _ fp 9 5
B C m is s
S Y S . r e a d _ d i r t y + s e t _ s h a r e d
S Y S . i n v a l i d a t e
Zarka Cvetanovic, February 1, 1998 13
Memory Bus BandwidthMemory Bus Bandwidth◆ AlphaServer 8400:
◆ 12 CPUs
◆ commercial:◆ lower bus traffic than
technical multistream
◆ not affectedsignificantly by bankconflicts (technicalaffected profoundly)
T L S B B a n d w i d t h ( M B / s )
0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0
T P C - C
L i n p a c k
B T
L U
S P
E P
t o m c a t v
s w i m
h y d r o 2 d
t o m c a t v _ p
s w i m _ p
h y d r o 2 d _ p
S t r e a m s
Zarka Cvetanovic, February 1, 1998 14
Bus RequestsBus Requests◆ commercial:
◆ high shared traffic on thebus (Shared Writes)
◆ Read/Victim traffic lowerthan technical
◆ technical◆ parallel: high shared traffic
◆ multistream: highbandwidth (no sharing)
B u s R e q u e s t s ( M / s )
0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0 2 2
T P C - C
L i n p a c k
B T
L U
S P
E P
t o m c a t v
s w i m
h y d r o 2 d
t o m c a t v _ p a r
s w i m _ p a r
h y d r o 2 d _ p a r
S t r e a m s
W r i t e
W r i t e C S R
V i c t i m
R e a d
R e a d C S R
Zarka Cvetanovic, February 1, 1998 15
Time-Allocation ModelTime-Allocation Model
◆ model derived frommeasured events
◆ high stall componentsin commercial:◆ S-to-Bcache
◆ B-to-memory
◆ MB
Percentage Stall Components
0 10 20 30 40 50 60 70 80 90 100
SPECweb96
TPC-C
Laddis
applu
apsi
fpppp
hydro2d
mgrid
su2cor
swim
tomcatv
turb3d
wave5
SPECfp95_para
compress
gcc
go
ijpeg
li
m88ksim
perl
vortex
SPECrate_int95
applu
apsi
fpppp
hydro2d
mgrid
su2cor
swim
tomcatv
turb3d
wave5
SPECrate_fp95
Others (reg conflict + unit busy)MBB-cache miss to memoryS-cache miss to B-cacheD-cache miss to S-cacheI-cache miss to S-cacheLitmusWB/MAF replay trapsLDU replay trapsBranch + PC mispredicts
Zarka Cvetanovic, February 1, 1998 16
Summary/ConclusionsSummary/Conclusions
◆ Key factors for commercial performance:◆ high Bcache latency/bandwidth
◆ 96KB cache not sufficient
◆ low latency data sharing◆ (ReadDirty/Invalidate) on the bus
◆ efficient Memory Barriers◆ efficient locks implementation
◆ low CPU time per I/O
Zarka Cvetanovic, February 1, 1998 17
AcknowledgmentsAcknowledgments
◆ Thanks to John Shakshober, Huy Phan,Dave Wilson, Paula Smith, Judy Piantedosifor help with profiling data collection
2/1/982/1/98
&KDUDFWHUL]LQJ�73&�'�RQ�D0,36�5��.�$UFKLWHFWXUH
4LDQJ�&DR��3HGUR�7UDQFRVR�
�-RVHS�/OXLV�/DUULED�3H\ ��-RVHS�7RUUHOODV
'HSDUWPHQW�RI�&RPSXWHU�6FLHQFH
8QLYHUVLW\�RI�,OOLQRLV�DW�8UEDQD�&KDPSDLJQ
'HSDUWDPHQW�G·$UTXLWHFWXUD�GH�&RPSXWDGRUV�
�8QLYHUVLWDW�3ROLWHFQLFD�GH�&DWDOXQ\D
2/1/982/1/98
7RSLFV�&RYHUHG
❚ 73&�'�%HQFKPDUN��5��.�SURFHVVRU
❚ 4XHU\�FDFKH�PLVVHV
❚ 6FDOLQJ
❚ 2SHUDWLRQ�&RVW
❚ ,QGH[LQJ
2/1/982/1/98
73&�'�%HQFKPDUN
❚ 'HFLVLRQ�VXSSRUW�EHQFKPDUN
❚ ���TXHULHV��LQFOXGLQJ�WZR�XSGDWH�TXHULHV
❚ &RPSOH[�TXHULHV�❙ PXOWL�WDEOH�MRLQV❙ H[WHQVLYH�VRUWLQJ��JURXSLQJ�DQG�DJJUHJDWLRQ❙ VHTXHQWLDO�VFDQV
❚ 5XQQLQJ�RQ�3RVWJUHV��
2/1/982/1/98
5��.�SURFHVVRU
❚ )RXU�LVVXH�VXSHUVFDODU�SURFHVVRU
❚ 7ZR�SHUIRUPDQFH�FRXQWHUV�PHDVXUH�XS�WR����HYHQWV��F\FOHV��/��/��,QVWUXFWLRQ�'DWD�FDFKHPLVVHV��HWF�
❚ (YHQWV�DUH�PHDVXUHG�SHU�SURFHVV
❚ 6DYH�WLPH�RYHU�VLPXODWLRQ
2/1/982/1/98
6*,�2ULJLQ����
❚ 6FDODEOH�6KDUHG�PHPRU\
❚ ��SURFHVVRUV
❚ ����0%�PDLQ�PHPRU\
❚ ���.%�/��LQVWUXFWLRQ�FDFKH�DQG���.%�/��GDWDFDFKH
❚ ��0%�XQLILHG�/��LQVWUXFWLRQ�GDWD�FDFKH
2/1/982/1/98
4XHU\�&DFKH�0LVVHV
❚ 6RPH�TXHULHV��4���4���KDYH�PRUH�PLVVHV
Total Number of Misses
0.0E+00
2.0E+08
4.0E+08
6.0E+08
8.0E+08
1.0E+09
1.2E+09
1.4E+09
1.6E+09
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q13 Q16
L2 Data
L1 Data
L2 Inst
L1 Inst
2/1/982/1/98
4XHU\�&DFKH�0LVVHV
❚ /��LQVWUXFWLRQ�PLVVHV�GRPLQDWH�E\�IDU
Normalized Total Number of Misses
0%
20%
40%
60%
80%
100%
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q13 Q16
L2 Data
L1 Data
L2 Inst
L1 Inst
2/1/982/1/98
4XHU\�&DFKH�0LVVHV
❚ &DFKH�SHQDOW\�KDV�VLJQLILFDQW�HIIHFW�RQ�WRWDOH[HFXWLRQ�WLPH
Normalized Execution Time
0%
20%
40%
60%
80%
100%
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q13 Q16
Cache
Non-Cache
2/1/982/1/98
6FDOLQJ
❚ 73&�'�VSHFLILHV�D�VFDOH�IDFWRU�RI���WR��������*%WR��7%�GDWDEDVH�
❚ 'HPDQGLQJ�VSDFH�DQG�WLPH�UHTXLUHPHQW�IRUHDFK�UXQ
❚ 0RVW�UHVHDUFK�VWXGLHV�XVH�VFDOH�IDFWRU����
2/1/982/1/98
6FDOLQJ
❚ &DFKH�PLVVHV�RI�VRPH�TXHULHV�LQFUHDVH�SURSRUWLRQDOO\ZLWK�WKH�VFDOH�IDFWRU��([DPSOH��4�
Q1 L1 Misses
1.0 1.0
6.48.7
0.02.04.06.08.0
10.0
Instr Data
10MB
100MB
Q1 L2 Misses
1.0 1.0
7.3 7.2
0.0
2.0
4.0
6.0
8.0
Instr Data
10MB
100MB
2/1/982/1/98
Q11 L1 Misses
1.0 1.0
66.5 62.4
0.0
20.0
40.0
60.0
80.0
Instr Data
10MB
100MB
Q11 L2 Misses
1.0 1.0
75.0
22.7
0.0
20.0
40.0
60.0
80.0
Instr Data
10MB
100MB
6FDOLQJ
❚ 2WKHU�TXHULHV�GHPRQVWUDWH�PXFK�KLJKHU�PLVVHV�WKDQ�WKHVFDOH�IDFWRU��([DPSOH��4��
❚ 4XHULHV�EHKDYH�GLIIHUHQWO\�ZLWK�WKH�GDWD�VL]H�FKDQJH�+DUG�WR�VFDOH�GRZQ�DFFXUDWHO\
2/1/982/1/98
2SHUDWLRQ�&RVW
❚ ,Q�VRPH�TXHULHV��WKH�FRVW�RI�VFDQ�LV�VPDOO
Opeartion Misses
100% 100% 100% 100%
66%
94%
62%56%
19%26%25%22%
0%20%40%60%80%
100%120%
L1 Inst L2 Inst L1 Data L2 Data
Q1
Q1_1
Q1_2
Sort
Aggre
Group
Sort
SeqScan
Q1
Q1_1
Q1_2
2/1/982/1/98
2SHUDWLRQ�&RVW
❚ ,Q�VRPH�TXHULHV��WKH�FRVW�RI�VFDQ�GRPLQDWHV
❚ &RQFOXVLRQ��1HHG�WR�VLPXODWH�ZKROH�TXHU\�WUHH
Operation Misses
100% 100% 100% 100%85%
66%71% 64%
0%
50%
100%
150%
L1 Inst L2 Inst L1 Data L2 Data
Q6
Q6_1
Aggre
SeqScan
Q6_1
Q6
2/1/982/1/98
,QGH[LQJ
❚ +RZ�GRHV�WKH�LQGH[�VWUXFWXUH�DIIHFW�WKH�FDFKH�PLVVHV�"
❚ &RPSOLFDWHG�LQGH[LQJ�VWUXFWXUH�FDXVH�WKH�LQGH[�VFDQ�WRVXIIHU�PRUH�FDFKH�PLVVHV�WKDQ�WKH�VHTXHQWLDO�VFDQ
L1 Cache Misses
0.0E+002.0E+054.0E+056.0E+058.0E+051.0E+06
L1 Inst L1 Data
SeqScan
IndxScan
L2 Cache Misses
0.0E+005.0E+031.0E+041.5E+042.0E+042.5E+04
L2 Inst L2 Data
SeqScan
IndxScan
2/1/982/1/98
,QGH[LQJ
❚ $�0RGLILHG�4��ZLWK�KLJKHU�VHOHFWLYLW\�VKRZV�IHZHU�GDWDFDFKH�PLVVHV�IRU�LQGH[�VFDQ
❚ 2SWLPL]HU�QHHGV�WR�XVH�VHOHFWLYLW\�IDFWRU�WR�FKRRVHRSWLPDO�DFFHVV�PHWKRG�IRU�FDFKH�PLVVHV
L1 Cache Misses
0.0E+00
5.0E+04
1.0E+05
1.5E+05
2.0E+05
L1 Inst L1 Data
SeqScan
IndxScan
L2 Cache Misses
0.0E+001.0E+032.0E+033.0E+034.0E+035.0E+03
L2 Inst L2 Data
SeqScan
IndxScan
2/1/982/1/98
&RQFOXVLRQV
❚ ,QVWUXFWLRQ�PLVVHV��/��HVSHFLDOO\��GRPLQDWH
❙ ,QVWUXFWLRQ�PLVVHV�VKRXOG�QRW�EH�QHJOHFWHG�LQ�VLPXODWLRQ
❚ 'LIIHUHQW�TXHULHV�KDYH�GLIIHUHQW�VFDOLQJ�EHKDYLRU
❙ +DUG�WR�VFDOH�GRZQ�WKH�GDWD�DFFXUDWHO\
❚ 2SHUDWLRQ�RWKHU�WKDQ�VFDQ�FDQ�FDXVH�PDQ\�PLVVHV
❙ 6LPXODWLRQ�RI�WKH�ZKROH�WUHH�LV�QHFHVVDU\
❚ ,QGH[�6FDQ�FDQ�LQFUHDVH�FDFKH�PLVVHV
❙ 6HOHFWLYLW\�IDFWRU�VKRXOG�EH�XVHG�WR�FKRRVH�RSWLPDO�VFDQ�PHWKRG
2/1/982/1/98
)XWXUH�:RUN
❚ 0RUH�H[SHULPHQWV�RQ�ODUJHU�GDWD�VL]H
❚ 7KH�HIIHFW�RI�LQLWLDO�GDWD�DOORFDWLRQ
❚ ,QWHUDFWLRQ�RI�PXOWLSOH�TXHULHV