Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial...
Transcript of Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial...
1
Cache Performance for
Multim
edia Applications
Nathan Slingerlandnslingerland@
apple.comApple Com
puter
Alan Jay Smith
smith@
cs.berkeley.eduUniversity of California
at Berkeley
2
Introduction•
Few studies of m
ultimedia cache behavior
but often characterized as:•
High instruction ref locality; small, tight loops
•Very large data sets (“stream
ing”)•
Poor data cache performance due to non-
locality of data references
•M
easurements?
IB
BB
PB
BB
I
5.91920x1080
1080I2.6
1280x720720P
1.0720x480
DVDM
B per video frame
3
Overview•
Berkeley Multim
edia Workload
•Analysis - Cache Sim
ulation•
Capacity: 1KB - 2MB
•Line Size: 16B - 256B
•Associativity: 1, 2, 4 and 8 -w
ay
•Result•
When com
pared to other types of workloads,
multim
edia has comparable instruction and data
cache miss ratios.
4
Berkeley Multim
edia Workload
Speech recognitionRasta
Klatt speech synthesizerRsynth
European GSM 06.10 speech com
pressionGSM
DCT based lossy image com
pressionJPEG
AT&T IW
44 wavelet im
age compression
DjVuPostscript docum
ent viewing/rendering
Ghostscript
Persistence of Vision ray tracerPOVray
OpenGL 3D rendering API clone - demos
Mesa
Classic first person shooter video game
Doom
IMA ADPCM
audio compression
ADPCMM
PEG-1 Layer III (MP3) audio encoder
LAME
MPEG-1 Layer III (M
P3) audio decoderm
pg123M
IDI music synthesis w
ith GUS instruments
Timidity
MPEG-2 video decoding and encoding
MPEG-2
DescriptionNam
e
Domains
3D GraphicsDocum
entAudioSpeechVideo
5
Other Workloads
•SPEC95*•
Uniprogramm
ed
•SPEC92 [Gee93]•
Uniprogramm
ed
•M
ultiprogramm
ingW
orkload [Borg90]•
Very long (up to 12 billionreferences) traces fromTitan RISC architecture
•Design Target M
iss Ratios[Sm
ith85]•
Synthesized from hardw
arem
onitor and tracesim
ulation measurem
ents
•VAX 11/780, VAX 8800[Clark83], [Clark88]•
Hardware m
onitor miss
ratio measurem
ents fortim
e shared engineeringw
orkload.
•M
ul3 [Agarwal88]
•Sam
pled and stitchedtraces (originally ~400Kreferences) from
ATUMtracing tool.
•Am
dahl 470 [Smith82]
•Hardw
are monitor
measurem
ents taken atAm
dahl on a 470V for astandard internalbenchm
ark.
6
Methodology
•Execution driven cache sim
ulation•
Modified version of LibCheetah sim
ulator•
DEC’s ATOM toolkit used to instrum
entm
ultimedia application binaries
•Very long traces•
Each application run to completion.
•Traces of 50 m
illion to 100+ billion instructionreferences
•M
ultiprogramm
ing simulated for
multim
edia workload (cache flushing)
7
Long Traces
•Cache behavior varies during execution
•Cold start effects•
All initial accesses are compulsory m
isses•
These can dominate if traces are too short
MPEG-2 DVD Encode
POVray
8
Average Context Switch Intervals
•Berkeley M
ultimedia W
orkload schedulingbehavior not realistic.
•W
indows NT and 2000 m
aintain a variety of systemevent perform
ance counters †. Modified version of
PDHTest tool used for our m
easurements.
•Thread Counter Events:•
Privileged time, User Tim
e (Cycles)•
Context Switch Count
•Priority
•State
tt
t
contextswitches
contextuser
system=
+_
†Counters are described in Microsoft System
s Journal, March 1996, April 1996, M
arch 1998, May 1998 issues.
9
Context Switch Intervals
297,641M
ediaPlayer GSM 06.10
4,754,521DjVushop Docum
ent Compression
3,675,086Audio Com
positor MIDI Synthesizer
3,358,692Audio Catalyst v2.1 M
P3 Encoder2,560,537
Dragon Naturally Speaking Preferred1,227,194
Ghostscript Postscript Previewer
5,930,0963D M
aze OpenGL Screen Saver5,928,433
POVray v3.1g Raytracer5,339,432
Avi2Mpg2 M
PEG-2 Encoder
4,284,671Quake III Arena (Dem
o)3,821,284
Irfanview v3.15 Im
age Viewer
1,189,234Pow
erDVD v2.55 DVD Player921,510
WinDVD v2.0 DVD Player
708,037M
ediaPlayer IMA ADPCM
594,438Narrator Text to Speech
567,0803D Pipes OpenGL Screen Saver
360,336K-Jofol 2000 M
P3 Player v1.0
58,399Real Jukebox v1.0.0.488 M
P3 Player40,396
RealPlayer v7.0 Real Audio Player23,653
3D Flowerbox OpenGL Screen Saver
Context Interval (Cycles)Nam
e
500 MHz AM
D Athlon system, 256 M
B RAM, W
indows 2000 v5.00.2195
3D GraphicsDocum
entAudioSpeechVideo
10
Cache Flush Intervals
5,339,432M
PEG-2 Encode
2,560,537Rasta
594,438Rsynth
297,641GSM
3,821,284JPEG
4,754,521DjVu
1,227,194Ghostscript
5,928,433POVray
2,173,610M
esa
4,284,671Doom
708,037ADPCM
3,358,692LAM
E
1,554,505m
pg123
3,675,086Tim
idity
1,055,372M
PEG-2 Decode
Cache Flush Interval(Instructions)
Name
•Simulation cache flush intervals
based on average of measured
context intervals for similar
Window
s applications
•Cycles converted to µOps tocorrespond m
ore closely to DECAlpha RISC instructions
3D GraphicsDocum
entAudioSpeechVideo
11
Capacity: Unified Cache
0.00
0.02
0.04
0.06
0.08
0.10
1K2K
4K8K
16K32K
64K128K
256K512K
1M2M
Cache Size (B
ytes)
Miss Ratio
Multim
edia
Agarw
al Mul3 [A
gar88]
DT
MR
[Smit87]
SPEC
92 [Gee93]
SPEC
95
470 User [Sm
it82]
470 Supervisor [Smit82]
VA
X 780 [C
lark83]
VA
X 8800 [C
lark88]
32B Lines, 2-way associativity
12
Capacity: Instruction Cache
0.00
0.02
0.04
0.06
0.08
0.10
1K2K
4K8K
16K32K
64K128K
256K512K
1M2M
Cache Size (B
ytes)
Miss Ratio
Multim
edia
Mult [B
org90]
DT
MR
[Smit87]
SPEC
92 [Gee93]
SPEC
95
32B Lines, 2-way associativity
13
Capacity: Data Cache
0.00
0.05
0.10
0.15
0.20
0.25
1K2K
4K8K
16K32K
64K128K
256K512K
1M2M
Cache Size (B
ytes)
Miss Ratio
Multim
edia
Mult [B
org90]
DT
MR
[Smit87]
SPEC
92 [Gee93]
SPEC
95
32B Lines, 2-way associativity
14
Why?
•M
any building blocks (e.g. DCT, FFT) internally re-reference the sam
e data
•Even if an array is sim
ply traversed in mem
oryorder there is a benefit from
long cache line“prefetch” effect
•M
ultimedia data types are narrow
, so more
elements fit in a cache line.
•Som
e of the comparison studies are older and
were done on m
achines with m
uch longer cycletim
es → shorter context intervals. W
e expectnew
er studies to exhibit lower m
iss ratios.
15
Capacity: Multim
edia Domains
0.00
0.02
0.04
0.06
0.08
0.10
0.12
1K2K
4K8K
16K32K
64K128K
256K512K
1M2M
Cache Size (B
ytes)
Miss Ratio
Audio
SpeechD
ocument
Video
3D G
FX
Data Cache, 32B Lines, 2-way associativity
16
Line Size - Uniprocessor•
Optimal line size on UP system
minim
izes avg mem
ory delay, tavg•
m(L) - m
iss ratio for line size, L (bytes)•
tline - time to fetch a cache line
•d - data path w
idth (bytes)•
tlatency - mem
ory transaction delay (sec)•
rxfer - bus bandw
idth (bytes per sec)•
Instruction Cache Lines: as large as 256B•
Data Cache Lines: dependent on capacity - 128B for 32KB cache•
MP issues not considered (total m
emory traffic, bus busy periods)
tt
rline
latency
Ldxfer
=+
()
Multim
edia - average mem
ory delay (ns)M
ultimedia - average m
emory delay (ns)
Instruction Cache Block Size (bytes)Data Cache Block Size (bytes)
Size16
3264
128256
Size16
3264
128256
1K6.22630
3.795362.68253
1.960861.74648
1K10.70698
9.0977510.17477
14.6678124.87838
2K3.13298
1.942261.40646
1.015560.93899
2K8.04009
6.209696.31052
8.5697514.26147
4K1.67616
1.081870.81106
0.602460.57495
4K6.15572
4.350043.83097
4.408626.87190
8K0.95800
0.649090.46229
0.359120.33620
8K4.64852
3.069342.40616
2.445603.27708
16K0.43464
0.281820.19453
0.156180.15525
16K3.48517
2.241991.61893
1.513141.86139
32K0.16759
0.103550.07412
0.057210.04810
32K2.81276
1.775011.24335
1.059641.14475
64K0.09868
0.057090.03657
0.024920.01902
64K2.44197
1.499881.03259
0.827060.78846
128K0.07714
0.042810.02534
0.015160.01016
128K2.30867
1.383470.91999
0.715910.65241
256K0.07514
0.041260.02407
0.013930.00897
256K2.23803
1.319360.85758
0.642360.57225
512K0.07496
0.041100.02392
0.013790.00883
512K2.18862
1.271340.80799
0.587060.50072
1M0.07496
0.041100.02392
0.013790.00882
1M1.94165
1.022880.55434
0.317640.19900
2M0.07496
0.041100.02392
0.013790.00882
2M1.93021
1.014520.54805
0.312110.19121
tns
latency=109
7.r
MBs
xfer =1182
92.
/d
bytes=8
500 MHz AM
D Athlon system2-w
ay associativity
tt
mL
avgline
=⋅()
17
Associativity•
Miss ratio spread
•m
easures the benefit of increasing associativity.
•m
(A=n) - miss ratio for n-w
ay associativity, A
•Increased associativity m
ore useful forinstruction rather than data caches.
•2-w
ay or 4-way associativity offer the
greatest relative benefit.
MissR
atioSpreadmA
nmA
n
mA
n=
=−
==
()
()
()2
2
18
Multim
edia Trends for Caches•
Audio/Speech•
Already at the limits of hum
an perceivablefidelity. Under the least pressure for change.
•3D Graphics•
No obvious limit to texture sizes and desired
number of vertices - trem
endous potential forgrow
th.•
Video•
DVD, HDTV 720P, HDTV 1080I resolutions•
Instruction Miss Ratios: not significantly affected
•Data M
iss Ratios: strongly influenced forcapacities under 32 KB; levels off for largercaches
19
Video Trends: Instruction Cache
0.00
0.50
1.00
1.50
2.00
2.50
1K2K
4K8K
16K32K
64K128K
256K512K
1M2M
Cache Size (B
ytes)
Relative Miss Ratio
DV
D→
720P (Encode)
720P→1080I (E
ncode)D
VD
→720P (D
ecode)720P→
1080I (Decode)
1920x1080HDTV 1080I
1280x720HDTV 720P
720x480DVD
32B Lines, 2-way associativity
20
Video Trends: Data Cache
0.00
0.50
1.00
1.50
2.00
2.50
1K2K
4K8K
16K32K
64K128K
256K512K
1M2M
Cache Size (B
ytes)
Relative Miss Ratio
DV
D→
720P (Encode)
720P→1080I (E
ncode)D
VD
→720P (D
ecode)720P→
1080I (Decode)
1920x1080HDTV 1080I
1280x720HDTV 720P
720x480DVD
32B Lines, 2-way associativity
21
Summ
ary•
Comparable instruction and data cache
miss ratios com
pared to other workloads.
•Capacity•
Instruction: 32 KB sufficient for all apps•
Data: 32 KB (audio, speech, video), > for doc, 3D•
Uniprocessor Line Size•
Instruction: as large as 256B•
Data: depends on capacity - 128B for 32KB cache•
Associativity•
Similar behavior to other w
orkloads.•
For workload and full sim
ulation results:http://w
ww
.cs.berkeley.edu/~slingn/research
22
Questions
23
Uncached Performance Slow
down
User Space Slowdow
nSystem
Space Slowdow
n
1x10x
100x1000x
ADPCM Encode
ADPCM Decode
DJVU EncodeDJVU Decode
DoomGhostscript
GSM Encode
GSM Decode
JPEG EncodeJPEG Decode
LAME
MESA Gears
MESA M
orph3DM
ESA ReflectM
PEG2 DVD EncodeM
PEG2 720P EncodeM
PEG2 1080I EncodeM
PEG2 DVD DecodeM
PEG2 720P DecodeM
PEG2 1080Im
pg123POVray
RastaRsynth
Timidity
500 MHz AM
D Athlon256 M
B RAMW
indows 2000
L1 & L2 Disabled
1x10x
100x1000x
ADPCM Encode
ADPCM Decode
DJVU EncodeDJVU Decode
DoomGhostscript
GSM Encode
GSM Decode
JPEG EncodeJPEG Decode
LAME
MESA Gears
MESA M
orph3DM
ESA ReflectM
PEG2 DVD EncodeM
PEG2 720P EncodeM
PEG2 1080I EncodeM
PEG2 DVD DecodeM
PEG2 720P DecodeM
PEG2 1080Im
pg123POVray
RastaRsynth
Timidity
Average: 72.6x
Geo Mean: 68.6x
Average: 11.2x
Geo Mean: 7.1x
24
Associativity: Unified Cache
0.00
0.50
1.00
1.50
2.00
2.50
1K2K
4K8K
16K32K
64K128K
256K512K
1M2M
Cache Size (B
ytes)
Miss Ratio
2_to_1
4_to_2
8_to_4
32B Lines
25
Associativity: Instruction Cache
0.00
0.50
1.00
1.50
2.00
2.50
1K2K
4K8K
16K32K
64K128K
256K512K
1M2M
Cache Size (B
ytes)
Miss Ratio
2_to_1
4_to_2
8_to_4
32B Lines
26
Associativity: Data Cache
0.00
0.50
1.00
1.50
2.00
2.50
1K2K
4K8K
16K32K
64K128K
256K512K
1M2M
Cache Size (B
ytes)
Miss Ratio
2_to_1
4_to_2
8_to_4
32B Lines
27
Current L1 Cache Parameters
32 32 32µOp
32/6416 64
$I LineSize (B)
324
644
32Sun UltraSPARC III
328
328
32M
otorola 745032
232
232
MIPS R12000
644
88
96♦
Intel Pentium IV
32/644
10244
512HP PA-8500
642
642
64DEC Alpha 21264B
642
642
64AM
D Athlon
$D LineSize (B)
$DAssoc
$D Size(KB)
$IAssoc
$I Size(KB)
Current L1 CacheParam
eters
♦ trace cache, capacity estim
ated based on die area