Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial...

27
1 Cache Performance for Multimedia Applications Nathan Slingerland [email protected] Apple Computer Alan Jay Smith [email protected] University of California at Berkeley

Transcript of Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial...

Page 1: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

1

Cache Performance for

Multim

edia Applications

Nathan Slingerlandnslingerland@

apple.comApple Com

puter

Alan Jay Smith

smith@

cs.berkeley.eduUniversity of California

at Berkeley

Page 2: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

2

Introduction•

Few studies of m

ultimedia cache behavior

but often characterized as:•

High instruction ref locality; small, tight loops

•Very large data sets (“stream

ing”)•

Poor data cache performance due to non-

locality of data references

•M

easurements?

IB

BB

PB

BB

I

5.91920x1080

1080I2.6

1280x720720P

1.0720x480

DVDM

B per video frame

Page 3: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

3

Overview•

Berkeley Multim

edia Workload

•Analysis - Cache Sim

ulation•

Capacity: 1KB - 2MB

•Line Size: 16B - 256B

•Associativity: 1, 2, 4 and 8 -w

ay

•Result•

When com

pared to other types of workloads,

multim

edia has comparable instruction and data

cache miss ratios.

Page 4: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

4

Berkeley Multim

edia Workload

Speech recognitionRasta

Klatt speech synthesizerRsynth

European GSM 06.10 speech com

pressionGSM

DCT based lossy image com

pressionJPEG

AT&T IW

44 wavelet im

age compression

DjVuPostscript docum

ent viewing/rendering

Ghostscript

Persistence of Vision ray tracerPOVray

OpenGL 3D rendering API clone - demos

Mesa

Classic first person shooter video game

Doom

IMA ADPCM

audio compression

ADPCMM

PEG-1 Layer III (MP3) audio encoder

LAME

MPEG-1 Layer III (M

P3) audio decoderm

pg123M

IDI music synthesis w

ith GUS instruments

Timidity

MPEG-2 video decoding and encoding

MPEG-2

DescriptionNam

e

Domains

3D GraphicsDocum

entAudioSpeechVideo

Page 5: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

5

Other Workloads

•SPEC95*•

Uniprogramm

ed

•SPEC92 [Gee93]•

Uniprogramm

ed

•M

ultiprogramm

ingW

orkload [Borg90]•

Very long (up to 12 billionreferences) traces fromTitan RISC architecture

•Design Target M

iss Ratios[Sm

ith85]•

Synthesized from hardw

arem

onitor and tracesim

ulation measurem

ents

•VAX 11/780, VAX 8800[Clark83], [Clark88]•

Hardware m

onitor miss

ratio measurem

ents fortim

e shared engineeringw

orkload.

•M

ul3 [Agarwal88]

•Sam

pled and stitchedtraces (originally ~400Kreferences) from

ATUMtracing tool.

•Am

dahl 470 [Smith82]

•Hardw

are monitor

measurem

ents taken atAm

dahl on a 470V for astandard internalbenchm

ark.

Page 6: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

6

Methodology

•Execution driven cache sim

ulation•

Modified version of LibCheetah sim

ulator•

DEC’s ATOM toolkit used to instrum

entm

ultimedia application binaries

•Very long traces•

Each application run to completion.

•Traces of 50 m

illion to 100+ billion instructionreferences

•M

ultiprogramm

ing simulated for

multim

edia workload (cache flushing)

Page 7: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

7

Long Traces

•Cache behavior varies during execution

•Cold start effects•

All initial accesses are compulsory m

isses•

These can dominate if traces are too short

MPEG-2 DVD Encode

POVray

Page 8: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

8

Average Context Switch Intervals

•Berkeley M

ultimedia W

orkload schedulingbehavior not realistic.

•W

indows NT and 2000 m

aintain a variety of systemevent perform

ance counters †. Modified version of

PDHTest tool used for our m

easurements.

•Thread Counter Events:•

Privileged time, User Tim

e (Cycles)•

Context Switch Count

•Priority

•State

tt

t

contextswitches

contextuser

system=

+_

†Counters are described in Microsoft System

s Journal, March 1996, April 1996, M

arch 1998, May 1998 issues.

Page 9: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

9

Context Switch Intervals

297,641M

ediaPlayer GSM 06.10

4,754,521DjVushop Docum

ent Compression

3,675,086Audio Com

positor MIDI Synthesizer

3,358,692Audio Catalyst v2.1 M

P3 Encoder2,560,537

Dragon Naturally Speaking Preferred1,227,194

Ghostscript Postscript Previewer

5,930,0963D M

aze OpenGL Screen Saver5,928,433

POVray v3.1g Raytracer5,339,432

Avi2Mpg2 M

PEG-2 Encoder

4,284,671Quake III Arena (Dem

o)3,821,284

Irfanview v3.15 Im

age Viewer

1,189,234Pow

erDVD v2.55 DVD Player921,510

WinDVD v2.0 DVD Player

708,037M

ediaPlayer IMA ADPCM

594,438Narrator Text to Speech

567,0803D Pipes OpenGL Screen Saver

360,336K-Jofol 2000 M

P3 Player v1.0

58,399Real Jukebox v1.0.0.488 M

P3 Player40,396

RealPlayer v7.0 Real Audio Player23,653

3D Flowerbox OpenGL Screen Saver

Context Interval (Cycles)Nam

e

500 MHz AM

D Athlon system, 256 M

B RAM, W

indows 2000 v5.00.2195

3D GraphicsDocum

entAudioSpeechVideo

Page 10: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

10

Cache Flush Intervals

5,339,432M

PEG-2 Encode

2,560,537Rasta

594,438Rsynth

297,641GSM

3,821,284JPEG

4,754,521DjVu

1,227,194Ghostscript

5,928,433POVray

2,173,610M

esa

4,284,671Doom

708,037ADPCM

3,358,692LAM

E

1,554,505m

pg123

3,675,086Tim

idity

1,055,372M

PEG-2 Decode

Cache Flush Interval(Instructions)

Name

•Simulation cache flush intervals

based on average of measured

context intervals for similar

Window

s applications

•Cycles converted to µOps tocorrespond m

ore closely to DECAlpha RISC instructions

3D GraphicsDocum

entAudioSpeechVideo

Page 11: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

11

Capacity: Unified Cache

0.00

0.02

0.04

0.06

0.08

0.10

1K2K

4K8K

16K32K

64K128K

256K512K

1M2M

Cache Size (B

ytes)

Miss Ratio

Multim

edia

Agarw

al Mul3 [A

gar88]

DT

MR

[Smit87]

SPEC

92 [Gee93]

SPEC

95

470 User [Sm

it82]

470 Supervisor [Smit82]

VA

X 780 [C

lark83]

VA

X 8800 [C

lark88]

32B Lines, 2-way associativity

Page 12: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

12

Capacity: Instruction Cache

0.00

0.02

0.04

0.06

0.08

0.10

1K2K

4K8K

16K32K

64K128K

256K512K

1M2M

Cache Size (B

ytes)

Miss Ratio

Multim

edia

Mult [B

org90]

DT

MR

[Smit87]

SPEC

92 [Gee93]

SPEC

95

32B Lines, 2-way associativity

Page 13: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

13

Capacity: Data Cache

0.00

0.05

0.10

0.15

0.20

0.25

1K2K

4K8K

16K32K

64K128K

256K512K

1M2M

Cache Size (B

ytes)

Miss Ratio

Multim

edia

Mult [B

org90]

DT

MR

[Smit87]

SPEC

92 [Gee93]

SPEC

95

32B Lines, 2-way associativity

Page 14: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

14

Why?

•M

any building blocks (e.g. DCT, FFT) internally re-reference the sam

e data

•Even if an array is sim

ply traversed in mem

oryorder there is a benefit from

long cache line“prefetch” effect

•M

ultimedia data types are narrow

, so more

elements fit in a cache line.

•Som

e of the comparison studies are older and

were done on m

achines with m

uch longer cycletim

es → shorter context intervals. W

e expectnew

er studies to exhibit lower m

iss ratios.

Page 15: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

15

Capacity: Multim

edia Domains

0.00

0.02

0.04

0.06

0.08

0.10

0.12

1K2K

4K8K

16K32K

64K128K

256K512K

1M2M

Cache Size (B

ytes)

Miss Ratio

Audio

SpeechD

ocument

Video

3D G

FX

Data Cache, 32B Lines, 2-way associativity

Page 16: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

16

Line Size - Uniprocessor•

Optimal line size on UP system

minim

izes avg mem

ory delay, tavg•

m(L) - m

iss ratio for line size, L (bytes)•

tline - time to fetch a cache line

•d - data path w

idth (bytes)•

tlatency - mem

ory transaction delay (sec)•

rxfer - bus bandw

idth (bytes per sec)•

Instruction Cache Lines: as large as 256B•

Data Cache Lines: dependent on capacity - 128B for 32KB cache•

MP issues not considered (total m

emory traffic, bus busy periods)

tt

rline

latency

Ldxfer

=+

()

Multim

edia - average mem

ory delay (ns)M

ultimedia - average m

emory delay (ns)

Instruction Cache Block Size (bytes)Data Cache Block Size (bytes)

Size16

3264

128256

Size16

3264

128256

1K6.22630

3.795362.68253

1.960861.74648

1K10.70698

9.0977510.17477

14.6678124.87838

2K3.13298

1.942261.40646

1.015560.93899

2K8.04009

6.209696.31052

8.5697514.26147

4K1.67616

1.081870.81106

0.602460.57495

4K6.15572

4.350043.83097

4.408626.87190

8K0.95800

0.649090.46229

0.359120.33620

8K4.64852

3.069342.40616

2.445603.27708

16K0.43464

0.281820.19453

0.156180.15525

16K3.48517

2.241991.61893

1.513141.86139

32K0.16759

0.103550.07412

0.057210.04810

32K2.81276

1.775011.24335

1.059641.14475

64K0.09868

0.057090.03657

0.024920.01902

64K2.44197

1.499881.03259

0.827060.78846

128K0.07714

0.042810.02534

0.015160.01016

128K2.30867

1.383470.91999

0.715910.65241

256K0.07514

0.041260.02407

0.013930.00897

256K2.23803

1.319360.85758

0.642360.57225

512K0.07496

0.041100.02392

0.013790.00883

512K2.18862

1.271340.80799

0.587060.50072

1M0.07496

0.041100.02392

0.013790.00882

1M1.94165

1.022880.55434

0.317640.19900

2M0.07496

0.041100.02392

0.013790.00882

2M1.93021

1.014520.54805

0.312110.19121

tns

latency=109

7.r

MBs

xfer =1182

92.

/d

bytes=8

500 MHz AM

D Athlon system2-w

ay associativity

tt

mL

avgline

=⋅()

Page 17: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

17

Associativity•

Miss ratio spread

•m

easures the benefit of increasing associativity.

•m

(A=n) - miss ratio for n-w

ay associativity, A

•Increased associativity m

ore useful forinstruction rather than data caches.

•2-w

ay or 4-way associativity offer the

greatest relative benefit.

MissR

atioSpreadmA

nmA

n

mA

n=

=−

==

()

()

()2

2

Page 18: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

18

Multim

edia Trends for Caches•

Audio/Speech•

Already at the limits of hum

an perceivablefidelity. Under the least pressure for change.

•3D Graphics•

No obvious limit to texture sizes and desired

number of vertices - trem

endous potential forgrow

th.•

Video•

DVD, HDTV 720P, HDTV 1080I resolutions•

Instruction Miss Ratios: not significantly affected

•Data M

iss Ratios: strongly influenced forcapacities under 32 KB; levels off for largercaches

Page 19: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

19

Video Trends: Instruction Cache

0.00

0.50

1.00

1.50

2.00

2.50

1K2K

4K8K

16K32K

64K128K

256K512K

1M2M

Cache Size (B

ytes)

Relative Miss Ratio

DV

D→

720P (Encode)

720P→1080I (E

ncode)D

VD

→720P (D

ecode)720P→

1080I (Decode)

1920x1080HDTV 1080I

1280x720HDTV 720P

720x480DVD

32B Lines, 2-way associativity

Page 20: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

20

Video Trends: Data Cache

0.00

0.50

1.00

1.50

2.00

2.50

1K2K

4K8K

16K32K

64K128K

256K512K

1M2M

Cache Size (B

ytes)

Relative Miss Ratio

DV

D→

720P (Encode)

720P→1080I (E

ncode)D

VD

→720P (D

ecode)720P→

1080I (Decode)

1920x1080HDTV 1080I

1280x720HDTV 720P

720x480DVD

32B Lines, 2-way associativity

Page 21: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

21

Summ

ary•

Comparable instruction and data cache

miss ratios com

pared to other workloads.

•Capacity•

Instruction: 32 KB sufficient for all apps•

Data: 32 KB (audio, speech, video), > for doc, 3D•

Uniprocessor Line Size•

Instruction: as large as 256B•

Data: depends on capacity - 128B for 32KB cache•

Associativity•

Similar behavior to other w

orkloads.•

For workload and full sim

ulation results:http://w

ww

.cs.berkeley.edu/~slingn/research

Page 22: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

22

Questions

Page 23: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

23

Uncached Performance Slow

down

User Space Slowdow

nSystem

Space Slowdow

n

1x10x

100x1000x

ADPCM Encode

ADPCM Decode

DJVU EncodeDJVU Decode

DoomGhostscript

GSM Encode

GSM Decode

JPEG EncodeJPEG Decode

LAME

MESA Gears

MESA M

orph3DM

ESA ReflectM

PEG2 DVD EncodeM

PEG2 720P EncodeM

PEG2 1080I EncodeM

PEG2 DVD DecodeM

PEG2 720P DecodeM

PEG2 1080Im

pg123POVray

RastaRsynth

Timidity

500 MHz AM

D Athlon256 M

B RAMW

indows 2000

L1 & L2 Disabled

1x10x

100x1000x

ADPCM Encode

ADPCM Decode

DJVU EncodeDJVU Decode

DoomGhostscript

GSM Encode

GSM Decode

JPEG EncodeJPEG Decode

LAME

MESA Gears

MESA M

orph3DM

ESA ReflectM

PEG2 DVD EncodeM

PEG2 720P EncodeM

PEG2 1080I EncodeM

PEG2 DVD DecodeM

PEG2 720P DecodeM

PEG2 1080Im

pg123POVray

RastaRsynth

Timidity

Average: 72.6x

Geo Mean: 68.6x

Average: 11.2x

Geo Mean: 7.1x

Page 24: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

24

Associativity: Unified Cache

0.00

0.50

1.00

1.50

2.00

2.50

1K2K

4K8K

16K32K

64K128K

256K512K

1M2M

Cache Size (B

ytes)

Miss Ratio

2_to_1

4_to_2

8_to_4

32B Lines

Page 25: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

25

Associativity: Instruction Cache

0.00

0.50

1.00

1.50

2.00

2.50

1K2K

4K8K

16K32K

64K128K

256K512K

1M2M

Cache Size (B

ytes)

Miss Ratio

2_to_1

4_to_2

8_to_4

32B Lines

Page 26: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

26

Associativity: Data Cache

0.00

0.50

1.00

1.50

2.00

2.50

1K2K

4K8K

16K32K

64K128K

256K512K

1M2M

Cache Size (B

ytes)

Miss Ratio

2_to_1

4_to_2

8_to_4

32B Lines

Page 27: Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

27

Current L1 Cache Parameters

32 32 32µOp

32/6416 64

$I LineSize (B)

324

644

32Sun UltraSPARC III

328

328

32M

otorola 745032

232

232

MIPS R12000

644

88

96♦

Intel Pentium IV

32/644

10244

512HP PA-8500

642

642

64DEC Alpha 21264B

642

642

64AM

D Athlon

$D LineSize (B)

$DAssoc

$D Size(KB)

$IAssoc

$I Size(KB)

Current L1 CacheParam

eters

♦ trace cache, capacity estim

ated based on die area