Photos placed in horizontal position between photos and...

Post on 18-Jul-2020

1 views 0 download

Transcript of Photos placed in horizontal position between photos and...

Photos placed in horizontal position with even amount of white space

between photos and header

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. SAND2017-4325 C

TheImpactofIncreasingMemorySystemDiversityonApplications

GwenVoskuilenArun Rodrigues,MikeFrank,SiHammond

4/25/2017

Diversityinmemoryhierarchy

§ Newmemorytechnologies– stackedDRAM,non-volatile§ Advantages:higherbandwidth,persistent,etc.§ Disadvantages:expensive,higherlatencies

§ Solution:multi-levelmemory(MLM)§ Mixmemoriestoexposeadvantages,hidedisadvantages§ Reallyhardinpractice

§ Mustcarefullylocatedatabasedondatacharacteristics§ Whichdatagoesintowhichmemory?Whodecides?

2

Automatic Manual

Hardware Hardwaredecides

Software OS/runtimedecides

Application decides

ManagingMLM

§ TrinityKNL:SmallstackedDRAM+largeDDRDRAM§ Applicationdatafootprint>>stackedDRAMcapacity

§ Managementrequiresidentifying andselectivelyallocatingdataneedingbandwidthintostackedDRAM

§ Study1:Softwaremanagementpolicies§ Rangingfromsimpletocomplex§ Developedanalysistool,MemSieve, toevaluatememorybehavior

§ Study2:Hardwarecaching§ Insertion/evictionpolicies

3

Outline

§ Methodology

§ Software/manuallymanagedMLM(malloc()based)

§ Hardware/automaticmanagedMLM

§ Conclusions

4

SimulationMethodology

§ MiniApps:HPCG,PENNANT,SNAP§ Memoryfootprintsof1-8GBà samplingrequired

§ SimulatedontwoarchitecturesusingSST§ Lightweight:72smallcores,mesh,privateL1,semi-privateL2§ Heavyweight:8bigcores,ring,privateL1+L2,sharedL3§ Both:DDRandHMCmemories

5

TileHMCDDR

C CL1 L1

L2

L3Core

Mem

Core

L3

Core

Core

CoreCore

Core

Core

L3

L3

L3

L3

L3

L3

Mem

Mem

Mem

Core

L1

L2

Outline

§ Methodology

§ Software/manuallymanagedMLM

§ Hardware/automaticmanagedMLM

§ Conclusions

6

Softwareapproaches

Software management

OS / Runtime

Static

Greedily insert pages

into HMC

Greedily insert mallocs

into HMC

Dynamic(future work)

Programmer

Static

Direct "best" mallocs to

HMC

Dynamic

Migrate to put current "best"

to HMC

7Increased performance?

Tradeoff

§ Automatic(OS)§ Easierforprogrammer§ Abletocaptureallocationsnotunderprogrammercontrol

§ Library,pre-programstart,etc.§ Page-tablecomplexity;potentiallyexpensivere-mapping§ Noprogramknowledgeà worseperformance?

§ Coulduseprogrammerhintsorruntimeprofilingbutmorework

§ Manual(programmer)§ Moreworkforprogrammer,pervasive(?)changes§ Notabletohandleallallocations§ Possibleconflictsbetweenapplicationandlibraryallocations

§ Whatiflibrariesdecidetomanageallocationforinternalstructurestoo?§ Knowledgeofprogrambehaviorà betterperformance?

8

Analysistool:MemSieve

§ Capturesanapplication’smemoryaccessesandcorrelatestotheapplication’smemoryallocations§ Filtersoutcachehits§ Withoutsimulatingfullmemoryhierarchyà 2.5X+faster

§ Keymeasurement:malloc density§ #accesses/size

§ Hypothesis:densemallocs shouldbeputinHMC§ AssumingsimilarlatenciesbetweenHMCandDDR

9

Malloc analysis

Pennant HPCG Snap * MiniPICMalloc count 8B 23M 1B 438KMalloc size 32.1 TB 7.43 GB 30GB 7.9GBDistinct traces 248 612 323 39043Accessed traces 140 146 90 10794Size of accessed traces as % total

89.7% 99.987% 89.6% 84%

10

*Iterations from beginning & middle only

§ Manymallocs butfewdistinctmalloc calltraces(locations)§ Reasonsmallocs arenotaccessed

§ Sameaddressmalloc’d repeatedlyà cache-resident§ Malloc wasnotaccessedinprofiledsectionofapplication

Idealmalloc behavior§ Good: Afew,small,verydensemallocs§ Bad:Many,equallydensemallocs;densestarebig

11

Den

sity

Mallocs

Big density variation: less work to manage

% a

cces

ses

(cum

ulat

ive)

Size

Mallocs

% a

cces

ses

(cum

ulat

ive)

Lots of accesses in a very small region

More dense Less dense

Malloc density

12

0%10%20%30%40%50%60%70%80%90%100%

0

0.5

1

1.5

2

2.5

3

3.5

4

% o

f tot

al a

cces

ses

Den

sity

(acc

esse

s/by

te)

Malloc call sites, from most to least dense

DensityAccesses

0%

20%

40%

60%

80%

100%

0

0.25

0.5

0.75

1

1.25

1.5

1.75

%of

tota

l acc

esse

s

Den

sity

(acc

esse

s/by

te)

Malloc call sites, from most to least dense

Density

Accesses

0%10%20%30%40%50%60%70%80%90%100%

0

8

16

24

32

% o

f tot

al a

cces

ses

Cum

ulat

ive

size

(TB

)

Malloc call sites, from most to least dense

Size (TB)Accesses

0%

20%

40%

60%

80%

100%

0

1

2

3

4

5

6

7

8

% o

f tot

al a

cces

ses

Cum

ulat

ive

size

(GB

)

Malloc call sites, from most to least dense

Size (MB)Accesses

HPCGPENNANT

HMCPotentialPerformance§ Max:8X§ Trendsdonotchangewithdatasetsize(1-8GB)

13

0

2

4

6

MiniPIC charge

MiniPIC field

MiniPIC move

SNAP p0 SNAP p1/2 HPCG PENNANTSpee

dup

over

all

DD

R

Heavyweight Architecture: Performance with all HMC

0

2

4

6

MiniPIC charge

MiniPIC field

MiniPIC move

SNAP p0 SNAP p1 HPCG PENNANTSpee

dup

over

all

DD

R

Lightweight Architecture: Performance with all HMC

Manualallocation:PENNANT

§ Largeperformancejumpfrom25%to50%HMC§ Dynamicmigrationnecessary

14

0

1

2

3

4

5

6

7

Greedy - page Greedy - malloc Static Dynamic

Spee

dup

over

DD

R o

nly

PENNANT

12.50% 25% 50% 100%

Manualallocation:HPCG

§ Again,largejumpfrom25%to50%HMC§ Page-based&staticperformsimilarly§ Dynamicnotbetter

§ Butgranularityofmigrationislarge

15

0

1

2

3

4

5

6

Greedy - page Greedy - malloc Static Dynamic

Spee

dup

over

DD

R o

nly

HPCG

12.50% 25% 50% 100%

Manualallocation:SNAP

§ Greedy-pageperformsthebest§ Twolargemallocs inSNAP,each42%oftotal

§ Medium/lowdensity§ Oncetheydon’tfit,HMCsize/malloc strategydoesn’tmatter

§ Suggestedcodechange§ Breakuplargemallocs toimproveHMCutilization

16

0

0.5

1

1.5

2

Greedy - page Greedy - malloc Static DynamicSpee

dup

over

DD

R o

nly

SNAP

12.50% 25% 50% 100%

Topologycomparison

§ Similartrend

17

0123456

Greedy -page

Greedy -malloc

Static Dynamic

Spee

dup

over

DD

R o

nly

PENNANT

12.50% 25% 50% 100%

01234567

Greedy -page

Greedy -malloc

Static Dynamic

Spee

dup

over

DD

R o

nly

PENNANT

12.50% 25% 50% 100%

Heavyweight Lightweight

Outline

§ Methodology

§ Software/manuallymanagedMLM

§ Hardware/automaticmanagedMLM

§ Conclusions

18

Hardwaremanagement

§ HardwaremanagementofMLMatthepagelevel§ CachepagesinHMC,pagestillresidesinDDR§ Comparedtoblocklevel:lowertrackingoverheadbuthigher

add/removeoverhead

§ Focuswashardwarecaching§ But,alsopossibletodocachingviaOS§ Usually,lessinformation(hits,misses,etc.)

19

AutomaticPage-LevelSwapping§ Additionpolicies

§ Replacementpolicies

20

Directory Controller

DDR Fast Memory

MLM Unit

Mapping Table

DMAPolicy

Dispatcher

Addition Policies Replacement Policies• addT: Simple Threshold• addMFU: Most Frequently Used• addRAND: 1:8192 chance• addMRPU: More Recent Previous Use• addMFRPU: More Frequent + More

Recent Previous Use• addSC: Deprioritize streams• addSCF: as addSC + More Frequent

• FIFO: First-in, First-out• LRU: Least Recently Used• LFU: Least Frequently Used• LFU8: LFU w/ 8-bit counter• BiLRU: BiModal LRU• SCLRU: Deprioritize streams

Performancevs.Policy

21

0.2

0.4

0.6

0.8

1

1.2

1.4

addMFRPU addMFU addMRPU addRAND addSC addT addSCF

Performan

ce

AddPolicy

Lulesh:MLMPerformancevsPolicy

BiLRU FIFO

LRU SCLRU

LFU8 LFU

0.2

0.4

0.6

0.8

1

1.2

1.4

addMFRPU addMFU addMRPU addRAND addSC addT addSCF

Performan

ce

AddPolicy

MiniFE:MLMPerformancevsPolicy

BiLRU FIFO

LRU SCLRU

LFU8 LFU

Addition policy: big variation

Replacement policy: little variation

“What you put in matters more than what you take out”

Largerdatasets

§ Lookedathighestperformingadditionpolicies§ Variantsofmost-frequentlyused§ Baseline:random§ LRUreplacement

22

01234567

1024 8192 65536

Perf(1

=nofastm

em)

Pages

Pennant-bPerformance:AdditionaddMFRPU AllFastaddRand addSCFaddMFU

0

0.5

1

1.5

2

1024 8192 65536

Perf(1

=nofastm

em)

Pages

Snap-p0Performance:Addition

addMFRPU AllFastaddRand addSCFAddMFU

0

2

4

6

1024 8192 65536Perf(1

=nofastm

em)

Pages

HPCGPerformance:AdditionaddMFRPU Series2addRAND addSCFaddMFU

FineTuning1. Thresholds

2. Pagesize3. Throttling

23

0

0.5

1

1.5

2

0 20 40 60 80

Performance

Threshold

PennantThreshold

0

1

2

3

4

5

9 11 13 15

Rel.Pe

form

PageSize(2x B)

PennantPagesizeEffects

128M256M

1.45

1.5

1.55

1.6

9 11 13 15

Rel.Pe

form

PageSize(2x B)

snap-p0PagesizeEffects

128M256M

0.5

1.5

2.5

3.5

0 200 400 600 800 1000

Perfo

rman

ce

Threshold

MLMPerformacevs.Threshold(addT/LRU)

CoMDlammpsluleshminiFE

0

0.2

0.4

0.6

0.8

1

CoMD lammps lulesh miniFEPe

rforman

ce

SwapThro0ling

Thro;le

NoThro;le

Conclusions§ Applicationbehaviorvariessignificantly§ Softwaremanagementisfeasible

§ Smalltomoderatenumberofdense callsites§ Staticallocationsufficientinsomecases,dynamicnecessaryinothers

§ Forhardwaremanagement,additionpolicymattersmost§ Forautomatic&manual,profilingisinstrumental

§ Applicationmanaged– helpsidentifyhigh-bandwidthdata§ Automaticallymanaged– helpsidentifyplaceswhereapplication

changeswillimproveperformance

24