Photos placed in horizontal position between photos and...
Transcript of Photos placed in horizontal position between photos and...
Photos placed in horizontal position with even amount of white space
between photos and header
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. SAND2017-4325 C
TheImpactofIncreasingMemorySystemDiversityonApplications
GwenVoskuilenArun Rodrigues,MikeFrank,SiHammond
4/25/2017
Diversityinmemoryhierarchy
§ Newmemorytechnologies– stackedDRAM,non-volatile§ Advantages:higherbandwidth,persistent,etc.§ Disadvantages:expensive,higherlatencies
§ Solution:multi-levelmemory(MLM)§ Mixmemoriestoexposeadvantages,hidedisadvantages§ Reallyhardinpractice
§ Mustcarefullylocatedatabasedondatacharacteristics§ Whichdatagoesintowhichmemory?Whodecides?
2
Automatic Manual
Hardware Hardwaredecides
Software OS/runtimedecides
Application decides
ManagingMLM
§ TrinityKNL:SmallstackedDRAM+largeDDRDRAM§ Applicationdatafootprint>>stackedDRAMcapacity
§ Managementrequiresidentifying andselectivelyallocatingdataneedingbandwidthintostackedDRAM
§ Study1:Softwaremanagementpolicies§ Rangingfromsimpletocomplex§ Developedanalysistool,MemSieve, toevaluatememorybehavior
§ Study2:Hardwarecaching§ Insertion/evictionpolicies
3
Outline
§ Methodology
§ Software/manuallymanagedMLM(malloc()based)
§ Hardware/automaticmanagedMLM
§ Conclusions
4
SimulationMethodology
§ MiniApps:HPCG,PENNANT,SNAP§ Memoryfootprintsof1-8GBà samplingrequired
§ SimulatedontwoarchitecturesusingSST§ Lightweight:72smallcores,mesh,privateL1,semi-privateL2§ Heavyweight:8bigcores,ring,privateL1+L2,sharedL3§ Both:DDRandHMCmemories
5
TileHMCDDR
C CL1 L1
L2
L3Core
Mem
Core
L3
Core
Core
CoreCore
Core
Core
L3
L3
L3
L3
L3
L3
Mem
Mem
Mem
Core
L1
L2
Outline
§ Methodology
§ Software/manuallymanagedMLM
§ Hardware/automaticmanagedMLM
§ Conclusions
6
Softwareapproaches
Software management
OS / Runtime
Static
Greedily insert pages
into HMC
Greedily insert mallocs
into HMC
Dynamic(future work)
Programmer
Static
Direct "best" mallocs to
HMC
Dynamic
Migrate to put current "best"
to HMC
7Increased performance?
Tradeoff
§ Automatic(OS)§ Easierforprogrammer§ Abletocaptureallocationsnotunderprogrammercontrol
§ Library,pre-programstart,etc.§ Page-tablecomplexity;potentiallyexpensivere-mapping§ Noprogramknowledgeà worseperformance?
§ Coulduseprogrammerhintsorruntimeprofilingbutmorework
§ Manual(programmer)§ Moreworkforprogrammer,pervasive(?)changes§ Notabletohandleallallocations§ Possibleconflictsbetweenapplicationandlibraryallocations
§ Whatiflibrariesdecidetomanageallocationforinternalstructurestoo?§ Knowledgeofprogrambehaviorà betterperformance?
8
Analysistool:MemSieve
§ Capturesanapplication’smemoryaccessesandcorrelatestotheapplication’smemoryallocations§ Filtersoutcachehits§ Withoutsimulatingfullmemoryhierarchyà 2.5X+faster
§ Keymeasurement:malloc density§ #accesses/size
§ Hypothesis:densemallocs shouldbeputinHMC§ AssumingsimilarlatenciesbetweenHMCandDDR
9
Malloc analysis
Pennant HPCG Snap * MiniPICMalloc count 8B 23M 1B 438KMalloc size 32.1 TB 7.43 GB 30GB 7.9GBDistinct traces 248 612 323 39043Accessed traces 140 146 90 10794Size of accessed traces as % total
89.7% 99.987% 89.6% 84%
10
*Iterations from beginning & middle only
§ Manymallocs butfewdistinctmalloc calltraces(locations)§ Reasonsmallocs arenotaccessed
§ Sameaddressmalloc’d repeatedlyà cache-resident§ Malloc wasnotaccessedinprofiledsectionofapplication
Idealmalloc behavior§ Good: Afew,small,verydensemallocs§ Bad:Many,equallydensemallocs;densestarebig
11
Den
sity
Mallocs
Big density variation: less work to manage
% a
cces
ses
(cum
ulat
ive)
Size
Mallocs
% a
cces
ses
(cum
ulat
ive)
Lots of accesses in a very small region
More dense Less dense
Malloc density
12
0%10%20%30%40%50%60%70%80%90%100%
0
0.5
1
1.5
2
2.5
3
3.5
4
% o
f tot
al a
cces
ses
Den
sity
(acc
esse
s/by
te)
Malloc call sites, from most to least dense
DensityAccesses
0%
20%
40%
60%
80%
100%
0
0.25
0.5
0.75
1
1.25
1.5
1.75
%of
tota
l acc
esse
s
Den
sity
(acc
esse
s/by
te)
Malloc call sites, from most to least dense
Density
Accesses
0%10%20%30%40%50%60%70%80%90%100%
0
8
16
24
32
% o
f tot
al a
cces
ses
Cum
ulat
ive
size
(TB
)
Malloc call sites, from most to least dense
Size (TB)Accesses
0%
20%
40%
60%
80%
100%
0
1
2
3
4
5
6
7
8
% o
f tot
al a
cces
ses
Cum
ulat
ive
size
(GB
)
Malloc call sites, from most to least dense
Size (MB)Accesses
HPCGPENNANT
HMCPotentialPerformance§ Max:8X§ Trendsdonotchangewithdatasetsize(1-8GB)
13
0
2
4
6
MiniPIC charge
MiniPIC field
MiniPIC move
SNAP p0 SNAP p1/2 HPCG PENNANTSpee
dup
over
all
DD
R
Heavyweight Architecture: Performance with all HMC
0
2
4
6
MiniPIC charge
MiniPIC field
MiniPIC move
SNAP p0 SNAP p1 HPCG PENNANTSpee
dup
over
all
DD
R
Lightweight Architecture: Performance with all HMC
Manualallocation:PENNANT
§ Largeperformancejumpfrom25%to50%HMC§ Dynamicmigrationnecessary
14
0
1
2
3
4
5
6
7
Greedy - page Greedy - malloc Static Dynamic
Spee
dup
over
DD
R o
nly
PENNANT
12.50% 25% 50% 100%
Manualallocation:HPCG
§ Again,largejumpfrom25%to50%HMC§ Page-based&staticperformsimilarly§ Dynamicnotbetter
§ Butgranularityofmigrationislarge
15
0
1
2
3
4
5
6
Greedy - page Greedy - malloc Static Dynamic
Spee
dup
over
DD
R o
nly
HPCG
12.50% 25% 50% 100%
Manualallocation:SNAP
§ Greedy-pageperformsthebest§ Twolargemallocs inSNAP,each42%oftotal
§ Medium/lowdensity§ Oncetheydon’tfit,HMCsize/malloc strategydoesn’tmatter
§ Suggestedcodechange§ Breakuplargemallocs toimproveHMCutilization
16
0
0.5
1
1.5
2
Greedy - page Greedy - malloc Static DynamicSpee
dup
over
DD
R o
nly
SNAP
12.50% 25% 50% 100%
Topologycomparison
§ Similartrend
17
0123456
Greedy -page
Greedy -malloc
Static Dynamic
Spee
dup
over
DD
R o
nly
PENNANT
12.50% 25% 50% 100%
01234567
Greedy -page
Greedy -malloc
Static Dynamic
Spee
dup
over
DD
R o
nly
PENNANT
12.50% 25% 50% 100%
Heavyweight Lightweight
Outline
§ Methodology
§ Software/manuallymanagedMLM
§ Hardware/automaticmanagedMLM
§ Conclusions
18
Hardwaremanagement
§ HardwaremanagementofMLMatthepagelevel§ CachepagesinHMC,pagestillresidesinDDR§ Comparedtoblocklevel:lowertrackingoverheadbuthigher
add/removeoverhead
§ Focuswashardwarecaching§ But,alsopossibletodocachingviaOS§ Usually,lessinformation(hits,misses,etc.)
19
AutomaticPage-LevelSwapping§ Additionpolicies
§ Replacementpolicies
20
Directory Controller
DDR Fast Memory
MLM Unit
Mapping Table
DMAPolicy
Dispatcher
Addition Policies Replacement Policies• addT: Simple Threshold• addMFU: Most Frequently Used• addRAND: 1:8192 chance• addMRPU: More Recent Previous Use• addMFRPU: More Frequent + More
Recent Previous Use• addSC: Deprioritize streams• addSCF: as addSC + More Frequent
• FIFO: First-in, First-out• LRU: Least Recently Used• LFU: Least Frequently Used• LFU8: LFU w/ 8-bit counter• BiLRU: BiModal LRU• SCLRU: Deprioritize streams
Performancevs.Policy
21
0.2
0.4
0.6
0.8
1
1.2
1.4
addMFRPU addMFU addMRPU addRAND addSC addT addSCF
Performan
ce
AddPolicy
Lulesh:MLMPerformancevsPolicy
BiLRU FIFO
LRU SCLRU
LFU8 LFU
0.2
0.4
0.6
0.8
1
1.2
1.4
addMFRPU addMFU addMRPU addRAND addSC addT addSCF
Performan
ce
AddPolicy
MiniFE:MLMPerformancevsPolicy
BiLRU FIFO
LRU SCLRU
LFU8 LFU
Addition policy: big variation
Replacement policy: little variation
“What you put in matters more than what you take out”
Largerdatasets
§ Lookedathighestperformingadditionpolicies§ Variantsofmost-frequentlyused§ Baseline:random§ LRUreplacement
22
01234567
1024 8192 65536
Perf(1
=nofastm
em)
Pages
Pennant-bPerformance:AdditionaddMFRPU AllFastaddRand addSCFaddMFU
0
0.5
1
1.5
2
1024 8192 65536
Perf(1
=nofastm
em)
Pages
Snap-p0Performance:Addition
addMFRPU AllFastaddRand addSCFAddMFU
0
2
4
6
1024 8192 65536Perf(1
=nofastm
em)
Pages
HPCGPerformance:AdditionaddMFRPU Series2addRAND addSCFaddMFU
FineTuning1. Thresholds
2. Pagesize3. Throttling
23
0
0.5
1
1.5
2
0 20 40 60 80
Performance
Threshold
PennantThreshold
0
1
2
3
4
5
9 11 13 15
Rel.Pe
form
PageSize(2x B)
PennantPagesizeEffects
128M256M
1.45
1.5
1.55
1.6
9 11 13 15
Rel.Pe
form
PageSize(2x B)
snap-p0PagesizeEffects
128M256M
0.5
1.5
2.5
3.5
0 200 400 600 800 1000
Perfo
rman
ce
Threshold
MLMPerformacevs.Threshold(addT/LRU)
CoMDlammpsluleshminiFE
0
0.2
0.4
0.6
0.8
1
CoMD lammps lulesh miniFEPe
rforman
ce
SwapThro0ling
Thro;le
NoThro;le
Conclusions§ Applicationbehaviorvariessignificantly§ Softwaremanagementisfeasible
§ Smalltomoderatenumberofdense callsites§ Staticallocationsufficientinsomecases,dynamicnecessaryinothers
§ Forhardwaremanagement,additionpolicymattersmost§ Forautomatic&manual,profilingisinstrumental
§ Applicationmanaged– helpsidentifyhigh-bandwidthdata§ Automaticallymanaged– helpsidentifyplaceswhereapplication
changeswillimproveperformance
24