Post on 19-Jan-2016
EKIVOLOSSelf Contained, Accurate Precomputation Prefetching
Islam Atta
Xin Tong
Andreas Moshovos
Viji Srinivasan
Ioana Baldini
4.4ZB
44ZB
2013
2020
EMC2 DIGITALUNIVERSE STUDY
2 Graphic Credit: www.editeddaily.com
Prefetching is the traditional remedy
3
LOG
Unconventional Data Sources
Unstructured &Semi-Structured
Sparse Matrices Graphs
XML
Graphic Credit: www.editeddaily.com
Memory-Bound
Hardware Prefetchers
History ofAccesses
PredictFuture Accesses
CurrentState
History-based predictions may not be sufficient!
Non-RepetitiveIrregularAccesses
Shared Cache
Memory
Precomputation Slice
(P-Slice)
LLC
Target Load
Tim
e
Prefetch
Load
Delinquent Load: a problematic load which accounts for a significant amount of memory stalls.
Hit
Context 1
4
Precomputation Prefetchers
ProgramSlice
PrecomputeFuture Accesses
CurrentState
MainThread
Context 0
PrecomputationPrediction- -based Prefetching
Yet Another Precomputation Prefetcher?
Manually At Compile Time Traces from Binary
Past Work constructed P-slices…
Re-design binary-based implementations to prioritize accuracy
5
Burdensome Task Requires Source CodeDense P-slices
Inaccurate P-slices
Accurate FastP-slices are ought to be…
Conventional Binary-based methods Over-Simplify P-slices
Correctness: Do not modify the state of the main-thread.
Fast: Aggressively optimize a p-slice. Ignore Control Flow
Ignore Memory Dependencies
Monitor & Correct
Mechanisms
Potential Inaccuracy
Variable Run-ahead distanceItera
tions
Time
Main Thread
α
Abort & Restart 6
InaccurateLightP-slice
Applications with intense code divergence or memory dependencies foil
“Monitor & Correct” mechanisms
Paradigm shift – Accuracy-First P-slice
Memory Dependencies: p-slice uses a local store buffer.
Control Flow: Merge multiple traces, instead of the single dominant trace.
All data dependencies can be maintained.
• No-monitoring• Accurate• Maybe slightly
slower but can still run-ahead
Accurately replicate main thread’s
execution path.
Eventually higher Run-ahead distance
Itera
tions
Time
Main Thread
α
InaccurateLightP-slice
More AccurateDenserP-slice
EKIVOLOS – “ Slow and Steady Wins The Race”
7
Web GraphsCircuit SimulationDNA AnalysisSocial NetworksGraph PartitioningClusteringFluid Dynamics
Sparse Matrices
8
Example of Hard-to-Predict AccessesSpVM – Sparse-Vector Sparse-Matrix Multiplication
Example of Hard-to-Predict AccessesSpVM – Sparse-Vector Sparse-Matrix Multiplication
V[] RV[]
V_val V_idxM_val
M_idx
M_begin
9
x =
LinearFragmented
Linear
M[][]
Out
er
Inne
r
Scan over V_idx[]Find corresponding Row
Scan over M_idx[]Find corresponding RV[]
RV Accesses: History does not entail Future
Random!
Binary-based P-Slice Construction
Pre-Compute RV Addresses
CPU
Execute CollectInstruction
Trace
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
IdentifyDominant
Loop
Apply Backward
Slicing
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
1 2 3 4
Tim
e
Identify Delinquent Load0
10
SpVM P-Slice: Backward Slicing
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]Delinquent Load
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]
0x9588 ldr r5, [sl]0x958c add r3, #1
0x9566 ldr r4, [r0, #12]0x9568 add r1, #4
0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
0x957a mla r5, r6, ip, r5
0x9582 ldr r4, [r0, #16]
0x9588 ldr r5, [sl]0x958c add r3, #1
0x9566 ldr r4, [r0, #12]0x9568 add r1, #4
0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
0x9568 add r1, #4
0x956e ldr r4, [r9,r1]
0x9576 ldr r5, [r2,r4,lsl#2]
Eliminate Control Flow
Retain OnlyRegister
Dependencies
Eliminate Stores
11
Inner-most Dominant Loop
V[] RV[]
V_val V_idxM_val
M_idx
M_begin
M[][]
Fails to Pre-Compute RV Addresses for Multiple Rows0x9568 add r1, #4
0x956e ldr r4, [r9,r1]
0x9576 ldr r5, [r2,r4,lsl#2]Dominant-Path P-slice
EKIVOLOSLocal Store BufferMemory Dependencies
Keep Control FlowMerge Multiple Traces
Maintains All Data Dependencies
Accurately Replicates Main Thread’s Execution Path
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
Outer Inner
Prefetch
Core
L1
L2
LSB
Simple Algorithm, Much Better Accuracy
12
Evaluation – Methodology
System Setup
ESESC Simulator, ARM ISA
Main core: Out-of-Order, 3GHz
Prefetch core: In-Order, 3GHz
Area & Energy: MCPAT 1.2
EvaluatedWorkloadsRepresent
Computational Biology
Data Mining Floating Point Differential
Graph Search
Hash Table joins Image Processing
Optimization Scheduling
Simulation Sorting Sparse Matrix Multiplication
Support Vector Machines
13
00.10.20.30.40.50.60.70.80.9
1
Nor
mal
ized
MPK
I Key Results
Ekivolos (Control Flow Only)
Dominant-Path Precomputation Prefetcher
Ekivolos (Control Flow and Memory Dependencies)
Bett
er
SpeedupLLC Misses
11.21.41.61.8
22.22.42.62.8
Rela
tive
Spee
dup
Energy 10%
ControlFlow
MemoryDependencies70% 267% (0-12X)
SMS – Spatial Address CorrelationAMPM – Pattern MatchingPC/AC – Address Correlation with PC-LocalizationEkivolos+ASP – Adding Simple Stream Prefetcher 14
Bett
er
Limitations of Ekivolos
Currently Requires Offline ProfilingEffectiveness depends on Profiling Input
Targets only Delinquent Loads
15
Future Work Directions
Enhancements P-Core Architecture Benchmarks
Online Profiling “In-Memory” or “In-Cache” Processing
Suites suitable for Architectural Studies
Big Data Diverse Memory Access Patterns
16
What We Learned
P-slices Need Not be Aggressively Optimized
Simple AlgorithmControl Flow & Memory Dependencies
Emerging Algorithms not Studied Before
Prefetch-cores can be Simplified17