ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications
description
Transcript of ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications
Computer Architecture Lab at
Evangelos Vlachos,Michelle L. Goodstein, Michael A. Kozuch,
Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry
ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded
Applications
ASPLOS '10 - ParaLog 2
Software Errors & Analysis Tools
• Errors abundant in parallel software– Program crashes/vulnerabilities, limited performance
• Three main categories of analysis tools– Checking before, during or after program execution
• Instruction-grain Lifeguards– Online detailed analysis, but with high overhead– Several tools available, but mostly support for single-
threaded code
© Evangelos Vlachos
ParaLog: a framework for efficient analysis of parallel applications
Lifeguards and Parallel Applications
Application Threads
TimeslicedExecution & Analysis
ParallelExecution & Analysis
Time Butterfly Analysis ParaLog
windows of uncertainty
precise application
order
(previous talk) (this talk)
DBI tools available today
- high overhead due to serialization
- some false positives+software-based
- new hardware required+no false positives+even better performance
4
Low-Overhead Instruction-level Analysis
© Evangelos Vlachos ASPLOS '10 - ParaLog
accelerators: IT, IF, MTLB
[Chen et. al., ISCA’08]
event streamevent capturing
applicationthread
lifeguard thread
event delivery
application lifeguardonline monitoring platform
metadata
add r1 r2, r4
add, r1, r2, r4add_handler(){
i = load_state(r2); j = load_state(r4); if(check(i, j)) upd_state(r1); else error();}
Lifeguard coreApplication core
ASPLOS '10 - ParaLog 5
accelerators: IT, IF, MTLB
accelerators: IT, IF, MTLB
Challenges in Parallel Monitoring
© Evangelos Vlachos
event stream
application lifeguardonline parallel monitoring platform[ParaLog]
applicationthread 1
event capturing event deliverylifeguard thread 1
globalmetadata
event streamapplicationthread k
event capturing event deliverylifeguard thread k
ASPLOS '10 - ParaLog 6
accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB
accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB
Addressing the Challenges
1. Application event ordering
2. Ensuring metadata access atomicity efficiently
3. Parallelizing hardware accelerators© Evangelos Vlachos
event streamapplication-onlyorder capturing
order enforcing
application lifeguardonline parallel monitoring platform
dependence arcs
[ParaLog]
applicationthread 1
event capturing event deliverylifeguard thread 1
globalmetadata
event streamapplication-onlyorder capturing
order enforcingapplication
thread k
event capturing event deliverylifeguard thread k
ASPLOS '10 - ParaLog 7
Outline
• Introduction
• Addressing the Challenges of Parallel Monitoring1. Capturing & enforcing application event ordering2. Ensuring metadata access atomicity3. Parallelizing hardware accelerators
• Evaluation
• Conclusions
© Evangelos Vlachos
ASPLOS '10 - ParaLog 8
Event Ordering: the Problem• Case Study: Information flow analysis (i.e., Taintcheck)
© Evangelos Vlachos
store(A)
load(A)
Applicationthread j thread k
st_handler(A)
Lifeguardthread j thread k
ApplicationTime
ld_handler(A)
Expose happens-before information to lifeguards
LifeguardTime
ASPLOS '10 - ParaLog 9
{thread j, tj}{thread j, tj}
progressj: tj progressj: tj - 2 progressk: tk
- 1 progressk: tk progressk: tk - 2 progressj: tj
- 1
Event Ordering: the solution (1/2)
• Coherence-based ordering of application events– Similar to FDR, but online, focusing on application-only events
© Evangelos Vlachos
store(A)
load(A)
Applicationthread j thread k
Time tj - 1
tjtj
+ 1tk - 1tk
tk + 1
st_handler(A)
ld_handler(A)
Lifeguardthread j thread k
wait whileprogressj < tj
ASPLOS '10 - ParaLog 10
Is monitoring coherence enough?
Event Ordering: the Solution (2/2)
• Previous work has not solved the problem of Logical Races• Both logical races and system calls resolved with Conflict Alert messages
© Evangelos Vlachos
free(A)
load(A)
Applicationthread j thread k
free(A)start
ld_handler(A)
Lifeguardthread j thread k
Metadata(A)
free(A)end
Conflict Alert Message Dependence
LogicalRace
ApplicationTime
LifeguardTime
ASPLOS '10 - ParaLog 11
Metadata Atomicity• Frequent use of locking too expensive
– # of instructions added & synchronization cost
• Dependence arcs handle the majority of the cases – Sufficient conditions:
1. One-to-one data-to-metadata mapping
2. Application reads don’t become metadata writes– Enforcing dependence arcs race-free operation
• Rest of the cases handled by acquiring a lock– Lock used only in the load_handler(); other handlers safe
© Evangelos Vlachos
(more details in the paper)
12
Parallel Hardware Accelerators• Speed-up frequent lifeguard actions
– Metadata-TLB; fast metadata address calculation– Idempotent Filters; filter out redundant checking– Inheritance Tracking; fast tracking of dataflow paths
• Accelerators have only local view of the analysis– Cache locally analysis information (e.g., frequent events) – Important events have application-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state
• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state
© Evangelos Vlachos ASPLOS '10 - ParaLog
ASPLOS '10 - ParaLog 13
Outline
• Introduction
• Addressing the Challenges of Parallel Monitoring– Capturing & enforcing application event ordering– Ensuring metadata access atomicity– Parallelizing hardware accelerators
• Evaluation
• Conclusions
© Evangelos Vlachos
ASPLOS '10 - ParaLog 14
Experimental Framework
© Evangelos Vlachos
• Log-Based Architectures framework– Simics full-system simulation– CMP system with {2, 4, 8, 16} cores– {1, 2, 4, 8} of application and lifeguard threads – Sequentially Consistent memory model
• Benchmarks and multithreaded Lifeguards used– SPLASH-2 and PARSEC– TaintCheck: Information flow tracking; accelerated by M-TLB, IT– AddrCheck: Memory access checking; accelerated by M-TLB, IF
• Comparison with Timesliced Monitoring
ASPLOS '10 - ParaLog 15
Performance Results: AddrCheck
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6AddrCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
© Evangelos Vlachos
8 app/lifeguard threads16 cores total
Normalized to sequential,
unmonitored
ASPLOS '10 - ParaLog 16
Performance Results: AddrCheck
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6AddrCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
© Evangelos Vlachos
17
Performance Results: AddrCheck
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6AddrCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
© Evangelos Vlachos ASPLOS '10 - ParaLog
2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4
• Timesliced Monitoring is not scalable• On average 15x slowdown over No Monitoring (8 threads)
ASPLOS '10 - ParaLog 18
Performance Results: AddrCheck
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6AddrCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
© Evangelos Vlachos
• Highest overhead with 8 threads: SWAPTIONS 6x• Lowest overhead with 8 threads: < 5%• Average overhead with 8 threads: 26%
ASPLOS '10 - ParaLog 19
Performance Results: TaintCheck
© Evangelos Vlachos
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6 TaintCheckNo Monitoring Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
ASPLOS '10 - ParaLog 20
Performance Results: TaintCheck
© Evangelos Vlachos
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6TaintCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
2.1 11.5 12.9 1.910 1.7
1.9 2.9 6.64.6
15.7 2.4 2.81.7
• Timesliced Monitoring is not scalable• On average 23x slowdown over No Monitoring (8 threads)
ASPLOS '10 - ParaLog 21
Performance Results: TaintCheck
© Evangelos Vlachos
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6 TaintCheckNo Monitoring Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
• Highest overhead with 8 threads: BARNES 2.6x• Lowest overhead with 8 threads: LU 5%• Average overhead with 8 threads: 48%
ASPLOS '10 - ParaLog 22
Other Results in the Paper
• Order capturing and order enforcing under TSO
• Performance Impact of Lifeguard Accelerators– AddrCheck: [1.13x – 3.4x], TaintCheck: [2x – 9x]
• A less expensive order capturing mechanism gets similar performance results– 1 timestamp per core vs. 1 timestamp per cache block
© Evangelos Vlachos
23
Conclusions• ParaLog: Fast and precise parallel monitoring
• Components of event ordering– Normal memory accesses: monitor coherence activity– Logical Races; use of Conflict Alert messages
• Metadata Atomicity– Enforcing dependence arcs ensures atomicity (most cases)
• Parallel Hardware Accelerators– Flush local state on remote events (Conflict Alert)
• Average overhead is relatively low– AddrCheck: 26% and TaintCheck: 48% (8 threads)
© Evangelos Vlachos ASPLOS '10 - ParaLog
ASPLOS '10 - ParaLog 24
Questions ?
© Evangelos Vlachos
ASPLOS '10 - ParaLog 25
Backup Slides
© Evangelos Vlachos
ASPLOS '10 26
Metadata Atomicity
• Synchronization-free fast path vs. slow path– Concurrent application reads; no ordering available!
• Concurrent metadata reads: follow the fast-path• Concurrent metadata writes: follow slow-path acquiring a lock• Concurrent metadata read and write: read may get either value
– In any other case dependence arcs are available
© Evangelos Vlachos
Application Event Lifeguard ActionR R WW R W
AddrCheckTaintCheckMemCheck
LockSet
ASPLOS '10 - ParaLog 27
Parallel Hardware Accelerators• Accelerators have only local view of the analysis
– Important events have system-wide effects– Case study: Idempotent Filters and AddrCheck
© Evangelos Vlachos
R(A)
R(B)
R(A)
R(A)
R(A)
R(C)
R(B)
R(A)
IF
free(A)
R(A)
IF
LG 0
LG 1
✔✖ ✔ Delivered to lifeguard
✖ Redundant; discarded
✖ ✔
✔✖ ✔✔
✔
Flush IF filters
free(A)
Flush local and remote IF
filters
• Details for parallel M-TLB and IT can be found in the paper
Builds on Remote Conflict
Messages
28
Performance Impact of Lifeguard Accelerators
© Evangelos Vlachos ASPLOS '10 - ParaLog
BARNES LUOCEAN
BLACKSCH.FLUIDANIM.
SWAPTIONS FMMRADIOSITY
0
1
2
3
4
5
6 5.79.4 6.8
4.2 4.3
5.47.3 11.3
2.6
1.0 1.1 1.3 1.4
2.2
1.4 1.5
TaintCheck Not AcceleratedAccelerated
Slow
dow
n (8
thre
ads)
9.4 6.8 7.3 11.3
• Accelerators provide a major speedup [2x – 9x]
ASPLOS '10 - ParaLog 29
Performance Impact of Lifeguard Accelerators
© Evangelos Vlachos
• Accelerators provide a major speedup [1.13x – 3.4x]
BARNES LUOCEAN
BLACKSCH.FLUIDANIM.
SWAPTIONS FMMRADIOSITY
0
1
2
3
4
5
6
7
8
9
3.93.2
1.01.4 1.1
8.4
1.01.41.1 1.0 1.0 1.0 1.0
6.0
1.0 1.0
AddrCheckNot AcceleratedAccelerated
Slow
dow
n (8
thre
ads)
ASPLOS '10 - ParaLog 30
Transitive Reduction Sensitivity Study
© Evangelos Vlachos
BARNES LUOCEAN
BLACKSCH.FLUIDANIM.
SWAPTIONS FMMRADIOSITY
0
0.5
1
1.5
2
2.5
3
3.5
2.9
1.1 1.21.3
1.5
3.1
1.5 1.6
2.6
1.0 1.11.3 1.4
2.2
1.4 1.5
TaintCheck Limited (1 timestamp / cache)Ideal (1 timestamp / cacheblock)
Slow
dow
n (8
thre
ads)
• Limited transitive reduction– No major performance impact; savings in chip area
31
Supporting Total Store Order (TSO)
• Cycle of dependencies in relaxed memory models– TSO relaxes the RAW ordering– Previous work (RTR): maintain versions of data– Identify SC offending instructions; save loaded value
• This paper: maintain versions of metadata
© Evangelos Vlachos ASPLOS '10 - ParaLog
Thread 0 Thread 1Commit
order
0
1
2
Wr(A) Wr(B)
Rd(B) Rd(A)
Memory Order:
P(v1, A)
C(v0, B)
P(v0, B)
C(v1, A)
Log 0 Log 1
Wr(A)
Rd(B, v0)
Wr(B)
Rd(A, v1)
produce_version(v1,A)Lifeguard 0
store_handler(A)
wait_until_available(v0,B)
load_handler(B, v0)
ASPLOS '10 - ParaLog 32
Parallel Hardware Accelerators• Speed-up frequent lifeguard actions
– Fast metadata address calculation – Metadata-TLB– Fast tracking of data-flow paths – Inheritance Tracking– Filter out redundant checking – Idempotent Filters
• Per-instruction checking gives the same result; cache event
• Accelerators have only local view of the analysis– Important events have system-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state
• Important events accompanied by Conflict Alerts– Use Conflict Alerts to flush state and deliver pending events
© Evangelos Vlachos
ASPLOS '10 - ParaLog 33
Experimental FrameworkBenchmarks Inputbarnes 16K bodiesocean Grid: 258 x 258lu Matrix: 1024 x 1024fmm 32768 particlesradiosity Base problemblackscholes Simlargefluidanimate Simlargeswaptions Simlarge
Simulation ParametersCores {2, 4, 8,16}, 1 GHz,
In-Order scalar x86L1I & L1D(private)
64KB, 64B line, 4-way assoc.
L2 (shared) {1, 2, 4, 8}MB, 64B line, 8-way assoc., 6-cycle latency
Memory 90-cycle latencyLog Buffer 64KB per thread
Multithreaded LifeguardsTaintCheck: Information flow tracking; accelerated by M-TLB and ITAddrCheck: Memory access checking; accelerated by M-TLB and IF
© Evangelos Vlachos
ASPLOS '10 - ParaLog 34
Relative Slowdown - TaintCheck
TaintCheck
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY
Slo
wdo
wn
Waiting for ApplicationWaiting for DependenceUseful Work
© Evangelos Vlachos
ASPLOS '10 - ParaLog 35
Relative Slowdown - AddrCheck
AddrCheck
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
BARNES LU OCEAN BLACKSH. FLUIDANIM. SWAPTIONS FMM RADIOSITY
Slow
dow
n
Waiting for ApplicationWaiting for DependenceUseful Work
3.0 6.0
© Evangelos Vlachos
ASPLOS '10 - ParaLog 36
Performance Results - AddrCheck
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6AddrCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
© Evangelos Vlachos
2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4
ASPLOS '10 - ParaLog 37
Performance Results - TaintCheck
© Evangelos Vlachos
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6TaintCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
2.1 11.5 12.9 1.910 1.7
1.9 2.9 6.64.6
15.7 2.4 2.81.7
38
Parallel Hardware Accelerators• Speed-up frequent lifeguard actions
– Metadata-TLB & Inheritance Tracking (discussed in the paper)
– Idempotent Filters; identify and filter out redundant checking• Per-instruction checking gives the same result• Cache incoming event and local state to identify redundancy
• Accelerators have only local view of the analysis– Important events have application-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state
• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state
© Evangelos Vlachos ASPLOS '10 - ParaLog