HOMME Trace Analysis
description
Transcript of HOMME Trace Analysis
HOMME Trace AnalysisFabrice Mizero
Mentor: Dr. John Dennis
Collaborators:Prof. Malathi Veeraraghavan (University of Virginia)
Prof. Robert D. Russell (University of New Hampshire)Qian Liu(University of New Hampshire)
Aug 1, 2014
2
• Motivation • Background• Methodology• Results • Conclusion and Solutions• Future Work
Roadmap
3
• Understanding the causes of poor performance of CESM on Yellowstone: a 5-step approach Experimental execution and data collection HOMME trace analysis IBMgtSim: routing study Network simulation Integrated simulation
Big Picture
4
2-hop
4-hop
6-hop*Credit: Dr. John Dennis Zhengyang Liu
5
• Network Congestion Head of Line Blocking Credit-Based Flow Control
• OS Jitter Kernel Interrupts
• Application Interference: Self-Interference Interference with others (Neighborhood Effect)
Suspected Causes“…OS noise, shape of the allocated partition, and interference from other jobs.” Abhinav Bhatele et al. SC13
6
H4
Congestion Head of Line Blocking (HOL)
Worst Case Scenario:Congestion Spreading due to HOL
H1
H2 H5
H3 H6
H7
S2S1
Stuck!!!
Out of Buffer
Space!! Out of Buffer
Space!!
Victim Flow
7
• Each compute node runs its own OS - RHEL• Interference caused by OS routines
Timer interrupts OS Daemons Hardware interrupts
• Competition for CPU resources. Example: Line Printer Daemon
OS Jitter
8
• How does congestion impact network latency?
• How important is OS Jitter to network latency?
• What has a bigger impact to message latency: OS Jitter or Congestion?
3 Questions
9
• Congestion: 2 Platforms
• Jellystone: Non-production machine • Yellowstone: production machine
Different message sizes & Hop distance• OS Jitter:
Linux Transparent Huge Pages (THP)
Experimental Set-Up
10
Methodology
Extrae Trace Collection
Hop, SizeHop, Size
Wilcoxon Rank Sum Test
Clock Skew Correction
11
• Tracing tool Developed at BSC• Chronologic event, state, communications records• One way communication delays – Visuals with Paraver
Extrae
MPI-Isend
Start EndTime
12
Clock Skew
Host A Ca(t1)
Host BCb(t2)
In reality, Offset = Ca(t) – Cb(t) != 0
Skew = Ca’(t) - Cb
’(t) != 0
Ideally, CAB= Cb(t2) – Ca(t1)
• Same size, Same Hop-Count, host-pair level Min delay: best approximation of offset CAB(t) – min( CAB(t)) + minpingpong
13
• Wilcoxon Rank Sum Test: Non-parametric significance test Compare the means of two independent populations Tests:
• OS Jitter? Jellystone: no THP <=> with THP
• Congestion? Yellowstone: 0-Hop delays 4-Hop Delays Jellystone: THP Yellowstone: THP
Statistical Methods
14
• Perfquery: IB performance counters query tool.• PortXmitWait: Port congestion monitoring
Credit-Based Flow control
Perfquery
Host A
TOR Switch
Credits?
No
Yes
PortXmitWait
15
• How important is OS Jitter to network latency? Jellystone::0-Hop::NoTHP vs. Jellystone::0-Hop::THP
Intranode communications delays with THP enabled are slower than without THP.
Results
Msg size Sample size p-Value Interpretation
488B 54624::45727 <0.001, <0.001,1 NoTHP is faster than with THP
1952B 9503::7950 <0.001, <0.001,1 NoTHP is faster than with THP
2440B 102120::85468 <0.001, <0.001,1 NoTHP is faster than with THP
2928B 47504::39764 <0.001, <0.001,1 NoTHP is faster than with THP
16
• What has a bigger impact to message latency: OS Jitter or Congestion? Comparing: Yellowstone: 0-Hop delays, 4-Hop delays
For all considered message sizes, intranode communications delays can outweigh internode delays
Results
Msg size Sample size p-Values Interpretation
488B 54325::23621 <0.001, <0.001,1 4-Hop is faster than 0-Hop
2440B 101581::16529 <0.001, <0.001,1 4-Hop is faster than 0-Hop
2928B 47243::21259 <0.001, <0.001,1 4-Hop is faster than 0-Hop
4880B 49603::4720 <0.001, <0.001,1 4-Hop is faster than 0-Hop
17
• OS Jitter can cause performance degradation or variability.
• Inter-job interference can lead to application performance variability.
Solutions Congestion:
Dynamic Allocation of Virtual Lanes to redirect victim flows around congested ports.
OS Jitter: Linux Tickless Kernel MPI-3 for better control over share-memory
communications.
Conclusion
18
• Further study on the Dynamic Virtual Lanes assignment solution
• Plan and collect new HOMME traces with PortXmitWait monitored and LSF Logs saved.
• Study intra-job interference• More efficient algorithm of correcting Clock Skew
Future Work
Thank You
Fabrice [email protected]