Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre •...
Transcript of Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre •...
![Page 1: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/1.jpg)
ORNL is managed by UT-Battelle for the US Department of Energy
Robust Health Monitoring
• Lustre• InfiniBand• Storage Arrays
Blake CaldwellHPC OperationsOak Ridge Leadership Computing FacilityOak Ridge National Laboratory
March 3rd, 2015
![Page 2: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/2.jpg)
2 Robust Monitoring
Just a filesystem, right?
Compute Compute Compute Compute Compute
Lustre
IB Fabric
High-level view: a compute cluster connected to a parallel file system
![Page 3: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/3.jpg)
3 Robust Monitoring
Just a filesystem, right?
Compute Compute Compute Compute Compute
Lustre
IB Fabric
High-level view: focus in on just the filesystem infrastructure
![Page 4: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/4.jpg)
4 Robust Monitoring
Not too bad…
Storage Array
OSS OSS OSS OSS OSSOSSOSSOSS
Storage Array
IB Fabric
View of PFS: composed of both servers and storage arrays
![Page 5: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/5.jpg)
5 Robust Monitoring
Not too bad…
Storage Array
OSS OSS OSS OSS OSSOSSOSSOSS
Storage Array
IB Fabric
View of PFS: organized into scalable units
![Page 6: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/6.jpg)
6 Robust Monitoring
More than we thought…
Storage Controller
OSSOSSOSSOSS
Storage Controller
IB Links
View of Scalable Unit: a mesh of IB links for throughput and high availability
![Page 7: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/7.jpg)
7 Robust Monitoring
More than we thought…
Storage Controller
OSSOSSOSSOSS
Storage Controller
IB Links
View of Scalable Unit: the storage array has individually monitor-able components
![Page 8: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/8.jpg)
8 Robust Monitoring
Another layer?
Storage Controller Storage Controller
Enclosure Enclosure Enclosure EnclosureEnclosure
SAS Links
View of Storage Array: redundant SAS links connect disk enclosures
![Page 9: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/9.jpg)
9 Robust Monitoring
Another layer?
Storage Controller Storage Controller
Enclosure Enclosure Enclosure EnclosureEnclosure
SAS Links
View of Storage Array: more complexity still within enclosure units
![Page 10: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/10.jpg)
10 Robust Monitoring
We need some help!
Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk
Pool
VD
Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk
Pool
VD
Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk
Pool
VD
IOM IOM IOM IOM IOM IOM IOM IOM IOM IOM
Enclosure Enclosure Enclosure Enclosure Enclosure
PSU PSU PSU PSU PSU PSU PSU PSU PSU PSU
View of Storage Enclosures: contains pools, VDs, IO modules, power supplies, and hundreds of disks, all of which can fail (so we monitor them)
![Page 11: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/11.jpg)
11 Robust Monitoring
This talk will cover…
• A monitoring infrastructure for Lustre• Tools used for monitoring layers
– DDN SFA check (block)– IB health check (network)– Custom scripts (Lustre)
• How common kernel LustreError log messages correlate with monitored events
![Page 12: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/12.jpg)
12 Robust Monitoring
Monitoring Infrastructure
• Nagios: for alerting• Splunk: for information to be investigated
– Send out snippets of syslog to filesystem admins– Interesting DDN SFA logs
• SCSI sense errors (predict PD failure)• RAID parity mismatches
– Rebooted OSS/MDS– Read-only LUN– Memory errors
• LMT/others: for performance monitoring
![Page 13: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/13.jpg)
13 Robust Monitoring
Nagios Infrastructure
• All Lustre hosts and DDN controllers are hosts• Service checks: (e.g. sfa_check, ib_health_check)
13,000 in OLCF!!– Check commands:
• check_snmp_sfa_health.sh• check_snmp_extend.pl
• Hosts run scripts via snmp extend (snmpd.conf)extend monitor_ib_health /opt/bin/monitor_ib_health.sh
![Page 14: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/14.jpg)
14 Robust Monitoring
1st layer: backend storage arrays
• Hardware failure events:– Disk failures– Enclosure power supplies– Inter-controller links
• Assess the impact on:– Redundancy– Performance– Cache status
![Page 15: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/15.jpg)
15 Robust Monitoring
SFA Check
• Periodic execution of API commands on all DDN arrays– Asynchronous from Nagios polling
• Python multiprocessing library– Manages a pool of workers (one per SFA couplet)– Times out stuck workers and propagates error to SNMP
• Modular design– All component checks perform doHealthCheck() for overall component status
(OK, NON_CRITICAL, CRITICAL)– Additional component-specific checks (e.g. ICLChannelCheck)
• InfinibandPortState == ACTIVE• InfinibandCurrentWidth == 4• ErrorStatisticCounts[SymbolErrorCounter] < 20
• Checks configuration of pools (caching, parity/integrity checks)
![Page 16: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/16.jpg)
16 Robust Monitoring
SFA Check (2)
• Physical disks (PD)
• Virtual disk (VD) objects (made up of PDs)
• Pools (made up of VDs)
• Inter-controller links (ICL)
• Power supplies
• Host channels (HCAs)
• Internal configuration disks
• SAS expanders
• SAS expander processor (SEP)
• UPS units (external)
• IO Controllers (IOC)
• Fans
• RAID processors
• Voltage sensors
• Temperature sensors
Classes that return health status (OK, NON_CRITICAL, CRITICAL):
![Page 17: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/17.jpg)
17 Robust Monitoring
poolCheck() example
• Top-level generic check: doHealthCheck()• Specific checks:
– State: degraded, no redundancy, critical– Rebuild state– Ownership by controller– Bad block count (*)
(*) only in CLI extended mode (-x)
![Page 18: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/18.jpg)
18 Robust Monitoring
SFA check design
Nagios
check_sfa
ornl_sfa_check_pp_daemon
ornl_sfa_check(DDN 1)
ornl_sfa_check(DDN 2)
ornl_sfa_check(DDN 3)
snmppass_persist
base_oid.ddn1.1base_oid.ddn1.2base_oid.ddn1.3
base_oid.ddn2.1base_oid.ddn2.2base_oid.ddn2.3
base_oid.ddn3.1base_oid.ddn3.2base_oid.ddn3.3
DDN 1
Python API
DDN 2
Python API
DDN 3
Python API
Every 5min
Every 5min
Once
snmpd
SFA Monitoring Host
Nagios Host
![Page 19: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/19.jpg)
19 Robust Monitoring
SNMP OID structure
# snmpwalk -v2c –c public monitor_host .1.3.6.1.4.1.341.49.1
SNMPv2-SMI::enterprises.341.49.1.12.97...98.49.1 = INTEGER: 0
SNMPv2-SMI::enterprises.341.49.1.12.97...98.49.2 = STRING: "All Checks OK"
SNMPv2-SMI::enterprises.341.49.1.12.97...98.49.3 = STRING: "1424574807"
SNMPv2-SMI::enterprises.341.49.1.12.97...98.50.1 = INTEGER: 0
SNMPv2-SMI::enterprises.341.49.1.12.97...98.50.2 = STRING: "All Checks OK"
SNMPv2-SMI::enterprises.341.49.1.12.97...98.50.3 = STRING: "1424574807"
SNMPv2-SMI::enterprises.341.49.1.12.97...99.49.1 = INTEGER: 0
SNMPv2-SMI::enterprises.341.49.1.12.97...99.49.2 = STRING: "All Checks OK"
SNMPv2-SMI::enterprises.341.49.1.12.97...99.49.3 = STRING: "1424574807”
Return code
Timestamp
Nagios String
![Page 20: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/20.jpg)
20 Robust Monitoring
SFA Check Output
[fpc@or-mgmt01 ~]$ /opt/lustre/bin/ornl_sfa_check.py or-ddn-a1 or-ddn-b1 or-ddn-c1 or-ddn-d1
POOL: 2 Checks WARNING - Index 53 DEGRADED
POWER SUPPLY: CRITICAL
All Checks OK
All Checks OK
[fpc@or-mgmt01 ~]$ /opt/lustre/bin/ornl_sfa_check.py -x or-ddn-a1 or-ddn-b1 or-ddn-c1 or-ddn-d1
or_ddn_a Check Summary:
-------------------------
Messages from check POOL
Messages from check VIRTUAL DISK
VIRTUAL DISK: 2 Checks WARNING
VIRTUAL DISK Health: OK Index: 37; Child Health: NON_CRITICAL
VIRTUAL DISK Health: OK Index: 53; Child Health: NON_CRITICAL
POOL: 2 Checks WARNING
POOL Health: NON_CRITICAL Index: 37
POOL Health: NON_CRITICAL Index: 53
or_ddn_b Check Summary:-------------------------All Checks OK
![Page 21: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/21.jpg)
21 Robust Monitoring
SFA Check Output (2)
or-ddn-c Check Summary:-------------------------Messages from check POOL
Messages from check VIRTUAL DISKVIRTUAL DISK: 1 Check WARNINGVIRTUAL DISK Index: 4; BadBlocks: 1
POOL: 1 Check WARNINGPool Index: 4; BadBlocks: 1
or-ddn-d Check Summary:-------------------------Messages from check POWER SUPPLY
Messages from check CONTROLLERCONTROLLER: 2 Checks WARNINGCONTROLLER Health: OK Name: A Index: 0; Child Health: CRITICALCONTROLLER Health: OK Name: B Index: 1; Child Health: CRITICAL
POWER SUPPLY: 1 Check CRITICALPOWER SUPPLY Health: CRITICAL EnclosureIndex: 10 Location: PSU 2 Index: 2
![Page 22: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/22.jpg)
22 Robust Monitoring
Other block-level OSS checks
• Multipath– 2 paths to each LUN
• SRP daemon– 2 processes (for each IB port)
• Block tuning– Udev rules correctly set for each block AND dm device
• max_sectors_kb = max_hw_sectors_kb• scheduler = “noop”• nr_requests, read_ahead_kb
![Page 23: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/23.jpg)
23 Robust Monitoring
2nd layer: network interconnect (IB)
What we’re looking for:• Faulty host IB links to:
– Storage arrays– Top of rack switches
• Fabric health (switch ports and inter-switch links)– Error counters, degraded/failed links– IB topology/forwarding routing
![Page 24: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/24.jpg)
24 Robust Monitoring
2nd layer: network interconnect (IB)
What are we looking for:• Faulty host IB links to:
– Storage arrays– Top of rack switches
• Fabric health (switch ports and inter-switch links)– Error counters, degraded/failed links– IB topology/SM routing
TODO
![Page 25: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/25.jpg)
25 Robust Monitoring
IB Health Check
• HCA and local link health– Local errors check (HCA port)– Remote errors check (switch port)
• PCI width/speed of each HCA– Identify failed hardware or firmware issues– Appropriate slot placement
• Port in up/active state• Link speed/width matches capability• SM lid is set
![Page 26: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/26.jpg)
26 Robust Monitoring
IB Health Check Output
![Page 27: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/27.jpg)
27 Robust Monitoring
Fabric Monitoring
• Issues to resolve:– Scaling health checks to 2000 ports– Discover new trends, not thresholds crossed– Storage and retrieval– Selective presentation of information
• Alerting interface• Performance monitoring interface
![Page 28: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/28.jpg)
28 Robust Monitoring
Nagios scripts for IB Fabric and SFA checks
• https://github.com/bacaldwell/scalable-monitoring– DDN SFA checks
• sfa_check
– Monitor IB fabric• monitor_ib_health
![Page 29: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/29.jpg)
29 Robust Monitoring
3rd layer: Lustre monitoring
• /proc/fs/lustre/health_check• /proc/mounts• /proc/fs/lustre/devices
– osd, osp, mgc, mds, mdt in UP state
• /proc/sys/lnet/stats– queued LNET messages
• /proc/sys/lnet/peers– connection state and queued messages to other servers
• lfs check servers• llstat
![Page 30: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/30.jpg)
30 Robust Monitoring
Lustre monitoring tools
• Capturing monitoring metrics in database for analysis – Robinhood (changelogs)– LMT (collectl)
• Scripts to collect using llstat, plot with gnuplot or matplotlib– See “Monitoring the Lustre file system to maintain optimal
performance,” by Gabriele Paciucci, LAD 2013
![Page 31: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/31.jpg)
31 Robust Monitoring
How monitoring helps with these kernel messages
Oct 25 14:35:09 oss1b8 kernel: [ 726.179459] LDISKFS-fs (dm-25): Remounting filesystem read-onlyOct 25 14:35:09 oss1b8 kernel: [ 726.445822] Remounting filesystem read-only
• Read-only OST– Trigger lustre_health, Splunk alerts
![Page 32: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/32.jpg)
32 Robust Monitoring
More Lustre kernel messages
• Bulk I/O failure and LNET timeouts– Check IB_health for errors– Probably useful syslog messages
• After determining root cause, can a Splunk alert be written?
Dec 8 12:11:52 atlas-oss3b4 kernel: [1032388.030434] Lustre: atlas2-OST009b: Bulk IO read error with 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib), client will retry: rc -110
![Page 33: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/33.jpg)
33 Robust Monitoring
More Lustre kernel messages
• OST unreachable– Health checks for OSS on which OST resides– Next check IB fabric (are messages getting lost?)
Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: atlas2-OST009b-osc-ffff880c392f9400: Connection to service atlas2-OST009b via nid 10.1.0.145@o2ib was lost; in progress operations using this service will wait for recovery to complete.
![Page 34: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/34.jpg)
34 Robust Monitoring
More Lustre kernel messages
• Client eviction/reconnection cycle– This was caused by MTU mismatch on IB fabric– Lessoned learned: ensure lower layers are healthyMar 3 11:14:09 atlas-oss4e1.ccs.ornl.gov kernel: [2428963.716071] Lustre: atlas2-OST0068: Client c618c441-fb87-27e9-21ae-6c345ddc40c8 (at 10.1.0.151@o2ib) reconnecting
Mar 3 11:14:09 atlas-oss4e1.ccs.ornl.gov kernel: [2428963.740253] Lustre: Skipped 6 previous similar messages
![Page 35: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/35.jpg)
35 Robust Monitoring
Conclusion
• OLCF monitoring best practices• Tools used at each layer
– DDN SFA check (block)– IB health check (network)– Custom scripts (Lustre)
• Applicability to common filesystem problems
![Page 36: Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing](https://reader033.fdocuments.us/reader033/viewer/2022042221/5ec7db4be2b32a13705ad57e/html5/thumbnails/36.jpg)
36 Robust Monitoring
Thank You
Blake Caldwell [email protected]
Monitoring and visualization tools:https://github.com/bacaldwell