Using XDMoD to Facilitate XSEDE Operations, Planning and Analysis

20
Using XDMoD to Facilitate XSEDE Operations, Planning and Analysis Tom Furlani, PhD Director - Center for Computational Research University at Buffalo, SUNY XSEDE13 JULY 22 – 25, 2013 Thomas R. Furlani 1 , Barry I. Schneider 2 , Matthew D. Jones 1 , John Towns 3 , David L. Hart 4 , Steven M. Gallo 1 , Robert L. DeLeon 1 , Charng-Da Lu 1 , Amin Ghadersohi 1 , Ryan J. Gentner 1 , Abani K. Patra 5 , Gregor von Laszewski 6 , Fugang Wang 6 , Jeffrey T. Palmer 1 , Nikolay Simakov 1 1 Center for Computational Research, University at Buffalo, SUNY, 2 CISE - Advanaced Computing Infrastructure, National Science Foundation, 3 NCSA - University of Illinois, 4 National Center for Atmospheric Research, 5 Mech. & Aerospace. Eng. Dept. University at Buffalo, SUNY, 6 Pervasive Technology Institute - University of Indiana

description

Using XDMoD to Facilitate XSEDE Operations, Planning and Analysis. - PowerPoint PPT Presentation

Transcript of Using XDMoD to Facilitate XSEDE Operations, Planning and Analysis

Page 1: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

Using XDMoD to Facilitate XSEDEOperations, Planning and Analysis

Tom Furlani, PhDDirector - Center for Computational Research

University at Buffalo, SUNYXSEDE13 JULY 22 – 25, 2013

Thomas R. Furlani1, Barry I. Schneider2, Matthew D. Jones1, John Towns3, David L. Hart4, Steven M. Gallo1, Robert L. DeLeon1, Charng-Da Lu1, Amin Ghadersohi1, Ryan J. Gentner1,

Abani K. Patra5, Gregor von Laszewski6, Fugang Wang6, Jeffrey T. Palmer1, Nikolay Simakov1

1Center for Computational Research, University at Buffalo, SUNY, 2 CISE - Advanaced Computing Infrastructure, National Science Foundation, 3NCSA - University of Illinois,

4National Center for Atmospheric Research, 5Mech. & Aerospace. Eng. Dept. University at Buffalo, SUNY, 6Pervasive Technology Institute - University of Indiana

 

Page 2: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Outline• Overview of Technology Audit Service (XDMoD)• XDMoD Case Studies

– Data Driven CI Planning for XSEDE– System Operation and Maintenance– Interpreting XDMoD Data 

• Future XDMoD Functionality– SUPReMM (Lightning Talk – Wed, 3PM, Marina Ballroom F&G)– PEAK (NICS) (Optimizing Utilization Across XSEDE – Thurs, 8:30AM, Marina Ballroom G)

– Scientific Impact and Open Source Version (XDMoD TAS BOF – Wed, 6PM, Palomar)

Page 3: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

CoAuthors• Barry I. Schneider (NSF)• Matthew D. Jones (UB)• John Towns (NCSA)• David L. Hart (NCAR)• Steven M. Gallo (UB)• Robert L. DeLeon (UB)  • Charng-Da Lu• Amin Ghadersohi (UB)• Ryan J. Gentner (UB)• Abani K. Patra (UB)• Gregor von Laszewski (Indiana)

• Fugang Wang (Indiana)• Jeffrey T. Palmer (UB)• Nikolay Simakov (UB)

Page 4: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Motivation• Measuring utilization of CI provides an understanding of how 

resource is being utilized• HPC systems are a complex combination of software, processors, 

memory, networks, and storage systems - difficult to know if optimal performance is being realized, or even if all subcomponents are functioning properly 

0 200 400 600 800 10000

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

35,000,000

40,000,000Log Size As Of 9/12/2011

Node Number

Lo

g S

ize

(Byt

es)

job scheduler error node #126

loose cable node #348

Example: Log File Analysis Discovers Two Malfunctioning Nodes

Page 5: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

XSEDE Technology Audit Service (TAS)• Provide Auditing and Quality of Service (QoS) Metrics• Primary components to TAS

– XDMoD: XSEDE Metrics on Demand Portal• Analytics Framework for XSEDE• Display results of all metrics (utilization, wait time, etc )• Easy to use

– Application Kernel Framework• Measure performance of XSEDE infrastructure• Diagnostic set of tools – early identification of system problems

• Broader Impact– Open source framework for academic HPC centers

• Organizations– Buffalo, Indiana (Laszewski), Michigan (Finholt), UT-NICS (You)

   

Page 6: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

XDMoD Data Sources

Page 7: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

XDMoD: XD Metrics on Demand Portal• Display metrics, Role Based, Custom Report Builder

Page 8: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

XDMoD Case Studies

• Data Driven CI Planning for XSEDE• System Operation and Maintenance• Interpreting XDMoD Data 

Page 9: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Data Driven CI Planning for XSEDE

• Largest, average and total SU allocations on XSEDE over time. Average and largest allocations have increased by more than a factor of 10 over the time period

9

Page 10: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Data Driven CI Planning for XSEDE• Total service unit usage by parent science-  Molecular Bioscience usage has 

grown over time – now rivals that of Physics

10

Page 11: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Data Driven CI Planning for XSEDE• However average core count varies widely over parent science – molecular 

bioscience jobs tend to use a relatively small number of processors 

11

Page 12: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

CI System Operation and Maintenance • Application kernels help detect user environment anomaly at CCR• Example: Performance variation of NWChem due to bug in commercial parallel 

file system that was subsequently fixed by vendor 

Page 13: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

CI System Operation and Maintenance • Sudden decrease in file system performance on TACC Lonestar4 as measured by 3 

different application kernels (IOR, MPI-Tile-IO, and IMB)

Page 14: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

CI System Operation and Maintenance • Application kernel control process to automatically detect underperforming 

application kernels (poor performance).  Red zone indicates an application kernel that is underperforming

Page 15: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Interpreting XDMoD Data• Like any analysis system, care must be exercised in interpretation of data 

from XDMoD• Ex. Distribution of job sizes for all parent science Physics jobs in XSEDE 

resources for the period 2008-2012 

Page 16: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Interpreting XDMoD Data• Mean core count for Physics jobs in XSEDE resources for the period 2008-

2012, including (blue line) and excluding (red line) serial runs 

High Throughput Jobs Start at Purdue

Number of Serial Physics Jobs by Resource

Page 17: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Future XDMoD Functionality: SUPReMM• SUPReMM  (Lightning Talk – Wed, 3PM)

– Collaboration with TACC and U Texas at Austin– Comprehensive job level resource use measurement for large clusters – Will supply XDMoD with some missing job usage data – application run, memory, 

local I/O, network, file-system, and CPU usage– Sample application report for Lonestar4

Page 18: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Future XDMoD Functionality: PEAK• NICS – PEAK (Thursday, 8:30AM)

– Optimizing Utilization Across XSEDE (Dr. Haihang You)– Performance Environment Autoconfiguration FrameworK– UT-NICS project to automatically tune key libraries and application kernels– Ex. Performance of Amber on Kraken – Amber built with PGI much faster

Page 19: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Future XDMoD FunctionalityOpen Source XDMoD & Scientific Impact

• Open Source Version: (XDMoD BOF - Wed, 6PM)– XDMoD functionality for non-XSEDE HPC centers– Installation by system administrators

• Programming not required• Guided textual installation process• Installation support provided by TAS Team

– Pre-existing central database not required• Aggregate data from available sources• Resource manager log files or existing database

– Currently recruiting for beta-testing program

• Scientific Impact– Preliminary XSEDE-based H-Index

Page 20: Using  XDMoD  to Facilitate XSEDE Operations, Planning and  Analysis

T E C H N O L O G Y   A U D I T   S E R V I C E

Acknowledgement

• This work was sponsored by NSF under grant number OCI 1025159 for the development of Technology Audit Service for XSEDE. 

• Contact Info– [email protected]– XDMoD https://xdmod.ccr.buffalo.edu/– [email protected]