S T A M P E D E STAMPEDE: A Framework for Monitoring and Troubleshooting … · 2016-06-14 ·...

1
Scientific Workflow STAMPEDE: A Framework for Monitoring and Troubleshooting of Large-Scale Applications on National Cyberinfrastructure Transform and Archive Real-time Analysis Provides end-to-end system performance view to users Presents data from processes, hosts, and network elements using a scalable analysis and presentation framework STAMPEDE database schema Represents both the abstract workflow plan and running workflow, including the associations between the two Handles parent and child workflows Stores data at high granularity STAMPEDE is funded by the National Science Foundation under the OCI program grant #0943705 Failure Analysis Algorithms Machine learning algorithms predict workflow failures based on behavior patterns Hard failures can be easily determined using database queries Soft failures are often stochastic, should be detected early for quick error recovery End-user Tools Stampede Python Analysis API: Simple, uniform access to back-end database Stampede-analyzer: Quickly debug a workflow after execution is completed Stampede-statistics: Generates statistics about a running or finished workflow Number of tasks/jobs/sub workflows ran/succeeded/failed/retried, ... Job execution site, scheduler queuing time, execution delay, ... Stampede-plots: Interactive graphs and charts for workflow visualization Scalable collection and analysis of performance and troubleshooting data from distributed workflows S T A M P E D E Online detection and analysis of workflow failures and anomalies Access via streaming subscription-based services !"#$%&&"’ Real-time subscription Stampede DB Plan and Execute Monitoring data Fabio Silva 3 , Christopher Brooks 5 , Ewa Deelman 3 , Monte Goode 1 , Dan Gunter 1 , Gideon Juve 2 , Gaurang Mehta 3 , Priscilla Moraes 4 , Taghrid Samak 1 , Martin Swany 4 , Prasanth Thomas 3 and Karan Vahi 3 1 Lawrence Berkeley National Laboratory, 2 University of Southern California, 3 University of Southern California Information Sciences Institute, 4 University of Delaware, Newark, 5 University of San Francisco NetLogger Tools nl-loader stores data in a database, such as SQLite, MySQL, etc. Data goes to a broker, where it can then be sent to many subscribers Log Collection Tools monitord parses data in real-time Collects workflow execution logs Large-Scale Workflows Composed of thousands to millions of coordinated tasks Executed in complex distributed environments Difficult to track failures, search through thousands of files

Transcript of S T A M P E D E STAMPEDE: A Framework for Monitoring and Troubleshooting … · 2016-06-14 ·...

Page 1: S T A M P E D E STAMPEDE: A Framework for Monitoring and Troubleshooting … · 2016-06-14 · STAMPEDE database schema Represents both the abstract workflow plan and running workflow,

Scientific Workflow

STAMPEDE: A Framework for Monitoring and Troubleshooting of Large-Scale Applications on National Cyberinfrastructure

Transform and Archive Real-time Analysis

 Provides end-to-end system performance view to users

 Presents data from processes, hosts, and network elements using a scalable analysis and presentation framework

STAMPEDE database schema  Represents both the abstract

workflow plan and running workflow, including the associations between the two

 Handles parent and child workflows

 Stores data at high granularity

STAMPEDE is funded by the National Science Foundation under the OCI program grant #0943705

Failure Analysis Algorithms  Machine learning algorithms predict workflow

failures based on behavior patterns  Hard failures can be easily determined using

database queries  Soft failures are often stochastic, should be

detected early for quick error recovery

End-user Tools  Stampede Python Analysis API: Simple, uniform access to back-end database  Stampede-analyzer: Quickly debug a workflow after execution is completed  Stampede-statistics: Generates statistics about a running or finished workflow

 Number of tasks/jobs/sub workflows ran/succeeded/failed/retried, ...  Job execution site, scheduler queuing time, execution delay, ...

 Stampede-plots: Interactive graphs and charts for workflow visualization

Scalable collection and analysis of performance and troubleshooting data from distributed workflows

S T A M P E D E

Online detection and analysis of workflow failures and anomalies

Access via streaming subscription-based services

!"#$%&&"'

Real-time subscription

Stampede DB

Plan and Execute

Monitoring data

Fabio Silva3, Christopher Brooks5, Ewa Deelman3, Monte Goode1, Dan Gunter1, Gideon Juve2, Gaurang Mehta3, Priscilla Moraes4, Taghrid Samak1, Martin Swany4, Prasanth Thomas3 and Karan Vahi3 1Lawrence Berkeley National Laboratory, 2University of Southern California, 3University of Southern California Information Sciences Institute, 4University of Delaware, Newark, 5University of San Francisco

NetLogger Tools  nl-loader stores data in a

database, such as SQLite, MySQL, etc.

 Data goes to a broker, where it can then be sent to many subscribers

Log Collection Tools  monitord parses data in

real-time  Collects workflow

execution logs

Large-Scale Workflows  Composed of thousands to millions

of coordinated tasks  Executed in complex distributed

environments  Difficult to track failures, search

through thousands of files