Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed...

15
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory

Transcript of Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed...

Page 1: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Scalable Analysis of Distributed Workflow Traces

Daniel K. Gunter and Brian Tierney

Distributed Systems DepartmentLawrence Berkeley National

Laboratory

Page 2: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Outline Motivation / Why do we care? Related Work / What have others done? NetLogger’s Objective / What would we

like to do? Background / What is NetLogger? How does NetLogger address the

problems? What are the results / costs of the

solution?

Page 3: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Motivation Large-scale applications are widely

used in science and business. Astronomy, Biology, Weather Models,

etc. Large-scale apps are complex and

difficult to debug and optimize. Large number of concurrent operations Distributed resources Hard to find bottlenecks

Page 4: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Related Work Applications can be “tightly coupled”, “loosely

coupled” or “uncoupled”. Tools have mostly focused on tightly coupled

applications. Profiling and Tracing code segments. (TAU, Paraver,

FPMPI, Intel Trace Collector) Tools extended to loosely coupled apps

SvPablo – Auto code instrumentation and statistics collected for sections of source code.

Phopesy – Auto code instrumentation and database of performance info. Tunable granularity.

Paradyn – Dynamic instrumentation insertion at runtime. Designed for message passing and pthreads programs

Page 5: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

End Objective Focus on loosely coupled and uncoupled applications. We would like a tool that can combine performance

information of multiple resources and application components and expose their interactions.

Page 6: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

NetLogger Background Log Generation – calls to logger libraries added to

source code at critical points to create event logs. Log Management – The various logs are collected

and merged based on event timestamps. Visualization and Analysis – Events, systems stats

and “lifelines” are displayed.

Page 7: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Extensions to NetLogger

Scaling NetLogger to large scale systems (100’s of machines) Collecting distributed log files Evaluating large log data-sets

Addition of Work Flow identifiers

Page 8: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Log Collection and Management

Netlogd Collection daemon which accepts logs across

the network (UDP or TCP) Nlforward

For finer-grain instrumentation, events can be written to local disk and forwarded in batches

Nldemux Server-side tool to scan incoming logs Split events into separate files Allows for log file rollovers.

Page 9: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Sifting Through the data Huge amount of log data from just 5 nodes

obscures important events.

Page 10: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Anomalous Workflow Detection Tool Define a linear sequence of events in a

configuration file. Mark any workflow lifeline that is

missing these events. Problems:

We would like some context for normal behavior. (solved by and option to include neighbors of anomalous lifelines)

Too many events to keep them all in memory for scanning.

Page 11: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Solutions Solution 1.

Create a histogram with 100 bins for normal workflow execution times.

Timeout when after 99th percentile. Runs in fixed memory footprint. Supports additional parameters (min time,

max time, etc) Solution 2

Calculate a running mean and standard deviation of workflow runtimes.

Assumes statistically normal distribution of times.

Page 12: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

NetLogger Workflow-logging Architecture

Page 13: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

New Log Visualization

3 incomplete events from previous picture shown in blue with context events shown in red.

Able to detect several errors in SNFactory Workflow application.

Page 14: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Key Differences in NetLogger

Use of “Lifelines” to trace sequence of actions.

Workflow anomaly detection. Facilitate log collection from multiple

locations. Manual instrumentation of source

code. Must have source code and understand it.

Page 15: Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

The End.

Questions? Comments?