UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando...

UC Berkeley

Online System Problem Detection by Mining Console Logs

Wei Xu* Ling Huang†

Armando Fox* David Patterson* Michael Jordan*

*UC Berkeley † Intel Labs Berkeley

Why console logs?

• Detecting problems in large scale Internet services often requires detailed instrumentation

• Instrumentation can be costly to insert & maintain• High code churn• Often combine open-source building blocks that are not

all instrumented

• Can we use console logs in lieu of instrumentation?

+ Easy for developer, so nearly all software has them

– Imperfect: not originally intended for instrumentation

Problems we are looking for

• The easy case – rare messages

• Harder but useful - abnormal sequences

NORMAL

receiving blk_1received blk_1

receiving blk_2

what is wrong with blk_2 ???

Overview and Contributions

* Large-scale system problem detection by mining console logs (SOSP’ 09)

• Accurate online detection with small latency

Frequent pattern based filtering

PCA DetectionOK

Dominant cases

Non-patternNormal cases

Real anomalies

Parsing*

Free text logs200 nodes

Constructing event traces from console logs

• Parse: message type + variables• Group messages by identifiers (automatically

discovered)– Group ~= event trace

receiving blk_1

received blk_1received blk_1

reading blk_1reading blk_1

receiving blk_2

received blk_2

receiving blk_2

receiving blk_1

received blk_1received blk_1

reading blk_1reading blk_1

receiving blk_2

received blk_2

receiving blk_2

Online detection: When to make detection?

• Cannot wait for the entire trace• Can last arbitrarily long time

• How long do we have to wait?• Long enough to keep correlations• Wrong cut = false positive

• Difficulties• No obvious boundaries• Inaccurate message ordering • Variations in session duration

receiving blk_1

received blk_1

reading blk_1

deleting blk_1

deleted blk_1

receiving blk_1

received blk_1

Frequent patterns help determine session boundaries

• Key Insight: Most messages/traces are normal• Strong patterns

• “Make common paths fast”• Tolerate noise

Two stage detection overview

PCA DetectionOK

Dominant cases

Real anomalies

Parsing

Stage 1 - Frequent patterns (1): Frequent event sets

receiving blk_1

received blk_1reading blk_1

deleting blk_1

deleted blk_1

receiving blk_1

received blk_1

Coarse cut by time

Find frequent item set

Refine time estimation

reading blk_1

error blk_1

Repeat until all patterns found

PCA Detection

Stage 1 - Frequent patterns (2) : Modeling session duration time

• Assuming Gaussian?• 99.95th percentile estimation is off by half• 45% more false alarms

• Mixture distribution• Power-law tail + histogram head

DurationDuration

Stage 2 - Handling noise with PCA detection

• More tolerant to noise• Principal Component Analysis (PCA) based

detection

PCA DetectionOK

Dominant cases

Real anomalies

Parsing

Frequent pattern matching filters most of the normal events

PCA DetectionOK

Dominant cases

Real anomalies

Parsing*

100%86%

14% 13.97%

Evaluation setup

• Hadoop file system (HDFS)• Experiment on Amazon’s EC2 cloud

• 203 nodes x 48 hours• Running standard map-reduce jobs• ~24 million lines of console logs

• 575,000 traces• ~ 680 distinct ones• Manual label from previous work

• Normal/abnormal + why it is abnormal• “Eventually normal” – did not consider time• For evaluation only

Frequent patterns in HDFS

Frequent Pattern 99.95th percentile Duration

% of messages

Allocate, begin write 13 sec 20.3%

Done write, update metadata 8 sec 44.6%

Delete - 12.5%

Serving block - 3.8%

Read exception - 3.2%

Verify block - 1.1%

Total 85.6%

• Covers most messages• Short durations

(Total events ~20 million)

Detection latency

• Detection latency is dominated by the wait time

Single event pattern

Frequent pattern (matched)

Frequent pattern (timed out)

Non pattern events

Detection accuracy

True Positives

False Positives

False Negatives

Precision Recall

Online 16,916 2,748 0 86.0% 100.0%

Offline 16,808 1,746 108 90.6% 99.3%

(Total trace = 575,319)

• Ambiguity on “abnormal”• Manual labels: “eventually normal”• > 600 FPs in online detection as very long latency• E.g. a write session takes >500sec to complete

(99.99th percentile is 20sec)

Future work

• Distributed log stream processing– Handle large scale cluster + partial failures

• Clustering alarms• Allowing feedback from operators• Correlation on logs from multiple applications /

layers

Summary

http://www.cs.berkeley.edu/~xuw/Wei Xu <xuw@cs.berkeley.edu>

Precision Recall

Online 86.0% 100.0%

PCA DetectionOK

Dominant cases

Real anomalies

Parsing

Free text logs200 nodes24 million lines

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando...

Documents

Transcript of UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando...

Sketching high-performance implementations of bitstream programs. Armando Solar-Lezama, Rastislav Bodik UC Berkeley.

Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens.

UC Berkeley Fsbe_brochure

Programming by Sketching Armando Solar-Lezama, Liviu Tancau, Gilad Arnold, Rastislav Bodik, Sanjit Seshia UC Berkeley, Rodric Rabbah MIT, Kemal Ebcioglu,

Armando Fox and David Patterson UC Berkeley June 21, 2010

Snezana Stanimirovic (UC Berkeley), Carl Heiles (UC Berkeley) & Nissim Kanekar (NRAO)

Ground Motions and Selection Tools for PEER Research ... · Kenichi Soga, Chancellor’s Professor, UC Berkeley Research Team Joan Walker, UC Berkeley Alexandre Bayen, UC Berkeley

Dependability Lessons from Internetty Systems: An Overview Stanford University CS 444A / UC Berkeley CS 294-4 Recovery-Oriented Computing, Autumn 01 Armando.

The Potential of Cloud Computing: Challenges, Opportunities, Impact Armando Fox, UC Berkeley

Richard A Muller UC Berkeley. Richard A Muller UC Berkeley.

UC Berkeley€¦ · UC Berkeley UC Berkeley Electronic Theses and Dissertations Title Analysis and applications of the Locally Competitive Algorithm Permalink ... University of California,

Building Internet Services With TACCArmando Fox, UC Berkeley TACC Retrospective: Contributions, Non-Contributions, and What We Really Learned Armando Fox.

UC Berkeley Research

UC Berkeley - INSPIRE

BEST Lab UC Berkeley – The BEST Lab at UC Berkeley ...

Jerry Zhao UC Berkeley jzh@berkeley

Two techniques for programming by sketching (Stanford, November 2004) Rastislav Bodik, David Mandelin, Armando Solar-Lezama, Lin Xu UC Berkeley Rodric.

UC Berkeley

UC Berkeley CS 98/198: Intro. to Web 2.0 Development Using Ruby on Rails Armando Fox, Will Sobel, Dave Patterson.

UC BERKELEY- BANK OF THE WEST … UC BERKELEY- BANK OF THE WEST IMPLEMENTATION AGREEMENT PERTAINING TO UC BERKELEY’S UNIVERSITY PARTNERSHIP PROGRAM This UC Berkeley-Bank of the West