Sustainable Logging – SplunkLive! 2014
-
Upload
paul-gilowey -
Category
Technology
-
view
463 -
download
0
description
Transcript of Sustainable Logging – SplunkLive! 2014
Copyright © 2014 Splunk Inc.
Sustainable Logging: SUCCEEDING WITH SPLUNK
2
Paul Gilowey Foundation Technology Specialist
@paulcgt
Sustainable Logging: SUCCEEDING WITH SPLUNK
Words and thoughts expressed herein are my own, and not those of Santam.
3
ww
w.d
an-d
are.
org
4
My technology background
5
The evolution that led to Splunk
6
In the beginning there was ONE.
depotwallpaper.com
7
Then things got really complex.
8
9
10
In 2012, a new project
11
A big decision
It’s time to say goodbye…
12
Highly distributed and integrated
13
A brand new world
Claims Finance Docs B2B Portal Legacy
Reverse Proxies
Load-balancers IDM Integration ESM Virtualisation
New Policy Administration
MDM
14
James Wheeler souvenirpixels.com
Too many logs to monitor
15 capetownstockphotos.com
So little time to trace problems
16
Not only in production
https://www.flickr.com/photos/wsdot/
17
On a tight timeline
18 https://www.flickr.com/photos/usnavy/
December 2013 Production and Non-Production
20GB
19
Now what?
So we’re collecting log events.
20
Developers like doing things the old way
21
tail -f ./catalina.out
22
We like this. It’s comforting.
23
Effecting change
24
CTO’s Office
Splunk users (dev, ops, etc.)
Choosing your champion
25
• have influence across departments
• act as product owner
• be fanatical
• be hands-on
• have a development background
• be an architect
Dave Keeshan - https://www.flickr.com/photos/spudmurphy/
Your champion should…
26
Tips to help your champion
27
Help developers
troubleshoot (even in dev)
Ed Yordon https://www.flickr.com/photos/yourdon/
28
Change how developers think
about log events
29
Police
lazy logging
[INFO ] Got here
[INFO ] finished loop 420
[INFO ] JDE…
[INFO ] >>>>>>>>AAAAAAAA
[INFO ] BBBBBBBBBBBBBBB
[ERROR] It failed!!!!!!
30
Ops might as well be blindfolded.
https://www.flickr.com/photos/foxtongue
31
Do you really want to be called at 2am?
32
Demonstrate thoughtful logging
[DEBUG] TxId=328, Counting invoice line items…
[INFO ] TxId=328, Invoice LineItemsTotal=420
[DEBUG] TxId=328, Calling remote service JDE…
[TRACE] TxId=328, JDE Request: {“TxID”:”328”,
“Items”[{“desc”:”Motor Vehicle”,”prem”:305.24},…
[WARN ] TxId=328, Timed out while calling remote service
JDE… target system may be down. Will retry in 30s.
33
Show the benefit of structured log events
[INFO] Purchase complete - total=42 currency=ZAR language=en_ZA priority=13
“Purchase complete” priority<4 |
stats sum(total) as currencyTotal by currency |
table currency, currencyTotal
34
11 Sep 2014 15:05:27,960 [Thread-428] [DEBUG] [stm.amx.communication.outboundcommunicationmanager] za.co.santam.communication.outboundcommunicationmanager.RunnableStatusReceiver - btid=77320d33-5f8c-4178-b13e-c594816463d8, cmpid=za.co.santam.communication.outboundcommunicationmanager.RunnableStatusReceiver, uid=System, za.co.santam.communication.outboundcommunicationmanager.RunnableStatusReceiver.processStatusMessage : Status [STATUS_PROCESSING_COMPLETED = 6], will act on [STATUS_FINISHED = 1], for now only GENERATE_DIGITAL_DOCUMENT.
11 Sep 2014 15:05:36,272 [Thread-428] [DEBUG] [stm.amx.communication.outboundcommunicationmanager] za.co.santam.communication.outboundcommunicationmanager.RunnableReceiver - btid=e76665e2-e876-455a-a087-aeb5ba97d5a8, cmpid=za.co.santam.communication.outboundcommunicationmanager.RunnableStatusReceiver, uid=System, za.co.santam.communication.outboundcommunicationmanager.RunnableStatusReceiver.processMessages : Blocking(2000) read storage until message arrives...
11 Sep 2014 15:05:36,472 [Thread-427] [DEBUG] [stm.amx.communication.outboundcommunicationmanager] za.co.santam.communication.outboundcommunicationmanager.RunnableReceiver - btid=e76665e2-e876-455a-a087-aeb5ba97d5a8, cmpid=za.co.santam.communication.outboundcommunicationmanager.RunnableStorageReceiver, uid=System, za.co.santam.communication.outboundcommunicationmanager.RunnableStorageReceiver.processMessages : message received.
11 Sep 2014 15:05:36,475 [Thread-427] [TRACE] [com.tibco.amx.platform] com.tibco.governance.amxagent.msginterceptor.component.AMXGovMsgInterceptorComponent - Target URI : urn:amx:env2/stm.amx.communication.outboundcommunicationmanager/StatusReceiver_1.2.0.v2014-09-10-1604#reference(StatusReceiver_ContentManagerProxyAsync_v4_Int).
Change this…
35
… into this.
36
Formalise stacktrace logging policy
Function call ->
Function call ->
Function call ->
Function call
<- Log stacktrace
<- Log stacktrace
<- Log stacktrace
<- Log stacktrace
37
Avoid filtering events.
[DEBUG] TxId=328, Real important debug statement.
[INFO ] TxId=328, This would have been useful to see...
[DEBUG] TxId=328, Useful when we really need it.
[TRACE] TxId=328, Oh man, I need this event so bad.
[DEBUG] TxId=328, Flippin’ important debug message.
[INFO ] TxId=328, This would have been useful to see...
[WARN ] TxId=328, Why am I logging at all?
38
Avoid filtering events.
[WARN ] TxId=328, Real important debug statement.
[WARN ] TxId=328, This would have been useful to see...
[WARN ] TxId=328, Useful when we really need it.
[WARN ] TxId=328, Oh man, I need this event so bad.
[WARN ] TxId=328, Flippin’ important debug message.
[WARN ] TxId=328, Cummon, I *really* wanna see this!
[WARN ] TxId=328, Why am I logging at all?
39
tail -f ./catalina.out
40
Why developer buy-in matters
41
“A fool with a tool is still a fool.” Grady Booch
42
• Laughable deadlines
• Long days, longer nights
• Management pressure
43
If we log excessively…
44
Bob B. Brown - https://www.flickr.com/photos/beleaveme
45
tail -f ./catalina.out
46
Nope, no fires today, folks.
Robert du Bois https://www.flickr.com/photos/lordisgood
47
No value, no money.
Neubie - https://www.flickr.com/photos/neubie/
48
Shelfware.
Robert Couse-Baker https://www.flickr.com/photos/29233640@N07/
49
8 steps to successful implementation
50
Start small (but plan to grow big)
Pewstruck.com - https://www.flickr.com/photos/canoodlepets/
1
51
Start with a
clean slate
2
52
Learn Implement Stabilise Spread the
word Refine
Take a
smart approach
3
53
Dashboards are pretty, alerts are king
Reactive becomes proactive
Register defects (ERROR = defect)
Filter, don’t flood mailboxes
Build alerts
and
set policy
4
54
Get a feel for the pain
Make sure filtering is working
Police false positives
Receive
all alerts
yourself
5
55
Mine their data yourself – Find what’s difficult to show – Build dashboards to showcase their solutions
Broaden their minds – complement traditional BI by using log events
Help
managers
look good
6
56
“Not too hot, not too cold, just right!”
“Meh – too sloooow…”
“Too expensive!”
Apply the Goldilocks Principle 7
57
Monitor licence usage by source or source type
index=_internal source=*metrics.log
group="per_sourcetype_thruput"
| stats sum(kb) as KB by series
| where KB > 20000
8
58
Wrapping up
59
Encourage thoughtful logging
Promote good logging practices
Police bad behaviour
Be intimately involved
Adopt a helpful attitude
Make sure you show value
To be successful: