DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years*...

41
Copyright © 2013 Splunk Inc. Ron Naken Principal Engineer #splunkconf Data Science

Transcript of DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years*...

Page 1: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Copyright  ©  2013  Splunk  Inc.  

Ron  Naken  Principal  Engineer  #splunkconf  

Data  Science  

Page 2: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Legal  NoCces  During  the  course  of  this  presentaCon,  we  may  make  forward-­‐looking  statements  regarding  future  events  or  the  expected  performance  of  the  company.  We  cauCon  you  that  such  statements  reflect  our  current  expectaCons  and  esCmates  based  on  factors  currently  known  to  us  and  that  actual  events  or  results  could  differ  materially.  For  important  factors  that  may  cause  actual  results  to  differ  from  those  contained  in  our  forward-­‐looking  statements,  please  review  our  filings  with  the  SEC.    The  forward-­‐looking  statements  made  in  this  presentaCon  are  being  made  as  of  the  Cme  and  date  of  its  live  presentaCon.    If  reviewed  aRer  its  live  presentaCon,  this  presentaCon  may  not  contain  current  or  accurate  informaCon.      We  do  not  assume  any  obligaCon  to  update  any  forward-­‐looking  statements  we  may  make.    In  addiCon,  any  informaCon  about  our  roadmap  outlines  our  general  product  direcCon  and  is  subject  to  change  at  any  Cme  without  noCce.    It  is  for  informaConal  purposes  only  and  shall  not,  be  incorporated  into  any  contract  or  other  commitment.    Splunk  undertakes  no  obligaCon  either  to  develop  the  features  or  funcConality  described  or  to  include  any  such  feature  or  funcConality  in  a  future  release.  

 

Splunk,  Splunk>,  Splunk  Storm,  Listen  to  Your  Data,  SPL  and  The  Engine  for  Machine  Data  are  trademarks  and  registered  trademarks  of  Splunk  Inc.  in  the  United  States  and  other  countries.  All  other  brand  names,  product  names,  or  trademarks  belong  to  their  respecCve  

owners.    

©2013  Splunk  Inc.  All  rights  reserved.  

2  

Page 3: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

About  Me  

!   Developer  and  Security  Professional  for  14+  years  

!   Best  known  for  integraCon  work  with  as/400,  NetApp,  and  ServiceNow      

!   Studied  psychology  at  the  University  of  California,  Irvine,  and  now  helps  

customers  to  envision  creaCve  ways  to  apply  operaConal  intelligence,  

using  mathemaCcs  to  evoke  "human  thought”  from  data  

3  

Page 4: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

About  My  Company    

splunk>  take  the  SH  out  of  IT…  

4  

Page 5: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Agenda  

IntroducCon:  Data  Science  ! Why  Learn  It?  ! What  Should  We  Focus  on  During  this  Course?  ! What  is  It?  

Abnormal  Behavior  ! Detect  Abnormal  Behavior  ! Calculate  Dynamic  Thresholds  

Standardizing  Ab(normal)  ! Correlate  Seemingly  Unrelated  Data  Sources  ! Calculate  Probability  Without  Complex  Formulas  

5  

Page 6: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Agenda  

Temporal  Proximity  ! AutomaCcally  Correlate  Issues  to  Root-­‐cause  ! Splunk  Just  Did  Your  Job  for  You!  

RelaCve  Volume  ! Detect  Abnormal  Data  Volumes  ! Find  QuesCons  When  our  Data  is  Full  of  Answers  

6  

Page 7: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    Data  Science  Why  do  we  want  to  learn  about  data  science?  

 

It  allows  Splunk  to  THINK  LIKE  A  HUMAN    

and  because  Splunk  can  do  our  thinking,  Splunk  can  do  our  work:  !   AutomaCcally  correlate  root-­‐cause  to  incidents  !   AutomaCcally  find  abnormal  errors  or  warnings  !   AutomaCcally  find  people  doing  abnormal  things  

This  means  we  can  ask  Splunk  to  think  things  through  for  us,  like  when  CPU  is  looking  abnormal,  even  though  normal  CPU  levels  may  vary  between  hour  of  the  day  or  day  of  the  week    *  This  is  one  example  we  will  see,  later  in  the  chapter  

7  

Page 8: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    Data  Science  Keep  this  in  mind  as  we  go  through  the  course:  

 

Splunk  is  EASY:  !   If  something  seems  difficult  or  complex,  we’re  probably  overthinking  it  !   Complexity  lies  in  remembering  not  to  overthink  a  problem  

 

StaCsCcs  is  EASY:  !   Because  Splunk  does  it  for  you!  !   So  while  we’re  going  to  cover  a  moderate  amount  of  it,  just  focus  on  

understanding  the  benefits  of  what  the  formulas  accomplish  

8  

Page 9: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    Data  Science  What  is  data  science?  

among  its  many  meanings,  we  will  focus  on  the  following:    

…build  on  techniques  and  theories  –  from  many  areas  of  study  (i.e.  mathemaCcs,  staCsCcs,  patern  recogniCon,  etc.)  –  to  extract  meaning  from  data…  

9  

Page 10: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Abnormal  Behavior  

Page 11: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  

What  is  normal?...    

       

 htp://www.mathsisfun.com/data/standard-­‐deviaCon.html  

11  

Page 12: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  Standard  deviaCon  can  be  used  to  detect  ‘normality’:  

 

Standard  deviaCon  (σ)  =  √  variance  Variance  (σ²)  =  (distance  from  mean)²  +  …  /  n    σ²    =  (  ∑ i² ) / n i = distance from mean

i∈S

Two  types  of  standard  deviaCon  and  variance:    

Popula<on  –  dataset  represents  the  enCre  relevant  ‘world’  stdevp(), varp()!Sample  –  dataset  is  a  ‘sample’  from  a  larger  relevant  ‘world’  stdev(), var()!  

Sample-­‐based  variance  and  sample-­‐based  standard  deviaCon  are  calculated  as  (n  –  1)  in  place  of  (n)  

12  

Page 13: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  

What  is  normal?...    

       

 htp://www.mathsisfun.com/data/standard-­‐deviaCon.html  

mean  

mean  +  σ  

mean  -­‐  σ  

13  

Page 14: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

(ab)normal  Human  Behavior  

14  

sourcetype=mmo:acCons  |  bucket  _Cme  span=10m  |  stats  count  AS  c  BY  _Cme  toon  |  eventstats  avg(c)  AS  mean  stdevp(c)  AS  sdev  |  where  c  >  

(mean  +  sdev)  |  stats  count  sparkline  BY  toon  |  sort  -­‐  count  

CalculaCng  abnormal  heights  within  a  populaCon  of  dogs  is  the  same  as  calculaCng  “abnormal”  human  behavior  

Look  how  simple  this  search  is  that  idenCfies  “bo}ng”  in  a  popular  MMORPG  (Massively  MulCplayer  Online  Role-­‐Playing  Game)  

1  

This  search  exemplifies  the  simplicity  by  which  abnormal  behavior  can  be  detected  by  Splunk.    It  makes  the  assumpCon  that  “abnormal”  is  defined  as  outside  of  1  standard  deviaCon.    Later  in  the  chapter,  we  will  invesCgate  z-­‐values  which  further  simplify  data  

and  allow  us  to  make  some  universal  assumpCons.  

Page 15: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

(ab)normal  Machine  Behavior  

15  

sourcetype="WMI:CPUTime"  |  bucket  _Cme  span=10m  |  stats  max(PercentProcessorTime)  AS  cpu_max  BY  _Cme  date_wday  date_hour  |  eventstats  avg(cpu_max)  AS  max_avg  

stdevp(cpu_max)  AS  sd  BY  date_wday  date_hour  |  eval  cpu_ciel=max_avg  +  sd  |  eval  earliest=relaCve_Cme(now(),  "-­‐1d@d")  |  eval  latest=relaCve_Cme(now(),  "@d")  |  where  

cpu_max  >  cpu_ciel  AND  _Cme  >  earliest  AND  _Cme  <  latest  |  fields  -­‐  earliest  latest  

CalculaCng  abnormal  CPU  uClizaCon  for  a  Windows  machine  is  also  the  same  

This  example  takes  the  maximum  CPU  uClizaCon  for  every  10  minute  window  of  a  day  and  compares  it  against  what  we  expect  to  be  “normal”  during  the  specific  hour  of  the  given  day;  we  will  see  how  to  exclude  today’s  data  from  the  baseline  in  a  later  secCon  on  z-­‐values  and  storage  

Calculate  max  CPU  uClizaCon  for  each  10  minute  window  for  each  day  of  the  week.    Note  the  split  by  _Cme,  day,  and  hour.  

2  

Add  standard  deviaCon  and  mean  to  the  dataset.    Note  the  split  here  is  by  hour  and  day,  not  including  _Cme.  

The  table  represents  each  Cme  the  CPU  went  to  abnormal  levels  –  this  is  our  upcoming  example;  the  chart  uses  the  same  search,  with  a  modified  “where”  clause  and  “fields”  command  

Page 16: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

(ab)normal  Machine  Behavior  

16  

sourcetype="WMI:CPUTime"  |  bucket  _Cme  span=10m  |  stats  max(PercentProcessorTime)  AS  cpu_max  BY  _Cme  date_wday  date_hour  |  eventstats  avg(cpu_max)  AS  max_avg  

stdevp(cpu_max)  AS  sd  BY  date_wday  date_hour  |  eval  cpu_ciel=max_avg  +  sd  |  eval  earliest=relaCve_Cme(now(),  "-­‐1d@d")  |  eval  latest=relaCve_Cme(now(),  "@d")  |  where  

cpu_max  >  cpu_ciel  AND  _Cme  >  earliest  AND  _Cme  <  latest  |  fields  -­‐  earliest  latest  

CalculaCng  abnormal  CPU  uClizaCon  for  a  Windows  machine  is  also  the  same  

This  example  takes  the  maximum  CPU  uClizaCon  for  every  10  minute  window  of  a  day  and  compares  it  against  what  we  expect  to  be  “normal”  during  the  specific  hour  of  the  given  day;  we  will  see  how  to  exclude  today’s  data  from  the  baseline  in  a  later  secCon  on  z-­‐values  and  storage  

Calculate  max  CPU  uClizaCon  for  each  10  minute  window  for  each  day  of  the  week.    Note  the  split  by  _Cme,  day,  and  hour.  

2  

Add  standard  deviaCon  and  mean  to  the  dataset.    Note  the  split  here  is  by  hour  and  day,  not  including  _Cme.  

Page 17: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Standardizing  (ab)normal  

Page 18: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  NormalizaCon:    z-­‐value  |  z-­‐score  |  standard  score    

! Much  of  the  mathemaCcs  assume  a  normal  distribuCon,  where  data  points  form  a  bell  curve  or  Gaussian  curve  

–  We  can  normalize  data  (normal  or  not)  into  a  standard  normal  distribuCon  or  z-­‐  distribuCon  

! z-­‐value  |  z-­‐score  |  standard  score  –  Unitless  -­‐  compare  across  technologies  –  Empirical  rule  -­‐  68-­‐95-­‐99.7  rule  –  Access  to  z-­‐table  for  pre-­‐calculated  probability  –  z  =  (x  –  u)  /  σ  

(data  point  –  mean)  /  (standard  deviaCon)  

18  

Page 19: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

CPU  to  Bandwidth:    Malware  

19  

EXAMPLE  Z-­‐CORRELATION  SEARCH:    CPU  %  uClizaCon  in  contrast  to  network  volume  with  

worm  spewing  on  the  wire  

In  this  example,  we  compare  CPU  uClizaCon  (%)  to  bandwidth  consumpCon  (MBps);  by  converCng  the  disribuCons  to  z-­‐value,  we  have  a  normalized,  unitless  measure  to  compare  

Page 20: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  Terminology  oRen  used:    

!   nth  PercenCle  –  n%  of  the  data  points  fall  beneath  this  data  point;  for  instance,  the  Empirical  Rule  (68-­‐95-­‐99.7)  states  that  95%  of  data  points  fall  within  2  standard  deviaCons  –  this  means  that  2  standard  deviaCons  represents  the  95th  percenCle  for  a  standard  normal  distribuCon  

–  95th  percenCle:    95%  of  the  data  points  are  below  this  value  

!   QuarCle  Q1,  Q2,  Q3  –  represents  the  most  common  percenCles  –  Q1  =  First  quarCle  =  25th  percenCle  –  Q2  =  Second  quarCle  =  50th  percenCle  –  Q3  =  Third  quarCle  =  75th  percenCle  

If  you  score  2100  on  the  SAT,  did  you  do  well?    While  this  may  represent  the  96th  PercenCle  for  2012,  how  does  it  compare  for  this  year?  

20  

Page 21: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  

21  

With  our  data  standardized  into  z-­‐values,  forming  a  standard  normal  distribuCon,  the  z-­‐table  provides  a  universal  matrix  of  probability  –  the  chance  a  value  will  be  below  this  point  

!   Download  from  the  internet  and  use  as  a  lookup  table!  

What  is  the  chance  my  CPU  will  be  above  90%?    u  (mean)  =  75,  σ  (standard  deviaCon)  =  13.4  

 z  =  (90  –  75)  /  13.4  =  1.119  

 *  closest  match  =  1.12  

How  to  use  z-­‐table:    match  x.x  to  leR  column,  and  2nd  decimal  point  to  top  row  

86.86%  chance  to  be  below  90%  =    

1  -­‐  .8686  =  .1314  =  13.14%  chance  to  be  above  90%!  

 

 

 

3  

Page 22: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  For  the  following  example,  which  exam  shows  beter  performance?  

 

Math:    91%  History:    62%  

 

When  we  are  graded  on  a  scale  from  1-­‐100,  this  is  an  easy  answer;  however,  if  we  are  graded  on  a  curve,  we  don’t  know  how  well  we  did  without  more  informaCon  

RESULTS:  Math  =  (91  –  90)  /  1.1  :  z=.909  History  =  (62  –  60)  /  1.5  :  z=1.33  

u=90,  σ=1.1  u=60,  σ=1.5  

2  

22  

Page 23: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  For  the  following  example,  which  exam  shows  beter  performance?  

 

Math:    91%  History:    62%  

 

When  we  are  graded  on  a  scale  from  1-­‐100,  this  is  an  easy  answer;  however,  if  we  are  graded  on  a  curve,  we  don’t  know  how  well  we  did  without  more  informaCon  

RESULTS:  Math  =  (91  –  90)  /  1.1  :  z=.909  History  =  (62  –  60)  /  1.5  :  z=1.33  

u=90,  σ=1.1  u=60,  σ=1.5  

2  

23  

Page 24: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  

Can  I  do  this  without  a  z-­‐table  lookup  or  complex  math?  

1  

z-­‐Table  

24  

Page 25: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  

25  

Empirical  Rule  –  applies  to  any  normal  distribu<on  ! 68%  of  data  points  are  within  1  standard  deviaCon  

! 95%  of  data  points  are  within  2  standard  deviaCons  

! 99.7%  of  data  points  are  within  3  standard  deviaCons  

This  becomes  important  once  our  data  is  normalized  to  z-­‐values,  because  we  now  have  a  standard  normal  distribu<on  and  can  make  assumpCons  !   Only  5%  of  the  data  points  have  a  z-­‐value  greater  than  2  standard  deviaCons  

!   Only  .3%  of  the  data  points  have  a  z-­‐value  greater  than  3  standard  deviaCons  

percen<le  –  x%  of  data  points  fall  below  the  value    *  To  create  general  alerts  to  find  the  “needle  in  the  haystack”,  or  rare  outliers,  it  is  simple  to  look  for  z-­‐values  based  on  the  volume  of  an  event  that  are  near  or  greater  than  3  standard  deviaCons  

Page 26: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Needle  in  a  Haystack  

26  

EXAMPLE  ERROR  ANALYSIS:    Alert  on  needle  in  haystack  from  overgeneralized  search;  use  

Empirical  Rule  to  find  outliers  

Page 27: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Temporal  Proximity  

Page 28: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Temporal  Proximity  There  are  many  ways  we  can  ask  Splunk  to  slice-­‐and-­‐dice,  correlate,  or  map  Cme;  in  this  chapter,  we’re  going  to  invesCgate  a  technique  that  is  so  easy,  “Even  a  caveman  can  do  it!”    

!   Given  events  that  contain  a  Cmestamp,  the  _Cme  field  represents  an  EPOCH  Cmestamp  –  seconds  since  01/01/1970  –  of  when  the  event  occurred  –  We  can  slice  this  into  Cme  windows  and  use  it  for  correlaCon  between  events  

!   Here  are  some  examples:  –  |  eval  minute_window  =  round(_Cme  /  60,  0)  –  |  eval  ten_minute_window  =  round(_Cme  /  (60  *  10),  0)  –  |  eval  hour_window  =  round(_Cme  /  (60  *  60),  0)  –  |  eval  day_window  =  round(_Cme  /  (60  *  60  *  24),  0)  

!   Using  one  of  these  windows  is  like  saying,  “Show  me  everything  that  happened  on  Tuesday”;  we  can  say,  “correlate  all  errors  that  occurred  within  this  ten_minute  window”,  or  “on  this  day_window”,  etc.  

28  

Page 29: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Time-­‐Correlated  Errors  

29  

(sourcetype=“cdn:generic”  complete_percent  <  100)  OR  `errors`  |  eval  epoch=round(_Cme/(60  *  5),  0)  |  eval  correlated_issues=if(sourcetype  ==  "cdn:generic",  null,  sourcetype  +  "  |  "  +  _raw)  |  eval  error_Cme=if(sourcetype  ==  "cdn:generic",  strRime(_Cme,  "%m-­‐%d-­‐%y  %H:

%M"),  null)  |  stats  list(error_Cme)  AS  error_Cme  list(product)  AS  product  list(correlated_issues)  AS  correlated_issues  BY  epoch  |  search  error_Cme=*  product=*  

correlated_issues=*  |  sort  -­‐  epoch  |  fields  -­‐  epoch  

Search  for  all  incomplete  CDN  downloads  and  infrastructure  errors  for  the  Cme  period.    Calculate  a  5-­‐minute  EPOCH  Cme  

window  on  each  event.  

Combine  the  events  using  the  STATS  command  and  a  SPLIT  BY  our  EPOCH  Cme  window.  

WOAH!    Splunk  just  did  my  job  for  me!    The  two  leR  columns  represent  a  point  in  Cme  where  our  CDN  did  not  complete  delivery  to  a  customer;  the  right  column  represents  the  

back-­‐end  infrastructure  issues  that  occurred  during  the  same  Cme  window  

2  

Page 30: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Time-­‐Correlated  Errors  

30  

(sourcetype=“cdn:generic”  complete_percent  <  100)  OR  `errors`  |  eval  epoch=round(_Cme/(60  *  5),  0)  |  eval  correlated_issues=if(sourcetype  ==  "cdn:generic",  null,  sourcetype  +  "  |  "  +  _raw)  |  eval  error_Cme=if(sourcetype  ==  "cdn:generic",  strRime(_Cme,  "%m-­‐%d-­‐%y  %H:

%M"),  null)  |  stats  list(error_Cme)  AS  error_Cme  list(product)  AS  product  list(correlated_issues)  AS  correlated_issues  BY  epoch  |  search  error_Cme=*  product=*  

correlated_issues=*  |  sort  -­‐  epoch  |  fields  -­‐  epoch  

Search  for  all  incomplete  CDN  downloads  and  infrastructure  errors  for  the  Cme  period.    Calculate  a  5-­‐minute  EPOCH  Cme  

window  on  each  event.  

Combine  the  events  using  the  STATS  command  and  a  SPLIT  BY  our  EPOCH  Cme  window.  

2  

Page 31: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

RelaCve  Volume  

Page 32: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

RelaCve  Volume  We  already  discussed  how  we  could  use  z-­‐values  to  define  abnormality;  this  applies  to  data  volumes  as  well;  in  this  chapter,  we  will  discuss  other  methods  that  can  be  used  to  idenCfy  abnormal  volumes  of  data    

!   z-­‐values  –  these  provide  a  normalized  method  of  idenCfying  outliers  in  data  volumes  

!   RelaCve  raCos  –  these  can  help  us  visually  idenCfy  abnormal  data  volumes  

!   Cluster  command  –  a  search  command  that  groups  similar  events  and    calculates  quanCty  

–  Use  t=<0–1>  parameter  to  determine  sensiCvity  

–  Use  match=(termlist  |  termset  |  ngramset)    

!   Abnormal  behavior  in  our  data  can  help  us  idenCfy  the  right  quesCons  to  ask  

32  

Page 33: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

z-­‐value:    3  Week  Volume  vs.  Today  

33  

`infrastructure_data`    earliest=-­‐1mon@d  latest=@h  |  bucket  _Cme  span=1h    |  eval  day=if(_Cme  >=  relaCve_Cme(now(),  "-­‐7d@d"),  "this",  "past")  |  stats  count  AS  c  BY  _Cme  date_wday  date_mday  date_hour  sourcetype  day  |  eval  c_tmp=c  |  eval  c=if(day  ==  "this",  null,  c)    |  eventstats  mean(c)  AS  m  stdevp(c)  AS  sd  BY  date_wday  date_hour  sourcetype  |  rename  c_tmp  AS  c  |  where  (day  =  "this")  AND  (date_mday  =  tonumber(strRime(now(),  

"%d")))  |  eval  z=(c  -­‐  m)/sd  |  xyseries  _Cme  sourcetype  z  |  eval  Hour=strRime(_Cme,  "%H")  |  fields  -­‐  _Cme  

This  search  calculates  z-­‐values  for  data  source  volumes,  contrasCng  the  past  7  days  to  the  rest  of  the  past  month.  

1  

NOTE:    The  last  7  days  are  excluded  from  baseline  calculaCons.  This  represents  a  clear  issue;  remember  that  z-­‐values  are  standardized,  and  according  to  

the  Empirical  Rule,  99.7%  of  our  data  points  should  be  within  3  standard  deviaCons  

Page 34: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

z-­‐value:    3  Week  Volume  vs.  Today  

34  

`infrastructure_data`    earliest=-­‐1mon@d  latest=@h  |  bucket  _Cme  span=1h    |  eval  day=if(_Cme  >=  relaCve_Cme(now(),  "-­‐7d@d"),  "this",  "past")  |  stats  count  AS  c  BY  _Cme  date_wday  date_mday  date_hour  sourcetype  day  |  eval  c_tmp=c  |  eval  c=if(day  ==  "this",  null,  c)    |  eventstats  mean(c)  AS  m  stdevp(c)  AS  sd  BY  date_wday  date_hour  sourcetype  |  rename  c_tmp  AS  c  |  where  (day  =  "this")  AND  (date_mday  =  tonumber(strRime(now(),  

"%d")))  |  eval  z=(c  -­‐  m)/sd  |  xyseries  _Cme  sourcetype  z  |  eval  Hour=strRime(_Cme,  "%H")  |  fields  -­‐  _Cme  

This  search  calculates  z-­‐values  for  data  source  volumes,  contrasCng  the  past  7  days  to  the  rest  of  the  past  month.  

1  

NOTE:    The  last  7  days  are  excluded  from  baseline  calculaCons.  

Page 35: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

RelaCve  RaCos  

35  

`infrastructure_data`  earliest=-­‐1d@d  latest=@h  |    where  date_hour  <  tonumber(strRime(now(),  "%H"))  |  eval  day=if(_Cme  <  relaCve_Cme(now(),  "@d"),  

"yesterday",  "today")  |  stats  count  AS  c  BY  date_hour  sourcetype  day  |  eval  yesterday=if(day  ==  "yesterday",  c,  0)  |  eval  today=if(day  ==  "today",  c,  0)  |  stats  max(yesterday)  AS  yesterday  max(today)  AS  today  BY  date_hour  sourcetype  |  eval  raCo=if(today  >=  yesterday,  today/

yesterday,  -­‐yesterday/today)  |  fields  -­‐  yesterday  today  |  rename  date_hour  AS  Hour  |  xyseries    Hour  sourcetype  raCo  |  sort  +  Hour  

RelaCve  raCos  can  help  to  idenCfy  abnormal  behavior  in  the  volumes  of  our  data  sources.    The  following  search  exemplifies  the  magnitude  of  volume  changes  when  ploted  on  a  chart.  

Columns  above  0  represent  the  magnitude  of  addiConal  volume  for  today,  whereas  columns  below  0  represent  the  magnitude  of  yesterday’s  data  compared  to  today.  

1  

Yesterday’s  volume  was  considerably  higher  than  today’s  for  a  number  of  devices;  this  is  a  possible  indicator  there  was  an  outage  today  

Page 36: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

RelaCve  RaCos  

36  

`infrastructure_data`  earliest=-­‐1d@d  latest=@h  |    where  date_hour  <  tonumber(strRime(now(),  "%H"))  |  eval  day=if(_Cme  <  relaCve_Cme(now(),  "@d"),  

"yesterday",  "today")  |  stats  count  AS  c  BY  date_hour  sourcetype  day  |  eval  yesterday=if(day  ==  "yesterday",  c,  0)  |  eval  today=if(day  ==  "today",  c,  0)  |  stats  max(yesterday)  AS  yesterday  max(today)  AS  today  BY  date_hour  sourcetype  |  eval  raCo=if(today  >=  yesterday,  today/

yesterday,  -­‐yesterday/today)  |  fields  -­‐  yesterday  today  |  rename  date_hour  AS  Hour  |  xyseries    Hour  sourcetype  raCo  |  sort  +  Hour  

RelaCve  raCos  can  help  to  idenCfy  abnormal  behavior  in  the  volumes  of  our  data  sources.    The  following  search  exemplifies  the  magnitude  of  volume  changes  when  ploted  on  a  chart.  

Columns  above  0  represent  the  magnitude  of  addiConal  volume  for  today,  whereas  columns  below  0  represent  the  magnitude  of  yesterday’s  data  compared  to  today.  

1  

Page 37: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

The  Cluster  Command  

37  

`infra_ops`  |  cluster  t=.8  field=_raw  match=termset  |  table  _raw  cluster_count  |  sort  +  cluster_count  

The  cluster  command  groups  similar  events  and  allows  for  a  quick-­‐and-­‐dirty  discovery  of  rare  events  

`infra_ops` | cluster t=.8 field=_raw match=termset | table _raw cluster_count | sort + cluster_count!

Page 38: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Summary  We  learned  how  to:  

!   IdenCfy  abnormal  human  behavior  !   IdenCfy  abnormal  machine  behavior  and  calculate  dynamic  thresholds  !   Correlate  abnormality  across  dissimilar  data  types  using  z-­‐value  !   AutomaCcally  correlate  root-­‐cause  to  incidents  

In  summary,  we  learned:  !   How  to  make  Splunk  think  like  a  human  !   How  to  find  quesCons  when  our  data  is  full  of  answers  

Most  importantly,  we  learned:  ! Splunk  is  EASY  !   StaCsCcs  is  HARD,  but  it’s  EASY  in  Splunk!  

38  

Page 39: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

Next  Steps  

39  

Download  the  .conf2013  Mobile  App  If  not  iPhone,  iPad  or  Android,  use  the  Web  App    

Take  the  survey  &  WIN  A  PASS  FOR  .CONF2014…  Or  one  of  these  bags!    Go  to  “How  to  Use  Dynamic  Drilldown”  Nolita  2,  Level  4  Today,  3-­‐4pm  

1  

2  

3  

Page 40: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

THANK  YOU  

Page 41: DataScience* - SplunkConf · AboutMe*! Developer*and*Security*Professional*for*14+years* Bestknown*for*integraon*work*with*as/400,* NetApp,*and*ServiceNow*** Studied*psychology*atthe*University

IntroducCon:    StaCsCcs  Resistance  to  Outliers  

median  –  the  value  separaCng  the  higher  and  lower  halves  of  a  sample  !   If  the  set  contains  an  even  number  of  values,  we  normally  average  the  two  middle  values.  mode  –  the  most  common  value  in  a  sample  !   sample  set:      1,  2,  2,  2,  3,  20  !   median  =  2,  mode  =  2,  mean  (average)  =  5  

–  someCmes  it  makes  sense  to  use  median,  in  place  of  mean,  in  order  to  account  for  outliers.    This  can  be  important  when  the  sample  is  small  

SIDE  NOTE:    mode  can  be  important  in  retail  analyCcs  to  understand  things  like  what  shirt/pant  size  is  the  most  common  (restocking)  

41