Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf ·...

12
ANOMALY DETECTION ON MACHINE LOG Data Mining Prof. Sunnie S Chung Ankur Pandit | 2619650

Transcript of Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf ·...

Page 1: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

ANOMALY  DETECTION  ON  MACHINE  LOG  Data  Mining  

Prof.  Sunnie  S  Chung  

Ankur  Pandit  |  2619650  

Page 2: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

Raw  Data:    NASA  HTTP  access  logs  –  It  contain  two  month's  of  all  HTTP  requests  to  the  NASA  Kennedy  Space  Center  WWW  server  in  Florida.    Format:    

The  logs  are  an  ASCII  file  with  one  line  per  request,  with  the  following  columns:  

1.   host  making  the  request.  A  hostname  when  possible,  otherwise  the  Internet  address  if  the  name  could  not  be  looked  up.  

2.   timestamp  in  the  format  "DAY  MON  DD  HH:MM:SS  YYYY",  where  DAY  is  the  day  of  the  week,  MON  is  the  name  of  the  month,  DD  is  the  day  of  the  month,  HH:MM:SS  is  the  time  of  day  using  a  24-­‐hour  clock,  and  YYYY  is  the  year.  The  timezone  is  -­‐0400.  

3.   request  given  in  quotes.  4.   HTTP  reply  code.  5.   bytes  in  the  reply.

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245

unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985

Total  Number  of  Records:  1.8  Million                            

Page 3: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

 Data  Cleaning:    

-­‐   For  convenience,  space  separated  logs  were  converted  into  a  CSV  file.  -­‐   A  simple  java  program  was  used  for  the  conversion.  (Link  can  be  found  in  references  

section)  -­‐   Special  characters  were  removed  by  the  program:    

o   double  quotes  (“)  o   comma  (,)    o   square  brackets  ([])      

-­‐   199.72.81.55,-­‐,-­‐,01/Jul/1995:00:00:01,-­‐0400,GET,/history/apollo/,HTTP/1.0,200,6245  -­‐   unicomp6.unicomp.net,-­‐,-­‐,  01/Jul/1995:00:00:06,0400  ,  GET  ,  /shuttle/countdown/  ,  

HTTP/1.0,  200,3985                                                      

Page 4: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

 Importing  data  in  R:    

-­‐   Setup  working  directory  first  using  setwd()  command.    

   

-­‐   Import  the  csv  data  using  read.csv().  -­‐   Make  sure  you  set  header  =  TRUE,  since  we  would  need  headers  to  access  the  data.  

 

                                               

Page 5: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

 Outlier  Detection:    

-­‐   Once  we  have  imported  the  data  we  can  start  detecting  outliers.  -­‐   Cluster  plot  for  entire  imported  data.  -­‐   clusplot(data,  data$col10,  color=TRUE,  shade=TRUE,labels=2,  lines=0)  

   

           

Page 6: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

   

-­‐   For  sample  data  containing  only  two  columns  –  IP  address  and  number  of  bytes  received.  

 

   

-­‐   These  graphs  shows  us  that  are  some  outliers  present  but  exactly  what  is  the  outlier,  we  cannot  find  it.  So  some  algorithms  must  be  applied  to  find  the  outliers.  

             

Page 7: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

Grubbs  test:    

-­‐   Performs  grubbs  test  for  to  detect  if  the  sample  dataset  contains  one  outlier.  -­‐   Test  is  based  on  calculating  outlier  score  G (outlier  minus  mean  and  divided  

by  standard  deviation)  and  comparing  it  to  appropriate  critical  values.    -­‐   Usage:  grubbs.test(<data_set_name>)  -­‐   Expects  a  numeric  vector  as  an  input  

 

   

-­‐   Perform  grubbs  test  to  check  highest  and  lowest  values  of  outliers.  -­‐   Usage:  grubbs.test(<data_set_name>,type=11)  

   

     

-­‐   There  is  another  type  available  but  it  can  be  used  only  when  the  data  set  contains  less  than  30  rows.  

                 

Page 8: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

Chi  Square  Test:    

-­‐   This  function  performs  a  simple  test  for  one  outlier,  based  on  chi  squared  distribution  of  squared  differences  between  data  and  sample  mean.  

-­‐   Usage:  chisq.out.test(<data_set_name>)  –  Gives  the  outlier  with  the  highest  value  -­‐   Usage:  chisq.out.test(<data_set_name>,opposite=TRUE)  –  Gives  the  outlier  with  lowest  

value    

   Outlier  Test:    

-­‐   Finds  value  with  largest  difference  between  it  and  sample  mean,  which  can  be  an  outlier.  

-­‐   Usage:  outlier(<data_set_name>)  –  Gives  the  outlier  with  the  highest  value.  -­‐   Usage:  outlier(<data_set_name>,  opposite=TRUE)  –  Gives  the  outlier  with  the  lowest  

value.      

       

Page 9: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

Limitations:    

-­‐   Doesn’t  work  that  well  with  complex  data  set  (more  than  two  columns)  -­‐   We  are  not  able  to  get  other  info  like  from  which  requester’s  IP,  resource  accessed,  data  

and  time  when  request  was  made  etc.  -­‐   Problems  with  large  data  set.  -­‐   Just  by  using  the  algorithm  we  are  not  able  learn  anything  about  the  working  of  the  

algorithm.  Giving  us  less  control  on  the  output.    Using  Custom  Java  Program:    

-­‐   Uses  z  score  to  detect  outliers.  -­‐   Uses  the  difference  between  the  value  and  mean  of  the  data  set.  -­‐   The  difference  is  compared  with  standard  deviation  to  find  the  outliers.    

   

                 

Page 10: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

Output  of  Program:    

   

   

       Lessons  Learned:    

-­‐   Data  mining  pipeline  –  Data  gathering,  Preprocessing  and  Analysis  -­‐   Various  Outlier  detection  techniques  and  algorithms.  -­‐   Using  R  for  outlier  detection.  -­‐   Implementing  Outlier  Detection  Algorithm.  

     

Page 11: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

     

       

Thank  you                                    

Page 12: Project Presentation CIS 660 Ankureecs.csuohio.edu/.../Project_Presentation_CIS_660_Ankur.pdf · 2015-12-05 · Limitations:$ $ Q!Doesn’twork$thatwell$with$complex$dataset$(more$than$two$columns)$

References:    

1.   http://ita.ee.lbl.gov/html/contrib/NASA-­‐HTTP.html  2.   https://github.com/Ankur-Pandit/CSVConverter 3.   https://cran.r-­‐project.org/web/packages/outliers/outliers.pdf  4.   https://www.siam.org/meetings/sdm10/tutorial3.pdf  5.   https://github.com/Ankur-­‐Pandit/OutlierDetection