Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

24
Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis Shar Lwin Khin, Tan Hee Beng Kuan Information Engineering, Nanyang Technological University, Singapore Lionel Briand, Interdisciplinary Centre for ICT Security, Reliability, and Trust, University of Luxembourg, Luxembourg [email protected] [email protected] [email protected]

description

Presentation at ICSE 2013, San Francisco

Transcript of Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Page 1: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis Shar Lwin Khin, Tan Hee Beng Kuan Information Engineering, Nanyang Technological University, Singapore

 

Lionel Briand, Interdisciplinary Centre for ICT Security, Reliability, and Trust, University of Luxembourg, Luxembourg

 

[email protected]  [email protected]    

[email protected]    

Page 2: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Mo7va7on

   Increasing  number  of  vulnerabili7es  

 Developers  lack  security  awareness  

 Manual  vulnerability  audit  is  effort  intensive  

Page 3: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Related  Work

Method   Granularity   Accuracy   Scalability  

Vuln.  Predic7on   ×   √   √  Sta7c  taint  analysis   √     ×   √  Sta7c  &  dynamic  analysis   √   √   ×    ???   √   √   √  

Page 4: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Problem  Defini7on  1/2  Input  valida,on  and  sani,za,on  are  two  common  defense  methods  used  in  web  applica7ons  

 Sta,c  a2ributes  have  been  shown  to  be  indicators  of  vulnerabili7es,  though  not  accurate  enough  

 Can  we  use  Sta7c  and  dynamic  aPributes  together  characterizing  the  implementa7ons  of  these  defense  methods  as  indicators?  

 Machine  learning  to  predict  vulnerability  based  on  aPributes  

Page 5: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Problem  Defini7on  2/2  Typical  predic7on  models  are  classifica7on-­‐based   Being  supervised  learning,  their  effec7veness  is  dependent  on  the  availability  of  sufficient  training  data  tagged  with  class  labels  

 Cluster  analysis    (CA)  is  a  type  of  unsupervised  learning  methods  

 CA  may  be  used  if  vulnerable  instances  can  be  dis7nguished  from  non-­‐vulnerable  instances  based  on  the  proposed  aPributes  

Page 6: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Vulnerability  Distribu7ons  

© Web Hacking Incident Database

Page 7: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

SQL  Injec7on    

7

Hacker login.php

Database

$name = ’ or 1=1 --

$q = “select * from user where name=‘’ or 1=1--’ and pw=‘’

 Cause:  Inadequate  valida7on  and  sani7za7on  of  user  inputs  used  in  queries  

$q = “select * from user where name=‘”.$name.“’ and pw=‘”.$pw.“’”

Unauthorized user information SQLI!

Page 8: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Cross  Site  Scrip7ng   Cause:  No  sanity  check  of  input  before  used  in  HTML  documents  Hacker Victim travelerTip.php

Inject Script: <script>alert(xss!);</script>

Visit

http://travelingForum/travelerTip.php?Action=Post&Place=Greece&Tip=<Script>document.location=‘http://hackerSite/stealCookie.jsp?cookie=’+document.cookie; </Script>

Injected Script executed on victim’s browser

XSS!

Page 9: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Vulnerability  Predic7on  Principles    1/2   Using  hybrid  code  a2ributes  to  predict  vulnerabili7es   Based  on  both  sta7c  and  dynamic  program  analyses   Input  valida7on  checks  and  sani7za7on  opera7ons  mainly  based  on  string  opera,ons     e.g.,  preg_replace(“<script”, “”, $data)    

 Classify  the  types  of  string  opera7ons  applied  according  to  their  poten,al  effects  on  the  inputs  before  their  use  in  security-­‐sensi7ve  statements—sinks     e.g.,  echo $data; mysql_query($data)  

 Such  valida7on  checks  and  opera7ons  can  be  iden7fied  by  analyzing  data  dependence  graphs  

Page 10: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Vulnerability  Predic7on  Principles    2/2  

 Given  the  data  dependence  graph  of  a  sink:    extrac,ng  the  number  of  inputs,  and  the  numbers  and  types  of  valida,on  and  sani,za,on  func,ons  from  the  graph,  can  we  predict  the  sink’s  vulnerability?  

     

 E.g.,  if  a  sink  uses  five  different  inputs,  there  should  at  least  be  five  input  valida7on  or  sani7za7on  func7ons.  

sink

Page 11: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Sta7c  and  Dynamic  Classifica7on   From  the  language  built-­‐in  func7ons  that  have  specific  

security  purposes,  the  language  operators,  and  the  predefined  language  parameters  used,  a  node  is  classified  sta,cally.  

 e.g.,  addslashes($input), $_GET, $a = $b . $c  But  it  is  classified  dynamically  if  the  node  invokes  user-­‐

defined  func7ons  or  some  built-­‐in  func7ons  such  as  string  replacement.  

 e.g.,  $sanitized = preg_replace(“<+”, “”, $input)  The  func7on  code  is  executed  using  a  set  of  predefined  test  

inputs,  and  the  final  values  of  test  input  variables  are  searched  for  malicious  characters.  

Page 12: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Hybrid  Code  APributes  Attribute

ID Attribute Name Description

Static attributes 1 Client The number of nodes that access data from HTTP request parameters 2 File The number of nodes that access data from files 3 Database The number of nodes that access data from database 4 Text-database Boolean value ‘TRUE’ if there is any text-based data accessed from database; ‘FALSE’ otherwise 5 Other-database Boolean value ‘TRUE’ if there is any data except text-based data accessed from database; ‘FALSE’

otherwise 6 Session The number of nodes that access data from persistent data objects 7 Uninit The number of nodes that reference un-initialized program variable 8 SQLI-sanitization The number of nodes that apply standard sanitization functions for preventing SQLI issues 9 XSS-sanitization The number of nodes that apply standard sanitization functions for preventing XSS issues 10 Numeric-casting The number of nodes that type-cast data into a numeric type data 11 Numeric-type-check The number of nodes that perform numeric data type check 12 Encoding The number of nodes that encode data into a certain format 13 Un-taint The number of nodes that return predefined information or information not influenced by external

users 14 Boolean The number of nodes which invoke functions that return Boolean value 15 Propagate The number of nodes that propagate partial or complete value of an input

Dynamic attributes 16 Numeric The number of nodes which invoke functions that return only numeric, mathematic, or dash characters 17 LimitLength The number of nodes that invoke string-length limiting functions 18 URL The number of nodes that invoke path-filtering functions 19 EventHandler The number of nodes that invoke event-handler filtering functions 20 HTMLTag The number of nodes that invoke HTML-tag filtering functions 21 Delimiter The number of nodes that invoke delimiter filtering functions 22 AlternateEncode The number of nodes that invoke alternate-character-encoding filtering functions

Target attribute 23 Vulnerable? Indicates a class label—Vulnerable or Not-Vulnerable

Page 13: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Sample  APribute  Vectors  

•  Each  sink  would  be  represented  by  a  23-­‐dimensional  aPribute  vector.  

 •  Sample  aPribute  vectors  (Session,  XSS-­‐sanit,  Un-­‐taint,  Delimiter,  Propagate,…,  Vulnerable?):     (2,  4,  0,  0,  2,…,  Not-­‐Vulnerable)   (1,  0,  1,  1,  7,…,  Vulnerable)    

13/50

Page 14: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Supervised  Vulnerability  Predic7on  

 Data  Preprocessing   Normaliza7on   Principal  Component  Analysis  

 Classifiers   Logis7c  Regression  –regression  analysis   Mul7-­‐Layer  Perceptron  –neural  network  analysis  

 Training  &  Tes7ng  –10-­‐fold  cross  valida7on    

Page 15: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Unsupervised  Vulnerability  Predic7on  

 Use  same  data  preprocessing  ac7vi7es  as  supervised  models  

 K-­‐means  cluster  analysis  based  on  two  assump7ons   non-­‐vulnerable  sinks  are  much  more  frequent  than  vulnerable  sinks  

 vulnerable  sinks  have  different  characteris7cs  from  non-­‐vulnerable  sinks  

 Label  clusters  as  Vulnerable  or  Non-­‐Vulnerable:   K=4:  Maximum  number  of  clusters   %Normal=12:  Minimum  size  of  non-­‐vulnerable  cluster  

Page 16: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Case  Study  

 Six  open  source,  web  applica7ons  (PHP):     Known  vulnerable   Func7onali7es:  school  admin,  forum,  news,  content,  database  management  

 Sizes:  from  2k  –  44k  LOC    

 Vulnerability  iden7fica7on:  manual  &  vuln.  databases  –  Bugtraq,  CVE  

16

Page 17: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Prototype  Tool    

Architecture of PhpMiner

Weka

Page 18: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Experiment  &  Result    1/2  

Classification results of predictors built from hybrid attributes.

LR performs better than MLP Maximum analysis time: 2 hours, average ½ hour AccuracyShin et al. TSE’113 achieved recall>80 and pf<25 Pixy S&P’061 reported pf>20.

Too many false positives! Ardilla ICSE’094 reported up to 50% of paths left

unexplored.... False negatives?

Our result recall=90, pf=5

Measure (%) à Data & Classifier

recall false alarm precision

schmate-html LR 99 3 98 MLP 99 0 100

faqforge-html LR 89 5 94 MLP 91 5 94

utopia-html LR 94 1 94 MLP 94 2 89

phorum-html LR 78 1 70 MLP 33 0 100

cutesite-html LR 68 9 61 MLP 78 8 67

myadmin-html LR 85 1 89 MLP 75 1 83

Average results on XSS prediction LR 86 3 84 MLP 78 3 89

schmate-sql LR 97 8 98 MLP 96 35 92

faqforge-sql LR 88 4 94 MLP 88 4 94

phorum-sql LR 100 3 63 MLP 0 1 0

cutesite-sql LR 91 14 89 MLP 89 18 86

Average results on SQLI prediction LR 94 7 86 MLP 68 15 68

Overall average LR 90 5 85 MLP 74 8 81

Page 19: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Experiment  &  Result    2/2   Measure (%) Data

recall

false alarm

precision

utopia-html 100 13 65 phorum-html 56 11 16 cutesite-html 70 20 41 myadmin-html 55 8 33 phorum-sql 100 7 38 Average 76 12 39

k-means clustering analysis results on the datasets which have < 40% vulnerable sinks

Measure (%) Data

recall

false alarm

precision

schmate-html 9 0 100 faqforge-html 26 0 100 schmate-sql 3 32 29 faqforge-sql 0 0 undefined cutesite-sql 0 0 undefined Average 8 6 undefined

k-means clustering analysis results on the datasets which have ≥ 40% vulnerable sinks

When assumptions are not met, clustering does not work!

Page 20: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Limita7ons  

 Supervised  learning  requires  sufficient  labeled  data  for  training  

 Unsupervised  learning  relies  on  some  assump7ons,  which  are  not  always  true:  Applicable  for  most  commercial  systems?  

 For  unsupervised  learning,  tuning  the  parameters  is  required:     K:  Maximum  number  of  clusters     %Normal:  Minimum  size  of  non-­‐vulnerable  cluster  

 

Page 21: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Conclusion  

 Security  audi7ng  by  providing  probabilis7c  alerts  about  vulnerable  code  statements.    

 Propose  hybrid  (sta7c  and  Dynamic)  code  aPributes  for  vulnerability  predic,on  using  machine  learning  

 APributes  characterize  common  input  valida7on  and  sani7za7on  code  paPerns,  without  expensive  analysis  

 Scalability:  <  2  hours  on  a  regular  PC   Both  supervised  learning  and  unsupervised  learning  methods  were  used    

 Supervised  learning  accuracy:  90%  R,  85%  P   Unsupervised  learning:  Lower  accuracy,  applicability?  

Page 22: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

Future  Work  

 Semi-­‐supervised  learning     Combining  data  dependency  informa7on  with  control  dependency  informa7on  

  Address  other  types  of  similar  vulnerabili7es  by  considering  other  types  of  code  paPerns  

Page 23: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

The  End!  

hPp://sharlwinkhin.com  

23/50

Thank You!

Question?

Page 24: Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis

References  1.  N.  Jovanovic,  C.  Kruegel,  and  E.  Kirda,  “Pixy:  a  sta7c  analysis  tool  for  

detec7ng  web  applica7on  vulnerabili7es,”  in  IEEE  Symposium  on  Security  and  Privacy,  2006,  pp.  258-­‐263.  

2.  D.  Balzarou  et  al.,  “Saner:  composing  sta7c  and  dynamic  analysis  to  validate  sani7za7on  in  web  applica7ons,”  in  IEEE  Symposium  on  Security  and  Privacy,  2008,  pp.  387-­‐401.    

3.  Y.  Shin,  A.  Meneely,  L.  Williams,  and  J.  A.  Osborne,  “Evalua7ng  complexity,  code  churn,  and  developer  ac7vity  metrics  as  indicators  of  sowware  vulnerabili7es,”  IEEE  Transac7ons  on  Sowware  Engineering,  vol.  37  (6),  pp.  772-­‐787,  2011.  

4.  Kieżun,  A.,  Guo,  P.  J.,  Jayaraman,  K.,  and  Ernst,  M.  D.  2009.  Automa7c  crea7on  of  SQL  injec7on  and  cross-­‐site  scrip7ng  aPacks.  In  Proceedings  of  the  31st  Interna,onal  Conference  on  SoTware  Engineering,  Vancouver,  BC,  pp.  199-­‐209.    

5.  RSnake.  hPp://ha.ckers.org,  accessed  March  2012.  6.  I.  H.  WiPen  and  E.  Frank,  Data  Mining,  2nd  ed.,  Morgan  Kaufmann,  2005.    

24