Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: [email protected] 2011/3/17 1 Data Mining...

19
Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: [email protected] 2011/3/17 1 Data Mining and Machine Learning Lab.

Transcript of Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: [email protected] 2011/3/17 1 Data Mining...

Page 1: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Reporter: Jing ChiuAdvisor: Yuh-Jye LeeEmail: [email protected]

2011/3/17 1Data Mining and Machine Learning Lab.

Page 2: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Authors: Anh Le, Athina Markopoulou

(University of California, Irvine)Michalis Faloutsos

(University of California, Riverside) Source:

to appear in IEEE INFOCOM 2011 Mini Conference, Shanghai, China, April 10-15, 2011. (poster, tech report)

2011/3/17 2Data Mining and Machine Learning Lab.

Page 3: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Introduction Dataset and Feature Extraction Classification Algorithms Evaluation Results System Deployment Conclusion

2011/3/17 3Data Mining and Machine Learning Lab.

Page 4: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

“How well can one detect phishing URLs using only lexical features compared to using full features?”

PhishDef Properties:High accuracy:

96%-97%Light-weight:

Low latency Imposes a modest overhead

Proactive approach As opposed to reactively relying on blacklist

Resilience to noise 95%-86% accuracy when there is 5%-45% noise

2011/3/17 4Data Mining and Machine Learning Lab.

Page 5: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

DatasetMalicious URLs

PhishTank MalwarePatrol

Legitimate URLs Yahoo Directory Open Directory (DMOZ)

External Feature CollectionWHOISTeam Cymru

2011/3/17 5Data Mining and Machine Learning Lab.

Page 6: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Feature ExtractionAutomatically selected features

Delimiters: ‘/’, ’?’, ‘.’, ‘=‘, ‘_’, ‘&’ and ‘-’. Four parts:

Domain NameDirectoryFile NameArgument

Obfuscation-resistant lexical features Four different URL obfuscation techniques Five categories of hand-selected lexical features

2011/3/17 6Data Mining and Machine Learning Lab.

Page 7: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

(I) Obfuscating the host with an IP address

(II) Obfuscating the host with another domain

(III) Obfuscating with large host names (IV) Domain unknown or misspelled

2011/3/17 7Data Mining and Machine Learning Lab.

Page 8: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Features related to the full URL Length of the URL (Type II) Number of dots in the URL (Type II) Blacklisted words (Type IV)

confirm, account, banking, secure, ebayisapi, webscr, login and signin Paypal, free, lucky and bonus

Features related to the domain name Length of the domain name (Type III) IP or port number is used in the domain name (Type I) Number of tokens of the domain name (Type III) Number of hyphens used in the domain name (Type III) The length of the longest token (Type III)

Features related to the directory Length of the directory (Type II) Number of sub-directory tokens (Type II) Length of the longest sub-directory token (Type II) Maximum number of dots and other delimiters used in a sub-

directory token (Type II)

2011/3/17Data Mining and Machine Learning Lab. 8

Page 9: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Features related to the file name Length of the file name (Type II) Number of dots and other delimiters used in the file name

(Type II) Features related to the argument part

Length of the argument part Number of variables Length of the longest variable value The maximum number of delimiters used in a value

Summary of dataset

2011/3/17Data Mining and Machine Learning Lab. 9

Page 10: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Batch LearningSupport Vector Machine (SVM)

Online LearningOnline Perception (OP)Confidence Weighted (CW)

Adaptive Regularization of Weights (AROW)

2011/3/17Data Mining and Machine Learning Lab. 10

Page 11: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Batch-based vs. Online algorithmsSVM vs. AROWYahoo-Phish

2011/3/17Data Mining and Machine Learning Lab. 11

Page 12: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Lexical Features vs. Full FeaturesOP, CW and AROWYahoo-Phish

2011/3/17Data Mining and Machine Learning Lab. 12

Page 13: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Obfuscation-Resistant Lexical FeaturesPerformance of AROW with/without OR

features after the last URL

2011/3/17Data Mining and Machine Learning Lab. 13

Page 14: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

The resilience of AROW to noisy dataAROW and CWYahoo-Phish

2011/3/17Data Mining and Machine Learning Lab. 14

Page 15: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Minimum/Maximum URL Similarity Distance distribution

2011/3/17Data Mining and Machine Learning Lab. 15

Page 16: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

2011/3/17Data Mining and Machine Learning Lab. 16

Page 17: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

2011/3/17Data Mining and Machine Learning Lab. 17

Proposed PhishDef – a proactive defense scheme of phishing attacks

PhishDef detecting phishing URLs on-the-fly

PhishDef use only lexical featuresHigh accuracy (97%)Low overheadResilient to noisy training dataFirefox and Chrome add-ons

implementation

Page 18: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

Q&A?

2011/3/17Data Mining and Machine Learning Lab. 18

Page 19: Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.

2011/3/17Data Mining and Machine Learning Lab. 19