Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: [email protected] 2011/3/17 1 Data Mining...
-
Upload
hester-lambert -
Category
Documents
-
view
215 -
download
0
Transcript of Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: [email protected] 2011/3/17 1 Data Mining...
Reporter: Jing ChiuAdvisor: Yuh-Jye LeeEmail: [email protected]
2011/3/17 1Data Mining and Machine Learning Lab.
Authors: Anh Le, Athina Markopoulou
(University of California, Irvine)Michalis Faloutsos
(University of California, Riverside) Source:
to appear in IEEE INFOCOM 2011 Mini Conference, Shanghai, China, April 10-15, 2011. (poster, tech report)
2011/3/17 2Data Mining and Machine Learning Lab.
Introduction Dataset and Feature Extraction Classification Algorithms Evaluation Results System Deployment Conclusion
2011/3/17 3Data Mining and Machine Learning Lab.
“How well can one detect phishing URLs using only lexical features compared to using full features?”
PhishDef Properties:High accuracy:
96%-97%Light-weight:
Low latency Imposes a modest overhead
Proactive approach As opposed to reactively relying on blacklist
Resilience to noise 95%-86% accuracy when there is 5%-45% noise
2011/3/17 4Data Mining and Machine Learning Lab.
DatasetMalicious URLs
PhishTank MalwarePatrol
Legitimate URLs Yahoo Directory Open Directory (DMOZ)
External Feature CollectionWHOISTeam Cymru
2011/3/17 5Data Mining and Machine Learning Lab.
Feature ExtractionAutomatically selected features
Delimiters: ‘/’, ’?’, ‘.’, ‘=‘, ‘_’, ‘&’ and ‘-’. Four parts:
Domain NameDirectoryFile NameArgument
Obfuscation-resistant lexical features Four different URL obfuscation techniques Five categories of hand-selected lexical features
2011/3/17 6Data Mining and Machine Learning Lab.
(I) Obfuscating the host with an IP address
(II) Obfuscating the host with another domain
(III) Obfuscating with large host names (IV) Domain unknown or misspelled
2011/3/17 7Data Mining and Machine Learning Lab.
Features related to the full URL Length of the URL (Type II) Number of dots in the URL (Type II) Blacklisted words (Type IV)
confirm, account, banking, secure, ebayisapi, webscr, login and signin Paypal, free, lucky and bonus
Features related to the domain name Length of the domain name (Type III) IP or port number is used in the domain name (Type I) Number of tokens of the domain name (Type III) Number of hyphens used in the domain name (Type III) The length of the longest token (Type III)
Features related to the directory Length of the directory (Type II) Number of sub-directory tokens (Type II) Length of the longest sub-directory token (Type II) Maximum number of dots and other delimiters used in a sub-
directory token (Type II)
2011/3/17Data Mining and Machine Learning Lab. 8
Features related to the file name Length of the file name (Type II) Number of dots and other delimiters used in the file name
(Type II) Features related to the argument part
Length of the argument part Number of variables Length of the longest variable value The maximum number of delimiters used in a value
Summary of dataset
2011/3/17Data Mining and Machine Learning Lab. 9
Batch LearningSupport Vector Machine (SVM)
Online LearningOnline Perception (OP)Confidence Weighted (CW)
Adaptive Regularization of Weights (AROW)
2011/3/17Data Mining and Machine Learning Lab. 10
Batch-based vs. Online algorithmsSVM vs. AROWYahoo-Phish
2011/3/17Data Mining and Machine Learning Lab. 11
Lexical Features vs. Full FeaturesOP, CW and AROWYahoo-Phish
2011/3/17Data Mining and Machine Learning Lab. 12
Obfuscation-Resistant Lexical FeaturesPerformance of AROW with/without OR
features after the last URL
2011/3/17Data Mining and Machine Learning Lab. 13
The resilience of AROW to noisy dataAROW and CWYahoo-Phish
2011/3/17Data Mining and Machine Learning Lab. 14
Minimum/Maximum URL Similarity Distance distribution
2011/3/17Data Mining and Machine Learning Lab. 15
2011/3/17Data Mining and Machine Learning Lab. 16
2011/3/17Data Mining and Machine Learning Lab. 17
Proposed PhishDef – a proactive defense scheme of phishing attacks
PhishDef detecting phishing URLs on-the-fly
PhishDef use only lexical featuresHigh accuracy (97%)Low overheadResilient to noisy training dataFirefox and Chrome add-ons
implementation
Q&A?
2011/3/17Data Mining and Machine Learning Lab. 18
2011/3/17Data Mining and Machine Learning Lab. 19