Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: [email protected] 2011/3/17 1 Data Mining...

Reporter: Jing ChiuAdvisor: Yuh-Jye LeeEmail: [email protected]

2011/3/17 1Data Mining and Machine Learning Lab.

Authors: Anh Le, Athina Markopoulou

(University of California, Irvine)Michalis Faloutsos

(University of California, Riverside) Source:

to appear in IEEE INFOCOM 2011 Mini Conference, Shanghai, China, April 10-15, 2011. (poster, tech report)


Introduction Dataset and Feature Extraction Classification Algorithms Evaluation Results System Deployment Conclusion


“How well can one detect phishing URLs using only lexical features compared to using full features?”

PhishDef Properties:High accuracy:

96%-97%Light-weight:

Low latency Imposes a modest overhead

Proactive approach As opposed to reactively relying on blacklist

Resilience to noise 95%-86% accuracy when there is 5%-45% noise


DatasetMalicious URLs

PhishTank MalwarePatrol

Legitimate URLs Yahoo Directory Open Directory (DMOZ)

External Feature CollectionWHOISTeam Cymru


Feature ExtractionAutomatically selected features

Delimiters: ‘/’, ’?’, ‘.’, ‘=‘, ‘_’, ‘&’ and ‘-’. Four parts:

Domain NameDirectoryFile NameArgument

Obfuscation-resistant lexical features Four different URL obfuscation techniques Five categories of hand-selected lexical features


(I) Obfuscating the host with an IP address

(II) Obfuscating the host with another domain

(III) Obfuscating with large host names (IV) Domain unknown or misspelled


Features related to the full URL Length of the URL (Type II) Number of dots in the URL (Type II) Blacklisted words (Type IV)

confirm, account, banking, secure, ebayisapi, webscr, login and signin Paypal, free, lucky and bonus

Features related to the domain name Length of the domain name (Type III) IP or port number is used in the domain name (Type I) Number of tokens of the domain name (Type III) Number of hyphens used in the domain name (Type III) The length of the longest token (Type III)

Features related to the directory Length of the directory (Type II) Number of sub-directory tokens (Type II) Length of the longest sub-directory token (Type II) Maximum number of dots and other delimiters used in a sub-

directory token (Type II)

2011/3/17Data Mining and Machine Learning Lab. 8

Features related to the file name Length of the file name (Type II) Number of dots and other delimiters used in the file name

(Type II) Features related to the argument part

Length of the argument part Number of variables Length of the longest variable value The maximum number of delimiters used in a value

Summary of dataset


Batch LearningSupport Vector Machine (SVM)

Online LearningOnline Perception (OP)Confidence Weighted (CW)

Adaptive Regularization of Weights (AROW)


Batch-based vs. Online algorithmsSVM vs. AROWYahoo-Phish


Lexical Features vs. Full FeaturesOP, CW and AROWYahoo-Phish


Obfuscation-Resistant Lexical FeaturesPerformance of AROW with/without OR

features after the last URL


The resilience of AROW to noisy dataAROW and CWYahoo-Phish


Minimum/Maximum URL Similarity Distance distribution



Proposed PhishDef – a proactive defense scheme of phishing attacks

PhishDef detecting phishing URLs on-the-fly

PhishDef use only lexical featuresHigh accuracy (97%)Low overheadResilient to noisy training dataFirefox and Chrome add-ons

implementation

Q&A?


Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: [email protected] 2011/3/17 1 Data Mining...

Documents

Transcript of Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: [email protected] 2011/3/17 1 Data Mining...