Phishing Detection Using Probabilistic Latent Semantic Analysis
Transcript of Phishing Detection Using Probabilistic Latent Semantic Analysis
Phishing Detection Using Probabilistic Latent Semantic
Analysis
Venkatesh Ramanathan & Dr. Harry WechslerDepartment of Computer Science,
George Mason University,Fairfax, VA 22030
Quantitative Methods in Defense and National Security 2010George Mason University
May 25-26, 2010
OUTLINE
n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research
n In phishing, attacker tricks users to divulge personal and financial datan More than 50,000 unique attacks detected per month by anti-phishing group
Motivation and Goals
FishingPhishing
(BOAT)
Attacker(fishermen)
Victim
(Bait)
(fish)
Goal and Method
n Goal:n Develop a self managing computing system that detects and
prevents phishing attacks, predicts future attacks and automatically adapts to new attacks
n Presentation focus:Development of a content filtering module to detect phishing n Development of a content filtering module to detect phishing attacks
n Module employs:n Probabilistic Latent Semantic Analysis for topic identification
and classification
Background
n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research
Attacker’s Motivation & Mode of Operation
n Motivationn Exploit security hole and consumer’s trust to gather personal
informationn Use stolen information for financial benefit, identity theft
and to commit other fraudMode of Operationn Mode of Operationn Find domain name that would resemble current targetn Find similarities that would make phisher’s site legitimaten Register the domainn Find reliable anonymous host (offshore)n Design web page similar to source web site with added code
to collect and post user’s credentials to attacker’s server
SMTPMailServer
Phishing Examples: Collect User Credentials
IMAPMailServer
MailDatabase
2. Mail ServerProcess & storesit in victim’s account
3. Victim’s PC fetches“phish” email
PhishingWeb Site
Attacker
Victim
Real Web Site1. Compose
“Phish”Email & Sends
it in victim’s account
4. Victim opens“phish” email
5. Victim clickson a link
6. Links launches browser& takes to a phishing web site
7. Victim, thinking it’s a realWeb siteenters username/password
8. Attacker collectsCredentials & putsup invalid login, redirectsto real web site,takes down phishing site
9. Victim enterscredentials again& logs into real site
Phishing Examples :Email To Collect User Credentials
Received: from 101.212.50.151 by ; Tue, 15 Nov 2005 16:27:51 -0700From: "Credit Union" [email protected] PRETENDS TO BE FROM REAL ACCOUNTReply-To: "Credit Union" [email protected] PHISHING EMAIL ADDRESS/DOMAINTo: [email protected], [email protected], [email protected]:FCU: Account updateDate: Tue, 15 Nov 2005 19:28:51 -0400Content-Type: text/html;<html> HTML EMAIL (EASIER TO DISGUISE/HIDE LINKS)<head></head><body>
TwoDifferentAddresses
<body>Credit Union is constantly working to ensure security by regularly CONVINCING CONTENT screening the accounts in our system. We recently reviewed your account, and we need more information to help us provide you with secure service.Until we can collect this information, your access to sensitive account RISK, IF NO ACTIONfeatures will be limited. We would like to restore your access as soon as possible, and we apologize for the inconvenience. <br>...
How can I restore my account access? Please confirm your identity here: Restore<a href=3D"http://200.73.81.212/.CREDIT-UNION/update.php">My PHISHING WEBSITE LINKOnline Banking</a> and complete the "Steps to Remove Limitations." Completing all of the checklist items will automatically restore your account access. </html>
SMTPMailServer
Protection Techniques
IMAPMailServer
MailDatabase
1. Network Level
3. Filters/Classifiers
2. Authentication
PhishingWeb Site
Attacker
Victim
Real Web Site 4. User
ProfileFilters, Toolbars5. Mimic
Prevention
6. User Education
Methodology
n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research
Methodologyn Self-Managing computing system
that adapts to unpredictable changes whilst hiding intrinsic complexity to operators and usersn Self-Protection: System detects
malicious activities and protects from such acts without human involvement
Self-Protection
Autonomic Computing
involvementn Self-Optimization: System
optimizes itself due to changes in load and environment
n Self-Configuration: System configures itself upon initialization, restart and changes in environment and load
n Self-Healing: System recovers automatically from hardware and software failures
Self-Optimization
Self-Configuration
Self-Healing
System
Self-Optimization
System Architecture
ContentClassifier (PLSA)
ContentParser
FeatureExtractor
Watch List
Link Analyzer
Fold In Technique
RepositoryUpdater
Time SliceControlWeb
Crawler
Phishing Detection Module
Adaptive Module
Event Manager
RoundRobin Module
SystemConfiguration
Fold In Technique
ExternalData Sources
Watch ListMonitor
EventCollectorEvent
AnalyzerPredictionModeler
Event Publisher
Watch ListRepository
SelfProtectionEngine
SelfOptimization
Engine
SelfConfiguration
Engine
Self HealingEngine
New AttackMonitorWorkloadMonitor
PerformanceMonitor
SoftwareUpdater
Predictive Module
Optimization Module
Repair Module
ContentCache
Phishing Detection Module
n To detect phishing attacks module will:n employ filtering model that
screens out good email from phishing email
PHISHINGEMAIL GOOD EMAIL
PHISHINGphishing email
n suspicious ones are stored for other modules use.
n employs PLSA, a natural language processing technique.
GOOD EMAIL
PHISHINGEMAIL
WATCHLIST
SUSPICIOUS
Phishing Detection Module
n PLSA makes use of “context” and term co-occurrences.n Handles topics containing “polysemy” words
n words that mean differently in different contextn Bank : river bank vs financial banking system Based on likelihood principle and has solid statistical n Based on likelihood principle and has solid statistical foundation.
n Standard statistical techniques can be applied for model fitting.
PLSA Introduction
n PLSA:n Given:
n a set of Documents, D = {d1, d2,…, },
n Find:Latent (Hidden) Topic “Z” =
TermsDocuments
Usernamen Latent (Hidden) Topic “Z” = {z1, z2, …}
n From n Vocabulary W = {w1, w2, …}
n Topic/concept probabilities are estimated based on all documents that are dealing with a concept.
n No prior knowledge about concepts required.
Topics/Latent
Concepts
ACCOUNT
Username
loginname
logonname
Account
∑=
=K
kjkkiji dzpzwpdwp
1
)|()|()|(
PLSA Introduction
Observed worddistributions
word distributionsper topic
Topic distributionsper document
PLSA Example
PLSA Algorithmn Step 1: Build Term-Document Matrix:
n W={w1,..wj, f1,…fj}; wi – words; fi - featuresn Step 2: Initialize probabilities P(z), P(d|z), P(w|z) randomly.
Document-Term Matrix
w 1 . . . 1,j fw . . . Jf
d 1
. . .
id. . .
d I
D
W
. . .
. . .
. . .
. . .
Document-Term Matrix
PLSA : EM Algorithm
n Step 3: E-step: compute posterior probabilities for the latent variables
n Step 4 : M-step: maximize the expected complete data log-likelihoodlikelihood
n Step 5: Compute log likelihoodn Step 6: Iterate until desired threshold
PLSA – Folding-In Technique
n Estimate probability on ‘unseen’ document (dnew).n P(w|z) unchanged. Use estimates from training.n E-step:
n M-step
Implementation
n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research
Phishing Detection Module
n To separate good, bad and the suspicious, module employs these functional modules:
n Content Parser: Parses content and removes noise.
n Feature Extractor: Extracts rich feature set for content classification.Link Analyzers : Analyze Link
ContentParser
FeatureExtractor
n Link Analyzers : Analyze interdependency between email & web content (not done yet)
n Content Classifier: Employs filters to separate good, bad and the suspicious.
n Fold In Technique: Compute topic distribution probabilities of a new document.
n Watch List Repository: External module that stores suspicious ones for other components use.
LinkAnalyzer
Watch ListRepository
ContentClassifier (PLSA)
Fold In Technique
Phishing Detection Module
n Training corpus is used to build filters.
n Filters are built using PLSAn Folding-In technique is used
to classify new content.
Training Repository
ValidationRepository
Fold-InTechnique
New Content
Repository
Filter DesignPLSA
Experimental Design and Results
n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research
Experimental Design
n Designn PhishingCorpus data set: 750 good emails, 250 Phishing
emailsn K-fold cross validation (k = 10)n Classification using PLSA
n Content Pre Processingn Removed extraneous characters, HTML tags.n Removed stop words n Employed Porter’s stemming.
Experimental Design
n Content Parsingn Email data:
n Implemented MIME message parsing.n Parsed email headers, email text, hyperlinks and code.n Removed HTML tags.n Removed HTML tags.
n Feature Extractionn Email Feature – Words only
Experimental Design –Topic Identification
(Training)
Phishing CorpusTraining Dataset
INPUT
Preprocess(parse, remove stop words,
stemming)
Build Term DocumentFrequency Matrix
Initialize ProbabilitiesP(d), P(w|z), p(z|d)
Run TEM
OUTPUT
Topics (z1 Phishing, z2 Non-Phishing)p(w|z1), p(w|z2)
Experimental Design –Phishing Detection
(Testing)
Phishing CorpusTest Dataset
INPUT
Preprocess(parse, remove stop words,
stemming)
Build Term DocumentFrequency Matrix
Fix – z, p(w|z)Initialize
P(dnew), p(z|dnew)
Run TEM
YES
P(z1|dnew) > threshold
PHISHING EMAIL
NONON PHISHING EMAIL
Results- PLSA Topic Identification
Topic: Phishing Topic: Non Phishing
w P(w|z) w P(w|z)
ebai 0.028786 video 0.000225
confirm 0.000352 game 0.000102
usernam 0.145218 playsta 0.000203
bill 0.000549 movi 0.000179bill 0.000549 movi 0.000179
issue 0.000104 music 0.00022
account 0.012317 disnei 0.000226
password 0.013333 dvd 0.000223
failur 0.000405 ashle 0.000182
suspend 0.000523 simpson 0.000209
indefinit 0.000385 star 0.000223
thank 0.002679 war 0.000221
cooper 0.000302 truck 0.000205
click 0.00267 handbag 0.000171
compromise 0.000246 diesel 0.000218
… ..
Results
ROC Curve of Phishing Email Classification
0.71
0.72
0.73
0.64
0.65
0.66
0.67
0.68
0.69
0.7
0.71
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
False Positive
True Positive
Results – PerformanceComparison
n Accuracy:n SpamAssassin – 89 %n PILFER (using SVM) – 92%n PLSA – 71 %
n Even though PLSA’s accuracy is lower than other techniques, n Even though PLSA’s accuracy is lower than other techniques, PLSA handles different word usage and polysemy while others do not.
n Additional features could not be used by PLSA due to age of the dataset (phishing domains are short lived)
n Performance improvement is expected when email data, website data and their interdependencies of recent data set is used for detection using PLSA.
Conclusions and Future Research
n Conclusions:n Results show that PLSA identifies hidden topics (phishing versus
others).n Performance of 71% accuracy was achieved using very limited
feature set (words only) from email content.n Future Research:
Expand PLSA framework to include richer feature set (email + n Expand PLSA framework to include richer feature set (email + website + interdependencies)
n Employ prediction modeler to predict future attacks from events generated by internal and external components.
n Employ adaptive modeler to adapt to changes in attacking strategies and mode of operations.
n Employ the framework in an autonomous environment for automatic identification, prevention and protection of email servers from phishing attacks.
References
1) Ahmed Abbasi and Hsinchun Chen, A comparison of tools for detecting fake web sites, Research Feature, Computer, IEE ComputerSociety, 2009. 2) Anti-Phishing Working Group (APWG): http://www.antiphishing.org3) Arel Cordero, Tamara Blain. Catching Phish: Detecting Phishing Attacks From Rendered website Images, available athttp://www.cs.berkeley.edu/~asimma/294-fall06/projects/reports/cordero.pdf.4) Autonomic Computing, http://en.wikipedia.org/wiki/Autonomic_Computing5) Financial Service Technology Consortium (FSTC): North-America based financial institutions, technology vendors, independentresearch organizations and government agency, available athttp://www.fstc.org/projects/docs/FSTC_Counter_Phishing_Project_Whitepaper.pdf6) Hasika Pamunuwa, et. al, An Intrusion Detection System for Detecting Phishing Attacks, LNCS, September, 2007.7) Ian Fette, Norman Sadeh, Anthony Thomasic. Learning to Detect Phishing Emails, to appear in WWW 2007, available at7) Ian Fette, Norman Sadeh, Anthony Thomasic. Learning to Detect Phishing Emails, to appear in WWW 2007, available athttp://www.cs.cmu.edu/~tomasic/doc/2007/FetteSadehTomasicWWW2007.pdf8) Liu Wenyin, Guanglin Huang, Liu Xiaoyue, Zhang Min, Xiaotie Deng. . Detection of Phishing Webpages based on Visual Similarity,In WWW 2005, May 10-14, 2005, Chiba, Japan9) Neil Chou, Robert Ledesma, Yuka Teraguchi, Dan Boneh, John C. Mitchell. Client-side defense against web-based identity theft(Webspoof), available at http://www.crypto.stanford.edu/SpoofGuard/webspoof.pdf.10) Niels Provos. A Virtual Honeypot Framework, available at http://www. niels.xtdnet.nl/papers/honeyd.pdf.11) Nicolas Vanderavero, Xavier Brouckaert, Olivier Bonaventure, Baudouin Le Charlier. The HoneyTank.: A Scalable Approach tocollect malicious Internet Traffic, In international infrastructure survivability workshop (IISW’04) 2004, held in conjunction with the25th IEEE International Real-time systems symposium (RTSS04). Paper available athttp://www.info.ucl.ac.be/people/OBO/papers/honeytank.pdf.12) Thomas Hoffman, Probabilistic Latent Semantic Indexing, SIGIR, 1999.13) Ulrike von Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing, 2007.14) Yu Chen, Wei-Yin Ma, Hong-Jiang Zhang. Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices, InWWW 2003, May 20-24, 2003. 15) Yue Zhang, Jason Hong, Lorrie Cranor. CANTINA: A Content Based Approach to Detecting Phishing Sites, To appear in WWW 2007and available at www. cups.cs.cmu.edu/trust.php.
THANK YOU !