Graph-based KNN Algorithm for Spam SMS Detection
-
Upload
so-yeon-kim -
Category
Data & Analytics
-
view
57 -
download
0
Transcript of Graph-based KNN Algorithm for Spam SMS Detection
![Page 1: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/1.jpg)
Tran Phuc Ho, Ho-Seok Kang, Sung-Ryul KimJournal of Universal Computer Science, vol. 19, no. 16 (2013)
*
![Page 2: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/2.jpg)
*
* Spam SMS : advertisements by commercial
companies, hacking messages for cheating and
stealing personal information.
* Content-based approach
Graph-based
Text representation
KNN
algorithm
![Page 3: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/3.jpg)
spam
normal
Labeled
small
message
groups
5 messages (in real time, only 1 message)
Tokenize them by white spaces
and punctuations
*
![Page 4: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/4.jpg)
*
* remove the noisy features and select the good
ones
Mutual information(MI),
X2-Statistic (CHI)
![Page 5: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/5.jpg)
*The dependence between a word(t) and a type of message(c)
t : token (word or phrase)
c : class (type of message – spam or ham)
The probability that t and c
co-occur
The conditional probability of t in c
Probability of t
![Page 6: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/6.jpg)
*The lack of independence between a word(t) and a type of message(c)
t : token (word or phrase)
c : class (type of message – spam or ham)
Probability of t
The probability that
t and c co-occur
t t
Probability that the text belong to c
![Page 7: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/7.jpg)
*
* calculate the weight of each feature
*Use the high weighted words for constructing
the graphs
CHI(X2-statistic)
MI(Mutual Information)
![Page 8: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/8.jpg)
*
Token selected
by feature selection
- unique word
G = (V, E, FWN)
V :set of nodes
E :set of weighted edges linking the nodes
FWN :feature weight matrix – weight of edges and nodes
The order &
Co-occurrence relationship
Between two feature words
(If feature words co-occur
within a step length, assign
an edge)
![Page 9: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/9.jpg)
*
G = (V, E, FWN)
V :set of nodes
E :set of weighted edges linking the nodes
FWN :feature weight matrix – weight of edges and nodes
Weight of edges, Probability of tokens represented by nodes
W_ij : co-occurrence frequency of two feature words
f_i and f_j within a step length
Only calculate the
weight W_ij (i>j).
Ex) scientific paper
Zero
Ex) paper scientific
Frequency of single words
![Page 10: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/10.jpg)
*
in K nearest neighbors of the text T to be classified, the class of T is the most
frequently appearing class in this collection
1. Build sample graphs (elements)
2. New message comes in
3. Build a testing graph
Similarity
Of two graphs
-> Feature Weight :
Weights of the edges
+ weight of the edge itself
(appear in the two graphs)
![Page 11: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/11.jpg)
*
Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)
….
List (RL)
1 FW(tg,sg1)=2 Spam
2 FW(tg,sg2)=3 Spam
… FW(tg,sg3)=4 Normal
K FW(tg,sg4)=5 Spam(Nfp : how many nodes in the sample
graph with their weights larger than 0
also appear in the test graph)
If Nfp > threshold, calculate FW(tg,sg1)
0.0001
3
![Page 12: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/12.jpg)
*
Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)
….
List (RL)
1 FW(tg,sg5)=6 Spam
2 FW(tg,sg2)=3 Spam
… FW(tg,sg3)=4 Normal
K FW(tg,sg4)=5 Spam
If Nfp > threshold, calculate FW(tg,sg5)
6
Spam message
![Page 13: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/13.jpg)
*
NUS SMS Corpus (5,574 messages)
– 4,827 normal(86.6%), 747 spam(13.4%)
[Uysal and Yildiz] SMS
collection
(875 messages)
- 450 normal, 425 spam
![Page 14: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/14.jpg)
*
![Page 15: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/15.jpg)
*
(%)(seconds)
![Page 16: Graph-based KNN Algorithm for Spam SMS Detection](https://reader033.fdocuments.us/reader033/viewer/2022051315/55ac231d1a28ab58298b462b/html5/thumbnails/16.jpg)
*
* Spam SMS messages are evolving.. Hard to
capture keywords.
* ex) 대★출, 이ㅈr, <<통>> / <<장>>, no space or
punctuation, no specific keyword, same content
with other phone numbers, no words only with
image …
* Graph patterns of communication between
sender and receiver should be added with
content-based approach.