graph-based knn algorithm for spam sms detection

16
Tran Phuc Ho, Ho-Seok Kang, Sung-Ryul Kim Journal of Universal Computer Science, vol. 19, no. 16 (2013) *

Upload: so-yeon-kim

Post on 20-Jul-2015

59 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Graph-based KNN Algorithm for Spam SMS Detection

Tran Phuc Ho, Ho-Seok Kang, Sung-Ryul KimJournal of Universal Computer Science, vol. 19, no. 16 (2013)

*

Page 2: Graph-based KNN Algorithm for Spam SMS Detection

*

* Spam SMS : advertisements by commercial

companies, hacking messages for cheating and

stealing personal information.

* Content-based approach

Graph-based

Text representation

KNN

algorithm

Page 3: Graph-based KNN Algorithm for Spam SMS Detection

spam

normal

Labeled

small

message

groups

5 messages (in real time, only 1 message)

Tokenize them by white spaces

and punctuations

*

Page 4: Graph-based KNN Algorithm for Spam SMS Detection

*

* remove the noisy features and select the good

ones

Mutual information(MI),

X2-Statistic (CHI)

Page 5: Graph-based KNN Algorithm for Spam SMS Detection

*The dependence between a word(t) and a type of message(c)

t : token (word or phrase)

c : class (type of message – spam or ham)

The probability that t and c

co-occur

The conditional probability of t in c

Probability of t

Page 6: Graph-based KNN Algorithm for Spam SMS Detection

*The lack of independence between a word(t) and a type of message(c)

t : token (word or phrase)

c : class (type of message – spam or ham)

Probability of t

The probability that

t and c co-occur

t t

Probability that the text belong to c

Page 7: Graph-based KNN Algorithm for Spam SMS Detection

*

* calculate the weight of each feature

*Use the high weighted words for constructing

the graphs

CHI(X2-statistic)

MI(Mutual Information)

Page 8: Graph-based KNN Algorithm for Spam SMS Detection

*

Token selected

by feature selection

- unique word

G = (V, E, FWN)

V :set of nodes

E :set of weighted edges linking the nodes

FWN :feature weight matrix – weight of edges and nodes

The order &

Co-occurrence relationship

Between two feature words

(If feature words co-occur

within a step length, assign

an edge)

Page 9: Graph-based KNN Algorithm for Spam SMS Detection

*

G = (V, E, FWN)

V :set of nodes

E :set of weighted edges linking the nodes

FWN :feature weight matrix – weight of edges and nodes

Weight of edges, Probability of tokens represented by nodes

W_ij : co-occurrence frequency of two feature words

f_i and f_j within a step length

Only calculate the

weight W_ij (i>j).

Ex) scientific paper

Zero

Ex) paper scientific

Frequency of single words

Page 10: Graph-based KNN Algorithm for Spam SMS Detection

*

in K nearest neighbors of the text T to be classified, the class of T is the most

frequently appearing class in this collection

1. Build sample graphs (elements)

2. New message comes in

3. Build a testing graph

Similarity

Of two graphs

-> Feature Weight :

Weights of the edges

+ weight of the edge itself

(appear in the two graphs)

Page 11: Graph-based KNN Algorithm for Spam SMS Detection

*

Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)

….

List (RL)

1 FW(tg,sg1)=2 Spam

2 FW(tg,sg2)=3 Spam

… FW(tg,sg3)=4 Normal

K FW(tg,sg4)=5 Spam(Nfp : how many nodes in the sample

graph with their weights larger than 0

also appear in the test graph)

If Nfp > threshold, calculate FW(tg,sg1)

0.0001

3

Page 12: Graph-based KNN Algorithm for Spam SMS Detection

*

Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)

….

List (RL)

1 FW(tg,sg5)=6 Spam

2 FW(tg,sg2)=3 Spam

… FW(tg,sg3)=4 Normal

K FW(tg,sg4)=5 Spam

If Nfp > threshold, calculate FW(tg,sg5)

6

Spam message

Page 13: Graph-based KNN Algorithm for Spam SMS Detection

*

NUS SMS Corpus (5,574 messages)

– 4,827 normal(86.6%), 747 spam(13.4%)

[Uysal and Yildiz] SMS

collection

(875 messages)

- 450 normal, 425 spam

Page 14: Graph-based KNN Algorithm for Spam SMS Detection

*

Page 15: Graph-based KNN Algorithm for Spam SMS Detection

*

(%)(seconds)

Page 16: Graph-based KNN Algorithm for Spam SMS Detection

*

* Spam SMS messages are evolving.. Hard to

capture keywords.

* ex) 대★출, 이ㅈr, <<통>> / <<장>>, no space or

punctuation, no specific keyword, same content

with other phone numbers, no words only with

image …

* Graph patterns of communication between

sender and receiver should be added with

content-based approach.