[ieee 2008 ieee international workshop on semantic computing and applications (iwsca) - incheon,...

5
Semantic Analysis of User Behaviors for Detecting Spam Mail Asung Han 1 , Hyun-Jun Kim 1 , Inay Ha 1 and Geun-Sik Jo 2 1 Intelligent E-Commerce Systems Lab., Inha University, Incheon, Korea 2 School of Computer Science & Engineering, Inha University, Incheon, Korea {tinypree, inay}@eslab.inha.ac.kr, [email protected], [email protected] Abstract According to continuous increasing of spam email, 92.6% of recent total email is known spam email. In this research, we will show an adaptive learning system that filter spam emails based on user’s action pattern as time goes by. In this paper, we consider relationship between user‘s actions such as what action is took after one action and how long does it take. They analyze that each action has how much meaning, and that it has an effect on filtering spam emails. And that in turn determines weight for each email. In experimentation, we will compare results of system of this research and weighted Bayesian classifier using real email dataset. Also, we will show how to handle personalization for concept drift and adaptive learning. 1. Introduction Recently, rate of spam email is increasing continuously and its increasing rate is 20% per year. According to a major internet security company, 92.6% of total emails in last year were spam. For solving this problem, variety email filtering methods such as machine learning or statistical technology have been proposing. For instance, Bayesian classification, SVM (Support vector machine), Neural network classification, case-based reasoning, etc. are used to prevent spam emails. In this research, we will show an adaptive learning system that filter spam emails based on user’s action pattern as time goes by. User’s interest change continuously, so filtering system is required to learn and reflect this. This is concept of “Concept Drift”. This means that user may show his or her interest after some time to email concerned spam. In similarly, user’s interest about some subject of email may decrease gradually. In this paper, we consider relationship between user’s actions such as what action is taken after one action and how long does it take. In addition, we make efforts to reduce inaccuracy of criterion for category of actions and to analyze user’s intention of each action more clearly. Therefore, we expect increased learning rate and better performance of learning by applying meanings for additional actions to learned email. 2. Related Work 2.1. Various Methods for Filtering Spam Mail Spam filtering based on rule or message is a method that identifies spam features from email and decide whether the email were spam or not. This approach is used by many filtering system because of its simplicity for implementation. But it has a limitation to update rule continuously every time features of spam change. Next, statistical approach to overcome rule or message- based system’s limitations is proposed. The most representative method is Bayesian classification which analyzes terms of email and classifies spam email statistically. So it is better than other systems for adaptation to changing environment, show a good performance. Proved Bayesian classification-based systems such as Bayesian Network, Naïve Bayesian classification, Weighted Bayesian classification have been using. Machine learning based approach such as SVM, LSI (Latent semantic indexing) with Bayesian classification is a method that prevent spam emails read once by user. But it is difficult to use in real world, because it is complex to calculate, and difficult to management. 2.2. Concept Drift and Adaptive Learning Domain of spam change rapidly, and user interests and concerns change with time also. So we need recognize that change and reflect to learning. This is called Concept Drift or Interest Drift. Required concept for this need is Adaptive Learning. It means that learn IEEE International Workshop on Semantic Computing and Applications 978-0-7695-3317-9/08 $25.00 © 2008 IEEE DOI 10.1109/IWSCA.2008.38 91

Upload: geun-sik

Post on 27-Mar-2017

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2008 IEEE International Workshop on Semantic Computing and Applications (IWSCA) - Incheon, Korea (South) (2008.07.10-2008.07.11)] 2008 IEEE International Workshop on Semantic

Semantic Analysis of User Behaviors for Detecting Spam Mail

Asung Han1, Hyun-Jun Kim1, Inay Ha1 and Geun-Sik Jo2

1 Intelligent E-Commerce Systems Lab., Inha University, Incheon, Korea 2 School of Computer Science & Engineering, Inha University, Incheon, Korea

{tinypree, inay}@eslab.inha.ac.kr, [email protected], [email protected]

Abstract

According to continuous increasing of spam email,

92.6% of recent total email is known spam email. In this research, we will show an adaptive learning system that filter spam emails based on user’s action pattern as time goes by. In this paper, we consider relationship between user‘s actions such as what action is took after one action and how long does it take. They analyze that each action has how much meaning, and that it has an effect on filtering spam emails. And that in turn determines weight for each email. In experimentation, we will compare results of system of this research and weighted Bayesian classifier using real email dataset. Also, we will show how to handle personalization for concept drift and adaptive learning. 1. Introduction

Recently, rate of spam email is increasing continuously and its increasing rate is 20% per year. According to a major internet security company, 92.6% of total emails in last year were spam. For solving this problem, variety email filtering methods such as machine learning or statistical technology have been proposing. For instance, Bayesian classification, SVM (Support vector machine), Neural network classification, case-based reasoning, etc. are used to prevent spam emails. In this research, we will show an adaptive learning system that filter spam emails based on user’s action pattern as time goes by. User’s interest change continuously, so filtering system is required to learn and reflect this. This is concept of “Concept Drift”. This means that user may show his or her interest after some time to email concerned spam. In similarly, user’s interest about some subject of email may decrease gradually. In this paper, we consider relationship between user’s actions such as what action is taken after one action and how long does it take. In

addition, we make efforts to reduce inaccuracy of criterion for category of actions and to analyze user’s intention of each action more clearly. Therefore, we expect increased learning rate and better performance of learning by applying meanings for additional actions to learned email. 2. Related Work 2.1. Various Methods for Filtering Spam Mail

Spam filtering based on rule or message is a method

that identifies spam features from email and decide whether the email were spam or not. This approach is used by many filtering system because of its simplicity for implementation. But it has a limitation to update rule continuously every time features of spam change. Next, statistical approach to overcome rule or message-based system’s limitations is proposed. The most representative method is Bayesian classification which analyzes terms of email and classifies spam email statistically. So it is better than other systems for adaptation to changing environment, show a good performance. Proved Bayesian classification-based systems such as Bayesian Network, Naïve Bayesian classification, Weighted Bayesian classification have been using. Machine learning based approach such as SVM, LSI (Latent semantic indexing) with Bayesian classification is a method that prevent spam emails read once by user. But it is difficult to use in real world, because it is complex to calculate, and difficult to management. 2.2. Concept Drift and Adaptive Learning

Domain of spam change rapidly, and user interests

and concerns change with time also. So we need recognize that change and reflect to learning. This is called Concept Drift or Interest Drift. Required concept for this need is Adaptive Learning. It means that learn

IEEE International Workshop on Semantic Computing and Applications

978-0-7695-3317-9/08 $25.00 © 2008 IEEE

DOI 10.1109/IWSCA.2008.38

91

Page 2: [IEEE 2008 IEEE International Workshop on Semantic Computing and Applications (IWSCA) - Incheon, Korea (South) (2008.07.10-2008.07.11)] 2008 IEEE International Workshop on Semantic

user interest and feature of spam with track changing concepts. It reflects changing interests and features to learning in real time.

3. Spam Filtering with User Actions and Time

Fig.1. shows a system architecture proposed in this

paper. The system learns terms of email with weights calculated by user action patterns and time delay for emails. And the system permits a re-classification for terms to update a weight as new action is added to email. At the test phase, terms of email are parameter to Weighted Bayesian Classifier, and email is filtered. Besides, when action pattern for this email is taken, learn the email with calculated weight for it circularly. Therefore learning can be updated continuously.

Fig. 1. System Architecture

3.1. User action based adaptive learning with weighted Bayesian classification

For filtering spam mail before describing the

algorithms, some definitions of the matrices are introduced. A lot of existing spam email filtering systems cause inconvenience that user should give feedback for rating for spam email to systems directly. To solve this limitation, a system that user action to email gives a feedback to the system automatically was introduced. This method isn’t needs to check user preference for email so that user can save a trouble and reduce waste of time. Especially, base actions for emails is divided into six actions as ‘open’, ‘delete’, ‘save’, ‘reply’, ‘block’, ‘nothing’. And combined two or three actions of them is one sequence. So a weight for email is calculated, and that email is learned. This method use Weighted Bayesian Classification to filter spam emails. Weighted Bayesian Classification shows a better performance than base Bayesian Classification

and [UTF-8?] Naïve Bayesian Classification. In next research, we will subdivide user actions to discriminate a weight of each action and will discriminate grade of weight as time delay. And finally we will show a better performance of learning. 3.2. Categorization for user action Each action is included into positive category or negative category. Positive category includes ‘open’, ‘save’, ‘reply’, and negative category includes ‘delete’, ‘block’. Each action has a default weight, so the system calculates a final weight based on it. We determine default weights by a heuristic that consider user’s intention and its importance for each action

Table 1. Default weight for actions

Action Default Weight Category Save 0.6

Positive Reply 0.4 Open 0.2 Delete 0.4 Negative Block 0.8

3.3. Grouping Action Patterns for Weighting Action recognizer saves email ID, action type, start

time, end time to action profile when user takes an action for email. A system gives a weight to each action of profile, and learns email with an updated weight. A weighting method is based on additional operation basically. To calculate weights, we group action pattern, that is, relation between previous action and current action into three cases, and give different functions to each case separately. Computing weight of each action is based on default weight of it. Above mentioned, all action patterns have different meanings because of relation between actions, time delay, action frequency, and etc. For instance, when user deletes an email after open it, there are two possibilities that delete it right after read and that after a good while. It can cause reduced accuracy of filtering. Therefore we design a method that gives a differentiated weight to email with time delay between ‘ open’ and ‘delete’ action. First of all, we group relations between previous and current action to three cases to calculate a weight of action. Let’s assume that there is i th action ia , previous action 1−ia , set of

positive action P , and we give as follows.

case 1. Pai ∈−1 and Pai ∈

92

Page 3: [IEEE 2008 IEEE International Workshop on Semantic Computing and Applications (IWSCA) - Incheon, Korea (South) (2008.07.10-2008.07.11)] 2008 IEEE International Workshop on Semantic

case 2. Pai ∈−1 and ''blockai = (1)

case 3. Pai ∈−1 and ''deleteai = Case1 has much more possibilities that email is not spam than when positive action is taken after taking positive action. So there is as follows. Let’s weight for non-spam category Nw , weight for spam category

Sw .

As a result, weight of || SN ww − is learned for non-spam category.

SN ww >>

And if we assume that there is iw , weight for ia ,

default weight idw , then id ww

i< . So we could

improve performance as decreasing iw with passing time gradually. Case2 has more possibilities of spam than possibilities of non-spam comparatively. So in this case, email is learned for spam category.

SN ww < A weight is start to 0.8, default weight of ‘block’ action, and change up to 0.4, default weight ‘delete’ action. The reason is to get bigger weight than weight of ‘delete’ action at least, lower level of ‘block’, even though time delay become longer unlimitedly. In case3, there are all possibilities of spam or non-spam. So a different approach is needed to this case as compared with above weighting approaches. Let’s time interval between actions t, weights for two categories are as follows.

NS ww > - > NS ww <

0)(lim0

<−>− SNt

ww , 0)(lim >−∞>− SNt

ww

Therefore, in case3, email is classified to spam or non-spam depending on degree of time delay. 3.4. Weighting function To make a weight function of Eq.(1), we can apply

time delay. Measured time delay between 1−ia and ia

with by the second goes into weighting function )(tf . For )(tf , we use logarithmic function. That is

)1log()( += ttf . The reason of using a logarithm is a fading of importance or meaning of action. When current action is taken after previous action

immediately, it has much more possibilities of nonspam than taking current action after a good while for taking a previous action. So weight is in inverse proportion to time delay, and retardation by time delay decrease gradually.

Therefore we calculate a weight with reflection a logarithmic property. Weighting function using logarithm is as follows.

case 1. ii ddi w

ttww *

)log(max)1log(2.0 +−+=

case 2. ii ddi w

ttww *

)log(max)1log( +−= (2)

case 3. ii ddi w

ttww *

)log(max)1log(1.0 +−−=

Maxt is the threshold of t , it is to prevent getting

longer unlimitedly. From this formulation, the system can calculate final two weights of categories for email,

Nw and Sw as follows. And the system learns email for two categories both. In result, the email is learned for category that has larger weight than another category with a weight of || SN ww − . If we assume

that 1−ma is learned for email until now, na is last action for learning, and U is a set of all actions, then we calculate weights for two categories, positive and negative category as follows.

∑=

+=n

miiN ww 1 , Pam ∈ , Pan ∈

∑=

+=n

miiS ww 1 , Uam ∈ , Uan ∈ (3)

3.5. Weighted Bayesian Classification Weights for two categories are given to each term

through learning process. When new email is arrived, Weighted Bayesian Classifier filters it as spam or nonspam using weighted terms. Naïve Bayesian Classifier handle all terms as same weight, but Weighted Bayesian Classifier handle each term as differentiated weight. So it has a high speed performance of learning. Formulation of Weighted Bayesian Classifier is as follows.

∏= ⎟

⎜⎜

∈=

n

kik

tj

ti

i

ctPw

wcP

Ccc

j

k

1

)|(max

)(maxarg (4)

93

Page 4: [IEEE 2008 IEEE International Workshop on Semantic Computing and Applications (IWSCA) - Incheon, Korea (South) (2008.07.10-2008.07.11)] 2008 IEEE International Workshop on Semantic

The system calculates weights of new email for both of spam and nonspam categories, classify the email to category that has a larger weight than another. To normalize weights, each weight of terms is divided to the largest weight of all terms. 4. Experimental Evaluation

In this section, experimental results of the proposed method are presented. All ex-periments are performed on a 2.79GHz Pentium PC with 1.49GB RAM, running on a Microsoft Window XP Server. The proposed system is implemented using Java, My-SQL.

4.1. Data Set and Evaluation Metrics Dataset for experiments is a set of LingSpam emails,

and we group it to spam email and nonspam email to use. Spam emails can be duplicated; tested email once can be used for learning again.

Table 2. Dataset

Training Data Test Data Spam Mail 1,328 1,301

Nonspam Mail 147 152 Total 1,475 1,453

Like table2, we assume we take fifty emails a day

during 20 days. The proportion of spam emails to nonspam emails is fixed. Learning is performed one time at a day. We observe accuracy and a speedup of accuracy and a result of filtering. To evaluate the system’s performance, we used the most commonly used evaluation measures – precision, recall and F1-measure as defined below.

spam as classified are which Emailsspam as classified are which emails SpamPr =ecision

emails spam Allspam as classified are which emails SpamRe =call

callecisioncallecisionmeasureF

RePrRe*Pr*21

+=− (5)

4.2. Experimental Result Figure.2 show accuracy of filtering as time goes by.

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Number of Train ing (Day )

F1-

measure

Fig. 2. Accuracy of filtering by proposed system

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of Training (Day)

F1-m

easure

Fig. 3. Accuracy of filtering by “User action based adaptive

learning with WBC for filtering spam mail”

From Fig.2, the proposed system shows correct performance comparatively at first. And speedup of accuracy is rating that was average 95 percent above original Bayesian Classifier’s average, although the ups and downs is high comparatively. Comparing of Fig.2 and Fig.3 shows that our system’s learning speedup is higher comparatively. 6. Conclusion and Future Work In this paper, we proposed a system of spam email

filtering using differentiated weight as user action pattern and time delay between actions. This system enhanced a performance of learning as compare with Bayesian Classifier and basic Weighted Bayesian Classifier, and especially, improved learning speed in short time comparatively. It would seem to be a result that applies adaptive learning that when action is added, the system update and learning weight of it every times to the system. This paper deals with an aspect of personalization also. Users can give more

94

Page 5: [IEEE 2008 IEEE International Workshop on Semantic Computing and Applications (IWSCA) - Incheon, Korea (South) (2008.07.10-2008.07.11)] 2008 IEEE International Workshop on Semantic

correct weights to emails by their interest, and can reflect individual interest flow continuously. For the future work, we will increase the performance by analyzing relation between actions correctly. And also we will study the method that reflects means of user action to the filtering system correctly. 7. References [1] ZDNet Korea, http://zdnet.co.kr [2] Digital Times,

http://www.dt.co.kr/contents.html?article_no=2006 122802011457730008

[3] H. J. Kim, J. Shrestha, H. N. Kim, G. S. Jo, "User Action Based Adaptive Learning with Weighted Bayesian Classification for Filtering Spam Mail", AI 2006, LNAI 4304, pp. 790-798, 2006 [4] Cohen, W. W., “Learning Rules that Classify E-Mail”, AAAI Spring Symposium, 1996 [5] Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Paliouras, G., Spyropoulos, C. D., “An Evaluation of Naive Bayesian Anti-Spam Filtering”, ECML 2000, pp. 9-17, 2000 [6] Thomas, G., Peter, A. F., “Weighted Bayesian Classification based on Support Vector Machine”, International Conference on Machine Learning, pp. 207-209, 2001 [7] Koychev, I., Schwab, I., “Adaption to Drifting User’s Interests”, ECML200/MLnet Workshop, 2000 [8] Dwi H. Widyantoro, Thomas R. loerger, John Yen, “An Adaptive Algorithm for Learning Changes in User Interests”,

International conference on Information and knowledge management, pp. 405-412, 1999 [9] M. Albanese, A. Picariello, C. Sansone, L. Sansone, “Web Personalization Based on Static Information and Dynamic User Behavior”, ACM international workshop, pp. 80-87, 2004 [10] M. Morita, Y. Shinoda, “Information Filtering Based on User Behavior Analysis and Best Match Text Retrieval”, ACM SIGIR conference, pp. 272-281, 1994 [11] Y. W. Seo, B. T. Zhang, “Learning User’s Preferences by Analyzing Web-Browsing Behaviors”, International conference on Autonomous agents, pp. 381-387, 2000 [12] Sahami, M., Dumais, S., Heckerman, D., Horvitz, E., “A Bayesian Approach to Filtering Junk E-Mail”, AAAI, pp. 55-62, 1998 [13] Burklen, S., Marron, P.J., Fritsch, S., Rothermel, K., "User Centric Walk: An Integrated Approach for Modeling the Browsing Behavior of Users on the Web", Annual Simulation Symposium, 2005 [14] E. Agichtein, Z. Zheng, ”Identifying ‘Best Bet’ Web Search Results by Mining Past User Behavior”, ACM SIGKDD international conference, pp. 902-908, 2006 [15] Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Paliouras, G., Spyropoulos, C. D., “An Evaluation of Naive Bayesian Anti-Spam Filtering”, ECML2000, pp. 9-17, 2000 [16] Y. Yang, X. Liu, “A Re-examination of Text Categorization Methods”, ACM SIGIR’99, 1999

95