adding semantics to email clustering hua li, dou shen, benyu zhang, zheng chen, qiang yang microsoft...

27
Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer Science and Engineering, Hong Kong University of Science and Technology (ICDM 2006) Date: 2008/10/16 Speaker: Cho Chin Wei Advisor: Dr. Koh, JiaLing

Upload: clarissa-mathews

Post on 03-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Adding Semantics to Email ClusteringHua Li, Dou Shen, Benyu Zhang, Zheng Chen,Qiang Yang

Microsoft Research Asia, Beijing, P.R.ChinaDepartment of Computer Science and Engineering,Hong Kong University of Science and Technology(ICDM 2006)

Date: 2008/10/16Speaker: Cho Chin WeiAdvisor: Dr. Koh, JiaLing

Page 2: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Abstract

This paper presents a novel algorithm to cluster emails according to their contents and the sentence styles of their subject lines. improve the clustering performance provide meaningful descriptions of the resulted

clusters

Page 3: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Introduction

Emails become an important medium of communication.

A user may receive tens or even hundreds of emails every day. Handling these emails takes much time.

It is necessary to provide some automatic approaches to relieve the burden of processing the emails.

Page 4: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Introduction

Supervised methods need a predefined taxonomy. Require users to update the taxonomy manually. Preparation of training data is time consuming and ex

pensive.

Unsupervised technique such as clustering is an attractive alternative.

Page 5: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Introduction

Conventionally, email clustering is based on the representation of bag-of-words.

This simplistic approach cannot take full advantage of valuable linguistic features inherent in the semi-structured emails, which may result in unsatisfactory performance.

Page 6: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Method

In this paper, we present a novel technique to cluster emails according to the sentence patterns discovered from the subject lines of the emails.

Each subject line is treated as a sentence and parsed through natural language processing techniques.

Page 7: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Method

Automatically generate meaningful generalized sentence patterns (GSPs) from subjects of emails. Natural Language Processing techniques frequent itemset mining techniques

Then, put forward a novel unsupervised approach treats GSPs as pseudo class labels conduct email clustering in a supervised manner no human labeling is involved

Page 8: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

What is GSP?

GSP (Generalized Sentence Pattern) It helps summarize the subjects of a large number of sim

ilar emails which results in a semantic representation of the subject lines.

Ex: the GSP is {“person”, “seminar”, “date”}

which means that someone (“person”) gives a “seminar” on someday (“date”).

Page 9: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Generalization of Terms - NLPWin

A Natural Language Parser, Microsoft’s NLPWin tool takes a sentence as input and builds a syntactic tree for the sentence

Page 10: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Generalization of Terms - NLPWin

NLPWin tool can generate factoids for noun phrases, e.g. person names, date and place.

Page 11: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Mine GSP

For each email subject, Remove stop words NLPWin is used to generate its syntactic tree factoids of the nodes are added into the email subject

s The resultant email subjects are called as generalized

sentences.

Subsets of GSPs are still GSPs.

EX. {welcome, Bruce, Lee, person} {welcome, Bob, Brill, person}

are the generalized sentences.

Page 12: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Mine GSP

Definition: Generalized Sentence Pattern (GSP)

Given a set of generalized sentences S={s1 ,s2 ,...,sn } and a generalized sentence p , the support of p in S is defined as

sup(p) =| {si | si∈S & p ⊆ si} | Given a minimum support min_sup, if sup(p) > min_sup , then p is called as a generalized sentence pattern, or GSP for short.

EX. {welcome, person} and {interview, date} Only closed GSPs are to be mined in our experiment.

Page 13: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

GSPs grouping and selection

Grouping similar GSPs together to reduce the number of GSPs.

GSPs in the same group will present the same cluster. The similarity between two GSPs p and q is defined base

d on their subset-superset relationship and support. Sim(p,q) = 1 ; p⊃ q : { sup(p) / sup(q) > min_conf } = 0 ; otherwise

Min_conf is between 0.5 and 0.8 by experimental result.too low=> group more GSPs, noisestoo high=> fail to group

Page 14: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

GSPs grouping and selection

The number of GSP groups can be several times larger than the actual number of clusters.

We want to select some representative GSP groups by some heuristic rules. sort the GSP groups in descending order of length. sort them in descending order of support. select the first sp_num GSP groups for clustering.

A parameter sp_num is used to control how many GSP groups are select for clustering.

Longer GSP is more confident then a shorter one.

Page 15: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

GSP as pseudo class label

Based on GSPs, we proposed a novel clustering algorithm to form a pseudo class for the emails matching the same GSP group, and then use a discriminative variant of Classification Expectation Maximization algorithm (CEM) to get the final clusters.

When CEM is applied to document clustering, the high-dimension usually causes the inaccurate model estimation and degrades the efficiency. Here linear SVM is used as the underlying classifier.

Page 16: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

GSP as pseudo class label

The proposed GSP-PCL algorithm uses GSP groups to construct initial pseudo classes.

Only the emails not matching any GSP group are classified.

Page 17: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

GSP as pseudo class label

A threshold is defined to control whether an email is put into a pseudo class. Only when the maximal posterior probability of an email is greater than the given threshold, the email will be put into the class with the maximal posterior probability.

otherwise the email will be put into a special class Dother.

Page 18: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Algorithm: GSP-PCL

GSP-PCL (k,GSP groups G1 ,G2,…Gsp_num,email set D)1.Construct sp_num pseudo classes using GSP groups,

Di 0 ={d | d∈ D and d match Gi},i=1,2,….,sp_num;

2. D’ =D -∪sp_numi=1Di

0

3. Iterative until coverage. For the J-th iteration,j>0:a) Train a SVM classifier based on Di

j-1 i=1,…,sp_num.

b) For each email d ∈D’ ,classify d into Dij-1 if P(Di

j-1|d) is the maximal posterior probability and P(Di

j-1|d) > min_class_prob.4. Dother = D - ∪sp_num

i=1Di j

5. Use basic k-means to partition Dother into (k - sp_num) clusters.

Page 19: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Experiments

1.Enron email datasetEmail archive from many of the senior management of Enron Co.

2.Private email dataset

Page 20: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Experiments

Let C={C1 …Cm} be a set of clusters generated by a clustering algorithm on a data set X ,and B={B1…Bn} be a set of predefined classes on X.

Each pair of two objects (Xi ,Xj) from the data set X belongs to one of the four possible cases:

Page 21: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Experiments

precision and recall and F1-Measure are calculated as following:

Page 22: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Experiments

Page 23: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Experiments

Page 24: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Cluster Naming

In GSP-PCL, cluster names are generated as follows

If emails in one cluster match one or more GSP groups, the cluster is named by the GSP with the highest support, otherwise it is named by the top five words sorted based on the scores computed as follows

Cj denotes the cluster di is an email tk is a word

tki is the weight of word tk in the email di

Page 25: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Cluster Naming

Page 26: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Cluster Naming

Page 27: Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer

Conclusions

In this paper, we proposed a novel approach to automatically extract embedded knowledge from the email subjects to help improve email clustering.

Natural language processing technique and the frequent closed itemset mining technique are employed to generate generalized sentence patterns (GSP for short) from email subjects, which can be used to assist clustering as well as serve as good cluster descriptors.