enron corpus: a new dataset for email classification by bryan klimt and yiming yang ceas 2004...

Enron Corpus:A New Dataset for Email

Classification

By Bryan Klimt and Yiming Yang

CEAS 2004

Presented by Will Lee

Introduction

Motivation Related Works The Enron Corpus Methods Evaluation Thread Information Conclusion

Motivation

Other corpuses focus on newsgroups or personal email data

Lack of common data set to evaluate the performance of email classification Previous research uses different personal data sets

Difficulties to find actual use of email within a company Obviously, companies do not like to share their internal

emails Privacy concerns for people working for the company

Related Works

Other corpuses 20 Newsgroups

http://people.csail.mit.edu/people/jrennie/20Newsgroups/

Related Papers Y. Diao, H. Lu, and D. Wu, A Comparative Study of

Classification Based Personal E-mail Filtering (PAKDD ’00) I. Androutsopoulos, et. al., An Experimental Comparison of

Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages (SIGIR ‘00)

T. Payne, Learning Email Filtering Rules with Magi (Thesis 1994)

20 Newsgroups

Collection of approximately 20,000 newsgroup documents, spread out evenly across 20 different newsgroups

Sample newsgroups: comp.graphics, rec.motorcycles, rec.sport.baseball,

sci.electronics, talk.politics.misc, talk.religion.misc, etc. Used originally in Ken Lang’s Newsweeder:

Learning to filter netnews paper (ICML 1995) Dataset on newsgroup data, probably not very

useful for research in personal information management

Enron Dataset

619,446 messages (200,399 after cleaning) by 158 users

Average 757 messages per user Shows most users do use folders to organize

emails Can use folder information to evaluate

effectiveness for folder classification

Enron Corpus’ Characteristics

Number of messages per user varies from a few messages to 10K + messages

Upper bound of folder seems to correlate to the log(# of messages) Number of messages does not correlate to the lower bound (can

have many messages but a few folders) Question: how can we use this kind of information?

Email Classification Features Constructive text

BOW approach, feature used the most Some fields are more important than the others Stemming, stop word removal used, effectiveness not proven

Categorical text “to” and “from” fields BOW, useful for classification, but not as useful as constructive

text Numeric data

Size of message, number of replies, number of words, etc. Not very useful

Thread information Indicates how message relates to each other Not fully exploited

Email Features (Example)

From: Mark Hills <[email protected]>Subject: Re: When is the first lecture? When will the course page be updated?Date: Thu, 26 Aug 2004 13:41:09 -0500Lines: 11Message-ID: <[email protected]>References: <[email protected]>In-Reply-To: <[email protected]>

Joshua Blatt wrote:

> When is the first lecture? When will the course page be updated?> > Thanks> > Josh

The first lecture was today, during the normally scheduled time.

Mark

Categorical text

Contextual text

Numeric data

Thread information

Classification Method Vector space model with SVM Vector weight wi is evaluated using “ltc”

(http://people.csail.mit.edu/people/jrennie/ecoc-svm/smart.html), which means: l: new-tf = ln (tf) + 1.0 t: new-wt = new-tf * log (num-docs/coll-freq-of-term) c: divide each new-wt by sqrt (sum of (new-wts squared))

vectorkk

iwn

Ntfw

1log)0.1)(ln(

Classification Method (Cont.)

Sort messages in chronological order, split into train and test set

Run SVM on term weighted vectors of From Subject Body To, CC All fields

Linear regression on all fields seem to have the best performance

Clustering Effectiveness

Number of Messages vs. F1

Number of message does not directly correlate to the accuracy

Question: What about the case where the user has only one folder, which makes classification trivial?

Number of Folders vs. F1

There’s correlation between the number of folders and the F1 score.

Question: Is this trivial as well? Some elements in the messages not modeled, since

SVM have more messages to train on.

Thread Information

200,399 messages, 101,786 threads, 71,696 threads with only one message

61.63% of messages of corpus is in a thread. Average thread size is 4.1 messages Average folder per thread is 1.37 (meaning

most messages of the thread stays in one folder)

Question: Not clear how threads are detected. How can we use this information?

More Thread

D. Lewis, et. al., Threading Electronic Mail: A Preliminary Study (1997)

Lewis studied finding parent message using BOW, TF/IDF weighted, vector space approach on constructive text

Document weight

Query weight

Similarity

More Thread (Cont.) Lewis’ work assumes that the thread information is

incomplete in the message header. May not be the case. Algorithm by Jamie Zawinski is widely used in the

original Netscape 4.x (maybe in recent Mozilla as well?) can group threaded messages effectively. http://www.jwz.org/doc/threading.htm

Questions How can we leverage the thread information in email

messages more effectively? Does this model extend to the more recent form of

conversation such as blog and web forums as well?

Conclusion

Pros Introduce a new corpus that can be useful in evaluating

classification performance on a large collection of personal mail

Unlike small collection of personal mails, corpus can also be used to analyze behavior within a company

Cons Details on performing SVM and the linear weight for

various fields are missing Not clear how threads are detected

enron corpus: a new dataset for email classification by bryan klimt and yiming yang ceas 2004...

Documents