enron corpus: a new dataset for email classification by bryan klimt and yiming yang ceas 2004...
TRANSCRIPT
Enron Corpus:A New Dataset for Email
Classification
By Bryan Klimt and Yiming Yang
CEAS 2004
Presented by Will Lee
Introduction
Motivation Related Works The Enron Corpus Methods Evaluation Thread Information Conclusion
Motivation
Other corpuses focus on newsgroups or personal email data
Lack of common data set to evaluate the performance of email classification Previous research uses different personal data sets
Difficulties to find actual use of email within a company Obviously, companies do not like to share their internal
emails Privacy concerns for people working for the company
Related Works
Other corpuses 20 Newsgroups
http://people.csail.mit.edu/people/jrennie/20Newsgroups/
Related Papers Y. Diao, H. Lu, and D. Wu, A Comparative Study of
Classification Based Personal E-mail Filtering (PAKDD ’00) I. Androutsopoulos, et. al., An Experimental Comparison of
Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages (SIGIR ‘00)
T. Payne, Learning Email Filtering Rules with Magi (Thesis 1994)
20 Newsgroups
Collection of approximately 20,000 newsgroup documents, spread out evenly across 20 different newsgroups
Sample newsgroups: comp.graphics, rec.motorcycles, rec.sport.baseball,
sci.electronics, talk.politics.misc, talk.religion.misc, etc. Used originally in Ken Lang’s Newsweeder:
Learning to filter netnews paper (ICML 1995) Dataset on newsgroup data, probably not very
useful for research in personal information management
Enron Dataset
619,446 messages (200,399 after cleaning) by 158 users
Average 757 messages per user Shows most users do use folders to organize
emails Can use folder information to evaluate
effectiveness for folder classification
Enron Corpus’ Characteristics
Number of messages per user varies from a few messages to 10K + messages
Upper bound of folder seems to correlate to the log(# of messages) Number of messages does not correlate to the lower bound (can
have many messages but a few folders) Question: how can we use this kind of information?
Email Classification Features Constructive text
BOW approach, feature used the most Some fields are more important than the others Stemming, stop word removal used, effectiveness not proven
Categorical text “to” and “from” fields BOW, useful for classification, but not as useful as constructive
text Numeric data
Size of message, number of replies, number of words, etc. Not very useful
Thread information Indicates how message relates to each other Not fully exploited
Email Features (Example)
From: Mark Hills <[email protected]>Subject: Re: When is the first lecture? When will the course page be updated?Date: Thu, 26 Aug 2004 13:41:09 -0500Lines: 11Message-ID: <[email protected]>References: <[email protected]>In-Reply-To: <[email protected]>
Joshua Blatt wrote:
> When is the first lecture? When will the course page be updated?> > Thanks> > Josh
The first lecture was today, during the normally scheduled time.
Mark
Categorical text
Contextual text
Numeric data
Thread information
Classification Method Vector space model with SVM Vector weight wi is evaluated using “ltc”
(http://people.csail.mit.edu/people/jrennie/ecoc-svm/smart.html), which means: l: new-tf = ln (tf) + 1.0 t: new-wt = new-tf * log (num-docs/coll-freq-of-term) c: divide each new-wt by sqrt (sum of (new-wts squared))
vectorkk
iwn
Ntfw
1log)0.1)(ln(
Classification Method (Cont.)
Sort messages in chronological order, split into train and test set
Run SVM on term weighted vectors of From Subject Body To, CC All fields
Linear regression on all fields seem to have the best performance
Clustering Effectiveness
Number of Messages vs. F1
Number of message does not directly correlate to the accuracy
Question: What about the case where the user has only one folder, which makes classification trivial?
Number of Folders vs. F1
There’s correlation between the number of folders and the F1 score.
Question: Is this trivial as well? Some elements in the messages not modeled, since
SVM have more messages to train on.
Thread Information
200,399 messages, 101,786 threads, 71,696 threads with only one message
61.63% of messages of corpus is in a thread. Average thread size is 4.1 messages Average folder per thread is 1.37 (meaning
most messages of the thread stays in one folder)
Question: Not clear how threads are detected. How can we use this information?
More Thread
D. Lewis, et. al., Threading Electronic Mail: A Preliminary Study (1997)
Lewis studied finding parent message using BOW, TF/IDF weighted, vector space approach on constructive text
Document weight
Query weight
Similarity
More Thread (Cont.) Lewis’ work assumes that the thread information is
incomplete in the message header. May not be the case. Algorithm by Jamie Zawinski is widely used in the
original Netscape 4.x (maybe in recent Mozilla as well?) can group threaded messages effectively. http://www.jwz.org/doc/threading.htm
Questions How can we leverage the thread information in email
messages more effectively? Does this model extend to the more recent form of
conversation such as blog and web forums as well?
Conclusion
Pros Introduce a new corpus that can be useful in evaluating
classification performance on a large collection of personal mail
Unlike small collection of personal mails, corpus can also be used to analyze behavior within a company
Cons Details on performing SVM and the linear weight for
various fields are missing Not clear how threads are detected