Enron email datasets
LING 575Fei Xia
01/04/2011
History of Enron• Enron was formed in 1985 under the direction of Kenneth Lay
• In 1999, Enron officials began to use the “special purpose entities” (SPE) trick.
• In Dec 2000, Jeffrey Skilling took over the position of CEO from Kenneth Lay.
• In Aug 2001, Skilling surprisingly resigned. Lay became CEO again. Watkins wrote an anonymous letter to Lay about possible fraud.
• In Oct 2001, the losses transferred from Enron to SPE totaled over $618 million. SEC started an inquiry into Enron.
• In Jan 2002, Lay resigned as chairman and CEO. Enron collapsed in the same year.
• In 2003, Enron emerged from bankruptcy as two separate companies. Most creditors would receive about 1/5 of the $67 billion they were owed.
History of Enron email dataset• Made public by the Federal Energy Regulatory Commission during
its investigation in May 2002
• Later collected and prepared by SRI for the CALO project
• William Cohen from CMU put up the dataset on the web for the researchers (the CMU dataset) in March 2004
• ISI cleaned the CMU dataset and created a MySql database (the ISI database)
• Various teams did data cleaning and annotation
Several corpora
• Raw data: emails between 1998 and 2002– the CMU dataset– the ISI database– …
• Annotated data– Personal vs. business– Email zoning– …
The CMU dataset
The CMU dataset
• Paper: (B. Klimt and Y. Yang, 2004)
• Available at http://www.cs.cmu.edu/~enron/
• Stored on patas under /corpora/enron_email_dataset/cmu/
CMU dataset
• Raw corpus: – 619,446 messages from 158 users
• Cleanup: – remove folders such as “discussion_threads”– remove duplicates
• Cleaned corpus:– 200,399 messages from 158 users
Messages per user
A few people sent out a lot of messages
Correlation of folders and messages
Most users do use folders to organize their emails, but their usage of folders varies a lot.
Distribution of thread sizes
• Thread: same subject line among the same users.
• Out of 200,399 messages, 61.6% of emails are in threads (123,501 emails in 30,091 threads).
• Most threads are of small size:
The ISI database
The ISI database
• Paper: Shetty and Adibi’s report
• Report and data are available at http://www.isi.edu/~adibi/Enron/Enron.htm
• Stored on patas under $data_dir/isi/
• Stored on capuchin as a mysql database called “enron”.
Data cleaning
• Start from the CMU dataset
• Remove duplicate emails
• Remove folders such as “discussion_threads”, “all documents”, and “sent_mail”
• …
Cleaned Enron email dataset
• 252,759 emails• from 151 employees• distributed in about 3000 user defined folders
• The dataset has been used by many research groups.
MySql database: four tables
rtype: TO, CC, or BCCrvalue: recipient email value
Distribution of sent emails per user
A few employees sent out a lot of messages.
Distribution of email over time
Notice the spike around Nov 2001
Social network