enron email datasets ling 575 fei xia 01/04/2011

18
Enron email datasets LING 575 Fei Xia 01/04/2011

Upload: suzan-norman

Post on 18-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enron email datasets LING 575 Fei Xia 01/04/2011

Enron email datasets

LING 575Fei Xia

01/04/2011

Page 2: Enron email datasets LING 575 Fei Xia 01/04/2011

History of Enron• Enron was formed in 1985 under the direction of Kenneth Lay

• In 1999, Enron officials began to use the “special purpose entities” (SPE) trick.

• In Dec 2000, Jeffrey Skilling took over the position of CEO from Kenneth Lay.

• In Aug 2001, Skilling surprisingly resigned. Lay became CEO again. Watkins wrote an anonymous letter to Lay about possible fraud.

• In Oct 2001, the losses transferred from Enron to SPE totaled over $618 million. SEC started an inquiry into Enron.

• In Jan 2002, Lay resigned as chairman and CEO. Enron collapsed in the same year.

• In 2003, Enron emerged from bankruptcy as two separate companies. Most creditors would receive about 1/5 of the $67 billion they were owed.

Page 3: Enron email datasets LING 575 Fei Xia 01/04/2011

History of Enron email dataset• Made public by the Federal Energy Regulatory Commission during

its investigation in May 2002

• Later collected and prepared by SRI for the CALO project

• William Cohen from CMU put up the dataset on the web for the researchers (the CMU dataset) in March 2004

• ISI cleaned the CMU dataset and created a MySql database (the ISI database)

• Various teams did data cleaning and annotation

Page 4: Enron email datasets LING 575 Fei Xia 01/04/2011

Several corpora

• Raw data: emails between 1998 and 2002– the CMU dataset– the ISI database– …

• Annotated data– Personal vs. business– Email zoning– …

Page 5: Enron email datasets LING 575 Fei Xia 01/04/2011

The CMU dataset

Page 6: Enron email datasets LING 575 Fei Xia 01/04/2011

The CMU dataset

• Paper: (B. Klimt and Y. Yang, 2004)

• Available at http://www.cs.cmu.edu/~enron/

• Stored on patas under /corpora/enron_email_dataset/cmu/

Page 7: Enron email datasets LING 575 Fei Xia 01/04/2011

CMU dataset

• Raw corpus: – 619,446 messages from 158 users

• Cleanup: – remove folders such as “discussion_threads”– remove duplicates

• Cleaned corpus:– 200,399 messages from 158 users

Page 8: Enron email datasets LING 575 Fei Xia 01/04/2011

Messages per user

A few people sent out a lot of messages

Page 9: Enron email datasets LING 575 Fei Xia 01/04/2011

Correlation of folders and messages

Most users do use folders to organize their emails, but their usage of folders varies a lot.

Page 10: Enron email datasets LING 575 Fei Xia 01/04/2011

Distribution of thread sizes

• Thread: same subject line among the same users.

• Out of 200,399 messages, 61.6% of emails are in threads (123,501 emails in 30,091 threads).

• Most threads are of small size:

Page 11: Enron email datasets LING 575 Fei Xia 01/04/2011

The ISI database

Page 12: Enron email datasets LING 575 Fei Xia 01/04/2011

The ISI database

• Paper: Shetty and Adibi’s report

• Report and data are available at http://www.isi.edu/~adibi/Enron/Enron.htm

• Stored on patas under $data_dir/isi/

• Stored on capuchin as a mysql database called “enron”.

Page 13: Enron email datasets LING 575 Fei Xia 01/04/2011

Data cleaning

• Start from the CMU dataset

• Remove duplicate emails

• Remove folders such as “discussion_threads”, “all documents”, and “sent_mail”

• …

Page 14: Enron email datasets LING 575 Fei Xia 01/04/2011

Cleaned Enron email dataset

• 252,759 emails• from 151 employees• distributed in about 3000 user defined folders

• The dataset has been used by many research groups.

Page 15: Enron email datasets LING 575 Fei Xia 01/04/2011

MySql database: four tables

rtype: TO, CC, or BCCrvalue: recipient email value

Page 16: Enron email datasets LING 575 Fei Xia 01/04/2011

Distribution of sent emails per user

A few employees sent out a lot of messages.

Page 17: Enron email datasets LING 575 Fei Xia 01/04/2011

Distribution of email over time

Notice the spike around Nov 2001

Page 18: Enron email datasets LING 575 Fei Xia 01/04/2011

Social network