vitor r. carvalho & william w. cohen carnegie mellon university

14
1 1 st st Conference on Email and Anti-Spam, CEAS 2004 Conference on Email and Anti-Spam, CEAS 2004 Learning to Extract Learning to Extract Signature and Reply Lines Signature and Reply Lines from Email from Email Vitor R. Carvalho & William W. Cohen Vitor R. Carvalho & William W. Cohen Carnegie Mellon University Carnegie Mellon University

Upload: joshua-savage

Post on 31-Dec-2015

21 views

Category:

Documents


1 download

DESCRIPTION

1 st Conference on Email and Anti-Spam, CEAS 2004 Learning to Extract Signature and Reply Lines from Email. Vitor R. Carvalho & William W. Cohen Carnegie Mellon University. Idea. Reply lines. Sig Lines. Motivation. Names, Dates, Times, etc. Preprocessing for: - PowerPoint PPT Presentation

TRANSCRIPT

11stst Conference on Email and Anti-Spam, CEAS 2004 Conference on Email and Anti-Spam, CEAS 2004

Learning to Extract Learning to Extract Signature and Reply Lines Signature and Reply Lines

from Emailfrom Email

Vitor R. Carvalho & William W. CohenVitor R. Carvalho & William W. CohenCarnegie Mellon UniversityCarnegie Mellon University

Sig Lines

Reply lines

Idea

Motivation

Email Text-To-Speech Systems

Automatic personal Automatic personal

address address managementmanagement

Anonymization of Anonymization of email corporaemail corpora

Preprocessing for:Preprocessing for: *email information *email information

extractionextraction*content-based email *content-based email

classifiersclassifiers

Names, Dates,

Times, etc

“Speech Act”, Topic, etc

Related work:Related work: Sproat, Chen & Hu;Sproat, Chen & Hu; “ “Emu: An e-mail preprocessor for Emu: An e-mail preprocessor for

text-to-speech”, “geometrical and linguistic analysis text-to-speech”, “geometrical and linguistic analysis

for e-mail signature”for e-mail signature”

Our work:Our work: 3 tasks: 3 tasks:

Sig detection ( has a signature?)Sig detection ( has a signature?)Sig line extraction (in which lines?)Sig line extraction (in which lines?)Reply line extractionReply line extraction

Compare state-of-the-art learning Compare state-of-the-art learning algorithms algorithms

Supervised learningSupervised learning

DataData

20 Newsgroups dataset

Searched for pairs of messages from the same sender, whose last K lines were identical.

K ≤ 1Unlikely to have a sig

K ≥ 6Likely to have a sig

Manually checked:586 Messages without

Sig lines

Manually Checked+

Sig and Reply-to Lines Annotated

617 Messages

Total: 33013 lines (3321 sig lines, 5587 reply-to lines)

Sig Detection TaskSig Detection Task

Last K lines of the email messageLast K lines of the email message Example: if URL pattern is detected in each of Example: if URL pattern is detected in each of

the last 3 lines, then the msg representation the last 3 lines, then the msg representation contains the features contains the features url1url1, , url2url2 and and url3url3

Sig Detection ResultsSig Detection Results

5-fold cross-validation on 1203 labeled messages 5-fold cross-validation on 1203 labeled messages (617 positive, 586 negative)(617 positive, 586 negative)

Sproat Sproat et al.et al. (1999): “SIG fields are rarely longer (1999): “SIG fields are rarely longer than ten lines”. than ten lines”.

Typical mistakes: ASCII drawing only, only the Typical mistakes: ASCII drawing only, only the nickname of the sender, or only a few quoted nickname of the sender, or only a few quoted sentences. sentences.

Learning Algorithm K = 5 K = 10 K = 15

F1 Precision Recall F1 Precision Recall F1 Precision Recall

Naïve Bayes 89.67 81.81 99.18 87.31 77.58 99.83 83.49 71.66 100

Maximum Entropy 95.11 97.28 93.03 97.40 97.56 97.24 96.98 97.54 96.43

SVM 94.87 96.79 93.03 97.55 98.03 97.08 97.39 97.87 96.92

VotedPerceptron 95.19 97.45 93.03 96.39 97.35 95.46 95.59 96.22 94.97

AdaBoost 95.16 96.19 94.16 96.76 96.45 97.08 96.56 97.36 95.78

Signature Extraction TaskSignature Extraction Task

Some of the line features (used to extract signature and reply lines)

On current line

Onprevious line

Onnextline

Blank line X X X

Email pattern X X X

URL pattern X X X

A line with a sequence of 10 or more special characters, as in the following regular expression: "^[\s]*([\*]|#|[\+]|[\^]|-|[\~]|[\&]|[///]|[\$]|_|[\!]|[\/]|[\%]|[\:]|[\=]){10,}[\s]*$"

X X X

Lines ending with quote symbol, as in regular expression: "\"$" X

The Name of the email sender, Surname, or Both (If it can be extracted from the email header) X

The number of tabs (as in regular expression “\t”) equals 1 X X X

The number of tabs equals 2 X X X

The number of tabs is equal or greater than 3 X X X

Percentage of punctuation symbols (as in regular expression “\p{Punct}”) is larger than 20% X X X

Percentage of punctuation symbols in a line is larger than 50% X X X

Percentage of punctuation symbols in a line is larger than 90% X X X

Typical reply marker (as in regular expression "^\>") X X X

Line starts with a punctuation symbol X X X

Next line begins with same punctuation symbol as current line X

Email message represented as a sequence of linesEmail message represented as a sequence of lines Each line is a set of features (sequential classification)Each line is a set of features (sequential classification)

Signature Extraction ResultsSignature Extraction Results (5-fold cross-validation) (5-fold cross-validation)

LearningAlgorithm

Without Features from Previousand Next Lines

With Features from Previousand Next Lines

Accuracy (%)

F1 Precision Recall Accuracy (%)

F1 Precision Recall

Non-Sequential

Naïve Bayes 94.13 73.88 66.80 82.65 91.03 68.60 52.95 97.38

Maximum Entropy 96.26 80.16 86.07 75.00 99.11 95.56 96.38 94.76

SVM 96.41 80.39 89.41 73.02 99.12 95.62 96.10 95.15

VotedPerceptron 96.10 80.23 81.88 78.65 98.96 94.73 96.32 93.19

AdaBoost 96.53 82.12 85.44 79.04 99.11 95.55 96.21 94.91

Sequential

CPerceptron(5, 25) 97.01 83.62 93.02 75.94 99.37 96.82 98.20 95.48

CMM(MaxEnt, 5) 87.11 57.24 42.94 85.84 98.65 93.58 89.99 97.47

CRF 98.13 90.97 88.05 94.09 99.17 95.97 94.27 97.74

Reply Lines Extraction ResultsReply Lines Extraction Results (5-fold cross-validation) (5-fold cross-validation)

Learning Algorithm

Without Features from Previous and Next Lines

With Features from Previous and Next Lines

Accuracy (%)

F1 Precision Recall Accuracy (%)

F1 Precision Recall

Non-Sequential

Starts with “>” 95.10 83.08 99.92 71.09 n/a

Naïve Bayes 97.97 93.98 94.47 93.50 93.86 84.37 74.03 98.06

MaximumEntropy 98.23 94.57 98.11 91.28 98.74 96.22 97.64 94.84

SVM 98.32 94.90 97.96 92.03 98.83 96.52 97.25 95.81

VotedPerceptron 98.19 94.38 99.19 90.03 98.48 95.36 98.90 92.07

AdaBoost 98.46 95.33 97.77 93.00 98.73 96.20 96.72 95.68

Sequential

CPerceptron(5, 18) 98.05 94.19 95.32 93.09 98.73 96.20 97.62 94.82

CMM(MaxEnt,5) 97.71 93.13 94.77 91.55 98.78 96.33 97.85 94.86

CRF 98.10 94.31 95.55 93.10 99.04 97.15 98.17 96.15

Sig & Reply Extraction: ResultsSig & Reply Extraction: Results(5-fold cross-validation)(5-fold cross-validation)

Multi-class SequentialLearning Algorithm

Without Features from Previous and Next Lines

With Features from Previous and Next Lines

Accuracy(%)

Confusion-Matrix Accuracy(%)

Confusion-Matrix

CPerceptron(5, 38) 95.35 98.91

CRF 96.71 98.48

% Sig Rep Other

Sig 8.27 0.17 1.61

Rep 0.05 15.22 1.65

Other 0.37 0.78 71.85

% Sig Rep Other

Sig 9.85 0.06 0.15

Rep 0.14 16.39 0.38

Other 0.09 0.26 72.65

% Sig Rep Other

Sig 9.42 0.03 0.61

Rep 0.04 15.87 1.00

Other 1.36 0.24 71.41

% Sig Rep Other

Sig 9.85 0.05 0.16

Rep 0.06 16.32 0.54

Other 0.51 0.20 72.30

Last LinesLast Lines

Effective method to extract signature and Effective method to extract signature and reply lines in email messagesreply lines in email messages

Sequence of lines representation (+ Sequence of lines representation (+ neighbor lines features)neighbor lines features)

Comparison of state-of-the-art learning Comparison of state-of-the-art learning algorithmsalgorithms

Implementation available on the Implementation available on the MinorthirdMinorthird package (Cohen, 2004)package (Cohen, 2004)

Complete Set of Features for Line ExtractionComplete Set of Features for Line Extraction