acm email corpus annotation analysis

ACM email corpus annotation analysis

Andrew Rosenberg2/26/2004

2

Overview

• Motivation• Corpus Description• Kappa Shortcomings• Kappa Augmentation• Classification of messages• Corpus annotation analysis• Next step: Sharpening method• Summary

3

Motivation

• The ACM email corpus annotation raises two problems.– By allowing annotators to assign a message one or

two labels, there is no clear way to calculate an annotation statistic.

• An augmentation to the kappa statistic is proposed

– Interannotator reliability is low (K < .3)• Annotator reeducation and/or annotation material redesign

are most likely necessary.• Available annotated data can be used, hypothetically, to

improve category assignment.

4

Corpus Description

• 312 email messages exchanged between the Columbia chapter of the ACM.

• Annotated by 2 annotators with one or two of the following 10 labels– question, answer, broadcast, attachment

transmission, planning, planning scheduling, planning-meeting scheduling, action item, technical discussion, social chat

5

Kappa Shortcomings

• Before running ML procedures, we need confidence in assigning labels to the messages.

• In order to compute kappa (below) we need to count up the number of agreements.

• How do you determine agreement with an optional secondary label?– Ignore the secondary label?

)(1

)()(

Ep

EpApK

6

Kappa Shortcomings (ctd.)

• Ignoring the secondary label isn’t acceptable for two reasons.– It is inconsistent with the annotation guidelines.– It ignores partial agreements.

• {a,ba} - singleton matches secondary• {ab,ca} - primary matches secondary• {ab,cb} - secondary matches secondary• {ab,ba} - secondary matches primary, and vice

versa

• Note: The purpose is not to inflate the kappa value, but to accurately assess the data.

7

Kappa Augmentation

• When a labeler employs a secondary label, consider it as a single annotation divided between two categories

• Select a value of p, where 0.5≤p≤1.0, based on how heavily to weight the secondary label– Singleton annotations assigned a score of 1.0– Primary p– Secondary 1-p

Kappa Augmentation example

A B

1 a,b b,d

2 b,a a,b

3 b b

4 c a,d

5 b,c c

Annotator labelsJudge A a b c d

1 0.6 0.4

2 0.4 0.6

3 1

4 1

5 0.6 0.4

Total 1 2.6 1.4 0 5

Judge B a b c d

1 0.6 0.4

2 0.6 0.4

3 1

4 0.6 0.4

5 1

Total 1.2 2 1 0.8 5

Annotation Matrices with p=0.6

9

Kappa Augmentation example (ctd.)

a b c d

1 00.2

4 0 0

20.2

40.2

4 0 0

3 0 1.0 0 0

4 0 0 0 0

5 0 0 0.4 0

Total0.2

41.4

8 0.4 0 2.12

Agreement matrix

424.05

12.2)( Ap

Judge A a b c d

1 0.6 0.4

2 0.4 0.6

3 1

4 1

5 0.6 0.4

Total 1 2.6 1.4 0 5

Judge B a b c d

1 0.6 0.4

2 0.6 .4

3 1

4 0.6 0.4

5 1

Total 1.2 2 1 0.8 5

Annotation Matrices

10

Kappa Augmentation example (ctd.)

• To calculate p(E), use the relative frequencies of each annotators label usage.

P(Topic) Judge A Judge B P(A)*P(B)

a 0.2 0.24 0.048

b 0.52 0.4 0.208

c 0.28 0.2 0.056

d 0 0.16 0

p(E)= 0.312• Kappa is then computed as originally:

163.0312.01

312.0424.0

)(1

)()('

Ep

EpApK

11

Classification of messages

• This augmentation allows us to classify messages based their individual kappa’ values at different values of p. – Class 1: high kappa’ at all values of p.– Class 2: low kappa’ at all values of p.– Class 3: high kappa’ at p = 1.0– Class 4: high kappa’ at p = 0.5

• Note: mathematically kappa’ needn’t be monotonic w.r.t. p, but with 2 annotators it is.

12

Corpus Annotation Analysis

• Agreement is low at all values of p– K’(p=1.0) = 0.299– K’(p=0.5) = 0.281

• Other views of the data will provide some insight into how to revise the annotation scheme.– Category distribution– Category co-occurrence– Category confusion– Class distribution– Category by class distribution

13

Corpus Annotation Analysis:Category Distribution

total gr db

Question 175 86 89

Answer 169 90 79

Broadcast 132 23 109

Attachment Transmission 3 1 2

Planning Meeting Scheduling 63 32 31

Planning Scheduling 27 22 5

Planning 92 76 16

Action Item 19 10 9

Technical Discussion 31 22 9

Social Chat 36 29 7

14

Corpus Annotation Analysis:Category Co-occurrence

Q A B A.T. P.M.S P.S. P. A.I T.D S.C

Question x 19 12 1 8 6 17 1 6 7

Answer x x 2 0 15 3 4 1 7 2

Broadcast x x x 0 2 2 8 0 0 1

AttachmentTransmission x x x x 0 0 0 0 0 0

PlanningMeetingScheduling x x x x x 2 1 0 0 0

PlanningScheduling x x x x x x 0 0 0 0

Planning x x x x x x x 3 2 0

Action Item x x x x x x x x 1 0

TechnicalDiscussion x x x x x x x x x 1

Social Chat x x x x x x x x x x

15

Corpus Annotation Analysis:Category Confusion

Q A B A.T. P.M.S. P.S P A.I T.D. S.C.

Question 62 36 21 0 18 13 47 7 13 10

Answer x 60 15 0 24 7 19 5 17 3

Broadcast x x 14 0 12 13 52 3 8 22

AttachmentTransmission x x x 0 0 0 1 0 0 1

PlanningMeetingScheduling x x x x 13 6 3 2 0 0

PlanningScheduling x x x x x 2 4 1 1 0

Planning x x x x x x 7 5 5 0

Action Item x x x x x x x 1 2 1

TechnicalDiscussion x x x x x x x x 2 1

Social Chat x x x x x x x x x 4

16

Corpus Annotation Analysis:Class Distribution

Constant High (Class 1): 82 0.262821

Constant Low (Class 2): 150 0.480769

Low to High (Class 3): 40 0.128205

High to Low (Class 4): 40 0.128205

Total Messages 312

17

Corpus Annotation Analysis:Category by Class Distribution-1/2

Num messagesClass :

Total

Question 52 0.29714

Answer 62 0.36686

Broadcast 16 0.12121

Attachment Transmission 0 0

Planning Meeting Scheduling 18 0.28571

Planning Scheduling 2 0.07407

Planning 8 0.08695

Action Item 0 0

Technical Discussion 2 0.06451

Social Chat 4 0.11111

Num messagesClass :

Total

Question 37 0.21142

Answer 42 0.24852





Planning 60 0.65217

Action Item 14 0.73684



Class 1:const. high Class 2:const. low

Corpus Annotation Analysis:Category by Class Distribution-2/2

Num messagesClass :

Total

Question 46 0.26285

Answer 40 0.23668

Broadcast 6 0.04545




Planning 5 0.05434




Num messagesClass :

Total

Question 40 0.22857

Answer 25 0.14972





Planning 19 0.20652




Class 3:low to high Class 4:high to low

19

Next step: Sharpening method

• In determining interannotator agreement with kappa, etc., two available pieces of information are overlooked:– Some annotators are “better” than others– Some messages are “easier to label” than others

• By limiting the contribution of known poor annotators and difficult messages, we gain confidence in the final category assignment of each message.

• How do we rank annotators? Messages?

20

Sharpening Method (ctd.)

• Ranking Annotators– Calculate kappa between each annotator and

the rest of the group.– “Better” annotators have a higher agreement

with the group

• Ranking messages– Variance (or -p*log(p)) of label vector summed

over annotators.– Messages with high variance are more

consistently annotated

21

Sharpening Method (ctd.)

• How do we use these ranks?– Weight the annotators based on their rank.– Recompute the message matrix with weighted

annotator contributions.– Weight the messages based on their rank.– Recompute the kappa values with weighted

message contributions.– Repeat these steps until the weights change

beneath a threshold.

22

Summary

• The ACM email corpus annotation raises two problems.– By allowing annotators to assign a message one or

two labels, there is no clear way to calculate an annotation statistic.

• An augmentation to the kappa statistic is proposed

– Interannotator reliability is low (K < .3)• Annotator reeducation and/or annotation material redesign

are most likely necessary.• Available annotated data can be used, hypothetically, to

improve category assignment.

acm email corpus annotation analysis

Documents

low kappa

high kappa

secondary matches secondary

kappa statistic

kappa augmentationwhen

kappa neednt

annotation statistic

individual kappa values