active annotation of corpora

15
Active Annotation of Corpora Kepa J. Rodriguez Text Analysis Seminar at the Göttingen Center of Digital Humanities 02.05.2012

Upload: kepa-j-rodriguez

Post on 21-Jun-2015

1.889 views

Category:

Education


1 download

DESCRIPTION

Text Analysis Seminar at the Göttingen Center of Digital Humanities. 02.05.2012

TRANSCRIPT

Page 1: Active Annotation of Corpora

Active Annotation of CorporaKepa J. RodriguezText Analysis Seminar at the Göttingen Center of Digital Humanities02.05.2012

Page 2: Active Annotation of Corpora

Outline

• Goal of the presentation.• The LUNA corpus.• Active annotation.

– Concept– Algorithm.– Evaluation.

• Potential use of Active Annotation in projects in humanities.

Page 3: Active Annotation of Corpora

Goal of the presentation

• Introduce concepts of: – Active Learning – Active Annotation.

• Present its use in the annotation of the LUNA corpus.• Discuss the utility of the Active Annotation in projects in

humanities.

Page 4: Active Annotation of Corpora

The LUNA Corpus (1)• Corpus consists of:

– 3000 Human-Human and 8100 WOZ dialogues– Multiple annotation levels: POS, entities, coreference, predicate structure, dialogue acts,

etc.– in French, Italian and Polish.

• French subcorpus:– Application domains: travel information and reservation, IT help desk, telecom costumer

care and financial information transaction– Human-Machine dialogues: 7100

• Italian subcorpus:– Application domain: IT helpdesk– 2500 Human-Human and 500 WOZ dialogues

• Polish subcorpus:– Application domain: public transportation information– 500 Human-Human and 500 WOZ dialogues

More information about annotation scheme and levels: http://www.ist-luna.eu/pdf/schemepresentationPdm.pdf

Page 5: Active Annotation of Corpora

The LUNA Corpus (2)

[Operator:] allora m'ha detto che [non riusciva]c1 ad [accedere]c2 [al computer]c3 e [le manca]c4 [la procedura]c5

so, you have told me that you cannot access the computer, and that you need the procedure

c1 trouble : unable_toc2 action : accessc3 computer-hardware : pcc4 trouble : lack_ofc5 computer-software : procedure

[Caller:] esattoexactly[Operator:] allora avrei bisogno [dell' RWS]c6 [del PC]c7so I need the RWS of the computer

c6 code-identificationCode : rwsc7 computer-hardware : pc

[Caller:] si allora [tredici zero ottantasei]c8yes, 13 0 86

c8 code-identificationCode-rws : 13086

Page 6: Active Annotation of Corpora

Active annotation (1)

Components of the active annotation are:• Active learning paradigm

– Selection of examples for annotation.• Potential error detection

– Cases in which manual annotation seems to be ambiguous or contradictory.

Page 7: Active Annotation of Corpora

Active annotation (2)

• Active learning paradigm: – Statistical learning based paradigm– A first small set will randomly chosen and manually annotated.– Use this set to train a model and annotate the rest of samples.– Selection of the most informative examples to update the statistical

model• Most informative = lower confidence score

• Use of active learning:– Speed-up annotation– Support annotators in their work– Select examples to be annotated: which examples from a big

amount of data will be useful for my purposes?

Page 8: Active Annotation of Corpora

Active annotation (3)

Learn curve comparison: active vs. random learning (Riccardi and Takkani-Tür, 2005 )

Page 9: Active Annotation of Corpora

Active annotation (4)

• Likely error detection:– Re-annotate the training data using the statistical model.– Extract examples in which manual annotation and automatic

annotation are different.– Send them to human supervision.

• Use of the likely error detection:– If manual annotation is correct, example is hard to learn:

• Analyze which new features can be implemented to enrich the model.– If the annotation is erroneous:

• Correct it.

Page 10: Active Annotation of Corpora

Annotation algoritm

1. Select randomly a small amount of dialogues and annotate it manually from scratch (SL).

2. Train a model M using SL3. while (labeler/data available)

a) Use M to automatically annotate the unannotated part of the corpus (Su).b) Rank automatically annotated examples of (Su) according to the confidence

measure given by Mc) Select a batch of k dialogues with the lowest score (Sk)d) Ask for human control/correction on Ske) Use M to automatically annotate SL and produce SaLf) Look at the difference between SL and produce SaL

i. HARD TO LEARN EXAMPLE: Add new features when training Mii. ANNOTATION AMBIGUITIES: Hire human annotators to disambiguate SL

g) SL = SL + Skh) Train a new model M with SLi) Go to 3.1

Page 11: Active Annotation of Corpora

Evaluation (2)

• Annotator point of view:– Annotation from scratch: 80-90 minutes/file.– Supervision after 3rd active annotation loop: 25-20 min/file.– Annotators more concentrated in:

• Difficult/interesting issues.• Giving feedback about the model.

• Error detection: no statistics.– Most of the reported feedback requests were annotation errors.– Some of the reported feedback requests were caused by ambiguities and

helped to add features to enrich the model.

Page 12: Active Annotation of Corpora

Evaluation (1)• Wizard of Oz dialogues

• Human-human dialogues

Act-turn Size in turns Error rate

1 200 59.2%2 400 44.4%3 600 39.3%4 800 6.4%

5 1200 0.0%

Act-turn Size in dialogues Error rate1 10 71.2%2 20 59.5%

3 30 54.0%

4 40 51.1%5 60 45.7%

6 80 42.4%

Page 13: Active Annotation of Corpora

Discussion

• Questions• Annotation tasks in the GCDH:

– Corpus of Coptic Texts.– …..

Page 14: Active Annotation of Corpora

References

• LUNA project: http://www.ist-luna.eu • Raymond, Rodriguez and Riccardi (2008): Active Annotation in the

LUNA Italian Corpus of Spontaneous Dialogues. In Proceedings of the sixth international conference on Language Resources and Evaluation (LREC 2008).Marrakech. Marrocco.

• Riccardi, G. and Hakkani-Tür, D. (2005): Active learning: theory and applications to automatic speech recognition. In IEEE Transactions on Speech and Audio Processing.

Page 15: Active Annotation of Corpora

Text Analysis Seminar at the Göttingen Center of Digital Humanities

Thanks!!!