annotation procedure in building the prague czech-english dependency treebank marie mikulová and...

21
Annotation Procedure in Building the Prague Czech- English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Prague

Upload: garey-york

Post on 18-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Annotation Procedure in Building the Prague Czech-English

Dependency Treebank

Marie Mikulová and Jan ŠtěpánekInstitute of Formal and Applied Linguistics

Faculty of Mathematics and Physics Charles University

Prague

Page 2: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 2

Introduction

division of the annotation into several phases system for annotation quality checking ways of evaluation of the annotation and annotators

a large corpus with a rich linguistic annotation

an elaborated organization of the annotation process

Prague Dependency Treebanks:

Prague Dependency Treebank 2.0 (2006)

Prague Czech-English Dependency Treebank (2010)

Page 3: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 3

Prague Dependency Treebanks

Introduction

Prague Czech-English Dependency Treebank (PCEDT) texts from Penn Treebank: mostly economic articles from the Wall

Street Journal for the Czech part texts were translated into Czech 2312 documents, 49 208 sentences

Ready for publication by the end of the 2010!

Prague Dependency Treebank 2.0 (PDT 2.0) Czech written texts 3165 documents, 49 431 sentences

Published in 2006.

Page 4: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 4

Word layer"raw-text„, tokens

Morphological layerlemmas, tags

Analytical layersurface syntax

dependencies, relations

Tectogrammatical layerdeep syntax

dependencies, relations (detailed)

System of annotation layers in Prague Dependency Treebanks

Page 5: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 5

Tectogrammatical layerin Prague Dependency Treebanks

as an example of a rich linguistic annotation

deep syntax dependencies, relations: 70 functors valency and ellipsis grammatemes: semantic counterparts of morphological categories coreference topic-focus, deep word order

39 different attributes8,42 attributes filled on average for a node in PDT 2.0The annotation manual has more than 1000 pages.

Page 6: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 6

What can we do? Three organizational aspects

of building a large corpus with a rich annotation

error error error error error error error error error error error error error error error

Division of the annotation into several phases

rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule

rule rule

Annotation quality checking

Motivative evaluation of the annotator

RULE

RULERULE

CORRECTION

CORRECTION

CORRECTION

Page 7: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 7

Division of the annotationinto several phases

The division of the annotation process into several steps is desirable for the quality of the output data, even though some phenomena had to be reconsidered repeatedly by different annotators in various phases.

How to divide the annotation when the information attached is mostly very complex?

„working value“ of an attribute

An annotation of one attribute

requires an annotation of another attribute.

Page 8: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 8

Annotation phases on the tectogrammatical layer

in the Prague Czech-English Dependency Treebank

1. building a tree structure, dealing with ellipsis included; assignment of functors and valency frames, links to lower layers (10 attributes),

2. annotation of subfunctors (fine grained classification of functors, 1 attribute),

3. annotation of coreference (4 attributes),4. annotation of topic-focus articulation, rhematizers and

deep word order (3 attributes),5. annotation of grammatemes, final form of

tectogrammatical lemmata (17 attributes),6. annotation of remaining phenomena (quotation, named

entities etc.).

First phase: 9.2 sentences per hour

Page 9: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 9

Example of „working value“in the Prague Czech-English Dependency Treebank

First phase: building a tree structure. Ellipsis - a new node is added.Each node requires a lemma. The lemma of an added node signifies the type of the elision.

#Gen stands for a general participant,

#PersPron for a subject,

#Cor for a controlee in control constructions,

#Rcp for ellipses because of reciprocation etc.

BUT for the building the tree structure, the type of elision is not substantial.

Adding a new node is necessary!

The annotator adds a node with the “working value” of the lemma and assigns only its syntactical function.

“Working value”: #NewNode

Page 10: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 10

Annotation quality checking

expensive for large corpus with a rich annotation: impossible!

Usually: parallel annotation of the same data

PCEDT (first phase): one annotator can annotate 9.2 sentences in one hour. Annotation of the whole treebank (49,000 sentences) by one annotator would take 5326 hours. If an annotator worked for 20 hours a week (half-time job), the whole treebank would take 5 years.

System for the automatic quality checking of data

It was developed during the building of the PDT 2.0. The real checking took place when all the annotation had finished. The checking and fixing phase was quite complex and time-consuming.

Now: fully integrated into the annotation process

Page 11: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 11

Annotation quality checkingDesign of the automatic checking procedures

programmed manually (in perl), based on annotation rules, return a list of erroneous positions in the data, run periodically.

103 checking procedures:

improve the quality of the data: by fixing the present errors, by providing a feedback to the annotators.

Page 12: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 12

Annotation quality checkingExample of the checking procedure

coord002: every coordination has at least two members struct001.1: the root of a tree has only a limited set of

possible functors: PRED for a predicate, DENOM for nominative clause, PARTL for interjection clause etc.

struct001.2: no dependent node has the PRED functor

#!btred -N -T -t PML_T -e coord()

package PML_T;

$NAME=’coord002’;

## Every coordination has at least two members.

sub coord {

writeln("$NAME\tmembers\t".ThisAddress($this))

if IsCoord($this)

and scalar(grep $_->{is_member},$this->children) < 2;

} # coord

Page 13: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 13

Evaluation of the annotators

inter-annotator agreement, error rate, performance of the annotators.

A system for the evaluation of the annotation and annotators integral part of any annotation project.

Page 14: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 14

Inter-annotator agreement

The structure to be compared is very complex. The algorithm aligning two tectogrammatical trees is not an easy task.

Since there is no “golden” annotation, we just measure the agreement of all the pairs of annotators.

As a baseline, we use the output of an automatic procedure with which the annotators start their work.

Page 15: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 15

Inter-annotator agreementExample

Overall K 94,08%

Ma 94,01%

A 93,83%

O 93,78%

Z 84,58%

Structure A 88,62%

Ma 88,60%

O 87,92%

K 87,88%

Z 69,28%

Functor K 85,70%

Ma 85,67%

O 85,57%

A 85,13%

Z 66,80%

Page 16: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 16

Error rate

Using the list of errors generated by the checking procedures we count how often the annotators make errors:

the number of errors the annotator made is divided by the number of sentences or nodes s/he annotated.

Page 17: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 17

Error rateExample

December 2007 July 2009

Who Errors per 100 sentences Errors per 100 nodes Errors per 100 sentences Errors per 100 nodes

K 29.7851 1.6241 1.5103 0.0806

O 39.6699 2.0624 4.0331 0.2067

Ma 61.4087 3.2707 8.4670 0.4533

A 63.2318 3.3498 6.3583 0.3265

L - - 15.0668 0.8010

Mi - - 16.2241 0.8460

J - - 19.0476 1.0971

Page 18: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 18

Performance of the annotators

In the annotation process, the time the annotators spent working is measured.

For each month we count the annotators' performance over the month and the over-all performance.

Page 19: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 19

Performance of the annotatorsExample

Who Hours Sentences Sentences per hour Minutes per sentence

A 114.25 963 8.4289 7.1184

I 827.00 7006 8.4716 7.0825

J 105.70 1001 9.4702 6.3357

K 107.00 1430 13.3645 4.4895

L 266.41 1716 6.4412 9.3150

Ma 78.00 615 7.8846 7.6098

Mi 169.98 1655 9.7364 6.1624

O 289.02 3211 11.1100 5.4006

Page 20: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Slovko 2009 [email protected] 20

Conclusion

The organizational aspects of building a large treebank: divide the annotation process into several phases system for checking the correctness of the annotation three ways to evaluate the annotation and annotators.

We believe that having published PDT 2.0 with 50,000 sentences and being in the halftime of the PCEDT project with more than a half data already annotated (33,500 sentences, 68% of the corpus) our proposals are sufficiently backed by our experience and practice.

Grants: Centrum komputační lingvistiky LC 356; PIRE (NSF, USA, 2005-2010); MŠMT KONTAKT (2006-2010) ; GAČR 405/06/0589 (2006-2008); GAUK 22908/2008; EU FP6 Euromatrix (2006-2008); EU FP7 EuromatrixPlus FP7-ICT-2007-3-231720.

Page 21: Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics

Thank you for your attention.

http:/ufal.mff.cuni.cz