annotation procedure in building the prague czech-english dependency treebank marie mikulová and...
TRANSCRIPT
Annotation Procedure in Building the Prague Czech-English
Dependency Treebank
Marie Mikulová and Jan ŠtěpánekInstitute of Formal and Applied Linguistics
Faculty of Mathematics and Physics Charles University
Prague
Slovko 2009 [email protected] 2
Introduction
division of the annotation into several phases system for annotation quality checking ways of evaluation of the annotation and annotators
a large corpus with a rich linguistic annotation
an elaborated organization of the annotation process
Prague Dependency Treebanks:
Prague Dependency Treebank 2.0 (2006)
Prague Czech-English Dependency Treebank (2010)
Slovko 2009 [email protected] 3
Prague Dependency Treebanks
Introduction
Prague Czech-English Dependency Treebank (PCEDT) texts from Penn Treebank: mostly economic articles from the Wall
Street Journal for the Czech part texts were translated into Czech 2312 documents, 49 208 sentences
Ready for publication by the end of the 2010!
Prague Dependency Treebank 2.0 (PDT 2.0) Czech written texts 3165 documents, 49 431 sentences
Published in 2006.
Slovko 2009 [email protected] 4
Word layer"raw-text„, tokens
Morphological layerlemmas, tags
Analytical layersurface syntax
dependencies, relations
Tectogrammatical layerdeep syntax
dependencies, relations (detailed)
System of annotation layers in Prague Dependency Treebanks
Slovko 2009 [email protected] 5
Tectogrammatical layerin Prague Dependency Treebanks
as an example of a rich linguistic annotation
deep syntax dependencies, relations: 70 functors valency and ellipsis grammatemes: semantic counterparts of morphological categories coreference topic-focus, deep word order
39 different attributes8,42 attributes filled on average for a node in PDT 2.0The annotation manual has more than 1000 pages.
Slovko 2009 [email protected] 6
What can we do? Three organizational aspects
of building a large corpus with a rich annotation
error error error error error error error error error error error error error error error
Division of the annotation into several phases
rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule
rule rule
Annotation quality checking
Motivative evaluation of the annotator
RULE
RULERULE
CORRECTION
CORRECTION
CORRECTION
Slovko 2009 [email protected] 7
Division of the annotationinto several phases
The division of the annotation process into several steps is desirable for the quality of the output data, even though some phenomena had to be reconsidered repeatedly by different annotators in various phases.
How to divide the annotation when the information attached is mostly very complex?
„working value“ of an attribute
An annotation of one attribute
requires an annotation of another attribute.
Slovko 2009 [email protected] 8
Annotation phases on the tectogrammatical layer
in the Prague Czech-English Dependency Treebank
1. building a tree structure, dealing with ellipsis included; assignment of functors and valency frames, links to lower layers (10 attributes),
2. annotation of subfunctors (fine grained classification of functors, 1 attribute),
3. annotation of coreference (4 attributes),4. annotation of topic-focus articulation, rhematizers and
deep word order (3 attributes),5. annotation of grammatemes, final form of
tectogrammatical lemmata (17 attributes),6. annotation of remaining phenomena (quotation, named
entities etc.).
First phase: 9.2 sentences per hour
Slovko 2009 [email protected] 9
Example of „working value“in the Prague Czech-English Dependency Treebank
First phase: building a tree structure. Ellipsis - a new node is added.Each node requires a lemma. The lemma of an added node signifies the type of the elision.
#Gen stands for a general participant,
#PersPron for a subject,
#Cor for a controlee in control constructions,
#Rcp for ellipses because of reciprocation etc.
BUT for the building the tree structure, the type of elision is not substantial.
Adding a new node is necessary!
The annotator adds a node with the “working value” of the lemma and assigns only its syntactical function.
“Working value”: #NewNode
Slovko 2009 [email protected] 10
Annotation quality checking
expensive for large corpus with a rich annotation: impossible!
Usually: parallel annotation of the same data
PCEDT (first phase): one annotator can annotate 9.2 sentences in one hour. Annotation of the whole treebank (49,000 sentences) by one annotator would take 5326 hours. If an annotator worked for 20 hours a week (half-time job), the whole treebank would take 5 years.
System for the automatic quality checking of data
It was developed during the building of the PDT 2.0. The real checking took place when all the annotation had finished. The checking and fixing phase was quite complex and time-consuming.
Now: fully integrated into the annotation process
Slovko 2009 [email protected] 11
Annotation quality checkingDesign of the automatic checking procedures
programmed manually (in perl), based on annotation rules, return a list of erroneous positions in the data, run periodically.
103 checking procedures:
improve the quality of the data: by fixing the present errors, by providing a feedback to the annotators.
Slovko 2009 [email protected] 12
Annotation quality checkingExample of the checking procedure
coord002: every coordination has at least two members struct001.1: the root of a tree has only a limited set of
possible functors: PRED for a predicate, DENOM for nominative clause, PARTL for interjection clause etc.
struct001.2: no dependent node has the PRED functor
#!btred -N -T -t PML_T -e coord()
package PML_T;
$NAME=’coord002’;
## Every coordination has at least two members.
sub coord {
writeln("$NAME\tmembers\t".ThisAddress($this))
if IsCoord($this)
and scalar(grep $_->{is_member},$this->children) < 2;
} # coord
Slovko 2009 [email protected] 13
Evaluation of the annotators
inter-annotator agreement, error rate, performance of the annotators.
A system for the evaluation of the annotation and annotators integral part of any annotation project.
Slovko 2009 [email protected] 14
Inter-annotator agreement
The structure to be compared is very complex. The algorithm aligning two tectogrammatical trees is not an easy task.
Since there is no “golden” annotation, we just measure the agreement of all the pairs of annotators.
As a baseline, we use the output of an automatic procedure with which the annotators start their work.
Slovko 2009 [email protected] 15
Inter-annotator agreementExample
Overall K 94,08%
Ma 94,01%
A 93,83%
O 93,78%
Z 84,58%
Structure A 88,62%
Ma 88,60%
O 87,92%
K 87,88%
Z 69,28%
Functor K 85,70%
Ma 85,67%
O 85,57%
A 85,13%
Z 66,80%
Slovko 2009 [email protected] 16
Error rate
Using the list of errors generated by the checking procedures we count how often the annotators make errors:
the number of errors the annotator made is divided by the number of sentences or nodes s/he annotated.
Slovko 2009 [email protected] 17
Error rateExample
December 2007 July 2009
Who Errors per 100 sentences Errors per 100 nodes Errors per 100 sentences Errors per 100 nodes
K 29.7851 1.6241 1.5103 0.0806
O 39.6699 2.0624 4.0331 0.2067
Ma 61.4087 3.2707 8.4670 0.4533
A 63.2318 3.3498 6.3583 0.3265
L - - 15.0668 0.8010
Mi - - 16.2241 0.8460
J - - 19.0476 1.0971
Slovko 2009 [email protected] 18
Performance of the annotators
In the annotation process, the time the annotators spent working is measured.
For each month we count the annotators' performance over the month and the over-all performance.
Slovko 2009 [email protected] 19
Performance of the annotatorsExample
Who Hours Sentences Sentences per hour Minutes per sentence
A 114.25 963 8.4289 7.1184
I 827.00 7006 8.4716 7.0825
J 105.70 1001 9.4702 6.3357
K 107.00 1430 13.3645 4.4895
L 266.41 1716 6.4412 9.3150
Ma 78.00 615 7.8846 7.6098
Mi 169.98 1655 9.7364 6.1624
O 289.02 3211 11.1100 5.4006
Slovko 2009 [email protected] 20
Conclusion
The organizational aspects of building a large treebank: divide the annotation process into several phases system for checking the correctness of the annotation three ways to evaluate the annotation and annotators.
We believe that having published PDT 2.0 with 50,000 sentences and being in the halftime of the PCEDT project with more than a half data already annotated (33,500 sentences, 68% of the corpus) our proposals are sufficiently backed by our experience and practice.
Grants: Centrum komputační lingvistiky LC 356; PIRE (NSF, USA, 2005-2010); MŠMT KONTAKT (2006-2010) ; GAČR 405/06/0589 (2006-2008); GAUK 22908/2008; EU FP6 Euromatrix (2006-2008); EU FP7 EuromatrixPlus FP7-ICT-2007-3-231720.
Thank you for your attention.
http:/ufal.mff.cuni.cz