session 4: annotation - university of oxford€¦ · • annotations should be separable •...
TRANSCRIPT
20th February, 2013
Session 4: Annotation
http://tinyurl.com/669o4zt
Corpus Linguistics
More than the Text:
Annotation - what, why, how?
Ylva Berglund Prytz and Martin Wynne
IT Services
http://tinyurl.com/669o4zt
CC BY-SA unless otherwise stated
Remember:
Corpus = text + metadata + annotation
STARTING POINT:
You can only find what is in
the corpus...
so unless someone has included it you
cannot (easily) find it
What can you find automatically?
• All instances of ‘work’
• All instances of ‘WORK’ (lemma)
• All instances of ‘work’ as a verb
• All instances of ‘work’ in fiction
• All instances of ‘work’ spoken by women
• All instances of ‘work’ at end of clause
• All instances of ‘work’ in jokes
Example: borrow = 1,423 {borrow} = 3,000
1. The practice of adding (interpretative) (linguistic) information to a corpus.
2. The result of (1)
By Jean-Etienne Poirrier (Physical tagging on tree) [CC BY-SA 2.0
(http://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons
Annotation, mark-up, tagging…
What and how?
By Helgen KM, Portela Miguez R, Kohen J, Helgen L [CC BY 3.0
(http://creativecommons.org/licenses/by/3.0)], via Wikimedia Commons
EXAMPLE: it is true that he was …
it_PP3 is_BEZ true_JJ that_CS he_PP3A
was_BEDZ .... (LOB G 28:95)
<w PNP>it <w VBZ>is <w AJ0>true <w
CJT>that <w PNP>he <w VBD>was ... (BNC ABU:1683)
•it = PNP / PP3 (third person singular pronoun)
•is = VBZ / BEZ (third person present tense form of BE)
•that = CJT (the subordinating conjunction ‘that’) / CS
(subordinating conjunction)
•he = PNP (personal pronoun) / PP3A (personal pronoun,
3rd pers sing nom (he, she))
<u who="PS01V">
<s n="4188">
<w c5="CJC“ hw="and“ pos="CONJ">And </w>
<w c5="UNC“ hw="erm“ pos="UNC">erm </w>
<pause/>
<w c5="CJC“ hw="and“ pos="CONJ">and </w>
<w c5="AV0“ hw="then“ pos="ADV">then </w>
<w c5="PNP“ hw="we“ pos="PRON">we </w>
…
<shift new="laughing"/>
<w c5="AV0“ hw="so“ pos="ADV">so </w>
…
<shift/>
Types of linguistic annotation
– Morphosyntactic / wordclass / part-of-speech
(POS)
– Syntactic (e.g. phrase, clause, mood...)
– Semantic
– Pragmatic
– Discourse
– Phonetic
– Phonological
– …
Mountain lion tagging 02. By Mountain-Prairie Region. US Fish and Wildlife
Service. US Department of the Interior. [Public domain], via Wikimedia Commons
How do you tag a corpus?
Decide
what to add
how to add
how to use the result
Pinza decides. [Public domain], via Wikimedia Commons
What to add, how to add, how to use?
• What are you marking up? (POS, lemma, clause?)
• How are you annotating? (manually/automatically?)
• With which tag-set? (CLAWS, Penn Treebank?)
• Format of annotation? (HTML, XML, Chat?)
• Whose linguistic analysis? (mine or established standard?)
• How are you going to use the annotations for your
analysis? (tools?)
• (How are your annotations going to be shared with others)?
Hands-on
Explore annotation:
Exercise 1.5 'More search features‘ (borrow)
Annotate a text:
Try some online taggers
Some online taggers
Example sentence: “John was very offended by her remarks”
Free CLAWS WWW trial service (http://ucrel.lancs.ac.uk/claws/trial.html)
C5: John_NP0 was_VBD very_AV0 offended_AJ0 by_PRP her_DPS remarks_NN2 ._.
C7: John_NP1 was_VBDZ very_RG offended_JJ by_II her_APPGE remarks_NN2 ._.
CST's Part-Of-Speech tagger (http://www.cst.dk/online/pos_tagger/uk/)
John/NNP was/VBD very/RB offended/VBN by/IN her/PRP$ remarks/NNS ./.
Infogistics tTAG (http://www.infogistics.com/posdemo.htm)
([ John_NNP ]) <: was_VBD :> very_RB offended_VBN by_IN ([ her_PRP$ remarks_NNS ])._.
Potential problems with annotation
It can:
• be incorrect
• be inconsistent
• follow the ‘wrong’ theory
• have the 'wrong' level of granularity
• use the 'wrong' tag-set
• introduce subjective interpretations
Good practice in annotation
• Annotations should be separable
• Detailed and explicit documentation should be
provided
• Annotation practices should be linguistically
consensual
• Annotation should observe standards
(Leech 2005)
Next week: Creating a corpus
Same time, same place
Please register via IT Services webpage
Reading tip:
Developing Linguistic Corpora: a Guide to Good Practice,
edited by Martin Wynne
http://ota.ox.ac.uk/documents/creating/dlc/index.htm