session 4: annotation - university of oxford€¦ · • annotations should be separable •...

18
20th February, 2013 Session 4: Annotation http://tinyurl.com/669o4zt

Upload: others

Post on 24-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

20th February, 2013

Session 4: Annotation

http://tinyurl.com/669o4zt

Page 2: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Corpus Linguistics

More than the Text:

Annotation - what, why, how?

Ylva Berglund Prytz and Martin Wynne

IT Services

http://tinyurl.com/669o4zt

CC BY-SA unless otherwise stated

Page 3: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Remember:

Corpus = text + metadata + annotation

STARTING POINT:

You can only find what is in

the corpus...

so unless someone has included it you

cannot (easily) find it

Page 4: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

What can you find automatically?

• All instances of ‘work’

• All instances of ‘WORK’ (lemma)

• All instances of ‘work’ as a verb

• All instances of ‘work’ in fiction

• All instances of ‘work’ spoken by women

• All instances of ‘work’ at end of clause

• All instances of ‘work’ in jokes

Page 5: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Example: borrow = 1,423 {borrow} = 3,000

Page 6: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

1. The practice of adding (interpretative) (linguistic) information to a corpus.

2. The result of (1)

By Jean-Etienne Poirrier (Physical tagging on tree) [CC BY-SA 2.0

(http://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons

Annotation, mark-up, tagging…

Page 7: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

What and how?

By Helgen KM, Portela Miguez R, Kohen J, Helgen L [CC BY 3.0

(http://creativecommons.org/licenses/by/3.0)], via Wikimedia Commons

Page 8: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

EXAMPLE: it is true that he was …

it_PP3 is_BEZ true_JJ that_CS he_PP3A

was_BEDZ .... (LOB G 28:95)

<w PNP>it <w VBZ>is <w AJ0>true <w

CJT>that <w PNP>he <w VBD>was ... (BNC ABU:1683)

•it = PNP / PP3 (third person singular pronoun)

•is = VBZ / BEZ (third person present tense form of BE)

•that = CJT (the subordinating conjunction ‘that’) / CS

(subordinating conjunction)

•he = PNP (personal pronoun) / PP3A (personal pronoun,

3rd pers sing nom (he, she))

Page 9: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

<u who="PS01V">

<s n="4188">

<w c5="CJC“ hw="and“ pos="CONJ">And </w>

<w c5="UNC“ hw="erm“ pos="UNC">erm </w>

<pause/>

<w c5="CJC“ hw="and“ pos="CONJ">and </w>

<w c5="AV0“ hw="then“ pos="ADV">then </w>

<w c5="PNP“ hw="we“ pos="PRON">we </w>

<shift new="laughing"/>

<w c5="AV0“ hw="so“ pos="ADV">so </w>

<shift/>

Page 10: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Types of linguistic annotation

– Morphosyntactic / wordclass / part-of-speech

(POS)

– Syntactic (e.g. phrase, clause, mood...)

– Semantic

– Pragmatic

– Discourse

– Phonetic

– Phonological

– …

Page 11: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Mountain lion tagging 02. By Mountain-Prairie Region. US Fish and Wildlife

Service. US Department of the Interior. [Public domain], via Wikimedia Commons

How do you tag a corpus?

Page 12: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Decide

what to add

how to add

how to use the result

Pinza decides. [Public domain], via Wikimedia Commons

Page 13: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

What to add, how to add, how to use?

• What are you marking up? (POS, lemma, clause?)

• How are you annotating? (manually/automatically?)

• With which tag-set? (CLAWS, Penn Treebank?)

• Format of annotation? (HTML, XML, Chat?)

• Whose linguistic analysis? (mine or established standard?)

• How are you going to use the annotations for your

analysis? (tools?)

• (How are your annotations going to be shared with others)?

Page 14: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Hands-on

Explore annotation:

Exercise 1.5 'More search features‘ (borrow)

Annotate a text:

Try some online taggers

Page 15: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Some online taggers

Example sentence: “John was very offended by her remarks”

Free CLAWS WWW trial service (http://ucrel.lancs.ac.uk/claws/trial.html)

C5: John_NP0 was_VBD very_AV0 offended_AJ0 by_PRP her_DPS remarks_NN2 ._.

C7: John_NP1 was_VBDZ very_RG offended_JJ by_II her_APPGE remarks_NN2 ._.

CST's Part-Of-Speech tagger (http://www.cst.dk/online/pos_tagger/uk/)

John/NNP was/VBD very/RB offended/VBN by/IN her/PRP$ remarks/NNS ./.

Infogistics tTAG (http://www.infogistics.com/posdemo.htm)

([ John_NNP ]) <: was_VBD :> very_RB offended_VBN by_IN ([ her_PRP$ remarks_NNS ])._.

Page 16: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Potential problems with annotation

It can:

• be incorrect

• be inconsistent

• follow the ‘wrong’ theory

• have the 'wrong' level of granularity

• use the 'wrong' tag-set

• introduce subjective interpretations

Page 17: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Good practice in annotation

• Annotations should be separable

• Detailed and explicit documentation should be

provided

• Annotation practices should be linguistically

consensual

• Annotation should observe standards

(Leech 2005)

Page 18: Session 4: Annotation - University of Oxford€¦ · • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should

Next week: Creating a corpus

Same time, same place

Please register via IT Services webpage

Reading tip:

Developing Linguistic Corpora: a Guide to Good Practice,

edited by Martin Wynne

http://ota.ox.ac.uk/documents/creating/dlc/index.htm