foundations of language science and technology - corpus linguistics - silvia hansen-schirra
DESCRIPTION
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra. Outline. Why corpora, why interpreted corpora Many types of annotation - linguistic annotation - non-linguistic annotation New developments. Why corpora?. Linguistics linguistic theory. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/1.jpg)
Foundations of Language Science and Technology
- Corpus Linguistics -
Silvia Hansen-Schirra
![Page 2: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/2.jpg)
Outline
Why corpora, why interpreted corpora
Many types of annotation - linguistic annotation- non-linguistic annotation
New developments
![Page 3: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/3.jpg)
Why corpora?
CognitionCognition
models of human models of human language processinglanguage processing
EngineeringEngineering
language technologylanguage technologyapplicationsapplications
LinguisticsLinguistics
linguistic theorylinguistic theory
![Page 4: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/4.jpg)
Empirical linguistics
corpus data experimentalpsycholinguistic data
introspective data
DB of relevant data
research
![Page 5: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/5.jpg)
Engineering motivation● information extraction ● question-answering● statistical machine translation● parser training and evaluation
=> increased need for deeply annotated corpora
![Page 6: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/6.jpg)
Cognitive motivation
● experience-oriented frequency-based models● models of gradiant grammaticality● metrics of complexity
![Page 7: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/7.jpg)
Resource description metadatalanguage: Spanish, English, German
sublanguage/register: regional dialect, sociolect, vernacular, professional jargon, toddler speech
text sort(s): newspaper articles, wire news, political speech, control commands
subject domain: stock rates, flight reservations,
type of producers: professional journalist, student, radiologist
mode of production: spoken, written, signed, morsed
medium of production: pencil, PC with MS Word, dictaphone
conditions of production: spontaneous, carefully composed, produced under time pressure
transmission encoding: raw ascii code, HTML, digitized phone signal, unicode
medium of transmission: telephone, WWW, CB radio
storage encoding: raw ASCII code, HTML, AIFF
medium of storage: DAT tape, CD ROM, hard disk
mode of presentation: spoken, written, signed
medium of presentation: newspaper, radio, book, tv show, theater performance,
type of intended recipients: newspaper reader, booking agent, theater audience
number of intended recipients: point-to-point, multicast, broadcast
synchronicity of discourse: synchronous dialogue, asynchronous
direction: one-way, two-way
![Page 8: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/8.jpg)
Linguistic annotation
● part-of-speech tags, ● word sense information, ● morphosyntactic features of words, ● constituent structures for phrases or sentences, ● coreference markers,● dependency structures,● predicate-argument structures,● reference identifications for term phrases,● information structures within sentences,● intonation contours,● speech acts,● discourse relations - discourse structures.
![Page 9: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/9.jpg)
Other annotations● judgements of native speakers on the acceptability or appropriateness of the utterance, ● information on speaker(s), ● information on hearer(s) or intended audience,● information on the utterance situation (time, place, circumstances)● information on the published source, ● typographic information,● layout and document structure, ● textual transcriptions of spoken utterances,● transcription of pauses,● error tagging.
![Page 10: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/10.jpg)
Raw vs. linguistically interpreted corpora
search term: word=form...play a significant part in determining growth and form....each molecule can form four hydrogen bonds...
vs.
search term: word=form & pos=N...play a significant part in determining growth and form.
search term: word=form & pos=V...each molecule can form four hydrogen bonds...
![Page 11: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/11.jpg)
search term: is *edAlpha interferon is produced by white blood cells...
search term: were *edIn the late 1970s interferons were hailed as "wonder drugs"...
vs.
search term: pos=VB {0,1} pos=VVNGamma is not induced by viruses at all...So interferons could be described as the antibiotics of the virus...Only two of these have yet been identified...
Raw vs. linguistically interpreted corpora
![Page 12: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/12.jpg)
Syntactically annotated corpora:treebanks
• German treebank project: TiGer Treebank• English reference treebank: Penn Treebank• Treebank + semantic information:
Prague Dependency Bank
![Page 13: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/13.jpg)
TiGer Treebank
ImAPPRART
Dat
in
nächstenADJA
Sup.Dat.Sg.Neut
nahe
JahrNNDat.
Pl.NeutJahr
.$.
HD SB OC
HDOAMO
AC NK NK NK NK NK NK
S
VP
NPNPPP
willVMFIN
3.Sg.Pres.Indwollen
dieARTNom.
Sg.Femdie
RegierungNN
Nom.Sg.Fem
Regierung
ihrePPOSAT
Acc.Pl.Masc
ihr
ReformpläneNNAcc.
Pl.MascPlan
umsetzenVVINF
Inf
umsetzen
![Page 14: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/14.jpg)
ImAPPRART
Dat
in
nächstenADJA
Sup.Dat.Sg.Neut
nahe
JahrNNDat.
Pl.NeutJahr
.$.
HD SB OC
HDOAMO
AC NK NK NK NK NK NK
S
VP
NPNPPP
willVMFIN
3.Sg.Pres.Indwollen
dieARTNom.
Sg.Femdie
RegierungNN
Nom.Sg.Fem
Regierung
ihrePPOSAT
Acc.Pl.Masc
ihr
ReformpläneNNAcc.
Pl.MascPlan
umsetzenVVINF
Inf
umsetzen
annotation on word level:part-of-speech,
morphology, lemmata
TiGer Treebank
![Page 15: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/15.jpg)
ImAPPRART
Dat
in
nächstenADJA
Sup.Dat.Sg.Neut
nahe
JahrNNDat.
Pl.NeutJahr
.$.
HD SB OC
HDOAMO
AC NK NK NK NK NK NK
S
VP
NPNPPP
willVMFIN
3.Sg.Pres.Indwollen
dieARTNom.
Sg.Femdie
RegierungNN
Nom.Sg.Fem
Regierung
ihrePPOSAT
Acc.Pl.Masc
ihr
ReformpläneNNAcc.
Pl.MascPlan
umsetzenVVINF
Inf
umsetzen
node labels:phrase categories
TiGer Treebank
![Page 16: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/16.jpg)
ImAPPRART
Dat
in
nächstenADJA
Sup.Dat.Sg.Neut
nahe
JahrNNDat.
Pl.NeutJahr
.$.
HD SB OC
HDOAMO
AC NK NK NK NK NK NK
S
VP
NPNPPP
willVMFIN
3.Sg.Pres.Indwollen
dieARTNom.
Sg.Femdie
RegierungNN
Nom.Sg.Fem
Regierung
ihrePPOSAT
Acc.Pl.Masc
ihr
ReformpläneNNAcc.
Pl.MascPlan
umsetzenVVINF
Inf
umsetzen
edge labels:syntactic functions
TiGer Treebank
![Page 17: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/17.jpg)
ImAPPRART
Dat
in
nächstenADJA
Sup.Dat.Sg.Neut
nahe
JahrNNDat.
Pl.NeutJahr
.$.
HD SB OC
HDOAMO
AC NK NK NK NK NK NK
S
VP
NPNPPP
willVMFIN
3.Sg.Pres.Indwollen
dieARTNom.
Sg.Femdie
RegierungNN
Nom.Sg.Fem
Regierung
ihrePPOSAT
Acc.Pl.Masc
ihr
ReformpläneNNAcc.
Pl.MascPlan
umsetzenVVINF
Inf
umsetzen
crossing branches fordiscontinuous constituency types
TiGer Treebank
![Page 18: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/18.jpg)
Penn Treebank
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
![Page 19: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/19.jpg)
Penn Treebank
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
annotation on word level:part-of-speech
![Page 20: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/20.jpg)
Penn Treebank
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
phrase categories
![Page 21: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/21.jpg)
Penn Treebank
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
syntactic functions
![Page 22: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/22.jpg)
Prague Dependency Bankchce
wantsSb
KdowhoSb
ACT.T
investovatto-invest
ObjACT.VOL.T
stehundred
ObjRESTR.F
koruncrowns
AtrPAT.F
doto
AuxP
automobilucar
AdvDIR.F
![Page 23: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/23.jpg)
Prague Dependency Bankchce
wantsSb
KdowhoSb
ACT.T
investovatto-invest
ObjACT.VOL.T
stehundred
ObjRESTR.F
koruncrowns
AtrPAT.F
doto
AuxP
automobilucar
AdvDIR.F
annotation on word level:lemmata, morphology
![Page 24: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/24.jpg)
Prague Dependency Bankchce
wantsSb
KdowhoSb
ACT.T
investovatto-invest
ObjACT.VOL.T
stehundred
ObjRESTR.F
koruncrowns
AtrPAT.F
doto
AuxP
automobilucar
AdvDIR.F
syntactic functions
![Page 25: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/25.jpg)
Prague Dependency Bankchce
wantsSb
KdowhoSb
ACT.T
investovatto-invest
ObjACT.VOL.T
stehundred
ObjRESTR.F
koruncrowns
AtrPAT.F
doto
AuxP
automobilucar
AdvDIR.F
dependency structure
![Page 26: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/26.jpg)
Prague Dependency Bankchce
wantsSb
KdowhoSb
ACT.T
investovatto-invest
ObjACT.VOL.T
stehundred
ObjRESTR.F
koruncrowns
AtrPAT.F
doto
AuxP
automobilucar
AdvDIR.F
semantic information on constituent roles,
theme/rheme, etc.
![Page 27: Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra](https://reader035.vdocuments.net/reader035/viewer/2022062521/56814bb1550346895db88571/html5/thumbnails/27.jpg)
New developments
● historical dimension (e.g., Corpus of the History of German Language)
● multilayer stand-off linguistic markup
● multimodal markup/interpretation
● new types of treebanks:● CS treebanks with dependency links (NEGRA, TIGER)● machine-annotated corpora for statistical training (e.g., Redwoods Treebank)● Dependency (Tree)Banks (Prague, PARC)● Grammatical Relation (Tree)Banks (Briscoe & Carroll)