folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that...

12
27/10/2015 1 Language Documentation and Corpus Linguistics Ulrike Mosel ISFAS, University of Kiel, [email protected] Bielefeld, 04.11.2015 Teop area, Tinputz, Bougainville, Papua New Guinea 1 1. Corpus typology What is special about corpora of endangered languages? 2. Corpus building in language documentation 3. Corpora for language maintenance and linguistics 4. Corpus exploitation 4.1 Annotation 4.2 Searching 5. Corpus size and grammatical analysis 6. Recommendations for corpus compilation and exploitation References Content 2 Basic concepts text any piece of natural speech (story, proverb, conversation, sermon, ....) corpus “collection of sampled texts, written or spoken, in machine readable form which may be annotated with various forms of linguistic information” (McEnery et al. 2006:4) sampling frame “specifies how samples are to be chosen from the population of text” (McEnery & Hardie 2012:250) 1 Corpus typology 3 Corpus typology (McEnery & Hardy 2012:8-12) 1. Monitor corpus “grows in size over time” 2. Sample corpus seeks “to be balanced and representative within a particular sampling frame “Balance, representativeness and comparability are ideals which corpus builders strive for but rarely , if ever, attain.” (p. 10) Balancedness and representativeness are a matter of degree . But how are they measured? What’s the standard? (see below) 4 3. Opportunistic corpora “represent nothing more or less than the data that it was possible to gather for a specific task.”

Upload: others

Post on 01-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

1

Language Documentation and Corpus Linguistics Ulrike Mosel

ISFAS, University of Kiel, [email protected]

Bielefeld, 04.11.2015

Teop area, Tinputz, Bougainville, Papua New Guinea 1

1. Corpus typology

What is special about corpora of endangered languages?

2. Corpus building in language documentation

3. Corpora for language maintenance and linguistics

4. Corpus exploitation

4.1 Annotation

4.2 Searching

5. Corpus size and grammatical analysis

6. Recommendations for corpus compilation and exploitation

References

Content

2

Basic concepts

text any piece of natural speech

(story, proverb, conversation, sermon, ....)

corpus “collection of sampled texts, written or spoken,

in machine readable form

which may be annotated with various forms

of linguistic information” (McEnery et al. 2006:4)

sampling frame “specifies how samples are to be chosen

from the population of text”

(McEnery & Hardie 2012:250)

1 Corpus typology

3

Corpus typology (McEnery & Hardy 2012:8-12)

1. Monitor corpus “grows in size over time” 2. Sample corpus seeks “to be balanced and representative within a particular sampling frame

“Balance, representativeness and comparability are ideals which corpus builders strive for but rarely , if ever, attain.” (p. 10)

Balancedness and representativeness are a matter of degree. But how are they measured? What’s the standard? (see below)

4

3. Opportunistic corpora “represent nothing more or less than the data that it was possible to gather for a specific task.”

Page 2: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

2

Against the notion of representativeness Teubert & Čermáková (2007:61, 69)

To claim “that a given corpus is representative of a discourse” in principle presupposes “access to all the texts the discourse consists of.”

“But if this utopia came true, we could well do without the corpus. ... it does not make much sense to talk about representativeness.”

5

Typical corpus LD corpus

language well-researched European language

unresearched, non-European

texts monolingual printed recorded, transcribed, translated

corpus builder

team of native speakers single non-native speaker

size millions of words below 1 mio

purpose linguistic research language preservation, linguistic and interdisciplinary research

compilation specified by a sampling frame

opportunistic monitor corpus

6

2 Corpus building in language documentation

Classification of texts (Biber & Conrad 2009:Ch. 1 & 2)

Linguistic variation is systematic Selection of linguistic features depends on non-linguistic factors. Certain linguistic features are more frequent in some speech situations than in others.

Register analysis 1. Identify the situational characteristics of the texts. 2. Identify typical (pervasive) linguistic features. 3. Interprete the relationship between situational characteristics

and pervasive linguistic features.

7

Situational characteristics of text varieties (abbrev. from Biber & Conrad 2009:40)

Participants and their social characteristics

Mode (speech /writing), Medium (taped, radio, handwritten ...)

Production circumstances (real time, planned, edited, ...)

Setting (private / public; ...)

Communicative purposes (narrate, report, describe, ...)

Topic

The situational chatacteristics should be documented in the metadata of each text.

8

Page 3: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

3

Corpus building with the Teop people

Bougainville, Papua New Guinea

9

Corpus building with the Teop people

Language: Teop genetic affiliation: Austronesian, Oceanic location: Bougainville, Papua New Guinea Sociolinguistic background: number of speakers: ca 6000 subsistence farming many literate people civil war 1988-1999 Project: 1994 first recordings by Ruth Saovana Spriggs (funded by Australian Research Council) DOBES Project 2000-2007 DFG Project 2009-2011 2011- funded by private resources

10

Collaborative language documentation

11

The Teop (opportunistic) corpus

free access restricted access

http://corpus1.mpi.nl/ds/imdi_browser/

structure and access needs to be revised as soon as possible!

12

Page 4: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

4

Revised corpus structure of the Teop Language Corpus (on my computer)

Metadata also need revision

13

3. Corpora for language maintenance and linguistics

The aims of the LD are set by the linguist & indigenous documenters

create something useful for the speech community (Pear-stories and Frog-stories are not useful!)

trust that this will be useful for lingustics

Production of educational materials substantial part of the language documentation

(Mosel 2012b)

14

New genres: 1. edited versions of legends

2. autobiographical narratives 3. procedural texts 4. encyclopedic descriptions of

plants, animals, artefacts

“mini-dictionaries” (Mosel 2011)

15

A INU

The Teop-English Dictionary of House Building

Mark Mahaka, Enoch Horai Magum, Joyce Maion,

Naphtaly Maion, Ruth Siimaa Rigamu, Ruth Saovana

Spriggs, and Jeremiah Vaabero

with Ulrike Mosel, Marcia Schwartz and Yvonne Thiesen,

drawings by Neville Vitahi, photographs by Ulrike Mosel

16

Page 5: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

5

3.1 Different modes: recorded speech vs. edited texts

1. original recordings with annotations

2. edited versions of the recordings

The comparative corpus

• gives a fuller picture of the expressive potential

of the language;

shows alternative ways of expressing the same content

• provides a new type of data for research on

what speakers actually do

when they put an oral text into writing

17 (Mosel 2015)

3.2.1 Legends

- may contain archaic expressions

-are situated in imaginary worlds

where animals talk and

things can change into living beings

> interesting for noun classification

- contain direct speech

interjections, swearing ....

18

3.2 Different genres

3.2.2. Encyclopedic descriptions

non-verbal clauses in definitions

(1) SUBJ.NP PRED.NP QUALIFICATIVE ATTRIBUTIVE AP The bokua (is) a fish a big (one) (2) SUBJ.NP PRED.NP POSSESSIVE AP The booboo (is) a fish (with) a strong skin (3) SUBJ.NP PRED.NP RELATIVE CLAUSE The shelf (is) a thing that we put things (on).

19

Definitions of “thing”-words

supply excellent examples for:

1. non-verbal clauses

2. topicalisation

3. various kinds of modifiers

(1) adjectival phrases (‘big’)

(2) possessive adjectival phrases

(‘having a thick skin’)

(3) relative clauses

20

Page 6: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

6

Definitions of “action”- words NMLZ DEM COMPLEMENT CLAUSE ‘The tearing... this (is) when we remove ....’

A siri atovo ART tear sago.palm.leaf 'The tearing of the sago palm leaf, .

ei be- ara gono kahi o paka DEM when- 1PL.INCL get from ART leaf this (is) when we get from the leaf.' bono sikiri nae. ART midrib 3SG.POSS the midrib 21

3.2.3 Narratives vs procedural texts

Narratives Procedural texts

Paratactic clauses Coordinate clauses

Adverbial clause constructions: ‘when ..., then...’

Sequence of past events Regular fixed order of actions

> create a corpus of comparative narrative and procedural texts minimise variables

22

Buy a chicken and ...

23

procedural text: 40 clauses, 12 adverbial clause constr. narrative text: 53 clauses, no adverbial clauses 13 paratactic clauses

Make series of photographs and use them as stimuli for 1. the description of how to butcher a chicken 2. the narrative of how the twins helped

their father butchering a chicken

24

Page 7: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

7

3.3 Different themes – different grammatical phenomena

3.3.1 Tropical fishes are colourful

Do colour-words behave like beera ‘big’ and mataa ‘good’?

-do they have the same morphology? - do they have the same syntactic functions? - do they enter comparative constructions?

25

NPSUBJECT VC NPOBJECT The sinarona [is redding passing] the aranavi. The sinarona is redder than the aranavi

A sinarona na gogooravi oha nana bona aranavi. ART sinarona TAM red pass TAM ART aranavi

TAM = tense/ aspect/ mood marker

26

Teop clause structure:

intransitive: SUBJ VC

transitive: SUBJ VC OBJ

agent patient/ recipient / theme

ditransitive: SUBJ VC OBJ1 OBJ2

agent recipient theme

agent patient instrument

3.3.2 What trees are good for

Translation equivalents of Teop ditransitive clauses: The man gave the child a coconut. The man made the canoe from wood,

27

Descriptions of trees and what the parts of trees are used for:

OBJ2 VC SUBJ OBJ1 ‘The putty-nut plaster they the canoe. ‘The putty-nut, they use for plastering (the knotholes of) the canoe’

[-hum] NPs = topic

28

Page 8: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

8

The more tree descriptions, the more clauses with OBJ2 VC SUBJ OBJ1 wordorder! It makes only sense to speak of a dominant constituent order with respect to a particular type of genre.

What is the dominant word order in Teop?

>> quantitative analysis of constituent orders in different genres >> for each clause you need to annotate the constituent order

General problem of LD corpora: annotation is very time-consuming.

29 30

4. Corpus exploitation

ELAN : corpus building and expoitation tool

facilitates:

1. time aligned annotations of audio and video recordings

2. annotations on several tiers

3. simultaneous searches on several tiers

4. searches with the query language Regular Expressions

5. concordances and statistics of search results

6. export to Toolbox, Praat, printable files

31

4.1 Annotation

(McEnery et al. 2006:30-45, 75)

1. gives additional information on the form of the texts;

2. “records a linguistic analysis explicitly “;

3. “imposes a linguistic analysis upon a corpus user”

but

4. this record of an explicit analysis is “open for scrutiny”

32

Page 9: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

9

Annotation in LD corpora in ELAN

Research independent annotation 1. minimal annotation (DoBes) 2. gramm_units (Erfurt Referentiality Project) 3. interlinear morphological glossing

(Leipzig glossing rules)

33

Research dependent annotation 1. GRAID (Haig & Schnell 2011) 2. Coreference annotation (McEnery et al. 2006:38-40)

Minimal annotation 3 Tiers: ref: numbered segments t: transcription f: free translation

34

35

4.2 Searches in ELAN

structured search on multiple files – single tier

concordance view – keyword in context - kwic

36

4.2 Searches in ELAN : collocations

Is old the only collocation of woman?

Examine 840 examples???

Page 10: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

10

37

Search with regular Expressions

38

Search for: a/an/the followed by a word that does not start with o followed by woman/women

39

Multiple layer & column search

peha indefiniteness marker bona anaphoric determiner

If the protagonist is introduced by a NP with peha in the first utternance/clause of the text, how is it referred to in the second sentence?

Alignment view

40

Multiple layer search

Page 11: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

11

Frequency counts are not sufficient for describing grammar, however. they point to interesting phenomena that deserve further investigation and interpretation

(Conrad 2010:228)

5 Corpus size and grammatical analysis

Does the corpus supply enough data for the research question? How big is big enough?

41

The smaller the corpus, the more restricted are the research topics, and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic item .

42

6. 1 Recommendations for corpus compilation:

Focus on a few registers/genres.

The more diversified the corpus is,

the smaller are the subcorpora, and

the smaller the probability that

you can adequately identify regular patterns of language usuage.

43

6.2 Recommendations for corpus exploitation # Use ELAN or a similar tool with an implemented powerful query language. # Keep in mind that extended annotation is very time consuming. Don’t get lost in annotations. # Document your annotation rules and your search methods. # Make your corpus and the metadata accessible. # Aim at scientific research that is replicable and falsifiable.

44

Page 12: Folie 1 - uni-bielefeld.de · and the more restricted the situational variables must be. so that the corpus presumably covers the full range of variability of the researched linguistic

27/10/2015

12

45 46