towards portability and interoperability for linguistic annotation and language- specific ontologies...
TRANSCRIPT
Towards portability and interoperability for linguistic
annotation and language-specific ontologies
Robert Munro & David Nathan
Endangered Languages Archive, School of Oriental and African Studies
Outline
1. Introduction and motivation
2. Linguistic ontologies and markups
3. Representing knowledge
4. Supporting fieldworkers
5. Supporting speakers
6. Conclusions
1. Introduction and motivation
Introduction
The main goal of this paper:how does GOLD meets the requirements of portability
for language documentation and description (Bird & Simons, 2003)
Road-testing:ability to meet the needs of archive users and
contributors
Motivation
The Endangered Languages Archive (ELAR) is part of the Hans Rausing Endangered Languages Project (HRELP)
HRELP supports:the archivegrants for documentation projectspostgraduate programs focussing on language
documentation
Motivation
We (ELAR):support a digital archive (preserve data and provide
access to it)
We also train students and grantees in:markup strategiesdata management strategiesmultimedia developmentchoice of recording equipment
Motivation
There is concern that cataloguing metadata (IMDI / OLAC) has not yet been sufficiently extended (Nathan and Austin, 2004)rich linguistic and contextual information is not being
recorded in well-formed portable formats/structures
Common ontologies present a solution to this
How does GOLD meet our needs
We find GOLD to be the most suitable ontology for supporting data portability
GOLD’s focus has been on ‘datanalysis sets’
Summary
We suggest extending the focus to:data acquisitiondata access
Key extensions:formalising the definitions of concepts by representing
them as a set of formal propertiesexplicitly capturing the conventions and constraints for
presentation (rendering)modelling features that are inherently indeterminate
and/or complex structures
2. Linguistic ontologies and markups
Linguistic ontologies and markups
Ontology:strictly, what we agree exists
Markup:strictly, what we are certain about
Ontology and markup converge:only with consensus and complete confidencebut there is rarely full confidence in the classification
of new hard-to-classify phenomena in little-studied endangered languages
Indeterminacy
Builders of ontologies outside of linguistics have been reluctant to accept inherent indeterminacy:
In some cases, the incompatibilities [between ontologies] can be smoothed over by tweaking definitions of concepts or formalizations of axioms; in other cases, wholesale theoretical revision may be required. (Niles & Pease, 2001)
If we can identify the incompatibilities, we can model them
Supporting linguistics
A theory-neutral model of linguistics is not possible:Theories are poly-centricThey will change
We need a pan-theory model of linguistics
Formulising definitions
Each concept in GOLD should be represented by a set of properties that describe that concept
Three possible values for a given property: ‘Yes’, ‘No’, or ‘Undefined’ (default)
To accurately represent variance: include enough properties to distinguish terms
For portability: include as many properties as possible
Formulising definitions
‘Yes’ can potentially be expanded: whether the property is mandatory or optional for the
conceptdependencies between properties for a concept
Example
‘Noun’ in GOLD:Noun Definition: A noun is a broad classification of parts of speech which include substantives and nominals (Crystal 1997:371; Mish et al. 1990:1176). (http://emeld.org/gold-ns/description.html#Noun, last checked 23/05/2003)
How do I know if my definition is the same as Crystal or Mish et al?
Is it both definitions, or the common ground?
Example
Will future users of GOLD have the same definition?the core of ‘noun’ may have longevitythe boundaries with other concepts will not
COPEs can define extensions in terms of sets of properties, and add those properties to GOLD
Example
GOLD:
COPEs:
NOUN
GerundNOUN NomVerbNOUN
Can’t formally identify the similarities
Example
GOLD:
COPEs:
NOUN
GerundNOUN NomVerbNOUN
+ property: verb suffix + property: verb suffix
Can formally identify the similarities
Definition of NOUN can grow
3. Representing knowledge
Rendering
Separating form from content:ideal for flexibilitynot possible for some materials (esp. video)
Rendering conventions / constraints
Some are well known:italicize part-of-speech in dictionariesalign interlinear transcriptions
Some are not:representation of language-specific kinship systems,
ethnobotanical ontologies etc
Solution 1
Include a (written) description and/or example of the rendering conventions and constraints:hard-code the interface
Solution 2
Include formal representations of the conventions within the data:interface takes instructions from the data
Solutions
These are two extremeshard-coded and language specificdata driven and language independent
Database architectures and linguistic ontologiesnot designed for navigation‘transparent’ access to such structures – who does it
support?
4. Supporting fieldworkers
Supporting indeterminacy
There are two kinds of indeterminacy in linguistics: confidence in assigning a category (uncertainty) phenomena that are inherently variable, probabilistic,
gradient or continuous
The most valuable information
The most valuable information that a field linguist learns may be the least likely to be annotated
Example: 7uhch in Lakanon Maya:A temporal-modal deictic expressing participant
frames and speaker's footings (Bergqvist 2005)This term has been given the most thought by the
researcher, but it is still not completely understoodThe uncertainty (or the extent of certainty) should be
recorded: all the properties we do know
5 reasons for modelling uncertainty
1. To record our the extent of our knowledge For example, we want everything known about
7uhch in Lakanon Maya to be recorded, even if we don’t yet have a category for it
5 reasons for modelling uncertainty
2. For searchability If an archive implementing an ontology with
uncertain categories exists, then we can more easily find existing solutions to a problem
If a problem is truly new, then we can allow future researchers to find it
5 reasons for modelling uncertainty
3. To reach certainty Even an indeterminate markup can allow a
corpus analysis that can inform a decision about assigning the appropriate category
5 reasons for modelling uncertainty
4. To highlight problems with descriptive frameworks
A feature may only appear to belong to multiple (or no) categories because the descriptive framework does not yet account for it
5 reasons for modelling uncertainty
5. Because the concept is inherently indeterminate
The concept may be inherently fuzzy but not previously encountered as a continuous / contiguous phenomena
Inherently indeterminate features
Eg: cline, gradience, squish, continuities, contiguities, vague, fuzzy, probabilistic
Many prosodic, semantic and discourse features are inherently continuous
Growing arguments for probabilities to be part of our formal linguistic models for morphological and syntactic structures (Aarts, 2004; Bayen, 2003; Manning, 2003)
Inherently indeterminate features
Representing categories by formal properties meets the current requirements of modelling gradience (Aarts, 2004)
Perhaps the “ContinuousObject” concept of SUMO (Niles & Pease, 2001) could also be used?
The problem is, currently, largely unresolved
Incorporating new categories
How do we know that a given category is not the same as another one identified elsewhere?
Formal properties for concepts give us another means for comparison
Incorporating structures
As well as inherently discrete phenomena and inherently indeterminate ones, there is a third kind: concepts that are complex structurescommon in syntax and discourse semantics
How do we model a structure in an ontology?
5. Supporting speakers
Users of EL archives
The largest (and growing) user group for endangered languages materials are the speakers of endangered languages
Rarely interested in linguistic categories or navigating a corpus or archive via them
Supporting language-specific ontologies means supporting information-rich structures for both navigation and analysis
Case Study: Yolngu kinship
The Yolngu languages have an extensive kinship terminology called Gurrutu27 terms that identify individuals and sets of
individuals in terms of moiety, generation, gender, and patriline or matriline.
The terms extend infinitely through cyclicity
Case Study: Yolngu kinship
Speakers draw from the same sets of kinship relations to describe their relationship to the Yolngu lands
We cannot always annotate well-known linguistic concepts independently of language-specific ontologies
6. Conclusions
Conclusions
Ontology building for endangered languages can be very different to other ontology projectsThe uncertain is often more valuable than the certainThe local is often more interesting than the universal… but will still need interoperability
We suggest extending the focus of GOLD todata acquisition data access
Conclusions
Current GOLD does not need to be altered to incorporate our suggestionsexcept to remove assumptions of invariability
Key extensionsformalising the definitions of concepts by representing
them as a set of formal propertiesexplicitly capturing the conventions and constraints for
presentation (rendering)modelling features that are inherently indeterminate
and/or complex structures
References
Aarts, B 2004 Modelling linguistic gradience. Studies in Language, 28(1):1–49.Bateman, J 1992 The theoretical status of ontologies in natural language processing. In Text Representation and Domain Modelling – ideas
from linguistics and AI, Technische Universität BerlinBayen, H 2003 Probabilistic Approaches to Morphology In Bod, R., Hay J. and Jannedy, S. (eds). Probabilistic Linguistics. MIT Press.Bergqvist, H 2005 Semantics of temporal deictics in Lakandon Maya. Presentation given at the ELAP-ELAR seminar series, SOAS, London.Bird, S & G Simons. 2003. Seven Dimensions of Portability for Language Documentation and Description, Language 79/3: 557-582.Christie, M & W Gaykamangu 2003. “Kinship, moiety, land & language in Arnhem Land”. In literacy link. Australian Council for Adult Literacy, vol
23, no 5 Oct 2003.Christie, M, W Gaykamangu & D Nathan. 2001. Yolngu Languages and Culture: Gupapuyngu. Faculty of Aboriginal and Torres Strait Islander
Studies, NTU [Multimedia CD-ROM]Crystal, D. 1997 A dictionary of linguistics and phonetics. 4th edition. Cambridge, MA: BlackwellCysouw, M, J Good, M Albu & HJ Bibiko 2005 Can GOLD “cope” with WALS? Retrofitting an ontology onto the World Atlas of Language
Structures. Proceedings of the E-MELD 2005Farrar, S. & D. T. Langendoen. 2003. A linguistic ontology for the Semantic Web. GLOT International 7 (3), 97-100.Farrar, S. 2003a Markup and the GOLD ontology. Proceedings of the EMELD 2003 Farrar, S. 2003b An ontological account of linguistics: extending SUMO with GOLD. Proceedings of the 2003 IEEE International Conference on
Natural Language Processing and Knowledge Engineering. BeijingFoley, W A 2003 Genre, register and language documentation in literate and preliterate communities. In Peter K Austin (ed.) Language
Documentation and Description vol 1Grinevald, C 2003 Speakers and documentation of endangered languages. In Peter K Austin (ed.) Language Documentation and Description
volume 1Gruber, T R. 1993 A translation approach to portable ontologies. Knowledge Acquisition, 5(2), 199-220Himmelmann, N P 1998 Documentary and descriptive linguistics. Linguistics 36. 161-195. Berlin: de Gruyter. Holton, G 2003 Approaches to digitization and annotation: A survey of language documentation materials in the Alaska Native Language Center
Archive. Proceedings of the EMELD 2003Manning, C. 2003 Probabilistic Syntax In Bod, R., Hay J. and Jannedy, S. (eds). Probabilistic Linguistics. MIT Press.Nathan, D. (ed) 1996. Australia’s Indigenous Languages. Adelaide: SSABSANathan, D and P K Austin (2004) Reconceiving metadata: language documentation through thick and thin. In Peter K Austin (ed.) Language
Documentation and Description Volume 2. Niles, I & A Pease. 2001. Towards a standard upper ontology. Proceedings of the 2nd International Conference on Formal Ontology in
Information Systems (FOIS-2001)Penton, D, C Bow, S Bird & B Hughes. 2004. Towards a General Model for Linguistic Paradigms. Proceedings of EMELD 2004