![Page 1: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/1.jpg)
Closing the Gap: Data Models for
Documentary Linguistics
Baden HughesDepartment of Computer Science and Software Engineering
The University of [email protected]
![Page 2: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/2.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 2
Overview
� Overall Context� The Electronic Data Format Challenge� Common Problems� Data Encoding Models
� Lexicons, interlinear texts, paradigms, syntactic trees, annotation standards, query languages
� Linguistic Motivations vs Computational Interests� New Types of Data Exploration� Effects on Linguistic Analysis� New Tools� Conclusions
![Page 3: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/3.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 3
Overall Context
� Large amounts of human language data continues to be managed in electronic form and analysed in fieldwork-driven linguistic documentation
� Increasing focus on acquisition-centric methodologies which have vastly increased the rate of growth of linguistic data
� Reasonably static basic linguistic data structures largely grounded in print domain
![Page 4: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/4.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 4
The Electronic Data Format Challenge
� The methods used for the digital encoding of linguistic data are often disparate� Often at best reduced to native formats supported by
widely-used tools such as Shoebox� Conversion is typically complex and lossy
� Sometimes this can’t be predicted in advance� Many utility manipulation functions required to move
data between analytical applications and outputs� These functions are largely external to analytical
environments, with some notable exceptions (eg regular expression manipulation)
![Page 5: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/5.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 5
Common Problems
� Despite diversity of language and analytical approach, many documentary and descriptive linguists face a common challenge: the interoperability and longevity of electronic data generated in fieldwork settings.
� Repurposing data� Publishing data on the web� Publishing in papers� New analysis tools� New generation formats
![Page 6: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/6.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 6
The Emergence of Abstract Language Data Encoding Models� Recently, a number formal data encoding models for
linguistic data types have emerged from projects investigating "best practice" methods for preserving linguistic data.
� We will briefly consider models for� lexicons� interlinear texts� paradigms� syntactic trees� annotation standards� query languages
![Page 7: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/7.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 7
Data Models (1)
� Lexicons� Bell & Bird (2001)
� Interlinear Text� Bow, Hughes & Bird (2003)� Hughes, Bird & Bow (2003)
� Linguistic Paradigms� Penton, Bow, Bird & Hughes (2004)� Penton & Bird (2004)
![Page 8: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/8.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 8
Data Models (2)
� Syntactic Trees� Lai & Bird (2004)
� Annotation Standards� Farrar, Lewis & Langendoen (2002)� Farrar & Langendoen (2003)
� Query Languages� Bird, Chen, Davidson, Lee & Zheng (2005)� Cassidy & Bird (2000) � Taylor (2004)
![Page 9: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/9.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 9
Linguistic Motivations
� Data models – so what ?� It is the combined utility of these models that makes
them attractive to documentary linguists� The challenge is to lower the barrier to use of these
technologies in fieldwork and analytical contexts� Linguistics (mostly) don’t care about the technology,
they just want to do linguistics!� Computer scientists are generally not interested in
linguistics …
![Page 10: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/10.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 10
Computational Interests
� The development of such models may be inherently interesting to computationally inclined researchers� Human language data encoding and annotation is
genuinely interesting in computer science terms; unfortunately basic data modelling isn't
� Technologists have a bad habit of providing advice which is intended well but lacks traction for non-technical communities (eg “use XML”)
� Many of the solutions are XML-based, but contain many more components than just XML encoded data
![Page 11: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/11.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 11
New Types of Data Exploration (1)
� Open implemented solutions for a range of manipulations are available� Lexicons
� Generation of different types of lexicons
� Interlinear Text (see following examples …)� Generation of different types of interlinear text � Induction of morphosyntactic glossing from lexicons� Generation of lexicons from interlinear text� Enrichment of lexicons from interlinear text
![Page 12: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/12.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 12
Nenets Interlinear (1)
![Page 13: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/13.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 13
Nenets Interlinear (2)
![Page 14: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/14.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 14
New Types of Data Exploration (2)
� Open implemented solutions for a range of manipulations are available� Syntactic Trees
� Induction of trees from interlinear text� Creation of interlinear text from syntactic tree drawing� Creation of lexicons from syntactic trees
� Paradigms (see following examples …)� Generation of different types of paradigms� Induction of paradigms from interlinear text� Annotation of interlinear text from paradigms� Enrichment of lexicons from paradigms
![Page 15: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/15.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 15
Kanarese Paradigm (1)
![Page 16: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/16.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 16
Kanarese Paradigm (2)
![Page 17: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/17.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 17
Effects on Linguistic Analysis
� Integrated encoding standards for linguistic data affect the practice of linguistic analysis� Some analysis types are now easier� New possibilities emerge� New analytical challenges are discovered� Data linkage/integration is certainly one of the
improvements
![Page 18: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/18.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 18
New Tools
� The next generation of tools which support these data models natively are emerging eg FIELD, ELAN, Toolbox (almost)
� “Middleware” which allows the translation of legacy formats to and from these models are reasonably widely available
� Analytical tools are increasingly being implemented with web-grounded technologies and using web-derived models
� Open source/open data approaches are becoming pervasive
![Page 19: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/19.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 19
Conclusion
� Reducing the gap between computationally tractable representations on which a high degree of functionality can be built and simple underlying formats driven by fieldwork-oriented tools
� Reduces the intermediate data-munging steps which require technical knowledge rather than linguistic knowledge is advantageous to all parties
� While we are not quite “there yet”, the light at the end of the tunnel is definitely there
� Growing community of philosophically aligned computer scientists and linguists
![Page 20: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/20.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 20
References
� Bell & Bird, 2001. A Preliminary Study of the Structure of Lexicon Entries. Proceedings of the Workshop on Web-Based Language Documentation and Description.
� Bow, Hughes & Bird 2003. Towards a General Model for Interlinear Text. Proceedings of EMELD 2003.
� Farrar, Lewis & Langendoen, 2002. A Common Ontology for Linguistic Concepts.Proceedings of the Knowledge Technologies Conference.
� Farrar & Langendoen, 2003. A linguistic ontology for the Semantic Web. GLOT International 7(3)
� Hughes, Bird & Bow, 2003. Encoding and Presenting Interlinear Text Using XML Technologies. Proceedings of ALTW 2003.
� Lai & Bird, 2004. Querying and Updating Treebanks: A Critical Survey and Requirements Analysis. Proceedings of ALTW 2004.
� Penton, Bow, Bird & Hughes, 2004. Towards a General Model for Linguistic Paradigms.Proceedings of EMELD 2004.
� Penton & Bird, 2004. Representing and Rendering Linguistic Paradigms. Proceedings of ALTW 2004.
� Bird, Chen, Davidson, Lee & Zheng, 2005. Extending XPath to Support Linguistic Queries. Proceedings of PLANX 2005.
� Cassidy & Bird, 2000. Querying databases of annotated speech. Proceedings of the Eleventh Australasian Database Conference.
� Taylor, 2004. XSLT as a Linguistic Query Language. BSc(Hons) Thesis, University of Melbourne.
![Page 21: Closing the Gap: Data Models for Documentary Linguistics](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550682ab4c905c0448b55a1/html5/thumbnails/21.jpg)
Latrobe Uni - Linguistics Seminar - 20050505 21
Questions ? Comments ?