atomic: an open-source software platform for multi-level ... ?· atomic: an open-source software...
Post on 18-Sep-2018
Embed Size (px)
Atomic: an open-source software platform for multi-level corpusannotation
Stephan Druskat, Lennart Bierkandt%, Volker Gast, Christoph Rzymski, Florian ZipserFriedrich Schiller University Jena, Dept. of English and American Studies, Jena/GermanyHumboldt University of Berlin, Dept. of German Studies and Linguistics, Berlin/Germany
This paper1 presents Atomic, an open-source2 platform-independent desktop ap-plication for multi-level corpus annotation.Atomic aims at providing the linguisticcommunity with a user-friendly annotationtool and sustainable platform through itsfocus on extensibility, a generic data model,and compatibility with existing linguisticformats. It is implemented on top of theEclipse Rich Client Platform, a pluggableJava-based framework for creating clientapplications. Atomic - as a set of plug-ins for this framework - integrates with theplatform and allows other researchers to de-velop and integrate further extensions to thesoftware as needed. The generic graph-based meta model Salt serves as Atomicsdomain model and allows for unlimitedannotation levels and types. Salt is alsoused as an intermediate model in the Pep-per framework for conversion of linguisticdata, which is fully integrated into Atomic,making the latter compatible with a widerange of linguistic formats. Atomic pro-vides tools for both less experienced andexpert annotators: graphical, mouse-driveneditors and a command-line data manipula-tion language for rapid annotation.
1This work is licensed under a Creative Commons Attri-bution 4.0 International License (CC BY 4.0). Page numbersand proceedings footer are added by the organizers. Licensedetails: http://creativecommons.org/licenses/by/4.0/.
2Atomic is open source under the Apache License 2.0.
Over the last years, a number of tools for the an-notation and analysis of corpora have been de-veloped in linguistics. Many of these tools havebeen created in the context of research projectsand designed according to the specific require-ments of their research questions. Some toolshave not been further developed after the end ofthe project, which often precludes their installa-tion and use today.
Some of the available annotation tools, suchas MMAX2 (Muller and Strube, 2006), @nno-tate (Plaehn, 1998) and EXMARaLDA (Schmidt,2004), have been designed to work on specific an-notation types and/or levels (e.g., token-based orspan-based; coreference annotations, constituentstructures, grid annotations). However, the anal-ysis of some linguistic phenomena greatly prof-its from having access to data with annotationson more than one or a restricted set of levels.For instance, an empirical, corpus-based analy-sis of scope relations and polarity sensitivity can-not be conducted without richly annotated cor-pora, with token-based annotation levels such aspart-of-speech and lemma, and additionally syn-tactic and sentence-semantic annotations. Andanalyses of learner corpora, for example, benefitfrom an annotation level with target hypotheses,and a level that records discrepancies betweentarget hypothesis and learner production.3 Simi-larly, in order to analyze the information structureof utterances, annotations on the levels of syn-tax, phonology and morphology are essential, as
3For an overview see Ludeling and Hirschmann (forth-coming).
information structure is interweaved with vari-ous [. . . ] linguistic levels (Dipper et al., 2007).4
And finally, typological questions such as thosethat have been the topic of the research projectTowards a corpus-based typology of clause link-age5 in the context of which Atomic hasbeen developed require corpora that have fine-grained annotations, at various levels.6 Conse-quentially, any software designed for unrestrictedmulti-level corpus annotation should meet the fol-lowing requirements.
(1) The data model has to be generic andable to accommodate various types of annotationbased on the requirements of different annotationschemes. (2) The architecture needs to exhibit ahigh level of compatibility, as the corpora avail-able for a specific research question may come indifferent formats. (3) The software must be ex-tensible, since new types of annotation will haveto be accommodated, and some may require newfunctionalities. (4) As the software should be us-able by a wide variety of annotators e.g., experi-enced researchers as well as students and special-ists of various languages including field workers it needs to provide support and accessibility forusers with different levels of experience in corpusannotation.
There has been a trend in corpus linguistics to-wards the creation of multi-level corpora for anumber of years now, which has of course hadan effect on tool development as well. There-fore, there already exist some annotation toolswhich can handle annotations on more than onelevel, e.g., TrED (Pajas and Stepanek, 2004)and MATE (Dybkjr et al., 1999). TrED hasbeen developed mainly for the annotation of tree-like structures and therefore handles only limitedtypes of annotations. These do not fully coversome of the required levels for, e.g., typologi-cal research questions. MATE, a web-based toolsplatform, is specifically aimed at multi-level an-notation of spoken dialogue corpora in XML for-mats. Atomics architecture avoids these limita-
4See also Ludeling et al. (forthcoming).5Cf. http://linktype.iaa.uni-jena.de.6Annotation levels should minimally include morphol-
ogy, syntax and information structure, ideally also semantics(sentence semantics and reference) and phonology (includ-ing prosody).
tions by using a generic graph-based model ca-pable of handling potentially unlimited annota-tion types in corpora of spoken and written texts(cf. 2.2). WebAnno (Yimam et al., 2013) isanother multi-level annotation tool currently un-der development, designed with a focus on dis-tributed annotations. It has a web-based archi-tecture partly based on brat.7 In contrast to this,Atomic is built as a rich client for the desktopbased on the Eclipse Rich Client Platform (RCP,McAffer et al. (2010)), which offers a number ofadvantages over the implementation as a web ap-plication, most importantly ease of extensibility,platform-independence, data security and stabil-ity, and network-independence.
The Eclipse RCP comes with a mature, stan-dardised plug-in framework,8 something whichvery few web frameworks are able to provide.And while it is a non-trivial task in itself todevelop a truly browser-independent web ap-plication, the Eclipse RCP provides platform-independence out-of-the-box, as software devel-oped against it run on the Java Virtual Ma-chine.9 Desktop applications also inherently of-fer a higher grade of data security than web-basedtools, as sensitive data such as personal datapresent in the corpus or its metadata does nothave to leave the users computer. Additionally,access to the corpus data itself will be more sta-ble when it is stored locally rather than remotely,as desktop applications are immune to server fail-ures and the unavailability of server administra-tors. And finally, a desktop tool such as Atomic,which is self-contained in terms of business logicand data sources, can be used without an internetconnection, which is important not only to fieldworkers but also to anyone who wants to work ina place with low or no connectivity, e.g. on publictransport.10
In terms of interoperability and sustainabil-ity, Atomic has been specifically designed tocomplement the existing interoperable softwareecosystem of ANNIS (Zeldes et al., 2009),the search and visualisation tool for multilayer
7Cf. http://brat.nlplab.org/.8Eclipse Equinox, cf. 2.1.9Ibid.
10Nevertheless it is of course possible to extend Atomic touse remote logic and/or data sources, such as databases.
corpora, Pepper (Zipser et al., 2011), theconverter framework for corpus formats, andLAUDATIO (Krause et al., 2014), the long-termarchive for linguistic data. Thus, it seamlesslyintegrates the compilation and annotation of re-sources with their analysis and long-term accessi-bility.
In order to fulfill the above-mentioned require-ments, Atomics architecture is particularly con-cerned with extensibility, a generic data model,and a feature set which focuses on accessibility.
It should be possible for other research groupsto build extensions for Atomic for their specificneeds (e.g., new editors, access to remote datasources) and integrate them into the platform.This would increase the sustainability of both theplatform and the extensions.
We have therefore decided to develop Atomicon top of the Eclipse RCP, an open-source Javaapplication platform which operates on sets ofplugins. The RCP itself is also a set of plug-ins running on a module runtime for the widelydistributed Java Virtual Machine (JVM).11 Henceapplications developed on top of it run on any sys-tem for which a JVM is available.12 It enablesthe implementation of Atomic as a set of plug-ins and their integration into the application plat-form, which in turn makes it possible for Atomicto interact with, and benefit from, the vast num-ber of plugins available in the Eclipse ecosys-tem. These include version control system inter-faces (for, e.g., git and Subversion), plugins fordistributed real-time collaboration, an R develop-ment environment, a TEX editor, and many more.By adding one of the version control system in-terfaces to Atomic, for example, the corpus dataitself, the annotations and all metadata can be ver-sioned in atomic detail, which can be utilised forcollaborative corpus annotation.13
11Eclipse Equinox is an implementation of the OSGi spec-ification for a dynamic component model for the J