Atomic: an open-source software platform for multi-level ... ?· Atomic: an open-source software platform…

Download Atomic: an open-source software platform for multi-level ... ?· Atomic: an open-source software platform…

Post on 18-Sep-2018




0 download

Embed Size (px)


<ul><li><p>Atomic: an open-source software platform for multi-level corpusannotation</p><p>Stephan Druskat, Lennart Bierkandt%, Volker Gast, Christoph Rzymski, Florian ZipserFriedrich Schiller University Jena, Dept. of English and American Studies, Jena/GermanyHumboldt University of Berlin, Dept. of German Studies and Linguistics, Berlin/Germany</p><p>{firstname.lastname}</p><p></p><p>Abstract</p><p>This paper1 presents Atomic, an open-source2 platform-independent desktop ap-plication for multi-level corpus annotation.Atomic aims at providing the linguisticcommunity with a user-friendly annotationtool and sustainable platform through itsfocus on extensibility, a generic data model,and compatibility with existing linguisticformats. It is implemented on top of theEclipse Rich Client Platform, a pluggableJava-based framework for creating clientapplications. Atomic - as a set of plug-ins for this framework - integrates with theplatform and allows other researchers to de-velop and integrate further extensions to thesoftware as needed. The generic graph-based meta model Salt serves as Atomicsdomain model and allows for unlimitedannotation levels and types. Salt is alsoused as an intermediate model in the Pep-per framework for conversion of linguisticdata, which is fully integrated into Atomic,making the latter compatible with a widerange of linguistic formats. Atomic pro-vides tools for both less experienced andexpert annotators: graphical, mouse-driveneditors and a command-line data manipula-tion language for rapid annotation.</p><p>1This work is licensed under a Creative Commons Attri-bution 4.0 International License (CC BY 4.0). Page numbersand proceedings footer are added by the organizers. Licensedetails:</p><p>2Atomic is open source under the Apache License 2.0.</p><p>1 Introduction</p><p>Over the last years, a number of tools for the an-notation and analysis of corpora have been de-veloped in linguistics. Many of these tools havebeen created in the context of research projectsand designed according to the specific require-ments of their research questions. Some toolshave not been further developed after the end ofthe project, which often precludes their installa-tion and use today.</p><p>Some of the available annotation tools, suchas MMAX2 (Muller and Strube, 2006), @nno-tate (Plaehn, 1998) and EXMARaLDA (Schmidt,2004), have been designed to work on specific an-notation types and/or levels (e.g., token-based orspan-based; coreference annotations, constituentstructures, grid annotations). However, the anal-ysis of some linguistic phenomena greatly prof-its from having access to data with annotationson more than one or a restricted set of levels.For instance, an empirical, corpus-based analy-sis of scope relations and polarity sensitivity can-not be conducted without richly annotated cor-pora, with token-based annotation levels such aspart-of-speech and lemma, and additionally syn-tactic and sentence-semantic annotations. Andanalyses of learner corpora, for example, benefitfrom an annotation level with target hypotheses,and a level that records discrepancies betweentarget hypothesis and learner production.3 Simi-larly, in order to analyze the information structureof utterances, annotations on the levels of syn-tax, phonology and morphology are essential, as</p><p>3For an overview see Ludeling and Hirschmann (forth-coming).</p><p>228</p></li><li><p>information structure is interweaved with vari-ous [. . . ] linguistic levels (Dipper et al., 2007).4</p><p>And finally, typological questions such as thosethat have been the topic of the research projectTowards a corpus-based typology of clause link-age5 in the context of which Atomic hasbeen developed require corpora that have fine-grained annotations, at various levels.6 Conse-quentially, any software designed for unrestrictedmulti-level corpus annotation should meet the fol-lowing requirements.</p><p>(1) The data model has to be generic andable to accommodate various types of annotationbased on the requirements of different annotationschemes. (2) The architecture needs to exhibit ahigh level of compatibility, as the corpora avail-able for a specific research question may come indifferent formats. (3) The software must be ex-tensible, since new types of annotation will haveto be accommodated, and some may require newfunctionalities. (4) As the software should be us-able by a wide variety of annotators e.g., experi-enced researchers as well as students and special-ists of various languages including field workers it needs to provide support and accessibility forusers with different levels of experience in corpusannotation.</p><p>There has been a trend in corpus linguistics to-wards the creation of multi-level corpora for anumber of years now, which has of course hadan effect on tool development as well. There-fore, there already exist some annotation toolswhich can handle annotations on more than onelevel, e.g., TrED (Pajas and Stepanek, 2004)and MATE (Dybkjr et al., 1999). TrED hasbeen developed mainly for the annotation of tree-like structures and therefore handles only limitedtypes of annotations. These do not fully coversome of the required levels for, e.g., typologi-cal research questions. MATE, a web-based toolsplatform, is specifically aimed at multi-level an-notation of spoken dialogue corpora in XML for-mats. Atomics architecture avoids these limita-</p><p>4See also Ludeling et al. (forthcoming).5Cf. levels should minimally include morphol-</p><p>ogy, syntax and information structure, ideally also semantics(sentence semantics and reference) and phonology (includ-ing prosody).</p><p>tions by using a generic graph-based model ca-pable of handling potentially unlimited annota-tion types in corpora of spoken and written texts(cf. 2.2). WebAnno (Yimam et al., 2013) isanother multi-level annotation tool currently un-der development, designed with a focus on dis-tributed annotations. It has a web-based archi-tecture partly based on brat.7 In contrast to this,Atomic is built as a rich client for the desktopbased on the Eclipse Rich Client Platform (RCP,McAffer et al. (2010)), which offers a number ofadvantages over the implementation as a web ap-plication, most importantly ease of extensibility,platform-independence, data security and stabil-ity, and network-independence.</p><p>The Eclipse RCP comes with a mature, stan-dardised plug-in framework,8 something whichvery few web frameworks are able to provide.And while it is a non-trivial task in itself todevelop a truly browser-independent web ap-plication, the Eclipse RCP provides platform-independence out-of-the-box, as software devel-oped against it run on the Java Virtual Ma-chine.9 Desktop applications also inherently of-fer a higher grade of data security than web-basedtools, as sensitive data such as personal datapresent in the corpus or its metadata does nothave to leave the users computer. Additionally,access to the corpus data itself will be more sta-ble when it is stored locally rather than remotely,as desktop applications are immune to server fail-ures and the unavailability of server administra-tors. And finally, a desktop tool such as Atomic,which is self-contained in terms of business logicand data sources, can be used without an internetconnection, which is important not only to fieldworkers but also to anyone who wants to work ina place with low or no connectivity, e.g. on publictransport.10</p><p>In terms of interoperability and sustainabil-ity, Atomic has been specifically designed tocomplement the existing interoperable softwareecosystem of ANNIS (Zeldes et al., 2009),the search and visualisation tool for multilayer</p><p>7Cf. Equinox, cf. 2.1.9Ibid.</p><p>10Nevertheless it is of course possible to extend Atomic touse remote logic and/or data sources, such as databases.</p><p>229</p></li><li><p>corpora, Pepper (Zipser et al., 2011), theconverter framework for corpus formats, andLAUDATIO (Krause et al., 2014), the long-termarchive for linguistic data. Thus, it seamlesslyintegrates the compilation and annotation of re-sources with their analysis and long-term accessi-bility.</p><p>2 Architecture</p><p>In order to fulfill the above-mentioned require-ments, Atomics architecture is particularly con-cerned with extensibility, a generic data model,and a feature set which focuses on accessibility.</p><p>2.1 Extensibility</p><p>It should be possible for other research groupsto build extensions for Atomic for their specificneeds (e.g., new editors, access to remote datasources) and integrate them into the platform.This would increase the sustainability of both theplatform and the extensions.</p><p>We have therefore decided to develop Atomicon top of the Eclipse RCP, an open-source Javaapplication platform which operates on sets ofplugins. The RCP itself is also a set of plug-ins running on a module runtime for the widelydistributed Java Virtual Machine (JVM).11 Henceapplications developed on top of it run on any sys-tem for which a JVM is available.12 It enablesthe implementation of Atomic as a set of plug-ins and their integration into the application plat-form, which in turn makes it possible for Atomicto interact with, and benefit from, the vast num-ber of plugins available in the Eclipse ecosys-tem. These include version control system inter-faces (for, e.g., git and Subversion), plugins fordistributed real-time collaboration, an R develop-ment environment, a TEX editor, and many more.By adding one of the version control system in-terfaces to Atomic, for example, the corpus dataitself, the annotations and all metadata can be ver-sioned in atomic detail, which can be utilised forcollaborative corpus annotation.13</p><p>11Eclipse Equinox is an implementation of the OSGi spec-ification for a dynamic component model for the JVM.</p><p>12This includes all major operating systems includingWindows, Mac OS X, and Linux.</p><p>13In Atomics current iteration, this would be achieved byversioning a corpus documents SaltXML files, cf. 2.2.</p><p>Eclipse is tried and tested technology whichhas originally been developed by IBM in 2001and is now under the aegis of the Eclipse Foun-dation. Due to its long existence and high im-pact it is supported by a very large community.14</p><p>Eclipse is used by a wide spectrum of disciplines,mostly from IT and the sciences.15 These param-eters make Eclipse a highly sustainable technol-ogy, more so than any single research project canhope to achieve.</p><p>Atomics extensibility is further enhancedthrough its use of the Salt data model, whose fea-ture set (cf. 2.2) in combination with the capabil-ities of the Eclipse RCP allow for the creation ofvery diverse extensions for Atomic, such as forthe annotation of historic text, or of speech data.Salt has successfully been used, for example, asan intermediate model for the data used in theRIDGES16 project, which was possible becausethe model allows for different text segmentationsover the same tokens in the context of the repre-sentation of different transcription and annotationlevels. And as Salt supports audio and video datasources in addition to textual sources, it has alsobeen used as an intermediate model for the dia-logue data of the BeMaTaC17 project.</p><p>2.2 Data model</p><p>Atomics domain model is Salt (Zipser and Ro-mary, 2010), a graph-based metamodel for lin-guistic data. Salts generic nature and general lackof semantics makes it independent of specific lin-guistic analyses, tagsets, annotation schemes andtheories, and its graph-based nature allows for themodeling of nearly all conceivable kinds of lin-guistic structures as nodes and edges. Tokens,spans, hierarchies, and primary texts are all repre-sented as nodes. There can be an unlimited num-ber of edges between nodes, which permits thecreation of very diverse types of structures, suchas co-reference chains, dependencies, constituenttrees, and simpler morphological token annota-</p><p>14Eclipse has over 200 open-source projects and millionsof users under its umbrella (Eclipse Foundation, 2011).</p><p>15The Eclipse science community is organized in theScience Working Group at the Eclipse Foundation, cf.</p><p>16Register in Diachronic German Science,cf.</p><p>17Berlin Map Task Corpus,</p><p>230</p></li><li><p>tions. An annotation in Salt is represented asan attribute-value pair with an additional optionalnamespace tag, and is therefore not restrictedto specific tagsets. Salt has originally been de-signed as a main memory model, but could alsobe mapped to graph-based persistence technolo-gies such as Neo4j.18 Salt provides a Java APIwhich is open source under the Apache License2.0. It was designed using the Eclipse ModelingFramework (EMF, Steinberg et al. (2009)). Mod-els can be persisted via an XMI serialisation ofthe EMF model as SaltXML, a stand-off formatwhich bundles annotations in one file per docu-ment. SaltXML is used as Atomics default per-sistence format.</p><p>A graph-based domain model such as Salt canbe mapped to a graphical representation model ofannotation graphs for Atomic relatively easily, asthis is supported by the underlying technology:Solutions for the creation of editors and visual-izations of EMF-based domain models are avail-able in the Eclipse ecosystem. Projects like theEclipse Graphical Modeling Framework (GMF)and the Eclipse Graphical Editing Framework(GEF (Rubel et al., 2011), used for Atomics an-notation graph editor, cf. 2.3) have been specifi-cally designed for this purpose.</p><p>2.3 Usability and features</p><p>User-friendliness starts at the acquisition and in-stallation of software: Atomic is provided asa single zip archive file on the Atomic web-site,19 and is available for Linux, Mac OS X,and Windows. No installation as such is neces-sary, simply extracting the archive to a directoryof choice suffices. Unlike other tools includ-ing locally installed web applications , Atomicis self-contained inasmuch as no further depen-dencies such as databases, server backends, etc.have to be installed, and the Eclipse RCP pluginsare included in the distributed zip file.</p><p>At its heart, Atomic consists of a workspace-driven navigator; a document editor for overview,basic annotation and segmentation of corpusdocuments; a graphical editor for mouse-and-keyboard-based annotation; a command-line shellfor rapid annotation with the native annotation</p><p>18Cf.</p><p>language AtomicAL (Figure 1). Additionally, thecurrent version of Atomic includes a dedicatededitor for co-reference annotations.</p><p>The navigation view provides an interface tothe users workspaces as well as the usual projectmanagement features.</p><p>The document editor is a simple, text-basedoverview of a corpus documents primary text.It can be used for token-based annotation, seg-mentation of a corpus document into processableunits, and navigation of a document. The editor isthe initial entry point for annotation work. Sub-sequently, the user is forwarded to the annotationgraph edi...</p></li></ul>