proceedings of the 13th conference of the european chapter

EACL 2012

Proceedings of the Demonstrations at the 13th Conference ofthe European Chapter of the

Association for Computational Linguistics

April 23 - 27 2012Avignon France

c© 2012 The Association for Computational Linguistics

ISBN 978-1-937284-19-0

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ii

Organizers:

Frederique Segond, XEROX

iii

Table of Contents

Language Resources Factory: case study on the acquisition of Translation MemoriesMarc Poch, Antonio Toral and Nuria Bel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Harnessing NLP Techniques in the Processes of Multilingual Content ManagementAnelia Belogay, Diman Karagyozov, Svetla Koeva, Cristina Vertan, Adam Przepiorkowski, Dan

Cristea and Plovios Raxis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Collaborative Machine Translation Service for Scientific textsPatrik Lambert, Jean Senellart, Laurent Romary, Holger Schwenk, Florian Zipser, Patrice Lopez

and Frederic Blain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

TransAhead: A Writing Assistant for CAT and CALLChung-chi Huang, Ping-che Yang, Mei-hua Chen, Hung-ting Hsieh, Ting-hui Kao and Jason S.

Chang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

SWAN - Scientific Writing AssistaNt. A Tool for Helping Scholars to Write Reader-Friendly ManuscriptsTomi Kinnunen, Henri Leisma, Monika Machunik, Tuomo Kakkonen and Jean-Luc LeBrun . . . 20

ONTS: ”Optima” News Translation SystemMarco Turchi, Martin Atkinson, Alastair Wilcox, Brett Crawley, Stefano Bucci, Ralf Steinberger

and Erik Van der Goot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Just Title It! (by an Online Application)Cedric Lopez, Violaine Prince and Mathieu Roche . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Folheador: browsing through Portuguese semantic relationsHugo Goncalo Oliveira, Hernani Costa and Diana Santos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A Computer Assisted Speech Transcription SystemAlejandro Revuelta-Martınez, Luis Rodrıguez and Ismael Garcıa-Varea . . . . . . . . . . . . . . . . . . . . . . 41

A Statistical Spoken Dialogue System using Complex User Goals and Value Directed CompressionPaul A. Crook, Zhuoran Wang, Xingkun Liu and Oliver Lemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Automatically Generated Customizable Online DictionariesEniko Heja and David Takacs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

MaltOptimizer: An Optimization Tool for MaltParserMiguel Ballesteros and Joakim Nivre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Fluid Construction Grammar: The New Kid on the BlockRemi van Trijp, Luc Steels, Katrien Beuls and Pieter Wellens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A Support Platform for Event Detection using Social IntelligenceTimothy Baldwin, Paul Cook, Bo Han, Aaron Harwood, Shanika Karunasekera and Masud Mosh-

taghi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction ToolsGiuseppe Rizzo and Raphael Troncy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Automatic Analysis of Patient History Episodes in Bulgarian Hospital Discharge LettersSvetla Boytcheva, Galia Angelova and Ivelina Nikolova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

v

ElectionWatch: Detecting Patterns in News Coverage of US ElectionsSaatviga Sudhahar, Thomas Lansdall-Welfare, Ilias Flaounas and Nello Cristianini . . . . . . . . . . . . 82

Query log analysis with LangLogMarco Trevisan, Eduard Barbu, Igor Barsanti, Luca Dini, Nikolaos Lagos, Frederique Segond,

Mathieu Rhulmann and Ed Vald . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A platform for collaborative semantic annotationValerio Basile, Johan Bos, Kilian Evang and Noortje Venhuizen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduceAndrea Gesmundo and Nadi Tomeh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

brat: a Web-based Tool for NLP-Assisted Text AnnotationPontus Stenetorp, Sampo Pyysalo, Goran Topic, Tomoko Ohta, Sophia Ananiadou and Jun’ichi

Tsujii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

vi

Conference Program

Wednesday, April, 25, 2012

Session 1: (16:10 -16:40)

Language Resources Factory: case study on the acquisition of Translation Memo-riesMarc Poch, Antonio Toral and Nuria Bel

Harnessing NLP Techniques in the Processes of Multilingual Content ManagementAnelia Belogay, Diman Karagyozov, Svetla Koeva, Cristina Vertan, AdamPrzepiorkowski, Dan Cristea and Plovios Raxis

Collaborative Machine Translation Service for Scientific textsPatrik Lambert, Jean Senellart, Laurent Romary, Holger Schwenk, Florian Zipser,Patrice Lopez and Frederic Blain

Session 2: (16:50 - 17.20)

TransAhead: A Writing Assistant for CAT and CALLChung-chi Huang, Ping-che Yang, Mei-hua Chen, Hung-ting Hsieh, Ting-hui Kaoand Jason S. Chang

SWAN - Scientific Writing AssistaNt. A Tool for Helping Scholars to Write Reader-Friendly ManuscriptsTomi Kinnunen, Henri Leisma, Monika Machunik, Tuomo Kakkonen and Jean-LucLeBrun

ONTS: ”Optima” News Translation SystemMarco Turchi, Martin Atkinson, Alastair Wilcox, Brett Crawley, Stefano Bucci,Ralf Steinberger and Erik Van der Goot

Just Title It! (by an Online Application)Cedric Lopez, Violaine Prince and Mathieu Roche

vii

Wednesday, April, 25, 2012 (continued)

Session 3: (17:30 - 18:00)

Folheador: browsing through Portuguese semantic relationsHugo Goncalo Oliveira, Hernani Costa and Diana Santos

A Computer Assisted Speech Transcription SystemAlejandro Revuelta-Martınez, Luis Rodrıguez and Ismael Garcıa-Varea

A Statistical Spoken Dialogue System using Complex User Goals and Value Directed Com-pressionPaul A. Crook, Zhuoran Wang, Xingkun Liu and Oliver Lemon

Automatically Generated Customizable Online DictionariesEniko Heja and David Takacs

Thursday, April, 26, 2012

Session 4: (16:10 -16:40)

MaltOptimizer: An Optimization Tool for MaltParserMiguel Ballesteros and Joakim Nivre

Fluid Construction Grammar: The New Kid on the BlockRemi van Trijp, Luc Steels, Katrien Beuls and Pieter Wellens

A Support Platform for Event Detection using Social IntelligenceTimothy Baldwin, Paul Cook, Bo Han, Aaron Harwood, Shanika Karunasekera and MasudMoshtaghi

NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extrac-tion ToolsGiuseppe Rizzo and Raphael Troncy

viii

Thursday, April, 26, 2012 (continued)

Session 5: (16:50 - 17.20)

Automatic Analysis of Patient History Episodes in Bulgarian Hospital Discharge LettersSvetla Boytcheva, Galia Angelova and Ivelina Nikolova

ElectionWatch: Detecting Patterns in News Coverage of US ElectionsSaatviga Sudhahar, Thomas Lansdall-Welfare, Ilias Flaounas and Nello Cristianini

Query log analysis with LangLogMarco Trevisan, Eduard Barbu, Igor Barsanti, Luca Dini, Nikolaos Lagos, FrederiqueSegond, Mathieu Rhulmann and Ed Vald

Session 6: (17:30 - 18:00)

A platform for collaborative semantic annotationValerio Basile, Johan Bos, Kilian Evang and Noortje Venhuizen

HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction withMapReduceAndrea Gesmundo and Nadi Tomeh

brat: a Web-based Tool for NLP-Assisted Text AnnotationPontus Stenetorp, Sampo Pyysalo, Goran Topic, Tomoko Ohta, Sophia Ananiadou andJun’ichi Tsujii

ix

Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–5,Avignon, France, April 23 - 27 2012. c©2012 Association for Computational Linguistics

Language Resources Factory: case study on the acquisition ofTranslation Memories∗

Marc PochUPF Barcelona, Spain

[email protected]

Antonio ToralDCU Dublin, Ireland

[email protected]

Nuria BelUPF Barcelona, [email protected]

Abstract

This paper demonstrates a novel distributedarchitecture to facilitate the acquisition ofLanguage Resources. We build a factorythat automates the stages involved in the ac-quisition, production, updating and mainte-nance of these resources. The factory is de-signed as a platform where functionalitiesare deployed as web services, which canbe combined in complex acquisition chainsusing workflows. We show a case study,which acquires a Translation Memory for agiven pair of languages and a domain usingweb services for crawling, sentence align-ment and conversion to TMX.

1 Introduction

A fundamental issue for many tasks in the field ofComputational Linguistics and Language Tech-nologies in general is the lack of Language Re-sources (LRs) to tackle them successfully, espe-cially for some languages and domains. It is theso-called LRs bottleneck.

Our objective is to build a factory of LRs thatautomates the stages involved in the acquisition,production, updating and maintenance of LRsrequired by Machine Translation (MT), and byother applications based on Language Technolo-gies. This automation will significantly cut downthe required cost, time and human effort. Thesereductions are the only way to guarantee the con-tinuous supply of LRs that Language Technolo-gies demand in a multilingual world.

∗ We would like to thank the developers of Soaplab, Tav-erna, myExperiment and Biocatalogue for solving our ques-tions and attending our requests. This research has beenpartially funded by the EU project PANACEA (7FP-ICT-248064).

2 Web Services and Workflows

The factory is designed as a platform of web ser-vices (WSs) where the users can create and usethese services directly or combine them in morecomplex chains. These chains are called work-flows and can represent different combinations oftasks, e.g. “extract the text from a PDF docu-ment and obtain the Part of Speech (PoS) tagging”or “crawl this bilingual website and align its sen-tence pairs”. Each task is carried out using NLPtools deployed as WSs in the factory.

Web Service Providers (WSPs) are institutions(universities, companies, etc.) who are willingto offer services for some tasks. WSs are ser-vices made available from a web server to re-mote users or to other connected programs. WSsare built upon protocols, server and program-ming languages. Their massive adoption has con-tributed to make this technology rather interoper-able and open. In fact, WSs allow computer pro-grams distributed in different locations to interactwith each other.

WSs introduce a completely new paradigm inthe way we use software tools. Before, everyresearcher or laboratory had to install and main-tain all the different tools that they needed fortheir work, which has a considerable cost in bothhuman and computing resources. In addition, itmakes it more difficult to carry out experimentsthat involve other tools because the researchermight hesitate to spend time resources on in-stalling new tools when there are other alterna-tives already installed.

The paradigm changes considerably with WSs,as in this case only the WSP needs to have a deepknowledge of the installation and maintenance ofthe tool, thus allowing all the other users to benefit

1

from this work. Consequently, researchers thinkabout tools from a high level and solely regard-ing their functionalities, thus they can focus ontheir work and be more productive as the time re-sources that would have been spent to install soft-ware are freed. The only tool that the users needto install in order to design and run experiments isa WS client or a Workflow editor.

3 Choosing the tools for the platform

During the design phase several technologieswere analyzed to study their features, ease of use,installation, maintenance needs as well as the es-timated learning curve required to use them. In-teroperability between components and with othertechnologies was also taken into account sinceone of our goals is to reach as many providers andusers as possible. After some deliberation, a set oftechnologies that have proved to be successful inthe Bioinformatics field were adopted to build theplatform. These tools are developed by the my-Grid1 team. This group aims to develop a suiteof tools for researchers that work with e-Science.These tools have been used in numerous projectsas well as in different research fields as diverse asastronomy, biology and social science.

3.1 Web Services: Soaplab

Soaplab (Senger et al., 2003)2 allows a WSP todeploy a command line tool as a WS just by writ-ing a metadata file that describes the parametersof the tool. Soaplab takes care of the typical is-sues regarding WSs automatically, including tem-porary files, protocols, the WSDL file and its pa-rameters, etc. Moreover, it creates a Web interface(called Spinet) where WSs can be tested and usedwith input forms. All these features make Soaplaba suitable tool for our project. Moreover, its nu-merous successful stories make it a safe choise;e.g., it has been used by the European Bioinfor-matics Institute3 to deploy their tools as WSs.

3.2 Registry: Biocatalogue

Once the WSs are deployed by WSPs, somemeans to find them becomes necessary. Biocat-alogue (Belhajjame et al., 2008)4 is a registry

1http://www.mygrid.org.uk2http://soaplab.sourceforge.net/

soaplab2/3http://www.ebi.ac.uk4http://www.biocatalogue.org/

where WSs can be shared, searched for, annotatedwith tags, etc. It is used as the main registrationpoint for WSPs to share and annotate their WSsand for users to find the tools they need. Bio-catalogue is a user-friendly portal that monitorsthe status of the WSs deployed and offers multi-ple metadata fields to annotate WSs.

3.3 Workflows: TavernaNow that users can find WSs and use them, thenext step is to combine them to create complexchains. Taverna (Missier et al., 2010)5 is an opensource application that allows the user to createhigh-level workflows that integrate different re-sources (mainly WSs in our case) into a singleexperiment. Such experiments can be seen assimulations which can be reproduced, tuned andshared with other researchers.

An advantage of using workflows is that theresearcher does not need to have backgroundknowledge of the technical aspects involved inthe experiment. The researcher creates the work-flow based on functionalities (each WS provides afunction) instead of dealing with technical aspectsof the software that provides the functionality.

3.4 Sharing workflows: myExperimentMyExperiment (De Roure et al., 2008)6 is a so-cial network used by workflow designers to shareworkflows. Users can create groups and sharetheir workflows within the group or make thempublically available. Workflows can be annotatedwith several types of information such as descrip-tion, attribution, license, etc. Users can easily findexamples that will help them during the designphase, being able to reuse workflows (or parts ofthem) and thus avoiding reinveinting the wheel.

4 Using the tools to work with NLP

All the aforementioned tools were installed, usedand adapted to work with NLP. In addition, sev-eral tutorials and videos have been prepared7 tohelp partners and other users to deploy and useWSs and to create workflows.

Soaplab has been modified (a patch has beendeveloped and distributed)8 to limit the amount ofdata being transfered inside the SOAP message in

5http://www.taverna.org.uk/6http://www.myexperiment.org/7http://panacea-lr.eu/en/tutorials/8http://myexperiment.elda.org/files/5

2

order to optimize the network usage. Guidelinesthat describe how to limit the amount of concur-rent users of WSs as well as to limit the maximumsize of the input data have been prepared.9

Regarding Taverna, guidelines and workflowexamples have been shared among partners show-ing the best way to create workflows for theproject. The examples show how to benefit fromuseful features provided by this tool, such as“retries” (to execute up to a certain number oftimes a WS when it fails) and “parallelisation” (torun WSs in parallel, thus increasing trhoughput).Users can view intermediate results and parame-ters using the provenance capture option, a usefulfeature while designing a workflow. In case of anyWS error in one of the inputs, Taverna will reportthe error message produced by the WS or proces-sor component that causes it. However, Tavernawill be able to continue processing the rest of theinput data if the workflow is robust (i.e. makesuse of retry and parallelisation) and the error isconfined to a WS (i.e. it does not affect the rest ofthe workflow).

An instance of Biocatalogue and one of my-Experiment have been deployed to be the Reg-istry and the portal to share workflows and otherexperiment-related data. Both have been adaptedby modifying relevant aspects of the interface(layout, colours, names, logos, etc.). The cate-gories that make up the classification system usedin the Registry have been adapted to the NLPfield. At the time of writing there are more than100 WSs and 30 workflows registered.

5 Interoperability

Interoperability plays a crucial role in a platformof distributed WSs. Soaplab deploys SOAP10

WSs and handles automatically most of the issuesinvolved in this process, while Taverna can com-bine SOAP and REST11 WSs. Hence, we can saythat communication protocols are being handledby the tools. However, parameters and data inter-operability need to be addressed.

5.1 Common InterfaceTo facilitate interoperability between WSs and toeasily exchange WSs, a Common Interface (CI)

9http://myexperiment.elda.org/files/410http://www.w3.org/TR/soap/11http://www.ics.uci.edu/˜fielding/

pubs/dissertation/rest_arch_style.htm

has been designed for each type of tool (e.g. PoS-taggers, aligners, etc.). The CI establishes that allWSs that perform a given task must have the samemandatory parameters. That said, each tool canhave different optional parameters. This systemeases the design of workflows as well as the ex-change of tools that perform the same task insidea workflow. The CI has been developed using anXML schema.12

5.2 Travelling Object

A goal of the project is to facilitate the deploy-ment of as many tools as possible in the form ofWSs. In many cases, tools performing the sametask use in-house formats. We have designed acontainer, called “Travelling Object” (TO), as thedata object that is being transfered between WSs.Any tool that is deployed needs to be adapted tothe TO, this way we can interconnect the differenttools in the platform regardless of their originalinput/output formats.

We have adopted for TO the XML Corpus En-coding Standard (XCES) format (Ide et al., 2000)because it was the already existing format that re-quired the minimum transduction effort from thein-house formats. The XCES format has beenused successfully to build workflows for PoS tag-ging and alignment.

Some WSs, e.g. dependency parsers, require amore complex representation that cannot be han-dled by the TO. Therefore, a more expressive for-mat has been adopted for these. The Graph Anno-tation Format (GrAF) (Ide and Suderman, 2007)is a XML representation of a graph that allowsdifferent levels of annotation using a “feature–value” paradigm. This system allows differentin-house formats to be easily encapsulated in thiscontainer-based format. On the other hand, GrAFcan be used as a pivot format between other for-mats (Ide and Bunt, 2010), e.g. there is softwareto convert GrAF to UIMA and GATE formats (Ideand Suderman, 2009) and it can be used to mergedata represented in a graph.

Both TO and GrAF address syntactic interop-erability while semantic interoperability is still anopen topic.

12http://panacea-lr.eu/en/info-for-professionals/documents/

3

6 Evaluation

The evaluation of the factory is based on itsfeatures and usability requirements. A binaryscheme (yes/no) is used to check whether each re-quirement is fulfilled or not. The quality of thetools is not altered as they are deployed as WSswithout any modification. According to the eval-uation of the current version of the platform, mostrequirements are fulfilled (Aleksic et al., 2012).

Another aspect of the factory that is being eval-uated is its performance and scalabilty. They donot depend on the factory itself but on the designof the workflows and WSs. WSPs with robustWSs and powerful servers will provide a betterand faster service to users (considering that theservice is based on the same tool). This is analo-gous to the user installing tools on a computer; ifthe user develops a fragile script to chain the toolsthe execution may fail, while if the computer doesnot provide the required computational resourcesthe performance will be poor.

Following the example of the Bioinformaticsfield where users can benefit of powerful WSPs,the factory is used as a proof of concept that thesetechnologies can grow and scale to benefit manyusers.

7 Case study

We introduce a case study in order to demonstratethe capabilities of the platform. It regards the ac-quisition of a Translation Memory (TM) for a lan-guage pair and a specific domain. This is deemedto be very useful for translators when they starttranslating documents for a new domain. As atthat early stage they still do not have any contentin their TM, having the automatically acquiredTM can be helpful in order to get familiar withthe characteristic bilingual terminology and otheraspects of the domain. Another obvious potentialuse of this data would be to use it to train a Statis-tical MT system.

Three functionalities are needed to carry outthis process: acquisition of the data, its alignmentand its conversion into the desired format. Theseare provided by WSs available in the registry.

First, we use a domain-focused bilingualcrawler13 in order to acquire the data. Given a pairof languages, a set of web domains and a set ofseed terms that define the target domain for these

13http://registry.elda.org/services/127

languages, this tool will crawl the webpages inthe domains and gather pairs of web documentsin the target languages that belong to the targetdomain. Second, we apply a sentence aligner.14

It takes as input the pairs of documents obtainedby the crawler and outputs pairs of equivalent sen-tences.Finally, convert the aligned data into a TMformat. We have picked TMX15 as it is the mostcommon format for TMs. The export is done bya service that receives as input sentence-alignedtext and converts it to TMX.16

The “Bilingual Process, Sentence Alignment ofbilingual crawled data with Hunalign and exportinto TMX”17 is a workflow built using Tavernathat combines the three WSs in order to providethe functionality needed. The crawling part isommitted because data only needs to be crawledonce; crawled data can be processed with differ-ent workflows but it would be very inefficient tocrawl the same data each time. A set of screen-shots showing the WSs and the workflow, togetherwith sample input and output data is available.18

8 Demo and Requirements

The demo aims to show the web portals and toolsused during the development of the case study.First, the Registry19 to find WSs, the Spinet Webclient to easily test them and Taverna to finallybuild a workflow combining the different WSs.For the live demo, the workflows will be alreadydesigned because of the time constraints. How-ever, there are videos on the web that illustratethe whole process. It will be also interesting toshow the myExperiment portal,20 where all pub-lic workflows can be found. Videos of workflowexecutions will also be available.

Regarding the requirements, a decent internetconnection is critical for an acceptable perfor-mance of the whole platform, specially for remoteWSs and workflows. We will use a laptop withTaverna installed to run the workflow presentedin Section 7.

14http://registry.elda.org/services/9215http://www.gala-global.org/

oscarStandards/tmx/tmx14b.html16http://registry.elda.org/services/21917http://myexperiment.elda.org/

workflows/3718http://www.computing.dcu.ie/˜atoral/

panacea/eacl12_demo/19http://registry.elda.org20http://myexperiment.elda.org

4

ReferencesVera Aleksic, Olivier Hamon, Vassilis Papavassiliou,

Pavel Pecina, Marc Poch, Prokopis Prokopidis, Va-leria Quochi, Christoph Schwarz, and Gregor Thur-mair. 2012. Second evaluation report. Evalu-ation of PANACEA v2 and produced resources(PANACEA project Deliverable 7.3). Technical re-port.

Khalid Belhajjame, Carole Goble, Franck Tanoh, JitenBhagat, Katherine Wolstencroft, Robert Stevens,Eric Nzuobontane, Hamish McWilliam, ThomasLaurent, and Rodrigo Lopez. 2008. Biocatalogue:A curated web service registry for the life sciencecommunity. In Microsoft eScience conference.

David De Roure, Carole Goble, and Robert Stevens.2008. The design and realisation of the myexperi-ment virtual research environment for social sharingof workflows. Future Generation Computer Sys-tems, 25:561–567, May.

Nancy Ide and Harry Bunt. 2010. Anatomy of anno-tation schemes: mapping to graf. In Proceedings ofthe Fourth Linguistic Annotation Workshop, LAWIV ’10, pages 247–255, Stroudsburg, PA, USA. As-sociation for Computational Linguistics.

Nancy Ide and Keith Suderman. 2007. GrAF: AGraph-based Format for Linguistic Annotations. InProceedings of the Linguistic Annotation Workshop,pages 1–8, Prague, Czech Republic, June. Associa-tion for Computational Linguistics.

Nancy Ide and Keith Suderman. 2009. Bridgingthe Gaps: Interoperability for GrAF, GATE, andUIMA. In Proceedings of the Third Linguistic An-notation Workshop, pages 27–34, Suntec, Singa-pore, August. Association for Computational Lin-guistics.

Nancy Ide, Patrice Bonhomme, and Laurent Romary.2000. XCES: An XML-based encoding standardfor linguistic corpora. In Proceedings of the SecondInternational Language Resources and EvaluationConference. Paris: European Language ResourcesAssociation.

Paolo Missier, Stian Soiland-Reyes, Stuart Owen,Wei Tan, Aleksandra Nenadic, Ian Dunlop, AlanWilliams, Thomas Oinn, and Carole Goble. 2010.Taverna, reloaded. In M. Gertz, T. Hey, and B. Lu-daescher, editors, SSDBM 2010, Heidelberg, Ger-many, June.

Martin Senger, Peter Rice, and Thomas Oinn. 2003.Soaplab - a unified sesame door to analysis tools.In All Hands Meeting, September.

5


Harnessing NLP Techniques in the Processes of

Multilingual Content Management

Anelia Belogay Diman Karagyozov Tetracom IS Ltd. Tetracom IS Ltd.

[email protected] [email protected]

Svetla Koeva Cristina Vertan

Institute for Bulgarian Language Universitaet Hamburg


Adam Przepiórkowski Polivios Raxis

Instytut Podstaw Informatyki Polskiej

Akademii Nauk Atlantis Consulting SA


Dan Cristea

Universitatea Alexandru Ioan Cuza

[email protected]

Abstract

The emergence of the WWW as the main

source of distributing content opened the

floodgates of information. The sheer

volume and diversity of this content

necessitate an approach that will reinvent

the way it is analysed. The quantitative

route to processing information which

relies on content management tools

provides structural analysis. The

challenge we address is to evolve from

the process of streamlining data to a level

of understanding that assigns value to

content.

We present an open-source multilingual

platform ATALS that incorporates

human language technologies in the

process of multilingual web content

management. It complements a content

management software-as-a-service

component i-Publisher, used for creating,

running and managing dynamic content-

driven websites with a linguistic

platform. The platform enriches the

content of these websites with revealing

details and reduces the manual work of

classification editors by automatically

categorising content. The platform

ASSET supports six European languages.

We expect ASSET to serve as a basis for

future development of deep analysis tools

capable of generating abstractive

summaries and training models for

decision making systems.

Introduction

The advent of the Web revolutionized the way in

which content is manipulated and delivered. As a

result, digital content in various languages has

become widely available on the Internet and its

sheer volume and language diversity have

presented an opportunity for embracing new

methods and tools for content creation and

distribution. Although significant improvements

have been made in the field of web content

management lately, there is still a growing

demand for online content services that

incorporate language-based technology.

Existing software solutions and services such

as Google Docs, Slingshot and Amazon

implement some of the linguistic mechanisms

addressed in the platform. The most used open-

source multilingual web content management

6

systems (Joomla, Joom!Fish, TYPO3, Drupal)1

offer low level of multilingual content

management, providing abilities for building

multilingual sites. However, the available

services are narrowly focused on meeting the

needs of very specific target groups, thus leaving

unmet the rising demand for a comprehensive

solution for multilingual content management

addressing the issues posed by the growing

family of languages spoken within the EU.

We are going to demonstrate the open-source

content management platform ATLAS and as

proof of concept, a multilingual library i-

librarian, driven by the platform. The

demonstration aims to prove that people reading

websites powered by ATLAS can easily find

documents, kept in order via the automatic

classification, find context-sensitive content, find

similar documents in a massive multilingual data

collection, and get short summaries in different

languages that help the users to discern essential

information with unparalleled clarity.

The “Technologies behind the system” chapter

describes the implementation and the integration

approach of the core linguistic processing

framework and its key sub-components – the

categorisation, summarisation and machine-

translation engines. The chapter “i-Librarian – a

case study” outlines the functionalities of an

intelligent web application built with our system

and the benefits of using it. The chapter

“Evaluation” briefly discusses the user

evaluation of the new system. The last chapter

“Conclusion and Future Work” summarises the

main achievements of the system and suggests

improvements and extensions.

Technologies behind the system

The linguistic framework ASSET employs

diverse natural language processing (NLP) tools

technologically and linguistically in a platform,

based on UIMA2

. The UIMA pluggable

component architecture and software framework

are designed to analyse content and to structure

it. The ATLAS core annotation schema, as a

uniform representation model, normalizes and

harmonizes the heterogeneous nature of the NLP

tools3.

1 http://www.joomla.org/, http://www.joomfish.net/,

http://typo3.org/, http://drupal.org/ 2 http://uima.apache.org/ 3 The system exploits heterogeneous NLP tools, for

the supported natural languages, implemented in Java,

C++ and Perl. Examples are:

The processing of text in the system is split

into three sequentially executed tasks.

Firstly, the text is extracted from the input

source (text or binary documents) in the “pre-

processing” phase.

Secondly, the text is annotated by several NLP

tools, chained in a sequence in the “processing”

phase. The language processing tools are

integrated in a language processing chain (LPC),

so that the output of a given NLP tool is used as

an input for the next tool in the chain. The

baseline LPC for each of the supported languages

includes a sentence and paragraph splitter,

tokenizer, part of speech tagger, lemmatizer,

word sense disambiguation, noun phrase chunker

and named entity extractor (Cristea and Pistiol,

2008). The annotations produced by each LPC

along with additional statistical methods are

subsequently used for detection of keywords and

concepts, generation of summary of text, multi-

label text categorisation and machine translation.

Finally, the annotations are stored in a fusion

data store, comprising of relational database and

high-performance Lucene4 indexes.

The architecture of the language processing

framework is depicted in Figure 1.

Figure 1. Architecture and communication channels in

our language processing framework.

The system architecture, shown in Figure 2, is

based on asynchronous message processing

OpenNLP (http://incubator.apache.org/opennlp/),

RASP (http://ilexir.co.uk/applications/rasp/),

Morfeusz (http://sgjp.pl/morfeusz/), Panterra

(http://code.google.com/p/pantera-tagger/), ParsEst

(http://dcl.bas.bg/), TnT Tagger (http://www.coli.uni-

saarland.de/~thorsten/tnt/). 4 http://lucene.apache.org/

7

patterns (Hohpe and Woolf, 2004) and thus

allows the processing framework to be easily

scaled horizontally.

Figure 2. Top-level architecture of our CMS and its

major components.

Text Categorisation

We implemented a language independent text

categorisation tool, which works for user-defined

and controlled classification hierarchies. The

NLP framework converts the texts to a series of

natural numbers, prior sending the texts to the

categorisation engine. This conversion allows

high level compression of the feature space. The

categorisation engine employs different

algorithms, such as Naïve Bayesian, relative

entropy, Class-Feature Centroid (CFC) (Guan et.

al., 2009), and SVM. New algorithms can be

easily integrated because of the chosen OSGi-

based architecture (OSGi Alliance, 2009). A

tailored voting system for multi-label multi-class

tasks consolidates the results of each of the

categorisation algorithms.

Summarisation (prototype phase)

The chosen implementation approach for

coherent text summarisation combines the well-

known LexRank algorithm (Erkan and Radev,

2004) and semantic graphs and word-sense

disambiguation techniques (Plaza and Diaz,

2011). Furthermore, we have automatically built

thesauri for the top-level domains in order to

produce domain-focused extractive summaries.

Finally, we apply clause-boundaries splitting in

order to truncate the irrelevant or subordinating

clauses in the sentences in the summary.

Machine Translation (prototype phase)

The machine translation (MT) sub-component

implements the hybrid MT paradigm, combining

an example-based (EBMT) component and a

Moses-based statistical approach (SMT). Firstly,

the input is processed by the example-based MT

engine and if the whole or important chunks of it

are found in the translation database, then the

translation equivalents are used and if necessary

combined (Gavrila, 2011). In all other cases the

input is processed by the categorisation sub-

component in order to select the top-level

domain and respectively, the most appropriate

SMT domain- and POS-translation model

(Niehues and Waibel, 2010).

The translation engine in the system, based on

MT Server Land (Federmann and Eisele, 2010),

is able to accommodate and use different third

party translation engines, such as the Google,

Bing, Lusy or Yahoo translators.

Case Study: Multilingual Library

i-Librarian5 is a free online library that assists

authors, students, young researchers, scholars,

librarians and executives to easily create,

organise and publish various types of documents

in English, Bulgarian, German, Greek, Polish

and Romanian. Currently, a sample of the

publicly available library contains over 20 000

books in English.

On uploading a new document to i-Librarian,

the system automatically provides the user with

an extraction of the most relevant information

(concepts and named entities, keywords). Later

on, the retrieved information is used to generate

suggestions for classification in the library

catalogue, containing 86 categories, as well as a

list of similar documents. Finally, the system

compiles a summary and translates it in all

supported languages. Among the supported

formats are Microsoft Office documents, PDF,

OpenOffice documents, books in various

electronic formats, HTML pages and XML

documents. Users have exclusive rights to

manage content in the library at their discretion.

The current version of the system supports

English and Bulgarian. In early 2012 the Polish,

Greek, German and Romanian languages will be

in use.

5 i-Librarian web site is available at http://www.i-

librarian.eu/. One can access the i-Librarian demo content

using “[email protected]” for username and “sandbox”

for password.

8

Evaluation

The technical quality and performance of the

system is being evaluated as well as its appraisal

by prospective users. The technical evaluation

uses indicators that assess the following key

technical elements:

overall quality and performance

attributes (MTBF6, uptime, response

time);

performance of specific functional

elements (content management, machine

translation, cross-lingual content

retrieval, summarisation, text

categorisation).

The user evaluation assesses the level of

satisfaction with the system. We measure non

functional elements such as:

User friendliness and satisfaction, clarity

in responses and ease of use;

Adequacy and completeness of the

provided data and functionality;

Impact on certain user activities and the

degree of fulfilment of common tasks.

We have planned for three rounds of user

evaluation; all users are encouraged to try online

the system, freely, or by following the provided

base-line scenarios and accompanying exercises.

The main instrument for collecting user feedback

is an online interactive electronic questionnaire7.

The second round of user evaluation is

scheduled for Feb-March 2012, while the first

round took place in Q1 2011, with the

participation of 33 users. The overall user

impression was positive and the Mean value of

each indicator (in a 5-point Likert scale) was

measured on AVERAGE or ABOVE

AVERAGE.

Figure 3. User evaluation – UI friendliness and ease

of use.

6 Mean Time Between Failures 7 The electronic questionnaire is available at

http://ue.atlasproject.eu

Figure 4. User evaluation – user satisfaction with the

available functionalities in the system.

Figure 5. User evaluation – users productivity

incensement.

Acknowledgments

ATLAS (Applied Technology for Language-

Aided CMS) is a European project funded under

the CIP ICT Policy Support Programme, Grant

Agreement 250467.

Conclusion and Future Work

The abundance of knowledge allows us to widen

the application of NLP tools, developed in a

research environment. The tailor made voting

system maximizes the use of the different

categorisation algorithms. The novel summary

approach adopts state of the art techniques and

the automatic translation is provided by a cutting

edge hybrid machine translation system.

The content management platform and the

linguistic framework will be released as open-

source software. The language processing chains

for Greek, Romanian, Polish and German will be

fully implemented by the end of 2011. The

summarisation engine and machine translation

tools will be fully integrated in mid 2012.

We expect this platform to serve as a basis for

future development of tools that directly support

decision making and situation awareness. We

will use categorical and statistical analysis in

order to recognise events and patterns, to detect

opinions and predictions while processing

The user interface is friendly and

easy to use

Excellent

28%

Good

35%

Average

28%

Below

Average

9%Poor

Below Average

Average

Good

Excellent

I am satisfied with the functionalities

Below

Average

3%

Average

38%

Excellent

31%

Good

28%

Poor

BelowAverageAverage

Good

Excellent

The system increases your

productivity Excellent

13%

Below

Average

9%

Average

31%

Good

47%

Poor

BelowAverageAverage

Good

Excellent

9

extremely large volumes of disparate data

resources.

Demonstration websites

The multilingual content management platform is

available for testing at http://i-

publisher.atlasproject.eu/atlas/i-publisher/demo .

One can access the CMS demo content using

“demo” for username and “sandbox2” for

password.

The multilingual library web site is available

at http://www.i-librarian.eu/. One can access the

i-Librarian demo content using “demo@i-

librarian.eu” for username and “sandbox” for

password.

References

Dan Cristea and Ionut C. Pistol, 2008. Managing

Language Resources and Tools using a Hierarchy

of Annotation Schemas. In the proceedings of

workshop 'Sustainability of Language Resources

and Tools for Natural Language Processing',

LREC, 2008

Gregor Hohpe and Bobby Woolf. 2004. Enterprise

Integration Patterns: Designing, Building, and

Deploying Messaging Solutions. Addison-Wesley

Professional.

Hu Guan, Jingyu Zhou and Minyi Guo. A Class-

Feature-Centroid Classifier for Text

Categorization. 2009. WWW 2009 Madrid, Track:

Data Mining / Session: Learning, p201-210.

OSGi Alliance. 2009. OSGi Service Platform, Core

Specification, Release 4, Version 4.2.

Gunes Erkan and Dragomir R. Radev. 2004.

LexRank: Graph-based Centrality as Salience in

Text Summarization. Journal of Artificial

Intelligence Research 22 (2004), p457–479.

Laura Plaza and Alberto Diaz. 2011. Using Semantic

Graphs and Word Sense Disambiguation

Techniques to Improve Text Summarization.

Procesamiento del Lenguaje Natural, Revista nº 47

septiembre de 2011 (SEPLN 2011), pp 97-105.

Monica Gavrila. 2011. Constrained Recombination in

an Example-based Machine Translation System, In

the Proceedings of the EAMT-2011: the 15th

Annual Conference of the European Association

for Machine Translation, 30-31 May 2011, Leuven,

Belgium, p. 193-200

Jan Niehues and Alex Waibel. 2010. Domain

adaptation in statistical machine translation using

factored translation models. EAMT 2010:

Proceedings of the 14th Annual conference of the

European Association for Machine Translation, 27-

28 May 2010, Saint-Raphaël, France.

Christian Federmann and Andreas Eisele. 2010. MT

Server Land: An Open-Source MT Architecture.

The Prague Bulletin of Mathematical Linguistics.

NUMBER 94, 2010, p57–66

10


Collaborative Machine Translation Service for Scientific texts

Patrik LambertUniversity of Le Mans

[email protected]

Jean SenellartSystran SA

[email protected]

Laurent RomaryHumboldt Universitat Berlin /INRIA Saclay - Ile de [email protected]

Holger SchwenkUniversity of Le Mans

[email protected]

Florian ZipserHumboldt Universitat Berlin

[email protected]

Patrice LopezHumboldt Universitat Berlin /INRIA Saclay - Ile de [email protected]

Frederic BlainSystran SA /

University of Le Mansfrederic.blain@

lium.univ-lemans.fr

Abstract

French researchers are required to fre-quently translate into French the descrip-tion of their work published in English. Atthe same time, the need for French peopleto access articles in English, or to interna-tional researchers to access theses or pa-pers in French, is incorrectly resolved viathe use of generic translation tools. Wepropose the demonstration of an end-to-endtool integrated in the HAL open archive forenabling efficient translation for scientifictexts. This tool can give translation sugges-tions adapted to the scientific domain, im-proving by more than 10 points the BLEUscore of a generic system. It also providesa post-edition service which captures userpost-editing data that can be used to incre-mentally improve the translations engines.Thus it is helpful for users which need totranslate or to access scientific texts.

1 Introduction

Due to the globalisation of research, the Englishlanguage is today the universal language of sci-entific communication. In France, regulations re-quire the use of the French language in progressreports, academic dissertations, manuscripts, andFrench is the official educational language of thecountry. This situation forces researchers to fre-quently translate their own articles, lectures, pre-sentations, reports, and abstracts between English

and French. In addition, students and the generalpublic are also challenged by language, when itcomes to find published articles in English or tounderstand these articles. Finally, internationalscientists not even consider to look for Frenchpublications (for instance PhD theses) becausethey are not available in their native languages.This problem, incorrectly resolved through theuse of generic translation tools, actually revealsan interesting generic problem where a commu-nity of specialists are regularly performing trans-lations tasks on a very limited domain. At thesame time, other communities of users seek trans-lations for the same type of documents. Withoutappropriate tools, the expertise and time spent fortranslation activity by the first community is lostand do not benefit to translation requests of theother communities.

We propose the demonstration of an end-to-endtool for enabling efficient translation for scientifictexts. This system, developed for the COSMATANR project,1 is closely integrated into the HALopen archive,2 a multidisciplinary open-accessarchive which was created in 2006 to archive pub-lications from all the French scientific commu-nity. The tool deals with handling of source doc-ument format, generally a pdf file, specialisedtranslation of the content, and user-friendly user-interface allowing to post-edit the output. Behind

1http://www.cosmat.fr/2http://hal.archives-ouvertes.fr/?langue=en

11

the scene, the post-editing tool captures user post-editing data which are used to incrementally im-prove the translations engines. The only equip-ment required by this demonstration is a computerwith an Internet browser installed and an Internetconnection.

In this paper, we first describe the completework-flow from data acquisition to final post-editing. Then we focus on the text extraction pro-cedure. In Section 4, we give details about thetranslation system. Then in section 5, we presentthe translation and post-editing interface. We fi-nally give some concluding remarks.

The system will be demonstrated at EACL inhis tight integration with the HAL paper depositsystem. If the organizers agree, we would like tooffer the use of our system during the EACL con-ference. It would automatically translate all theabstracts of the accepted papers and also offersthe possibility to correct the outputs. This result-ing data would be made freely available.

2 Complete Processing Work-flow

The entry point for the system are “ready to pub-lish” scientific papers. The goal of our systemwas to extract content keeping as many meta-information as possible from the document, totranslate the content, to allow the user to performpost-editing, and to render the result in a format asclose as possible to the source format. To train oursystem, we collected from the HAL archive morethan 40 000 documents in physics and computerscience, including articles, PhD theses or researchreports (see Section 4). This material was used totrain the translation engines and to extract domainbilingual terminology.

The user scenario is the following:

• A user uploads an article in PDF format3 onthe system.

• The document is processed by the open-source Grobid tool (see section 3) to extract

3The commonly used publishing format is PDF fileswhile authoring format is principally a mix of MicrosoftWord file and LaTeX documents using a variety of styles.The originality of our approach is to work on the PDF fileand not on these source formats. The rationale being that 1/the source format is almost never available, 2/ even if we hadaccess to the source format, we would need to implement afilter specific to each individual template required by such orsuch conference for a good quality content extraction

the content. The extracted paper is structuredin the TEI format where title, authors, refer-ences, footnotes, figure captions are identi-fied with a very high accuracy.

• An entity recognition process is performedfor markup of domain entities such as:chemical compounds for chemical papers,mathematical formulas, pseudo-code and ob-ject references in computer science papers,but also miscellaneous acronyms commonlyused in scientific communication.

• Specialised terminology is then recognisedusing the Termsciences4 reference termi-nology database, completed with terminol-ogy automatically extracted from the train-ing corpus. The actual translation of the pa-per is performed using adapted translation asdescribed in Section 4.

• The translation process generates a bilingualTEI format preserving the source structureand integrating the entity annotation, multi-ple terminology choices when available, andthe token alignment between source and tar-get sentences.

• The translation is proposed to the user forpost-editing through a rich interactive inter-face described in Section 5.

• The final version of the document is thenarchived in TEI format and available for dis-play in HTML using dedicated XSLT stylesheets.

3 The Grobid System

Based on state-of-the-art machine learning tech-niques, Grobid (Lopez, 2009) performs reliablebibliographic data extraction from scholar articlescombined with multi-level term extraction. Thesetwo types of extraction present synergies and cor-respond to complementary descriptions of an arti-cle.

This tool parses and converts scientific arti-cles in PDF format into a structured TEI docu-ment5 compliant with the good practices devel-oped within the European PEER project (Bretel etal., 2010). Grobid is trained on a set of annotated

4http://www.termsciences.fr5http://www.tei-c.org

12

scientific article and can be re-trained to fit tem-plates used for a specific conference or to extractadditional fields.

4 Translation of Scientific Texts

The translation system used is a Hybrid MachineTranslation (HMT) system from French to En-glish and from English to French, adapted totranslate scientific texts in several domains (sofar physics and computer science). This sys-tem is composed of a statistical engine, cou-pled with rule-based modules to translate spe-cial parts of the text such as mathematical for-mulas, chemical compounds, pseudo-code, andenriched with domain bilingual terminology (seeSection 2). Large amounts of monolingual andparallel data are available to train a SMT systembetween French and English, but not in the scien-tific domain. In order to improve the performanceof our translation system in this task, we extractedin-domain monolingual and parallel data from theHAL archive. All the PDF files deposited in HALin computer science and physics were made avail-able to us. These files were then converted toplain text using the Grobid tool, as described inthe previous section. We extracted text from allthe documents from HAL that were made avail-able to us to train our language model. We builta small parallel corpus from the abstracts of thePhD theses from French universities, which mustinclude both an abstract in French and in English.Table 1 presents statistics of these in-domain data.

The data extracted from HAL were used toadapt a generic system to the scientific litera-ture domain. The generic system was mostlytrained on data provided for the shared task ofSixth Workshop on Statistical Machine Transla-tion6 (WMT 2011), described in Table 2.

Table 3 presents results showing, in theEnglish–French direction, the impact on the sta-tistical engine of introducing the resources ex-tracted from HAL, as well as the impact of do-main adaptation techniques. The baseline statis-tical engine is a standard PBSMT system basedon Moses (Koehn et al., 2007) and the SRILMtookit (Stolcke, 2002). Is was trained and tunedonly on WMT11 data (out-of-domain). Incorpo-rating the HAL data into the language model andtuning the system on the HAL development set,

6http://www.statmt.org/wmt11/translation-task.html

Set Domain Lg Sent. Words Vocab.

Parallel dataTrain cs+phys En 55.9 k 1.41 M 43.3 k

Fr 55.9 k 1.63 M 47.9 kDev cs En 1100 25.8 k 4.6 k

Fr 1100 28.7 k 5.1 kphys En 1000 26.1 k 5.1 k

Fr 1000 29.1 k 5.6 kTest cs En 1100 26.1 k 4.6 k

Fr 1100 29.2 k 5.2 kphys En 1000 25.9 k 5.1 k

Fr 1000 28.8 k 5.5 k

Monolingual dataTrain cs En 2.5 M 54 M 457 k

Fr 761 k 19 M 274 kphys En 2.1 M 50 M 646 k

Fr 662 k 17 M 292 k

Table 1: Statistics for the parallel training, develop-ment, and test data sets extracted from thesis abstractscontained in HAL, as well as monolingual data ex-tracted from all documents in HAL, in computer sci-ence (cs) and physics (phys). The following statisticsare given for the English (En) and French (Fr) sides(Lg) of the corpus: the number of sentences, the num-ber of running words (after tokenisation) and the num-ber of words in the vocabulary (M and k stand for mil-lions and thousands, respectively).

yielded a gain of more than 7 BLEU points, inboth domains (computer science and physics). In-cluding the theses abstracts in the parallel trainingcorpus, a further gain of 2.3 BLEU points is ob-served for computer science, and 3.1 points forphysics. The last experiment performed aims atincreasing the amount of in-domain parallel textsby translating automatically in-domain monolin-gual data, as suggested by Schwenk (2008). Thesynthesised bitext does not bring new words intothe system, but increases the probability of in-domain bilingual phrases. By adding a syntheticbitext of 12 million words to the parallel trainingdata, we observed a gain of 0.5 BLEU point forcomputer science, and 0.7 points for physics.

Although not shown here, similar results wereobtained in the French–English direction. TheFrench–English system is actually slightly bet-ter than the English–French one as it is an easiertranslation direction.

13

Translation Model Language Model Tuning Domain CS PHYSwords (M) Bleu words (M) Bleu

wmt11 wmt11 wmt11 371 27.3 371 27.1wmt11 wmt11+hal hal 371 36.0 371 36.2wmt11+hal wmt11+hal hal 287 38.3 287 39.3wmt11+hal+adapted wmt11+hal hal 299 38.8 307 40.0

Table 3: Results (BLEU score) for the English–French systems. The type of parallel data used to train thetranslation model or language model are indicated, as well as the set (in-domain or out-of-domain) used to tunethe models. Finally, the number of words in the parallel corpus and the BLEU score on the in-domain test set areindicated for each domain: computer science and physics.

Figure 1: Translation and post-editing interface.

Corpus English FrenchBitexts:Europarl 50.5M 54.4MNews Commentary 2.9M 3.3MCrawled (109 bitexts) 667M 794MDevelopment data:newstest2009 65k 73knewstest2010 62k 71kMonolingual data:LDC Gigaword 4.1G 920MCrawled news 2.6G 612M

Table 2: Out-of-domain development and training dataused (number of words after tokenisation).

5 Post-editing Interface

The collaborative aspect of the demonstrated ma-chine translation service is based on a post-editingtool, whose interface is shown in Figure 1. This

tool provides the following features:

• WYSIWYG display of the source and targettexts (Zones 1+2)

• Alignment at the sentence level (Zone 3)

• Zone to review the translation with align-ment of source and target terms (Zone 4) andterminology reference (Zone 5)

• Alternative translations (Zone 6)

The tool allows the user to perform sentencelevel post-editing and records details of post-editing activity, such as keystrokes, terminologyselection, actual edits and time log for the com-plete action.

6 Conclusions and Perspectives

We proposed the demonstration of an end-to-endtool integrated into the HAL archive and enabling

14

efficient translation for scientific texts. This toolconsists of a high-accuracy PDF extractor, a hy-brid machine translation engine adapted to the sci-entific domain and a post-edition tool. Thanks toin-domain data collected from HAL, the statisti-cal engine was improved by more than 10 BLEUpoints with respect to a generic system trained onWMT11 data.

Our system was deployed for a physic confer-ence organised in Paris in Sept 2011. All acceptedabstracts were translated into author’s native lan-guages (around 70% of them) and proposed forpost-editing. The experience was promoted bythe organisation committee and 50 scientists vol-unteered (34 finally performed their post-editing).The same experience will be proposed for authorsof the LREC conference. We would like to offera complete demonstration of the system at EACL.The goal of these experiences is to collect and dis-tribute detailed ”post-editing” data for enablingresearch on this activity.

Acknowledgements

This work has been partially funded by the FrenchGovernment under the project COSMAT (ANRANR-09-CORD-004).

References

Foudil Bretel, Patrice Lopez, Maud Medves, AlainMonteil, and Laurent Romary. 2010. Back tomeaning – information structuring in the PEERproject. In TEI Conference, Zadar, Croatie.

Philipp Koehn, Hieu Hoang, Alexandra Birch,Chris Callison-Burch, Marcello Federico, NicolaBertoldi, Brooke Cowan, Wade Shen, ChristineMoran, Richard Zens, Chris Dyer, Ondrej Bojar,Alexandra Constantin, and Evan Herbst. 2007.Moses: Open source toolkit for statistical ma-chine translation. In Proc. of the 45th AnnualMeeting of the Association for Computational Lin-guistics (Demo and Poster Sessions), pages 177–180, Prague, Czech Republic, June. Association forComputational Linguistics.

Patrice Lopez. 2009. GROBID: Combining auto-matic bibliographic data recognition and term ex-traction for scholarship publications. In Proceed-ings of ECDL 2009, 13th European Conference onDigital Library, Corfu, Greece.

Holger Schwenk. 2008. Investigations on large-scalelightly-supervised training for statistical machinetranslation. In IWSLT, pages 182–189.

A. Stolcke. 2002. SRILM: an extensible languagemodeling toolkit. In Proc. of the Int. Conf. on Spo-ken Language Processing, pages 901–904, Denver,CO.

15


TransAhead: A Writing Assistant for CAT and CALL

*Chung-chi Huang ++Ping-che Yang *Mei-hua Chen *Hung-ting Hsieh +Ting-hui Kao

+Jason S. Chang *ISA, NTHU, HsinChu, Taiwan, R.O.C.

++III, Taipei, Taiwan, R.O.C. +CS, NTHU, HsinChu, Taiwan, R.O.C. u901571,maciaclark,chen.meihua,vincent732,maxis1718,jason.jschanggmail.com

Abstract

We introduce a method for learning to predict the following grammar and text of the ongoing translation given a source text. In our approach, predictions are offered aimed at reducing users’ burden on lexical and grammar choices, and improving productivity. The method involves learning syntactic phraseology and translation equivalents. At run-time, the source and its translation prefix are sliced into ngrams to generate subsequent grammar and translation predictions. We present a prototype writing assistant, TransAhead1, that applies the method to where computer-assisted translation and language learning meet. The preliminary results show that the method has great potentials in CAT and CALL (significant boost in translation quality is observed).

1. Introduction

More and more language learners use the MT systems on the Web for language understanding or learning. However, web translation systems typically suggest a, usually far from perfect, one-best translation and hardly interact with the user.

Language learning/sentence translation could be achieved more interactively and appropriately if a system recognized translation as a collaborative sequence of the user’s learning and choosing from the machine-generated predictions of the next-in-line grammar and text and the machine’s adapting to the user’s accepting /overriding the suggestions.

Consider the source sentence “我們在結束這個

交易上扮演重要角色” (We play an important role in closing this deal). The best learning environment is probably not the one solely

1Available at http://140.114.214.80/theSite/TransAhead/ which, for the time being, only supports Chrome browsers.

providing the automated translation. A good learning environment might comprise a writing assistant that gives the user direct control over the target text and offers text and grammar predictions following the ongoing translations.

We present a new system, TransAhead, that automatically learns to predict/suggest the grammatical constructs and lexical translations expected to immediately follow the current translation given a source text, and adapts to the user’s choices. Example TransAhead responses to the source “我們在結束這個交易上扮演重要角色” and the ongoing translation “we” and “we play an important role” are shown in Figure 12(a) and (b) respectively. TransAhead has determined the probable subsequent grammatical constructions with constituents lexically translated, shown in pop-up menus (e.g., Figure 1(b) shows a prediction “IN[in] VBG[close, end, …]” due to the history “play role” where lexical items in square brackets are lemmas of potential translations). TransAhead learns these constructs and translations during training.

At run-time, TransAhead starts with a source sentence, and iteratively collaborates with the user: by making predictions on the successive grammar patterns and lexical translations, and by adapting to the user’s translation choices to reduce source ambiguities (e.g., word segmentation and senses). In our prototype, TransAhead mediates between users and automatic modules to boost users’ writing/ translation performance (e.g., productivity).

2. Related Work

CAT has been an area of active research. Our work addresses an aspect of CAT focusing on language learning. Specifically, our goal is to build a human-computer collaborative writing assistant: helping the language learner with in- text grammar and translation and at the same

2 Note that grammatical constituents (in all-capitalized words) are represented using Penn parts-of-speech and the history based on the user input is shown in shades.

16

Figure 1. Example TransAhead responses to a source text under the translation (a) “we” and (b) “we play an important role”. Note that the grammar/text predictions of (a) and (b) are not placed directly under the current input focus for space limit. (c) and (d) depict predominant grammar constructs which follow and (e) summarizes the translations for the source’s character-based ngrams.

time updating the system’s segmentation /translation options through the user’s word choices. Our intended users are different from those of the previous research focusing on what professional translator can bring for MT systems (e.g., Brown and Nirenburg, 1990).

More recently, interactive MT (IMT) systems have begun to shift the user’s role from analyses of the source text to the formation of the target translation. TransType project (Foster et al., 2002) describes such pioneering system that supports next word predictions. Koehn (2009) develops caitra which displays one phrase translation at a time and offers alternative translation options. Both systems are similar in spirit to our work. The main difference is that we do not expect the user to be a professional translator and we provide translation hints along with grammar predictions to avoid the generalization issue facing phrase-based system.

Recent work has been done on using fully-fledged statistical MT systems to produce target hypotheses completing user-validated translation prefix in IMT paradigm. Barrachina et al. (2008) investigate the applicability of different MT kernels within IMT framework. Nepveu et al. (2004) and Ortiz-Martinez et al. (2011) further exploit user feedbacks for better IMT systems and user experience. Instead of trigged by user correction, our method is triggered by word delimiter and assists in target language learning.

In contrast to the previous CAT research, we present a writing assistant that suggests subsequent grammar constructs with translations and interactively collaborates with learners, in view of reducing users’ burden on grammar and word choice and enhancing their writing quality.

3. The TransAhead System

3.1 Problem Statement

For CAT and CALL, we focus on predicting a set of grammar patterns with lexical translations likely to follow the current target translation given a source text. The predictions will be examined by a human user directly. Not to overwhelm the user, our goal is to return a reasonable-sized set of predictions that contain suitable word choices and correct grammar to choose and learn from. Formally speaking,

Problem Statement: We are given a target-language reference corpus Ct, a parallel corpus Cst, a source-language text S, and its target translation prefix Tp. Our goal is to provide a set of predictions based on Ct and Cst likely to further translate S in terms of grammar and text. For this, we transform S and Tp into sets of ngrams such that the predominant grammar constructs with suitable translation options following Tp are likely to be acquired.

3.2 Learning to Find Pattern and Translation

We attempt to find syntax-based phraseology and translation equivalents beforehand (four-staged) so that a real-time system is achievable.

Firstly, we syntactically analyze the corpus Ct. In light of the phrases in grammar book (e.g., one’s in “make up one’s mind”), we resort to parts-of-speech for syntactic generalization. Secondly, we build up inverted files of the words in Ct for the next stage (i.e., pattern grammar generation). Apart from sentence and position information, a word’s lemma and part-of-speech (POS) are also recorded.

(b)

Source text: 我們在結束這個交易上扮演重要角色

(a)

Pop-up predictions/suggestions: we MD VB[play, act, ..] , … we VBP[play, act, ..] DT , … we VBD[play, act, ..] DT , …

Pop-up predictions/suggestions: play role IN[ in] VBG[close, end, ..] , … important role IN[ in] VBG[close, end, ..] , … role IN[ in] VBG[close, end, ..] , …

(c)

(d)

(e)

Patterns for “we”: we MD VB , …, we VBP DT , …, we VBD DT , …

Patterns for “we play an important role”: play role IN[ in] DT , play role IN[ in] VBG , …, important role IN[ in] VBG , …, role IN[ in] VBG , …

Translations for the source text: “我們”: we, …; “結束”: close, end, …; …; “扮演”: play, …; “重要”: critical, …; …; “扮”: act, …; …; “重”: heavy, …; “要”: will, wish, …; “角”: cents, …; “色”: outstanding, …

Input your source text and start to interact with TransAhead!

17

We then leverage the procedure in Figure 2 to generate grammar patterns for any given sequence of words (e.g., contiguous or not).

Figure 2. Automatically generating pattern grammar.

The algorithm first identifies the sentences

containing the given sequence of words, query. Iteratively, Step (3) performs an AND operation on the inverted file, InvList, of the current word wi and interInvList, a previous intersected results.

Afterwards, we analyze query’s syntax-based phraseology (Step (5)). For each element of the form ([wordPosi(w1),…,wordPosi(wn)], sentence number) denoting the positions of query’s words in the sentence, we generate grammar pattern involving replacing words with POS tags and words in wordPosi(wi) with lemmas, and extracting fixed-window3 segments surrounding query from the transformed sentence. The result is a set of grammatical, contextual patterns.

The procedure finally returns top N predominant syntactic patterns associated with the query. Such patterns characterizing the query’s word usages follow the notion of pattern grammar in (Hunston and Francis, 2000) and are collected across the target language.

In the fourth and final stage, we exploit Cst for bilingual phrase acquisition, rather than a manual dictionary, to achieve better translation coverage and variety. We obtain phrase pairs through leveraging IBM models to word-align the bitexts, “smoothing” the directional word alignments via grow-diagonal-final, and extracting translation equivalents using (Koehn et al., 2003).

3.3 Run-Time Grammar and Text Prediction

Once translation equivalents and phraseological tendencies are learned, TransAhead then predicts/suggests the following grammar and text of a translation prefix given the source text using the procedure in Figure 3.

We first slice the source text S and its translation prefix Tp into character-level and 3 Inspired by (Gamon and Leacock, 2010).

word-level ngrams respectively. Step (3) and (4) retrieve the translations and patterns learned from Section 3.2. Step (3) acquires the active target-language vocabulary that may be used to translate the source text. To alleviate the word boundary issue in MT raised by Ma et al. (2007), TransAhead non-deterministically segments the source text using character ngrams and proceeds with collaborations with the user to obtain the segmentation for MT and to complete the translation. Note that a user vocabulary of preference (due to users’ domain of knowledge or errors of the system) may be exploited for better system performance. On the other hand, Step (4) extracts patterns preceding with the history ngrams of tj.

Figure 3. Predicting pattern grammar and translations.

In Step (5), we first evaluate and rank the translation candidates using linear combination:

( ) ( )( ) ( )1 1 1 2 2 i i pP t s P s t P t Tλ λ× + + ×

where λi is combination weight, P1 and P2 are translation and language model respectively, and t is one of the translation candidates under S and Tp. Subsequently, we incorporate the lemmatized translation candidates into grammar constituents in GramOptions. For example, we would include “close” in pattern “play role IN[ in] VBG” as “play role IN[ in] VBG[close]”.

At last, the algorithm returns the representative grammar patterns with confident translations expected to follow the ongoing translation and further translate the source. This algorithm will be triggered by word delimiter to provide an interactive environment where CAT and CALL meet.

4. Preliminary Results

To train TransAhead, we used British National Corpus and Hong Kong Parallel Text and deployed GENIA tagger for POS analyses.

To evaluate TransAhead in CAT and CALL, we introduced it to a class of 34 (Chinese) first-year college students learning English as foreign language. Designed to be intuitive to the general public, esp. language learners, presentational tutorial lasted only for a minute. After the tutorial, the participants were asked to translate 15

procedure PatternFinding(query,N,Ct) (1) interInvList=findInvertedFile(w1 of query)

for each word wi in query except for w1 (2) InvList=findInvertedFile(wi) (3a) newInterInvList= φ ; i=1; j=1 (3b) while i<=length(interInvList) and j<=lengh(InvList) (3c) if interInvList[i].SentNo==InvList[j].SentNo (3d) Insert(newInterInvList, interInvList[i],InvList[j])

else (3e) Move i,j accordingly (3f) interInvList=newInterInvList (4) Usage= φ

for each element in interInvList (5) Usage+=PatternGrammarGeneration(element,Ct) (6) Sort patterns in Usage in descending order of frequency (7) return the N patterns in Usage with highest frequency

procedure MakePrediction(S,Tp) (1) Assign sliceNgram(S) to si (2) Assign sliceNgram(Tp) to tj (3) TransOptions=findTranslation(si,Tp) (4) GramOptions=findPattern(tj) (5) Evaluate translation options in TransOptions and incorporate them into GramOptions (6) Return GramOptions

18

Chinese texts from (Huang et al., 2011a) one by one (half with TransAhead assistance, and the other without). Encouragingly, the experimental group (i.e., with the help of our system) achieved much better translation quality than the control group in BLEU (Papineni et al., 2002) (i.e., 35.49 vs. 26.46) and significantly reduced the performance gap between language learners and automatic decoder of Google Translate (44.82). We noticed that, for the source “我們在結束這個交易上扮演重要角色”, 90% of the participants in the experimental group finished with more grammatical and fluent translations (see Figure 4) than (less interactive) Google Translate (“We conclude this transaction plays an important role”). In comparison, 50% of the translations of the source from the control group were erroneous.

Figure 4. Example translations with TransAhead assistance.

Post-experiment surveys indicate that a) the participants found TransAhead intuitive enough to collaborate with in writing/translation; b) the participants found TransAhead suggestions satisfying, accepted, and learned from them; c) interactivity made translation and language learning more fun and the participants found TransAhead very recommendable and would like to use the system again in future translation tasks.

5. Future Work and Summary

Many avenues exist for future research and improvement. For example, in the linear combination, the patterns’ frequencies could be considered and the feature weight could be better tuned. Furthermore, interesting directions to explore include leveraging user input such as (Nepveu et al., 2004) and (Ortiz-Martinez et al., 2010) and serially combining a grammar checker (Huang et al., 2011b). Yet another direction would be to investigate the possibility of using human-computer collaborated translation pairs to re-train word boundaries suitable for MT.

In summary, we have introduced a method for learning to offer grammar and text predictions expected to assist the user in translation and writing (or even language learning). We have implemented and evaluated the method. The preliminary results are encouragingly promising, prompting us to further qualitatively and quantitatively evaluate our system in the near future (i.e., learners’ productivity, typing speed and keystroke ratios of “del” and “backspace”

(possibly hesitating on the grammar and lexical choices), and human-computer interaction, among others).

Acknowledgement

This study is conducted under the “Project Digital Convergence Service Open Platform” of the Institute for Information Industry which is subsidized by the Ministry of Economy Affairs of the Republic of China.

References S. Barrachina, O. Bender, F. Casacuberta, J. Civera, E.

Cubel, S. Khadivi, A. Lagarda, H. Ney, J. Tomas, E. Vidal, and J.-M. Vilar. 2008. Statistical approaches to computer-assisted translation. Computer Linguistics, 35(1): 3-28.

R. D. Brown and S. Nirenburg. 1990. Human-computer interaction for semantic disambiguation. In Proceedings of COLING, pages 42-47.

G. Foster, P. Langlais, E. Macklovitch, and G. Lapalme. 2002. TransType: text prediction for translators. In Proceedings of ACL Demonstrations, pages 93-94.

M. Gamon and C. Leacock. 2010. Search right and thou shalt find … using web queries for learner error detection. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications, pages 37-44.

C.-C. Huang, M.-H. Chen, S.-T. Huang, H.-C. Liou, and J. S. Chang. 2011a. GRASP: grammar- and syntax-based pattern-finder in CALL. In Proceedings of ACL.

C.-C. Huang, M.-H. Chen, S.-T. Huang, and J. S. Chang. 2011b. EdIt: a broad-coverage grammar checker using pattern grammar. In Proceedings of ACL.

S. Hunston and G. Francis. 2000. Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English. Amsterdam: John Benjamins.

P. Koehn, F. J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proceedings of NAACL.

P. Koehn. 2009. A web-based interactive computer aided translation tool. In Proceedings of ACL.

Y. Ma, N. Stroppa, and A. Way. 2007. Bootstrapping word alignment via word packing. In Proceedings of ACL.

L. Nepveu, G. Lapalme, P. Langlais, and G. Foster. 2004. Adaptive language and translation models for interactive machine translation. In Proceedings of EMNLP.

Franz Josef Och and Hermann Ney. 2003. A systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19-51.

D. Ortiz-Martinez, L. A. Leiva, V. Alabau, I. Garcia-Varea, and F. Casacuberta. 2011. An interactive machine translation system with online learning. In Proceedings of ACL System Demonstrations, pages 68-73.

K. Papineni, S. Roukos, T. Ward, W.-J. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL, pages 311-318.

1. we play(ed) a critical role in closing/sealing this/the deal. 2. we play(ed) an important role in ending/closing this/the deal.

19


SWAN – Scientific Writing AssistaNtA Tool for Helping Scholars to Write Reader-Friendly Manuscripts

http://cs.joensuu.fi/swan/

Tomi Kinnunen∗ Henri Leisma Monika Machunik Tuomo Kakkonen Jean-Luc Lebrun

Abstract

Difficulty of reading scholarly papers is sig-nificantly reduced by reader-friendly writ-ing principles. Writing reader-friendly text,however, is challenging due to difficulty inrecognizing problems in one’s own writing.To help scholars identify and correct poten-tial writing problems, we introduce SWAN(Scientific Writing AssistaNt) tool. SWANis a rule-based system that gives feedbackbased on various quality metrics based onyears of experience from scientific writ-ing classes including 960 scientists of var-ious backgrounds: life sciences, engineer-ing sciences and economics. According toour first experiences, users have perceivedSWAN as helpful in identifying problem-atic sections in text and increasing overallclarity of manuscripts.

1 Introduction

A search on “tools to evaluate the quality of writ-ing” often gets you to sites assessing only one ofthe qualities of writing: its readability. Measur-ing ease of reading is indeed useful to determineif your writing meets the reading level of your tar-geted reader, but with scientific writing, the sta-tistical formulae and readability indices such asFlesch-Kincaid lose their usefulness.

In a way, readability is subjective and depen-dent on how familiar the reader is with the spe-cific vocabulary and the written style. Scien-tific papers are targeting an audience at ease with

∗ T. Kinnunen, H. Leisma, M. Machunik and T.Kakkonen are with the School of Computing, Univer-sity of Eastern Finland (UEF), Joensuu, Finland, e-mail:[email protected]. Jean-Luc Lebrun is an inde-pendent trainer of scientific writing and can be contacted [email protected].

a more specialized vocabulary, an audience ex-pecting sentence-lengthening precision in writing.The readability index would require recalibrationfor such a specific audience. But the need forreadability indices is not questioned here. “Sci-ence is often hard to read” (Gopen and Swan,1990), even for scientists.

Science is also hard to write, and finding faultwith one’s own writing is even more challengingsince we understand ourselves perfectly, at leastmost of the time. To gain objectivity scientiststurn away from silent readability indices and findmore direct help in checklists such as the peer re-view form proposed by Bates College1, or scor-ing sheets to assess the quality of a scientific pa-per. These organise a systematic and critical walkthrough each part of a paper, from its title to itsreferences in peer-review style. They integratereadability criteria that far exceed those coveredby statistical lexical tools. For example, they ex-amine how the text structure frames the contentsunder headings and subheadings that are consis-tent with the title and abstract of the paper. Theytest whether or not the writer fluidly meets the ex-pectations of the reader. Written by expert review-ers (and readers), they represent them, their needsand concerns, and act as their proxy. Such man-ual tools effectively improve writing (Chuck andYoung, 2004).

Computer-assisted tools that support manualassessment based on checklists require naturallanguage understanding. Due to the complexityof language, today’s natural language processing(NLP) techniques mostly enable computers to de-liver shallow language understanding when the

1http://abacus.bates.edu/˜ganderso/biology/resources/peerreview.html

20

vocabulary is large and highly specialized – as isthe case for scientific papers. Nevertheless, theyare mature enough to be embedded in tools as-sisted by human input to increase depth of under-standing. SWAN (ScientificWriting AssistaNt) issuch a tool (Fig. 1). It is based on metrics testedon 960 scientists working for the research Insti-tutes of the Agency for Science, Technology andResearch (A*STAR) in Singapore since 1997.

The evaluation metrics used in SWAN are de-scribed in detail in a book written by the designerof the tool (Lebrun, 2011). In general, SWAN fo-cuses on the areas of a scientific paper that createthe first impression on the reader. Readers, and inparticular reviewers, will always read these partic-ular sections of a paper: title, abstract, introduc-tion, conclusion, and the headings and subhead-ings of the paper. SWAN does not assess the over-all quality of a scientific paper. SWAN assessesits fluidity and cohesion, two of the attributes thatcontribute to the overall quality of the paper. Italso helps identify other types of potential prob-lems such as lack of text dynamism, overly longsentences and judgmental words.

Figure 1: Main window of SWAN.

2 Related Work

Automatic assessment of student-authored texts isan active area of research. Hundreds of researchpublications related to this topic have been pub-lished since Page’s (Page, 1966) pioneering workon automatic grading of student essays. The re-search on using NLP in support of writing scien-tific publications has, however, gained much lessattention in the research community.

Amadeus (Aluisio et al., 2001) is perhaps thesystem that is the most similar to the work out-lined in this system demonstration. However, thefocus of the Amadeus system is mostly on non-native speakers on English who are learning towrite scientific publications. SWAN is targetedfor more general audience of users.

Helping our own (HOO) is an initiative thatcould in future spark a new interest in the re-search on using of NLP for supporting scientificwriting (Dale and Kilgarriff, 2010). As the namesuggests, the shared task (HOO, 2011) focuses onsupporting non-native English speakers in writingarticles related specifically to NLP and computa-tional linguistics. The focus in this initiative ison what the authors themselves call “domain-and-register-specific error correction”, i.e. correctionof grammatical and spelling mistakes.

Some NLP research has been devoted to apply-ing NLP techniques to scientific articles. Paquotand Bestgen (Paquot and Bestgen, 2009), for in-stance, extracted keywords from research articles.

3 Metrics Used in SWAN

We outline the evaluation metrics used in SWAN.Detailed description of the metrics is given in (Le-brun, 2011). Rather than focusing on Englishgrammar or spell-checking included in most mod-ern word processors, SWAN gives feedback onthe core elements of any scientific paper: title, ab-stract, introduction and conclusions. In addition,SWAN gives feedback on fluidity of writing andpaper structure.

SWAN includes two types of evaluation met-rics, automatic and manual ones. Automatic met-rics are solely implemented as text analysis of theoriginal document using NLP tools. An examplewould be locating judgemental word patterns suchas suffers from or locating sentences with passivevoice. The manual metrics, in turn, require user’sinput for tasks that are difficult – if not impossible– to automate. An example would be highlightingtitle keywords that reflect the core contribution ofthe paper, or highlighting in the abstract the sen-tences that cover the relevant background.

Many of the evaluation metrics are stronglyinter-connected with each other, such as

• Checking that abstract and title are consis-tent; for instance, frequently used abstractkeywords should also be found in the title;

21

and the title should not include keywords ab-sent in the abstract.

• Checking that all title keywords are alsofound in the paper structure (from headingsor subheadings) so that the paper structure isself-explanatory.

An important part of paper quality metrics is as-sessing text fluidity. By fluidity we mean the easewith which the text can be read. This, in turn,depends on how much the reader needs to mem-orize about what they have read so far in orderto understand new information. This memorizingneed is greatly reduced if consecutive sentencesdo not contain rapid change in topic. The aim ofthe text fluidity module is to detect possible topicdiscontinuities within and across paragraphs, andto suggest ways of improving these parts, for ex-ample, by rearranging the sentences. The sugges-tions, while already useful, will improve in futureversions of the tool with a better understandingof word meanings thanks to WordNet and lexicalsemantics techniques.

Fluidity evaluation is difficult to fully auto-mate. Manual fluidity evaluation relies on thereader’s understanding of the text. It is thereforesuperior to the automatic evaluation which relieson a set of heuristics that endeavor to identify textfluidity based on the concepts of topic and stressdeveloped in (Gopen, 2004). These heuristics re-quire the analysis of the sentence for which theStanford parser is used. These heuristics are per-fectible, but they already allow the identificationof sentences disrupting text fluidity.More fluidityproblems would be revealed through the manualfluidity evaluation.

Simply put, here topic refers to the main fo-cus of the sentence (e.g. the subject of the mainclause) while stress stands for the secondary sen-tence focus, which often becomes one of the fol-lowing sentences’ topic. SWAN compares the po-sition of topic and stress across consecutive sen-tences, as well as their position inside the sentence(i.e. among its subclauses). SWAN assigns eachsentence to one of four possible fluidity classes:

1. Fluid: the sentence is maintaining connec-tion with the previous sentences.

2. Inverted topic: the sentence is connectedto a previous sentence, but that connectiononly becomes apparent at the very end ofthe sentence (“The cropping should preserveall critical points. Images of the same sizeshould also be kept by the cropping”).

3. Out-of-sync: the sentence is connected to aprevious one, but there are disconnected sen-tences in between the connected sentences(“The cropping should preserve all criticalpoints. The face features should be normal-ized. The cropping should also preserve allcritical points”).

4. Disconnected: the sentence is not connectedto any of the previous sentences or there aretoo many sentences in between.

The tool also alerts the writer when transitionwords such as in addition, on the other hand,or even the familiar however are used. Eventhough these expressions are effective when cor-rectly used, they often betray the lack of a log-ical or semantic connection between consecutivesentences (“The cropping should preserve all crit-ical points. However, the face features should benormalized”). SWAN displays all the sentenceswhich could potentially break the fluidity (Fig. 2)and suggests ways of rewriting them.

Figure 2: Fluidity evaluation result in SWAN.

4 The SWAN Tool

4.1 Inputs and outputsSWAN operates on two possible evaluationmodes: simple and full. In simple evaluationmode, the input to the tool are the title, abstract,introduction and conclusions of a manuscript.These sections can be copy-pasted as plain textto the input fields.

In full evaluation mode, which generally pro-vides more feedback, the user provides a full pa-per as an input. This includes semi-automaticimport of the manuscript from certain standard

22

document formats such as TeX, MS Office andOpenOffice, as well as semi-automatic structuredetection of the manuscript. For the well-knownAdobe’s portable document format (PDF) we usestate-of-the-art freely available PdfBox extractor2.Unfortunately, PDF format is originally designedfor layout and printing and not for structured textinterchange. Most of the time, simple copy &paste from a source document to the simple eval-uation fields is sufficient.

When the text sections have been input to thetool, clicking the Evaluate button will trigger theevaluation process. This has been observed tocomplete, at most, in a minute or two on a mod-ern laptop. The evaluation metrics in the tool arestraight-forward, most of the processing time isspent in the NLP tools. After the evaluation iscomplete, the results are shown to the user.

SWAN provides constructive feedback fromthe evaluated sections of your paper. The tool alsohighlights problematic words or sentences in themanuscript text and generates graphs of sentencefeatures (see Fig. 2). The results can be saved andreloaded to the tool or exported to html formatfor sharing. The feedback includes tips on howto maintain authoritativeness and how to convincethe scientist reader. Use of powerful and precisesentences is emphasized together with strategicaland logical placement of key information.

In addition to these two main evaluation modes,the tool also includes a manual fluidity assessmentexercise where the writer goes through a giventext passage, sentence by sentence, to see whetherthe next sentence can be predicted from the previ-ous sentences.

4.2 Implementation and External Libraries

The tool is a desktop application written in Java.It uses external libraries for natural language pro-cessing from Stanford, namely Stanford POS Tag-ger (Toutanova et al., 2003) and Stanford Parser(Klein and Manning, 2003). This is one of themost accurate and robust parsers available and im-plemented in Java, as is the rest of our system.Other external libraries include Apache Tika3,which we use in extracting textual content fromfiles. JFreeChart4 is used in generating graphs

2http://pdfbox.apache.org/3http://tika.apache.org/4http://www.jfree.org/jfreechart/

and XStream5 in saving and loading inputs andresults.

5 Initial User Experiences of SWAN

Since its release in June 2011, the tool hasbeen used in scientific writing classes in doc-toral schools in France, Finland, and Singapore,as well as in 16 research institutes from A*STAR(Agency for Science Technology and Research).Participants to the classes routinely enter intoSWAN either parts, or the whole paper they wishto immediately evaluate. SWAN is designed towork on multiple platforms and it relies com-pletely on freely available tools. The feedbackgiven by the participants after the course revealsthe following benefits of using SWAN:

1. Identification and removal of the inconsis-tencies that make clear identification of thescientific contribution of the paper difficult.

2. Applicability of the tool across vast domainsof research (life sciences, engineering sci-ences, and even economics).

3. Increased clarity of expression through theidentification of the text fluidity problems.

4. Enhanced paper structure leading to a morereadable paper overall.

5. More authoritative, more direct and more ac-tive writing style.

Novice writers already appreciate SWAN’sfunctionalityand even senior writers, although ev-idence remains anecdotal. At this early stage,SWAN’s capabilities are narrow in scope.We con-tinue to enhance the existing evaluation metrics.And we are eager to include a new and alreadytested metric that reveals problems in how figuresare used.

AcknowledgmentsThis works of T. Kinnunen and T. Kakkonen were supportedby the Academy of Finland. The authors would like to thankArttu Viljakainen, Teemu Turunen and Zhengzhe Wu in im-plementing various parts of SWAN.

References[Aluisio et al.2001] S.M. Aluisio, I. Barcelos, J. Sam-

paio, and O.N. Oliveira Jr. 2001. How to learnthe many “unwritten rules” of the game of the aca-demic discourse: a hybrid approach based on cri-tiques and cases to support scientific writing. In5http://xstream.codehaus.org/

23

Proc. IEEE International Conference on AdvancedLearning Technologies, Madison, Wisconsin, USA.

[Chuck and Young2004] Jo-Anne Chuck and LaurenYoung. 2004. A cohort-driven assessment task forscientific report writing. Journal of Science, Edu-cation and Technology, 13(3):367–376, September.

[Dale and Kilgarriff2010] R. Dale and A. Kilgarriff.2010. Text massaging for computational linguis-tics as a new shared task. In Proc. 6th Int. NaturalLanguage Generation Conference, Dublin, Ireland.

[Gopen and Swan1990] George D. Gopen and Ju-dith A. Swan. 1990. The science of scien-tific writing. American Scientist, 78(6):550–558,November-December.

[Gopen2004] George D. Gopen. 2004. Expectations:Teaching Writing From The Reader’s perspective.Longman.

[HOO2011] 2011. HOO - helping our own. Web-page, September. http://www.clt.mq.edu.au/research/projects/hoo/.

[Klein and Manning2003] Dan Klein and Christo-pher D. Manning. 2003. Accurate unlexicalizedparsing. In Proc. 41st Meeting of the Associationfor Computational Linguistics, pages 423–430.

[Lebrun2011] Jean-Luc Lebrun. 2011. Scientific Writ-ing 2.0 – A Reader and Writer’s Guide. World Sci-entific Publishing Co. Pte. Ltd., Singapore.

[Page1966] E. Page. 1966. The imminence of gradingessays by computer. In Phi Delta Kappan, pages238–243.

[Paquot and Bestgen2009] M. Paquot and Y. Bestgen.2009. Distinctive words in academic writing: Acomparison of three statistical tests for keyword ex-traction. In A.H. Jucker, D. Schreier, and M. Hundt,editors, Corpora: Pragmatics and Discourse, pages247–269. Rodopi, Amsterdam, Netherlands.

[Toutanova et al.2003] Kristina Toutanova, Dan Klein,Christopher Manning, and Yoram Singer. 2003.Feature-rich part-of-speech tagging with a cyclicdependency network. In Proc. HLT-NAACL, pages252–259.

24


ONTS: “Optima” News Translation System

Marco Turchi∗, Martin Atkinson∗, Alastair Wilcox+, Brett Crawley,Stefano Bucci+, Ralf Steinberger∗ and Erik Van der Goot∗

European Commission - Joint Research Centre (JRC), IPSC - GlobeSecVia Fermi 2749, 21020 Ispra (VA) - Italy

∗[name].[surname]@jrc.ec.europa.eu+[name].[surname]@ext.jrc.ec.europa.eu

[email protected]

Abstract

We propose a real-time machine translationsystem that allows users to select a newscategory and to translate the related livenews articles from Arabic, Czech, Danish,Farsi, French, German, Italian, Polish, Por-tuguese, Spanish and Turkish into English.The Moses-based system was optimised forthe news domain and differs from otheravailable systems in four ways: (1) Newsitems are automatically categorised on thesource side, before translation; (2) Namedentity translation is optimised by recog-nising and extracting them on the sourceside and by re-inserting their translation inthe target language, making use of a sep-arate entity repository; (3) News titles aretranslated with a separate translation sys-tem which is optimised for the specific styleof news titles; (4) The system was opti-mised for speed in order to cope with thelarge volume of daily news articles.

1 Introduction

Being able to read news from other countries andwritten in other languages allows readers to bebetter informed. It allows them to detect nationalnews bias and thus improves transparency anddemocracy. Existing online translation systemssuch as Google Translate and Bing Translator1

are thus a great service, but the number of docu-ments that can be submitted is restricted (Googlewill even entirely stop their service in 2012) andsubmitting documents means disclosing the users’interests and their (possibly sensitive) data to theservice-providing company.

1http://translate.google.com/ and http://www.microsofttranslator.com/

For these reasons, we have developed ourin-house machine translation system ONTS. Itstranslation results will be publicly accessible aspart of the Europe Media Monitor family of ap-plications, (Steinberger et al., 2009), which gatherand process about 100,000 news articles per dayin about fifty languages. ONTS is based onthe open source phrase-based statistical machinetranslation toolkit Moses (Koehn et al., 2007),trained mostly on freely available parallel cor-pora and optimised for the news domain, as statedabove. The main objective of developing our in-house system is thus not to improve translationquality over the existing services (this would bebeyond our possibilities), but to offer our users arough translation (a “gist”) that allows them to getan idea of the main contents of the article and todetermine whether the news item at hand is rele-vant for their field of interest or not.

A similar news-focused translation service is“Found in Translation” (Turchi et al., 2009),which gathers articles in 23 languages and trans-lates them into English. “Found in Translation” isalso based on Moses, but it categorises the newsafter translation and the translation process is notoptimised for the news domain.

2 Europe Media Monitor

Europe Media Monitor (EMM)2 gathers a dailyaverage of 100,000 news articles in approximately50 languages, from about 3,400 hand-selectedweb news sources, from a couple of hundred spe-cialist and government websites, as well as fromabout twenty commercial news providers. It vis-its the news web sites up to every five minutes to

2http://emm.newsbrief.eu/overview.html

25

search for the latest articles. When news sites of-fer RSS feeds, it makes use of these, otherwiseit extracts the news text from the often complexHTML pages. All news items are converted toUnicode. They are processed in a pipeline struc-ture, where each module adds additional informa-tion. Independently of how files are written, thesystem uses UTF-8-encoded RSS format.

Inside the pipeline, different algorithms are im-plemented to produce monolingual and multilin-gual clusters and to extract various types of in-formation such as named entities, quotations, cat-egories and more. ONTS uses two modules ofEMM: the named entity recognition and the cate-gorization parts.

2.1 Named Entity Recognition and VariantMatching.

Named Entity Recognition (NER) is per-formed using manually constructed language-independent rules that make use of language-specific lists of trigger words such as titles(president), professions or occupations (tennisplayer, playboy), references to countries, regions,ethnic or religious groups (French, Bavarian,Berber, Muslim), age expressions (57-year-old),verbal phrases (deceased), modifiers (former)and more. These patterns can also occur incombination and patterns can be nested to capturemore complex titles, (Steinberger and Pouliquen,2007). In order to be able to cover many differentlanguages, no other dictionaries and no parsers orpart-of-speech taggers are used.

To identify which of the names newly foundevery day are new entities and which ones aremerely variant spellings of entities already con-tained in the database, we apply a language-independent name similarity measure to decidewhich name variants should be automaticallymerged, for details see (Pouliquen and Stein-berger, 2009). This allows us to maintain adatabase containing over 1,15 million named en-tities and 200,000 variants. The major part ofthis resource can be downloaded from http://langtech.jrc.it/JRC-Names.html

2.2 Category Classification acrossLanguages.

All news items are categorized into hundreds ofcategories. Category definitions are multilingual,created by humans and they include geographic

regions such as each country of the world, organi-zations, themes such as natural disasters or secu-rity, and more specific classes such as earthquake,terrorism or tuberculosis,

Articles fall into a given category if they sat-isfy the category definition, which consists ofBoolean operators with optional vicinity opera-tors and wild cards. Alternatively, cumulativepositive or negative weights and a threshold canbe used. Uppercase letters in the category defi-nition only match uppercase words, while lower-case words in the definition match both uppercaseand lowercase words. Many categories are de-fined with input from the users themselves. Thismethod to categorize the articles is rather sim-ple and user-friendly, and it lends itself to dealingwith many languages, (Steinberger et al., 2009).

3 News Translation System

In this section, we describe our statistical machinetranslation (SMT) service based on the open-source toolkit Moses (Koehn et al., 2007) and itsadaptation to translation of news items.

Which is the most suitable SMT system forour requirements? The main goal of our systemis to help the user understand the content of an ar-ticle. This means that a translated article is evalu-ated positively even if it is not perfect in the targetlanguage. Dealing with such a large number ofsource languages and articles per day, our systemshould take into account the translation speed, andtry to avoid using language-dependent tools suchas part-of-speech taggers.

Inside the Moses toolkit, three differentstatistical approaches have been implemented:phrase based statistical machine translation (PB-SMT) (Koehn et al., 2003), hierarchical phrasebased statistical machine translation (Chiang,2007) and syntax-based statistical machine trans-lation (Marcu et al., 2006). To identify themost suitable system for our requirements, werun a set of experiments training the three mod-els with Europarl V4 German-English (Koehn,2005) and optimizing and testing on the Newscorpus (Callison-Burch et al., 2009). For all ofthem, we use their default configurations and theyare run under the same condition on the same ma-chine to better evaluate translation time. For thesyntax model we use linguistic information onlyon the target side. According to our experiments,in terms of performance the hierarchical model

26

performs better than PBSMT and syntax (18.31,18.09, 17.62 Bleu points), but in terms of transla-tion speed PBSMT is better than hierarchical andsyntax (1.02, 4.5, 49 second per sentence). Al-though, the hierarchical model has the best Bleuscore, we prefer to use the PBSMT system in ourtranslation service, because it is four times faster.

Which training data can we use? It is knownin statistical machine translation that more train-ing data implies better translation. Although, thenumber of parallel corpora has been is growingin the last years, the amounts of training datavary from language pair to language pair. Totrain our models we use the freely available cor-pora (when possible): Europarl (Koehn, 2005),JRC-Acquis (Steinberger et al., 2006), DGT-TM3, Opus (Tiedemann, 2009), SE-Times (Ty-ers and Alperen, 2010), Tehran English-PersianParallel Corpus (Pilevar et al., 2011), NewsCorpus (Callison-Burch et al., 2009), UN Cor-pus (Rafalovitch and Dale, 2009), CzEng0.9 (Bo-jar and Zabokrtsky, 2009), English-Persian paral-lel corpus distributed by ELRA4 and two Arabic-English datasets distributed by LDC5. This re-sults in some language pairs with a large cover-age, (more than 4 million sentences), and otherwith a very small coverage, (less than 1 million).The language models are trained using 12 modelsentences for the content model and 4.7 millionfor the title model. Both sets are extracted fromEnglish news.

For less resourced languages such as Farsi andTurkish, we tried to extend the available corpora.For Farsi, we applied the methodology proposedby (Lambert et al., 2011), where we used a largelanguage model and an English-Farsi SMT modelto produce new sentence pairs. For Turkish weadded the Movie Subtitles corpus (Tiedemann,2009), which allowed the SMT system to in-crease its translation capability, but included sev-eral slang words and spoken phrases.

How to deal with Named Entities in transla-tion? News articles are related to the most impor-tant events. These names need to be efficientlytranslated to correctly understand the content ofan article. From an SMT point of view, two mainissues are related to Named Entity translation: (1)such a name is not in the training data or (2) part

3http://langtech.jrc.it/DGT-TM.html4http://catalog.elra.info/5http://www.ldc.upenn.edu/

of the name is a common word in the target lan-guage and it is wrongly translated, e.g. the Frenchname “Bruno Le Maire” which risks to be trans-lated into English as “Bruno Mayor”. To mitigateboth the effects we use our multilingual namedentity database. In the source language, each newsitem is analysed to identify possible entities; ifan entity is recognised, its correct translation intoEnglish is retrieved from the database, and sug-gested to the SMT system enriching the sourcesentence using the xml markup option 6 in Moses.This approach allows us to complement the train-ing data increasing the translation capability ofour system.

How to deal with different language stylesin the news? News title writing style containsmore gerund verbs, no or few linking verbs,prepositions and adverbs than normal sentences,while content sentences include more preposi-tion, adverbs and different verbal tenses. Startingfrom this assumption, we investigated if this phe-nomenon can affect the translation performanceof our system.

We trained two SMT systems, SMTcontent

and SMTtitle, using the Europarl V4 German-English data as training corpus, and two dif-ferent development sets: one made of contentsentences, News Commentaries (Callison-Burchet al., 2009), and the other made of news ti-tles in the source language which were trans-lated into English using a commercial transla-tion system. With the same strategy we gener-ated also a Title test set. The SMTtitle used alanguage model created using only English newstitles. The News and Title test sets were trans-lated by both the systems. Although the perfor-mance obtained translating the News and Titlecorpora are not comparable, we were interestedin analysing how the same test set is translatedby the two systems. We noticed that translat-ing a test set with a system that was optimizedwith the same type of data resulted in almost 2Blue score improvements: Title-TestSet: 0.3706(SMTtitle), 0.3511 (SMTcontent); News-TestSet:0.1768 (SMTtitle), 0.1945 (SMTcontent). Thisbehaviour was present also in different languagepairs. According to these results we decidedto use two different translation systems for eachlanguage pair, one optimized using title data

6http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc4

27

and the other using normal content sentences.Even though this implementation choice requiresmore computational power to run in memory twoMoses servers, it allows us to mitigate the work-load of each single instance reducing translationtime of each single article and to improve transla-tion quality.

3.1 Translation Quality

To evaluate the translation performance of ONTS,we run a set of experiments where we translate atest set for each language pair using our systemand Google Translate. Lack of human translatedparallel titles obliges us to test only the contentbased model. For German, Spanish and Czech weuse the news test sets proposed in (Callison-Burchet al., 2010), for French and Italian the news testsets presented in (Callison-Burch et al., 2008),for Arabic, Farsi and Turkish, sets of 2,000 newssentences extracted from the Arabic-English andEnglish-Persian datasets and the SE-Times cor-pus. For the other languages we use 2,000 sen-tences which are not news but a mixture of JRC-Acquis, Europarl and DGT-TM data. It is notguarantee that our test sets are not part of the train-ing data of Google Translate.

Each test set is translated by Google Translate- Translator Toolkit, and by our system. Bleuscore is used to evaluate the performance of bothsystems. Results, see Table 1, show that GoogleTranslate produces better translation for those lan-guages for which large amounts of data are avail-able such as French, German, Italian and Spanish.Surprisingly, for Danish, Portuguese and Polish,ONTS has better performance, this depends onthe choice of the test sets which are not made ofnews data but of data that is fairly homogeneousin terms of style and genre with the training sets.

The impact of the named entity module is ev-ident for Arabic and Farsi, where each Englishsuggested entity results in a larger coverage ofthe source language and better translations. Forhighly inflected and agglutinative languages suchas Turkish, the output proposed by ONTS is poor.We are working on gathering more training datacoming from the news domain and on the pos-sibility of applying a linguistic pre-processing ofthe documents.

Source L. ONTS Google T.Arabic 0.318 0.255Czech 0.218 0.226Danish 0.324 0.296Farsi 0.245 0.197French 0.26 0.286German 0.205 0.25Italian 0.234 0.31Polish 0.568 0.511Portuguese 0.579 0.424Spanish 0.283 0.334Turkish 0.238 0.395

Table 1: Automatic evaluation.

4 Technical Implementation

The translation service is made of two compo-nents: the connection module and the Mosesserver. The connection module is a servlet im-plemented in Java. It receives the RSS files,isolates each single news article, identifies eachsource language and pre-processes it. Each newsitem is split into sentences, each sentence is to-kenized, lowercased, passed through a statisti-cal compound word splitter, (Koehn and Knight,2003), and the named entity annotator module.For language modelling we use the KenLM im-plementation, (Heafield, 2011).

According to the language, the correct Mosesservers, title and content, are fed in a multi-thread manner. We use the multi-thread versionof Moses (Haddow, 2010). When all the sentencesof each article are translated, the inverse processis run: they are detokenized, recased, and untrans-lated/unknown words are listed. The translated ti-tle and content of each article are uploaded intothe RSS file and it is passed to the next modules.

The full system including the translation mod-ules is running in a 2xQuad-Core with In-tel Hyper-threading Technology processors with48GB of memory. It is our intention to locatethe Moses servers on different machines. This ispossible thanks to the high modularity and cus-tomization of the connection module. At the mo-ment, the translation models are available for thefollowing source languages: Arabic, Czech, Dan-ish, Farsi, French, German, Italian, Polish, Por-tuguese, Spanish and Turkish.

28

Figure 1: Demo Web site.

4.1 Demo

Our translation service is currently presented ona demo web site, see Figure 1, which is availableat http://optima.jrc.it/Translate/.News articles can be retrieved selecting one of thetopics and the language. All the topics are as-signed to each article using the methodology de-scribed in 2.2. These articles are shown in the leftcolumn of the interface. When the button “Trans-late” is pressed, the translation process starts andthe translated articles appear in the right columnof the page.

The translation system can be customized fromthe interface enabling or disabling the namedentity, compound, recaser, detokenizer and un-known word modules. Each translated article isenriched showing the translation time in millisec-onds per character and, if enabled, the list of un-known words. The interface is linked to the con-nection module and data is transferred using RSSstructure.

5 Discussion

In this paper we present the Optima News Trans-lation System and how it is connected to Eu-rope Media Monitor application. Different strate-gies are applied to increase the translation perfor-mance taking advantage of the document struc-ture and other resources available in our researchgroup. We believe that the experiments describedin this work can result very useful for the develop-ment of other similar systems. Translations pro-duced by our system will soon be available as partof the main EMM applications.

The performance of our system is encouraging,

but not as good as the performance of web ser-vices such as Google Translate, mostly becausewe use less training data and we have reducedcomputational power. On the other hand, our in-house system can be fed with a large number ofarticles per day and sensitive data without includ-ing third parties in the translation process. Per-formance and translation time vary according tothe number and complexity of sentences and lan-guage pairs.

The domain of news articles dynamicallychanges according to the main events in the world,while existing parallel data is static and usuallyassociated to governmental domains. It is our in-tention to investigate how to adapt our translationsystem updating the language model with the En-glish articles of the day.

Acknowledgments

The authors thank the JRC’s OPTIMA team forits support during the development of ONTS.

ReferencesO. Bojar and Z. Zabokrtsky. 2009. CzEng0.9: Large

Parallel Treebank with Rich Annotation. PragueBulletin of Mathematical Linguistics, 92.

C. Callison-Burch and C. Fordyce and P. Koehn andC. Monz and J. Schroeder. 2008. Further Meta-Evaluation of Machine Translation. Proceedings ofthe Third Workshop on Statistical Machine Transla-tion, pages 70–106. Columbus, US.

C. Callison-Burch, and P. Koehn and C. Monz and J.Schroeder. 2009. Findings of the 2009 Workshopon Statistical Machine Translation. Proceedings ofthe Fourth Workshop on Statistical Machine Trans-lation, pages 1–28. Athens, Greece.

C. Callison-Burch, and P. Koehn and C. Monz and K.Peterson and M. Przybocki and O. Zaidan. 2009.Findings of the 2010 Joint Workshop on Statisti-cal Machine Translation and Metrics for MachineTranslation. Proceedings of the Joint Fifth Work-shop on Statistical Machine Translation and Met-ricsMATR, pages 17–53. Uppsala, Sweden.

D. Chiang. 2005. Hierarchical phrase-based transla-tion. Computational Linguistics, 33(2): pages 201–228. MIT Press.

B. Haddow. 2010. Adding multi-threaded decoding tomoses. The Prague Bulletin of Mathematical Lin-guistics, 93(1): pages 57–66. Versita.

K. Heafield. 2011. KenLM: Faster and smaller lan-guage model queries. Proceedings of the SixthWorkshop on Statistical Machine Translation, Ed-inburgh, UK.

29

P. Koehn. 2005. Europarl: A Parallel Corpus forStatistical Machine Translation. Proceedings ofthe Machine Translation Summit X, pages 79-86.Phuket, Thailand.

P. Koehn and F. J. Och and D. Marcu. 2003. Statisticalphrase-based translation. Proceedings of the 2003Conference of the North American Chapter of theAssociation for Computational Linguistics on Hu-man Language Technology, pages 48–54. Edmon-ton, Canada.

P. Koehn and K. Knight. 2003. Empirical methodsfor compound splitting. Proceedings of the tenthconference on European chapter of the Associationfor Computational Linguistics, pages 187–193. Bu-dapest, Hungary.

P. Koehn and H. Hoang and A. Birch and C. Callison-Burch and M. Federico and N. Bertoldi and B.Cowan and W. Shen and C. Moran and R. Zensand C. Dyer and O. Bojar and A. Constantin and E.Herbst 2007. Moses: Open source toolkit for sta-tistical machine translation. Proceedings of the An-nual Meeting of the Association for ComputationalLinguistics, demonstration session, pages 177–180.Columbus, Oh, USA.

P. Lambert and H. Schwenk and C. Servan and S.Abdul-Rauf. 2011. SPMT: Investigations on Trans-lation Model Adaptation Using Monolingual Data.Proceedings of the Sixth Workshop on StatisticalMachine Translation, pages 284–293. Edinburgh,Scotland.

D. Marcu and W. Wang and A. Echihabi and K.Knight. 2006. SPMT: Statistical machine trans-lation with syntactified target language phrases.Proceedings of the 2006 Conference on Empiri-cal Methods in Natural Language Processing, pages48–54. Edmonton, Canada.

M. Pilevar and H. Faili and A. Pilevar. 2011. TEP:Tehran English-Persian Parallel Corpus. Compu-tational Linguistics and Intelligent Text Processing,pages 68–79. Springer.

B. Pouliquen and R. Steinberger. 2009. Auto-matic construction of multilingual name dictionar-ies. Learning Machine Translation, pages 59–78.MIT Press - Advances in Neural Information Pro-cessing Systems Series (NIPS).

A. Rafalovitch and R. Dale. 2009. United nationsgeneral assembly resolutions: A six-language par-allel corpus. Proceedings of the MT Summit XIII,pages 292–299. Ottawa, Canada.

R. Steinberger and B. Pouliquen. 2007. Cross-lingualnamed entity recognition. Lingvisticæ Investiga-tiones, 30(1) pages 135–162. John Benjamins Pub-lishing Company.

R. Steinberger and B. Pouliquen and A. Widiger andC. Ignat and T. Erjavec and D. Tufis and D. Varga.2006. The JRC-Acquis: A multilingual aligned par-allel corpus with 20+ languages. Proceedings of

the 5th International Conference on Language Re-sources and Evaluation, pages 2142–2147. Genova,Italy.

R. Steinberger and B. Pouliquen and E. van der Goot.2009. An Introduction to the Europe Media MonitorFamily of Applications. Proceedings of the Infor-mation Access in a Multilingual World-Proceedingsof the SIGIR 2009 Workshop, pages 1–8. Boston,USA.

J. Tiedemann. 2009. News from OPUS-A Collectionof Multilingual Parallel Corpora with Tools andInterfaces. Recent advances in natural languageprocessing V: selected papers from RANLP 2007,pages 309:237.

M. Turchi and I. Flaounas and O. Ali and T. DeBieand T. Snowsill and N. Cristianini. 2009. Found intranslation. Proceedings of the European Confer-ence on Machine Learning and Knowledge Discov-ery in Databases, pages 746–749. Bled, Slovenia.

F. Tyers and M.S. Alperen. 2010. South-East Euro-pean Times: A parallel corpus of Balkan languages.Proceedings of the LREC workshop on Exploita-tion of multilingual resources and tools for Centraland (South) Eastern European Languages, Valletta,Malta.

30


Just Title It! (by an Online Application)

Cedric Lopez, Violaine Prince, and Mathieu RocheLIRMM, CNRS, University of Montpellier 2

161, rue AdaMontpellier, France

lopez,prince,[email protected]

Abstract

This paper deals with an application of au-tomatic titling. The aim of such applicationis to attribute a title for a given text. So,our application relies on three very differ-ent automatic titling methods. The first oneextracts relevant noun phrases for their useas a heading, the second one automaticallyconstructs headings by selecting words ap-pearing in the text, and, finally, the thirdone uses nominalization in order to proposeinformative and catchy titles. Experimentsbased on 1048 titles have shown that ourmethods provide relevant titles.

1 Introduction

The important amount of textual documents isin perpetual growth and requires robust applica-tions. Automatic titling is an essential task forseveral applications: Automatic titling of e-mailswithout subjects, text generation, summarization,and so forth. Furthermore, a system able to ti-tle HTML documents and so, to respect one ofthe W3C standards about Web site accessibility,is quite useful. The titling process goal is to pro-vide a relevant representation of a document con-tent. It might use metaphors, humor, or emphasis,thus separating a titling task from a summariza-tion process, proving the importance of the rhetor-ical status in both tasks.

This paper presents an original application con-sisting in titling all kinds of texts. For that pur-pose, our application offers three main meth-ods. The first one (called POSTIT) extracts nounphrases to be used as headings, the second one(called CATIT) automatically builds titles by se-lecting words appearing in the text, and, finally,

the third one (called NOMIT) uses nominalizationin order to propose relevant titles. Morphologicand semantic treatments are applied to obtain ti-tles close to real titles. In particular, titles have torespect two characteristics: Relevance and catch-iness.

2 Text Titling Application

The application presented in this paper was de-veloped with PHP, and it is available on theWeb1. It is based on several methods using Nat-ural Language Processing (NLP) and InformationRetrieval (IR) techniques. So, the input is a textand the output is a set of titles based on differentkinds of strategies.

A single automatic titling method is not suffi-cient to title texts. Actually, it cannot respect di-versity, noticed in real titles, which vary accord-ing to the writer’s personal interests or/and his/herwriting style. With the aim of getting closer to thisvariety, the user can choose the more relevant titleaccording to his personal criteria among a list oftitles automatically proposed by our system.

A few other applications have focused on ti-tling: One of the most typical, (Banko, 2000),consists in generating coherent summaries thatare shorter than a single sentence. These sum-maries are called ”headlines”. The main diffi-culty is to adjust the threshold (i.e, the headlinelength), in order to obtain syntactically correcttitles. Whereas our methods create titles whichare intrinsically correct, both syntactically and se-mantically.

In this section, we present the POSTIT, CATIT,and NOMIT methods. These three methods run

1https://www2.lirmm.fr/˜lopez/Titrage_general/

31

in parallel, without interaction with each other.Three very different titles are thus determined forevery text. For each of them, an example of theproduced title is given on the following sampletext: ”In her speech, Mrs Merkel has promisedconcrete steps towards a fiscal union - in effectclose integration of the tax-and-spend polices ofindividual eurozone countries, with Brussels im-posing penalties on members that break the rules.[...]”. Even if examples are in English, the ap-plication is actually in French (but easily repro-ducible in English). The POS tagging was per-formed by Sygfran (Chauche, 1984).

2.1 POSTIT

(Jin, 2001) implemented a set of title generationmethods and evaluated them: The statistical ap-proach based on the TF-IDF obtains the best re-sults. In the same way, the POSTIT (Titling usingPosition and Statistical Information) method usesstatistical information. Related works have shownthat verbs are not as widely spread as nouns,named entities, and adjectives (Lopez, 2011a).Moreover, it was noticed that elements appearingin the title are often present in the body of the text(Zajic et al., 2002). (Zhou and Hovy, 2003) sup-ports this idea and shows that the covering rate ofthose words present in titles, is very high in thefirst sentences of a text. So, the main idea is toextract noun phrases from the text and to selectthe more relevant for its use as title. The POSTITapproach is composed of the following steps:

1. Candidate Sentence Determination. We as-sume that any text contains only a few rel-evant sentences for titling. The goal of thisstep consists in recognizing them. Statisticalanalysis shows that, very often, terms usefulfor titling are located in the first sentences ofthe text.

2. Extracting Candidate Noun Phrases for Ti-tling. This step uses syntactical filters re-lying on the statistical studies previouslyled. For that purpose, texts are tagged withSygfran. Our syntactical patterns allowingnoun phrase extraction are also inspired from(Daille, 1996).

3. Selecting a Title. Last, candidate nounphrases (t) are ranked according to a scorebased on the use of TF-IDF and information

about noun phrase position (NPPOS) (seeLopez, 2011a). With λ = 0.5, this methodobtains good results (see Formula 1).

NPscore(t) = λ×NPPOS(t)

+ (1− λ)×NPTF−IDF (t) (1)

Example of title with POSTIT: Concrete stepstowards a fiscal union.

On one hand, this method proposes titles whichare syntactically correct. But on the other hand,provided titles can not be considered as original.Next method, called CATIT, enables to generatemore ’original’ titles.

2.2 CATITCATIT (Automatic Construction of Titles) is anautomatic process that constructs short titles. Ti-tles have to show coherence with both the text andthe Web, as well as with their dynamic context(Lopez, 2011b). This process is based on a globalapproach consisting in three main stages:

1. Generation of Candidates Titles. The pur-pose is to extract relevant nouns (using TF-IDF criterion) and adjectives (using TF cri-terion) from the text. Potential relevant cou-ples (candidate titles) are built respecting the”Noun Adjective” and/or ”Adjective Noun”syntactical patterns.

2. Coherence of Candidate Titles. Among thelist of candidate titles, which ones are gram-matically and semantically consistent ? Theproduced titles are supposed to be consis-tent with the text through the use of TF-IDF. To reinforce coherence, we set up adistance coefficient between a noun and anadjective which constitutes a new coherencecriterion in candidate titles. Besides, the fre-quency of appearance of candidate titles onthe Web (with Dice measure) is used in orderto measure the dependence between the nounand the adjective composing a candidate ti-tle. This method thus automatically favorswell-formed candidates.

3. Dynamic Contextualization of Candidate Ti-tles. To determine the most relevant candi-date title, the text context is compared withthe context in which these candidates are met

32

on the Web. They are both modeled as vec-tors, according to Salton’s vector model.

Example of title with CATIT: Fiscal penalties.

The automatic generation of titles is a complextask because titles have to be coherent, grammat-ically correct, informative, and catchy. These cri-teria are a brake in the generation of longer ti-tles (being studied). That is why we suggest anew approach consisting in reformulating rele-vant phrases in order to determine informative andcatchy ”long” titles.

2.3 NOMIT

Based on statistical analysis, NOMIT (Nominal-ization for Titling) provides original titles relyingon several rules to transform a verbal phrase in anoun phrase.

1. Extracting Candidates. First step consists inextracting segments of phrases which con-tain a past participle (in French). For exam-ple: In her speech, Mrs Merkel has promised”concrete steps towards a fiscal union” -in effect close integration of the tax-and-spend polices of individual eurozone coun-tries, with Brussels imposing penalties onmembers that break the rules.

2. Linguistic Treatment. The linguistic treat-ment of the segments retained in the previousstep consists of two steps aiming at nominal-izing the ”auxiliary + past participle” form(very frequent in French). First step consistsin associating a noun for each past participle.Second step uses transforming rules in orderto obtain nominalized segments. For exam-ple: has promised⇒ promise.

3. Selecting a Title. Selection of the most rel-evant title relies on a Web validation. Theinterest of this validation is double. On onehand, the objective is to validate the connec-tion between the nominalized past partici-ple and the complement. On the other hand,the interest is to eliminate incorrect semanticconstituents or not popular ones (e.g., ”an-nunciation of the winners ”), to prefer thosewhich are more popular on Web (e.g. , ”an-nouncement of the winners”).

Figure 1: Screenshot of Automatic Titling Evaluation

Example of title with NOMIT: Mrs Merkel:Promise of a concrete step towards a fiscal union.

This method enables to obtain even more orig-inal titles than the previous one (i.e. CATIT).A positive aspect is that new transforming rulescan be easily added in order to respect morpho-syntactical patterns of real titles.

3 Evaluations3.1 Protocol DescriptionAn online evaluation has been set up, accessi-ble to all people (cf. Figure 1)2. The benefit ofsuch evaluation is to compare different automaticmethods according to several judgements. So, foreach text proposed to the human user, several ti-tles are presented, each one resulting from one ofthe automatic titling methods presented in this pa-per (POSTIT, CATIT, and NOMIT). Furthermore,random titles stemming from CATIT and POSTITmethods are evaluated (CATIT-R, and POSTIT-R), i.e., candidate titles built by our methods butnot selected because of their bad score. The ideais to measure the efficiency of our ranking func-tions.

This evaluation is run on French articles stem-ming from the daily newspaper ’Le Monde’. Weretained the first article published every day forthe year 1994, up to a total of 200 journalistic ar-ticles. 190 people have participated to the onlineexperiment, evaluating a total of 1048 titles. Onaverage, every person has evaluated 41 titles. Ev-ery title has been evaluated by several people (be-tween 2 and 10). The total number of obtainedevaluations is 7764.

2URL: http://www2.lirmm.fr/˜lopez/Titrage_general/evaluation_web2/

33

3.2 ResultsResults of this evaluation indicate that the mostadapted titling method for articles is NOMIT. Thisone enables to title 82.7% of texts in a relevantway (cf. Table 1). However, NOMIT does not de-termine titles for all the texts (in this evaluation,NOMIT determined a title for 58 texts). Indeed,if no past participle is present in the text, there isno title returned with this method. It is thus essen-tial to consider the other methods which assure atitle for every text. POSTIT enables to title 70%of texts in a relevant way. It is interesting to notethat both gathered methods POSTIT and NOMITprovide at least one relevant title for 74 % of texts(cf. Table 2). Finally, even if CATIT obtains aweak score, this method provides a relevant titlewhere POSTIT and NOMIT are silent. So, thesethree gathered methods propose at least one rele-vant title for 81% of journalistic articles.

Concerning catchiness, the three methods seemequivalent, proposing catchy titles for approxi-mately 50% of texts. The three gathered methodspropose at least one catchy title for 78% of texts.Real titles (RT) obtain close score (80.5%).

% POSTIT POSTIT-R CATIT CATIT-R NOMIT RTVery relevant (VR) 39.1 16.4 15.7 10.3 60.3 71.4Relevant (R) 30.9 22.3 21.3 14.5 22.4 16.4(VR) and (R) 70.0 38.7 37.0 24.8 82.7 87.8Not relevant 30.0 61.4 63.0 75.2 17.2 12.3Catchy 49.1 30.9 47.2 32.2 53.4 80.5Not catchy 50.9 69.1 52.8 67.8 46.6 19.5

Table 1: Average scores of our application.

% POSTIT & NOMIT POSTIT & CATIT NOMIT & CATIT POSTIT, CATIT, & NOMIT(VR) 47 46 28 54(R) or (VR) 74 78 49 81Catchy 57 73 55 78

Table 2: Results of gathered methods.

Also, let us note that our ranking functionsare relevant since CATIT-R and POSTIT-R obtainweak results compared with CATIT and POSTIT.

4 ConclusionsIn this paper, we have compared the efficiency ofthree methods using various techniques. POSTITuses noun phrases extracted from the text, CATITconsists in constructing short titles, and NOMITuses nominalization. We proposed three differentmethods to approach the real context. Two per-sons can propose different titles for the same text,depending on personal criteria and on its own in-terests. That is why automatic titling is a complex

task as much as evaluation of catchiness whichremains subjective. Evaluation shows that our ap-plication provides relevant titles for 81% of textsand catchy titles for 78 % of texts. These re-sults are very encouraging because real titles ob-tain close results.

A future work will consist in taking into ac-count a context defined by the user. For exam-ple, the generated titles could depend on a polit-ical context if the user chooses to select a giventhread. Furthermore, an ”extended” context, au-tomatically determined from the user’s choice,could enhance or refine user’s desiderata.

A next work will consist in adapting this appli-cation for English.

ReferencesMichele Banko, Vibhu O. Mittal, and Michael J Wit-

brock. 1996. Headline generation based on statisti-cal translation. COLING’96. p. 318–325.

Jacques Chauche. 1984. Un outil multidimensionnelde l’analyse du discours. COLING’84. p. 11-15.

Beatrice Daille. 1996. Study and implementationof combined techniques for automatic extraction ofterminology. The Balancing Act: Combining Sym-bolic and Statistical Approaches to language. p. 29-36.

Rong Jin, and Alexander G. Hauptmann. 1996. Au-tomatic title generation for spoken broadcast news.Proceedings of the first international conference onHuman language technology research. p. 1–3.

Cedric Lopez, Violaine Prince, and Mathieu Roche.2011. Automatic titling of Articles Using Positionand Statistical Information. RANLP’11. p. 727-732.

Cedric Lopez, Violaine Prince, and Mathieu Roche.2011. Automatic Generation of Short Titles.LTC’11. p. 461-465.

Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval.Information Processing and Management 24. p.513-523.

Helmut Schmid. 1994. Probabilistic part-of-speechtagging using decision trees. International Confer-ence on New Methods in Language Processing. p.44-49.

Franck Smadja, Kathleen R. McKeown, and VasileiosHatzivassiloglou. 1996. Translating collocationsfor bilingual lexicons: A statistical approach. Com-putational linguistics, 22(1). p. 1-38.

David Zajic, Bonnie Door, and Rich Schwarz. 2002.Automatic headline generation for newspaper sto-ries. ACL 2002. Philadelphia.

Liang Zhou and Eduard Hovy. 2002. Headline sum-marization at ISI. DUC 2003. Edmonton, Alberta,Canada.

34


Folheador: browsing through Portuguese semantic relations

Hugo Goncalo OliveiraCISUC, University of Coimbra

[email protected]

Hernani CostaFCCN, Linguateca &

CISUC, University of CoimbraPortugal

[email protected]

Diana SantosFCCN, Linguateca &

University of OsloNorway

[email protected]

Abstract

This paper presents Folheador, an onlineservice for browsing through Portuguesesemantic relations, acquired from differ-ent sources. Besides facilitating the ex-ploration of Portuguese lexical knowledgebases, Folheador is connected to servicesthat access Portuguese corpora, which pro-vide authentic examples of the semantic re-lations in context.

1 Introduction

Lexical knowledge bases (LKBs) hold informa-tion about the words of a language and their in-teractions, according to their possible meanings.They are typically structured on word senses,which may be connected by means of semantic re-lations. Besides important resources for languagestudies, LKBs are key resources in the achieve-ment of natural language processing tasks, suchas word sense disambiguation (see e.g. Agirre etal. (2009)) or question answering (see e.g. Pascaand Harabagiu (2001)).

Regarding the complexity of most knowledgebases, their data formats are generally not suitedfor being read by humans. User interfaces havethus been developed for providing easier ways ofexploring the knowledge base and assessing itscontents. For instance, for LKBs, in addition toinformation on words and semantic relations, it isimportant that these interfaces provide usage ex-amples where semantic relations hold, or at leastwhere related words co-occur.

In this paper, we present Folheador1, an on-line browser for Portuguese LKBs. Besides an

1See http://www.linguateca.pt/Folheador/

interface for navigating through semantic rela-tions acquired from different sources, Folheadoris linked to two services that provide access toPortuguese corpora, thus allowing observation ofrelated words co-occurring in authentic contextsof use, some of them even evaluated by humans.

After introducing several well-known LKBsand their interfaces, we present Folheador andits main features, also detailing the contents ofthe knowledge base currently browseable throughthis interface, which contains information ac-quired from public domain lexical resources ofPortuguese. Then, before concluding, we discussadditional features planned for the future.

2 Related Work

Here, we mention a few interfaces that ease theexploration of well-known knowledge bases. Re-garding the knowledge base structure, some of theinterfaces are significantly different.

Princeton WordNet (Fellbaum, 1998) is themost widely used LKB to date. In addition toother alternatives, the creators of WordNet pro-vide online access to their resource through theWordNet Search interface (Princeton University,2010)2. As WordNet is structured around synsets(groups of synonymous lexical items), queryingfor a word prompts all synsets containing thatword to be presented. For each synset, its part-of-speech (PoS), a gloss and a usage example areprovided. Synsets can also be expanded to accessthe semantic relations they are involved in.

As a resource also organised in synsets, the

2http://wordnetweb.princeton.edu/perl/

webwn

35

Brazilian Portuguese thesaurus TeP3 has a sim-ilar interface (Maziero et al., 2008). Neverthe-less, since TeP does not contain relations besidesantonymy, its interface is simpler and providesonly the synsets containing a queried word andtheir part-of-speech.

MindNet (Vanderwende et al., 2005) is a LKBextracted automatically, mainly from dictionar-ies, and structured on semantic relations connect-ing word senses to words. Its authors provideMNEX4, an online interface for MindNet. Afterquerying for a pair of words, MNEX provides allthe semantic relation paths between them, estab-lished by a set of links that connect directly orindirectly one word to another. It is also possibleto view the definitions that originated the path.

FrameNet (Baker et al., 1998) is a man-ually built knowledge base structured on se-mantic frames that describe objects, states orevents. There are several means for explor-ing FrameNet easily, including FrameSQL (Sato,2003)5, which allows searching for frames, lexi-cal units and relations in an integrated interface,and FrameGrapher6, a graphical interface for thevisualization of frame relations. For each frame,in both interfaces, a textual definition, annotatedsentences of the frame elements, lists of the framerelations, and lists with the lexical units in theframe are provided.

ReVerb (Fader et al., 2011) is a Web-scaleinformation extraction system that automaticallyacquires binary relations from text. Using ReVerbSearch7, a web interface for ReVerb extractions, itis possible to obtain sets of relational triples wherethe predicate and/or the arguments contain givenstrings. Regarding that each of the former is op-tional, it is possible, for instance, to search for alltriples with the predicate loves and first argumentPortuguese. Search results include the matchingtriples, organised according to the name of thepredicate, as well as the number of times eachtriple was extracted. The sentences where eachtriple was extracted from are as well provided.

3http://www.nilc.icmc.usp.br/tep24http://stratus.research.microsoft.com/

mnex/5http://framenet2.icsi.berkeley.edu/

frameSQL/fn2_15/notes/6https://framenet.icsi.berkeley.edu/

fndrupal/FrameGrapher7http://www.cs.washington.edu/research/

textrunner/reverbdemo.html

Finally, Visual Thesaurus (Huiping et al.,2006)8 is a proprietary graphical interface thatprovides an alternative way of exploring a knowl-edge base structured on word senses, synonymy,antonymy and hypernymy relations. It presents agraph centered on a queried word, connected to itssenses, as well as semantic relations between thesenses and other words. Nodes and edges have adifferent color or look, respectively according tothe PoS of the sense or to the type of semantic re-lation. If a word is clicked, a new graph, centeredon that word, is drawn.

3 Folheador

Folheador, in figure 2, is an online service forbrowsing through instances of semantic relations,represented as relational triples.

Folheador was originally designed as an inter-face for PAPEL (Goncalo Oliveira et al., 2010),a public domain lexical-semantic network, auto-matically extracted from a proprietary dictionary.It was soon expanded to other (public) resourcesfor Portuguese as well (see Santos et al. (2010) foran overview of Portuguese LKBs).

The current version of Folheador browsesthrough a LKB that, besides PAPEL, in-tegrates semantic triples from the followingsources: (i) synonymy acquired from two hand-crafted thesauri of Portuguese9, TeP (Dias-Da-Silva and de Moraes, 2003; da Silva et al.,2002) and OpenThesaurus.PT10; (ii) relations ex-tracted automatically in the scope of the projectOnto.PT (Goncalo Oliveira and Gomes, 2010;Goncalo Oliveira et al., 2011), which includetriples extracted from Wiktionary.PT11, and fromDicionario Aberto (Simoes and Farinha, 2011),both public domain dictionaries.

Underlying relation triples in Folheador arethus in the form x RELATED-TO y, where x andy are lexical items and RELATED-TO is a predi-cate. Their interpretation is as follows: one senseof x is related to one sense of y, by means of a re-lation whose type is identified by RELATED-TO.

8http://www.visualthesaurus.com/9We converted the thesauri to triples x synonym-of y,

where x and y are lexical items in the same synset.10http://openthesaurus.caixamagica.pt/11http://pt.wiktionary.org/

36

Figure 1: Folheador’s interface.

3.1 Navigation

It is possible to use Folheador for searching forall relations with one, two, or no fixed arguments,and one or no types (relation names). Combiningthese options, Folheador can be used, for instance,to obtain: all lexical items related to a particularitem; all relations between two lexical items; or asample of relations involving a particular type.

The matching triples are listed and may befiltered according to the resource they were ex-tracted from. For each triple, the PoS of the ar-guments is shown, as well as a list with the iden-tification of the resources from where it was ac-quired. The arguments of each triple are also linksthat make navigation easier. When clicked, Fol-heador behaves the same way as if it had beenqueried with the clicked word as argument. Also,since the queried lexical item may occur in thefirst or in the second argument of a triple, whenit occurs in the second, Folheador inverts the rela-tion, so that the item appears always as the first ar-gument. Therefore, there is no need to store boththe direct and the inverse triples.

Consider the example in figure 2: it showsthe triples retrieved after searching for the wordcomputador (computer, in English). In most of

the retrieved triples, computador is a noun (e.g.computador HIPONIMO DE maquina), but thereare relations where it is an adjective (e.g. com-putador PROPRIEDADE DO QUE computar).Moreover, as hypernymy relations are stored inthe form x HIPERONIMO DE y, some of thetriples presented, such as computador HIPON-IMO DE maquina and computador HIPON-IMO DE aparelho, have been inverted on the fly.

Furthermore, for each triple, Folheadorpresents: a confidence value based on the mereco-occurrence of the words in corpora; andanother based on the co-occurrence of the relatedwords instantiating discriminating patterns of theparticular relation.

3.2 Graph visualization

Currently, Folheador contains a very simple visu-alization tool, which draws the semantic relationgraph established by the search results in a page,as in figure 3.2. In the future, we aim to provide analternative for navigation based on textual links,which would be made through the graph.

3.3 The use of corpora

One of the problems of most lexical resources isthat they do not integrate or contain frequency in-

37

Figure 2: Graph for the results in figure 2.

formation. This is especially true when one is notsimply listing words but going deeper into mean-ing, and listing semantic properties like wordsenses or relationships between senses.

So, a list of relations among words can con-flate a number of highly specialized and obsoletewords (or word senses) that co-occur with im-portant and productive relations in everyday use,which is not a good thing for human and auto-matic users alike. On the other hand, using cor-pora allows one to add frequency information toboth participants in the relation and the triplesthemselves, and thus provide another axis to thedescription of words.

In addition, it is always interesting to observelanguage use in context, especially in cases wherethe user is not sure whether the relation is cor-rect or still in use (and the user can and shouldbe fairly suspicious when s/he is browsing auto-matically compiled information). A corpus checktherefore provides illustration, and confirmation,to a user facing an unusual or surprising relation,in addition to evaluation data for the relation cu-rator or lexicographer. If these checks have beendone before by a set of human beings (as is thecase of VARRA (Freitas et al., forthcomming)),one can have much more confidence on the databrowsed, something that is important for users.

Having this in mind, besides allowing to queryfor stored relational triples, Folheador is con-nected to AC/DC (Santos and Bick, 2000; San-tos, 2011), an online service that provides ac-cess to a large set of Portuguese corpora. In just

one click, it is possible to query for all the sen-tences in the AC/DC corpora connecting the argu-ments of a retrieved triple. Figure 3.3 shows someof the results for the words computador (com-puter) and aparelho (apparatus). While some ofthe returned sentences might contain the relatedwords co-occurring almost by chance or withouta clear semantic relation, other sentences validatethe triple (e.g. sentence par=saude16727 in fig-ure 3.3). Sometimes, the sentences might as wellinvalidate the triple.

Furthermore, for some of the relation types, itis possible to connect to another online service,VARRA (Freitas et al., forthcomming), which isbased on a set of patterns that express some of therelation types, in corpora text. After clicking onthe VARRA link, this service is queried for occur-rences of the corresponding triple in AC/DC. Thepresented sentences (a subset of those returnedby the previous service) will thus contain the re-lated words connected by a discriminating pat-tern for the relation they hold. Figure 3.3 showstwo sentences returned for the relation computa-dor HIPONIMO DE maquina.

These patterns, as those proposed by Hearst(1992) and used in many projects since, may notbe 100% reliable. So, VARRA was designed toallow human users to classify the sentences ac-cording to whether the latter validate the relation,are just compatible with it, or not even that.

In fact, people do not usually write defini-tions, especially when using common sense termsin ordinary discourse. Thus, co-occurrence ofsemantically-related terms frequently indicates aparticular relation only implicitly. The choiceof assessing sentences as good validators of asemantic relation is related to the task of auto-matically finding good illustrative examples fordictionaries, which is a surprisingly complextask (Rychly et al., 2008).

This kind of information, amassed with thehelp of VARRA, is much more difficult to cre-ate, but is of great value to Folheador, since itprovides good illustrative contexts for the relatedlexical items.

4 Further work and concluding remarks

We have shown that, as it is, Folheador is veryuseful, as it enables to browse for triples withfixed arguments, it identifies the source of thetriples, and, in one click, it provides real sentences

38

Figure 3: AC/DC: some sentences returned for the related words computador and aparelho.

Figure 4: VARRA: sentences that exemplify the relation computador hyponym-of maquina.

where related lexical items co-occur. Still, we areplanning to implement new basic features, such asthe suggestion of words, when the searched wordis not in the LKB. Also, while currently Folheadoronly directly connects to AC/DC and VARRA, inorder to increase its usability, we plan to connect itautomatically to online definitions and other ser-vices available on the Web. We intend as well tocrosslink Folheador from the AC/DC interface, inthe sense that one can invoke Folheador also byjust one click (Santos, forthcomming).

Currently, Folheador gives access to 169,385lexical items: 93,612 nouns, 38,409 verbs, 33,497adjectives and 3,867 adverbs, in a total of 722,589triples, and it can browse through the followingtypes of semantic relations: synonymy, hyper-nymy, part-of, member-of, causation, producer-of, purpose-of, place-of, and property-of. How-ever, as the underlying resources, especially theones created automatically, will continue to be up-dated, one important challenge is to create a ser-vice that does not get outdated, by accompany-ing the progress of these resources, ideally doingan automatic update every month. Furthermore,we believe that quantitative studies on the com-parison and the aggregation of the integrated re-sources should be made, deeper than what is pre-sented in Goncalo Oliveira et al. (2011).

We would like to end by emphasizing that weare aware that the proper interpretation of thesemantic relations may vary in the different re-sources, even disregarding possible mistakes in

the automatic harvesting. It is enough to considerthe (regular morphological) relation between averb and an adjective/noun ended in -dor in Por-tuguese (and which can be paraphrased by onewho Vs). For instance, in relations such as sofrer- sofredor, correr - corredor, roer - roedor,the kind of verb defines the kind of temporal re-lation conveyed: a rodent is essentially roendo,while a sofredor (sufferer) suffers hopefully in aparticular situation and can stop suffering, and acorredor (runner) runs as job or as role.

The source code of Folheador is open source12,so it may be used by other authors to explore theirknowledge bases. Technical information aboutFolheador may be found in Costa (2011).

Acknowledgements

Folheador was developed under the scope of Lin-guateca, throughout the years jointly funded bythe Portuguese Government, the European Union(FEDER and FSE), UMIC, FCCN and FCT. HugoGoncalo Oliveira is supported by the FCT grantSFRH/BD/44955/2008 co-funded by FSE.

ReferencesEneko Agirre, Oier Lopez De Lacalle, and Aitor

Soroa. 2009. Knowledge-based WSD on spe-cific domains: performing better than generic su-pervised WSD. In Proceedings of 21st Interna-

12Available from http://code.google.com/p/

folheador/

39

tional Joint Conference on Artifical Intelligence,IJCAI’09, pages 1501–1506, San Francisco, CA,USA. Morgan Kaufmann Publishers Inc.

Collin F. Baker, Charles J. Fillmore, and John B.Lowe. 1998. The Berkeley FrameNet project. InProceedings of the 17th international conferenceon Computational linguistics, pages 86–90, Morris-town, NJ, USA. ACL Press.

Hernani Costa. 2011. O desenho do novo Folheador.Technical report, Linguateca.

Bento C. Dias da Silva, Mirna F. de Oliveira, andHelio R. de Moraes. 2002. Groundwork for theDevelopment of the Brazilian Portuguese Wordnet.In Nuno Mamede and Elisabete Ranchhod, editors,Advances in Natural Language Processing (PorTAL2002), LNAI, pages 189–196, Berlin/Heidelberg.Springer.

Bento Carlos Dias-Da-Silva and Helio Robertode Moraes. 2003. A construcao de um the-saurus eletronico para o portugues do Brasil. ALFA,47(2):101–115.

Anthony Fader, Stephen Soderland, and Oren Etzioni.2011. Identifying relations for open information ex-traction. In Proceedings of the Conference of Em-pirical Methods in Natural Language Processing,EMNLP ’11, Edinburgh, Scotland, UK.

Christiane Fellbaum, editor. 1998. WordNet: An Elec-tronic Lexical Database (Language, Speech, andCommunication). The MIT Press.

Claudia Freitas, Diana Santos, Hugo Goncalo Oliveira,and Violeta Quental. forthcomming. VARRA:Validacao, Avaliacao e Revisao de Relacoessemanticas no AC/DC. In Atas do IX Encontro deLinguıstica de Corpus, ELC 2010.

Hugo Goncalo Oliveira and Paulo Gomes. 2010.Onto.PT: Automatic Construction of a Lexical On-tology for Portuguese. In Proceedings of 5th Eu-ropean Starting AI Researcher Symposium (STAIRS2010), pages 199–211. IOS Press.

Hugo Goncalo Oliveira, Diana Santos, and PauloGomes. 2010. Extraccao de relacoes semanticasentre palavras a partir de um dicionario: o PAPEL esua avaliacao. Linguamatica, 2(1):77–93.

Hugo Goncalo Oliveira, Leticia Anton Perez, HernaniCosta, and Paulo Gomes. 2011. Uma rede lexico-semantica de grandes dimensoes para o portugues,extraıda a partir de dicionarios electronicos. Lin-guamatica, 3(2):23–38.

Marti A. Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In Proceedingsof 14th Conference on Computational Linguistics,pages 539–545, Morristown, NJ, USA. ACL Press.

Du Huiping, He Lin, and Hou Hanqing. 2006.Thinkmap visual thesaurus:a new kind of knowl-edge organization system. Library Journal, 12.

Erick G. Maziero, Thiago A. S. Pardo, Ariani Di Fe-lippo, and Bento C. Dias-da-Silva. 2008. A Base de

Dados Lexical e a Interface Web do TeP 2.0 - The-saurus Eletronico para o Portugues do Brasil. In VIWorkshop em Tecnologia da Informacao e da Lin-guagem Humana, TIL 2008, pages 390–392.

Marius Pasca and Sanda M. Harabagiu. 2001. The in-formative role of WordNet in open-domain questionanswering. In Proceedings of NAACL 2001 Work-shop on WordNet and Other Lexical Resources: Ap-plications, Extensions and Customizations, pages138–143, Pittsburgh, USA.

Princeton University. 2010. Princeton university“About Wordnet”. http://wordnet.princeton.edu.

Pavel Rychly, Milos Husak, Adam Kilgarriff, MichaelRundell, and Katy McAdam. 2008. GDEX: Auto-matically finding good dictionary examples in a cor-pus. In Proceedings of the XIII EURALEX Interna-tional Congress, pages 425–432, Barcelona. InstitutUniversitari de Linguıstica Aplicada.

Diana Santos and Eckhard Bick. 2000. ProvidingInternet access to Portuguese corpora: the AC/DCproject. In Proceedings of 2nd International Con-ference on Language Resources and Evaluation,LREC’2000, pages 205–210. ELRA.

Diana Santos, Anabela Barreiro, Claudia Freitas,Hugo Goncalo Oliveira, Jose Carlos Medeiros, LuısCosta, Paulo Gomes, and Rosario Silva. 2010.Relacoes semanticas em portugues: comparando oTeP, o MWN.PT, o Port4NooJ e o PAPEL. InA. M. Brito, F. Silva, J. Veloso, and A. Fieis, editors,Textos seleccionados. XXV Encontro Nacional daAssociacao Portuguesa de Linguıstica, pages 681–700. APL.

Diana Santos. 2011. Linguateca’s infrastructurefor Portuguese and how it allows the detailedstudy of language varieties. OSLa: Oslo Stud-ies in Language, 3(2):113–128. Volume edited byJ.B.Johannessen, Language variation infrastructure.

Diana Santos. forthcomming. Corpora at linguateca:vision and roads taken. In Tony Berber Sardinhaand Telma S ao Bento Ferreira, editors, Workingwith Portuguese corpora.

Hiroaki Sato. 2003. FrameSQL: A software tool forFrameNet. In Proceedings of Asialex 2003, pages251–258, Tokyo. Asian Association of Lexicogra-phy, Asian Association of Lexicography.

Alberto Simoes and Rita Farinha. 2011. DicionarioAberto: Um novo recurso para PLN. Vice-Versa,pages 159–171.

Lucy Vanderwende, Gary Kacmarcik, Hisami Suzuki,and Arul Menezes. 2005. Mindnet: Anautomatically-created lexical resource. In Proceed-ings of HLT/EMNLP 2005 Interactive Demonstra-tions, pages 8–9, Vancouver, British Columbia,Canada. ACL Press.

40


A Computer Assisted Speech Transcription System

Alejandro Revuelta-Martınez, Luis Rodrıguez, Ismael Garcıa-VareaComputer Systems Department

University of Castilla-La ManchaAlbacete, Spain

Alejandro.Revuelta,Luis.RRuiz,[email protected]

Abstract

Current automatic speech transcription sys-tems can achieve a high accuracy althoughthey still make mistakes. In some scenar-ios, high quality transcriptions are neededand, therefore, fully automatic systems arenot suitable for them. These high accuracytasks require a human transcriber. How-ever, we consider that automatic techniquescould improve the transcriber’s efficiency.With this idea we present an interactivespeech recognition system integrated witha word processor in order to assists userswhen transcribing speech. This system au-tomatically recognizes speech while allow-ing the user to interactively modify the tran-scription.

1 Introduction

Speech has been the main mean of communica-tion for thousands of years and, hence, is the mostnatural human interaction mode. For this reason,Automatic Speech Recognition (ASR) has beenone of the major research interests within the Nat-ural Language Processing (NLP) community.

Although current speech recognition ap-proaches (which are based on statistical learningtheory (Jelinek, 1998)) are speaker independentand achieve high accuracy, ASR systems are notperfect and transcription errors rise drasticallywhen considering large vocabularies, dealingwith noise environments or spontaneous speech.In those tasks (as for example, automatic tran-scription of parliaments proceedings) whereperfect recognition results are required, ASRcan not be fully reliable so far and, a humantranscriber has to check and supervise theautomatically generated transcriptions.

In the last years, cooperative systems, wherea human user and an automatic system work to-gether, have gain growing attention. Here wepresent a system that interactively assists a humantranscriber when using an ASR software. Theproposed tool is fully embedded into a widelyused and open source word processor and it relieson an ASR system that is proposing suggestions tothe user in the form of practical transcriptions forthe input speech. The user is allowed to introducecorrections at any moment of the discourse and,each time an amendment is performed, the sys-tem will take it into account in order to propose anew transcription (always preserving the decisionmade by the user, as can be seen in Fig. 1). Therationale behind this idea is to reduce the humanuser’s effort and increase efficiency.

Our proposal’s main contribution is that it car-ries out an interactive ASR process, continuallyproposing new transcriptions that take into ac-count user amendments to increase their useful-ness. To our knowledge, no current transcriptionpackage provides such an interactive process.

2 Theoretical Background

Computer Assisted Speech Recognition (CAST)can be addressed by extending the statistical ap-proach to ASR. Specifically, we have an inputsignal to be transcribed x and the user feedbackin the form of a fully correct transcription pre-fix p (an example of a CAST session is shownin Fig. 1). From this, the recognition system hasto search for the optimal completion (suffix) s as:

s = arg maxs

Pr(s | x,p)

= arg maxs

Pr(x | p, s) · Pr(s | p) (1)

41

where, as in traditional ASR, we have an acous-tic model Pr(x | p, s) and a language modelPr(s | p). The main difference is that, here,part of the correct transcription is available (pre-fix) and we can use this information to improvethe suffix recognition. This can be achieved byproperly adapting the language model to accountfor the user validated prefix as it is detailed in(Rodrıguez et al., 2007; Toselli et al., 2011).

As was commented before, the main goal ofthis approach is to improve the efficiency of thetranscription process by saving user keystrokes.Off-line experiments have shown that this ap-proach can save about 30% of typing effort whencompared to the traditional approach of off-linepost-editing results from an ASR system.

3 Prototype Description

A fully functional prototype, which implementsthe CAST techniques described in section 2, hasbeen developed. The main goal is to provide acompletely usable tool. To this end, we have im-plemented a tool that easily allows for organiz-ing and accessing different transcription projects.Besides, the prototype has been embedded into awidely used office suite. This way, the transcribeddocument can be properly formatted since all thefeatures provided by a word processor are avail-able during the transcription process.

3.1 Implementation IssuesThe system has been implemented following amodular architecture consisting of several compo-nents:

• User interface. Manages the graphical fea-tures of the prototype user interface.

• Project management: Allows the user todefine and deal with transcription projects.These projects are stored in XML files con-taining parameters such as input files to betranscribed, output documents, etc.

• System controller. Manages communicationamong all the components.

• OpenOffice integration: This subsystem pro-vides an appropriate integration between theCAST tool and the OpenOffice1 softwaresuite. The transcriber has, therefore, full ac-cess to a word processor functionality.

1www.openoffice.org

• Speech manager: Implements audio play-back and synchronization with the ASR out-comes.

• CAST engine: Provides the interactive ASRsuggestion mechanism.

This architecture is oriented to be flexible andportable so that different scenarios, word proces-sor software or ASR engines can be adopted with-out requiring big changes in the current imple-mentation. Although this initial prototype worksas a standalone application the followed designshould allow for a future “in the cloud” tool,where the CAST engine is located in a server andthe user can employ a mobile device to carry outthe transcription process.

With the purpose of providing a real-time sys-tem response, CAST is actually performed overa set of word lattices. A lattice, representing ahuge set of hypotheses for the current utterance,is initially used to parse the user validated prefixand then to search for the best completion (sug-gestion).

3.2 System Interface and UsageThe prototype has been designed to be intuitivefor professional speech transcribers and generalusers; we expect most users to quickly get usedto the system without any previous experience orexternal assistance.

The prototype features and operation mode aredescribed in the following items:

• The initial screen (Fig. 2) guides the user onhow to address a transcription project. Here,the transcriber can select one of the threemain tasks that have to be performed to ob-tain the final result.

• In the project management screen (Fig. 3),the user can interact with the current projectsor create a new one. A project is a set ofinput audio files to be transcribed along withthe partial transcription achieved and someother related parameters.

• Once the current project has been selected, atranscription session is started (Fig. 4). Dur-ing this session, the application looks like astandard OpenOffice word processor incor-porating CAST features. Specifically, theuser can perform the following operations:

42

utteranceITER-0 prefix ( )

ITER-1

suffix (Nine extra soul are planned half beam discovered these years)validated (Nine)correction (extrasolar)prefix (Nine extrasolar)

ITER-2

suffix (planets have been discovered these years)validated (planets have been discovered)correction (this)prefix (Nine extrasolar planets have been discovered this)

FINALsuffix (year)validated (#)prefix (Nine extrasolar planets have been discovered this year)

Figure 1: Example of a CAST session. In each iteration, the system suggests a suffix based on the input utteranceand the previous prefix. After this, the user can validate part of the suggestion and type a correction to generatea new prefix that can be used in the next iteration. This process is iterated until the full utterance is transcribed.

The user can move between audio segmentsby pressing the “fast forward” and “rewind”buttons. Once the a segment to be tran-scribed has been chosen, the “play” buttonstarts the audio replay and transcription. Thesystem produces the text in synchrony withthe audio so that the user can check in “realtime” the proposed transcription. As soon asa mistake is produced, the transcriber can usethe “pause” button to interrupt the process.Then, the error is corrected and by pressing“play” again the process is continued. Atthis point, the CAST engine will use the useramendment to improve the rest of the tran-scription.

• When all the segments have been tran-scribed, the final task in the initial screen al-lows the user to open the OpenOffice’s PDFexport dialog to generate the final document.

A video, showing the prototype operationmode, can be found on the following website:www.youtube.com/watch?v=vc6bQCtYVR4.

4 Conclusions and Future Work

In this paper we have presented a CAST systemwhich has been fully implemented and integratedinto the OpenOffice word processing software.The implemented techniques have been tested of-fline and the prototype has been presented to a re-duced number of real users.

Preliminary results suggest that the system

could be useful for transcribers when high qual-ity transcriptions are needed. It is expected tosave effort, increase efficiency and allow inexperi-enced users to take advantage of ASR systems allalong the transcription process. However, theseresults should be corroborated by performing aformal usability evaluation.

Currently, we are in the process of carrying outa formal usability evaluation with real users thathas been designed following the ISO/IEC 9126-4(2004) standard according to the efficiency, effec-tiveness and satisfaction characteristics.

As future work, it will be interesting to considerconcurrent collaborative work at both, project andtranscription levels. Other promising line is toconsider a multimodal user interface in order toallow users to control the playback and transcrip-tion features using their own speech. This hasbeen explored in the literature (Rodrıguez et al.,2010) and would allow the system to be used indevices with constrained interfaces such as mo-bile phones or tablet PCs.

Acknowledgments

Work supported by the EC (ERDF/ESF) andthe Spanish government under the MIPRCV“Consolider Ingenio 2010” program (CSD2007-00018), and the Spanish Junta de Comunidadesde Castilla-La Mancha regional government un-der projects PBI08-0210-7127 and PPII11-0309-6935.

43

Figure 2: Main window prototype. The three stages of a transcription project are shown.

Figure 3: Screenshot of the project management window showing a loaded project. A project consists of severalaudio segments, each of them is stored in a file so that the user can easily add or remove files when needed. Inthis screen the user can choose the current working segments.

Figure 4: Screenshot of a transcription session. This shows the process of transcribing one audio segment. In thisfigure, all the text but the last incomplete sentence has already been transcribed and validated. The last partialsentence, shown in italics, is being produced by the ASR system while the transcriber listen to the audio. Assoon as an error is detected the user momentarily interrupts the process to correct the mistake. Then, the systemwill continue transcribing the audio according to the new user feedback (prefix).

44

ReferencesISO/IEC 9126-4. 2004. Software engineering —

Product quality — Part 4: Quality in use metrics.F. Jelinek. 1998. Statistical Methods for Speech

Recognition. The MIT Press, Cambridge, Mas-sachusetts, USA.

Luis Rodrıguez, Francisco Casacuberta, and EnriqueVidal. 2007. Computer assisted transcription ofspeech. In Proceedings of the 3rd Iberian confer-ence on Pattern Recognition and Image Analysis,Part I, IbPRIA ’07, pages 241–248, Berlin, Heidel-berg. Springer-Verlag.

Luis Rodrıguez, Ismael Garcıa-Varea, and Enrique Vi-dal. 2010. Multi-modal computer assisted speechtranscription. In Proceedings of the 12th Interna-tional Conference on Multimodal Interfaces and the7th International Workshop on Machine Learningfor Multimodal Interaction, ICMI-MLMI.

A.H. Toselli, E. Vidal, and F. Casacuberta. 2011. Mul-timodal Interactive Pattern Recognition and Appli-cations. Springer.

45


A Statistical Spoken Dialogue System using Complex User Goals andValue Directed Compression

Paul A. Crook, Zhuoran Wang, Xingkun Liu and Oliver LemonInteraction Lab

School of Mathematical and Computer Sciences (MACS)Heriot-Watt University, Edinburgh, UK

p.a.crook, zhuoran.wang, x.liu, [email protected]

Abstract

This paper presents the first demonstrationof a statistical spoken dialogue system thatuses automatic belief compression to rea-son over complex user goal sets. Reasoningover the power set of possible user goals al-lows complex sets of user goals to be rep-resented, which leads to more natural dia-logues. The use of the power set results in amassive expansion in the number of beliefstates maintained by the Partially Observ-able Markov Decision Process (POMDP)spoken dialogue manager. A modified formof Value Directed Compression (VDC) isapplied to the POMDP belief states produc-ing a near-lossless compression which re-duces the number of bases required to rep-resent the belief distribution.

1 Introduction

One of the main problems for a spoken dialoguesystem (SDS) is to determine the user’s goal (e.g.plan suitable meeting times or find a good Indianrestaurant nearby) under uncertainty, and therebyto compute the optimal next system dialogue ac-tion (e.g. offer a restaurant, ask for clarification).Recent research in statistical SDSs has success-fully addressed aspects of these problems throughthe application of Partially Observable MarkovDecision Process (POMDP) approaches (Thom-son and Young, 2010; Young et al., 2010). How-ever POMDP SDSs are currently limited by therepresentation of user goals adopted to make sys-tems computationally tractable.

Work in dialogue system evaluation, e.g.Walker et al. (2004) and Lemon et al. (2006),shows that real user goals are generally sets ofitems, rather than a single item. People like to

explore possible trade offs between the attributesof items.

Crook and Lemon (2010) identified this as acentral challenge for the field of spoken dialoguesystems, proposing the use of automatic compres-sion techniques to allow for extended accuraterepresentations of user goals. This paper presentsa proof of concept of these ideas in the form of acomplete, working spoken dialogue system. ThePOMDP dialogue manager (DM) of this demon-stration system uses a compressed belief spacethat was generated using a modified version of theValue Directed Compression (VDC) algorithmas originally proposed by Poupart (2005). Thisdemonstration system extends work presented byCrook and Lemon (2011) in that it embeds thecompressed complex user goal belief space into aworking system and demonstrates planning (andacting) in the compressed space.

2 Complex User Goals

The type of SDS task that we focus on is a limited-domain query-dialogue, also known as a “slot fill-ing” task. The spoken dialogue system has knowl-edge about some set of objects where these ob-jects have attributes and these attributes can takeseveral values. An object can thus be describedby a conjunction of attribute-value pairs. A di-alogue progresses with the system obtaining re-quirements from the user which are specified interms of attribute values. The system should even-tually present objects (search results) based uponits understanding of the user’s requirement. Thedialogue ends when the user accepts one of thedomain objects.

Prior work on POMDP SDSs has assumed thata user has a narrowly constrained goal (as speci-

46

User goal: cheap central Thai or expensive FrenchS: Hello, how can I help you?U: I’m looking for a cheap central Thai restaurant.S: You’re looking for budget, city centre, Thai

restaurants. —U: Or an expensive French restaurant anywhere.S: You’re looking for non-budget French restaurants.

What —U: And cheap, city centre, Thai.S: You’re looking for budget, city centre, Thai

restaurants. What about —U: (SIGH) And French.S: You’re looking for budget, city centre, French

restaurants. What about —U: (HANG UP)

Table 1: Example dialogue using simple user goals.S=System, U=User. ‘—’ indicates the user barged in.

fied in terms of the domain objects) and thus therole of the DM is one of reducing uncertainty un-til its belief is strongly focused on a particular do-main object. This has the unfortunate effect offorcing users to select one domain object in orderto progress the dialogue, see Table 1. Note that theexample given is problematic not only because theuser wants two different food types but becausethey have different requirements associated witheach, i.e. Thai restaurants should be in the centreand cheap, while any French restaurants should beexpensive1 and can be located anywhere. To ourknowledge such a combination of goals with dif-ferent attribute values cannot be straightforwardlyhandled by comparable state-of-the-art statisticalSDSs which appear in the literature.

Crook and Lemon (2011) suggest that ratherthan the DM assuming that the user has a singlenarrowly constrained goal in mind, it should as-sume that they want any possible sub-set of thecomplete set of domain objects. Thus, insteadof maintaining the POMDP belief over individualdomain objects, it should be maintained over thepower set of domain objects. As an example seeTable 3 which is the power set for a domain withthree objects that can take two attributes u, v withassociated values u1, u2, u3 and v1 respectively.

The power set representation allows thedemonstration system to straightforwardly handlepreviously problematic dialogues. See Table 2.Of course this approach significantly expands the

1Interpreted as non-budget by the system since itsdatabase only classifies restaurants as budget or non-budget.

User goal: cheap central Thai or expensive FrenchS: Hello, how can I help you?U: I’m looking for a cheap central Thai restaurant.S: You’re looking for budget, city centre, Thai

restaurants. —U: Or an expensive French restaurant anywhere.S: You’re looking for budget, city centre, Thai

restaurants or non-budget French. What aboutChang Thai in Old Town†? Chang Thai’s set lunchprice is £ 7.95 —

U: What alternatives do you have?S: What about the Plumed Horse in Leith‡? The

Plumed Horse is a Michelin starred French ...

Table 2: Example dialogue using complex user goals.S=System, U=User. †Old Town is part of the city cen-tre. ‡Leith is outside the centre.

state space of possible user goals, with the num-ber of goal sets being equal to 2|domain objects| .

2.1 Automatic CompressionEven considering limited domains, POMDP statespaces for SDSs grow very quickly. Thus the cur-rent state-of-the-art in POMDP SDSs uses a vari-ety of handcrafted compression techniques, suchas making several types of independence assump-tion as discussed above.

Crook and Lemon (2010) propose replacinghandcrafted compressions with automatic com-pression techniques. The idea is to use princi-pled statistical methods for automatically reduc-ing the dimensionality of belief spaces, but whichpreserve useful distributions from the full space,and thus can more accurately represent real user’sgoals.

2.2 VDC AlgorithmThe VDC algorithm (Poupart, 2005) uses Kryloviteration to compute a reduced state space. It findsa set of linear basis vectors that can reproduce thevalue2 of being in any of the original POMDPstates. Where, if a lossless VDC compression ispossible, the number of basis vectors is less thanthe original number of POMDP states.

The intuition here is that if the value of takingan action in a given state has been preserved thenplanning is equally as reliable in the compressedspace as the in full space.

The VDC algorithm requires a fully specifiedPOMDP, i.e. 〈S, A,O, T, Ω,R〉where S is the set

2The sum of discounted future rewards obtained throughfollowing some series of actions.

47

state goal set meaning: user’s goal iss1 ∅ (empty set) none of the domain objectss2 u=u1∧v=v1 domain object 1s3 u=u2∧v=v1 domain object 2s4 u=u3∧v=v1 domain object 3s5 (u=u1∧v=v1) ∨ (u=u2∧v=v1) domain objects 1 or 2s6 (u=u1∧v=v1) ∨ (u=u3∧v=v1) domain objects 1 or 3s7 (u=u2∧v=v1) ∨ (u=u3∧v=v1) domain objects 2 or 3s8 (u=u1∧v=v1) ∨ (u=u2∧v=v1) ∨ (u=u3∧v=v1) any of the domain objects

Table 3: Example of complex user goal sets.

of states, A is the set of actions, O is the set of ob-servations, T conditional transition probabilities,Ω conditional observation probabilities, and R isthe reward function. Since it iteratively projectsthe rewards associated with each state and actionusing the state transition and observation proba-bilities, the compression found is dependent onstructures and regularities in the POMDP model.

The set of basis vectors found can be used toproject the POMDP reward, transition, and obser-vation probabilities into the reduced state spaceallowing the policy to be learnt and executed inthis state space.

Although the VDC algorithm (Poupart, 2005)produces compressions that are lossless in termsof the states’ values, the set of basis vectors found(when viewed as a transformation matrix) can beill-conditioned. This results in numerical instabil-ity and errors in the belief estimation. The com-pression used in this demonstration was producedusing a modified VDC algorithm that improvesthe matrix condition by approximately selectingthe most independent basis vectors, thus improv-ing numerical stability. It achieves near-losslessstate value compression while allowing belief es-timation errors to be minimised and traded-offagainst the amount of compression. Details of thisalgorithm are to appear in a forthcoming publica-tion.

3 System Description

3.1 Components

Input and output to the demonstration system isusing standard open source and commercial com-ponents. FreeSWITCH (Minessale II, 2012) pro-vides a platform for accepting incoming Voiceover IP calls, routing them (using the Media Re-source Control Protocol (MRCP)) to a Nuance 9.0Automatic Speech Recogniser (Nuance, 2012).

Output is similarly handled by FreeSWITCHrouting system responses via a CereProc Text-to-Speech MRCP server (CereProc, 2012) in orderto respond to the user.

The heart of the demonstration system consistsof a State-Estimator server which estimates thecurrent dialogue state using the compressed statespace previously produced by VDC, a Policy-Executor server that selects actions based onthe compressed estimated state, and a templatebased Natural Language Generator server. Theseservers, along with FreeSWITCH, use ZeroC’sInternet Communications Engine (Ice) middle-ware (ZeroC, 2012) as a common communica-tions platform.

3.2 SDS DomainThe demonstration system provides a restaurantfinder system for the city of Edinburgh (Scot-land, UK). It presents search results from a realdatabase of over 600 restaurants. The searchresults are based on the attributes specified bythe user, currently; location, food type andbudget/non-budget.

3.3 InterfaceThe demonstration SDS is typically accessed overthe phone network. For debugging and demon-stration purposes it is possible to visualise thebelief distribution maintained by the DM as dia-logues progress. The compressed version of thebelief distribution is not a conventional proba-bility distribution3 and its visualisation is unin-formative. Instead we take advantage of the re-versibility of the VDC compression and projectthe distribution back onto the full state space. Foran example of the evolution of the belief distribu-tion during a dialogue see Figure 1.

3The values associated with the basis vectors are not con-fined to the range [0− 1].

48

#4096

10−7 10−6 10−5 0.0001 0.001

(a) Initial uniform distribution over the power set.

#2048

#2048

10−7 10−6 10−5 0.0001 0.001

(b) Distribution after user responds to greet.

#512

#3584

10−11 10−9 10−7 10−5 0.001

(c) Distribution after second user utterance.

Figure 1: Evolution of the belief distribution for theexample dialogue in Table 2. The horizontal length ofeach bar corresponds to the probability of that com-plex user goal state. Note that the x-axis uses a log-arithmic scale to allow low probability values to beseen. The y-axis is the set of complex user goals or-dered by probability. Lighter shaded (green) bars indi-cate complex user goal states corresponding to “cheap,central Thai” and “cheap, central Thai or expensiveFrench anywhere” in figures (b) and (c) respectively.The count ‘#’ indicates the number of states in thosegroups.

4 Conclusions

We present a demonstration of a statistical SDSthat uses automatic belief compression to reasonover complex user goal sets. Using the power setof domain objects as the states of the POMDPDM allows complex sets of user goals to be rep-resented, which leads to more natural dialogues.To address the massive expansion in the numberof belief states, a modified form of VDC is usedto generate a compression. It is this compressedspace which is used by the DM for planning andacting in response to user utterances. This is thefirst demonstration of a statistical SDS that usesautomatic belief compression to reason over com-plex user goal sets.

VDC and other automated compression tech-niques reduce the human design load by automat-ing part of the current POMDP SDS design pro-cess. This reduces the knowledge required whenbuilding such statistical systems and should makethem easier for industry to deploy.

Such compression approaches are not only ap-plicable to SDSs but should be equally relevantfor multi-modal interaction systems where sev-eral modalities are being combined in user-goalor state estimation.

5 Future Work

The current demonstration system is a proofof concept and is limited to a small numberof attributes and attribute-values. Part of ourongoing work involves investigation of scaling.For example, increasing the number of attribute-values should produce more regularities acrossthe POMDP space. Does VDC successfully ex-ploit these?

We are in the process of collecting corporafor the Edinburgh restaurant domain mentionedabove with the aim that the POMDP observationand transition statistics can be derived from data.

As part of this work we have launched a longterm, public facing outlet for testing and data col-lection, see http:\\www.edinburghinfo.co.uk. It is planned to make future versions ofthe demonstration system discussed in this paperavailable via this public outlet.

Finally we are investigating the applicabilityof other automatic belief (and state) compressiontechniques for SDSs, e.g. E-PCA (Roy and Gor-don, 2002).

49

Acknowledgments

The research leading to these results was fundedby the Engineering and Physical Sciences Re-search Council, UK (EPSRC) under project no.EP/G069840/1 and was partially supported by theEC FP7 projects Spacebook (ref. 270019) andJAMES (ref. 270435).

ReferencesCereProc. 2012. http://www.cereproc.com/.Paul A. Crook and Oliver Lemon. 2010. Representing

uncertainty about complex user goals in statisticaldialogue systems. In proceedings of SIGdial.

Paul A. Crook and Oliver Lemon. 2011. LosslessValue Directed Compression of Complex User GoalStates for Statistical Spoken Dialogue Systems. InProceedings of the Twelfth Annual Conference ofthe International Speech Communication Associa-tion (Interspeech).

Oliver Lemon, Kallirroi Georgila, and James Hender-son. 2006. Evaluating Effectiveness and Portabil-ity of Reinforcement Learned Dialogue Strategieswith real users: the TALK TownInfo Evaluation. InIEEE/ACL Spoken Language Technology.

Anthony Minessale II. 2012. FreeSWITCH. http://www.freeswitch.org/.

Nuance. 2012. Nuance Recognizer. http://www.nuance.com.

P. Poupart. 2005. Exploiting Structure to EfficientlySolve Large Scale Partially Observable Markov De-cision Processes. Ph.D. thesis, Dept. Computer Sci-ence, University of Toronto.

N. Roy and G. Gordon. 2002. Exponential FamilyPCA for Belief Compression in POMDPs. In NIPS.

B. Thomson and S. Young. 2010. Bayesian updateof dialogue state: A POMDP framework for spokendialogue systems. Computer Speech and Language,24(4):562–588.

Marilyn Walker, S. Whittaker, A. Stent, P. Maloor,J. Moore, M. Johnston, and G. Vasireddy. 2004.User tailored generation in the match multimodaldialogue system. Cognitive Science, 28:811–840.

S. Young, M. Gasic, S. Keizer, F. Mairesse, B. Thom-son, and K. Yu. 2010. The Hidden InformationState model: a practical framework for POMDPbased spoken dialogue management. ComputerSpeech and Language, 24(2):150–174.

ZeroC. 2012. The Internet Communications Engine(Ice). http://www.zeroc.com/ice.html.

50


Automatically Generated Customizable Online Dictionaries

Eniko HejaDept. of Language Technology

Research Institute for Linguistics, HASP.O.Box. 360 H-1394, Budapest

[email protected]

David TakacsDept. of Language Technology

Research Institute for Linguistics, HASP.O.Box. 360 H-1394, Budapest

[email protected]

Abstract

The aim of our software presentation isto demonstrate that corpus-driven bilingualdictionaries generated fully by automaticmeans are suitable for human use. Pre-vious experiments have proven that bilin-gual lexicons can be created by applyingword alignment on parallel corpora. Suchan approach, especially the corpus-drivennature of it, yields several advantages overmore traditional approaches. Most im-portantly, automatically attained translationprobabilities are able to guarantee that themost frequently used translations come firstwithin an entry. However, the proposedtechnique have to face some difficulties, aswell. In particular, the scarce availability ofparallel texts for medium density languagesimposes limitations on the size of the result-ing dictionary. Our objective is to designand implement a dictionary building work-flow and a query system that is apt to ex-ploit the additional benefits of the methodand overcome the disadvantages of it.

1 Introduction

The work presented here is part of the pilot projectEFNILEX 1 launched in 2008. The project objec-tive was to investigate to what extent LT methodsare capable of supporting the creation of bilingualdictionaries. Need for such dictionaries shows upspecifically in the case of lesser used languageswhere it does not pay off for publishers to in-vest into the production of dictionaries due to thelow demand. The targeted size of the dictionariesis between 15,000 and 25,000 entries. Since the

1EFNILEX is financed by EFNIL

completely automatic generation of clean bilin-gual resources is not possible according to thestate of the art, we have decided to provide lex-icographers with bilingual resources that can fa-cilitate their work. These kind of lexical resourceswill be referred to as proto-dictionaries hencefor-ward.

After investigating some alternative approachese.g. hub-and-spoke model (Martin, 2007), align-ment of WordNets, we have decided to use wordalignment on parallel corpora. Former experi-ments (Heja, 2010) have proven that word align-ment is not only able to help the dictionary cre-ation process itself, but the proposed techniquealso yields some definite advantages over moretraditional approaches. The main motivation be-hind our choice was that the corpus-driven natureof the method decreases the reliance on human in-tuition during lexicographic work. Although thecareful investigation of large monolingual corporamight have the same effect, being tedious andtime-cosuming it is not affordable in the case oflesser used languages.

In spite of the fact that word alignment hasbeen widely used for more than a decade withinthe NLP community to produce bilingual lexi-cons e.g. Wu and Xia (1994) and several ex-perts claimed that such resources might also beuseful for lexicographic purposes e.g. Bertels etal. (2009), as far as we know, this technique hasnot been exploited in large-scale lexicographicprojects yet e.g. Atkins and Rundell (2008).

Earlier experiments has shown that althoughword alignment has definite advantages over moretraditional approaches, there are also some diffi-culties that have to be dealt with: The method initself does not handle multi-word expressions and

51

the proto-dictionaries comprise incorrect trans-lation candidates, as well. In fact, in a given paral-lel corpus the number of incorrect translation can-didates strongly depends on the size of the proto-dictionary, as there is a trade-off between preci-sion and recall.

Accordingly, our objective is to design and im-plement a dictionary query system that is apt toexploit the benefits of the method and overcomethe disadvantages of it. Hopefully, such a sys-tem renders the proto-dictionaries helpful for notonly lexicographers, but also for ordinary dictio-nary users.

In Section 2 the basic generation process is in-troduced along with the difficulties we have todeal with. The various features of the DictionaryQuery System are detailed in Section 3. Finally,a conclusion is given and future work is listed inSection 4.

The proto-dictionaries are available at:http://efnilex.efnil.org

2 Generating Proto-Dictionaries –One-Token Translation Pairs

2.1 Input data

Since the amount of available parallel data is cru-cial for this approach, in the first phase of theproject we have experimented with two diffe-rent language pairs. The Dutch-French languagepair represents well-resourced languages whilethe Hungarian-Lithuaninan language pair repre-sents medium density languages. As for the for-mer, we have exploited the French-Dutch paral-lel corpus which forms subpart of the Dutch Pa-rallel Corpus (Macken et al., 2007). It consistsof 3,606,000 French tokens, 3,215,000 Dutch to-kens and 186,945 translation units2 (TUs). As forHungarian and Lithuanian we have built a paral-lel corpus comprising 4,189,000 Hungarian and3,544,000 Lithuanian tokens and 262,423 TUs.Because our original intention is to compile dic-tionaries covering every-day language, we havedecided to focus on literature while collecting thetexts. However, due to the scarce availabilityof parallel texts we made some concessions thatmight be questionable from a translation point ofview. First, we did not confine ourselves purely

2The size of the parallel corpora is given in terms of trans-lation units instead of in terms of sentence pairs, for many-to-many alignment was allowed, too.

to the literary domain: The parallel corpus com-prises also philosophical works. Secondly, in-stead of focusing on direct translations betweenLithuanian and Hungarian we have relied mainlyon translations from a third language. Thirdly, wehave treated every parallel text alike, regardless ofthe direction of the translation, although the DPCcontains that information.

2.2 The Generation Process

As already has been mentioned in Section 1,word alignment in itself deals only with one-tokenunits. A detailed description of the generationprocess of such proto-dictionaries has been givenin previous papers, e. g. Heja (2010). In thepresent paper we confine ourselves to a schematicoverview. In the first step the lemmatized versionsof each input text have been created by means ofmorhological analysis and disambiguation3.

In the second step parallel corpora have beencreated. We used Hunalign (Varga et al., 2005)for sentence alignment.

In the next step word alignment has been per-formed with GIZA++ (Och and Ney, 2003). Dur-ing word alignment GIZA++ builds a dictionary-file that stores translation candidates, i.e. sourceand target language lemmata along with theirtranslation probabilities. We used this dictio-nary file as the starting point to create the proto-dictionaries.

In the fourth step the proto-dictionaries havebeen created. Only the most likely translationcandidates were kept on the basis of some suit-able heuristics, which has been developed whileevaluating the results manually.

Finally, the relevant example sentences wereprovided in a concordance to give hints on the useof the translation candidates.

2.3 Trade-off between Precision and Recall

At this stage of the workflow some suitableheuristics need to be introduced to find the besttranslation candidates without the loss of toomany correct pairs. Therefore, several evaluationswere carried out.

3The analysis of the Lithuanian texts was performedby the Lithuanian Centre of Computational Linguistics(Zinkevicius et al., 2005). The Hungarian texts were anno-tated with the tool-chain of the Research Institute for Lin-guistics, HAS (Oravecz and Dienes, 2002).

52

It is important to note that throughout the man-ual evaluation we have focused on lexicographi-cally useful translation candidates instead of per-fect translations. The reason behind this is thattranslation synonymy is rare in general languagee.g. Atkins and Rundell (2008, p. 467), thus othersemantic relations, such as hyponymy or hyper-onymy, were also considered. Moreover, since theword alignment method does not handle MWEs initself, partial matching between SL and TL trans-lation candidates occurs frequently. In either case,provided example sentences make possible to findthe right translation.

We considered three parameters when search-ing for the best translations: translational proba-bility, source language lemma frequency and tar-get language lemma frequency (ptr, Fs and Ft,respectively).

The lemma frequency had to be taken into ac-count for at least two reasons. First, a minimalamount of data was necessary for the word align-ment algorithm to be able to estimate the transla-tional probability. Secondly, in the case of rarelyused TL lemmas the alignment algorithm mightassign high translational probabilities to incor-rect lemma pairs if the source lemma occurs fre-quently in the corpus and both members of thelemma pair recurrently show up in aligned units.

Results of the first evaluation showed thattranslation pairs with relatively low frequencyand with a relatively high translational probabilityyielded cc. 85% lexicographically useful trans-lation pairs. Although the precision was ratherconvincing, it has also turned out that the size ofthe resulting proto-dictionaries might be a seriousbottleneck of the method (Heja, 2010). Whereasthe targeted size of the dictionaries is between15,000 and 25,000 entries, the proto-dictionariescomprised only 5,521 Hungarian-Lithuanian and7,007 French-Dutch translation candidates withthe predefined parameters. Accordingly, the cov-erage of the proto-dictionaries should be aug-mented.

According to our hypothesis in the case of morefrequent source lemmata even lower values oftranslation probability might yield the same resultin terms of precision as in the case of lower fre-quency source lemmata. Hence, different evalua-tion domains need to be determined as a functionof source lemma frequency. That is:

1. The refinement of the parameters yields ap-proximately the same proportion of correcttranslation candidates as the basic parametersetting,

2. The refinement of the parameters ensures agreater coverage.

Detailed evaluation of the French-Dutch trans-lation candidates confirmed the first part of ourhypothesis. We have chosen a parameter setting inaccordance with (1) (see Table 1). 6934 French-Dutch translation candidates met the given con-ditions. 10 % of the relevant pairs was manuallyevaluated. The results are presented in Table 1.’OK’ denotes the lexicographically useful transla-tion candidates. For instance, the first evaluationrange (1st row of Table 1) comprised translationcandidates where the source lemma occurs at least10 times and at most 20 times in the parallel cor-pus. With these parameters only those pairs wereconsidered where the translation probability wasat least 0.4. As the 1st and 2nd rows of Table 1show, using different ptr values as cut-off param-eters give similar results (87%), if the two sourcelemma frequencies also differ.

Fs ptr OK10 ≤ LF ≤ 20 p ≥ 0.4 83%

100 ≤ LF ≤ 200 p ≥ 0.06 87%

500 ≤ LF p ≥ 0.02 87.5%

Table 1: Evaluation results of the refined French-Dutch proto-dictionary.

The manual evaluation of the Hungarian-Lithuanian translation candidates yielded thesame result. We have used this proto-dictionaryto confirm the 2nd part of our hypothesis, i.e. thatthe refinement of these parameters may increasethe size of the proto-dictionary. Table 2 presentsthe results. Expected refers to the expectednumber of correct translation candidates, esti-mated on the basis of the evaluation sample. 800translation candidates were evaluated altogether,200 from each evaluation domain. As Table 2shows, it is possible to increase the size of thedictionary through refining the parameters: withfine-tuned parameters the estimated number ofuseful translation candidates was 13,605 insteadof 5,521.

53

Fs ptr OK Expected5 ≤ LF < 30 p > 0.3 64% 4,29630 ≤ LF < 90 p > 0.1 80% 4,14490 ≤ LF < 300 p > 0.07 89% 3,026300 ≤ LF p > 0.04 79% 2,139

13,605

Table 2: Evaluation results of the refined Hungarian-Lithuanian proto-dictionary.

However, we should keep in mind when search-ing for the optimal values for these parametersthat while we aim at including as many translationcandidates as possible, we also expect the gener-ated resource to be as clean as possible. That is, inthe case of proto-dictionaries there is a trade-offbetween precision and recall: the size of the re-sulting proto-dictionaries can be increased only atthe cost of more incorrect translation candidates.

This leads us to the question of what parame-ter settings are useful for what usage scenarios?We think that the proto-dictionaries generated bythis method with various settings match well dif-ferent user needs. For instance, when the settingsare strict so that the minimal frequencies and pro-babilities are set high, the dictionary will containless translation pairs, resulting in high precisionand relatively low coverage, with only the mostfrequently used words and their most frequenttranslations. Such a dictionary is especially usefulfor a novice language learner. Professional trans-lators are able to judge whether a translation iscorrect or not. They might be rather interested inspecial uses of words, lexicographically useful butnot perfect translation candidates, and more sub-tle cross-language semantic relations, while at thesame time, looking at the concordance providedalong with the translation pairs, they can easilycatch wrong translations which are the side-effectof the method. This kind of work may be sup-ported by a proto-dictionary with increased recalleven at the cost of a lower precision.

Thus, the Dictionary Query System describedin Section 3 in more detail, should support varioususer needs.

However, user satisfaction has to be evaluatedin order to confirm this hypothesis. It forms partof our future tasks.

Figure 1: The customized dictionary: the distribu-tion of the Lithuanian-Hungarian translation candi-dates. Logarithmic frequency of the source words onthe x-axis, translation probability on the y-axis.

3 Dictionary Query System

As earlier has been mentioned, the proposedmethod has several benefits compared to more tra-ditional approaches:

1. A parallel corpus of appropriate size gua-rantees that the most relevant translations beincluded in the dictionary.

2. Based on the translational probabilities it ispossible to rank translation candidates ensur-ing that the most likely used translation va-riants go first within an entry.

3. All the relevant example sentences from theparallel corpora are easily accessible facili-tating the selection of the most appropriatetranslations from possible translation candi-dates.

Accordingly, the Dictionary Query Systempresents some novel features. On the one hand,users can select the best proto-dictionary for theirpurposes on the Cut Board Page. On the otherhand, the innovative representation of the gene-rated bilingual information helps to find the besttranslation for a specific user in the DictionaryBrowser Window.

3.1 Customizable proto-dictionaries: the CutBoard Page

The dictionary can be customized on the CutBoard Page. Two different charts are displayed

54

Figure 2: The customized dictionary: the distributionof the candidates. Logarithmic frequency ratio of thesource and target words on the x-axis, translation prob-ability on the y-axis.

here showing the distribution of all word pairs ofthe selected proto-dictionary.

1. Plot 1 visualizes the distribution of the log-arithmic frequency of the source words andthe relevant translation probability for eachword pair, selected by the given custom cri-teria.

2. Plot 2 visualizes the distribution of thelogarithmic frequency ratio of the targetand source words and the correspondingtranslation probability for each word pair,selected by the given custom criteria..

Proto-dictionaries are customizable by the follow-ing criteria:

1. Maximum and minimum ratio of the relativefrequencies of the source and target words(left and right boundary on Plot 1).

2. Overall minimum frequency of either thesource and the target words (left boundaryon Plot 2).

3. Overall minimum translation probability(bottom boundary on both plots).

4. Several more cut off intervals can be definedin the space represented by Plot 2: wordpairs falling in rectangles given by their left,right and top boundaries are cut off.

After submitting the given parameters the chartsare refreshed giving a feedback to the user andthe parameters are stored for the session, i. e. thedictionary page shows only word pairs fitting theselected criteria.

3.2 Dictionary BrowserThe Dictionary Browser displays four differenttypes of information.

1. List of the translation candidates ranked bytheir translation probabilities. This guaran-tees that most often used translations comefirst in the list (from top to bottom). Abso-lute corpus frequencies are also displayed.

2. A plot displaying the distribution of the po-ssible translations of the source word accord-ing to translation probability and the ratio ofcorpus ferquency between the source wordand the corresponding translation candidate.

3. Word cloud reflecting semantic relations bet-ween source and target lemmata. Words inthe word cloud vary in two ways.

First, their size depends on their translationprobabilities: the higher the probability ofthe target word, the bigger the font size is.

Secondly, colours are assigned to targetwords according to their frequency ratios rel-ative to the source word: less frequent targetwords are cool-coloured (dark blue and lightblue) while more frequent target words arewarm-coloured (red, orange). Target wordswith a frequency close to that of the sourceword get gray colour.

4. Provided example sentences with the sourceand target words highlighted, displayed byclicking one of the translation candidates.

According to our hypothesis the frequency ra-tios provide the user with hints about the se-mantic relations between source and target wordswhich might be particularly important when cre-ating texts in a foreign language. For instance,the Lithuanian lemma karieta has four Hungar-ian eqivalents: ”kocsi” (word with general mean-ing, e.g. ’car’, ’railway wagon’, ’horse-drown ve-hicle’), ”hinto” (’carriage’), ”konflis” (’a horse-drawn vehicle for public hire’), ”jarmu” (’vehi-cle’). The various colours of the candidates indi-cate different semantic relations: the red colour of

55

Figure 3: The Dictionary Browser

”kocsi” marks that the meaning of the target wordis more general than that of the source word. Con-versely, the dark blue colour of ”konflis” showsthat the meaning of the target word is more spe-cial. However, this hypothesis should be tested inthe future which makes part of our future work.

3.3 Implementation

The online research tool is based on the LAMPweb architecture. We use a relational databaseto store all the data: the multilingual corpus text,sentences and their translations, the word formsand lemmata and all the relations between them.The implementation of such a data structure andthe formulation of the queries is straightforwardand efficient. The data displayed in the dictionarybrowser as well as the distributional dataset pre-sented on the charts is selected on-the-fly. Thesize of the database is log-linear with the size ofthe corpus and the dictionary.

4 Conclusions and Future Work

Previous experiments have proven that corpus-driven bilingual resources generated fully by au-tomatic means are apt to facilitate lexicographicwork when compiling bilingual dictionaries.

We think that the proto-dictionaries generatedby this technique with various settings match well

different user needs, and consequently, beside lex-icographers, they might also be useful for endusers, both for language learners and for profes-sional translators. A possible future work is tofurther evaluate the dictionaries in real world usecases.

Some new assumptions can be formulatedwhich connect the statistical properties of thetranslation pairs, e.g. their frequency ratios andthe cross-language semantic relations betweenthem. Based on the generated dictionaries suchhypotheses may be further examined in the future.

In order to demonstrate the generated proto-dictionaries, we have designed and implementedan online dictionary query system, which exploitsthe advantages of the data-driven nature of the ap-plied technique. It provides different visualiza-tions of the possible translations based on theirtranslation probabilities and frequencies, alongwith their relevant contexts in the corpus. By pre-setting different selection criteria the contents ofthe dictionaries are customizable to suit varioususage scenarios.

The dictionaries are publicly available athttp://efnilex.efnil.org.

56

ReferencesBeryl T. Sue Atkins and Michael Rundell. 2008. The

Oxford Guide to Practical Lexicography. OUP Ox-ford.

Ann Bertels, Cedrick Fairon, Jorg Tiedemann, andSerge Verlinde. 2009. Corpus paralleles et corpuscibles au secours du dictionnaire de traduction. InCahiers de lexicologie, number 94 in Revues, pages199–219. Classiques Garnier.

Eniko Heja. 2010. The role of parallel corpora inbilingual lexicography. In Nicoletta Calzolari (Con-ference Chair), Khalid Choukri, Bente Maegaard,Joseph Mariani, Jan Odijk, Stelios Piperidis, MikeRosner, and Daniel Tapias, editors, Proceedingsof the Seventh International Conference on Lan-guage Resources and Evaluation (LREC’10), Val-letta, Malta, may. European Language ResourcesAssociation (ELRA).

Lieve Macken, Julia Trushkina, Hans Paulussen, LidiaRura, Piet Desmet, and Willy Vandeweghe. 2007.Dutch parallel corpus : a multilingual annotatedcorpus. In Proceedings of Corpus Linguistics 2007.

Willy Martin. 2007. Government policy and the plan-ning and production of bilingual dictionaries : Thedutch approach as a case in point. InternationalJournal of Lexicography, 20(3):221–237.

Franz Josef Och and Hermann Ney. 2003. A sys-tematic comparison of various statistical alignmentmodels. Computational Linguistics, 29(1):19–51.

Csaba Oravecz and Peter Dienes. 2002. Efficientstochastic part-of-speech tagging for hungarian. InProceedings of the Third International Conferenceon Language Resources and Evaluation, pages710–717, Las Palmas.

Daniel Varga, Laszlo Nemeth, Peter Halacsy, AndrasKornai, Viktor Tron, and Viktor Nagy. 2005. Par-allel corpora for medium density languages. InRecent Advances in Natural Language Processing(RANLP 2005), pages 590–596.

Dekai Wu and Xuanyin Xia. 1994. Learning anenglish-chinese lexicon from a parallel corpus. InIn Proceedings of the First Conference of the As-sociation for Machine Translation in the Americas,pages 206–213.

Vytautas Zinkevicius, Vidas Daudaravicius, and ErikaRimkute. 2005. The Morphologically annotatedLithuanian Corpus. In Proceedings of The SecondBaltic Conference on Human Language Technolo-gies, pages 365–370.

57


MaltOptimizer: An Optimization Tool for MaltParser

Miguel BallesterosComplutense University of Madrid

[email protected]

Joakim NivreUppsala University

[email protected]

Abstract

Data-driven systems for natural languageprocessing have the advantage that they caneasily be ported to any language or domainfor which appropriate training data can befound. However, many data-driven systemsrequire careful tuning in order to achieveoptimal performance, which may requirespecialized knowledge of the system. Wepresent MaltOptimizer, a tool developed tofacilitate optimization of parsers developedusing MaltParser, a data-driven dependencyparser generator. MaltOptimizer performsan analysis of the training data and guidesthe user through a three-phase optimizationprocess, but it can also be used to performcompletely automatic optimization. Exper-iments show that MaltOptimizer can im-prove parsing accuracy by up to 9 percentabsolute (labeled attachment score) com-pared to default settings. During the demosession, we will run MaltOptimizer on dif-ferent data sets (user-supplied if possible)and show how the user can interact with thesystem and track the improvement in pars-ing accuracy.

1 Introduction

In building NLP applications for new languagesand domains, we often want to reuse componentsfor tasks like part-of-speech tagging, syntacticparsing, word sense disambiguation and semanticrole labeling. From this perspective, componentsthat rely on machine learning have an advantage,since they can be quickly adapted to new settingsprovided that we can find suitable training data.However, such components may require carefulfeature selection and parameter tuning in order to

give optimal performance, a task that can be dif-ficult for application developers without special-ized knowledge of each component.

A typical example is MaltParser (Nivre et al.,2006), a widely used transition-based dependencyparser with state-of-the-art performance for manylanguages, as demonstrated in the CoNLL sharedtasks on multilingual dependency parsing (Buch-holz and Marsi, 2006; Nivre et al., 2007). Malt-Parser is an open-source system that offers a widerange of parameters for optimization. It imple-ments nine different transition-based parsing al-gorithms, each with its own specific parameters,and it has an expressive specification languagethat allows the user to define arbitrarily complexfeature models. Finally, any combination of pars-ing algorithm and feature model can be combinedwith a number of different machine learning al-gorithms available in LIBSVM (Chang and Lin,2001) and LIBLINEAR (Fan et al., 2008). Justrunning the system with default settings whentraining a new parser is therefore very likely toresult in suboptimal performance. However, se-lecting the best combination of parameters is acomplicated task that requires knowledge of thesystem as well as knowledge of the characteris-tics of the training data.

This is why we present MaltOptimizer, a toolfor optimizing MaltParser for a new languageor domain, based on an analysis of the train-ing data. The optimization is performed in threephases: data analysis, parsing algorithm selec-tion, and feature selection. The tool can be runin “batch mode” to perform completely automaticoptimization, but it is also possible for the user tomanually tune parameters after each of the threephases. In this way, we hope to cater for users

58

without specific knowledge of MaltParser, whocan use the tool for black box optimization, aswell as expert users, who can use it interactivelyto speed up optimization. Experiments on a num-ber of data sets show that using MaltOptimizer forcompletely automatic optimization gives consis-tent and often substantial improvements over thedefault settings for MaltParser.

The importance of feature selection and param-eter optimization has been demonstrated for manyNLP tasks (Kool et al., 2000; Daelemans et al.,2003), and there are general optimization tools formachine learning, such as Paramsearch (Van denBosch, 2004). In addition, Nilsson and Nugues(2010) has explored automatic feature selectionspecifically for MaltParser, but MaltOptimizer isthe first system that implements a complete cus-tomized optimization process for this system.

In the rest of the paper, we describe the opti-mization process implemented in MaltOptimizer(Section 2), report experiments (Section 3), out-line the demonstration (Section 4), and conclude(Section 5). A more detailed description of Malt-Optimizer with additional experimental resultscan be found in Ballesteros and Nivre (2012).

2 The MaltOptimizer System

MaltOptimizer is written in Java and implementsan optimization procedure for MaltParser basedon the heuristics described in Nivre and Hall(2010). The system takes as input a trainingset, consisting of sentences annotated with depen-dency trees in CoNLL data format,1 and outputsan optimized MaltParser configuration togetherwith an estimate of the final parsing accuracy.The evaluation metric that is used for optimiza-tion by default is the labeled attachment score(LAS) excluding punctuation, that is, the percent-age of non-punctuation tokens that are assignedthe correct head and the correct label (Buchholzand Marsi, 2006), but other options are available.For efficiency reasons, MaltOptimizer only ex-plores linear multiclass SVMs in LIBLINEAR.

2.1 Phase 1: Data Analysis

After validating that the data is in valid CoNLLformat, using the official validation script fromthe CoNLL-X shared task,2 the system checks the

1http://ilk.uvt.nl/conll/#dataformat2http://ilk.uvt.nl/conll/software.html#validate

minimum Java heap space needed given the sizeof the data set. If there is not enough memoryavailable on the current machine, the system in-forms the user and automatically reduces the sizeof the data set to a feasible subset. After these ini-tial checks, MaltOptimizer checks the followingcharacteristics of the data set:

1. Number of words/sentences.

2. Existence of “covered roots” (arcs spanningtokens with HEAD = 0).

3. Frequency of labels used for tokens withHEAD = 0.

4. Percentage of non-projective arcs/trees.

5. Existence of non-empty feature values in theLEMMA and FEATS columns.

6. Identity (or not) of feature values in theCPOSTAG and POSTAG columns.

Items 1–3 are used to set basic parameters in therest of phase 1 (see below); 4 is used in the choiceof parsing algorithm (phase 2); 5 and 6 are rele-vant for feature selection experiments (phase 3).

If there are covered roots, the system checkswhether accuracy is improved by reattachingsuch roots in order to eliminate spurious non-projectivity. If there are multiple labels for to-kens with HEAD=0, the system tests which labelis best to use as default for fragmented parses.

Given the size of the data set, the system sug-gests different validation strategies during phase1. If the data set is small, it recommends us-ing 5-fold cross-validation during subsequent op-timization phases. If the data set is larger, it rec-ommends using a single development set instead.But the user can override either recommendationand select either validation method manually.

When these checks are completed, MaltOpti-mizer creates a baseline option file to be used asthe starting point for further optimization. Theuser is given the opportunity to edit this optionfile and may also choose to stop the process andcontinue with manual optimization.

2.2 Phase 2: Parsing Algorithm SelectionMaltParser implements three groups of transition-based parsing algorithms:3 (i) Nivre’s algorithms(Nivre, 2003; Nivre, 2008), (ii) Covington’s algo-rithms (Covington, 2001; Nivre, 2008), and (iii)

3Recent versions of MaltParser contains additional algo-rithms that are currently not handled by MaltOptimizer.

59

Figure 1: Decision tree for best projective algorithm.

Figure 2: Decision tree for best non-projective algo-rithm (+PP for pseudo-projective parsing).

Stack algorithms (Nivre, 2009; Nivre et al., 2009)Both the Covington group and the Stack groupcontain algorithms that can handle non-projectivedependency trees, and any projective algorithmcan be combined with pseudo-projective parsingto recover non-projective dependencies in post-processing (Nivre and Nilsson, 2005).

In phase 2, MaltOptimizer explores the parsingalgorithms implemented in MaltParser, based onthe data characteristics inferred in the first phase.In particular, if there are no non-projective depen-dencies in the training set, then only projectivealgorithms are explored, including the arc-eagerand arc-standard versions of Nivre’s algorithm,the projective version of Covington’s projectiveparsing algorithm and the projective Stack algo-rithm. The system follows a decision tree consid-ering the characteristics of each algorithm, whichis shown in Figure 1.

On the other hand, if the training set con-tains a substantial amount of non-projective de-pendencies, MaltOptimizer instead tests the non-projective versions of Covington’s algorithm andthe Stack algorithm (including a lazy and an eagervariant), and projective algorithms in combinationwith pseudo-projective parsing. The system thenfollows the decision tree shown in Figure 2.

If the number of trees containing non-projective arcs is small but not zero, the sys-tem tests both projective algorithms and non-projective algorithms, following the decision trees

in Figure 1 and Figure 2 and picking the algorithmthat gives the best results after traversing both.

Once the system has finished testing each of thealgorithms with default settings, MaltOptimizertunes some specific parameters of the best per-forming algorithm and creates a new option filefor the best configuration so far. The user is againgiven the opportunity to edit the option file (orstop the process) before optimization continues.

2.3 Phase 3: Feature SelectionIn the third phase, MaltOptimizer tunes the fea-ture model given all the parameters chosen so far(especially the parsing algorithm). It starts withbackward selection experiments to ensure that allfeatures in the default model for the given pars-ing algorithm are actually useful. In this phase,features are omitted as long as their removal doesnot decrease parsing accuracy. The system thenproceeds with forward selection experiments, try-ing potentially useful features one by one. In thisphase, a threshold of 0.05% is used to determinewhether an improvement in parsing accuracy issufficient for the feature to be added to the model.Since an exhaustive search for the best possiblefeature model is impossible, the system relies ona greedy optimization strategy using heuristics de-rived from proven experience (Nivre and Hall,2010). The major steps of the forward selectionexperiments are the following:4

1. Tune the window of POSTAG n-grams overthe parser state.

2. Tune the window of FORM features over theparser state.

3. Tune DEPREL and POSTAG features overthe partially built dependency tree.

4. Add POSTAG and FORM features over theinput string.

5. Add CPOSTAG, FEATS, and LEMMA fea-tures if available.

6. Add conjunctions of POSTAG and FORMfeatures.

These six steps are slightly different dependingon which algorithm has been selected as the bestin phase 2, because the algorithms have differentparsing orders and use different data structures,

4For an explanation of the different feature columns suchas POSTAG, FORM, etc., see Buchholz and Marsi (2006) orsee http://ilk.uvt.nl/conll/#dataformat

60

Language Default Phase 1 Phase 2 Phase 3 DiffArabic 63.02 63.03 63.84 65.56 2.54Bulgarian 83.19 83.19 84.00 86.03 2.84Chinese 84.14 84.14 84.95 84.95 0.81Czech 69.94 70.14 72.44 78.04 8.10Danish 81.01 81.01 81.34 83.86 2.85Dutch 74.77 74.77 78.02 82.63 7.86German 82.36 82.36 83.56 85.91 3.55Japanese 89.70 89.70 90.92 90.92 1.22Portuguese 84.11 84.31 84.75 86.52 2.41Slovene 66.08 66.52 68.40 71.71 5.63Spanish 76.45 76.45 76.64 79.38 2.93Swedish 83.34 83.34 83.50 84.09 0.75Turkish 57.79 57.79 58.29 66.92 9.13

Table 1: Labeled attachment score per phase and withcomparison to default settings for the 13 training setsfrom the CoNLL-X shared task (Buchholz and Marsi,2006).

but the steps are roughly equivalent at a certainlevel of abstraction. After the feature selectionexperiments are completed, MaltOptimizer tunesthe cost parameter of the linear SVM using a sim-ple stepwise search. Finally, it creates a completeconfiguration file that can be used to train Malt-Parser on the entire data set. The user may nowcontinue to do further optimization manually.

3 Experiments

In order to assess the usefulness and validity ofthe optimization procedure, we have run all threephases of the optimization on all the 13 data setsfrom the CoNLL-X shared task on multilingualdependency parsing (Buchholz and Marsi, 2006).Table 1 shows the labeled attachment scores withdefault settings and after each of the three opti-mization phases, as well as the difference betweenthe final configuration and the default.5

The first thing to note is that the optimizationimproves parsing accuracy for all languages with-out exception, although the amount of improve-ment varies considerably from about 1 percentagepoint for Chinese, Japanese and Swedish to 8–9points for Dutch, Czech and Turkish. For mostlanguages, the greatest improvement comes fromfeature selection in phase 3, but we also see sig-

5Note that these results are obtained using 80% of thetraining set for training and 20% as a development test set,which means that they are not comparable to the test resultsfrom the original shared task, which were obtained using theentire training set for training and a separate held-out test setfor evaluation.

nificant improvement from phase 2 for languageswith a substantial amount of non-projective de-pendencies, such as Czech, Dutch and Slovene,where the selection of parsing algorithm can bevery important. The time needed to run the op-timization varies from about half an hour for thesmaller data sets to about one day for very largedata sets like the one for Czech.

4 System Demonstration

In the demonstration, we will run MaltOptimizeron different data sets and show how the user caninteract with the system while keeping track ofimprovements in parsing accuracy. We will alsoexplain how to interpret the output of the system,including the final feature specification model, forusers that are not familiar with MaltParser. By re-stricting the size of the input data set, we can com-plete the whole optimization procedure in 10–15minutes, so we expect to be able to complete anumber of cycles with different members of theaudience. We will also let the audience contributetheir own data sets for optimization, provided thatthey are in CoNLL format.6

5 Conclusion

MaltOptimizer is an optimization tool for Malt-Parser, which is primarily aimed at applicationdevelopers who wish to adapt the system to anew language or domain and who do not haveexpert knowledge about transition-based depen-dency parsing. Another potential user group con-sists of researchers who want to perform compar-ative parser evaluation, where MaltParser is oftenused as a baseline system and where the use ofsuboptimal parameter settings may undermine thevalidity of the evaluation. Finally, we believe thesystem can be useful also for expert users of Malt-Parser as a way of speeding up optimization.

Acknowledgments

The first author is funded by the Spanish Ministryof Education and Science (TIN2009-14659-C03-01 Project), Universidad Complutense de Madridand Banco Santander Central Hispano (GR58/08Research Group Grant). He is under the supportof the NIL Research Group (http://nil.fdi.ucm.es)from the same university.

6The system is available for download under an open-source license at http://nil.fdi.ucm.es/maltoptimizer

61

ReferencesMiguel Ballesteros and Joakim Nivre. 2012. MaltOp-

timizer: A System for MaltParser Optimization. InProceedings of the Eighth International Conferenceon Language Resources and Evaluation (LREC).

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-Xshared task on multilingual dependency parsing. InProceedings of the 10th Conference on Computa-tional Natural Language Learning (CoNLL), pages149–164.

Chih-Chung Chang and Chih-Jen Lin, 2001.LIBSVM: A Library for Support Vec-tor Machines. Software available athttp://www.csie.ntu.edu.tw/∼cjlin/libsvm.

Michael A. Covington. 2001. A fundamental algo-rithm for dependency parsing. In Proceedings ofthe 39th Annual ACM Southeast Conference, pages95–102.

Walter Daelemans, Veronique Hoste, Fien De Meul-der, and Bart Naudts. 2003. Combined optimiza-tion of feature selection and algorithm parametersin machine learning of language. In Nada Lavrac,Dragan Gamberger, Hendrik Blockeel, and LjupcoTodorovski, editors, Machine Learning: ECML2003, volume 2837 of Lecture Notes in ComputerScience. Springer.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, andC.-J. Lin. 2008. LIBLINEAR: A library for largelinear classification. Journal of Machine LearningResearch, 9:1871–1874.

Anne Kool, Jakub Zavrel, and Walter Daelemans.2000. Simultaneous feature selection and param-eter optimization for memory-based natural lan-guage processing. In A. Feelders, editor, BENE-LEARN 2000. Proceedings of the Tenth Belgian-Dutch Conference on Machine Learning, pages 93–100. Tilburg University, Tilburg.

Peter Nilsson and Pierre Nugues. 2010. Automaticdiscovery of feature sets for dependency parsing. InCOLING, pages 824–832.

Joakim Nivre and Johan Hall. 2010. A quick guideto MaltParser optimization. Technical report, malt-parser.org.

Joakim Nivre and Jens Nilsson. 2005. Pseudo-projective dependency parsing. In Proceedings ofthe 43rd Annual Meeting of the Association forComputational Linguistics (ACL), pages 99–106.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.Maltparser: A data-driven parser-generator for de-pendency parsing. In Proceedings of the 5th In-ternational Conference on Language Resources andEvaluation (LREC), pages 2216–2219.

Joakim Nivre, Johan Hall, Sandra Kubler, Ryan Mc-Donald, Jens Nilsson, Sebastian Riedel, and DenizYuret. 2007. The CoNLL 2007 shared task on de-pendency parsing. In Proceedings of the CoNLL

Shared Task of EMNLP-CoNLL 2007, pages 915–932.

Joakim Nivre, Marco Kuhlmann, and Johan Hall.2009. An improved oracle for dependency parsingwith online reordering. In Proceedings of the 11thInternational Conference on Parsing Technologies(IWPT’09), pages 73–76.

Joakim Nivre. 2003. An efficient algorithm for pro-jective dependency parsing. In Proceedings of the8th International Workshop on Parsing Technolo-gies (IWPT), pages 149–160.

Joakim Nivre. 2008. Algorithms for deterministic in-cremental dependency parsing. Computational Lin-guistics, 34:513–553.

Joakim Nivre. 2009. Non-projective dependencyparsing in expected linear time. In Proceedings ofthe Joint Conference of the 47th Annual Meeting ofthe ACL and the 4th International Joint Conferenceon Natural Language Processing of the AFNLP(ACL-IJCNLP), pages 351–359.

Antal Van den Bosch. 2004. Wrapped progressivesampling search for optimizing learning algorithmparameters. In Proceedings of the 16th Belgian-Dutch Conference on Artificial Intelligence.

62


Fluid Construction Grammar:The New Kid on the Block

Remi van Trijp1, Luc Steels1,2, Katrien Beuls3, Pieter Wellens3

1Sony Computer Science 2ICREA Institute for 3 VUB AI LabLaboratory Paris Evolutionary Biology (UPF-CSIC) Pleinlaan 2

6 Rue Amyot PRBB, Dr Aiguidar 88 1050 Brussels (Belgium)75005 Paris (France) 08003 Barcelona (Spain) katrien|pieter@

[email protected] [email protected] ai.vub.ac.be

Abstract

Cognitive linguistics has reached a stageof maturity where many researchers arelooking for an explicit formal groundingof their work. Unfortunately, most currentmodels of deep language processing incor-porate assumptions from generative gram-mar that are at odds with the cognitivemovement in linguistics. This demonstra-tion shows how Fluid Construction Gram-mar (FCG), a fully operational and bidi-rectional unification-based grammar for-malism, caters for this increasing demand.FCG features many of the tools that werepioneered in computational linguistics inthe 70s-90s, but combines them in an inno-vative way. This demonstration highlightsthe main differences between FCG and re-lated formalisms.

1 Introduction

The “cognitive linguistics enterprise” (Evanset al., 2007) is a rapidly expanding research dis-cipline that has so far avoided rigorous formal-izations. This choice was wholly justified in the70s-90s when the foundations of this scientificmovement were laid (Rosch, 1975; Lakoff, 1987;Langacker, 1987), and it remained so during thepast two decades while the enterprise worked ongetting its facts straight through empirical stud-ies in various subfields such as language acqui-sition (Tomasello, 2003; Goldberg et al., 2004;Lieven, 2009), language change and grammati-calization (Heine et al., 1991; Barðdal and Chel-liah, 2009), and corpus research (Boas, 2003; Ste-fanowitsch and Gries, 2003). However, with nu-merous textbooks on the market (Lee, 2001; Croft

and Cruse, 2004; Evans and Green, 2006), cogni-tive linguistics has by now established itself as aserious branch in the study of language, and manycognitive linguists are looking for ways of explic-itly formalizing their work through computationalmodels (McClelland, 2009).

Unfortunately, it turns out to be very difficultto adequately formalize a cognitive linguistic ap-proach to grammar (or “construction grammar”)using the tools for precision-grammars developedin the 70s-90s such as unification (Kay, 1979;Carpenter, 1992), because these tools are typi-cally incorporated in a generative grammar (suchas HPSG; Ginzburg and Sag, 2000) whose as-sumptions are incompatible with the foundationsof construction grammar. First, cognitive linguis-tics blurs the distinction between ‘competence’and ‘performance’, which means giving up thesharp distinction between declarative and proce-dural representations. Next, construction gram-marians argue for a usage-based approach (Lan-gacker, 2000), so the constraints on features maychange and features may emerge or disappearfrom a grammar at any given time.

This demonstration introduces Fluid Construc-tion Grammar (FCG; Steels, 2011, 2012a), anovel unification-based grammar formalism thataddresses these issues, and which is available asopen-source software at www.fcg-net.org.After more than a decade of development, FCGis now ready to handle sophisticated linguisticissues. FCG revisits many of the technologiesdeveloped by computational linguists and intro-duces several key innovations that are of inter-est to anyone working on deep language process-ing. The demonstration illustrates these innova-tions through FCG’s interactive web interface.

63

semantic pole

syntactic pole

transient structure

semantic pole

syntactic pole

construction

matching phasefirst

merging phase

second merging phase

semantic pole

syntactic pole

transient structure

semantic pole

syntactic pole

construction

secondmergingphase

first merging phasematching phase

Figure 1: FCG allows the implementation of efficient and strongly reversible grammars. Left: In production,conditional units of the semantic pole of a construction are matched against a transient structure, before additionalsemantic constraints and the syntactic pole are merged with the structure. Right: In parsing, the same algorithmapplies but in the opposite direction.

2 Strong and Efficient Reversibility

Reversible or bidirectional grammar formalismscan achieve both production and parsing (Strza-lkowski, 1994). Several platforms, such as theLKB (Copestake, 2002), already achieve bidirec-tionality, but they do so through separate algo-rithms for parsing and production (mainly for effi-ciency reasons). One problem with this approachis that there may be a loss of coherence in gram-mar engineering. For instance, the LKB parsercan handle a wider variety of structures than itsgenerator.

FCG uses one core engine that handles bothparsing and production with a single linguisticinventory (see Figure 1). When processing, theFCG-system builds a transient structure that con-tains all the information concerning the utterancethat the system has to parse or produce, dividedinto a semantic and syntactic pole (both of whomare feature structures). Grammar rules or “con-structions” are coupled feature structures as welland thus contain a semantic and syntactic pole.

When applying constructions, the FCG-systemgoes through three phases. In production, FCGfirst matches all feature-value pairs of the seman-tic pole of a construction with the semantic poleof the transient structure, except fv-pairs that aremarked for being attributed by the construction(De Beule and Steels, 2005). Matching is a more

strict form of unification that resembles a sub-sumption test (see Steels and De Beule, 2006).If matching is successful, all the marked fv-pairsof the semantic pole are merged with the tran-sient structure in a first merge phase, after whichthe whole syntactic pole is merged in a secondphase. FCG-merge is equivalent to “unification”in other formalisms. The same three-phase algo-rithm is applied in parsing as well, but this time inthe opposite direction: if the syntactic pole of theconstruction matches with the transient structure,the attributable syntactic fv-pairs and the seman-tic pole are merged.

3 WYSIWYG Grammar Engineering

Most unification grammars use non-directionallinguistic representations that are designed to beindependent of any model of processing (Sagand Wasow, 2011). Whereas this may be de-sirable from a ‘mathematical’ point-of-view, itputs the burden of efficient processing on theshoulders of computational linguists, who have tofind a balance between faithfulness to the hand-written theory and computational efficiency (Mel-nik, 2005). For instance, there is no HPSG imple-mentation, but rather several platforms that sup-port the implementation of ‘HPSG-like’ gram-mars: ALE (Carpenter and Penn, 1995), ALEP(Schmidt et al., 1996), CUF (Dörre and Dorna,

64

top

cxn-applied

top

nominal-adjectival-cxn

sem-subunits

footprints

args

sem-cat

nominal-adjectival-phrase-1

(word-ballon-1 word-rouge-1)

(nominal-adjectival-cxn)

(red-ball-15 context-19)

((sem-function identifier))

word-ballon-1

word-rouge-1

word-le-1

sem syn

form

syn-subunits

syn-cat

footprints

nominal-adjectival-phrase-1

((meets word-ballon-1 word-rouge-1))

(word-ballon-1 word-rouge-1)

((number singular) (gender masculine) (syn-function nominal))

(nominal-adjectival-cxn)

word-rouge-1

word-ballon-1

word-le-1

Figure 2: FCG comes equipped with an interactive web interface for inspecting the linguistic inventory, con-struction application and search. This Figure shows an example construction where two units are opened up forcloser inspection of their feature structures.

1993), LIGHT (Ciortuz, 2002), LKB (Copestake,2002), ProFIT (Erbach, 1995), TDL (Krieger andSchäfer, 1994), TFS (Emele, 1994), and others(see Bolc et al., 1996, for a survey). Unfortu-nately, the optimizations and technologies devel-oped within these platforms are often consideredby theoretical linguists as engineering solutionsrather than scientific contributions.

FCG, on the other hand, adheres to the cogni-tive linguistics assumption that linguistic perfor-mance is equally important as linguistic compe-tence, hence processing becomes a central notionin the formalism. FCG representations thereforeoffer a ‘what you see is what you get’ approachto grammar engineering where the representationshave a direct impact on processing and vice versa.For instance, a construction’s division between asemantic and syntactic pole is informative with re-spect to how the construction is applied.

Some grammarians may object that this designchoice forces linguists to worry about process-ing, but that is entirely the point. It has alreadybeen demonstrated in other unification-based for-malisms that different grammar representationshave a significant impact on processing efficiency(Flickinger, 2000). Moreover, FCG-style repre-sentations can be directly implemented and testedwithout having to compromise on either faithful-ness to a theory or computational efficiency.

Since writing grammars is highly complex,however, FCG also features a ‘design level’ on topof its operational level (Steels, 2012b). On thislevel, grammar engineers can use templates thatbuild detailed constructions. The demonstrationshows how to write a grammar in FCG, switch-

ing between its design level, its operational leveland its interactive web interface (see Figure 2).The web interface allows FCG-users to inspect thelinguistic inventory, the search tree in processing,and so on.

4 Robustness and Learning

Unification-based grammars have the reputationof being brittle when it comes to processing nov-elty or ungrammatical utterances (Tomuro, 1999).Since cognitive linguistics adheres to a usage-based view on language (Langacker, 2000), how-ever, an adequate formalization must be robustand open-ended.

A first requirement is that there can be differ-ent degrees of ‘entrenchment’ in the grammar:while some features might still be emergent, oth-ers are already part of well-conventionalized lin-guistic patterns. Moreover, new features and con-structions may appear (or disappear) from a gram-mar at any given time. These requirements arehard to reconcile with the type hierarchy approachof other formalisms, so FCG does not imple-ment typed feature structures. The demonstra-tion shows how FCG can nevertheless preventover-licensing of linguistic structures through itsmatching phase and how it captures generaliza-tions through its templates – two benefits typicallyassociated with type hierarchies.

Secondly, FCG renders linguistic processingfluid and robust through a meta-level architec-ture, which consists of two layers of processing,as shown in Figure 3 (Beuls et al., 2012). Thereis a routine layer in which constructional process-ing takes place. At the same time, a meta-layer

65

!"!"

routine processing

diagnostic

problem repair

diagnostic diagnostic diagnostic

problem

repair meta-layer processing

Figure 3: There are two layers of processing in FCG. On the routine level, constructional processing takes place.At the same time, a meta-layer of diagnostics and repairs try to detect and solve problems that occur in the routinelayer.

is active that runs diagnostics for detecting prob-lems in routine processing, and repairs for solvingthose problems. The demonstration shows howthe meta-layer is used for solving common prob-lems such as missing lexical entries and coercion(Steels and van Trijp, 2011), and how its archi-tecture offers a uniform way of implementing thevarious solutions for robustness already pioneeredin the aforementioned grammar platforms.

5 Efficiency

Unification is computationally expensive, andmany technical solutions have been proposed forefficient processing of rich and expressive fea-ture structures (Tomuro, 1999; Flickinger, 2000;Callmeier, 2001). In FCG, however, researchon efficiency takes a different dimension becauseperformance is considered to be an integral part ofthe linguistic theory that needs to be operational-ized. The demonstration allows conference par-ticipants to inspect the following research resultson the interplay between grammar and efficiency:

• In line with construction grammar, there isno distinction between the lexicon and thegrammar. Based on language usage, the lin-guistic inventory can nevertheless organizeitself in the form of dependency networksthat regulate which construction should beconsidered when in processing (Wellens andDe Beule, 2010; Wellens, 2011).

• There is abundant psycholinguistic evidencethat language usage contains many ready-made language structures. FCG incorporatesa chunking mechanism that is able to cre-ate such canned phrases for faster processing(Stadler, 2012).

• Morphological paradigms, such as the Ger-man case system, can be represented in theform of ‘feature matrices’, which reducesyntactic and semantic ambiguity and hencespeed up processing efficiency and reliability(van Trijp, 2011).

• Many linguistic domains, such as spatial lan-guage, are known for their high degree ofpolysemy. By distinguishing between actualand potential values, such polysemous struc-tures can be processed smoothly (Sprangerand Loetzsch, 2011).

6 Conclusion

With many well-developed unification-basedgrammar formalisms available to the community,one might wonder whether any ‘new kid on theblock’ can still claim relevance today. With thisdemonstration, we hope to show that Fluid Con-struction Grammar allows grammar engineers tounchart new territory, most notably in the relationbetween linguistic competence and performance,and in modeling usage-based approaches to lan-guage.

66

References

Johanna Barðdal and Shobhana Chelliah, edi-tors. The Role of Semantic, Pragmatic andDiscourse Factors in the Development of Case.John Benjamins, Amsterdam, 2009.

Katrien Beuls, Remi van Trijp, and PieterWellens. Diagnostics and repairs in Fluid Con-struction Grammar. In Luc Steels, editor, Com-putational Issues in Fluid Construction Gram-mar. Springer Verlag, Berlin, 2012.

Hans C. Boas. A Constructional Approach to Re-sultatives. Stanford Monograph in Linguistics.CSLI, Stanford, 2003.

Leonard Bolc, Krzysztof Czuba, AnnaKupsc, Malgorzata Marciniak, AgnieszkaMykowiecka, and Adam Przepiórkowski. Asurvey of systems for implementing HPSGgrammars. Research Report 814 of IPIPAN (Institute of Computer Science, PolishAcademy of Sciences), 1996.

Ulrich Callmeier. Efficient parsing with large-scale unification grammars. Master’s thesis,Universität des Saarlandes, 2001.

Bob Carpenter. The Logic of Typed Feature Struc-tures. Cambridge UP, Cambridge, 1992.

Bob Carpenter and Gerald Penn. The AttributeLogic Engine (Version 2.0.1). Pittsburgh, 1995.

Liviu Ciortuz. LIGHT – a constraint language andcompiler system for typed-unification gram-mars. In Proceedings of The 25th German Con-ferences on Artificial Intelligence (KI 2002),volume 2479 of LNAI, pages 3–17, Berlin,2002. Springer-Verlag.

Ann Copestake. Implementing Typed FeatureStructure Grammars. CSLI Publications, Stan-ford, 2002.

William Croft and D. Alan Cruse. Cognitive Lin-guistics. Cambridge Textbooks in Linguistics.Cambridge University Press, Cambridge, 2004.

J. De Beule and L. Steels. Hierarchy in fluid con-struction grammar. In U. Furbach, editor, Pro-ceedings of the 28th Annual German Confer-ence on Artificial Intelligence, volume 3698 ofLecture Notes in Artificial Intelligence, pages1–15, Berlin, Germany, 2005. Springer Verlag.

Jochen Dörre and Michael Dorna. CUF – aformalism for linguistic knowledge represen-

tation. In Jochen Dörre, editor, Computa-tional Aspects of Constraint Based LinguisticDescriptions, volume I, pages 1–22. DYANA-2Project, Amsterdam, 1993.

Martin C. Emele. The typed feature structure rep-resentation formalism. In Proceedings of theInternational Workshop on Sharable NaturalLanguage Resources, Ikoma, Nara, 1994.

Gregor Erbach. ProFIT: Prolog with features,inheritance and templates. In Proceedings ofEACL-95, 1995.

Vyvyan Evans and Melanie Green. Cognitive Lin-guistics: An Introduction. Lawrence ErlbaumAssociates / Edinburgh University Press, Hills-dale, NJ/Edinburgh, 2006.

Vyvyan Evans, Benjamin K. Bergen, and JörgZinken. The cognitive linguistics enterprise:An overview. In V. Evans, B.K. Bergen, andJ. Zinken, editors, The Cognitive LinguisticsReader. Equinox Publishing, London, 2007.

Daniel P. Flickinger. On building a more efficientgrammar by exploiting types. Natural Lan-guage Engineering, 6(1):15–28, 2000.

Jonathan Ginzburg and Ivan A. Sag. Interroga-tive Investigations: the Form, the Meaning, andUse of English Interrogatives. CSLI Publica-tions, Stanford, 2000.

Adele E. Goldberg, Devin M. Casenhiser, andNitya Sethuraman. Learning argument struc-ture generalizations. Cognitive Linguistics, 15(3):289–316, 2004.

Bernd Heine, Ulrike Claudi, and Friederike Hün-nemeyer. Grammaticalization: A Concep-tual Framework. University of Chicago Press,Chicago, 1991.

Martin Kay. Functional grammar. In Proceedingsof the Fifth Annual Meeting of the Berkeley Lin-guistics Society, pages 142–158. Berkeley Lin-guistics Society, 1979.

Hans-Ulrich Krieger and Ulrich Schäfer. TDL –a type description language for HPSG. part 1:Overview. In Proceedings of the 15th Interna-tional Conference on Computational Linguis-tics, pages 893–899, Kyoto, 1994.

George Lakoff. Women, Fire, and Danger-ous Things: What Categories Reveal aboutthe Mind. The University of Chicago Press,Chicago, 1987.

67

Ronald W. Langacker. Foundations of CognitiveGrammar: Theoretical Prerequisites. StanfordUniversity Press, Stanford, 1987.

Ronald W. Langacker. A dynamic usage-basedmodel. In Michael Barlow and Suzanne Kem-mer, editors, Usage-Based Models of Lan-guage, pages 1–63. Chicago University Press,Chicago, 2000.

David Lee. Cognitive Linguistics: An Introduc-tion. Oxford University Press, Oxford, 2001.

Elena Lieven. Developing constructions. Cogni-tive Linguistics, 20(1):191–199, 2009.

James L. McClelland. The place of modeling incognitive science. Topics in Cognitive Science,1:11–38, 2009.

Nurit Melnik. From “hand-written” to computa-tionally implemented HPSG theories. In Ste-fan Müller, editor, Proceedings of the HPSG05Conference, Stanford, 2005. CSLI Publica-tions.

Eleanor Rosch. Cognitive representations of se-mantic categories. Journal of ExperimentalPsychology: General, 104:192–233, 1975.

Ivan A. Sag and Thomas Wasow. Performance-compatible competence grammar. In Robert D.Borsley and Kersti Börjars, editors, Non-Transformational Syntax: Formal and ExplicitModels of Grammar, pages 359–377. Wiley-Blackwell, 2011.

Paul Schmidt, Sibylle Rieder, Axel Theofilidis,and Thierry Declerck. Lean formalisms, lin-guistic theory, and applications. grammar de-velopment in ALEP. In Proceedings of the16th International Conference on Computa-tional Linguistics (COLING-96), pages 286–291, Copenhagen, 1996.

Michael Spranger and Martin Loetzsch. Syntac-tic indeterminacy and semantic ambiguity: Acase study for German spatial phrases. In LucSteels, editor, Design Patterns in Fluid Con-struction Grammar. John Benjamins, Amster-dam, 2011.

Kevin Stadler. Chunking constructions. InLuc Steels, editor, Computational Issues inFluid Construction Grammar. Springer Verlag,Berlin, 2012.

Luc Steels, editor. Design Patterns in Fluid Con-

struction Grammar. John Benjamins, Amster-dam, 2011.

Luc Steels, editor. Computational Issues inFluid Construction Grammar. Springer, Berlin,2012a.

Luc Steels. Design methods for Fluid Construc-tion Grammar. In Luc Steels, editor, Computa-tional Issues in Fluid Construction Grammar.Springer Verlag, Berlin, 2012b.

Luc Steels and Joachim De Beule. Unify andmerge in Fluid Construction Grammar. InP. Vogt, Y. Sugita, E. Tuci, and C. Nehaniv,editors, Symbol Grounding and Beyond., LNAI4211, pages 197–223, Berlin, 2006. Springer.

Luc Steels and Remi van Trijp. How to make con-struction grammars fluid and robust. In LucSteels, editor, Design Patterns in Fluid Con-struction Grammar, pages 301–330. John Ben-jamins, Amsterdam, 2011.

Anatol Stefanowitsch and Stefan Th. Gries. Col-lostructions: Investigating the interaction ofwords and constructions. International Journalof Corpus Linguistics, 2(8):209–243, 2003.

Tomek Strzalkowski, editor. Reversible Grammarin Natural Language Processing. Kluwer Aca-demic Publishers, Boston, 1994.

Michael Tomasello. Constructing a Language. AUsage Based Theory of Language Acquisition.Harvard University Press, 2003.

Noriko Tomuro. Left-Corner Parsing Algorithmfor Unification Grammars. PhD thesis, DePaulUniversity, Chicago, 1999.

Remi van Trijp. Feature matrices and agree-ment: A case study for German case. In LucSteels, editor, Design Patterns in Fluid Con-struction Grammar, pages 205–236. John Ben-jamins, Amsterdam, 2011.

Pieter Wellens. Organizing constructions in net-works. In Luc Steels, editor, Design Patterns inFluid Construction Grammar. John Benjamins,Amsterdam, 2011.

Pieter Wellens and Joachim De Beule. Prim-ing through constructional dependencies: Acase study in Fluid Construction Grammar.In A. Smith, M. Schouwstra, Bart de Boer,and K. Smith, editors, The Evolution of Lan-guage (EVOLANG8), pages 344–351, Singa-pore, 2010. World Scientific.

68


A Support Platform for Event Detection using Social Intelligence

Timothy Baldwin, Paul Cook, Bo Han, Aaron Harwood,Shanika Karunasekera and Masud Moshtaghi

Department of Computing and Information SystemsThe University of Melbourne

Victoria 3010, Australia

Abstract

This paper describes a system designedto support event detection over Twitter.The system operates by querying the datastream with a user-specified set of key-words, filtering out non-English messages,and probabilistically geolocating each mes-sage. The user can dynamically set a proba-bility threshold over the geolocation predic-tions, and also the time interval to presentdata for.

1 Introduction

Social media and micro-blogs have entered themainstream of society as a means for individu-als to stay in touch with friends, for companiesto market products and services, and for agen-cies to make official announcements. The attrac-tions of social media include their reach (eithertargeted within a social network or broadly acrossa large user base), ability to selectively pub-lish/filter information (selecting to publish cer-tain information publicly or privately to certaingroups, and selecting which users to follow),and real-time nature (information “push” happensimmediately at a scale unachievable with, e.g.,email). The serendipitous takeoff in mobile de-vices and widespread support for social mediaacross a range of devices, have been significantcontributors to the popularity and utility of socialmedia.

While much of the content on micro-blogs de-scribes personal trivialities, there is also a vein ofhigh-value content ripe for mining. As such, or-ganisations are increasingly targeting micro-blogsfor monitoring purposes, whether it is to gaugeproduct acceptance, detect events such as trafficjams, or track complex unfolding events such asnatural disasters.

In this work, we present a system intendedto support real-time analysis and geolocation ofevents based on Twitter. Our system consists ofthe following steps: (1) user selection of key-words for querying Twitter; (2) preprocessing ofthe returned queries to rapidly filter out messagesnot in a pre-selected set of languages, and option-ally normalise language content; (3) probabilisticgeolocation of messages; and (4) rendering of thedata on a zoomable map via a purpose-built webinterface, with facility for rich user interaction.

Our starting in the development of this systemwas the Ushahidi platform,1 which has high up-take for social media surveillance and informationdissemination purposes across a range of organ-isations. The reason for us choosing to imple-ment our own platform was: (a) ease of integra-tion of back-end processing modules; (b) extensi-bility, e.g. to visualise probabilities of geolocationpredictions, and allow for dynamic thresholding;(c) code maintainability; and (d) greater loggingfacility, to better capture user interactions.

2 Example System Usage

A typical user session begins with the user spec-ifying a disjunctive set of keywords, which areused as the basis for a query to the TwitterStreaming API.2 Messages which match the queryare dynamically rendered on an OpenStreetMapmash-up, indexed based on (grid cell-based) loca-tion. When the user clicks on a location marker,they are presented with a pop-up list of messagesmatching the location. The user can manipulate atime slider to alter the time period over which topresent results (e.g. in the last 10 minutes, or over

1http://ushahidi.com/2https://dev.twitter.com/docs/

streaming-api

69

Figure 1: A screenshot of the system, with a pop-up presentation of the messages at the indicated location.

the last hour), to gain a better sense of report re-cency. The user can further adjust the threshold ofthe prediction accuracy for the probabilistic mes-sage locations to view a smaller number of mes-sages with higher-confidence locations, or moremessages that have lower-confidence locations.

A screenshot of the system for the followingquery is presented in Figure 1:

study studying exam “end of semester”examination test tests school exams uni-versity pass fail “end of term” snowsnowy snowdrift storm blizzard flurryflurries ice icy cold chilly freeze freez-ing frigid winter

3 System Details

The system is composed of a front-end, whichprovides a GUI interface for query parameter in-put, and a back-end, which computes a result foreach query. The front-end submits the query pa-rameters to the back-end via a servlet. Sincethe result for the query is time-dependent, theback-end regularly re-evaluates the query, gener-ating an up-to-date result at regular intervals. Thefront-end regularly polls the back-end, via anotherservlet, for the latest results that match its submit-ted query. In this way, the front-end and back-endare loosely coupled and asynchronous.

Below, we describe details of the various mod-ules of the system.

3.1 Twitter Querying

When the user inputs a set of keywords, this is is-sued as a disjunctive query to the Twitter Stream-ing API, which returns a streamed set of resultsin JSON format. The results are parsed, andpiped through to the language filtering, lexicalnormalisation, and geolocation modules, and fi-nally stored in a flat file, which the GUI interactswith.

3.2 Language Filtering

For language identification, we use langid.py,a language identification toolkit developed atThe University of Melbourne (Lui and Baldwin,2011).3 langid.py combines a naive Bayesclassifier with cross-domain feature selection toprovide domain-independent language identifica-tion. It is available under a FOSS license asa stand-alone module pre-trained over 97 lan-guages. langid.py has been developed specif-ically to be able to keep pace with the speedof messages through the Twitter “garden hose”feed on a single-CPU machine, making it par-ticularly attractive for this project. Additionally,in an in-house evaluation over three separate cor-pora of Twitter data, we have found langid.pyto be overall more accurate than other state-of-the-art language identification systems such as

3http://www.csse.unimelb.edu.au/research/lt/resources/langid

70

lang-detect4 and the Compact Language De-tector (CLD) from the Chrome browser.5

langid.py returns a monolingual predictionof the language content of a given message, and isused to filter out all non-English tweets.

3.3 Lexical Normalisation

The prevalence of noisy tokens in microblogs(e.g. yr “your” and soooo “so”) potentially hin-ders the readability of messages. Approachesto lexical normalisation—i.e., replacing noisy to-kens by their standard forms in messages (e.g.replacing yr with your)—could potentially over-come this problem. At present, lexical normali-sation is an optional plug-in for post-processingmessages.

A further issue related to noisy tokens is thatit is possible that a relevant tweet might containa variant of a query term, but not that query termitself. In future versions of the system we there-fore aim to use query expansion to generate noisyversions of query terms to retrieve additional rel-evant tweets. We subsequently intend to performlexical normalisation to evaluate the precision ofthe returned data.

The present lexical normalisation used by oursystem is the dictionary lookup method of Hanand Baldwin (2011) which normalises noisy to-kens only when the normalised form is knownwith high confidence (e.g. you for u). Ultimately,however, we are interested in performing context-sensitive lexical normalisation, based on a reim-plementation of the method of Han and Baldwin(2011). This method will allow us to target awider variety of noisy tokens such as typos (e.g.earthquak “earthquake”), abbreviations (e.g. lv“love”), phonetic substitutions (e.g. b4 “before”)and vowel lengthening (e.g. goooood “good”).

3.4 Geolocation

A vital component of event detection is the de-termination of where the event is happening, e.g.to make sense of reports of traffic jams or floods.While Twitter supports device-based geotaggingof messages, less than 1% of messages have geo-tags (Cheng et al., 2010). One alternative is to re-turn the user-level registered location as the event

4http://code.google.com/p/language-detection/

5http://code.google.com/p/chromium-compact-language-detector/

location, based on the assumption that most usersreport on events in their local domicile. However,only about one quarter of users have registered lo-cations (Cheng et al., 2010), and even when thereis a registered location, there’s no guarantee ofits quality. A better solution would appear to bethe automatic prediction of the geolocation of themessage, along with a probabilistic indication ofthe prediction quality.6

Geolocation prediction is based on the termsused in a given message, based on the assumptionthat it will contain explicit mentions of local placenames (e.g. London) or use locally-identifiablelanguage (e.g. jawn, which is characteristic of thePhiladelphia area). By including a probabilitywith the prediction, we can give the system usercontrol over what level of noise they are preparedto see in the predictions, and hopefully filter outmessages where there is insufficient or conflictinggeolocating evidence.

We formulate the geolocation prediction prob-lem as a multinomial naive Bayes classificationproblem, based on its speed and accuracy over thetask. Given a message m, the task is to output themost probable location locmax ∈ locin1 for m.User-level classification can be performed basedon a similar formulation, by combining the totalset of messages from a given user into a singlecombined message.

Given a message m, the task is to findarg maxi P (loci|m) where each loci is a grid cellon the map. Based on Bayes’ theorem and stan-dard assumptions in the naive Bayes formulation,this is transformed into:

arg maxi

P (loci)v∏j

P (wj |loci)

To avoid zero probabilities, we only consider to-kens that occur at least twice in the training data,and ignore unseen words. A probability is calcu-lated for the most-probable location by normalis-ing over the scores for each loci.

We employ the method of Ritter et al. (2011) totokenise messages, and use token unigrams as fea-tures, including any hashtags, but ignoring twittermentions, URLs and purely numeric tokens. We

6Alternatively, we could consider a hybrid approach ofuser- and message-level geolocation prediction, especiallyfor users where we have sufficient training data, which weplan to incorporate into a future version of the system.

71

10000 20000 30000 40000

0.15

0.20

0.25

0.30

0.35

0.40

Feature Number

Pre

dict

ion

Acc

urac

y

Figure 2: Accuracy of geolocation prediction, forvarying numbers of features based on information gain

also experimented with included the named en-tity predictions of the Ritter et al. (2011) methodinto our system, but found that it had no impacton predictive accuracy. Finally, we apply featureselection to the data, based on information gain(Yang and Pedersen, 1997).

To evaluate the geolocation prediction mod-ule, we use the user-level geolocation datasetof Cheng et al. (2010), based on the lower 48states of the USA. The user-level accuracy of ourmethod over this dataset, for varying numbers offeatures based on information gain, can be seenin Figure 2. Based on these results, we select thetop 36,000 features in the deployed version of thesystem.

In the deployed system, the geolocation pre-diction model is trained over one million geo-tagged messages collected over a 4 month pe-riod from July 2011, resolved to 0.1-degree lat-itude/longitude grid cells (covering the wholeglobe, excepting grid locations where there wereless than 8 messages). For any geotagged mes-sages in the test data, we preserve the geotag andsimply set the probability of the prediction to 1.0.

3.5 System InterfaceThe final output of the various pre-processingmodules is a list of tweets that match the query,in the form of an 8-tuple as follows:

• the Twitter user ID

• the Twitter message ID

• the geo-coordinates of the message (eitherprovided with the message, or automaticallypredicted)

• the probability of the predicated geolocation

• the text of the tweet

In addition to specifying a set of keywords fora given session, system users can dynamically se-lect regions on the map, either via the manualspecification of a bounding box, or zooming themap in and out. They can additionally changethe time scale to display messages over, specifythe refresh interval and also adjust the thresholdon the geolocation predictions, to not render anymessages which have a predictive probability be-low the threshold. The size of each place markeron the map is rendered proportional to the num-ber of messages at that location, and a square issuperimposed over the box to represent the max-imum predictive probability for a single messageat that location (to provide user feedback on boththe volume of predictions and the relative confi-dence of the system at a given location).

ReferencesZhiyuan Cheng, James Caverlee, and Kyumin Lee.

2010. You are where you tweet: a content-based ap-proach to geo-locating twitter users. In Proceedingsof the 19th ACM international conference on In-formation and knowledge management, CIKM ’10,pages 759–768, Toronto, ON, Canada. ACM.

Bo Han and Timothy Baldwin. 2011. Lexical normal-isation of short text messages: Makn sens a #twit-ter. In Proceedings of the 49th Annual Meetingof the Association for Computational Linguistics:Human Language Technologies (ACL HLT 2011),pages 368–378, Portland, USA.

Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification.In Proceedings of the 5th International Joint Con-ference on Natural Language Processing (IJCNLP2011), pages 553–561, Chiang Mai, Thailand.

Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.2011. Named entity recognition in tweets: Anexperimental study. In Proceedings of the 2011Conference on Empirical Methods in Natural Lan-guage Processing, pages 1524–1534, Edinburgh,Scotland, UK., July. Association for ComputationalLinguistics.

Yiming Yang and Jan O. Pedersen. 1997. A compar-ative study on feature selection in text categoriza-tion. In Proceedings of the Fourteenth InternationalConference on Machine Learning, ICML ’97, pages412–420, San Francisco, CA, USA. Morgan Kauf-mann Publishers Inc.

72


NERD: A Framework for Unifying Named Entity Recognitionand Disambiguation Extraction Tools

Giuseppe RizzoEURECOM / Sophia Antipolis, France

Politecnico di Torino / Turin, [email protected]

Raphael TroncyEURECOM / Sophia Antipolis, [email protected]

Abstract

Named Entity Extraction is a mature taskin the NLP field that has yielded numerousservices gaining popularity in the Seman-tic Web community for extracting knowl-edge from web documents. These servicesare generally organized as pipelines, usingdedicated APIs and different taxonomy forextracting, classifying and disambiguatingnamed entities. Integrating one of theseservices in a particular application requiresto implement an appropriate driver. Fur-thermore, the results of these services arenot comparable due to different formats.This prevents the comparison of the perfor-mance of these services as well as their pos-sible combination. We address this problemby proposing NERD, a framework whichunifies 10 popular named entity extractorsavailable on the web, and the NERD on-tology which provides a rich set of axiomsaligning the taxonomies of these tools.

1 Introduction

The web hosts millions of unstructured data suchas scientific papers, news articles as well as forumand archived mailing list threads or (micro-)blogposts. This information has usually a rich se-mantic structure which is clear for the human be-ing but that remains mostly hidden to computingmachinery. Natural Language Processing (NLP)tools aim to extract such a structure from thosefree texts. They provide algorithms for analyz-ing atomic information elements which occur in asentence and identify Named Entity (NE) such asname of people or organizations, locations, timereferences or quantities. They also classify theseentities according to predefined schema increas-

ing discoverability (e.g. through faceted search)and reusability of information.

Recently, research and commercial communi-ties have spent efforts to publish NLP services onthe web. Beside the common task of identifyingPOS and of reducing this set to NEs, they pro-vide more and more disambiguation facility withURIs that describe web resources, leveraging onthe web of real world objects. Moreover, theseservices classify such information using commonontologies (e.g. DBpedia ontology1 or YAGO2)exploiting the large amount of knowledge avail-able from the web of data. Tools such as Alche-myAPI3, DBpedia Spotlight4, Evri5, Extractiv6,Lupedia7, OpenCalais8, Saplo9, Wikimeta10, Ya-hoo! Content Extraction11 and Zemanta12 repre-sent a clear opportunity for the web community toincrease the volume of interconnected data. Al-though these extractors share the same purpose -extract NE from text, classify and disambiguatethis information - they make use of different algo-rithms and provide different outputs.

This paper presents NERD (Named EntityRecognition and Disambiguation), a frameworkthat unifies the output of 10 different NLP extrac-

1http://wiki.dbpedia.org/Ontology2http://www.mpi-inf.mpg.de/yago-naga/

yago3http://www.alchemyapi.com4http://dbpedia.org/spotlight5http://www.evri.com/developer6http://extractiv.com7http://lupedia.ontotext.com/8http://www.opencalais.com9http://www.saplo.com/

10http://www.wikimeta.com11http://developer.yahoo.com/search/

content/V2/contentAnalysis.html12http://www.zemanta.com

73

tors publicly available on the web. Our approachrelies on the development of the NERD ontologywhich provides a common interface for annotat-ing elements, and a web REST API which is usedto access the unified output of these tools. Wecompare 6 different systems using NERD and wediscuss some quantitative results. The NERD ap-plication is accessible online at http://nerd.eurecom.fr. It requires to input a URI of aweb document that will be analyzed and option-ally an identification of the user for recording andsharing the analysis.

2 Framework

NERD is a web application plugged on top ofvarious NLP tools. Its architecture follows theREST principles and provides a web HTML ac-cess for humans and an API for computers to ex-change content in JSON or XML. Both interfacesare powered by the NERD REST engine. The Fig-ure 2 shows the workflow of an interaction amongclients (humans or computers), the NERD RESTengine and various NLP tools which are used byNERD for extracting NEs, for providing a typeand disambiguation URIs pointing to real worldobjects as they could be defined in the Web ofData.

2.1 NERD interfaces

The web interface13 is developed in HTML/-Javascript. It accepts any URI of a web documentwhich is analyzed in order to extract its main tex-tual content. Starting from the raw text, it drivesone or several tools to extract the list of NamedEntity, their classification and the URIs that dis-ambiguate these entities. The main purpose of thisinterface is to enable a human user to assess thequality of the extraction results collected by thosetools (Rizzo and Troncy, 2011a). At the end ofthe evaluation, the user sends the results, throughasynchronous calls, to the REST API engine in or-der to store them. This set of evaluations is furtherused to compute statistics about precision scoresfor each tool, with the goal to highlight strengthsand weaknesses and to compare them (Rizzo andTroncy, 2011b). The comparison aggregates allthe evaluations performed and, finally, the useris free to select one or more evaluations to seethe metrics that are computed for each service in

13http://nerd.eurecom.fr

real time. Finally, the application contains a helppage that provides guidance and details about thewhole evaluation process.

The API interface14 is developed following theREST principles and aims to enable program-matic access to the NERD framework. GET,POST and PUT methods manage the requestscoming from clients to retrieve the list of NEs,classification types and URIs for a specific tool orfor the combination of them. They take as inputsthe URI of the document to process and a userkey for authentication. The output sent back tothe client can be serialized in JSON or XML de-pending on the content type requested. The outputfollows the schema described below (in the JSONserialization):

e n t i t i e s : [” e n t i t y ” : ” Tim Berne r s−Lee ” ,” t y p e ” : ” P e r so n ” ,” u r i ” : ” h t t p : / / d b p e d i a . o rg / r e s o u r c e /

T i m b e r n e r s l e e ” ,” nerdType ” : ” h t t p : / / ne rd . eurecom . f r /

o n t o l o g y # P e r son ” ,” s t a r t C h a r ” : 30 ,” endChar ” : 45 ,” c o n f i d e n c e ” : 1 ,” r e l e v a n c e ” : 0 . 5

]

2.2 NERD REST engine

The REST engine runs on Jersey15 and Griz-zly16 technologies. Their extensible frameworkallows to develop several components, so NERDis composed of 7 modules, namely: authenti-cation, scraping, extraction, ontology mapping,store, statistics and web. The authentication en-ables to log in with an OpenID provider and sub-sequently attaches all analysis and evaluationsperformed by a user with his profile. The scrap-ing module takes as input the URI of an articleand extracts its main textual content. Extraction isthe module designed to invoke the external serviceAPIs and collect the results. Each service pro-vides its own taxonomy of named entity types itcan recognize. We therefore designed the NERDontology which provides a set of mappings be-tween these various classifications. The ontol-ogy mapping is the module in charge to map theclassification type retrieved to the NERD ontol-ogy. The store module saves all evaluations ac-cording to the schema model we defined in the

14http://nerd.eurecom.fr/api/application.wadl

15http://jersey.java.net16http://grizzly.java.net

74

Figure 1: A user interacts with NERD through a REST API. The engine drives the extraction to the NLP extractor.The NERD REST engine retrieves the output, unifies it and maps the annotations to the NERD ontology. Finally,the output result is sent back to the client using the format reported in the initial request.

NERD database. The statistic module enablesto extract data patterns from the user interactionsstored in the database and to compute statisticalscores such as Fleiss Kappa and precision/recallanalysis. Finally, the web module manages theclient requests, the web cache and generates theHTML pages.

3 NERD ontology

Although these tools share the same goal, they usedifferent algorithms and their own classificationtaxonomies which makes hard their comparison.To address this problem, we have developed theNERD ontology which is a set of mappings es-tablished manually between the schemas of theNamed Entity categories. Concepts included inthe NERD ontology are collected from differentschema types: ontology (for DBpedia Spotlightand Zemanta), lightweight taxonomy (for Alche-myAPI, Evri and Wikimeta) or simple flat typelists (for Extractiv, OpenCalais and Wikimeta). Aconcept is included in the NERD ontology as soonas there are at least two tools that use it. TheNERD ontology becomes a reference ontologyfor comparing the classification task of NE tools.In other words, NERD is a set of axioms useful toenable comparison of NLP tools. We consider theDBpedia ontology exhaustive enough to representall the concepts involved in a NER task. For allthose concepts that do not appear in the NERDnamespace, there are just sub-classes of parentsthat end-up in the NERD ontology. This ontology

is available at http://nerd.eurecom.fr/ontology.

We provide the following example map-ping among those tools which defines theCity type: the nerd:City class is consid-ered as being equivalent to alchemy:City,dbpedia-owl:City, extractiv:CITY,opencalais:City, evri:City whilebeing more specific than wikimeta:LOC andzemanta:location.

ne rd : C i t y a r d f s : C l a s s ;r d f s : s u b C l a s s O f wik ime ta :LOC ;r d f s : s u b C l a s s O f zemanta : l o c a t i o n ;owl : e q u i v a l e n t C l a s s alchemy : C i t y ;owl : e q u i v a l e n t C l a s s dbped ia−owl : C i t y ;owl : e q u i v a l e n t C l a s s e v r i : C i t y ;owl : e q u i v a l e n t C l a s s e x t r a c t i v : CITY ;owl : e q u i v a l e n t C l a s s o p e n c a l a i s : C i t y .

4 Ontology alignment results

We conducted an experiment to assess the align-ment of the NERD framework according to theontology we developed. For this experiment, wecollected 1000 news articles of The New YorkTimes from 09/10/2011 to 12/10/2011 and weperformed the extraction of named entities withthe tools supported by NERD. The goal is to ex-plore the NE extraction patterns with this datasetand to assess commonalities and differences ofthe classification schema used. We propose thealignment of the 6 main types recognized by alltools using the NERD ontology. To conduct thisexperiment, we used the default configuration forall tools used. We define the following variables:

75

AlchemyAPI DBpedia Spotlight Evri Extractiv OpenCalais ZemantaPerson 6,246 14 2,698 5,648 5,615 1,069Organization 2,479 - 900 81 2,538 180Country 1,727 2 1,382 2,676 1,707 720City 2,133 - 845 2,046 1,863 -Time - - - 123 1 -Number - - - 3,940 - -

Table 1: Number of axioms aligned for all the tools involved in the comparison according to the NERD ontologyfor the sources collected from the The New York Times from 09/10/2011 to 12/10/2011.

the number nd of evaluated documents, the num-ber nw of words, the total number ne of enti-ties, the total number nc of categories and nu

URIs. Moreover, we compute the following met-rics: word detection rate r(w, d), i.e. the num-ber of words per document, entity detection rater(e, d), i.e. the number of entities per document,entity detection rate per word, i.e. the ratio be-tween entities and words r(e, w), category detec-tion rate, i.e. the number of categories per docu-ment r(c, d) and URI detection rate, i.e. the num-ber of URIs per document r(u, d). The evaluationwe performed concerned nd = 1000 documentsthat amount to nw = 620, 567 words. The worddetection rate per document r(w, d) is equal to620.57 and the total number of recognized enti-ties ne is 164, 12 with the r(e, d) equal to 164.17.Finally r(e, w) is 0.0264, r(c, d) is 0.763 andr(u, d) is 46.287.

Table 1 shows the classification comparison re-sults. DBpedia Spotlight recognizes very fewclasses. Zemanta increases significantly classi-fication performances with respect to DBpediaobtaining a number of recognized Person whichis two magnitude order more important. Alche-myAPI has strong ability to recognize Person andCity while obtaining significant scores for Orga-nization and Country. OpenCalais shows good re-sults to recognize the class Person and a strongability to classify NEs with the label Organiza-tion. Extractiv holds the best score for classifyingCountry and it is the only extractor capable of ex-tracting the classes Time and Number.

5 Conclusion

In this paper, we presented NERD, a frameworkdeveloped following REST principles, and theNERD ontology, a reference ontology to map sev-eral NER tools publicly accessible on the web.

We propose a preliminary comparison resultswhere we investigate the importance of a refer-ence ontology in order to evaluate the strengthsand weaknesses of the NER extractors. We willinvestigate whether the combination of extractorsmay overcome the performance of a single tool ornot. We will demonstrate more live examples ofwhat NERD can achieve during the conference.Finally, with the increasing interest of intercon-necting data on the web, a lot of research effort isspent to aggregate the results of NLP tools. Theimportance to have a system able to compare themis under investigation from the NIF17 (NLP Inter-change Format) project. NERD has recently beenintegrated with NIF (Rizzo and Troncy, 2012) andthe NERD ontology is a milestone for creating areference ontology for this task.

Acknowledgments

This paper was supported by the French Min-istry of Industry (Innovative Web call) under con-tract 09.2.93.0966, “Collaborative Annotation forVideo Accessibility” (ACAV).

ReferencesRizzo G. and Troncy R. 2011. NERD: A Framework

for Evaluating Named Entity Recognition Tools inthe Web of Data. 10th International Semantic WebConference (ISWC’11), Demo Session, Bonn, Ger-many.

Rizzo G. and Troncy R. 2011. NERD: Evaluat-ing Named Entity Recognition Tools in the Web ofData. Workshop on Web Scale Knowledge Extrac-tion (WEKEX’11), Bonn, Germany.

Rizzo G., Troncy R, Hellmann S and Bruemmer M.2012. NERD meets NIF: Lifting NLP ExtractionResults to the Linked Data Cloud. 5th InternationalWorkshop on Linked Data on the Web (LDOW’12),Lyon, France.17http://nlp2rdf.org

76


Automatic Analysis of Patient History Episodes in Bulgarian HospitalDischarge Letters

Svetla BoytchevaState University of Library Studies

and Information Technologiesand IICT-BAS

[email protected]

Galia Angelova, Ivelina NikolovaInstitute of Information and

Communication Technologies (IICT),Bulgarian Academy of Sciences (BAS)

galia,[email protected]

Abstract

This demo presents Information Extractionfrom discharge letters in Bulgarian lan-guage. The Patient history section is au-tomatically split into episodes (clauses be-tween two temporal markers); then drugs,diagnoses and conditions are recognisedwithin the episodes with accuracy higherthan 90%. The temporal markers, which re-fer to absolute or relative moments of time,are identified with precision 87% and re-call 68%. The direction of time for theepisode starting point: backwards or for-ward (with respect to certain moment ori-enting the episode) is recognised with pre-cision 74.4%.

1 Introduction

Temporal information processing is a challenge inmedical informatics (Zhou and Hripcsak, 2007)and (Hripcsak et al., 2005). There is no agree-ment about the features of the temporal modelswhich might be extracted automatically from freetexts. Some sophisticated approaches aim at theadaptation of TimeML-based tags to clinically-important entities (Savova et al., 2009) whileothers identify dates and prepositional phrasescontaining temporal expressions (Angelova andBoytcheva, 2011). Most NLP prototypes for auto-matic temporal analysis of clinical narratives dealwith discharge letters.

This demo presents a prototype for automaticsplitting of the Patient history into episodes andextraction of important patient-related events thatoccur within these episodes. We process Elec-tronic Health Records (EHRs) of diabetic pa-tients. In Bulgaria, due to centralised regulations

on medical documentation (which date back tothe 60’s of the last century), hospital dischargeletters have a predefined structure (Agreement,2005). Using the section headers, our Informa-tion Extraction (IE) system automatically iden-tifies the Patient history (Anamnesis). It con-tains a summary, written by the medical expertwho hospitalises the patient, and documents themain phases in diabetes development, the maininterventions and their effects. The splitting al-gorithm is based on the assumption that the Pa-tient history texts can be represented as a struc-tured sequence of adjacent clauses which are po-sitioned between two temporal markers and re-port about some important events happening inthe designated period. The temporal markers areusually accompanied by words signaling the di-rection of time (backward or forward). Thus weassume that the episodes have the following struc-ture: (i) reference point, (ii) direction, (iii) tem-poral expression, (iv) diagnoses, (v) symptoms,syndromes, conditions, or complains; (vi) drugs;(vii) treatment outcome. The demo will showhow our IE system automatically fills in the sevenslots enumerated above. Among all symptomsand conditions, which are complex phrases andparaphrases, the extraction of features related topolyuria and polydipsia, weight change and bloodsugar value descriptions will be demonstrated.Our present corpus contains 1,375 EHRs.

2 Recognition of Temporal Markers

Temporal information is very important in clini-cal narratives: there are 8,248 markers and 8,249words/phrases signaling the direction backwardsor forward in the corpus (while the drug name oc-currences are 7,108 and the diagnoses are 7,565).

77

In the hospital information system, there aretwo explicitly fixed dates: the patient birth dateand the hospitalisation date. Both of them areused as antecedents of temporal anaphora:

• the hospitalisation date is a reference pointfor 37.2% of all temporal expressions (e.g.’since 5 years’, ’(since) last March’, ’3 yearsago’, ’two weeks ago’, ’diabetes duration 22years’, ’during the last 3 days’ etc.). For8.46% of them, the expression allows for cal-culation of a particular date when the corre-sponding event has occurred;

• the age (calculated using the birth date) is areference point for 2.1% of all temporal ex-pressions (e.g. ’diabetes diagnosed in theage of 22 years’).

Some 28.96% of the temporal markers refer to anexplicitly specified year which we consider as anabsolute reference. Another 15.1% of the markerscontain reference to day, month and year, and inthis way 44.06% of the temporal expressions ex-plicitly refer to dates. Adding to these 44.06%the above-listed referential citations of the hos-pitalization date and the birth date, we see that83.36% of the temporal markers refer to explic-itly specified moments of time and can be seenas absolute references. We note that diabetes is achronicle disease and references like ’diabetes di-agnosed 30 years ago’ are sufficiently precise tobe counted as explicit temporal pointers.

The anaphoric expressions refer to events de-scribed in the Patient history section: these ex-pressions are 2.63% of the temporal markers (e.g.’20 days after the operation’, ’3 years after di-agnosing the diabetes’, ’about 1 year after that’,’with the same duration’ etc.). We call these ex-pressions relative temporal markers and note thatmuch of our temporal knowledge is relative andcannot be described by a date (Allen, 1983).

The remaining 14% of the temporal markersare undetermined, like ’many years ago’, ’beforethe puberty’, ’in young age’, ’long-duration dia-betes’. About one third of these markers refer toperiods e.g. ’for a period of 3 years’, ’with du-ration of 10-15 years’ and need to be interpretedinside the episode where they occur.

Identifying a temporal expression in some sen-tence in the Patient history, we consider it as asignal for a new episode. Thus it is very impor-tant to recognise automatically the time anchors

of the events described in the episode: whetherthey happen at the moment, designated by themarker, after or before it. The temporal markersare accompanied by words signaling time direc-tion backwards or forward as follows:

• the preposition ’since’ (ot) unambiguouslydesignates the episode startpoint and thetime interval when the events happen. It oc-curs in 46.78% of the temporal markers;

• the preposition ’in’ (prez) designates theepisode startpoint with probability 92.14%.It points to a moment of time and oftenmarks the beginning of a new period. But theevents happening after ’in’ might refer back-wards to past moments, like e.g. ’diabetesdiagnosed in 2004, (as the patient) lost 20 kgin 6 months with reduced appetite’. So therecould be past events embedded in the ’in’-started episodes which should be consideredas separate episodes (but are really difficultfor automatic identification);

• the preposition ’after’ (sled) unambigu-ously identifies a relative time moment ori-ented to the immediately preceding evente.g. ’after that’ with synonym ’later’ e.g.’one year later’. Another kind of referenceis explicit event specification e.g. ’after theManinil has been stopped’;

• the preposition ’before’ or ’ago’ (predi) isincluded in 11.2% of all temporal markers inour corpus. In 97.4% of its occurrences it isassociated to a number of years/months/daysand refers to the hospitalisation date, e.g.’3 years ago’, ’two weeks ago’. In 87.6%of its occurrences it denotes starting pointsin the past after which some events hap-pen. However, there are cases when ’ago’marks an endpoint, e.g. ’Since 1995 the hy-pertony 150/100 was treated by Captopril25mg, later by Enpril 10mg but two yearsago the therapy has been stopped because ofhypotony’;

• the preposition ’during, throughout’ (vprodlenie na) occurs relatively rarely,only in 1.02% of all markers. It is usuallyassociated with explicit time period.

78

3 Recognition of Diagnoses and Drugs

We have developed high-quality extractors of di-agnoses, drugs and dosages from EHRs in Bulgar-ian language. These two extracting componentsare integrated in our IE system which processesPatient history episodes.

Phrases designating diagnoses are juxtaposedto ICD-10 codes (ICD, 10). Major difficulties inmatching ICD-10 diseases to text units are due to(i) numerous Latin terms written in Latin or Cyril-lic alphabets; (ii) a large variety of abbreviations;(iii) descriptions which are hard to associate toICD-10 codes, and (iv) various types of ambigu-ity e.g. text fragments that might be juxtaposed tomany ICD-10 labels.

The drug extractor finds in the EHR texts 1,850brand names of drugs and their daily dosages.Drug extraction is based on algorithms using reg-ular expressions to describe linguistic patterns.The variety of textual expressions as well as theabsent or partial dosage descriptions impede theextraction performance. Drug names are juxta-posed to ATC codes (ATC, 11).

4 IE of symptoms and conditions

Our aim is to identify diabetes symptoms andconditions in the free text of Patient history.The main challenge is to recognise automaticallyphrases and paraphrases for which no ”canonicalforms” exist in any dictionary. Symptom extrac-tion is done over a corpus of 1,375 dischargeletters. We analyse certain dominant factors whendiagnosing with diabetes - values of blood sugar,body weight change and polyuria, polydipsiadescriptions. Some examples follow:

(i) Because of polyuria-polydipsia syndrome,blood sugar was - 19 mmol/l.(ii) ... on the background of obesity - 117 kg...

The challenge in the task is not only to iden-tify sentences or phrases referring to such expres-sions but to determine correctly the borders ofthe description, recognise the values, the directionof change - increased or decreased value and tocheck whether the expression is negated or not.

The extraction of symptoms is a hybrid methodwhich includes document classification and rule-based pattern recognition. It is done by a 6-steps algorithm as follows: (i) manual selection

of symptom descriptions from a training corpus;(ii) compiling a list of keyterms per each symp-tom; (iii) compiling probability vocabularies forleft- and right-border tokens per each symptomdescription according to the frequencies of theleft- and right-most tokens in the list of symp-tom descriptions; (iv) compiling a list of fea-tures per each symptom (these are all tokens avail-able in the keyterms list without the stop words);(v) performing document classification for select-ing the documents containing the symptom of in-terest based on the feature selection in the previ-ous step and (vi) selection of symptom descrip-tions by applying consecutively rules employingthe keyterms vocabulary and the left- and right-border tokens vocabularies. For overcoming theinflexion of Bulgarian language we use stemming.

The last step could be actually segmented intofive subtasks such as: focusing on the expressionswhich contain the terms; determining the scope ofthe expressions; deciding on the condition wors-ening - increased, decreased values; identifyingthe values - interval values, simple values, mea-surement units etc. The final subtask is to deter-mine whether the expression is negated or not.

5 Evaluation results

The evaluation of all linguistic modules is per-formed in close cooperation with medical expertswho assess the methodological feasibility of theapproach and its practical usefulness.

The temporal markers, which refer to absoluteor relative moments of time, are identified withprecision 87% and recall 68%. The direction oftime for the episode events: backwards or for-ward (with respect to certain moment orientingthe episode) is recognised with precision 74.4%.

ICD-10 codes are associated to phrases withprecision 84.5%. Actually this component hasbeen developed in a previous project where itwas run on 6,200 EHRs and has extracted 26,826phrases from the EHR section Diagnoses; correctICD-10 codes were assigned to 22,667 phrases.In this way the ICD-10 extractor uses a dictio-nary of 22,667 phrases which designate 478 ICD-10 disease names occurring in diabetic EHRs(Boytcheva, 2011a).

Drug names are juxtaposed to ATC codes withf-score 98.42%; the drug dosage is recognisedwith f-score 93.85% (Boytcheva, 2011b). Thisresult is comparable to the accuracy of the best

79

systems e.g. MedEx which extracts medicationevents with 93.2% f-score for drug names, 94.5%for dosage, 93.9% for route and 96% for fre-quency (Xu et al., 2010). We also identify thedrugs taken by the patient at the moment ofhospitalisation. This is evaluated on 355 drugnames occurring in the EHRs of diabetic pa-tients. The extraction is done with f-score 94.17%for drugs in Patient history (over-generation 6%)(Boytcheva et al., 2011).

In the separate phases of symptom descriptionextraction the f-score goes up to 96%. The com-plete blood sugar descriptions are identified with89% f-score; complete weight change descrip-tions - with 75% and complete polyuria and poly-dipsia descriptions with 90%. These figures arecomparable to the success of extracting condi-tions, reported in (Harkema et al., 2009).

6 Demonstration

The demo presents: (i) the extractors of diag-noses, drugs and conditions within episodes and(ii) their integration within a framework for tem-poral segmentation of the Patient history intoepisodes with identification of temporal mark-ers and time direction. Thus the prototype auto-matically recognises the time period, when someevents of interest have occurred.

Example 1. (April 2004) Diabetes diagnosedlast August with blood sugar values 14mmol/l.Since then put on a diet but without followingit too strictly. Since December follows the dietbut the blood sugar decreases to 12mmol/l. Thismakes it necessary to prescribe Metfodiab in themorning and at noon 1/2t. since 15.I. Since thenthe body weight has been reduced with about 6 kg.Complains of fornication in the lower limbs.

This history is broken down into the episodes,imposed by the time markers (table 1). Pleasenote that we suggest no order for the episodes.This should be done by a temporal reasoner.

However, it is hard to cope with expressionslike the ones in Examples 2-5, where more thanone temporal marker occurs in the same sentencewith possibly diverse orientation. This requiressemantic analysis of the events happening withinthe sentences. Example 2: Since 1,5 years withgrowing swelling of the feet which became per-manent and massive since the summer of 2003.Example 3: Diabetes type 2 with duration 2 years,diagnosed due to gradual body weight reduction

Ep reference August 2003direction forward

expression last Augustcondition blood sugar 14mmol/l

Ep reference August 2003direction forward

expression Since thenEp reference December 2003

direction forwardexpression Since Decembercondition blood sugar 12mmol/l

Ep reference 15.Idirection forward

expression since 15.Itreatment Metfodiab A10BA02

1/2t. morning and noonEp reference 15.I

direction forwardexpression Since thencondition body weight reduced 6 kg.

Table 1: A patient history broken down into episodes.

during the last 5-6 years. Example 4: Secondaryamenorrhoea after a childbirth 12 months ago, af-ter the birth with ceased menstruation and with-out lactation. Example 5: Now hospitalised 3years after a radioiodine therapy of a nodular goi-ter which has been treated before that by thyreo-static medication for about a year.

In conclusion, this demo presents one step inthe temporal analysis of clinical narratives: de-composition into fragments that could be consid-ered as happening in the same period of time. Thesystem integrates various components which ex-tract important patient-related entities. The rela-tive success is partly due to the very specific textgenre. Further effort is needed for ordering theepisodes in timelines, which is in our researchagenda for the future. These results will be in-tegrated into a research prototype extracting con-ceptual structures from EHRs.

Acknowledgments

This work is supported by grant DO/02-292 EV-TIMA funded by the Bulgarian National ScienceFund in 2009-2012. The anonymised EHRs aredelivered by the University Specialised Hospitalof Endocrinology, Medical University - Sofia.

80

ReferencesAllen, J. Maintaining Knowledge about Temporal In-

tervals. Comm. ACM, 26(11), 1983, pp. 832-843.Angelova G. and S. Boytcheva. Towards Temporal

Segmentation of Patient History in Discharge Let-ters. In Proceedings of the Second Workshop onBiomedical Natural Language Processing, associ-ated to RANLP-2011. September 2011, pp. 11-18.

Boytcheva, S. Automatic Matching of ICD-10 Codesto Diagnoses in Discharge Letters. In Proceed-ings of the Second Workshop on Biomedical Nat-ural Language Processing, associated to RANLP-2011. September 2011, pp. 19-26.

Boytcheva, S. Shallow Medication Extraction fromHospital Patient Records. In Patient Safety Infor-matics - Adverse Drug Events, Human Factors andIT Tools for Patient Medication Safety, IOS Press,Studies in Health Technology and Informatics se-ries, Volume 166. May 2011, pp. 119-128.

Boytcheva, S., D. Tcharaktchiev and G. Angelova.Contextualization in automatic extraction of drugsfrom Hospital Patient Records. In A. Moen at al.(Eds) User Centred Networked Health Case, Pro-ceedings of MIE-2011, IOS Press, Studies in HealthTechnology and Informatics series, Volume 169.August 2011, pp. 527-531.

Harkema, H., J. N. Dowling, T. Thornblade, and W.W. Chapman. 2009. ConText: An algorithm for de-termining negation, experiencer, and temporal sta-tus from clinical reports. J. Biomedical Informatics,42(5), 2009, pp. 839-851.

Hripcsak G., L. Zhou, S. Parsons, A. K. Das, and S.B. Johnson. Modeling electronic dis-charge sum-maries as a simple temporal con-straint satisfactionproblem. JAMIA (J. of Amer. MI Assoc.) 2005,12(1), pp. 55-63.

Savova, G., S. Bethard, W. Styler, J. Martin, M.Palmer, J. Masanz, and W. Ward. Towards Tempo-ral Relation Discovery from the Clinical Narrative.In Proc. AMIA Annual Sympo-sium 2009, pp. 568-572.

Xu.H., S. P Stenner, S. Doan, K. Johnson, L. Waitman,and J. Denny. MedEx: a medication informationextraction system for clinical narratives. JAMIA17 (2010), pp. 19-24.

Zhou L. and G. Hripcsak. Temporal reasoning withmedical data - a review with emphasis on medicalnatural language processing. J. Biom. Informatics2007, 40(2), pp. 183-202.

Agreement fixing the sections of Bulgarian hospitaldischarge letters. Bulgarian Parliament, OfficialState Gazette 106 (2005), Article 190(3).

ICD v.10: International Classification of Diseaseshttp://www.nchi.government.bg/download.html.

ATC (Anatomical Therapeutic Chemical ClassificationSystem), http://who.int/classifications/atcddd/en.

81


ElectionWatch: Detecting Patterns in News Coverage of US Elections

Saatviga Sudhahar, Thomas Lansdall-Welfare, Ilias Flaounas, Nello CristianiniIntelligent Systems Laboratory

University of Bristol(saatviga.sudhahar, Thomas.Lansdall-Welfare,

ilias.flaounas, nello.cristianini)@bristol.ac.uk

Abstract

We present a web tool that allows users toexplore news stories concerning the 2012US Presidential Elections via an interac-tive interface. The tool is based on con-cepts of “narrative analysis”, where the keyactors of a narration are identified, alongwith their relations, in what are sometimescalled “semantic triplets” (one example ofa triplet of this kind is “Romney CriticisedObama”). The network of actors and theirrelations can be mined for insights aboutthe structure of the narration, including theidentification of the key players, of the net-work of political support of each of them, arepresentation of the similarity of their po-litical positions, and other information con-cerning their role in the media narration ofevents. The interactive interface allows theusers to retrieve news report supporting therelations of interest.

1 Introduction

U.S presidential elections are major media events,following a fixed calendar, where two or morepublic relation “machines” compete to send outtheir message. From the point of view of the me-dia, this event is often framed as a race, with con-tenders, front runners, and complex alliances. Bythe end of the campaign, which lasts for about oneyear, two line-ups are created in the media, one foreach major party. This event provides researchersan opportunity to analyse the narrative structuresfound in the news coverage, the amounts of mediaattention that is devoted to the main contendersand their allies, and other patterns of interest.

We propose to study the U.S Presidential Elec-tions with the tools of (quantitative) narrative

analysis, identifying the key actors and their polit-ical relations, and using this information to inferthe overall structure of the political coalitions. Weare also interested in how the media covers suchevent that is which role is attributed to each actorwithin this narration.

Quantitative Narrative Analysis (QNA) is anapproach to the analysis of news content that re-quires the identification of the key actors, and ofthe kind of interactions they have with each other(Franzosi, 2010). It usually requires a signifi-cant amount of manual labour, for “coding” thenews articles, and this limits the analysis to smallsamples. We claim that the most interesting rela-tions come from analysing large networks result-ing from tens of thousands of articles, and there-fore that QNA needs to be automated.

Our approach is to use a parser to extract simpleSVO triplets, forming a semantic graph to identifythe noun phrases with actors, and to classify theverbal links between actors in three simple cate-gories: those expressing political support, thoseexpressing political opposition, and the rest. Byidentifying the most important actors and triplets,we form a large weighted and directed networkwhich we analyse for various types of patterns.

In this paper we demonstrate an automated sys-tem that can identify articles relative to the 2012US Presidential Election, from 719 online newsoutlets, and can extract information about the keyplayers, their relations, and the role they play inthe electoral narrative. The system refreshes itsinformation every 24 hours, and has already anal-ysed tens of thousands of news articles. The toolallows the user to browse the growing set of newsarticles by the relations between actors, for ex-ample retrieving all articles where Mitt Romney

82

praises Obama1.A set of interactive plots allows users to ex-

plore the news data by following specific candi-dates and also specific types of relations, to seea spectrum of all key actors sorted by their po-litical affinity, a network representing relationsof political support between actors, and a two-dimensional space where proximity again repre-sents political affinity, but also they can access in-formation about the role mostly played by a givenactor in the media narrative: that of a subject orthat of an object.

The ElectionWatch system is built on top of ourinfrastructure for news content analysis, whichhas been described elsewhere. It has also accessto named entities information, with which it cangenerate timelines and activity-maps. These arealso available through the web interface.

2 Data Collection

Our system collects news articles from 719 En-glish language news outlets. We monitor both U.Sand International media. A detailed description ofthe underlying infrastructure has been presentedin our previous work (Flaounas, 2011).

In this demo we use only articles related toUS Elections. We detect those articles using atopic detector based on Support Vector Machines(Chang, 2011). We trained and validated ourclassifier using the specialised Election news feedfrom Yahoo!. The performance of the classifierreached 83.46% precision, 73.29% recall, vali-dated on unseen articles.

While the main focus of the paper is to presentNarrative patterns in elections stories, the systempresents also timelines and activity maps gener-ated by detected Named Entities associated withthe election process.

3 Methodology

We perform a series of methodologies for narra-tive analysis. Figure 1 illustrates the main compo-nents that are used to analyse news and create thewebsite.

Preprocessing.First, we perform co-referenceand anaphora resolution on each U.S Electionarticle. This is based on the ANNIE pluginin GATE (Cunningham, 2002). Next, we ex-

1Barack Obama and Mitt Romney are the two main op-posing candidates in 2012 U.S Presidential Elections.

tract Subject-Verb-Object (SVO) triplets using theMinipar parser output (Lin, 1998). An extractedtriplet is denoted for example like “Obama(S)–Accuse(V)–Republicans(O)”. We found that newsmedia contains less than 5% of passive sentencesand therefore it is ignored. We store each triplet ina database annotated with a reference to the arti-cle from which it was extracted. This allows us totrack the background information of each tripletin the database.

Key Actors. From triplets extracted, we makea list of actors which are defined as subjects andobjects of triplets. We rank actors according totheir frequencies and consider the top 50 subjectsand objects as the key actors.

Polarity of Actions. The verb element intriplets are defined as actions. We map actionsto two specific action types which are endorse-ment and opposing. We obtained the endorse-ment/opposing polarity of verbs using the Verbnetdata (Kipper et al, 2006)).

Extraction of Relations. We retain all tripletsthat have a) the key actors as subjects or ob-jects; and b) an endorse/oppose verb. To ex-tract relations we introduced a weighting scheme.Each endorsement-relation between actorsa, b isweighted bywa,b:

wa,b =fa,b (+)− fa,b (−)

fa,b (+) + fa,b (−)(1)

wherefa,b(+) denotes the number of triplets be-tweena, b with positive relation andfa,b(−) withnegative relation. This way, actors who hadequal number of positive and negative relationsare eliminated.

Endorsement Network. We generate a tripletnetwork with the weighted relations where actorsare the nodes and weights calculated by Eq. 1 arethe links. This network reveals endorse/opposerelations between key actors. The network in themain page of ElectionWatch website, illustratedin Fig. 2, is a typical example of such a network.

Network Partitioning. By using graph parti-tioning methods we can analyse the allegiance ofactors to a party, and therefore their role in thepolitical discourse. The Endorsement Networkis a directed graph. To perform its partitioningwe first omit directionality by calculating graphB = A+AT , whereA is the adjacency matrix ofthe Endorsement Network. We computed eigen-vectors of theB and selected the eigenvector that

83

Figure 1: The Pipeline

correspond to the highest eigenvalue. The ele-ments of the eigenvector represent actors. We sortthem by their magnitude and we obtain a sortedlist of actors. In the website we display only ac-tors that are very polarised politically in the sidesof the list. These two sets of actors correlate wellwith the left-right political ordering in our exper-iments on past US Elections. Since in the firstphase of the campaign there are more than twosides, we added a scatter plot using the first twoeigenvectors.

Subject/Object Bias of Actors. The Sub-ject/Object biasSa of actora reveals the role itplays in the news narrative. It is computed as:

Sa =fSubj (a)− fObj (a)

fSubj (a) + fObj (a)(2)

A positive value ofS for actora indicates that theactor is used more often as a subject and a neg-ative value indicates that the actor is used moreoften as an object.

4 The Website

We analyse news related to U.S Elections 2012every day, automatically, and the results of ouranalysis are presented integrated under a publiclyavailable website2. Figure 2 illustrates the home-page of ElectionWatch. Here, we list the key fea-tures of the site:

Triplet Graph – The main network in Fig. 2is created using the weighted relations. A positivesign for the edge indicates an endorsement rela-tion and a negative sign indicates an oppositionrelation in the network. By clicking on each edgein the network, we display triplets and articles thatsupport the relation.

2ElectionWatch: http://electionwatch.enm.bris.ac.uk

Actor Spectrum – The left side of Fig. 2shows the Actor Spectrum, coloured from bluefor Democrats to red for Republicans. Actor spec-trum was obtained by applying spectral graph par-titioning methods to the triplet network.Note, thatcurrently there are more than two campaigns thatrun in parallel between key actors that dominatethe elections news coverage. Nevertheless, westill find that the two main opposing candidatesin each party were in either sides of the list.

Relations – On the right hand side of thewebsite we show the endorsement/opposition re-lations between key actors. For example, “Re-publicans Oppose Democrats”. When clicking ona relation the webpage displays the news articlesthat support the relation.

Actor Space – The tab labelled ‘Actor Space’plots the first and second eigenvector values forall actors in the actor spectrum.

Actor Bias The tab labelled ‘Actor Bias’ plotsthe subject/object bias of actors against the firsteigenvector in a two dimensional space.

Pie Chart – Pie Chart on the left bottom inthe webpage shows the share of each actor withregard to the total number of articles mentioningan endorse/oppose relation.

Map – The map geo-locates articles related toUS Elections and refer to US locations.

Bar Chart – The bar chart tab, illustrated inFig. 3, plots the number of articles in which ac-tors were involved in a endorse/oppose relation.The height of each column reveals the frequencyof it. The default plot focuses on only the first fiveactors in the actor spectrum.

Timelines & Activity Map – We track the ac-tivity of each named entity in the actor spectrumwithin the United States and present it in a time-line. The activity map monitors the media atten-

84

Figure 2: Screenshot of the home page of ElectionWatch

Figure 3: Barchart showing endorse/oppose article fre-quencies for actor “Obama” with other top actors.

tion for Presidential candidates in each state in theUnites States. At present we monitor this activityfor Mitt Romney, Rick Perry, Michele Bachmann,Herman Cain and Barack Obama.

5 Discussion

We have demonstrated the system ElectionWatchthat presents key actors in U.S election news ar-ticles and their role in political discourse. Thisbuilds on various recent contributions from thefield of Pattern Analysis, such as (Trampus,2011), augmenting them with multiple analysistools that respond to the needs of social sciences

investigations.We agree on the fact that the triplets extracted

by the system are not very clean. This noise canbe ignored since we perform analysis on only fil-tered triplets containing key actors and specifictype of actions, and also it’s extracted from hugeamount of data.

We have tested this system on data from all pre-vious six elections, using the New York Timescorpus as well as our own database. We use onlysupport/criticism relations revealing a strong po-larisation among actors and this seems to corre-spond to the left/right political dimension. Evalu-ation is an issue due to lack of data but results onthe past six election cycles on New York Timesalways seperated the two competing candidatesalong the eigenvector spectrum. This is not soeasy in the primary part of the elections, whenmultiple candidates compete with each other forthe role of contender. To cover this case, we gen-erate also a two-dimensional plot using the firsttwo eigenvalues of the adjacency matrix, whichseems to capture the main groupings in the politi-cal narrative.

Future work will include making better use ofthe information coming from the parser, which

85

goes well beyond the simple SVO structure ofsentences, and developing more sophisticatedmethods for the analysis of large and complex net-works that can be inferred with the methodologywe have developed.

Acknowledgments

I. Flaounas and N. Cristianini are supported byFP7 CompLACS; N. Cristianini is supported by aRoyal Society Wolfson Merit Award; The mem-bers of the Intelligent Systems Laboratory aresupported by the ‘Pascal2’ Network of Excel-lence. Authors would like to thank Omar Ali andRoberto Franzosi.

References

Chang C.C., and Lin C.J. 2011.LIBSVM: a libraryfor support vector machines. ACM Transactions onIntelligent Systems and Technology 2(3):1–27

Cunningham H., Maynard D., Bontcheva K. andTablan V. 2002.GATE: A Framework and Graph-ical Development Environment for Robust NLPTools and Applications. Proc. of the 40th Anniver-sary Meeting of the Association for ComputationalLinguistics 168–175.

Earl J., Martin A., McCarthy J.D., Soule S.A. 2004.The Use of Newspaper Data in the Study of Collec-tive Action. Annual Review of Sociology, 30:65–80.

Flaounas I., Ali O., Turchi M., Snowsill T., Nicart F.,De Bie T., Cristianini N. 2011.NOAM:News Out-lets Analysis and Monitoring system. Proc. of the2011 ACM SIGMOD international conference onManagement of data, 1275–1278.

Franzosi R. 2010.Quantitative Narrative Analysis.Sage Publications Inc, Quantitative Applications inthe Social Sciences, 162–200.

Kipper K., Korhonen A., Ryant N., Palmer M. 2006.Extensive Classifications of English verbs. 12thEURALEX International Congress, Turin, Italy.

Lin D. 1998. Dependency-Based Evaluation ofMinipar. Text, Speech and Language Technology20:317–329.

Sandhaus, E. 2008.The New York Times AnnotatedCorpus. Linguistic Data Consortium

Trampus M., Mladenic D. 2011.Learning Event Pat-terns from Text. Informatica 35

86


Query log analysis with GALATEAS LangLog

Marco Trevisan and Luca DiniCELI

[email protected]@celi.it

Eduard BarbuUniversita di Trento

[email protected]

Igor BarsantiGonetwork

[email protected]

Nikolaos LagosXerox Research Centre Europe

[email protected]

Frederique Segond and Mathieu RhulmannObjet Direct

[email protected]@objetdirect.com

Ed ValdBridgeman Art Library

[email protected]

Abstract

This article describes GALATEASLangLog, a system performing Search LogAnalysis. LangLog illustrates how NLPtechnologies can be a powerful supporttool for market research even when thesource of information is a collection ofqueries each one consisting of few words.We push the standard Search Log Analysisforward taking into account the semanticsof the queries. The main innovation ofLangLog is the implementation of twohighly customizable components thatcluster and classify the queries in the log.

1 Introduction

Transaction logs become increasingly importantfor studying the user interaction with systemslike Web Searching Engines, Digital Libraries, In-tranet Servers and others (Jansen, 2006). Var-ious service providers keep log files recordingthe user interaction with the searching engines.Transaction logs are useful to understand the usersearch strategy but also to improve query sugges-tions (Wen and Zhang, 2003) and to enhancethe retrieval quality of search engines (Joachims,2002). The process of analyzing the transactionlogs to understand the user behaviour and to as-sess the system performance is known as Transac-tion Log Analysis (TLA). Transaction Log Anal-ysis is concerned with the analysis of both brows-ing and searching activity inside a website. Theanalysis of transaction logs that focuses on searchactivity only is known as Search Log Analysis

(SLA). According to Jansen (2008) both TLAand SLA have three stages: data collection, datapreparation and data analysis. In the data collec-tion stage one collects data describing the userinteraction with the system. Data preparation isthe process of loading the collected data in a re-lational database. The data loaded in the databasegives a transaction log representation independentof the particular log syntax. In the final stagethe data prepared at the previous step is analyzed.One may notice that the traditional three levelslog analyses give a syntactic view of the infor-mation in the logs. Counting terms, measuringthe logical complexity of queries or the simpleprocedures that associate queries with the ses-sions in no way accesses the semantics of queries.LangLog system addreses the semantic problemperforming clustering and classification for realquery logs. Clustering the queries in the logs al-lows the identification of meaningful groups ofqueries. Classifying the queries according to arelevant list of categories permits the assessmentof how well the searching engine meets the userneeds. In addition the LangLog system addressproblems like automatic language identification,Name Entity Recognition, and automatic querytranslation. The rest of the paper is organizedas follows: the next section briefly reviews somesystems performing SLA. Then we present thedata sources the architecture and the analysis pro-cess of the LangLog system. The conclusion sec-tion concludes the article summarizing the workand presenting some new possible enhancementsof the LangLog.

87

2 Related work

The information in the log files is useful in manyways, but its extraction raises many challengesand issues. Facca and Lanzi (2005) offer a sur-vey of the topic. There are several commercialsystems to extract and analyze this information,such as Adobe web analytics1, SAS Web Analyt-ics2, Infor Epiphany3, IBM SPSS4. These prod-ucts are often part of a customer relation manage-ment (CRM) system. None of those showcasesinclude any form of linguistic processing. On theother hand, Web queries have been the subjectof linguistic analysis, to improve the performanceof information retrieval systems. For example, astudy (Monz and de Rijke, 2002) experimentedwith shallow morphological analysis, another (Liet al., 2006) analyzed queries to remove spellingmistakes. These works encourage our belief thatlinguistic analysis could be beneficial for Web loganalysis systems.

3 Data sources

LangLog requires the following information fromthe Web logs: the time of the interaction, thequery, click-through information and possiblymore. LangLog processes log files which con-form to the W3C extended log format. No otherformats are supported. The system prototype isbased on query logs spanning one month of inter-actions recorded at the Bridgeman Art Library5.Bridgeman Art library contains a large repositoryof images coming from 8000 collections and rep-resenting more than 29.000 artists.

4 Analyses

LangLog organizes the search log data into unitscalled queries and hits. In a typical search-ing scenario a user submits a query to the con-tent provider’s site-searching engine and clickson some (or none) of the search results. Fromnow on we will refer to a clicked item as a hit,and we will refer to the text typed by the user asthe query. This information alone is valuable tothe content provider because it allows to discover

1http://www.omniture.com/en/products/analytics2http://www.sas.com/solutions/webanalytics/index.html3http://www.infor.com4http://www-01.ibm.com/software/analytics/spss/5http://www.bridgemanart.com

which queries were served with results that satis-fied the user, and which queries were not.

LangLog extracts queries and hits from the logfiles, and performs the following analyses on thequeries:

• language identification

• tokenization and lemmatization

• named entity recognition

• classification

• cluster analysis

Language information may help the contentprovider decide whether to translate the contentinto new languages.

Lemmatization is especially important in lan-guages like German and Italian that have a richmorphology. Frequency statistics of keywordshelp understand what users want, but they are bi-ased towards items associated with words withlesser ortographic and morpho-syntactic varia-tion. For example, two thousand queries for”trousers”, one thousand queries for ”handbag”and another thousand queries for ”handbags”means that handbags are twice as popular astrousers, although statistics based on raw wordswould say otherwise.

Named entities extraction helps the contentprovider for the same reasons lemmatization does.Named entities are especially important becausethey identify real-world items that the contentprovider can relate to, while lemmas less often doso. The name entities and the most important con-cepts can be linked afterwards with resources likeWikipedia which offer a rich specification of theirproperties.

Both classification and clustering allow thecontent provider to understand what kind of theusers look for and how this information is targetedby means of queries.

Classification consists of classifying queriesinto categories drawn from a classificationschema. When the schema used to classifyis different from the schema used in the con-tent provider’s website, classification may providehints as to what kind of queries are not matchedby items in the website. In a similar way, clusteranalysis can be used to identify new market seg-ments or new trends in the user’s behaviour. Clus-

88

ter analysis provide more flexybility than classifi-cation, but the information it produces is less pre-cise. Many trials and errors may be necessary be-fore finding interesting results. One hopes that thefinal clustering solution will give insights into thepatterns of users’ searches. For example an on-line book store may discover that one cluster con-tains many software-related terms, altough noneof those terms is popular enough to be noticeablein the statistics.

5 Architecture

LangLog consists of three subsystems: log ac-quisition, log analysis, log disclosure. Periodi-cally the log acquisition subsystem gathers newdata which it passes to the log analyses compo-nent. The results of the analyses are then availablethrough the log disclosure subsystem.

Log acquisition deals with the acquisition andnormalization and anonymization of the data con-tained in the content provider’s log files. Thedata flows from the content provider’s servers toLangLog’s central database. This process is car-ried out by a series of Pentaho Data Integration6

procedures.Log analysis deals with the anaysis of the data.

The analyses proper are executed by NLP systemsprovided by third parties and accessible as Webservices. LangLog uses NLP Web services forlanguage identification, morpho-syntactic analy-sis, named entity recognition, classification andclustering. The analyses are stored in the databasealong with the original data.

Log disclosure is actually a collection of inde-pendent systems that allow the content providersto access their information and the analyses. Logdisclosure systems are also concerned with accesscontrol and protection of privacy. The contentprovider can access the output of LangLog usingAWStats, QlikView, or JPivot.

• AWStats7 is a widely used log analysis sys-tem for websites. The logs gathered from thewebsites are parsed by AWStats, which gen-erates a complete report about visitors, vis-its duration, visitor’s countries and other datato disclose useful information about the visi-tor’s behavior.

6http://kettle.pentaho.com7http://awstats.sourceforge.net

• QlikView8 is a business intelligence (BI)platform. A BI platform provides histori-cal, current, and predictive views of busi-ness operations. Usually such tools are usedby companies to have a clear view of theirbusiness over time. In LangLog, QlickViewdoes not display sales or costs evolution overtime. Instead, it displays queries on the con-tent provider’s website over time. A dash-board with many elements (input selections,tables, charts, etc.) provides a wide range oftools to visualize the data.

• JPivot9 is a front-end for Mondrian. Mon-drian10 is an Online Analytical Processing(OLAP) engine, a system capable of han-dling and analyzing large quantities of data.JPivot allows the user to explore the outputof LangLog, by slicing the data along manydimensions. JPivot allows the user to displaycharts, export results to Microsoft Excel orCSV, and use custom OLAP MDX queries.

Log analysis deals with the anaysis of the data.The analyses proper are executed by NLP systemsprovided by third parties and accessible as Webservices. LangLog uses NLP Web services forlanguage identification, morpho-syntactic analy-sis, named entity recognition, classification andclustering. The analyses are stored in the databasealong with the original data.

5.1 Language IdentificationThe system uses a language identification sys-tem (Bosca and Dini, 2010) which offers languageidentification for English, French, Italian, Span-ish, Polish and German. The system uses fourdifferent strategies:

• N-gram character models: uses the distancebetween the character based models of theinput and of a reference corpus for the lan-guage (Wikipedia).

• Word frequency: looks up the frequency ofthe words in the query with respect to a ref-erence corpus for the language.

• Function words: searches for particleshighly connoting a specific language (suchas prepositions, conjunctions).

8http://www.qlikview.com9http://jpivot.sourceforge.net

10http://mondrian.pentaho.com

89

• Prior knowledge: provides a default guessbased on a set of hypothesis and heuristicslike region/browser language.

5.2 Lemmatization

To perform lemmatization, Langlog uses general-purpose morpho-syntactic analysers based on theXerox Incremental Parser (XIP), a deep robustsyntactic parser (Ait-Mokhtar et al., 2002). Thesystem has been adapted with domain-specificpart of speech disambiguation grammar rules, ac-cording to the results a linguistic study of the de-velopment corpus.

5.3 Named entity recognition

LangLog uses the Xerox named entity recogni-tion web service (Brun and Ehrmann, 2009) forEnglish and French. XIP includes also a namedentity detection component, based on a combina-tion of lexical information and hand-crafted con-textual rules. For example, the named entityrecognition system was adapted to handle titlesof portraits, which were frequent in our dataset.While for other NLP tasks LangLog uses the samesystem for every content provider, named entityrecognition is a task that produces better analyseswhen it is tailored to the domain of the content.Because LangLog uses a NER Web service, it iseasy to replace the default NER system with a dif-ferent one. So if the content provider is interestedin the development of a NER system tailored fora specific domain, LangLog can accomodate this.

5.4 Clustering

We developed two clustering systems: one per-forms hierarchical clustering, another performssoft clustering.

• CLUTO: the hierarchical clustering systemrelies on CLUTO411, a clustering toolkit.To understand the main ideas CLUTO isbased on one might consult Zhao andKarypis (2002). The clustering process pro-ceeds as follows. First, the set of queries tobe clustered is partitioned in k groups wherek is the number of desired clusters. To doso, the system uses a partitional clusteringalgorithm which finds the k-way clusteringsolution making repeated bisections. Then

11http://glaros.dtc.umn.edu/gkhome/views/cluto

the system arranges the clusters in a hierar-chy by successively merging the most similarclusters in a tree.

• MALLET: the soft clustering system wedeveloped relies on MALLET (McCallum,2002), a Latent Dirichlet Allocation (LDA)toolkit (Steyvers and Griffiths, 2007).

Our MALLET-based system considers thateach query is a document and builds a topicmodel describing the documents. The result-ing topics are the clusters. Each query is as-sociated with each topic according to a cer-tain strenght. Unlike the system based onCLUTO, this system produces soft clusters,i.e. each query may belong to more than onecluster.

5.5 ClassificationLangLog allows the same query to be classifiedmany times using different classification schemasand different classification strategies. The resultof the classification of an input query is always amap that assigns each category a weight, wherethe higher the weight, the more likely the querybelongs to the category. If NER performs bet-ter when tailored to a specific domain, classifi-cation is a task that is hardly useful without anycustomization. We need a different classificationschema for each content provider. We developedtwo classification system: an unsupervised sys-tem and a supervised one.

• Unsupervised: this system does not requireany training data nor any domain-specificcorpus. The output weight of each categoryis computed as the cosine similarity betweenthe vector models of the most representa-tive Wikipedia article for the category andthe collection of Wikipedia articles most rel-evant to the input query. Our evaluation inthe KDD-Cup 2005 dataset results in 19.14precision and 22.22 F-measure. For com-parison, the state of the art in the competi-tion achieved a 46.1 F-measure. Our systemcould not achieve a similar score because itis unsupervised, and therefore it cannot makeuse of the KDD-Cup training dataset. In ad-dition, it uses only the query to perform clas-sification, whereas KDD-Cup systems werealso able to access the result sets associatedto the queries.

90

• Supervised: this system is based on theWeka framework. Therefore it can use anymachine learning algorithm implemented inWeka. It uses features derived from thequeries and from Bridgeman metadata. Wetrained a Naive Bayes classifier on a set of15.000 queries annotated with 55 categoriesand hits and obtained a F-measure of 0.26.The results obtained for the classificationare encouraging but not yet at the level ofthe state of the art. The main reason forthis is the use of only in-house meta-data inthe feature computation. In the future wewill improve both components by providingthem with features from large resources likeWikipedia or exploiting the results returnedby Web Searching engines.

6 Demonstration

Our demonstration presents:

• The setting of our case study: the BridgemanArt Library website, a typical user search,and what is recorded in the log file.

• The conceptual model of the results of theanalyses: search episodes, queries, lemmas,named entities, classification, clustering.

• The data flow across the parts of the system,from content provider’s servers to the front-end through databases, NLP Web servicesand data marts.

• The result of the analyses via QlikView.

7 Conclusion

In this paper we presented the LangLog system,a customizable system for analyzing query logs.The LangLog performs language identification,lemmatization, NER, classification and clusteringfor query logs. We tested the LangLog system onqueries in Bridgeman Library Art. In the futurewe will test the system on query logs in differ-ent domains (e.g. pharmaceutical, hardware andsoftware, etc.) thus increasing the coverage andthe significance of the results. Moreover we willincorporate in our system the session informationwhich should increase the precision of both clus-tering and classification components.

ReferencesSalah Ait-Mokhtar, Jean-Pierre Chanod and Claude

Roux 2002. Robustness Beyond Shallowness: In-cremental Deep Parsing. Journal of Natural Lan-guage Engineering 8, 2-3, 121-144.

Alessio Bosca and Luca Dini. 2010. Language Identi-fication Strategies for Cross Language InformationRetrieval. CLEF 2010 Working Notes.

C. Brun and M. Ehrmann. 2007. Adaptation ofa Named Entity Recognition System for the ES-TER 2 Evaluation Campaign. In proceedings ofthe IEEE International Conference on Natural Lan-guage Processing and Knowledge Engineering.

F. M. Facca and P. L. Lanzi. 2005. Mining interestingknowledge from weblogs: a survey. Data Knowl.Eng. 53(3):225241.

Jansen, B. J. 2006. Search log analysis: What is it;what’s been done; how to do it. Library and Infor-mation Science Research 28(3):407-432.

Jansen, B. J. 2008. The methodology of search loganalysis. In B. J. Jansen, A. Spink and I. Taksa (eds)Handbook of Web log analysis 100-123. Hershey,PA: IGI.

Joachims T. 2002. Optimizing search engines us-ing clickthrough data. In proceedings of the 8thACM SIGKDD international conference on Knowl-edge discovery and data mining 133-142.

M. Li, Y. Zhang, M. Zhu, and M. Zhou. 2006. Ex-ploring distributional similarity based models forquery spelling correction. In proceedings of In ACL06: the 21st International Conference on Computa-tional Linguistics and the 44th annual meeting ofthe ACL 10251032, 2006.

Andrew Kachites McCallum. 2002. MAL-LET: A Machine Learning for Language Toolkit.http://mallet.cs.umass.edu.

C. Monz and M. de Rijke. 2002. Shallow Morpholog-ical Analysis in Monolingual Information Retrievalfor Dutch, German and Italian. In Proceedings ofCLEF 2001. Springer

M. Steyvers and T. Griffiths. 2007. ProbabilisticTopic Models. In T. Landauer, D McNamara, S.Dennis and W. Kintsch (eds), Handbook of LatentSemantic Analysis, Psychology Press.

J. R. Wen and H.J. Zhang 2003. Query Clusteringin the Web Context. In Wu, Xiong and Shekhar(eds) Information Retrieval and Clustering 195-226. Kluwer Academic Publishers.

Y. Zhao and G. Karypis. 2002. Evaluation of hierar-chical clustering algorithms for document datasets.In proceedings of the ACM Conference on Informa-tion and Knowledge Management.

91


A platform for collaborative semantic annotation

Valerio Basile and Johan Bos and Kilian Evang and Noortje Venhuizenv.basile,johan.bos,k.evang,[email protected]

Center for Language and Cognition Groningen (CLCG)University of Groningen, The Netherlands

Abstract

Data-driven approaches in computationalsemantics are not common because thereare only few semantically annotated re-sources available. We are building alarge corpus of public-domain English textsand annotate them semi-automatically withsyntactic structures (derivations in Com-binatory Categorial Grammar) and seman-tic representations (Discourse Representa-tion Structures), including events, thematicroles, named entities, anaphora, scope, andrhetorical structure. We have created awiki-like Web-based platform on which acrowd of expert annotators (i.e. linguists)can log in and adjust linguistic analyses inreal time, at various levels of analysis, suchas boundaries (tokens, sentences) and tags(part of speech, lexical categories). Thedemo will illustrate the different features ofthe platform, including navigation, visual-ization and editing.

1 Introduction

Data-driven approaches in computational seman-tics are still rare because there are not manylarge annotated resources that provide empiri-cal information about anaphora, presupposition,scope, events, tense, thematic roles, named en-tities, word senses, ellipsis, discourse segmenta-tion and rhetorical relations in a single formal-ism. This is not surprising, as it is challenging andtime-consuming to create such a resource fromscratch.

Nevertheless, our objective is to develop alarge annotated corpus of Discourse Representa-tion Structures (Kamp and Reyle, 1993), com-prising most of the aforementioned phenomena:the Groningen Meaning Bank (GMB). We aim toreach this goal by:

1. Providing a wiki-like platform supportingcollaborative annotation efforts;

2. Employing state-of-the-art NLP software forbootstrapping semantic analysis;

3. Giving real-time feedback of annotation ad-justments in their resulting syntactic and se-mantic analysis;

4. Ensuring kerfuffle-free dissemination ofour semantic resource by considering onlypublic-domain texts for annotation.

We have developed the wiki-like platform fromscratch simply because existing annotation sys-tems, such as GATE (Dowman et al., 2005), NITE(Carletta et al., 2003), or UIMA (Hahn et al.,2007), do not offer the functionality required fordeep semantic annotation combined with crowd-sourcing.

In this description of our platform, we motivateour choice of data and explain how we manage it(Section 2), we describe the complete toolchainof NLP components employed in the annotation-feedback process (Section 3), and the Web-basedinterface itself is introduced, describing how lin-guists can adjust boundaries of tokens and sen-tences, and revise tags of named entities, parts ofspeech and lexical categories (Section 4).

2 Data

The goal of the Groningen Meaning Bank is toprovide a widely available corpus of texts, withdeep semantic annotations. The GMB only com-prises texts from the public domain, whose dis-tribution isn’t subject to copyright restrictions.Moreover, we include texts from various genresand sources, resulting in a rich, comprehensive

92

corpus appropriate for use in various disciplineswithin NLP.

The documents in the current version of theGMB are all in English and originate from fourmain sources: (i) Voice of America (VOA), an on-line newspaper published by the US Federal Gov-ernment; (ii) the Manually Annotated Sub-Corpus(MASC) from the Open American National Cor-pus (Ide et al., 2010); (iii) country descriptionsfrom the CIA World Factbook (CIA) (Central In-telligence Agency, 2006), in particular the Back-ground and Economy sections, and (iv) a col-lection of Aesop’s fables (AF). All these docu-ments are in the public domain and are thus redis-tributable, unlike for example the WSJ data usedin the Penn Treebank (Miltsakaki et al., 2004).

Each document is stored with a separate filecontaining metadata. This may include the lan-guage the text is written in, the genre, date ofpublication, source, title, and terms of use of thedocument. This metadata is stored as a simplefeature-value list.

The documents in the GMB are categorizedwith different statuses. Initially, newly added doc-uments are labeled as uncategorized. As we man-ually review them, they are relabeled as eitheraccepted (document will be part of the next sta-ble version, which will be released in regular in-tervals), postponed (there is some difficulty withthe document that can possibly be solved in thefuture) or rejected (something is wrong with thedocument form, i.e., character encoding, or withthe content, e.g., it contains offensive material).

Currently, the GMB comprises 70K Englishtext documents (Table 1), corresponding to 1,3million sentences and 31,5 million tokens.

Table 1: Documents in the GMB, as of March 5, 2012Documents VOA MASC CIA AF AllAccepted 4,651 34 515 0 5,200Uncategorized 61,090 0 0 834 61,924Postponed 2,397 339 3 1 2,740Rejected 184 27 4 0 215Total 68,322 400 522 835 70,079

3 The NLP Toolchain

The process of building the Groningen MeaningBank takes place in a bootstrapping fashion. Achain of software is run, taking the raw text docu-ments as input. The output of this automatic pro-cess is in the form of several layers of stand-off

annotations, i.e., files with links to the original,raw documents.

We employ a chain of NLP components thatcarry out, respectively, tokenization and sentenceboundary detection, POS tagging, lemmatization,named entity recognition, supertagging, parsingusing the formalism of Combinatory CategorialGrammar (Steedman, 2001), and semantic anddiscourse analysis using the framework of Dis-course Representation Theory (DRT) (Kamp andReyle, 1993) with rhetorical relations (Asher,1993).

The lemmatizer used is morpha (Minnen et al.,2001), the other steps are carried out by the C&Ctools (Curran et al., 2007) and Boxer (Bos, 2008).

3.1 Bits of Wisdom

After each step in the toolchain, the intermediateresult may be automatically adjusted by auxiliarycomponents that apply annotations provided byexpert users or other sources. These annotationsare represented as “Bits of Wisdom” (BOWs): tu-ples of information regarding, for example, tokenand sentence boundaries, tags, word senses or dis-course relations. They are stored in a MySQLdatabase and can originate from three differentsources: (i) explicit annotation changes made byexperts using the Explorer Web interface (see Sec-tion 4); (ii) an annotation game played by non-experts, similar to ‘games with a purpose’ likePhrase Detectives (Chamberlain et al., 2008) andJeux de Mots (Artignan et al., 2009); and (iii) ex-ternal NLP tools (e.g. for word sense disambigua-tion or co-reference resolution).

Since BOWs come from various sources, theymay contradict each other. In such cases, a judgecomponent resolves the conflict, currently by pre-ferring the most recent expert BOW. Future workwill involve the application of different judgingtechniques.

3.2 Processing Cycle

The widely known open-source tool GNU makeis used to orchestrate the toolchain while avoid-ing unnecessary reprocessing. The need to rerunthe toolchain for a document arises in three sit-uations: a new BOW for that document is avail-able; a new, improved version of one of the com-ponents is available; or reprocessing is forced bya user via the “reprocess” button in the Web inter-face. A continually running program, the ‘updat-

93

Figure 1: A screenshot of the web interface, displaying a tokenised document.

ing daemon’, is responsible for calling make forthe right document at the right time. It checks thedatabase for new BOWs or manual reprocessingrequests in very short intervals to ensure immedi-ate response to changes experts make via the Webinterface. It also updates and rebuilds the compo-nents in longer intervals and continuously loopsthrough all documents, remaking them with thenewest versions of the components. The numberof make processes that can run in parallel is con-figurable; standard techniques of concurrent pro-gramming are used to prevent more than one makeprocess from working simultaneously on the samedocument.

4 The Expert Interface

We developed a wiki-like Web interface, calledthe GMB Explorer, that provides users access tothe Groningen Meaning Bank. It fulfills threemain functions: navigation and search through thedocuments, visualization of the different levels ofannotation, and manual correction of the annota-tions. We will discuss these functions below.

4.1 Navigation and SearchThe GMB Explorer allows navigation through thedocuments of the GMB with their stand-off an-notations (Figure 1). The default order of docu-ments is based on their size in terms of numberof tokens. It is possible to apply filters to restrictthe set of documents to be shown: showing onlydocuments from a specific subcorpus, or specifi-cally showing documents with/without warningsgenerated by the NLP toolchain.

The Explorer interface comes with a built-insearch engine. It allows users to pose single- ormulti-word queries. The search results can thenbe restricted further by looking for a specific lex-ical category or part of speech. A more advancedsearch system that is based on a semantic lexicon

with lexical information about all levels of anno-tation is currently under development.

4.2 VisualizationThe different visualization options for a documentare placed in tabs: each tab corresponds to a spe-cific layer of annotation or additional informa-tion. Besides the raw document text, users canview its tokenized version, an interactive deriva-tion tree per sentence, and the semantic represen-tation of the entire discourse in graphical DRSformat. There are three further tabs in the Ex-plorer: a tab containing the warnings produced bythe NLP pipeline (if any), one containing the Bitsof Wisdom that have been collected for the docu-ment, and a tab with the document metadata.

The sentences view allows the user to show orhide sub-trees per sentence and additional infor-mation such as POS-tags, word senses, supertagsand partial, unresolved semantics. The deriva-tions are shown using the CCG notation, gener-ated by XSLT stylesheets applied to Boxer’s XMLoutput. An example is shown in Figure 2.

The discourse view shows a fully resolvedsemantic representation in the form of a DRS with

Figure 2: An example of a CCG derivation as shownin GMB Explorer.

94

Figure 3: An example of the semantic representationsin the GMB, with DRSs representing discourse units.

rhetorical relations. Clicking on discourse unitsswitches the visualization between text and se-mantic representation. Figure 3 shows how DRSsare visualized in the Web interface.

4.3 Editing

Some of the tabs in the Explorer interface have an“edit” button. This allows registered users to man-ually correct certain types of annotations. Cur-rently, the user can edit the tokenization view andon the derivation view. Clicking “edit” in the to-kenization view gives an annotator the possibilityto add and remove token and sentence boundariesin a simple and intuitive way, as Figure 4 illus-trates. This editing is done in real-time, followingthe WYSIWYG strategy, with tokens separatedby spaces and sentences separated by new lines.In the derivation view, the annotator can changepart-of-speech tags and named entity tags by se-lecting a tag from a drop-down list (Figure 5).

Figure 4: Tokenization edit mode. Clicking on thered ‘×’ removes a sentence boundary after the token;clicking on the green ‘+’ adds a sentence boundary.

Figure 5: Tag edit mode, showing derivation with par-tial DRSs and illustrating how to adjust a POS tag.

As the updating daemon is running continu-ally, the document is immediately reprocessed af-ter editing so that the user can directly view thenew annotation with his BOW taken into account.Re-analyzing a document typically takes a fewseconds, although for very large documents it cantake longer. It is also possible to directly rerunthe NLP toolchain on a specific document via the“reprocess” button, in order to apply the most re-cent version of the software components involved.The GMB Explorer shows a timestamp of the lastprocessing for each document.

We are currently working on developing newediting options, which allow users to change dif-ferent aspects of the semantic representation, suchas word senses, thematic roles, co-reference andscope.

5 Demo

In the demo session we show the functionality ofthe various features in the Web-based user inter-face of the GMB Explorer, which is available on-line via: http://gmb.let.rug.nl.

We show (i) how to navigate and searchthrough all the documents, including the refine-ment of search on the basis of the lexical cate-gory or part of speech, (ii) the operation of the dif-ferent view options, including the raw, tokenized,derivation and semantics view of each document,and (iii) how adjustments to annotations can be re-alised in the Web interface. More concretely, wedemonstrate how boundaries of tokens and sen-tences can be adapted, and how different types oftags can be changed (and how that affects the syn-tactic, semantic and discourse analysis).

In sum, the demo illustrates innovation in theway changes are made and how they improve thelinguistic analysis in real-time. Because it is aweb-based platform, it paves the way for a collab-orative annotation effort. Currently it is activelyin use as a tool to create a large semantically an-notated corpus for English texts: the GroningenMeaning Bank.

95

References

Guillaume Artignan, Mountaz Hascoet, and MathieuLafourcade. 2009. Multiscale visual analysis oflexical networks. In 13th International Confer-ence on Information Visualisation, pages 685–690,Barcelona, Spain.

Nicholas Asher. 1993. Reference to Abstract Objectsin Discourse. Kluwer Academic Publishers.

Johan Bos. 2008. Wide-Coverage Semantic Analy-sis with Boxer. In J. Bos and R. Delmonte, editors,Semantics in Text Processing. STEP 2008 Confer-ence Proceedings, volume 1 of Research in Compu-tational Semantics, pages 277–286. College Publi-cations.

J. Carletta, S. Evert, U. Heid, J. Kilgour, J. Robert-son, and H. Voormann. 2003. The NITE XMLtoolkit: flexible annotation for multi-modal lan-guage data. Behavior Research Methods, Instru-ments, and Computers, 35(3):353–363.

Central Intelligence Agency. 2006. The CIA WorldFactbook. Potomac Books.

John Chamberlain, Massimo Poesio, and Udo Kr-uschwitz. 2008. Addressing the Resource Bottle-neck to Create Large-Scale Annotated Texts. InJohan Bos and Rodolfo Delmonte, editors, Seman-tics in Text Processing. STEP 2008 Conference Pro-ceedings, volume 1 of Research in ComputationalSemantics, pages 375–380. College Publications.

James Curran, Stephen Clark, and Johan Bos. 2007.Linguistically Motivated Large-Scale NLP withC&C and Boxer. In Proceedings of the 45th An-nual Meeting of the Association for ComputationalLinguistics Companion Volume Proceedings of theDemo and Poster Sessions, pages 33–36, Prague,Czech Republic.

Mike Dowman, Valentin Tablan, Hamish Cunning-ham, and Borislav Popov. 2005. Web-assisted an-notation, semantic indexing and search of televisionand radio news. In Proceedings of the 14th Interna-tional World Wide Web Conference, pages 225–234,Chiba, Japan.

U. Hahn, E. Buyko, K. Tomanek, S. Piao, J. Mc-Naught, Y. Tsuruoka, and S. Ananiadou. 2007.An annotation type system for a data-driven NLPpipeline. In Proceedings of the Linguistic Annota-tion Workshop, pages 33–40, Prague, Czech Repub-lic, June. Association for Computational Linguis-tics.

Nancy Ide, Christiane Fellbaum, Collin Baker, and Re-becca Passonneau. 2010. The manually annotatedsub-corpus: a community resource for and by thepeople. In Proceedings of the ACL 2010 Confer-ence Short Papers, pages 68–73, Stroudsburg, PA,USA.

Hans Kamp and Uwe Reyle. 1993. From Discourse toLogic; An Introduction to Modeltheoretic Seman-

tics of Natural Language, Formal Logic and DRT.Kluwer, Dordrecht.

Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, andBonnie Webber. 2004. The Penn Discourse Tree-bank. In In Proceedings of LREC 2004, pages2237–2240.

Guido Minnen, John Carroll, and Darren Pearce.2001. Applied morphological processing of en-glish. Journal of Natural Language Engineering,7(3):207–223.

Mark Steedman. 2001. The Syntactic Process. TheMIT Press.

96


HadoopPerceptron: a Toolkit for Distributed Perceptron Training andPrediction with MapReduce

Andrea GesmundoComputer Science Department

University of GenevaGeneva, Switzerland

[email protected]

Nadi TomehLIMSI-CNRS and

Universite Paris-SudOrsay, France

[email protected]

Abstract

We propose a set of open-source softwaremodules to perform structured PerceptronTraining, Prediction and Evaluation withinthe Hadoop framework. Apache Hadoopis a freely available environment for run-ning distributed applications on a com-puter cluster. The software is designedwithin the Map-Reduce paradigm. Thanksto distributed computing, the proposed soft-ware reduces substantially execution timeswhile handling huge data-sets. The dis-tributed Perceptron training algorithm pre-serves convergence properties, thus guar-anties same accuracy performances as theserial Perceptron. The presented modulescan be executed as stand-alone software oreasily extended or integrated in complexsystems. The execution of the modules ap-plied to specific NLP tasks can be demon-strated and tested via an interactive web in-terface that allows the user to inspect thestatus and structure of the cluster and inter-act with the MapReduce jobs.

1 Introduction

The Perceptron training algorithm (Rosenblatt,1958; Freund and Schapire, 1999; Collins, 2002)is widely applied in the Natural Language Pro-cessing community for learning complex struc-tured models. The non-probabilistic nature of theperceptron parameters makes it possible to incor-porate arbitrary features without the need to cal-culate a partition function, which is required forits discriminative probabilistic counterparts suchas CRFs (Lafferty et al., 2001). Additionally, thePerceptron is robust to approximate inference inlarge search spaces.

Nevertheless, Perceptron training is propor-tional to inference which is frequently non-linearin the input sequence size. Therefore, training canbe time-consuming for complex model structures.Furthermore, for an increasing number of tasks isfundamental to leverage on huge sources of dataas the World Wide Web. Such difficulties renderthe scalability of the Perceptron a challenge.

In order to improve scalability, Mcdonald etal. (2010) propose a distributed training strat-egy callediterative parameter mixing, and showthat it has similar convergence properties to thestandard perceptron algorithm; it finds a separat-ing hyperplane if the training set is separable; itproduces models with comparable accuracies tothose trained serially on all the data; and reducestraining times significantly by exploiting comput-ing clusters.

With this paper we present the HadoopPer-ceptron package. It provides a freely availableopen-source implementation of the iterative pa-rameter mixing algorithm for training the struc-tured perceptron on a generic sequence labelingtasks. Furthermore, the package provides two ad-ditional modules for prediction and evaluation.The three software modules are designed withinthe MapReduce programming model (Dean andGhemawat, 2004) and implemented using theApache Hadoop distributed programming Frame-work (White, 2009; Lin and Dyer, 2010). Thepresented HadoopPerceptron package reduces ex-ecution time significantly compared to its serialcounterpart while maintaining comparable perfor-mance.

97

PerceptronIterParamMix(T = (xt,yt)|T |t=1)

1. SplitT into S piecesT = T1, . . . ,TS2. w = 0

3. for n : 1..N4. w(i,n) = OneEpochPerceptron(Ti ,w)5. w =

∑i µi,nw

(i,n)

6. returnw

OneEpochPerceptron(Ti ,w∗)1. w(0) = w∗; k = 02. for n : 1..T3. Lety′ = argmax

y′ w(k).f(xt,y′t)

4. if y′ 6= yt

5. x(k+1) = x(k) + f(xt,yt)− f(xt,y′t)

6. k = k + 17. returnw(k)

Figure 1: Distributed perceptron with iterative param-eter mixing strategy. Eachw(i,n) is computed in par-allel. µn = µ1,n, . . . , µS,n, ∀µi,n ∈ µn : µi,n ≥0 and∀n :

∑iµi,n = 1.

2 Distributed Structured Perceptron

The structured perceptron (Collins, 2002) is anonline learning algorithm that processes train-ing instances one at a time during each trainingepoch. In sequence labeling tasks, the algorithmpredicts a sequence of labels (an element fromthe structured output space) for each input se-quence. Prediction is determined by linear opera-tions on high-dimensional feature representationsof candidate input-output pairs and an associatedweight vector. During training, the parameters areupdated whenever the prediction that employedthem is incorrect.

Unlike many batch learning algorithms that caneasily be distributed through the gradient calcula-tion, the perceptron online training is more subtleto parallelize. However, Mcdonald et al. (2010)present a simple distributed training through a pa-rameter mixing scheme.

The Iterative Parameter Mixing is given in Fig-ure 2 (Mcdonald et al., 2010). First the trainingdata is divided into disjoint splits of example pairs(xt,yt) wherext is the observation sequence andyt is the associated labels. The algorithm pro-ceeds to train a single epoch of the perceptronalgorithm for each split in parallel, and mix thelocal models weightsw(i,n) to produce the global

weight vectorw. The mixed model is then passedto each split to reset the perceptron local weights,and a new iteration is started. Mcdonald et al.(2010) provide bound analysis for the algorithmand show that it is guaranteed to converge and finda seperation hyperplane if one exists.

3 MapReduce and Hadoop

Many algorithms need to iterate over numberof records and 1) perform some calculation oneach of them and then 2) aggregate the results.The MapReduce programming model implementsa functional abstraction of these two operationscalled respectively Map and Reduce. The Mapfunction takes a value-key pairs and produces alist of key-value pairs: map(k, v) → (k′, v′)∗;while the input the Reduce function is a key withall the associated values produced by all the map-pers: reduce(k′, (v′)∗) → (k′′, v′′)∗. The modelrequires that all values with the same key are re-duced together.

Apache Hadoop is an open-source implementa-tion of the MapReduce model on cluster of com-puters. A cluster is composed by a set of comput-ers (nodes) connected into a network. One nodeis designated as the Master while other nodesare referred to as Worker Nodes. Hadoop is de-signed to scale out to large clusters built fromcommodity hardware and achieves seamless scal-ability. To allow rapid development, Hadoophides system-level details from the applicationdeveloper.The MapReduce runtime automaticallyschedule worker assignment to mappers and re-ducers;handles synchronization required by theprogramming model including gathering, sort-ing and shuffling of intermediate data across thenetwork; and provides robustness by detectingworker failures and managing restarts. The frame-work is built on top of he Hadoop DistributedFile System (HDFS), which allows to distributethe data across the cluster nodes. Network trafficis minimized by moving the process to the nodestoring the data. In Hadoop terminology an entireMapReduce program is called ajob while individ-ual mappers and reducers are calledtasks.

4 HadoopPerceptron Implementation

In this section we give details on how the train-ing, prediction and evaluation modules are im-plemented for the Hadoop framework using the

98

Figure 2: HadoopPerceptron in MapReduce.

MapReduce programming model1.Our implementation of the iterative parame-

ter mixing algorithm is sketched in Figure 2.At the beginning of each iteration, the train-ing data is split and distributed to the workernodes. The set of training examples in adata split is streamed to map workers as pairs(sentence-id, (xt,yt)). Each map worker per-forms a standard perceptron training epoch andoutputs a pair(feature-id, wi,f ) for each feature.The set of such pairs emitted by a map worker rep-resents its local weight vector. After map workershave finished, the MapReduce framework guaran-tees that all local weights associated with a givenfeature are aggregated together as input to a dis-tinct reduce worker. Each reduce worker producesas output the average of the associated featureweight. At the end of each iteration, the reduceworkers outputs are aggregated into the global av-eraged weight vector. The algorithm iteratesN

times or until convergence is achieved. At thebeginning of each iteration the weight vector ofeach distinct model is initialized with the globalaveraged weight vector resultant from the previ-ous iteration. Thus, for all the iterations exceptfor the first, the global averaged weight vector re-sultant from the previous iteration needs to be pro-vided the map workers. In Hadoop it is possibleto pass this information via the Distributed CacheSystem.

In addition to the training module, the Hadoop-Perceptron package provides separate modulesfor prediction and evaluation both of them aredesigned as MapReduce programs. The evalu-

1The Hadoop Perceptron toolkit is available fromhttps://github.com/agesmundo/HadoopPerceptron .

ation module output the accuracy measure com-puted against provided gold standards. Predictionand evaluation modules are independent from thetraining modules, the weight vector given as inputcould have been computed with any other systemusing any other training algorithm as long as theyemploy the same features.

The implementation is in Java, and we inter-face with the Hadoop cluster via the native JavaAPI. It can be easily adapted to a wide range ofNLP tasks. Incorporating new features by mod-ifying the extensible feature extractor is straight-forward. The package includes the implementa-tion of the basic feature set described in (Suzukiand Isozaki, 2008).

5 The Web User Interface

Hadoop is bundled with several web interfacesthat provide concise tracking information for jobs,tasks, data nodes, etc. as shown in Figure 3. Theseweb interfaces can be used to demonstrate theHadoopPerceptron running phases and monitorthe distributed execution of the training, predic-tion and evaluation modules for several sequencelabeling tasks including part-of-speech taggingand named entity recognition.

6 Experiments

We investigate HadoopPerceptron training timeand prediction accuracy on a part-of-speech(POS) task using the PennTreeBank corpus (Mar-cus et al., 1994). We use sections 0-18 of the WallStreet Journal for training, and sections 22-24 fortesting.

We compare the regular percepton trained se-rially on all the training data with the distributedperceptron trained with iterative parameter mix-ing with variable number of splitsS ∈ 10, 20.For each system, we report the prediction accu-racy measure on the final test set to determineif any loss is observed as a consequence of dis-tributed training.

For each system, Figure 4 plots accuracy re-sults computed at the end of every training epochagainst consumed wall-clock time. We observethat iterative mixing parameter achieves compa-rable performance to its serial counterpart whileconverging orders of magnitude faster.

Furthermore, we note that the distributed al-gorithm achieves a slightly higher final accuracy

99

Figure 3: Hadoop interfaces for HadoopPerceptron.

Figure 4: Accuracy vs. training time. Each point cor-responds to a training epoch.

than serial training. Mcdonald et al. (2010) sug-gest that this is due to the bagging effect thatthe distributed training has, and due to parametermixing that is similar to the averaged perceptron.

We note also that increasing the number ofsplits increases the number of epoch required toattain convergence, while reducing the time re-quired per epoch. This implies a trade-off be-tween slower convergence and quicker epochswhen selecting a larger number of splits.

7 Conclusion

The HadoopPerceptron package provides the firstfreely-available open-source implementation of

iterative parameter mixing Perceptron Training,Prediction and Evaluation for a distributed Map-Reduce framework. It is a versatile stand alonesoftware or building block, that can be easilyextended, modified, adapted, and integrated inbroader systems.

HadoopPerceptron is a useful tool for the in-creasing number of applications that need to per-form large-scale structured learning. This is thefirst freely available implementation of an ap-proach that has already been applied with successin private sectors (e.g. Google Inc.). Making itpossible for everybody to fully leverage on hugedata sources as the World Wide Web, and developstructured learning solutions that can scale keep-ing feasible execution times and cluster-networkusage to a minimum.

Acknowledgments

This work was funded by Google and The Scot-tish Informatics and Computer Science Alliance(SICSA). We thank Keith Hall, Chris Dyer andMiles Osborne for help and advice.

100

References

Michael Collins. 2002. Discriminative training meth-ods for hidden markov models: Theory and experi-ments with perceptron algorithms. InEMNLP ’02:Proceedings of the 2002 Conference on EmpiricalMethods in Natural Language Processing, Philadel-phia, PA, USA.

Jeffrey Dean and Sanjay Ghemawat. 2004. Mapre-duce: simplified data processing on large clusters.In Proceedings of the 6th Symposium on Opeart-ing Systems Design and Implementation, San Fran-cisco, CA, USA.

Yoav Freund and Robert E. Schapire. 1999. Largemargin classification using the perceptron algo-rithm. Machine Learning, 37(3):277–296.

John Lafferty, Andrew Mccallum, and FernandoPereira. 2001. John lafferty and andrew mc-callum and fernando pereira. InProceedings ofthe International Conference on Machine Learning,Williamstown, MA, USA.

Jimmy Lin and Chris Dyer. 2010.Data-Intensive TextProcessing with MapReduce. Morgan & ClaypoolPublishers.

Mitchell P. Marcus, Beatrice Santorini, and Mary A.Marcinkiewicz. 1994. Building a large annotatedcorpus of english: The penn treebank.Computa-tional Linguistics, 19(2):313–330.

Ryan Mcdonald, Keith Hall, and Gideon Mann. 2010.Distributed training strategies for the structured per-ceptron. InNAACL ’10: Proceedings of the 11thConference of the North American Chapter of theAssociation for Computational Linguistics, Los An-geles, CA, USA.

Frank Rosenblatt. 1958. The Perceptron: A proba-bilistic model for information storage and organiza-tion in the brain.Psychological Review, 65(6):386–408.

Jun Suzuki and Hideki Isozaki. 2008. Semi-supervised sequential labeling and segmentation us-ing giga-word scale unlabeled data. InACL ’08:Proceedings of the 46th Conference of the Associa-tion for Computational Linguistics, Columbus, OH,USA.

Tom White. 2009. Hadoop: The Definitive Guide.O’Reilly Media Inc.

101


BRAT: a Web-based Tool for NLP-Assisted Text Annotation

Pontus Stenetorp1∗ Sampo Pyysalo2,3∗ Goran Topic1

Tomoko Ohta1,2,3 Sophia Ananiadou2,3 and Jun’ichi Tsujii4

1Department of Computer Science, The University of Tokyo, Tokyo, Japan2School of Computer Science, University of Manchester, Manchester, UK

3National Centre for Text Mining, University of Manchester, Manchester, UK4Microsoft Research Asia, Beijing, People’s Republic of Chinapontus,smp,goran,[email protected]

[email protected]@microsoft.com

Abstract

We introduce the brat rapid annotation tool(BRAT), an intuitive web-based tool for textannotation supported by Natural LanguageProcessing (NLP) technology. BRAT hasbeen developed for rich structured annota-tion for a variety of NLP tasks and aimsto support manual curation efforts and in-crease annotator productivity using NLPtechniques. We discuss several case stud-ies of real-world annotation projects usingpre-release versions of BRAT and presentan evaluation of annotation assisted by se-mantic class disambiguation on a multi-category entity mention annotation task,showing a 15% decrease in total annota-tion time. BRAT is available under an open-source license from: http://brat.nlplab.org

1 Introduction

Manually-curated gold standard annotations area prerequisite for the evaluation and training ofstate-of-the-art tools for most Natural LanguageProcessing (NLP) tasks. However, annotation isalso one of the most time-consuming and finan-cially costly components of many NLP researchefforts, and can place heavy demands on humanannotators for maintaining annotation quality andconsistency. Yet, modern annotation tools aregenerally technically oriented and many offer lit-tle support to users beyond the minimum requiredfunctionality. We believe that intuitive and user-friendly interfaces as well as the judicious appli-cation of NLP technology to support, not sup-plant, human judgements can help maintain thequality of annotations, make annotation more ac-cessible to non-technical users such as subject

∗These authors contributed equally to this work

Figure 1: Visualisation examples. Top: named en-tity recognition, middle: dependency syntax, bot-tom: verb frames.

domain experts, and improve annotation produc-tivity, thus reducing both the human and finan-cial cost of annotation. The tool presented inthis work, BRAT, represents our attempt to realisethese possibilities.

2 Features

2.1 High-quality Annotation Visualisation

BRAT is based on our previously released open-source STAV text annotation visualiser (Stene-torp et al., 2011b), which was designed to helpusers gain an understanding of complex annota-tions involving a large number of different se-mantic types, dense, partially overlapping text an-notations, and non-projective sets of connectionsbetween annotations. Both tools share a vectorgraphics-based visualisation component, whichprovide scalable detail and rendering. BRAT in-tegrates PDF and EPS image format export func-tionality to support use in e.g. figures in publica-tions (Figure 1).

102

Figure 2: Screenshot of the main BRAT user-interface, showing a connection being made between theannotations for “moving” and “Citibank”.

2.2 Intuitive Annotation InterfaceWe extended the capabilities of STAV by imple-menting support for annotation editing. This wasdone by adding functionality for recognising stan-dard user interface gestures familiar from text ed-itors, presentation software, and many other tools.

In BRAT, a span of text is marked for annotationsimply by selecting it with the mouse by “drag-ging” or by double-clicking on a word. Similarly,annotations are linked by clicking with the mouseon one annotation and dragging a connection tothe other (Figure 2).

BRAT is browser-based and built entirely usingstandard web technologies. It thus offers a fa-miliar environment to annotators, and it is pos-sible to start using BRAT simply by pointing astandards-compliant modern browser to an instal-lation. There is thus no need to install or dis-tribute any additional annotation software or touse browser plug-ins. The use of web standardsalso makes it possible for BRAT to uniquely iden-tify any annotation using Uniform Resource Iden-tifiers (URIs), which enables linking to individualannotations for discussions in e-mail, documentsand on web pages, facilitating easy communica-tion regarding annotations.

2.3 Versatile Annotation SupportBRAT is fully configurable and can be set up tosupport most text annotation tasks. The most ba-sic annotation primitive identifies a text span andassigns it a type (or tag or label), marking for e.g.POS-tagged tokens, chunks or entity mentions(Figure 1 top). These base annotations can beconnected by binary relations – either directed orundirected – which can be configured for e.g. sim-ple relation extraction, or verb frame annotation

(Figure 1 middle and bottom). n-ary associationsof annotations are also supported, allowing the an-notation of event structures such as those targetedin the MUC (Sundheim, 1996), ACE (Doddingtonet al., 2004), and BioNLP (Kim et al., 2011) In-formation Extraction (IE) tasks (Figure 2). Addi-tional aspects of annotations can be marked usingattributes, binary or multi-valued flags that canbe added to other annotations. Finally, annotatorscan attach free-form text notes to any annotation.

In addition to information extraction tasks,these annotation primitives allow BRAT to beconfigured for use in various other tasks, suchas chunking (Abney, 1991), Semantic Role La-beling (Gildea and Jurafsky, 2002; Carrerasand Marquez, 2005), and dependency annotation(Nivre, 2003) (See Figure 1 for examples). Fur-ther, both the BRAT client and server implementfull support for the Unicode standard, which al-low the tool to support the annotation of text us-ing e.g. Chinese or Devanagarı characters. BRAT

is distributed with examples from over 20 cor-pora for a variety of tasks, involving texts in sevendifferent languages and including examples fromcorpora such as those introduced for the CoNLLshared tasks on language-independent named en-tity recognition (Tjong Kim Sang and De Meul-der, 2003) and multilingual dependency parsing(Buchholz and Marsi, 2006).

BRAT also implements a fully configurable sys-tem for checking detailed constraints on anno-tation semantics, for example specifying that aTRANSFER event must take exactly one of eachof GIVER, RECIPIENT and BENEFICIARY argu-ments, each of which must have one of the typesPERSON, ORGANIZATION or GEO-POLITICAL

ENTITY, as well as a MONEY argument of type

103

Figure 3: Incomplete TRANSFER event indicatedto the annotator

MONEY, and may optionally take a PLACE argu-ment of type LOCATION (LDC, 2005). Constraintchecking is fully integrated into the annotation in-terface and feedback is immediate, with clear vi-sual effects marking incomplete or erroneous an-notations (Figure 3).

2.4 NLP Technology Integration

BRAT supports two standard approaches for inte-grating the results of fully automatic annotationtools into an annotation workflow: bulk anno-tation imports can be performed by format con-version tools distributed with BRAT for manystandard formats (such as in-line and column-formatted BIO), and tools that provide standardweb service interfaces can be configured to be in-voked from the user interface.

However, human judgements cannot be re-placed or based on a completely automatic analy-sis without some risk of introducing bias and re-ducing annotation quality. To address this issue,we have been studying ways to augment the an-notation process with input from statistical andmachine learning methods to support the annota-tion process while still involving human annotatorjudgement for each annotation.

As a specific realisation based on this approach,we have integrated a recently introduced ma-chine learning-based semantic class disambigua-tion system capable of offering multiple outputswith probability estimates that was shown to beable to reduce ambiguity on average by over 75%while retaining the correct class in on average99% of cases over six corpora (Stenetorp et al.,2011a). Section 4 presents an evaluation of thecontribution of this component to annotator pro-ductivity.

2.5 Corpus Search Functionality

BRAT implements a comprehensive set of searchfunctions, allowing users to search document col-

Figure 4: The BRAT search dialog

lections for text span annotations, relations, eventstructures, or simply text, with a rich set of searchoptions definable using a simple point-and-clickinterface (Figure 4). Additionally, search resultscan optionally be displayed using keyword-in-context concordancing and sorted for browsingusing any aspect of the matched annotation (e.g.type, text, or context).

3 Implementation

BRAT is implemented using a client-server ar-chitecture with communication over HTTP usingJavaScript Object Notation (JSON). The server isa RESTful web service (Fielding, 2000) and thetool can easily be extended or adapted to switchout the server or client. The client user interface isimplemented using XHTML and Scalable VectorGraphics (SVG), with interactivity implementedusing JavaScript with the jQuery library. Theclient communicates with the server using Asyn-chronous JavaScript and XML (AJAX), whichpermits asynchronous messaging.

BRAT uses a stateless server back-end imple-mented in Python and supports both the CommonGateway Interface (CGI) and FastCGI protocols,the latter allowing response times far below the100 ms boundary for a “smooth” user experiencewithout noticeable delay (Card et al., 1983). Forserver side annotation storage BRAT uses an easy-to-process file-based stand-off format that can beconverted from or into other formats; there is noneed to perform database import or export to in-terface with the data storage. The BRAT server in-

104

Figure 5: Example annotation from the BioNLP Shared Task 2011 Epigenetics and Post-translationalModifications event extraction task.

stallation requires only a CGI-capable web serverand the set-up supports any number of annotatorswho access the server using their browsers, on anyoperating system, without separate installation.

Client-server communication is managed sothat all user edit operations are immediately sentto the server, which consolidates them with thestored data. There is no separate “save” operationand thus a minimal risk of data loss, and as theauthoritative version of all annotations is alwaysmaintained by the server, there is no chance ofconflicting annotations being made which wouldneed to be merged to produce an authoritative ver-sion. The BRAT client-server architecture alsomakes real-time collaboration possible: multipleannotators can work on a single document simul-taneously, seeing each others edits as they appearin a document.

4 Case Studies

4.1 Annotation Projects

BRAT has been used throughout its developmentduring 2011 in the annotation of six different cor-pora by four research groups in efforts that havein total involved the creation of well-over 50,000annotations in thousands of documents compris-ing hundreds of thousands of words.

These projects include structured event an-notation for the domain of cancer biology,Japanese verb frame annotation, and gene-mutation-phenotype relation annotation. Oneprominent effort making use of BRAT is theBioNLP Shared Task 2011,1 in which the tool wasused in the annotation of the EPI and ID maintask corpora (Pyysalo et al., 2012). These twoinformation extraction tasks involved the annota-tion of entities, relations and events in the epige-netics and infectious diseases subdomains of biol-ogy. Figure 5 shows an illustration of shared taskannotations.

Many other annotation efforts using BRAT arestill ongoing. We refer the reader to the BRAT

1http://2011.bionlp-st.org

Mode Total Type Selection

Normal 45:28 13:49Rapid 39:24 (-6:04) 09:35 (-4:14)

Table 1: Total annotation time, portion spent se-lecting annotation type, and absolute improve-ment for rapid mode.

website2 for further details on current and past an-notation projects using BRAT.

4.2 Automatic Annotation Support

To estimate the contribution of the semantic classdisambiguation component to annotation produc-tivity, we performed a small-scale experiment in-volving an entity and process mention taggingtask. The annotation targets were of 54 dis-tinct mention types (19 physical entity and 35event/process types) marked using the simpletyped-span representation. To reduce confound-ing effects from annotator productivity differ-ences and learning during the task, annotation wasperformed by a single experienced annotator witha Ph.D. in biology in a closely related area whowas previously familiar with the annotation task.

The experiment was performed on publicationabstracts from the biomolecular science subdo-main of glucose metabolism in cancer. The textswere drawn from a pool of 1,750 initial candi-dates using stratified sampling to select pairs of10-document sets with similar overall statisticalproperties.3 Four pairs of 10 documents (80 in to-tal) were annotated in the experiment, with 10 ineach pair annotated with automatic support and 10without, in alternating sequence to prevent learn-ing effects from favouring either approach.

The results of this experiment are summarizedin Table 1 and Figure 6. In total 1,546 annotationswere created in normal mode and 1,541 annota-

2http://brat.nlplab.org3Document word count and expected annotation count,

were estimated from the output of NERsuite, a freely avail-able CRF-based NER tagger: http://nersuite.nlplab.org

105

0

500

1000

1500

2000

2500

3000

Normal Mode Rapid Mode

Tim

e(s

econ

ds)

Figure 6: Allocation of annotation time. GREEN

signifies time spent on selecting annotation typeand BLUE the remaining annotation time.

tions in rapid mode; the sets are thus highly com-parable. We observe a 15.4% reduction in totalannotation time, and, as expected, this is almostexclusively due to a reduction in the time the an-notator spent selecting the type to assign to eachspan, which is reduced by 30.7%; annotation timeis otherwise stable across the annotation modes(Figure 6). The reduction in the time spent in se-lecting the span is explained by the limiting of thenumber of candidate types exposed to the annota-tor, which were decreased from the original 54 toan average of 2.88 by the semantic class disam-biguation component (Stenetorp et al., 2011a).

Although further research is needed to establishthe benefits of this approach in various annotationtasks, we view the results of this initial experi-ment as promising regarding the potential of ourapproach to using machine learning to support an-notation efforts.

5 Related Work and Conclusions

We have introduced BRAT, an intuitive and user-friendly web-based annotation tool that aims toenhance annotator productivity by closely inte-grating NLP technology into the annotation pro-cess. BRAT has been and is being used for severalongoing annotation efforts at a number of aca-demic institutions and has so far been used forthe creation of well-over 50,000 annotations. Wepresented an experiment demonstrating that inte-grated machine learning technology can reducethe time for type selection by over 30% and over-all annotation time by 15% for a multi-type entitymention annotation task.

The design and implementation of BRAT was

informed by experience from several annotationtasks and research efforts spanning more thana decade. A variety of previously introducedannotation tools and approaches also served toguide our design decisions, including the fast an-notation mode of Knowtator (Ogren, 2006), thesearch capabilities of the XConc tool (Kim et al.,2008), and the design of web-based systems suchas MyMiner (Salgado et al., 2010), and GATETeamware (Cunningham et al., 2011). Using ma-chine learning to accelerate annotation by sup-porting human judgements is well documented inthe literature for tasks such as entity annotation(Tsuruoka et al., 2008) and translation (Martınez-Gomez et al., 2011), efforts which served as in-spiration for our own approach.

BRAT, along with conversion tools and exten-sive documentation, is freely available under theopen-source MIT license from its homepage athttp://brat.nlplab.org

Acknowledgements

The authors would like to thank early adopters ofBRAT who have provided us with extensive feed-back and feature suggestions. This work was sup-ported by Grant-in-Aid for Specially PromotedResearch (MEXT, Japan), the UK Biotechnologyand Biological Sciences Research Council (BB-SRC) under project Automated Biological EventExtraction from the Literature for Drug Discov-ery (reference number: BB/G013160/1), and theRoyal Swedish Academy of Sciences.

106

ReferencesSteven Abney. 1991. Parsing by chunks. Principle-

based parsing, 44:257–278.Sabine Buchholz and Erwin Marsi. 2006. CoNLL-

X shared task on multilingual dependency parsing.In Proceedings of the Tenth Conference on Com-putational Natural Language Learning (CoNLL-X),pages 149–164.

Stuart K. Card, Thomas P. Moran, and Allen Newell.1983. The psychology of human-computer interac-tion. Lawrence Erlbaum Associates, Hillsdale, NewJersey.

Xavier Carreras and Lluıs Marquez. 2005. Introduc-tion to the CoNLL-2005 shared task: Semantic RoleLabeling. In Proceedings of the 9th Conference onNatural Language Learning, pages 152–164. Asso-ciation for Computational Linguistics.

Hamish Cunningham, Diana Maynard, KalinaBontcheva, Valentin Tablan, Niraj Aswani, IanRoberts, Genevieve Gorrell, Adam Funk, AngusRoberts, Danica Damljanovic, Thomas Heitz,Mark A. Greenwood, Horacio Saggion, JohannPetrak, Yaoyong Li, and Wim Peters. 2011. TextProcessing with GATE (Version 6).

George Doddington, Alexis Mitchell, Mark Przybocki,Lance Ramshaw, Stephanie Strassel, and RalphWeischedel. 2004. The Automatic Content Extrac-tion (ACE) program: Tasks, data, and evaluation. InProceedings of the 4th International Conference onLanguage Resources and Evaluation, pages 837–840.

Roy Fielding. 2000. REpresentational State Trans-fer (REST). Architectural Styles and the Designof Network-based Software Architectures. Univer-sity of California, Irvine, page 120.

Daniel Gildea and Daniel Jurafsky. 2002. Automaticlabeling of semantic roles. Computational Linguis-tics, 28(3):245–288.

Jin-Dong Kim, Tomoko Ohta, and Jun’ichi Tsujii.2008. Corpus annotation for mining biomedi-cal events from literature. BMC Bioinformatics,9(1):10.

Jin-Dong Kim, Sampo Pyysalo, Tomoko Ohta, RobertBossy, Ngan Nguyen, and Jun’ichi Tsujii. 2011.Overview of BioNLP Shared Task 2011. In Pro-ceedings of BioNLP Shared Task 2011 Workshop,pages 1–6, Portland, Oregon, USA, June. Associa-tion for Computational Linguistics.

LDC. 2005. ACE (Automatic Content Extraction) En-glish Annotation Guidelines for Events. Technicalreport, Linguistic Data Consortium.

Pascual Martınez-Gomez, German Sanchis-Trilles,and Francisco Casacuberta. 2011. Online learn-ing via dynamic reranking for computer assistedtranslation. In Alexander Gelbukh, editor, Compu-tational Linguistics and Intelligent Text Processing,

volume 6609 of Lecture Notes in Computer Science,pages 93–105. Springer Berlin / Heidelberg.

Joakim Nivre. 2003. An Efficient Algorithm for Pro-jective Dependency Parsing. In Proceedings of the8th International Workshop on Parsing Technolo-gies, pages 149–160.

Philip V. Ogren. 2006. Knowtator: A protege plug-infor annotated corpus construction. In Proceedingsof the Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Companion Volume:Demonstrations, pages 273–275, New York City,USA, June. Association for Computational Linguis-tics.

Sampo Pyysalo, Tomoko Ohta, Rafal Rak, Dan Sul-livan, Chunhong Mao, Chunxia Wang, Bruno So-bral, Junichi Tsujii, and Sophia Ananiadou. 2012.Overview of the ID, EPI and REL tasks of BioNLPShared Task 2011. BMC Bioinformatics, 13(suppl.8):S2.

David Salgado, Martin Krallinger, Marc Depaule,Elodie Drula, and Ashish V Tendulkar. 2010.Myminer system description. In Proceedings of theThird BioCreative Challenge Evaluation Workshop2010, pages 157–158.

Pontus Stenetorp, Sampo Pyysalo, Sophia Ananiadou,and Jun’ichi Tsujii. 2011a. Almost total recall: Se-mantic category disambiguation using large lexicalresources and approximate string matching. In Pro-ceedings of the Fourth International Symposium onLanguages in Biology and Medicine.

Pontus Stenetorp, Goran Topic, Sampo Pyysalo,Tomoko Ohta, Jin-Dong Kim, and Jun’ichi Tsujii.2011b. BioNLP Shared Task 2011: Supporting Re-sources. In Proceedings of BioNLP Shared Task2011 Workshop, pages 112–120, Portland, Oregon,USA, June. Association for Computational Linguis-tics.

Beth M. Sundheim. 1996. Overview of results ofthe MUC-6 evaluation. In Proceedings of the SixthMessage Understanding Conference, pages 423–442. Association for Computational Linguistics.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 sharedtask: Language-independent named entity recogni-tion. In Proceedings of the Seventh Conference onNatural Language Learning at HLT-NAACL 2003,pages 142–147.

Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ana-niadou. 2008. Accelerating the annotation ofsparse named entities by dynamic sentence selec-tion. BMC Bioinformatics, 9(Suppl 11):S8.

107

Author Index

Ananiadou, Sophia, 102Angelova, Galia, 77Atkinson, Martin, 25

Baldwin, Timothy, 69Ballesteros, Miguel, 58Barbu, Eduard, 87Barsanti, Igor, 87Basile, Valerio, 92Bel, Nuria, 1Belogay, Anelia, 6Beuls, Katrien, 63Blain, Frederic, 11Bos, Johan, 92Boytcheva, Svetla, 77Bucci, Stefano, 25

Chang, Jason S., 16Chen, Mei-hua, 16Cook, Paul, 69Costa, Hernani, 35Crawley, Brett, 25Cristea, Dan, 6Cristianini, Nello, 82Crook, Paul A., 46

Dini, Luca, 87

Evang, Kilian, 92

Flaounas, Ilias, 82

Garcıa-Varea, Ismael, 41Gesmundo, Andrea, 97Goncalo Oliveira, Hugo, 35

Han, Bo, 69Harwood, Aaron, 69Heja, Eniko, 51Hsieh, Hung-ting, 16Huang, Chung-chi, 16

Kakkonen, Tuomo, 20Kao, Ting-hui, 16Karagyozov, Diman, 6Karunasekera, Shanika, 69

Kinnunen, Tomi, 20Koeva, Svetla, 6

Lagos, Nikolaos, 87Lambert, Patrik, 11Lansdall-Welfare, Thomas, 82LeBrun, Jean-Luc, 20Leisma, Henri, 20Lemon, Oliver, 46Liu, Xingkun, 46Lopez, Cedric, 31Lopez, Patrice, 11

Machunik, Monika, 20Moshtaghi, Masud, 69

Nikolova, Ivelina, 77Nivre, Joakim, 58

Ohta, Tomoko, 102

Poch, Marc, 1Prince, Violaine, 31Przepiorkowski, Adam, 6Pyysalo, Sampo, 102

Raxis, Plovios, 6Revuelta-Martınez, Alejandro, 41Rhulmann, Mathieu, 87Rizzo, Giuseppe, 73Roche, Mathieu, 31Rodrıguez, Luis, 41Romary, Laurent, 11

Santos, Diana, 35Schwenk, Holger, 11Segond, Frederique, 87Senellart, Jean, 11Steels, Luc, 63Steinberger, Ralf, 25Stenetorp, Pontus, 102Sudhahar, Saatviga, 82

Takacs, David, 51Tomeh, Nadi, 97Topic, Goran, 102

109

Toral, Antonio, 1Trevisan, Marco, 87Troncy, Raphael, 73Tsujii, Jun’ichi, 102Turchi, Marco, 25

Vald, Ed, 87Van der Goot, Erik, 25van Trijp, Remi, 63Venhuizen, Noortje, 92Vertan, Cristina, 6

Wang, Zhuoran, 46Wellens, Pieter, 63Wilcox, Alastair, 25

Yang, Ping-che, 16

Zipser, Florian, 11