conll-2012 shared task: modeling multilingual unrestricted

38
CoNLL-2012 Shared Task: Modeling Mul ti lin gual Unrestricted Coreference in OntoNotes Sameer Pradhan Raytheon BBN Technologies, Cambridge, MA 02138 USA [email protected] Alessandro Moschitti University of Trento, 38123 Povo (TN) Italy [email protected] Nianwen Xue Brandeis University, Waltham, MA 02453 USA [email protected] Olga Uryupina University of Trento, 38123 Povo (TN) Italy [email protected] Yuchen Zhang Brandeis University, Waltham, MA 02453 USA [email protected] Abstract The CoNLL-2012 shared task involved pre- dicting coreference in English, Chinese, and Arabic, using the final version, v5.0, of the OntoNotes corpus. It was a follow-on to the English-only task organized in 2011. Un- til the creation of the OntoNotes corpus, re- sources in this sub-field of language process- ing were limited to noun phrase coreference, often on a restricted set of entities, such as ACE entities. OntoNotes provides a large- scale corpus of general anaphoric coreference not restricted to noun phrases or to a spec- ified set of entity types, and covers multi- ple languages. OntoNotes also provides ad- ditional layers of integrated annotation, cap- turing additional shallow semantic structure. This paper describes the OntoNotes annota- tion (coreference and other layers) and then describes the parameters of the shared task in- cluding the format, pre-processing informa- tion, evaluation criteria, and presents and dis- cusses the results achieved by the participat- ing systems. The task of coreference has had a complex evaluation history. Potentially many evaluation conditions, have, in the past, made it difficult to judge the improvement in new algorithms over previously reported results. Having a standard test set and evaluation pa- rameters, all based on a resource that provides multiple integrated annotation layers (parses, semantic roles, word senses, named entities and coreference) and in multiple languages could support joint models, and should help ground and energize ongoing research in the task of entity and event coreference. 1 Introduction The importance of coreference resolution for the entity/event detection task, namely identifying all mentions of entities and events in text and clustering them into equivalence classes, has been well recog- nized in the natural language processing community. Automatic identification of coreferring entities and events in text has been an uphill battle for several decades, partly because it can require world knowl- edge which is not well-defined and partly owing to the lack of substantial annotated data. Aside from the fact that resolving coreference in text is a very hard problem, there have been hindrances that con- tributed in damping its progress further: (i) Smaller sized corpora such as MUC which cov- ered coreference across all noun phrases. Cor- pora such as ACE which were larger in size, but covered a smaller set of entities, and at- tempted to cover multiple coreference phe- nomenon which are not equally annotatable with high agreement; and (ii) Complex evaluation with multiple evaluation metrics and multiple possible evaluation sce- narios, complicated with varying training and test partitions, led to many researchers report- ing one or few of the metrics and under a sub- set of evaluation scenarios making it harder to gauge the improvements in algorithms over the years(Stoyanov et al., 2009), or to determine which particular areas require further attention. Looking at various numbers reported in litera- ture can greatly affect the perceived difficulty of the task. It can seem to be a very hard prob- lem (Soon et al., 2001) or one that is somewhat easier (Culotta et al., 2007). A step in the right direction for the benefit of the community was to possibly: (i) Create a large corpus with high annotator agreement possibly by restricting the corefer-

Upload: others

Post on 03-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

CoNLL-2012 Shared Task:Modeling Multilingual Unrestricted Coreference in OntoNotes

Sameer PradhanRaytheon BBN Technologies,

Cambridge, MA 02138USA

[email protected]

Alessandro MoschittiUniversity of Trento,

38123 Povo (TN)Italy

[email protected]

Nianwen XueBrandeis University,Waltham, MA 02453

[email protected]

Olga UryupinaUniversity of Trento,

38123 Povo (TN)Italy

[email protected]

Yuchen ZhangBrandeis University,Waltham, MA 02453

[email protected]

Abstract

The CoNLL-2012 shared task involved pre-dicting coreference in English, Chinese, andArabic, using the final version, v5.0, of theOntoNotes corpus. It was a follow-on to theEnglish-only task organized in 2011. Un-til the creation of the OntoNotes corpus, re-sources in this sub-field of language process-ing were limited to noun phrase coreference,often on a restricted set of entities, such asACE entities. OntoNotes provides a large-scale corpus of general anaphoric coreferencenot restricted to noun phrases or to a spec-ified set of entity types, and covers multi-ple languages. OntoNotes also provides ad-ditional layers of integrated annotation, cap-turing additional shallow semantic structure.This paper describes the OntoNotes annota-tion (coreference and other layers) and thendescribes the parameters of the shared task in-cluding the format, pre-processing informa-tion, evaluation criteria, and presents and dis-cusses the results achieved by the participat-ing systems. The task of coreference has had acomplex evaluation history. Potentially manyevaluation conditions, have, in the past, madeit difficult to judge the improvement in newalgorithms over previously reported results.Having a standard test set and evaluation pa-rameters, all based on a resource that providesmultiple integrated annotation layers (parses,semantic roles, word senses, named entitiesand coreference) and in multiple languagescould support joint models, and should helpground and energize ongoing research in thetask of entity and event coreference.

1 Introduction

The importance of coreference resolution for theentity/event detection task, namely identifying all

mentions of entities and events in text and clusteringthem into equivalence classes, has been well recog-nized in the natural language processing community.Automatic identification of coreferring entities andevents in text has been an uphill battle for severaldecades, partly because it can require world knowl-edge which is not well-defined and partly owing tothe lack of substantial annotated data. Aside fromthe fact that resolving coreference in text is a veryhard problem, there have been hindrances that con-tributed in damping its progress further:

(i) Smaller sized corpora such as MUC which cov-ered coreference across all noun phrases. Cor-pora such as ACE which were larger in size,but covered a smaller set of entities, and at-tempted to cover multiple coreference phe-nomenon which are not equally annotatablewith high agreement; and

(ii) Complex evaluation with multiple evaluationmetrics and multiple possible evaluation sce-narios, complicated with varying training andtest partitions, led to many researchers report-ing one or few of the metrics and under a sub-set of evaluation scenarios making it harder togauge the improvements in algorithms over theyears(Stoyanov et al., 2009), or to determinewhich particular areas require further attention.Looking at various numbers reported in litera-ture can greatly affect the perceived difficultyof the task. It can seem to be a very hard prob-lem (Soon et al., 2001) or one that is somewhateasier (Culotta et al., 2007).

A step in the right direction for the benefit of thecommunity was to possibly:

(i) Create a large corpus with high annotatoragreement possibly by restricting the corefer-

Page 2: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

ence annotating to phenomenon that can be an-notated with high agreement, and covering anunrestricted set of entities and events; and

(ii) Create a standard evaluation scenario with anofficial evaluation setup, and possibly severalablation settings to capture the range of perfor-mance. This can then be used as a standardbenchmark by various researchers.

Creation of the OntoNotes corpus automaticallyaddressed the first issue. A SemEval-2010 coref-erence task was the first attempt at addressing thesecond issue. Among other corpora, a small sub-set (∼120k) of English portion of OntoNotes wasused for this purpose. However, it did not receiveenough participation which prevented the organiz-ers from forming reaching at any strong conclusions.CoNLL-2011 shared task was another attempt to ad-dress the second issue. It was well received, but wasonly limited to the English portion of OntoNotes. Inaddition, the coreference portion of OntoNotes didnot have a concrete baseline prior to the 2011 evalu-ation, thereby making it challenging for participantsto gauge the performance of their algorithms in ab-sence of established state of the art on this flavorof annotation. The closest comparison was to theresults reported by Pradhan et al. (2007b) on thenewswire portion of OntoNotes. Since the corpusalso covers Chinese and Arabic, it provided a greatopportunity to have a follow up task in 2012 cover-ing all three languages. As we will see later, pecu-liarities of each of these languages had to be consid-ered in creating the evaluation framework.

In the language processing community, the fieldof speech recognition probably has longest historyof shared evaluations held primary by NIST1 (Pal-lett, 2002). In the past decade Machine Translationhas been a topic of shared evaluations also by NIST2

There are many phenomenon of syntax and seman-tics that are not quite amenable to such continuedevaluation efforts. CoNLL shared tasks over the past15 years have tried to fill that gap, helping in estab-lishing benchmarks and promoting the advancementof state of the art in various sub-fields within NLP.The importance of shared tasks has now permeatedthe domain of clinical NLP (Chapman et al., 2011)and recently a coreference task was organized as partof the i2b2 workshop (Uzuner et al., 2012).

The rest of the paper is organized as follows: Sec-tion 2 gives a quick overview of the research in

1http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html2http://www.itl.nist.gov/iad/mig/tests/mt/

coreference. Section 3 presents an overview of theOntoNotes corpus. Section 4, describes the rangeof phenomenon annotated in OntoNotes, with lan-guage specific issues. Section 5 describes the sharedtask data and the evaluation parameters, with Sec-tion 5.4.2 looking at the performance of state of theart tools on all/most intermediate layers of annota-tion. Section briefly describes the approaches takenby various participating systems. Section presentsthe system results with some analysis. Section 10concludes the paper.

2 Background

Early work on corpus-based coreference resolutiondates back to the mid-90s by McCarthy and Lenhert(1995) where they experimented with using decisiontrees and hand-written rules. A systematic studywas then conducted using decision trees by Soonet al. (2001). Significant improvements have beenmade in the field of language processing in gen-eral, and improved learning techniques have beendeveloped to push the state of the art in corefer-ence resolution forward (Morton, 2000; Harabagiuet al., 2001; McCallum and Wellner, 2004; Culottaet al., 2007; Denis and Baldridge, 2007; Rahmanand Ng, 2009; Haghighi and Klein, 2010). Re-searchers continued finding novel ways of exploit-ing ontologies such as WordNet. Various differentknowledge sources from shallow semantics to en-cyclopedic knowledge are being exploited (Ponzettoand Strube, 2005; Ponzetto and Strube, 2006; Vers-ley, 2007; Ng, 2007). Given that WordNet is a staticontology and as such has limitation on coverage,more recently, there have been successful attemptsto utilize information from much larger, collabora-tively built resources such as Wikipedia (Ponzettoand Strube, 2006). More recently [??] researchershave used graph based algorithms. For a detailedtreatment of progress in this field, we refer the readerto a recent survey article (Ng, 2010) and a tutorial(Ponzetto and Poesio, 2009) dedicated to this sub-ject.

In spite of all the progress, current techniquesstill rely primarily on surface level features such asstring match, proximity, and edit distance; syntac-tic features such as apposition; and shallow seman-tic features such as number, gender, named entities,semantic class, Hobbs’ distance, etc. Corpora tosupport supervised learning of this task date backto the Message Understanding Conferences (MUC(Hirschman and Chinchor, 1997; Chinchor, 2001;Chinchor and Sundheim, 2003)). The de facto stan-

Page 3: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

dard datasets for current coreference studies are theMUC and the ACE3 (G. Doddington et al., 2004)corpora. These corpora were tagged with corefer-ring entities identified by noun phrases in the text.The MUC corpora cover all noun phrases in text,but represent small training and test sets. The ACEcorpora, on the other hand, have much more anno-tation, but are restricted to a small subset of enti-ties. They are also less consistent, in terms of inter-annotator agreement (ITA) (Hirschman et al., 1998)need to add a real citation. This lessens the relia-bility of statistical evidence in the form of lexicalcoverage and semantic relatedness that could be de-rived from the data and used by a classifier to gen-erate better predictive models. The importance of awell-defined tagging scheme and consistent ITA hasbeen well recognized and studied in the past (Poe-sio, 2004; Poesio and Artstein, 2005; Passonneau,2004). There is a growing consensus that in order forthese to be most useful for language understandingapplications such as question answering or distilla-tion – both of which seek to take information accesstechnology to the next level – we need more con-sistent annotation of larger amounts of broad cov-erage data for training better automatic techniquesfor entity and event identification. Identification andencoding of richer knowledge – possibly linked toknowledge sources – and development of learningalgorithms that would effectively incorporate themis a necessary next step towards improving the cur-rent state of the art. The computational learningcommunity, in general, is also witnessing a move to-wards evaluations based on joint inference, with thetwo previous CoNLL tasks (Surdeanu et al., 2008;Hajic et al., 2009) devoted to joint learning of syn-tactic and semantic dependencies. A principle ingre-dient for joint learning is the presence of multiplelayers of semantic information.

One fundamental question still remains, and thatis – what would it take to improve the state of the artin coreference resolution that has not been attemptedso far? Many different algorithms have been triedin the past 15 years, but one thing that was lack-ing until now was a corpus comprehensively taggedon a large scale with consistent, rich semantic infor-mation. One of the many goals of the OntoNotesproject4 (Hovy et al., 2006; Weischedel et al., 2011)was to explore whether it could fill this void andhelp push the progress further – not only in coref-erence, but with the various layers of semantics that

3http://projects.ldc.upenn.edu/ace/data/4http://www.bbn.com/nlp/ontonotes

it tries to capture. As one of its layers, it has createda corpus for general anaphoric coreference that cov-ers entities and events not limited to noun phrases ora limited set of entity types. As mentioned earlier,the coreference layer in OntoNotes constitutes justone part of a multi-layered, integrated annotation ofshallow semantic structure in text with high inter-annotator agreement, which also provides a uniqueopportunity for performing joint inference over asubstantial body of data.

3 The OntoNotes CorpusThe OntoNotes project has created a corpus of large-scale, accurate, and integrated annotation of mul-tiple levels of the shallow semantic structure intext. The English and Chinese language portioncomprises roughly one million words per languagefrom newswire, magazine articles, broadcast news,broadcast conversations, web data and conversa-tional speech. The English corpus also contains afurther 200k of the English translation of the NewTestament. The Arabic portion is smaller, compris-ing 300k of newswire articles. The idea is that thisrich, integrated annotation covering many layers willallow for richer, cross-layer models enabling signif-icantly better automatic semantic analysis. In ad-dition to coreference, this data is also tagged withsyntactic trees, high coverage verb and some nounpropositions, partial verb and noun word senses, and18 named entity types. Over the years of the de-velopment of this corpus, there were various prior-ities that came into play, and therefore not all thedata in the corpus is annotated with all the differentlayers of annotation. However, such multi-layer an-notations, with complex, cross-layer dependencies,demands a robust, efficient, scalable mechanism forstoring them while providing efficient, convenient,integrated access to the the underlying structure. Tothis effect, it uses a relational database representa-tion that captures both the inter- and intra-layer de-pendencies and also provides an object-oriented APIfor efficient, multi-tiered access to this data (Pradhanet al., 2007a). This should facilitate the creation ofcross-layer features in integrated predictive modelsthat will make use of these annotations.

OntoNotes comprises the following layers of an-notation:

• Syntax – A syntactic layer representing a re-vised Penn Treebank (Marcus et al., 1993;Babko-Malaya et al., 2006).

• Propositions – The proposition structure ofverbs in the form of a revised PropBank(Palmer

Page 4: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

et al., 2005; Babko-Malaya et al., 2006). men-tion Chinese and Arabic PropBank

• Word Sense – Coarse grained word sensesare tagged for the most frequent polysemousverbs and nouns, in order to maximize cov-erage. The word sense granularity is tailoredto achieve 90% inter-annotator agreement asdemonstrated by Palmer et al. (2007). Thesesenses are defined in the sense inventory filesand each individual sense has been connectedto multiple WordNet senses. This provides adirect access to the WordNet semantic structurefor users to make use of. For the English por-tion of OntoNotes, there is also a mapping fromthe word senses to the PropBank frames andto VerbNet (Kipper et al., 2000) and FrameNet(Fillmore et al., 2003). Unfortunately, owingto lack of comparable resources as comprehen-sive as WordNet, in Chinese or Arabic, neitherhave any inter-resource mappings available.

• Named Entities – The corpus was tagged witha set of 18 proper named entity types thatwere well-defined and well-tested for inter-annotator agreement by Weischedel and Burn-stein (2005).

• Coreference – This layer captures generalanaphoric coreference that covers entities andevents not limited to noun phrases or a limitedset of entity types (Pradhan et al., 2007b). Itconsiders all prepositions (PRP, PRP$), nounphrases (NP) and heads of verb phrases (VP)as potential mentions. Unlike English, Chi-nese and Arabic have dropped subjects and ob-jects which were also considered during coref-erence. We will take a look at this in detail inthe next section.

4 Coreference in OntoNotes

General anaphoric coreference that spans a rich setof entities and events – not restricted to a few types,as has been characteristic of most coreference dataavailable until now – has been tagged with a high de-gree of consistency. Two different types of corefer-ence are distinguished in the OntoNotes data: Iden-tical (IDENT), and Appositive (APPOS). Identifycoreference (IDENT) is used for anaphoric corefer-ence, meaning links between pronominal, nominal,and named mentions of specific referents. It does notinclude mentions of generic, underspecified, or ab-stract entities. Appositives (APPOS) are treated sep-

arately because they function as attributions, as de-scribed further below. Coreference is annotated forall specific entities and events. There is no limit onthe semantic types of NP entities that can be consid-ered for coreference, and in particular, coreferenceis not limited to ACE types.

4.1 Noun PhrasesThe mentions over which IDENT coreference appliesare typically pronominal, named, or definite nomi-nal. The annotation process begins by automaticallyextracting all of the NP mentions from the PennTreebank, though the annotators can also add addi-tional mentions when appropriate. In the followingtwo examples (and later ones), the phrases notatedin bold form the links of an IDENT chain.

(1) She had a good suggestion and it was unani-mously accepted by all.

(2) Elco Industries Inc. said it expects net incomein the year ending June 30, 1990, to fall below arecent analyst’s estimate of $ 1.65 a share. TheRockford, Ill. maker of fasteners also said itexpects to post sales in the current fiscal yearthat are “slightly above” fiscal 1989 sales of $155 million.

Noun phrases (NPs) in Chinese can be complexnoun phrases or bare nouns (nouns that lack a de-terminer such as “the” or “this”). Complex nounphrases contain structures modifying the head noun,as in the following examples:

(3) [他担任 总统 任内 最后 一 次 的 [亚(1) 太经济合作会议 [高峰会]]].[[His last APEC [summit meeting]] as the Pres-ident]

(4) [越南 统一 后 [第一 位 前往 当地 访问 的[美国总统]]][[The first [U.S. president]] who went to visitVietnam after its unification]

In these examples, the smallest phrase in squarebrackets is the bare noun. The longer phrase insquare brackets includes modifying structures. Allthe expressions in square brackets, however, sharethe same head noun, i.e., “高峰会 (summit meet-ing)”, and “美国总统 (U.S. president)” respectively.Nested noun phrases, or nested NPs, are containedwithin longer noun phrases. In the above example,“summit meeting” and “U.S. president” are nestedNPs. Wherever NPs are nested, the largest logicalspan is used in coreference

Page 5: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

4.2 VerbsVerbs are added as single-word spans if they canbe coreferenced with a noun phrase or with an-other verb. The intent is to annotate the VP, butmark the single-word head for convenience. This in-cludes morphologically related nominalizations (5)and noun phrases that refer to the same event, evenif they are lexically distinct from the verb (6). In thefollowing two examples, only the chains related tothe growth event are shown. The Arabic translationof the same example identifies mentions through aline on top of the tokens.

(5) Sales of passenger cars grew 22%. The stronggrowth followed year-to-year increases.

,�éJ

�AÖÏ @

�H@ñ

J�Ë@ ÈC

g

�é«Qå��. ú

G.ðPð

B@ XA�

�J�¯B

@ AÖ

ß Y

�®Ë

©P ú

¯ ÑëA�ñÒ

JË @ @

(6) Japan’s domestic sales of cars, trucks and busesin October rose 18% from a year earlier to500,004 units, a record for the month, the JapanAutomobile Dealers’ Association said. Thestrong growth followed year-to-year increasesof 21% in August and 12% in September.

4.3 PronounsAll pronouns and demonstratives are linked to any-thing that they refer to, and pronouns in quotedspeech are also marked. Expletive or pleonastic pro-nouns (it, there) are not considered for tagging, andgeneric you is not marked. In the following exam-ple, the pronoun you and it would not be marked. (Inthis and following examples, an asterisk (*) before aboldface phrase identifies entity/event mentions thatwould not be tagged as coreferent.)

(7) Senate majority leader Bill Frist likes to tella story from his days as a pioneering heartsurgeon back in Tennessee. A lot of times,Frist recalls, *you’d have a critical patient ly-ing there waiting for a new heart, and *you’dwant to cut, but *you couldn’t start unless *youknew that the replacement heart would make*it to the operating room.

In Chinese, if the subject or object can be recov-ered from the context, or it is of little interest for thereader/listener to know, it can be omitted. In Chi-nese Treebank, the position where subject or objectis omitted is annotated with small *pro*. A *pro*can be replaced by overt NPs if they refer to the sameentity or event. And *pro* and overt NPs do nothave to be in the same sentence. Exactly what *pro*

stands for is determined by the linguistic context inwhich it appears.

(8) 吉林省主管经贸工作的副省长全哲洙说:“[*pro*] 欢迎国际社会同[我们] 一道,共同推进图门江开发事业, 促进区域经济发展,造福东北亚人民。Quan Zhezhu, Vice Governor of JinlinProvince who is in charge of economics andtrade, said: “[*pro*] Welcome international so-cieties to join [us] in the development of Tu-men Jiang, so as to promote regional economicdevelopment and benefit people in NortheastAsia.

Sometimes, *pro*s cannot be recovered in thetext—i.e., they cannot be replaced by an overt NPin the text. For example, *pro* in existential sen-tences usually cannot be recovered or linked in theannotation, as in the following case

(9) [*pro*] 有二十三顶高新技术项目进区开发。There are 23 high-tech projects under develop-ment in the zone.

Also, if *pro* does not refer to a specific en-tity or event, it is considered generic *pro* and notlinked. Finally, *pro*s in idiomatic expressions arenot linked.

(10) 肯德基 、 麦当劳 等 速食店 全 大陆 都 推出 了 [*pro*] 买 套餐赠送 布质 或 棉质 圣诞老人玩具的促销.In Mainland China, fast food restaurants suchas Kentucky Fried Chicken and McDonald’shave launched their promotional packages byproviding free cotton Santa toys for eachcombo [*pro*] purchased

Similar to Chinese, Arabic null subjects and ob-jects are also eligible for coreference. In the ArabicTreebank, these are marked with just an “*”.

4.4 Generic mentionsGeneric nominal mentions can be linked with refer-ring pronouns and other definite mentions, but arenot linked to other generic nominal mentions.

This would allow linking of the bracketed men-tions in (11) and (12), but not (13).

(11) Officials said they are tired of making thesame statements.

Page 6: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

(12) Meetings are most productive when they areheld in the morning. Those meetings, how-ever, generally have the worst attendance.

(13) Allergan Inc. said it received approval tosell the PhacoFlex intraocular lens, the firstfoldable silicone lens available for *cataractsurgery. The lens’ foldability enables it to beinserted in smaller incisions than are now pos-sible for *cataract surgery.

Bare plurals, as in (11) and (12), are always con-sidered generic. In example (14) below, there aretwo generic instances of parents. These are markedas distinct IDENT chains (with separate chains dis-tinguished by subscripts X, Y and Z), each contain-ing a generic and the related referring pronouns.

(14) ParentsX should be involved with theirXchildren’s education at home, not in school.TheyX should see to it that theirX kids don’tplay truant; theyX should make certain thatthe children spend enough time doing home-work; theyX should scrutinize the report card.ParentsY are too likely to blame schools for theeducational limitations of theirY children. IfparentsZ are dissatisfied with a school, theyZshould have the option of switching to another.

In (15) below, the verb “halve” cannot be linked to“a reduction of 50%”, since “a reduction” is indefi-nite.

(15) Argentina said it will ask creditor banks to*halve its foreign debt of $64 billion – thethird-highest in the developing world . Ar-gentina aspires to reach *a reduction of 50%in the value of its external debt.

4.5 Pre-modifiersProper pre-modifiers can be coreferenced, butproper nouns that are in a morphologically adjecti-val form are treated as adjectives, and not corefer-enced. For example, adjectival forms of GPEs suchas Chinese in “the Chinese leader”, would not belinked. Thus we could coreference United States in“the United States policy” with another referent, butnot American “the American policy.” GPEs and Na-tionality acronyms (e.g. U.S.S.R. or U.S.). are alsoconsidered adjectival. Pre-modifier acronyms can becoreferenced unless they refer to a nationality. Thusin the examples below, FBI can be coreferenced toother mentions, but U.S. cannot.

(16) FBI spokesman

(17) *U.S. spokesman

Dates and monetary amounts can be consideredpart of a coreference chain even when they occur aspre-modifiers.

(18) The current account deficit on France’s balanceof payments narrowed to 1.48 billion Frenchfrancs ($236.8 million) in August from a re-vised 2.1 billion francs in July, the FinanceMinistry said. Previously, the July figure wasestimated at a deficit of 613 million francs.

(19) The company’s $150 offer was unexpected.The firm balked at the price.

4.6 Copular verbsAttributes signaled by copular structures are notmarked; these are attributes of the referent they mod-ify, and their relationship to that referent will becaptured through word sense and propositional ar-gument tagging.

(20) JohnX is a linguist. PeopleY are nervousaround JohnX, because heX always correctstheirY grammar.

Copular (or ’linking’) verbs are those verbs thatfunction as a copula and are followed by a sub-ject complement. Some common copular verbs are:be, appear, feel, look, seem, remain, stay, become,end up, get. Subject complements following suchverbs are considered attributes, and not linked. SinceCalled is copular, neither IDENT nor APPOS corefer-ence is marked in the following case.

(21) Called Otto’s Original Oat Bran Beer, the brewcosts about $12.75 a case.

4.7 Small clausesLike copulas, small clause constructions are notmarked. The following example is treated as if thecopula were present (“John considers Fred to be anidiot”):

(22) John considers *Fred *an idiot.

4.8 Temporal expressionsTemporal expressions such as the following arelinked:

(23) John spent three years in jail. In that time...

Page 7: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Deictic expressions such as now, then, today, tomor-row, yesterday, etc. can be linked, as well as othertemporal expressions that are relative to the time ofthe writing of the article, and which may thereforerequire knowledge of the time of the writing to re-solve the coreference. Annotators were allowed touse knowledge from outside the text in resolvingthese cases. In the following example, the end ofthis period and that time can be coreferenced, as canthis period and from three years to seven years.

(24) The limit could range from three years toseven yearsX, depending on the compositionof the management team and the nature of itsstrategic plan. At (the end of (this period)X)Y,the poison pill would be eliminated automati-cally, unless a new poison pill were approvedby the then-current shareholders, who wouldhave an opportunity to evaluate the corpora-tion’s strategy and management team at thattimeY.

In multi-date temporal expressions, embeddeddates are not separately connected to to other men-tions of that date. For example in Nov. 2, 1999, Nov.would not be linked to another instance of Novemberlater in the text.

4.9 AppositivesBecause they logically represent attributions, appos-itives are tagged separately from Identity corefer-ence. They consist of a head, or referent (a nounphrase that points to a specific object/concept in theworld), and one or more attributes of that referent.An appositive construction contains a noun phrasethat modifies an immediately-adjacent noun phrase(separated only by a comma, colon, dash, or paren-thesis). It often serves to rename or further definethe first mention. Marking appositive constructionsallows us to capture the attributed property eventhough there is no explicit copula.

(25) Johnhead, a linguistattribute

The head of each appositive construction is distin-guished from the attribute according to the followingheuristic specificity scale, in a decreasing order fromtop to bottom:

Type Example

Proper noun JohnPronoun HeDefinite NP the manIndefinite specific NP a man I knowNon-specific NP man

This leads to the following cases:

(26) Johnhead, a linguistattribute

(27) A famous linguistattribute, hehead studied at ...

(28) a principal of the firmattribute, J. Smithhead

In cases where the two members of the appositiveare equivalent in specificity, the left-most member ofthe appositive is marked as the head/referent. Defi-nite NPs include NPs with a definite marker (the) aswell as NPs with a possessive adjective (his). Thusthe first element is the head in all of the followingcases:

(29) The chairman, the man who never gives up

(30) The sheriff, his friend

(31) His friend, the sheriff

In the specificity scale, specific names of diseasesand technologies are classified as proper names,whether they are capitalized or not.

(32) A dangerous bacteria, bacillium, is found

When the entity to which an appositive refers isalso mentioned elsewhere, only the single span con-taining the entire appositive construction is includedin the larger IDENT chain. None of the nested NPspans are linked. In the example below, the en-tire span can be linked to later mentions to RichardGodown.

The sub-spans are not included separately in theIDENT chain.

(33) Richard Godown, president of the Indus-trial Biotechnology Association

Ages are tagged as attributes (as if they were el-lipses of, for example, a 42-year-old):

(34) Mr.Smithhead, 42attribute,

Similar rules apply for Chinese and Arabic.

4.10 Special IssuesIn addition to the ones above, there are some specialcases such as:

• No coreference is marked between an organi-zation and its members.

• GPEs are linked to references to their govern-ments, even when the references are nestedNPs, or the modifier and head of a single NP.

Page 8: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Type Description

Annotator Error An annotator error. This is a catch-all category for cases of errors that do not fit in the othercategories.

Genuine Ambiguity This is just genuinely ambiguous. Often the case with pronouns that have no clear an-tecedent (especially this & that)

Generics One person thought this was a generic mention, and the other person didn’tGuidelines The guidelines need to be clear about this exampleCallisto Layout Something to do with the usage/design of CallistoReferents Each annotator thought this was referring to two completely different thingsPossessives One person did not mark this possessiveVerb One person did not mark this verbPre Modifiers One person did not mark this Pre ModifierAppositive One person did not mark this appositiveExtent Both people marked the same entity, but one person’s mention was longerCopula Disagreement arose because this mention is part of a copular structure

a) Either each annotator marked a different half of the copulab) Or one annotator unnecessarily marked both

Figure 1: Description of various disagreement types

Figure 1: The distribution of disagreements across the various types in Table 2

Sheet1

Page 1

Copulae 2%Appositives 3%Pre Modifiers 3%Verbs 3%Possessives 4%Referents 7%Callisto Layout 8%Guidelines 8%Generics 11%Genuine Ambiguity 25%Annotator Error 26%

Copulae

Appositives

Pre Modifiers

Verbs

Possessives

Referents

Callisto Layout

Guidelines

Generics

Genuine Ambiguity

Annotator Error

0% 5% 10% 15% 20% 25% 30%

Figure 2: The distribution of disagreements across the various types in Table 1 for a sample of 15K disagreements inthe English portion of the corpus.

• In extremely rare cases, metonymic mentionscan be co-referenced. This is done only whenthe two mentions clearly and without a doubtrefer to the same entity. For example:

(35) In a statement released this afternoon, [10Downing Street] called the bombings inCasablanca “a strike against all peace-loving people.”

(36) In a statement, [Britain] called theCasablanca bombings “a strike against allpeace-loving people.”

In this case, it is obvious that “10 Down-ing Street” and “Britain” are being used inter-changeably in the text. Again, if there is anyambiguity, however, these terms are not coref-erenced with each other.

• In Arabic, verbal inflections are not consideredpronominal and are not coreferenced.

4.11 Annotator Agreement and AnalysisTable 1 shows the inter-annotator and annotator-adjudicator agreement on all the genres ofOntoNotes. We also analyzed about 15K disagree-ments in various parts of the English data, andgrouped them into one of the categories shown inFigure 1. Figure 2 shows the distribution of thesedifferent types that were found in that sample. It canbe seen that genuine ambiguity and annotator errorare the biggest contributors – the latter of which isusually captured during adjudication, thus showingthe increased agreement between the adjudicatedversion and the individual annotator version.

5 CoNLL-2012 Coreference Task

The CoNLL-2012 shared task was held across allthree languages that are from quite different lan-guage families – English, Chinese and Arabic – ofthe OntoNotes 5.0 data. The task was to automat-ically identify mentions of entities and events intext and to link the coreferring mentions together to

Page 9: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Language Genre A1-A2 A1-ADJ A2-ADJ

English Newswire 80.9 85.2 88.3Broadcast News 78.6 83.5 89.4Broadcast Conversation 86.7 91.6 93.7Magazine 78.4 83.2 88.8Web 85.9 92.2 91.2Call Home 81.3 94.1 84.7New Testament 89.4 96.0 92.0

Chinese Newswire 73.6 84.8 75.1Broadcast News 80.5 86.4 91.6Broadcast Conversation 84.1 90.7 91.2Magazine 74.9 81.2 80.0Web 87.6 92.3 93.5Call Home 65.6 86.6 77.1

Arabic Newswire 73.8 88.1 75.6

Table 1: Inter Annotator and Adjudicator agreement forthe Coreference Layer in OntoNotes measured in termsof the MUC score.

form entity/event chains. The coreference decisionshad to be made using automatically predicted infor-mation on the other structural and semantic layersincluding the parses, semantic roles, word senses,and named entities. Given various factors, suchas the lack of resources, state of the art tools, andtime constraints, we could not provide some layersof information for the Chinese and Arabic portionof the data. The morphology of these languagesis quite different. Arabic has a complex morphol-ogy, English has limited morphology, whereas Chi-nese has no morphology. English word segmenta-tion amounts to rule-based tokenization, and is closeto perfect. In case of Chinese and Arabic, althoughit is not as good as English, the accuracies are inthe high 90s. Syntactically, there are many droppedsubjects and objects in Arabic and Chinese, whereas none is the case with English. Another differenceis the amount of resources available for each lan-guage. English has probably the most resources atits disposal, whereas Chinese and Arabic lack sig-nificantly. Arabic more so than Chinese. Giventhis fact, plus the fact that the CoNLL format can-not handle multiple segmentations, and that it wouldcomplicate scoring since we are using exact tokenboundaries (as discussed later in Section xxx), wedecided to allow the use of gold, treebank segmenta-tion for all languages. In case of Chinese, the wordsthemselves are lemmas, so no addition informationneeds to be provided. For Arabic, we decided to alsoprovide correct, gold lemmas, along with the cor-rect vocalized version of the tokens. Table 2 listsall the predicted layers of information provided foreach language.

As is customary for CoNLL tasks, there were two

Layer English Chinese Arabic

Segmentation • • •Lemma

√– •

Parse√ √ √5

Proposition√ √ ×

Predicate Frame√ × ×

Word Sense√ √ √

Name Entities√ × ×

Speaker • • –

Table 2: Summary of predicted layers provided for eachlanguage. A “•” indicates gold annotation.

primary tracks, closed and open. For the closedtrack, systems were limited to using the distributedresources, in order to allow a fair comparison of al-gorithm performance, while the open track allowedfor almost unrestricted use of external resources inaddition to the provided data. Within each closedand open track, we had an optional supplementarytrack which allowed us to run some ablation studiesover a few different input conditions. This allowedus to evaluate the systems given: i) Gold parses; ii)Gold mention boundaries, and iii) Gold mentions.

5.1 Primary EvaluationThe primary evaluation comprises the closed andopen tracks where predicted information is providedon all layers of the test set other than coreference. Asmentioned earlier, we provide gold lemma and vo-calization information for Arabic, and we use gold,treebank, segmentation for all three languages.

5.1.1 Closed TrackIn the closed track, systems were limited to the

provided data. For the training and test data, inaddition to the underlying text, predicted versionsof all the supplementary layers of annotation wereprovided using off-the-shelf tools (parsers, semanticrole labelers, named entity taggers, etc.) retrainedon the training portion of the OntoNotes v5.0 data– as described in Section 5.4.2. For the trainingdata, however, in addition to predicted values for theother layers, we also provided manual gold-standardannotations for all the layers. Participants were al-lowed to use either the gold-standard or predictedannotation for training their systems. They were alsofree to use the gold-standard data to train their ownmodels for the various layers of annotation, if theyjudged that those would either provide more accu-rate predictions or alternative predictions for use asmultiple views, or wished to use a lattice of predic-tions.

More so than previous CoNLL tasks, coreference

Page 10: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

predictions depend on world knowledge, and manystate-of-the-art systems use information from exter-nal resources such as WordNet, which can add alayer of information that could help a system recog-nize semantic connections between the various lex-icalized mentions in the text. Therefore, in the caseof English, similar to the previous year’s task, weallowed the use of WordNet for the closed track.Since word senses in OntoNotes are predominantly6

coarse-grained groupings of WordNet senses, sys-tems could also map from the predicted or gold-standard word senses to the sets of underlying Word-Net senses. Another significant piece of knowledgethat is particularly useful for coreference but that isnot available in the layers of OntoNotes is that ofnumber and gender. There are many different waysof predicting these values, with differing accuracies,so in order to ensure that participants in the closedtrack were working from the same data, thus allow-ing clearer algorithmic comparisons, we specified aparticular table of number and gender predictionsgenerated by Bergsma and Lin (2006), for use dur-ing both training and testing. Unfortunately neitherArabic, nor Chinese has comparable resources avail-able that we could allow participants to use.

5.1.2 Open TrackIn addition to resources available in the closed

track, the open track, systems were allowed to useexternal resources such as Wikipedia, gazetteers etc.This track is mainly to get an idea of a performanceceiling on the task at the cost of not getting a com-parison across all systems. Another advantage of theopen track is that it might reduce the barriers to par-ticipation by allowing participants to field existingresearch systems that already depend on external re-sources – especially if there were hard dependencieson these resources – so they can participate in thetask with minimal or no modification to their exist-ing system.

5.2 Supplementary EvaluationIn addition to the option to select between the of-

ficial closed or the open tracks, the participants alsohad an option to run their systems on some ablationsettings.

Gold Parses In this case, for each language, wereplaced the predicted parses in the closed track datawith manual, gold parses.

6There are a few instances of novel senses introduced inOntoNotes which were not present in WordNet, and so lack amapping back to the WordNet senses

Gold Mention Boundaries In this case, we pro-vided all possible correct mention boundaries in thetest data. This essentially entails all NPs, and PRPsin the data extracted from the gold parse trees, aswell as the mentions that do not align with any parseconstituent, for example, verb mentions and somenamed entities.

Gold Mentions In this dataset, we provided onlyand all the correct mentions for the test sets, therebyreducing the task to one of pure coreference linking,and eliminating the task of mention detection andanaphoricity determination7.

5.3 Train, Development and Test SplitsWe used the same algorithm as in CoNLL-2011 tocreate the train/development/test partitions for En-glish, Chinese and Arabic. We tried to reuse pre-viously used training/development/test partitions forChinese and Arabic, but either they were not in theselection used for OntoNotes, or were partially over-lapping. Unfortunately, unlike English WSJ parti-tions, there was no clean way of reusing those par-titions. Algorithm 1 details this procedure. modifythe lists. The list of training, development and testdocument IDs can be found on the task webpage8.Following the recent CoNLL tradition, participantswere allowed to use both the training and the devel-opment data for training the final model.

5.4 Data PreparationThis section gives details of the different annota-tion layers including the automatic models that wereused to predict them, and describes the formats inwhich the data were provided to the participants.

5.4.1 Manual Annotation Gold LayersWe will take a look at the manually annotated, or

gold layers of information that were made availablefor the training data.

Coreference The manual coreference annotationis stored as chains of linked mentions connectingmultiple mentions of the same entity. Coreference is

7Since mention detection interacts with anaphoricity deter-mination since the corpus does not contain any singleton men-tions.

8http://conll.bbn.com/download/conll-train.idhttp://conll.bbn.com/download/conll-dev.idhttp://conll.bbn.com/download/conll-test.id

These are more general list which include docu-ments that had at least one of the layers of annotation.In other words they also include documents that do nothave any coreference annotation

Page 11: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Algorithm 1 Procedure used to create OntoNotes training, developmentand test partitions.Procedure: Generate Partitions(OntoNotes) returns Train, Dev, Test

1: Train ← ∅2: Dev ← ∅3: Test ← ∅4: for all Source ∈ OntoNotes do5: if Source = Wall Street Journal then6: Train ← Train ∪ Sections 02 – 217: Dev ← Dev ∪ Sections 00, 01, 22, 248: Test ← Test ∪ Section 239: else10: if Number of files in Source ≥ 10 then11: Train ← Train ∪ File IDs ending in 1 – 812: Dev ← Dev ∪ File IDs ending in 013: Test ← Test ∪ File IDs ending in 914: else15: Dev ← Dev ∪ File IDs ending in 016: Test ← Test ∪ File ID ending in the highest number17: Train ← Train ∪ Remaining File IDs for the Source18: end if19: end if20: end for21: return Train, Dev, Test

1

the only document-level phenomenon in OntoNotes,and the complexity of annotation increases non-linearly with the length of a document. Unfortu-nately, some of the documents – especially ones inthe broadcast conversation, weblogs, and telephoneconversation genre – are very long which prohib-ited us from efficiently annotating them in entirety.These had to be split into smaller parts. We con-ducted a few passes to join some adjacent parts, butsince some documents had as many as 17 parts, thereare still multi-part documents in the corpus. Sincethe coreference chains are coherent only within eachof these document parts, for this task, each such partis treated as a separate document. Another thingto note is that there were some cases of sub-tokenannotation in the corpus owing to the fact that to-kens were not split at hyphens. Cases such as pro-WalMart had the sub-span WalMart linked with anotherinstance of Walmart. The recent Treebank revisionwhich split tokens at most hyphens, made a majorityof these sub-token annotations go away. There werestill some residual sub-token annotations. Sincesubtoken annotations cannot be represented in theCoNLL format, and they were a very small quan-tity – much less than even half a percent – we de-cided to ignore them. Unlike English, Chinese andArabic has coreference annotation on elided sub-

jects/objects. Recovering these entities in text is ahard problem, and the most recently reported num-bers in literature for Chinese are around a F-scoreof 50 cite, and for Arabic are around mention typ-ical scores. Considering the level of prediction ac-curacy of these tokens, and the relative frequency ofthe same, plus the fact that the CoNLL tabular for-mat is not amenable to a variable number of tokens,we decided to not consider them as part of the task.In other words, we removed the manually identifiedtraces (*pro* and *) respectively in Chinese and Ara-bic Treebanks. We also do not consider the links thatare formed by these tokens (how many?) in the goldevaluation key.

For various reasons, not all the documents inOntoNotes have been annotated with all the differentlayers of annotation, with full coverage.11 There isa core portion, however, which is roughly 1.6M En-glish words, 950K Chinese words, and 300K Arabicwords which has been annotated with all the layers.This is the portion that we used for the shared task.

The number of documents in the corpus for this

11Given the nature of word sense annotation, and changes inproject priorities, we could not annotate all the low frequencyverbs and nouns in the corpus. Furthermore, PropBank anno-tation currently only covers mostly verb predicates and a fewnoun predicates.

Page 12: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Corpora Language Words DocumentsTotal Train Dev Test Total Train Dev Test

MUC-6 English 25K 12K 13K 60 30 30MUC-7 English 40K 19K 21K 67 30 37ACE9 (2000-2004) English 960K 745K 215K - - -

Chinese 615K 455K 150K - - -Arabic 500K 350K 150K - - -

OntoNotes10 English 1.6M 1.3M 160K 170K 2,384(3493)

1,940(2,802)

222(343)

222348()

Chinese 950K 750K 110K 90K 1,729(2,280)

1,391(1,810)

172(252)

166(218)

Arabic 300K 240K 30K 30K 447(447)

359(359)

44(44)

44(44)

Table 3: Number of documents in the OntoNotes data, and some comparison with the MUC and ACE data sets. Thenumbers in parenthesis for the OntoNotes corpus indicate the total number of parts that correspond to the documents.Each part was considered a separate document for evaluation purposes.

Language Syntactic Train Development Testcategory Count % Count % Count %

English NP 81,866 52.89 10,274 53.71 10,357 52.51PRP 65,529 42.33 7,705 40.28 8,157 0.41NNP 2,902 1.87 478 2.50 503 0.03V 2,493 1.61 295 1.54 342 0.02Other N 1,024 0.66 226 1.18 191 0.01Other 978 0.63 150 0.78 175 0.01

Chinese NP 101,049 98.94 13,876 98.63 12,593 99.01PRP 467 0.46 107 0.76 70 0.55NR 71 0.07 8 0.06 8 0.06other N 75 0.07 11 0.08 3 0.02V 187 0.18 37 0.26 13 0.10Other 282 0.28 30 0.21 32 0.25

Arabic NP 23,157 85.79 2,828 86.94 2,673 84.86PRP 2,977 11.03 344 10.57 410 13.02NNP 608 2.25 49 1.51 36 1.14NN 71 0.26 10 0.31 8 0.25V 25 0.09 4 0.12 0 0.00Other 154 0.57 18 0.55 23 0.73

Table 4: Distribution of mentions in the data by their syn-tactic category.

task, for each of the different languages, are shownin Table 3. Tables 4 and 5 shows the distribution ofmentions by the syntactic categories, and the countsof entities, links and mentions in the corpus respec-tively. All of this data has been Treebanked andPropBanked either as part of the OntoNotes effortor some preceding effort.

For comparison purposes, Table 3 also lists thenumber of documents in the MUC-6, MUC-7, andACE (2000-2004) corpora. The MUC-6 data wastaken from the Wall Street Journal, whereas theMUC-7 data was from the New York Times. The ACEdata spanned many different languages and genressimilar to the ones in OntoNotes.

Parse Trees This represents the syntactic layerthat is a revised version of the treebanks in English,Chinese and Arabic. Arabic treebank has probably

Language Type Train Development Test All

English Entities/Chains 35,143 4,546 4,532 44,221Links 120,417 14,610 15,232 150,259Mentions 155,560 19,156 19,764 194,480

Chinese Entities/Chains 28,257 3,875 3,559 35,691Links 74,597 10,308 9,242 94,147Mentions 10,2854 14,183 12,801 129,838

Arabic Entities/Chains 8,330 936 980 10,246Links 19,260 2,381 2,255 23,896Mentions 27,590 3,313 3,235 34,138

Table 5: Number of entities, links and mentions in theOntoNotes 5.0 data.

seen the most revision over the past few years, to in-crease consistency. For purposes of this task, traceswere removed from the syntactic trees, since theCoNLL-style data format, being indexed by tokens,does not provide any good means of conveying thatinformation. As mentioned in the previous section,these include the cases of traces in Chinese and Ara-bic which are legitimate dropped subjects/objects.Function tags were also removed, since the parsersthat we used for the predicted syntax layer did notprovide them. One thing that needs to be dealt within conversational data is the presence of disfluencies(restarts, etc.). Tokens that were part of disfluencieswere removed from the English portion of the data,but kept in the Chinese portion. Since Arabic por-tion of the corpus is all newswire, this had no impacton it. In the English OntoNotes parses the disflu-encies are marked using a special EDITED12 phrasetag – as was the case for the Switchboard Treebank.Given the frequency of disfluencies and the perfor-

12There is another phrase type – EMBED in the telephone con-versation genre which is similar to the EDITED phrase type, andsometimes identifies insertions, but sometimes contains logicalcontinuation of phrases, so we decided not to remove that fromthe data.

Page 13: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

mance with which one can identify them automat-ically,13 a probable processing pipeline would fil-ter them out before parsing. Since we did not havea readily available tagger for tagging disfluencies,we decided to remove them using oracle informa-tion available in the Treebank, and the coreferencechains were remapped to trees without disfluencies.Owing to various constraints, we decided to retainthe disfluencies to be kept in the Chinese data.

Propositions The propositions in OntoNotes con-stitute PropBank semantic roles. Most of the verbpredicates in the corpus have been annotated withtheir arguments. Recent enhancements to the Prop-Bank to make it synchronize better with the Tree-bank (Babko-Malaya et al., 2006) have enhancedthe information in the proposition by the addition oftwo types of LINKs that represent pragmatic corefer-ence (LINK-PCR) and selectional preferences (LINK-SLC). More details can be found in the addendum tothe PropBank guidelines14 in the OntoNotes 5.0 re-lease. Since the community is not used to this repre-sentation which relies heavily on the trace structurein the Treebank which we are excluding, we decidedto unfold the LINKs back to their original represen-tation as in the Proposition Bank release 1.0. Thisfunctionality is part of the OntoNotes DB Tool.15

Word Sense Gold word sense annotation wassupplied using sense numbers as specified in theOntoNotes list of senses for each lemma.16

Named Entities Named Entities in OntoNotesdata are specified using a catalog of 18 Name types.

Other Layers Discourse plays a vital role incoreference resolution. In the case of broadcast con-versation, or telephone conversation data, it partiallymanifests in the form of speakers of a given utter-ance, whereas in weblogs or newsgroups it does soas the writer, or commenter of a particular articleor thread. This information provides an importantclue for correctly linking anaphoric pronouns withthe right antecedents. This information could be au-tomatically deduced, but since it would add addi-tional complexity to the already complex task, we

13A study by Charniak and Johnson (2001) shows that onecan identify and remove edits from transcribed conversationalspeech with an F-score of about 78, with roughly 95 Precisionand 67 recall.

14doc/propbank/english-propbank.pdf15http://cemantix.org/ontonotes.html16It should be noted that word sense annotation in OntoNotes

is note complete, so only some of the verbs and nouns haveword sense tags specified.

decided to provide oracle information of this meta-data both during training and testing. In other words,speaker and author identification was not treatedas an annotation layer that needed to be predicted.This information was provided in the form of an-other column in the .conll table. There were somecases of interruptions and interjections that ideallywould associate parts of a sentence to two differentspeakers, but since the frequency of this was quitesmall, we decided to make an assumption of onespeaker/writer per sentence.

5.4.2 Predicted Annotation LayersThe predicted annotation layers were derived us-

ing automatic models trained using cross-validationon other portions of OntoNotes data. As mentionedearlier, there are some portions of the OntoNotescorpus that have not been annotated for coreferencebut that have been annotated for other layers. Fortraining models for each of the layers, where feasi-ble, we used all the data that we could for that layerfrom the training portion of the entire OntoNotes re-lease.

Parse Trees Predicted parse trees for English wereproduced using the Charniak parser (Charniak andJohnson, 2005).17 Some additional tag types usedin the OntoNotes trees were added to the parser’stagset, including the NML tag that has recently beenadded to capture internal NP structure, and the rulesused to determine head words were appropriatelyextended. Chinese and Arabic parses were gener-ated using the Berkeley parser. In case of Arabic, theparsing community uses a mapping from rich Ara-bic part of speech tags, to Penn-style part of speechtags. We used that mapping which is included withthe Arabic treebank. The parser was then re-trainedon the training portion of the release 5.0 data using10-fold cross-validation. Table 6 shows the perfor-mance of the re-trained parsers on the CoNLL-2012test set. We did not get a chance to re-train the re-ranker available for English, and since the stock re-ranker crashes when run on n-best parses containingNMLs, because it has not seen that tag in training, wecould not make use of it.

Word Sense This year we used the IMS (It MakesSense) (Zhong and Ng, 2010) word sense tagger18.It was trained on all the word sense data that is

17http://bllip.cs.brown.edu/download/reranking-parserAug06.tar.gz

18We offer special thanks to Hwee Tou Ng and his studentZhong Zi for training IMS models and providing output for thedevelopment and test sets.

Page 14: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

All Sentences Sentence len < 40N POS R P F N R P F

English Broadcast Conversation (BC) 2,194 95.93 84.30 84.46 84.38 2,124 85.83 85.97 85.90Broadcast News (BN) 1,344 96.50 84.19 84.28 84.24 1,278 85.93 86.04 85.98Magazine (MZ) 780 95.14 87.11 87.46 87.28 736 87.71 88.04 87.87Newswire (NW) 2,273 96.95 87.05 87.45 87.25 2,082 88.95 89.27 89.11Telephone Conversation (TC) 1,366 93.52 79.73 80.83 80.28 1,359 79.88 80.98 80.43Weblogs and Newsgroups (WB) 1,658 94.67 83.32 83.20 83.26 1,566 85.14 85.07 85.11New Testament 1,217 96.87 92.48 93.66 93.07 1,217 92.48 93.66 93.07

Overall 9,615 96.03 85.25 85.43 85.34 9145 86.86 87.02 86.94

Chinese Broadcast Conversation (BC) 885 94.79 79.35 80.17 79.76 824 80.92 81.86 81.38Broadcast News (BN) 929 93.85 80.13 83.49 81.78 756 81.82 84.65 83.21Magazine (MZ) 451 97.06 83.85 88.48 86.10 326 85.64 89.80 87.67Newswire (NW) 481 94.07 77.28 82.26 79.69 406 79.06 83.84 81.38Telephone Conversation (TC) 968 92.22 69.19 71.90 70.52 942 69.59 72.24 70.89Weblogs and Newsgroups (WB) 758 92.37 78.92 82.57 80.70 725 79.30 83.10 81.16

Overall 4,472 94.12 78.93 82.23 80.55 3,979 79.80 82.79 81.27

Arabic Newswire (NW) 1,003 94.12 75.67 74.71 75.19 766 77.44 74.99 76.19

Overall 1,003 94.12 75.67 74.71 75.19 766 77.44 74.99 76.19

Table 6: Parser performance on the CoNLL-2011 test set

present in the training portion of the OntoNotes cor-pus using cross-validated predictions on the inputlayers as with the proposition tagging. During test-ing, for English, IMS must first uses the automaticPOS tagger to identify the nouns and verbs in thetest data, and then assign senses to the automaticallyidentified nouns and verbs. Since automatic POStagging is not perfect, IMS does not always output asense to all word tokens that need to be sense taggeddue to wrongly predicted POS tags. As such, recallis not the same as precision on the English test data.For Chinese, sense tags are defined with respect toa Chinese lemma. Since we provide gold word seg-mentation, IMS attempts to sense tag all correctlysegmented Chinese words, so recall and precisionare same and so is F1. Note that in Chinese, the wordsenses are defined against lemmas and are indepen-dent of the part of speech. For Arabic, sense tagsare defined with respect to a lemma. Since we usedgold standard lemmas, IMS attempts to sense tag allcorrectly determined lemmas, therefore, in this casealso the recall and precision are identical and so isF1. Table 7 gives the number of lemmas covered bythe word sense inventory in the English, Chinese andArabic portion of OntoNotes.

Table 8 shows the performance of this classifierover both the verbs and nouns in the CoNLL-2012test set.

For English, genres pt and tc, and for Chinesegenres tc and wb, no gold standard senses wereavailable, and so their accuracies could be com-puted.

Layer English Chinese ArabicVerb Noun All Verb Noun

Sense Inventories 2702 2194 763 150 111Frames 5672 1335 20134 2743 532

Table 7: Number of senses defined for English, Chineseand Arabic in the OntoNotes v5.0 corpus

Propositions To predict propositional structure,ASSERT19 (Pradhan et al., 2005) was used, re-trained also on all the training portion of theOntoNotes 5.0 data using cross-validated predictedparses. Given time constraints, we had to performtwo modifications: i) Instead of a single model thatpredicts all arguments including NULL arguments,we had to use the two-stage mode where the NULLarguments are first filtered out and the remainingNON-NULL arguments are classified into one of theargument types, and ii) The argument identificationmodule used an ensemble of ten classifiers – eachtrained on a tenth of the training data and performedan unweighted voting among them. This should stillgive a close to state of the art performance giventhat the argument identification performance tendsto start to be asymptotic around 10k training in-stances. The CoNLL-2005 scorer was used to com-pute the scores. At first glance, the performance onthe newswire genre is much lower than what hasbeen reported for WSJ Section 23. This could beattributed to two factors: i) the fact that we had tocompromise on the training method, but more im-

19http://cemantix.org/assert.html

Page 15: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

AccuracyR P F

English Broadcast Conversation (BC) 81.2 81.3 81.2Broadcast News 82.0 81.5 81.7Magazine 79.1 78.8 79.0Newswire 85.7 85.7 85.7Weblogs 77.5 77.6 77.5All 82.5 82.5 82.5

Chinese Broadcast Conversation - - 80.5Broadcast News - - 85.4Magazine - - 82.4Newswire - - 89.1All - - 84.3

Arabic Newswire - - 77.6All - - 77.6

Table 8: Word sense performance over both verbs and nouns in the CoNLL-2012 test set

Frameset Total Total % Perfect Argument ID + ClassAccuracy Sentences Propositions Propositions P R F

English Broadcast Conversation (BC) 0.92 2,037 5,021 52.18 82.55 64.84 72.63Broadcast News (BN) 0.91 1,252 3,310 53.66 81.64 64.46 72.04Magazine (MZ) 0.89 780 2,373 47.16 79.98 61.66 69.64Newswire (NW) 0.93 1,898 4,758 39.72 80.53 62.68 70.49Telephone Conversation (TC) 0.90 1,366 1,725 45.28 79.60 63.41 70.59Weblogs and Newsgroups (WB) 0.92 929 2,174 39.19 81.01 60.65 69.37Pivot Corpus (PT) 0.92 1,217 2,853 50.54 86.40 72.61 78.91

Overall 0.91 9,479 24,668 44.69 81.47 61.56 70.13

Chinese Broadcast Conversation (BC) - 885 2,323 31.34 53.92 68.60 60.38Broadcast News (BN) - 929 4,419 35.44 64.34 66.05 65.18Magazine (MZ) - 451 2,620 31.68 65.04 65.40 65.22Newswire (NW) - 481 2,210 27.33 69.28 55.74 61.78Telephone Conversation (TC) - 968 1,622 32.74 48.70 59.12 53.41Weblogs and Newsgroups (WB) - 758 1,761 35.21 62.35 68.87 65.45

Overall - 4,472 14,955 32.62 61.26 64.48 62.83

Table 9: Performance on the propositions and framesets in the CoNLL-2012 test set.

Page 16: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Framesets Lemmas

1 2,7222 321

> 2 181

Table 10: Frameset polysemy across lemmas

portantly because ii) the newswire in OntoNotes notonly contains WSJ data, but also Xinhua news. Onecould try to verify using just the WSJ portion of thedata, but it would be hard as it is not only a sub-set of the documents that the performance has beenreported on previously, but also the annotation hasbeen significantly revised; it includes propositionsfor be verbs missing from the original PropBank,and the training data is a subset of the original dataas well. It looks like the newly added New Testa-ment data shows very good performance. This isnot surprising since the same is the trend for the au-tomatic parses. Table 9 shows the detailed perfor-mance numbers. In addition to automatically pre-dicting the arguments, we also trained a classifier totag PropBank frameset IDs for the English data. Ta-ble ?? lists the number of framesets available acrossthe three languages20 An overwhelming number ofthem are monosemous, but the more frequent verbstend to be polysemous. Table 10 gives the distribu-tion of number of framesets per lemma in the Prop-Bank layer of the English OntoNotes 5.0 data. Dur-ing automatic processing of the data, we tagged allthe tokens that were tagged with a part of speechVBx. This means that there would be cases where thewrong token would be tagged with propositions.

Named Entities BBN’s IdentiFinderTMsystemwas used to predict the named entities. Given thetime constraints, we could not re-train it on theChinese and Arabic data, so we only retrained iton the English portion of the OntoNotes trainingdata. Table 11 shows the overall performance ofthe tagger on the CoNLL-2012 English test set, aswell as the performance broken down by individualname types.

Other Layers As noted earlier, systems were al-lowed to make use of gender and number predic-tions for NPs using the table from Bergsma and Lin(Bergsma and Lin, 2006), and the speaker meta-data for broadcast conversations, telephone conver-

20The number of lemmas for English in Table 10 do not addup to this number because not all of them have examples inthe training data, where the total number of instantiated sensesamounts to 4229.

sations and author or poster metadata for weblogsand newsgroups.

5.4.3 Data FormatIn order to organize the multiple, rich layers of

annotation, the OntoNotes project has created adatabase representation for the raw annotation layersalong with a Python API to manipulate them (Prad-han et al., 2007a). In the OntoNotes distributionthe data is organized as one file per layer, per docu-ment. The API requires a certain hierarchical struc-ture with documents at the leaves inside a hierarchyof language, genre, source and section. It comeswith various ways of cleanly querying and manip-ulating the data and allows convenient access to thesense inventory and propbank frame files instead ofhaving to interpret the raw .xml versions. How-ever, maintaining format consistency with earlierCoNLL tasks was deemed convenient for sites thatalready had tools configured to deal with that for-mat. Therefore, in order to distribute the data sothat one could make the best of both worlds, we cre-ated a new file type called .conll which logicallyserved as another layer in addition to the .parse,.prop, .name and .coref layers. Each .conll file con-tained a merged representation of all the OntoNoteslayers in the CoNLL-style tabular format with oneline per token, and with multiple columns for eachtoken specifying the input annotation layers rele-vant to that token, with the final column specifyingthe target coreference layer. Because OntoNotes isnot authorized to distribute the underlying text, andmany of the layers contain inline annotation, we hadto provide a skeletal form (.skel of the .conll filewhich was essentially the .conll file, but with theword column replaced with a dummy string. Weprovided an assembly script that participants coulduse to create a .conll file taking as input the .skel

file and the top-level directory of the OntoNotes dis-tribution that they had separately downloaded fromthe LDC21 Once the .conll file is created, it can beused to create the individual layers such as .parse,.name, .coref etc. In the CoNLL-2011 task, therewere a few issues, where some teams used the testdata accidentally during training. To prevent this,this year, we distributed the data in two installments.First one was for training and development and theother for testing. The test data release from LDCdid not contain the coreference layer. Unlike previ-ous CoNLL tasks, this test data contained some truly

21OntoNotes is deeply grateful to the Linguistic Data Con-sortium for making the source data freely available to the taskparticipants.

Page 17: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

All Genre BC BN MZ NW TC WBF F F F F F F

English Cardinal 68.76 58.52 75.34 72.57 83.62 32.26 57.14Date 78.60 73.46 80.61 71.60 84.12 63.89 65.48Event 44.63 30.77 50.00 36.36 50.00 0.00 66.67Facility 47.29 64.20 43.14 40.00 54.17 0.00 28.57GPE 89.77 89.40 93.83 92.87 92.56 81.19 91.36Language 47.06 - 75.00 50.00 33.33 22.22 66.67Law 48.00 0.00 100.00 0.00 50.98 0.00 100.00Location 59.00 54.55 61.36 54.84 67.10 - 44.44Money 75.45 33.33 63.64 77.78 79.12 92.31 58.18NORP 88.58 94.55 93.92 94.87 90.70 78.05 85.15Ordinal 71.39 74.16 80.49 79.07 74.34 84.21 55.17Organization 76.00 60.90 78.57 69.97 84.76 48.98 51.08Percent 89.11 100.00 83.33 75.00 91.41 83.33 72.73Person 78.75 93.35 94.36 87.47 85.80 73.39 76.49Product 52.76 0.00 77.65 0.00 42.55 0.00 0.00Quantity 50.00 17.14 66.67 62.86 81.82 0.00 30.77Time 60.65 66.13 67.33 66.67 64.29 27.03 55.56Work of Art 34.03 42.42 35.62 28.57 54.24 0.00 8.70

All NE 77.95 77.02 84.95 80.33 84.73 62.17 69.47

Table 11: Named Entity performance on the CoNLL-2012 test set

Column Type Description

1 Document ID This is a variation on the document filename2 Part number Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.3 Word number This is the word index in the sentence4 Word The word itself5 Part of Speech Part of Speech of the word6 Parse bit This is the bracketed structure broken before the first open parenthesis in the parse, and the

word/part-of-speech leaf replaced with a *. The full parse can be created by substitutingthe asterisk with the ([pos] [word]) string (or leaf) and concatenating the items inthe rows of that column.

7 Predicate lemma The predicate lemma is mentioned for the rows for which we have semantic role informa-tion. All other rows are marked with a -

8 Predicate Frameset ID This is the PropBank frameset ID of the predicate in Column 7.9 Word sense This is the word sense of the word in Column 3.10 Speaker/Author This is the speaker or author name where available. Mostly in Broadcast Conversation and

Web Log data.11 Named Entities These columns identifies the spans representing various named entities.12:N Predicate Arguments There is one column each of predicate argument structure information for the predicate

mentioned in Column 7.N Coreference Coreference chain information encoded in a parenthesis structure.

Table 12: Format of the .conll file used on the shared task

Page 18: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

#begin document (nw/wsj/07/wsj_0771); part 000......nw/wsj/07/wsj_0771 0 0 ‘‘ ‘‘ (TOP(S(S* - - - - * * (ARG1* * * -nw/wsj/07/wsj_0771 0 1 Vandenberg NNP (NP* - - - - (PERSON) (ARG1* * * * (8|(0)nw/wsj/07/wsj_0771 0 2 and CC * - - - - * * * * * -nw/wsj/07/wsj_0771 0 3 Rayburn NNP *) - - - - (PERSON) *) * * *(23)|8)nw/wsj/07/wsj_0771 0 4 are VBP (VP* be 01 1 - * (V*) * * * -nw/wsj/07/wsj_0771 0 5 heroes NNS (NP(NP*) - - - - * (ARG2* * * * -nw/wsj/07/wsj_0771 0 6 of IN (PP* - - - - * * * * * -nw/wsj/07/wsj_0771 0 7 mine NN (NP*)))) - - 5 - * *) * * * (15)nw/wsj/07/wsj_0771 0 8 , , * - - - - * * * * * -nw/wsj/07/wsj_0771 0 9 ’’ ’’ *) - - - - * * *) * * -nw/wsj/07/wsj_0771 0 10 Mr. NNP (NP* - - - - * * (ARG0* (ARG0* * (15nw/wsj/07/wsj_0771 0 11 Boren NNP *) - - - - (PERSON) * *) *) * 15)nw/wsj/07/wsj_0771 0 12 says VBZ (VP* say 01 1 - * * (V*) * * -nw/wsj/07/wsj_0771 0 13 , , * - - - - * * * * * -nw/wsj/07/wsj_0771 0 14 referring VBG (S(VP* refer 01 2 - * * (ARGM-ADV* (V*) * -nw/wsj/07/wsj_0771 0 15 as RB (ADVP* - - - - * * * (ARGM-DIS* * -nw/wsj/07/wsj_0771 0 16 well RB *) - - - - * * * *) * -nw/wsj/07/wsj_0771 0 17 to IN (PP* - - - - * * * (ARG1* * -nw/wsj/07/wsj_0771 0 18 Sam NNP (NP(NP* - - - - (PERSON* * * * * (23nw/wsj/07/wsj_0771 0 19 Rayburn NNP *) - - - - *) * * * * -nw/wsj/07/wsj_0771 0 20 , , * - - - - * * * * * -nw/wsj/07/wsj_0771 0 21 the DT (NP(NP* - - - - * * * * (ARG0* -nw/wsj/07/wsj_0771 0 22 Democratic JJ * - - - - (NORP) * * * * -nw/wsj/07/wsj_0771 0 23 House NNP * - - - - (ORG) * * * * -nw/wsj/07/wsj_0771 0 24 speaker NN *) - - - - * * * * *) -nw/wsj/07/wsj_0771 0 25 who WP (SBAR(WHNP*) - - - - * * * * (R-ARG0*) -nw/wsj/07/wsj_0771 0 26 cooperated VBD (S(VP* cooperate 01 1 - * * * * (V*) -nw/wsj/07/wsj_0771 0 27 with IN (PP* - - - - * * * * (ARG1* -nw/wsj/07/wsj_0771 0 28 President NNP (NP* - - - - * * * * * -nw/wsj/07/wsj_0771 0 29 Eisenhower NNP *))))))))))) - - - - (PERSON) * *) *) *) 23)nw/wsj/07/wsj_0771 0 30 . . *)) - - - - * * * * * -nw/wsj/07/wsj_0771 0 0 ‘‘ ‘‘ (TOP(S* - - - - * * * -nw/wsj/07/wsj_0771 0 1 They PRP (NP*) - - - - * (ARG0*) * (8)nw/wsj/07/wsj_0771 0 2 allowed VBD (VP* allow 01 1 - * (V*) * -nw/wsj/07/wsj_0771 0 3 this DT (S(NP* - - - - * (ARG1* (ARG1* (6nw/wsj/07/wsj_0771 0 4 country NN *) - - 3 - * * *) 6)nw/wsj/07/wsj_0771 0 5 to TO (VP* - - - - * * * -nw/wsj/07/wsj_0771 0 6 be VB (VP* be 01 1 - * * (V*) (16)nw/wsj/07/wsj_0771 0 7 credible JJ (ADJP*))))) - - - - * *) (ARG2*) -nw/wsj/07/wsj_0771 0 8 . . *)) - - - - * * * -#end document

Figure 3: Sample portion of the .conll file.

unseen documents which made it easier to spot po-tential training errors such as ones identified in theCoNLL-2011 task. Since the propositions and wordsense layers are inherently standoff annotation, theywere provided as is, and did not require that extramerging step. One thing thing that made the Englishdata creation process a bit tricky was the fact thatwe had dissected some of the trees for the conversa-tion data to remove the EDITED phrases. Table 12describes the data provided in each of the column ofthe .conll format. Figure 3 shows a sample from a.conll file.

5.5 Evaluation

This section describes the evaluation criteria used.Unlike for propositions, word sense and named en-tities, where it is simply a matter of counting thecorrect answers, or for parsing, where there are sev-eral established metrics, evaluating the accuracy ofcoreference continues to be contentious. Various al-ternative metrics have been proposed, as mentionedbelow, which weight different features of a proposedcoreference pattern differently. The choice is notclear in part because the value of a particular setof coreference predictions is integrally tied to theconsuming application. A further issue in defining

a coreference metric concerns the granularity of thementions, and how closely the predicted mentionsare required to match those in the gold standardfor a coreference prediction to be counted as cor-rect. Our evaluation criterion was in part driven bythe OntoNotes data structures. OntoNotes corefer-ence distinguishes between identity coreference andappositive coreference, treating the latter separatelybecause it is already captured explicitly by other lay-ers of the OntoNotes annotation. Thus we evaluatedsystems only on the identity coreference task, whichlinks all categories of entities and events togetherinto equivalent classes. The situation with mentionsfor OntoNotes is also different than it was for MUCor ACE. OntoNotes data does not explicitly iden-tify the minimum extents of an entity mention, but itdoes include hand-tagged syntactic parses. Thus forthe official evaluation, we decided to use the exactspans of mentions for determining correctness. TheNP boundaries for the test data were pre-extractedfrom the hand-tagged Treebank for annotation, andevents triggered by verb phrases were tagged usingthe verbs themselves. This choice means that scoresfor the CoNLL-2011 coreference task are likely tobe lower than for coref evaluations based on MUC,

Page 19: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

where the mention spans are specified in the input,22

or those based on ACE data, where an approximatematch is often allowed based on the specified headof the NP mention.

5.5.1 MetricsAs noted above, the choice of an evaluation met-

ric for coreference has been a tricky issue and theredoes not appear to be any silver bullet approach thataddresses all the concerns. Three metrics have beenproposed for evaluating coreference performanceover an unrestricted set of entity types: i) The linkbased MUC metric (Vilain et al., 1995), ii) The men-tion based B-CUBED metric (Bagga and Baldwin,1998) and iii) The entity based CEAF (ConstrainedEntity Aligned F-measure) metric (Luo, 2005). Veryrecently BLANC (BiLateral Assessment of Noun-Phrase Coreference) measure (Recasens and Hovy,2011) has been proposed as well. Each of the met-ric tries to address the shortcomings or biases of theearlier metrics. Given a set of key entities K, anda set of response entities R, with each entity com-prising one or more mentions, each metric generatesits variation of a precision and recall measure. TheMUC measure if the oldest and most widely used. Itfocuses on the links (or, pairs of mentions) in thedata.23 The number of common links between en-tities in K and R divided by the number of linksin K represents the recall, whereas, precision is thenumber of common links between entities in K andR divided by the number of links in R. This met-ric prefers systems that have more mentions per en-tity – a system that creates a single entity of allthe mentions will get a 100% recall without signifi-cant degradation in its precision. And, it ignores re-call for singleton entities, or entities with only onemention. The B-CUBED metric tries to addressesMUCS’s shortcomings, by focusing on the mentionsand computes recall and precision scores for eachmention. If K is the key entity containing mention M,and R is the response entity containing mention M,then recall for the mention M is computed as |K∩R||K|and precision for the same is is computed as |K∩R||R| .Overall recall and precision are the average of theindividual mention scores. CEAF aligns every re-sponse entity with at most one key entity by findingthe best one-to-one mapping between the entities us-ing an entity similarity metric. This is a maximumbipartite matching problem and can be solved bythe Kuhn-Munkres algorithm. This is thus a entity

22as is the case in this evaluation with Gold Mentions23The MUC corpora did not tag single mention entities.

based measure. Depending on the similarity, thereare two variations – entity based CEAF – CEAFe anda mention based CEAF – CEAFe. Recall is the totalsimilarity divided by the number of mentions in K,and precision is the total similarity divided by thenumber of mentions in R. Finally, BLANC uses avariation on the Rand index (Rand, 1971) suitablefor evaluating coreference. There are a few othermeasures – one being the ACE value, but since thisis specific to a restricted set of entities (ACE types),we did not consider it.

5.5.2 Official Evaluation MetricIn order to determine the best performing system

in the shared task, we needed to associate a singlenumber with each system. This could have beenone of the metrics above, or some combination ofmore than one of them. The choice was not sim-ple, and while we consulted various researchers inthe field, hoping for a strong consensus, their con-clusion seemed to be that each metric had its prosand cons. We settled on the MELA metric by Denisand Baldridge (2009), which takes a weighted av-erage of three metrics: MUC, B-CUBED, and CEAF.The rationale for the combination is that each of thethree metrics represents a different important dimen-sion, the MUC measure being based on links, theB-CUBED based on mentions, and the CEAF basedon entities. We decided to use CEAFe instead ofCEAFm. For a given task, a weighted average of thethree might be optimal, but since we don’t have anend task in mind, we decided to use the unweightedmean of the three metrics as the score on which thewinning system was judged. This still gives us ascore for one language. We wanted to encourage re-searchers to run their systems on all three languages.Therefore, we decided to compute the final scorethat would determine the winning submission as theaverage of the MELA metric across all the three lan-guages. We decided to give a MELA score of zero toevery language that a particular group did not run itssystem on.

5.5.3 Scoring Metrics ImplementationWe used the same core scorer implementation24

that was used for the SEMEVAL-2010 task, andwhich implemented all the different metrics. Therewere a couple of modifications done to this scorerafter it was used for the SEMEVAL-2010 task.

1. Only exact matches were considered cor-rect. Previously, for SEMEVAL-2010 non-

24http://www.lsi.upc.edu/ esapena/downloads/index.php?id=3

Page 20: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

exact matches were judged partially correctwith a 0.5 score if the heads were the sameand the mention extent did not exceed the goldmention.

2. The modifications suggested by Cai and Strube(2010) were incorporated in the scorer.

Since there are differences in the version used forCoNLL and the one available on the download site,and it is possible that the latter would be revised inthe future, we have archived the version of the scoreron the CoNLL-2012 task webpage.25

6 ParticipantsA total of 41 different groups demonstrated interestin the shared task by registering on the task web-page. Of these, 16 groups from 6 countries submit-ted system outputs on the test set during the evalu-ation week. 15 groups participated in at least onelanguage in the closed task, and only one group par-ticipated solely in the open track. One participantdid not submit a final task paper Tables 13 and 14list the distribution of the participants by country andthe participation by language and task type.

Country Participants

Brazil 1China 8Germany 3Italy 1Switzerland 1USA 2

Table 13: Participation by Country

Closed Open Combined

English 1 15 16Chinese 3 13 14Arabic 1 7 8

Table 14: Participation across languages and tracks

7 ApproachesTables 15 and 16 summarize the approaches takenby the participating systems along some importantdimensions. Most of the systems divided the prob-lem into the typical two phases – first identifyingthe potential mentions in the text, and then linkingthe mentions to form coreference chains, or enti-ties. Many systems used rule-based approaches for

25http://conll.bbn.com/download/scorer.v4.tar.gz

mention detection, though one, yang did use trainedmodels, and li used a hybrid approach by addingmentions from a trained model to the ones identi-fied using rules. All systems ran a post process-ing stage, after linking potential mentions together,to delete the remaining unlinked mentions. It wascommon for the systems to represent the markables(mentions) internally in terms of the parse tree NPconstituent span, but some systems used a sharedattribute model, where the attributes of the mergedentity are determined collectively by heuristicallymerging the attribute types and values of the differ-ent constituent mentions. Various types of trainedmodels were used for predicting coreference. Fora learning-based system generation of positive andnegative examples is very important. The partici-pating systems used a range of sentence windowssurrounding the anaphor in generating these exam-ples. In the systems that used trained models, manysystems used the approach described in Soon et al.(2001) for selecting the positive and negative train-ing examples, while others used some of the alter-native approaches that have been introduced in theliterature more recently. Following on the success ofrule-based linking model in the CoNLL-2011 sharedtask, many systems used a completely rule-basedlinking model, or used it as a initializing, or inter-mediate step in a learning based system. A hybridapproach seems to be a central theme of many highscoring systems. Taking cue from last year’s sys-tems, almost all systems trained pleonastic it classi-fiers, and used speaker-based constraints/features forthe conversation genre. Many systems used the Ara-bic POS that were mapped-down to Penn-style POS,but stamborg used heuristic to convert them backto the complex POS type using more frequent map-ping to get better performance for Arabic. fernandessystem uses feature templates defined on mentionpairs. bjorkelund mentions that disallowing transi-tive closures gave performance improvement of 0.6and 0.4 respectively for English and Chinese/Arabic.bjorkelund also mention seeing a considerable in-crease in performance after added features that cor-respond to the Shortest Edit Script (Myers, 1986)between surface forms and unvocalised Buckwalterforms, respectively. These could be better at captur-ing the differences in gender and number signaledby certain morphemes than hand-crafted rules. chenbuild upon the sieve architecture proposed in cite?raghunathan added one sieve – head match – forChinese and modified two sieves. Some participantstried to incorporate some peculiarities of the corpusin their systems. For example, martschat excluded

Page 21: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Part

icip

ant

Trac

kL

angu

ages

Synt

axL

earn

ing

Fram

ewor

kM

arka

ble

Iden

tifica

tion

Verb

Feat

ure

Sele

ctio

n#

Feat

ures

Trai

n

fern

ande

sC

A,C

,EP

Lat

entS

truc

ture

Perc

eptr

onA

llno

unph

rase

s,pr

onou

nsan

dna

me

entit

ies

×L

aten

tfea

ture

indu

ctio

nan

dfe

atur

ete

mpl

ates

196

tem

plat

es(E

);19

7(C

)and

223

(A)

T+

D

bjor

kelu

ndC

A,C

,EP

LIB

LIN

EA

Rfo

rlin

king

,and

Max

imum

Ent

ropy

(Mal

let)

for

anap

hori

city

NP,

PR

Pan

dP

RP$

inal

llan

guag

es;P

Nan

dN

Rin

Chi

nese

;all

NE

inE

nglis

h.C

lass

ifier

toex

clud

eno

n-re

fere

ntia

lpro

noun

sin

Eng

lish

(with

apr

obab

ility

of0.

95).

××

–T

+D

chen

C,O

A,C

,EP

Hyb

rid

–Si

eve

appr

oach

follo

wed

byla

ngua

ge-s

peci

fiche

uris

ticpr

unin

gan

dla

ngua

ge-i

ndep

ende

ntle

arni

ngba

sed

prun

ing;

Gen

resp

ecifi

cm

odel

s

NP,

PR

Pan

dP

RP$

and

sele

cted

NE

inE

nglis

h.N

Pan

dQ

Pin

Chi

nese

.Exc

lude

Chi

nese

inte

rrog

ativ

epr

onou

nsw

hata

ndw

here

.NP

and

sele

cted

NE

inA

rabi

c.L

earn

ing

topr

une

non-

refe

rent

ialm

entio

ns

×B

ackw

ard

elim

inat

ion

–T

stam

borg

CA

,C,E

DL

ogis

ticR

egre

ssio

n(L

IBL

INE

AR

)

NP,

PR

Pan

dP

RP$

inal

llan

guag

es;P

Nin

Chi

nese

;all

NE

inE

nglis

h.E

xclu

depl

eona

stic

itin

Eng

ish.

Prun

esm

alle

rmen

tions

with

sam

ehe

ad.

×Fo

rwar

d+

Bac

kwar

dst

artin

gfr

omC

oNL

L-2

011

feat

ure

set

forE

nglis

han

dSo

onfe

atur

ese

tfor

Chi

nese

and

Ara

bic

18–3

7T

+D

mar

tsch

atC

A,C

D

Dir

ecte

dm

ultig

raph

repr

esen

tatio

nw

here

the

wei

ghts

are

lear

ned

over

the

trai

ning

data

(on

top

ofB

AR

T(V

ersl

eyet

al.,

2008

))

Eig

htdi

ffer

entm

entio

nty

pes

forE

nglis

h,an

dad

ject

ival

use

forn

atio

nsan

da

few

NE

sar

efil

tere

das

wel

las

embe

dded

men

tions

and

pleo

nast

icpr

onou

ns.F

ourm

entio

nty

pes

inC

hine

se.C

opul

asar

eal

soha

ndle

dap

poro

pria

tely

.

××

Inth

efo

rmof

nega

tive

and

posi

tive

rela

tions

20%

ofT

(E);

15%

ofT

(C)

chan

gC

E,C

PL

aten

tStr

uctu

reL

earn

ing

All

noun

phra

ses,

pron

ouns

and

nam

een

titie

×C

hang

,eta

l.,20

11T

+D

uryu

pina

CA

,C,E

P

mod

ifica

tion

ofB

AR

Tus

ing

mul

ti-ob

ject

ive

optim

izat

ion.

Dom

ain

spec

ific

clas

sifie

rsfo

rnw

and

bcge

nre.

Stan

dard

rule

sfo

rEng

lish

and

Cla

ssifi

erto

iden

tify

mar

kabl

eN

Psin

Chi

nese

and

Ara

bic.

××

∼45

T+

D

zhek

ova

CA

,C,E

PM

emor

yba

sed

lear

ning

(TIM

BL

)N

P,P

RP

and

PR

P$in

Eng

lish,

and

allN

Pin

Chi

nese

and

Ara

bic.

Sing

leto

ncl

assi

fier.

××

33T

+D

liC

A,C

,EP

Max

Ent

All

phra

sety

pes

that

are

men

tions

intr

aini

ngar

eco

nsid

ered

asm

entio

nsan

da

clas

sifie

ris

trai

ned

toid

entif

ypo

tent

ialm

entio

ns.

√×

11fe

atur

egr

oups

T+

D

yuan

C,O

E,C

PC

4.5

and

dete

rmin

istic

rule

sA

llno

unph

rase

s,pr

onou

nsan

dna

me

entit

ies

××

–T

+D

xuC

E,C

PD

ecis

ion

tree

clas

sifie

rand

dete

rmin

istic

rule

s

All

noun

phra

ses,

pron

ouns

and

sele

cted

nam

eden

titie

sse

lect

edan

dov

erla

ppin

gm

entio

nsar

eco

nsid

ered

whe

nth

eyar

ese

cond

-lev

elN

Psin

side

anN

P,fo

rexa

mpl

eco

ordi

natin

gN

Ps

××

51(E

)and

61(C

)T

chun

yang

CE

,CP

Rul

e-ba

sed

(ada

ptat

ion

ofL

eeet

al.2

011’

ssi

eve

stru

ctur

e)

Chi

nese

NP

and

pron

ouns

usin

gpa

rtof

spee

chP

Nan

dna

mes

usin

gpa

rtof

spee

chN

Rex

clud

ing

mea

sure

wor

dsan

dce

rtai

nna

mes

××

––

yang

CE

PM

axE

nt(O

penN

LP)

Men

tion

dete

ctio

ncl

assi

fier

×Sa

me

feat

ure

set,

butp

ercl

assi

fier

40T

xinx

inC

E,C

PM

axE

ntN

P,P

RP

and

PR

P$in

Eng

lish

and

Chi

nese

×G

reed

yfo

rwar

dba

ckw

ard

71T

+D

shou

CE

PM

odifi

edve

rsio

nof

Lee

etal

.,20

11si

eve

syst

emxi

ong

OA

,C,E

PL

eeet

al.,

2011

syst

em

Tabl

e15

:Pa

rtic

ipat

ing

syst

empr

ofile

s–

Part

I.In

the

Task

colu

mn,

C/O

repr

esen

tsw

heth

erth

esy

stem

part

icip

ated

inth

ecl

osed

,op

enor

both

trac

ks.

Inth

eSy

ntax

colu

mn,

aP

repr

esen

tsth

atth

esy

stem

sus

eda

phra

sest

ruct

ure

gram

mar

repr

esen

tatio

nof

synt

ax,w

here

asa

Dre

pres

ents

that

they

used

ade

pend

ency

repr

esen

tatio

n.In

the

Trai

nco

lum

nT

repr

esen

tstr

aini

ngda

taan

dD

repr

esen

tsde

velo

pmen

tdat

a.

Page 22: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Part

icip

ant

Posi

tive

Trai

ning

Exa

mpl

esN

egat

ive

Trai

ning

Exa

mpl

esD

ecod

ing

fern

ande

sId

entif

ylik

ely

men

tions

with

anai

mto

gene

rate

high

reca

llus

ing

the

siev

em

etho

dpr

opos

edin

(Lee

etal

.,20

11).

Cre

ate

dire

cted

arcs

betw

een

men

tion

pair

sus

ing

ase

tofr

ules

Aco

nstr

aine

dla

tent

pred

icto

rfind

sth

em

axim

umsc

orin

gdo

cum

entt

ree

amon

gpo

ssib

leca

ndid

ates

bjor

kelu

ndC

lose

stA

ntec

eden

t(So

on,2

001)

Neg

ativ

eex

ampl

esin

betw

een

anap

hora

ndcl

oses

tant

eced

ent(

Soon

,200

1)

Stac

ked

reso

lver

s–

i)B

estfi

rst,

ii)Pr

onou

ncl

oses

tfirs

t–cl

oses

tfirs

tfor

pron

ouns

and

best

first

foro

ther

men

tions

and

iii)c

lust

er-m

entio

n;di

sallo

wtr

ansi

tive

nest

ing;

prop

erno

unm

entio

nspr

oces

sed

first

,fol

low

edby

othe

rno

uns

and

pron

ouns

chen

Rul

e-ba

sed

siev

eap

proa

chfo

llow

edby

heur

istic

and

lear

ning

base

dpr

unin

g

stam

borg

Clo

sest

Ant

eced

ent(

Soon

,200

1)N

egat

ive

exam

ples

inbe

twee

nan

apho

rand

clos

esta

ntec

eden

t(So

on,2

001)

Chi

nese

and

Ara

bic

–C

lose

st-fi

rstc

lust

erin

gfo

rpro

noun

san

dB

est-

first

clus

teri

ngot

herw

ise.

Eng

lish

–cl

oses

t-fir

stfo

rpro

noun

san

dav

erag

edbe

st-fi

rstc

lust

erin

got

herw

ise.

mar

tsch

atW

eigh

tsar

etr

aine

don

part

ofth

etr

aini

ngda

taG

reed

ycl

uste

ring

forE

nglis

h;Sp

ectr

alcl

uste

ring

follo

wed

bygr

eedy

clus

teri

ngfo

rChi

nese

tore

duce

num

bero

fcan

dida

tean

tece

dent

s.

chan

gC

lose

stA

ntec

eden

t(So

on,2

001)

All

prec

edin

gm

entio

nsin

aun

ion

ofof

gold

and

pred

icte

dm

entio

ns.M

entio

nsw

here

the

first

ispr

onou

nan

dot

hern

otar

eno

tco

nsid

ered

Bes

tlin

kst

rate

gy;s

epar

ate

clas

sifie

rsfo

rpro

nom

inal

and

non-

pron

omin

alm

entio

nsfo

rEng

lish.

Sing

lecl

assi

fierf

orC

hine

se.

uryu

pina

Clo

sest

Ant

eced

ent(

Soon

,200

1)N

egat

ive

exam

ples

inbe

twee

nan

apho

rand

clos

esta

ntec

eden

t(So

on,2

001)

men

tion

pair

mod

elw

ithou

tran

king

asin

Soon

2001

zhek

ova

Clo

sest

Ant

eced

ent(

Soon

,200

1)N

egat

ive

exam

ples

inbe

twee

nan

apho

rand

clos

esta

ntec

eden

t(So

on,2

001)

All

defin

iteph

rase

sus

edto

crea

tea

pair

fore

ach

anap

horw

ithea

chm

entio

npr

eced

ing

itw

ithin

aw

indo

wof

10(E

nglis

h,C

hine

se)o

r7(A

rabi

c)se

nten

ces.

li—

Bes

t-fir

stcl

uste

ring

yuan

—D

eter

min

istic

NP-N

Pfo

llow

edby

PP-N

P

xuW

indo

wof

sent

ence

sis

used

tode

term

ine

posi

tive

and

nega

tive

exam

ples

.For

Eng

lish

aw

indo

wof

5se

nten

ces

isus

edw

here

asfo

rChi

nese

aw

indo

wof

10se

nten

ces

isus

ed

All-

pair

linki

ngfo

llow

edby

prun

ing

orco

rrec

tion

usin

ga

seto

frul

esfo

rN

E-N

Ean

dN

P-N

Pm

entio

nsfo

rsen

tenc

esou

tsid

eof

a5/

10se

nten

cew

indo

win

Eng

lish

and

Chi

nese

resp

ectiv

ely

chun

yang

Lee

etal

.,20

11sy

stem

yang

Pre-

clus

terp

airm

odel

sse

para

tefo

reac

hpa

irN

P-N

P,N

P-P

RP

and

PR

P-P

RP

Pre-

clus

ters

,with

sing

leto

npr

onou

npr

e-cl

uste

rs,a

ndus

ecl

oses

t-fir

stcl

uste

ring

.Diff

eren

tlin

km

odel

sba

sed

onth

ety

peof

linki

ngm

entio

ns–

NP-P

RP,

PR

P-P

RP

and

NP-N

P

xinx

inC

lose

stA

ntec

eden

t(So

on,2

001)

Neg

ativ

eex

ampl

esin

betw

een

anap

hora

ndcl

oses

tant

eced

ent(

Soon

,200

1)B

est-

first

clus

teri

ngm

etho

d

shou

Mod

ified

vers

ion

ofL

eeet

al.,

2011

core

fere

nce

syst

emxi

ong

Lee

etal

.,20

11sy

stem

Tabl

e16

:Par

ticip

atin

gsy

stem

profi

les

–Pa

rtII

.Thi

sfo

cuse

son

the

way

posi

tive

and

nega

tive

exam

ples

wer

ege

nera

ted

and

the

deco

ding

stra

tegy

used

.

Page 23: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

adjectival nation names. Unlike English, and espe-cially in absence of a external resource, it is hardto make a gender distinction in Arabic and Chinese.martschat used the information that先生(sir) and女士(lady) and often suggest gender information. boand martschat used plurality markers们 to identifyplurals. eg. 同学 (student) is singular and 同学们(students) is plural. bo also a heuristic that if theword和 (and) appears in the middle of a mention A,and the two parts separated by和 are sub-mentionsof A, then mention A is considered to be plural.Other words which have the similar meaning of和’,such as同’,与 and跟, are also considered. uryupinaused the rich POS tags to classify pronouns into sub-type, person number and gender. Chinese and Ara-bic do not have definite noun phrase markers like thein English. In contrast to English there is no strictenforcement of using definite noun phrases when re-ferring to an antecedent in Chinese. Both 这次 演说 (the talk) and演说 (talk) can corefer with the an-tecedent 克林顿在河内大选的演说 (Clinton’s talkduring Hanoi election). This makes it very difficultto distinguish generic expressions from referentialones. martschat checks whether the phrase start witha definite/demonstrative indicator (e.g. 这(this) or那(that)) in order to identify demonstrative and defi-nite noun phrases. For Arabic, uryupina consider asdefinite all mentions with definite head nouns (pre-fixed with “Al”) and all the idafa constructs with adefinite modifier. chang use training data to identifyinappropriate mention boundaries. They perform arelaxed matching between predicted mentions andgold mentions ignoring punctuation marks and men-tions that start with one of the following: adverb,verb, determiner, and cardinal number. In anotherextreme, xiong translated Chinese and Arabic to En-glish, and ran an English system and projected men-tions back to the source languages. It did not workquite well by itself. One issue that they faced wasthat many instances of pronouns did not have a cor-responding mention in the source language (sincewe do not consider mentions formed by droppedsubjects/objects). Nevertheless, using this in addi-tion to performing coreference resolution in theselanguages could be useful. Similar to last year, mostparticipants appear not to have focused much oneventive coreference, those coreference chains thatbuild off verbs in the data. This usually meant thatmentions that should have linked to the eventive verbwere instead linked in with some other entity, or re-mained unlinked. Participants may have chosen notto focus on events because they pose unique chal-lenges while making up only a small portion of the

data. Roughly 91% of mentions in the data are NPsand pronouns. Many of the trained systems werealso able to improve their performance by using fea-ture selection, though things varied some dependingon the example selection strategy and the classifierused.

8 Results

In this section we will take a look at the perfor-mance overview of various systems and then lookat the performance for each language in various set-ting separately. For the official test, beyond the rawsource text, coreference systems were provided onlywith the predictions (from automatic engines) of theother annotation layers (parses, semantic roles, wordsenses, and named entities). While referring to theparticipating systems, as a convention, we will usethe last name of the contact person from the par-ticipating team. It is almost always the last nameof the first author of the system papers, or the firstname in case of conflicting last names (xinxin26).The only exception is – chunyang which is the firstname of the second author for that system. A high-level summary of the results for the systems on theprimary evaluation for both open and closed tracksis shown in Table 17. The scores under the columnsfor each language are the average of MUC, BCUBEDand CEAFe for that language. The column OfficialScore is the average of those per-language averages,but only for the closed track. If a participant didnot participate in all three languages, then they gota score of zero for the languages that were not at-tempted. The systems are sorted in descending orderof this final Official Score The last two columns in-dicate whether the systems used only the training orboth training and development for the final submis-sions. Note that all the results reported here still usedthe same, predicted information for all input layers.Most top performing systems used both training anddevelopment data for training the final system

It can be seen that the fernandes system got thehighest combined score (58.69) across all three lan-guages and metrics. While this is lower than the fig-ures cited for other corpora, it is as expected, giventhat the task here includes predicting the underly-ing mentions and mention boundaries, the insistenceon exact match, and given that the relatively easierappositive coreference cases are not included in thismeasure. The combined score across all languages ispurely for ranking purposes, and does not really tellmuch about each individual language. Looking at

26They did not submit a final system description paper.

Page 24: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Participant Open Closed Official Final model

English Chinese Arabic English Chinese Arabic Score Train Dev

fernandes 63.37 58.49 54.22 58.69√ √

bjorkelund 61.24 59.97 53.55 58.25√ √

chen 63.53 59.69 62.24 47.13 56.35√ ×

stamborg 59.36 56.85 49.43 55.21√ √

uryupina 56.12 53.87 50.41 53.47√ √

zhekova 48.70 44.53 40.57 44.60√ √

li 45.85 46.27 33.53 41.88√ √

yuan 61.02 58.68 60.69 39.79√ √

xu 57.49 59.22 38.90√ ×

martschat 61.31 53.15 38.15√ ×

chunyang 59.24 51.83 37.02 – –yang 55.29 18.43

√ ×chang 60.18 45.71 35.30

√ ×xinxin 48.77 51.76 33.51

√ √shou 58.25 19.42

√ ×xiong 59.23 44.35 44.37 0.00

√ √

Table 17: Performance on primary open and closed tracks using all predicted information

Participant Open Closed Suppl. Final model

English Chinese Arabic English Chinese Arabic Score Train Dev

fernandes 63.16 61.48 53.90 59.51√ √

bjorkelund 60.75 62.76 53.50 59.00√ √

chen 70.00 60.33 68.55 47.27 58.72√ ×

stamborg 57.35 54.30 49.59 53.75√ √

zhekova 49.30 44.93 40.24 44.82√ √

li 43.04 43.28 31.46 39.26√ √

yuan 59.50 64.42 41.31√ √

xu 56.47 64.08 40.18√ ×

chang 60.89 20.30√ √

Table 18: Performance on supplementary open and closed tracks using all predicted information, given gold mentionboundaries

Participant Open Closed Suppl. Final model

English Chinese Arabic English Chinese Arabic Score Train Dev

fernandes 69.35 66.36 63.49 66.40√ √

bjorkelund 68.20 69.92 59.14 65.75√ √

chen 78.98 70.46 77.77 52.26 66.83√ ×

stamborg 68.66 66.97 53.35 62.99√ √

zhekova 59.06 51.44 55.72 55.41√ √

li 51.40 59.93 40.62 50.65√ √

yuan 69.88 76.05 48.64√ √

xu 63.46 69.79 44.42√ ×

chang 77.22 25.74√ √

Table 19: Performance on supplementary open and closed tracks using all predicted information, given gold mentions

Page 25: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

the the Engilsh performance, we can see that the fer-nandes system gets the best average across the threeselected metrics (MUC, BCUBED and CEAFe) Thenext best system for English is martschat (61.31)followed very closely by bjorkelund (61.24) andthen chang (60.18). Owing to the ordering based onofficial score, not all the best performing systems fora particular language are in sequential order. There-fore, for easier reading, the scores of the top rankingsystem are in red, and the top four systems are un-derlined in the table. The performance differencesbetween the better-scoring systems were not large,with only about three points separating the top foursystems, and only 5 out of a total of 16 systems gota score lower than 58 points27

In case of Chinese, it is seen that the chen sys-tem performs the best with a score of 62.24. This isthen followed by yuan (60.69), and then bjorkelund(59.97) and xu (59.22). It is interesting to note thatthe scores for the top performing systems for bothEnglish and Chinese are very close. For all we know,this is just a coincidence. Also, for both Englishand Chinese, the top performing system is almost2 points higher than the second best system.

On the Arabic language front, once again, fernan-des has the highest score of 54.22, followed closelyby bjorkelund (53.55) and then uryupina (53.47)

Tables 18 and 19 show similar information for thetwo supplementary tasks – one given gold mentionboundaries and one given correct, gold mentions.We have however, kept the same relative orderingof the system participants as in Table 17 for ease ofreading. Looking at Table 18 carefully, we can seethat for English and Arabic the relative ranking ofthe systems remain almost the same, except for a fewoutliers. The chang system performs the best per-formance given gold mentions – by almost 7 pointsover the next best performing system. In the case ofChinese, chen system performs almost 6 points bet-ter than the official performance given gold bound-aries, and another 9 points given gold mentions andalmost 8 points better than the next best system incase of the latter. We will look at more details in thefollowing sections.

Figure 4 shows a performance plot for eight par-ticipating systems that attempted both the supple-mentary tasks – GB and GM in addition to the mainNB for at least one of the three languages. These areall in the closed setting. At the bottom of the plotyou can see dots that indicate what test condition to

27Which also happens to be the highest performing score lastyear. More precise comparison later in Section 9.

which a particular point refers. In most cases, forthe hardest task – NB – the English and Chinese per-formances track are quite close to each other. Whenprovided with gold mention boundaries (GB), sys-tems, chen, xu and yuan do significantly better forthe Chinese language. There is almost no positiveeffect on the English performance across the board.In fact, performance of the stamborg and li systemsdrops noticeably. There is also a drop in perfor-mance for the bjorkelund system, but the differenceis probably not significant. Finally, when providedwith gold mentions, the performance of all systemsincreases across all languages, with the chang sys-tem showing the highest gain for English, and thechen system showing the highest gain for Chinese.

Figure 5 is a box and whiskers plot of the per-formance for all the systems for each language andvariations – NB, GB, and GM. The circle in the cen-ter indicates the mean of the performances. The hor-izontal line in between the box indicates the median,and the bottom and top of the boxes indicate the firstand third quartiles respectively, with the whiskers in-dicating the highest and lowest performance on thattask. It can be easily seen that the English systemshave the least divergence, with the divergence largefor the GM case probably owing to the chang sys-tem. This is somewhat expected as this is the sec-ond year for the English task, and so it does show amore mature, more stable performance. On the otherhand, both Chinese and Arabic plots show muchmore divergence, with the Chinese and Arabic GBcase showing the highest divergence. Also, exceptfor Chinese GM condition, there is some skewnessin the score distribution one way or the other.

Some participants ran their systems on six ofthe twelve possible combinations for all three lan-guages. Figure 6 shows a plot for these tree par-ticipants – fernandes, bjorkelund, and chen. As inFigure 4, the dots at the bottom help identify whichparticular combination of parameters the point onthe plot represents. In addition to the three testcondition related to mention quality, we now havealso two more test conditions relating to the syntax.We can see that the fernandes and bjorkelund, sys-tem performance tracks very close to each other. Inother words, using gold standard parses during test-ing does not show much benefit in those cases. Incase of the chen system, however, using gold parsesshows a significant jump in scores for the NB condi-tion. It seems that somehow, the chen system makesmuch better use of the gold parses. In fact, the per-formance is very close to the one with the GB condi-tion. It is not clear what this system is doing differ-

Page 26: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Testing Mention Quality No Boundaries Gold Boundaries Gold Mention

20

30

40

50

60

70

80

Arabic Chinese English

fernandes bjorklund

chen stamborg

20

30

40

50

60

70

80

xu yuan chang

li

Figure 4: Performance for eight participating systems for the three languages, across the three mention qualities.

ently that makes this possible. Adding more infor-mation, i.e., the GM condition improves the perfor-mance by almost the same delta as going from NB toGB.

Finally, Figure 7 show the plot for one system –bjorkelund – that was ran on ten of the twelve dif-ferent settings. As usual the dots at the bottom helpidentify the conditions for a point on the plot. Now,there is a condition related to the quality of syntaxduring training as well. For some reason, using goldsyntax hurts performance – though slightly – in theNB and GB settings. Chinese does show some im-provement when gold parse is used for training, onlywhen gold mentions are available during testing.

One point to note is that we cannot comparethese results to the ones obtained in the SEMEVAL-2010 coreference task which used a small portion ofOntoNotes data because it was only using nominalentities, and had heuristically added singleton men-

tions28.In the following sections we will look at the re-

sults for the three languages, in various settings inmore detail. It might help to describe the format ofthe tables first. Given that our choice of the officialmetric was somewhat arbitrary, it is also useful tolook at the individual metrics The tables are simi-lar in structure to Table 20. Each table provides re-

28The documentation that comes with the SEMEVAL datapackage from LDC (LDC2011T01) states: “Only nominalmentions and identical (IDENT) types were taken from theOntoNotes coreference annotation, thus excluding coreferencerelations with verbs and appositives. Since OntoNotes is onlyannotated with multi-mention entities, singleton referential ele-ments were identified heuristically: all NPs and possessive de-terminers were annotated as singletons excluding those func-tioning as appositives or as pre-modifiers but for NPs in thepossessive case. In coordinated NPs, single constituents as wellas the entire NPs were considered to be mentions. There is noreliable heuristic to automatically detect English expletive pro-nouns, thus they were (although inaccurately) also annotated assingletons.”

Page 27: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

sults across multiple dimensions. For completeness,the tables include the raw precision and recall scoresfrom which the F-scores were derived. Each tableshows the scores for a particular system for the taskof mention detection and coreference resolution sep-arately. The tables also include two additional scores(BLANC and CEAFm) that did not factor into the of-ficial score. Useful further analysis may be possiblebased on these results beyond the preliminary resultspresented here. As you recall, OntoNotes does notcontain any singleton mentions. Owing to this pecu-liar nature of the data, the mention detection scorescannot be interpreted independently of the corefer-ence resolution scores. In this scenario, a mentionis effectively an anaphoric mention that has at leastone other mention coreferent with it in the docu-ment. Most systems removed singletons from theresponse as a post-processing step, so not only willthey not get credit for the singleton entities that theyincorrectly removed from the data, but they will be

penalized for the ones that they accidentally linkedwith another mention. What this number does in-dicate is the ceiling on recall that a system wouldhave got in absence of being penalized for makingmistakes in coreference resolution. The tables aresub-divided into several logical horizontal sectionsseparated by two horizontal lines. Each horizon-tal section can be categorized by a combination ofthree parameters. Two of these apply to the test set,and one to the training set. We have divided the pa-rameters into two types: i) Syntax and ii) MentionQuality. Syntax can take two values – automatic orgold, and the mention quality can be of three types:i) No boundaries (NB), ii) Gold mention boundaries(GB) and iii) Gold mentions (GM). There are a to-tal of 12 combinations that we can form of usingthese parameters. Out of these, we thought six wereparticularly interesting. This is the product of thethree cases of mention quality – NB, GB and GM,and whether or not gold syntax (GS) or predicted

0

20

40

60

80

100

Ara

bic

(NB

)

Ara

bic

(GB

)

Ara

bic

(GM

)

Chi

nese

(N

B)

Chi

nese

(G

B)

Chi

nese

(G

M)

Eng

lish

(NB

)

Eng

lish

(GB

)

Eng

lish

(GM

)

Figure 5: A box plot of the performance for the three languages across the three mention qualities.

Page 28: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

0

10

20

30

40

50

60

70

80

90

100

Arabic Chinese English

Testing Syntax Gold Predicted

Mention Quality No Boundaries Gold Boundaries Gold Mention

fernandes chenbjorklund

Figure 6: Performance of the fernandes, bjorkelund and chen systems in six different settings.

syntax (PS) was used for the test set. Just like weused the dots below the graphs earlier to indicate theparameters that were chosen for a particular point onthe plot, we use small black squares in the tables af-ter the participant name, to indicate the conditionschosen for the results on that particular row. Sincethere are many rows to each table, in order to facil-itate finding which number we are referring to, wehave added a ID column which uses letters e, c, anda to refer to the three languages – English, Chineseand Arabic. This is followed by a decimal number,in which the number before the decimal identifiesthe logical block within the table that share the sameexperiment parameters, and the one after the deci-mal indicates the index of a particular system in thatblock. Systems are sorted by the official score withineach block. All the systems with NB are listed first,followed by GB, followed by GM. One participant(bjorkelund) ran more variations than we had orig-inally planned, but since it falls under the generalpermutation and combination of the settings that wewere considering, it makes sense to list those resultshere as well.

8.1 English Closed

Table 20 shows the performance for the English lan-guage in greater detail.

Official Setting Recall is quite important in themention detection stage because the full coreferencesystem has no way to recover if the mention de-tection stage misses a potentially anaphoric men-tion. The linking stage indirectly impacts the finalmention detection accuracy. After a complete passthrough the system some correct mentions could re-main unlinked with any other mentions and wouldbe deleted thereby lowering recall. Most systemstend to get a close balance between recall and preci-sion for the mention detection task. A few systemshad a considerable gap between the final mentiondetection recall and precision (fernandes, xu, yang,li and xinxin). It is not clear why this might be thecase. One commonality between the ones that hada much higher precision than recall was that theyused machine learned classifiers for mention detec-tion. This could be possible because any classifierthat is trained will not normally contain singletonmentions (as none have been annotated in the data)unless one explicitly adds them to the set of train-ing examples (which is not mentioned in any of therespective system papers). A hybrid rule-based andmachine learned model (fernandes) performed thebest. Apart from some local differences, the rank-ing for all the systems is roughly the same irrespec-tive of which metric is chosen. The CEAFe mea-

Page 29: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Training Syntax Gold Predicted

Testing Syntax Gold Predicted

Mention Quality No Boundaries Gold Boundaries Gold Mention

Arabic Chinese English

40

50

60

70

80

Figure 7: Performance of the bjorkelund system in twelve different settings.

sure seems to penalize systems more harshly thanthe other measures. If the CEAFe measure does in-dicate the accuracy of entities in the response, thissuggests that the fernandes system is doing betteron getting coherent entities than any other system.

Gold Mention Boundaries One difficulty withthis supplementary evaluation using gold mentionboundaries is that those boundaries alone provideonly very partial information. For the roughly 10%of mentions that the automatic parser did not cor-rectly identify, while the systems knew the correctboundaries, they had no structural syntactic or se-mantic information, and they also had to further ap-proximate the already heuristic head word identifi-cation. This incomplete data complicated the sys-tems’ task and also complicates interpretation of theresults. While most systems did slightly better herein terms of raw scores, the performance was notmuch different from the official task, indicating thatmention boundary errors resulting from problems inparsing do not contribute significantly to the finaloutput.29

29It would be interesting to measure the overlap between theentity clusters for these two cases, to see whether there wasany substantial difference in the mention chains, besides the ex-pected differences in boundaries for individual mentions.

Gold Mentions Another supplementary conditionthat we explored was if the systems were suppliedwith the manually-annotated spans for all and onlythose mentions that did participate in the gold stan-dard coreference chains. This supplies significantlymore information than the previous case, where ex-act spans were supplied for all NPs, since the goldmentions will also include verb headwords that arelinked to event NPs, but will not include singletonmentions, which do not end up as part of any chain.The latter constraint makes this test seem artificial,since it directly reveals part of what the systems aredesigned to determine, but it still has some value inquantifying the impact that mention detection andanaphoricity determination has on the overall taskand what the results are if they are perfectly known.The results show that performance does go up sig-nificantly, indicating that it is markedly easier forthe systems to generate better entities given goldmentions. Although, ideally, one would expect aperfect mention detection score, it is the case thatmany of the systems did not get a 100% recall. Thiscould possibly be owing to unlinked singletons thatwere removed in post-processing. The chang sys-tem along with the fernandes system are the onlysystems that got a perfect 100% recall. The reasonis most likely because they had a hard constraint tolink all mentions with at least one other mention.

Page 30: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

The chang system (77.22 [e7.00]) stands out in thatit has a 7 point lead on the next best system in thiscategory ([e7.00]) This indicates that the linking al-gorithm for this system is significantly superior thanthe other systems – especially since the performanceof the only other system that gets 100% mentionscore is much lower (69.35 [e7.03])

Gold Test Parses Looking at Table 20 it can beseen that there is a slight increase (∼1 point) in per-formance across all the systems when gold parsesacross all settings – NB, GB, and GM. In the case ofthe bjorkelund system, for the NB setting, the over-all performance improves by a percent when usinggold test parse during testing (61.24 [e0.02] vs 62.23[e1.02]), but strangely if gold parses are used duringtraining as well, the performance is slightly lower(61.71 [e3.00]), although this difference is probablynot statistically significant.

8.2 Chinese ClosedTable 21 shows the performance for the Chinese lan-guage in greater detail.

8.2.1 Official SettingIn this case, it turns out that the chen system does

about 2 points better than the next best system acrossall the metrics. We know that this system had somemore Chinese-specific improvements. It is strangethat the fernandes system has a much lower mentionrecall with a much higher precision as compared tochen. As far as the system descriptions go, both sys-tems seem to have used the same set of mentions –except for the chen system including QP phrases andnot considering interrogative pronouns. One thingwe found about the chen system was that they dealtwith nested NPs differently in case of the NW genre.This unfortunately seems to be addressing a quirk inthe Chinese newswire data owing to a possible datainconsistency.

8.2.2 Gold Mention BoundariesUnlike English, just the addition of gold mention

boundaries improves the performance of almost allsystems significantly. The delta improvement for thefernandes system turns out to be small, but it doesgain on the mention recall as compared to the NBcase. It is not clear why this might be the case. Oneexplanation could be that the parser performance forconstituents that represent mentions – primarily NPmight be significantly worse than that for English.The mention recall of all the systems is boosted byroughly 10%.

8.2.3 Gold MentionsProviding gold mention information further sig-

nificantly boosts all systems. More so is the casewith chen system [e8.00] which gains another 8points over the gold mention boundary condition inspite of the fact that they don’t have a perfect re-call. On the other hand, the fernandes system getsa perfect mention recall and precision, but ends upgetting a 10 point lower performance [c8.05] thanthe chen system. Another thing to note is that for theCEAFe metric, the incremental drop in performancefrom the best to the next best and so on, is substan-tial, with a difference of 17 points between the chensystem and the fernandes system. It does seem thatthe chen and yuan system algorithm for linking ismuch better than the others.

8.2.4 Gold Test ParsesWhen provided with gold parses for the test set,

there is a substantial increase in performance for theNB condition – numerically more so than in case ofEnglish. The degree of improvement decreases forthe GB and GM conditions.

8.3 Arabic Closed

Table 22 shows the performance for the Arabic lan-guage in greater detail.

8.3.1 Official SettingUnlike English, of Chinese, none of the system

was particularly tuned for Arabic. This gives us anunique opportunity to test the performance variationof a mostly statistical, roughly language indepen-dent mechanism. Although, there could possibly bea significant bias that Arabic languages brings to themix. The overall performance for Arabic seems tobe about ten points below both English and Chinese.On the mention detection front, most of the systemshave a balanced precision and recall, and the dropin performance seems quite steady. The bjorkelundsystem has a slight edge on the fernandes system onthe MUC, BCUBED and BLANC metrics, but fernan-des has a much larger lead on both the CEAF met-rics, putting it on the top in the official score. Wehaven’t reported the development set numbers here,but another thing to note especially for Arabic is thatperformance on Arabic test set is significantly bet-ter than on the development set cite relevant paper.This is probably because of the smaller size of thetraining set and therefore a higher relative incrementover training set. The size of the training set (whichis roughly about a third of either Engish or Chinese)

Page 31: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

IDPa

rtic

ipan

tTr

ain

Test

ME

NT

ION

DE

TE

CT

ION

CO

RE

FE

RE

NC

ER

ESO

LU

TIO

NO

ffici

alSy

ntax

Synt

axM

entio

nQ

lty.

MU

CB

CU

BE

DC

EA

F mC

EA

F eB

LA

NC

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

e0.0

0fe

rnan

des

��

�72

.75

83.4

577

.73

65.8

375

.91

70.5

165

.79

77.6

971

.24

61.9

461

.94

61.9

455

.00

43.1

748

.37

77.3

578

.07

77.7

063

.37

e0.0

1m

arts

chat

��

�74

.23

76.1

075

.15

65.2

168

.83

66.9

766

.50

74.6

970

.36

59.6

159

.61

59.6

148

.64

44.7

246

.60

73.2

978

.94

75.7

361

.31

e0.0

2bj

orke

lund

��

�73

.75

77.0

975

.38

65.2

370

.10

67.5

865

.90

75.2

470

.26

59.2

059

.20

59.2

048

.60

43.4

245

.87

73.4

778

.81

75.8

061

.24

e0.0

3ch

ang

��

�72

.48

76.2

574

.32

64.7

768

.06

66.3

867

.04

71.8

169

.34

58.2

858

.28

58.2

846

.64

43.1

144

.81

75.2

474

.73

74.9

860

.18

e0.0

4ch

en�

��

75.0

872

.60

73.8

263

.47

63.9

663

.71

66.5

771

.52

68.9

658

.13

58.1

358

.13

46.6

846

.15

46.4

171

.30

79.1

574

.48

59.6

9e0

.05

stam

borg

��

�75

.51

72.3

973

.92

66.2

663

.98

65.1

069

.09

69.5

469

.31

56.7

656

.76

56.7

642

.53

44.8

943

.68

74.0

377

.28

75.5

259

.36

e0.0

6ch

unya

ng�

��

75.2

372

.24

73.7

164

.08

63.5

763

.82

66.4

570

.71

68.5

157

.24

57.2

457

.24

45.1

345

.67

45.4

071

.12

77.9

273

.95

59.2

4e0

.07

yuan

��

�73

.19

71.8

872

.53

62.0

863

.02

62.5

566

.23

70.4

568

.27

57.2

857

.28

57.2

845

.74

44.7

445

.23

72.0

576

.86

74.1

658

.68

e0.0

8sh

ou�

��

75.3

572

.08

73.6

863

.46

62.3

962

.92

65.3

168

.90

67.0

555

.68

55.6

855

.68

44.2

045

.35

44.7

769

.43

75.0

871

.81

58.2

5e0

.09

xu�

��

62.6

484

.55

71.9

659

.11

75.1

866

.18

62.2

873

.29

67.3

453

.19

53.1

953

.19

48.2

932

.64

38.9

575

.16

68.3

770

.95

57.4

9e0

.10

uryu

pina

��

�71

.82

69.9

670

.88

61.0

060

.78

60.8

963

.59

68.4

865

.95

52.4

452

.44

52.4

441

.42

41.6

441

.53

67.4

072

.83

69.6

556

.12

e0.1

1ya

ng�

��

64.2

873

.99

68.7

955

.29

65.2

059

.84

60.3

872

.74

65.9

951

.90

51.9

051

.90

45.2

035

.95

40.0

568

.69

71.6

670

.03

55.2

9e0

.12

xinx

in�

��

73.9

354

.55

62.7

855

.48

42.7

248

.27

66.9

356

.67

61.3

744

.83

44.8

344

.83

31.6

843

.55

36.6

864

.77

66.1

465

.42

48.7

7e0

.13

zhek

ova

��

�65

.78

68.4

967

.11

54.2

852

.79

53.5

262

.26

54.9

058

.35

43.5

443

.54

43.5

433

.52

34.9

634

.22

67.2

358

.77

60.6

348

.70

e0.1

4li

��

�45

.78

86.7

259

.93

39.1

272

.57

50.8

443

.03

80.0

655

.98

41.9

741

.97

41.9

749

.44

22.3

030

.74

64.0

166

.86

65.2

445

.85

e1.0

0fe

rnan

des

��

�74

.76

84.8

179

.47

67.7

377

.25

72.1

866

.42

78.0

171

.75

63.0

563

.05

63.0

556

.16

44.5

149

.66

76.8

978

.60

77.7

164

.53

e1.0

1m

arts

chat

��

�76

.87

77.3

377

.10

67.9

070

.37

69.1

167

.83

75.3

471

.39

60.9

260

.92

60.9

249

.38

46.6

147

.96

74.0

779

.77

76.5

462

.82

e1.0

2bj

orke

lund

��

�75

.53

77.8

676

.68

67.0

071

.17

69.0

266

.56

75.7

170

.84

59.9

059

.90

59.9

049

.22

44.6

846

.84

73.4

280

.19

76.2

762

.23

e1.0

3ch

en�

��

77.1

375

.15

76.1

366

.30

66.9

966

.65

67.7

372

.77

70.1

659

.70

59.7

059

.70

47.9

947

.21

47.6

072

.63

80.0

875

.70

61.4

7e1

.04

zhek

ova

��

�66

.05

69.6

267

.79

54.4

553

.59

54.0

261

.66

55.6

258

.48

43.7

143

.71

43.7

133

.82

34.6

534

.23

66.8

459

.27

61.2

048

.91

e2.0

0bj

orke

lund

��

�76

.23

72.6

474

.39

66.5

065

.22

65.8

568

.45

71.0

169

.71

58.3

558

.35

58.3

544

.85

46.2

145

.52

74.3

677

.29

75.7

260

.36

e3.0

0bj

orke

lund

��

�78

.68

73.8

376

.18

68.9

666

.73

67.8

369

.70

71.4

770

.58

59.5

559

.55

59.5

545

.55

47.9

746

.73

74.9

877

.92

76.3

561

.71

e4.0

0fe

rnan

des

��

�71

.91

85.2

978

.03

64.9

277

.53

70.6

764

.25

78.9

570

.85

61.5

961

.59

61.5

956

.48

41.6

947

.97

76.2

877

.87

77.0

563

.16

e4.0

1ch

ang

��

�72

.01

79.8

375

.72

64.5

871

.36

67.8

066

.07

73.8

769

.75

58.5

158

.51

58.5

149

.14

41.7

145

.12

75.9

273

.95

74.8

860

.89

e4.0

2bj

orke

lund

��

�71

.95

78.9

775

.30

63.4

471

.63

67.2

963

.95

76.5

969

.70

58.4

858

.48

58.4

850

.00

41.3

545

.27

73.0

578

.99

75.5

960

.75

e4.0

3ch

en�

��

74.7

875

.70

75.2

463

.26

66.7

864

.97

65.3

873

.56

69.2

358

.57

58.5

758

.57

48.8

144

.92

46.7

971

.85

80.2

275

.20

60.3

3e4

.04

yuan

��

�75

.73

70.8

473

.20

64.5

062

.79

63.6

468

.53

69.9

969

.25

57.9

057

.90

57.9

044

.73

46.5

345

.61

72.8

676

.93

74.6

959

.50

e4.0

5st

ambo

rg�

��

76.4

667

.30

71.5

966

.16

58.8

062

.26

71.2

164

.98

67.9

555

.20

55.2

055

.20

38.4

645

.83

41.8

376

.19

73.5

874

.80

57.3

5e4

.06

xu�

��

66.3

175

.82

70.7

462

.15

67.1

364

.55

67.5

366

.97

67.2

552

.02

52.0

252

.02

40.0

435

.45

37.6

077

.29

67.4

270

.74

56.4

7e4

.07

zhek

ova

��

�66

.45

70.9

168

.61

54.9

654

.67

54.8

261

.85

55.6

058

.56

43.9

643

.96

43.9

634

.38

34.6

734

.53

68.4

959

.51

61.5

249

.30

e4.0

8li

��

�60

.00

44.4

751

.08

44.1

733

.66

38.2

166

.37

53.9

359

.51

39.3

039

.30

39.3

027

.53

36.5

131

.39

63.2

660

.00

61.3

343

.04

e5.0

0fe

rnan

des

��

�72

.69

86.1

178

.83

65.6

578

.26

71.4

064

.36

79.0

970

.97

62.0

062

.00

62.0

057

.36

42.2

348

.65

75.8

978

.28

77.0

263

.67

e5.0

1bj

orke

lund

��

�73

.24

79.2

576

.13

64.7

572

.29

68.3

164

.62

77.0

170

.27

59.2

059

.20

59.2

050

.45

42.3

646

.05

73.2

280

.28

76.1

761

.54

e5.0

2ch

en�

��

75.5

976

.58

76.0

864

.67

68.0

766

.33

66.0

974

.02

69.8

359

.38

59.3

859

.38

49.3

645

.55

47.3

872

.09

80.4

175

.43

61.1

8e5

.03

stam

borg

��

�77

.17

67.1

471

.81

66.8

858

.82

62.5

971

.55

64.9

768

.10

55.3

055

.30

55.3

038

.28

46.3

441

.93

76.1

673

.62

74.8

157

.54

e5.0

4zh

ekov

a�

��

65.8

271

.72

68.6

554

.68

55.5

155

.09

61.2

256

.59

58.8

243

.90

43.9

043

.90

34.8

534

.04

34.4

468

.10

59.7

661

.79

49.4

5

e6.0

0bj

orke

lund

��

�76

.26

75.4

775

.86

66.5

568

.00

67.2

767

.60

73.0

570

.22

59.0

459

.04

59.0

446

.99

45.4

246

.19

74.6

078

.38

76.3

261

.23

e7.0

0ch

ang

��

�10

0.00

100.

0010

0.00

83.1

688

.48

85.7

475

.36

79.6

977

.46

73.7

673

.76

73.7

675

.38

62.7

168

.46

81.2

378

.99

80.0

577

.22

e7.0

1ch

en�

��

80.8

210

0.00

89.3

972

.29

89.4

079

.94

64.6

085

.92

73.7

568

.35

68.3

568

.35

76.2

546

.40

57.6

975

.41

84.8

079

.12

70.4

6e7

.02

yuan

��

�80

.03

100.

0088

.91

72.2

289

.16

79.8

064

.75

84.6

873

.39

67.7

667

.76

67.7

674

.49

45.4

656

.46

75.6

083

.10

78.6

969

.88

e7.0

3fe

rnan

des

��

�10

0.00

100.

0010

0.00

70.6

991

.21

79.6

565

.46

85.6

174

.19

67.6

467

.64

67.6

474

.71

42.5

554

.22

79.4

180

.17

79.7

869

.35

e7.0

4st

ambo

rg�

��

78.1

710

0.00

87.7

471

.22

88.1

278

.77

64.7

583

.16

72.8

166

.74

66.7

466

.74

71.9

443

.74

54.4

178

.68

81.4

779

.99

68.6

6e7

.05

bjor

kelu

nd�

��

75.6

910

0.00

86.1

669

.51

90.7

178

.70

62.3

787

.03

72.6

766

.32

66.3

266

.32

74.1

441

.52

53.2

376

.49

82.9

079

.22

68.2

0e7

.06

xu�

��

67.0

910

0.00

80.3

064

.81

89.9

475

.34

62.8

880

.57

70.6

359

.13

59.1

359

.13

65.2

833

.67

44.4

278

.60

72.3

774

.87

63.4

6e7

.07

zhek

ova

��

�78

.55

100.

0087

.98

68.3

878

.11

72.9

263

.04

58.6

060

.74

50.3

550

.35

50.3

552

.64

37.1

043

.53

69.8

761

.19

62.7

459

.06

e7.0

8li

��

�57

.18

99.9

972

.75

48.0

481

.51

60.4

544

.13

81.2

157

.18

47.8

247

.82

47.8

261

.82

25.9

836

.58

65.2

669

.93

67.1

251

.40

e8.0

0ch

en�

��

81.4

610

0.00

89.7

873

.22

89.6

980

.62

65.3

285

.87

74.2

068

.87

68.8

768

.87

76.4

347

.26

58.4

075

.66

84.8

479

.31

71.0

7e8

.01

fern

ande

s�

��

100.

0010

0.00

100.

0071

.18

91.2

479

.97

65.8

185

.51

74.3

867

.83

67.8

367

.83

74.9

343

.09

54.7

279

.65

80.3

179

.97

69.6

9e8

.02

stam

borg

��

�78

.83

100.

0088

.16

71.9

688

.40

79.3

365

.31

83.2

873

.21

67.2

667

.26

67.2

672

.29

44.4

855

.07

78.9

281

.59

80.1

769

.20

e8.0

3bj

orke

lund

��

�76

.94

100.

0086

.97

70.6

691

.15

79.6

062

.90

87.5

673

.21

66.9

066

.90

66.9

074

.88

42.6

554

.35

76.4

284

.43

79.7

169

.05

e8.0

4zh

ekov

a�

��

78.7

310

0.00

88.1

068

.54

78.1

073

.01

63.1

458

.63

60.8

050

.61

50.6

150

.61

52.8

437

.44

43.8

370

.28

61.5

563

.23

59.2

1

e9.0

0bj

orke

lund

��

�79

.55

100.

0088

.61

72.5

990

.04

80.3

864

.94

85.7

273

.89

68.1

468

.14

68.1

474

.77

45.2

756

.40

78.2

383

.31

80.4

870

.22

Tabl

e20

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

hecl

osed

trac

kfo

rEng

lish

Page 32: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

IDPa

rtic

ipan

tTr

ain

Test

ME

NT

ION

DE

TE

CT

ION

CO

RE

FE

RE

NC

ER

ESO

LU

TIO

NO

ffici

alSy

ntax

Synt

axM

entio

nQ

lty.

MU

CB

CU

BE

DC

EA

F mC

EA

F eB

LA

NC

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

c0.0

0ch

en�

��

71.1

272

.16

71.6

459

.92

64.6

962

.21

69.7

377

.81

73.5

562

.18

62.1

862

.18

53.4

348

.73

50.9

772

.79

84.5

377

.34

62.2

4c0

.01

yuan

��

�72

.75

64.0

968

.15

62.3

658

.42

60.3

373

.12

72.6

772

.90

59.5

959

.59

59.5

947

.10

50.7

048

.83

73.7

278

.22

75.7

660

.69

c0.0

2bj

orke

lund

��

�69

.45

63.5

466

.37

58.7

258

.49

58.6

171

.23

75.0

773

.10

59.0

159

.01

59.0

148

.09

48.2

948

.19

72.2

581

.61

76.0

759

.97

c0.0

3xu

��

�64

.33

66.1

065

.20

55.0

261

.47

58.0

768

.39

76.3

872

.16

57.4

657

.46

57.4

650

.40

44.8

147

.44

71.6

474

.57

73.0

059

.22

c0.0

4fe

rnan

des

��

�57

.24

78.2

866

.13

52.6

970

.58

60.3

462

.99

80.5

770

.70

57.7

357

.73

57.7

353

.75

37.8

844

.44

75.0

679

.59

77.1

158

.49

c0.0

5st

ambo

rg�

��

57.6

571

.93

64.0

152

.56

64.1

357

.77

64.4

377

.55

70.3

855

.57

55.5

755

.57

47.9

038

.04

42.4

172

.74

77.8

475

.00

56.8

5c0

.06

uryu

pina

��

�50

.98

70.0

959

.03

45.6

263

.13

52.9

759

.17

80.7

868

.31

52.4

052

.40

52.4

048

.47

34.5

240

.32

68.7

280

.76

73.1

153

.87

c0.0

7m

arts

chat

��

�48

.49

74.0

258

.60

42.7

167

.80

52.4

155

.37

85.2

467

.13

51.3

051

.30

51.3

051

.81

32.4

639

.92

63.9

682

.81

69.1

853

.15

c0.0

8ch

unya

ng�

��

61.1

162

.12

61.6

150

.02

49.6

449

.83

65.8

165

.50

65.6

649

.88

49.8

849

.88

39.8

440

.17

40.0

067

.12

65.8

366

.45

51.8

3c0

.09

xinx

in�

��

55.6

856

.09

55.8

947

.64

48.5

548

.09

66.1

670

.60

68.3

149

.92

49.9

249

.92

39.2

638

.53

38.8

969

.48

73.9

171

.44

51.7

6c0

.10

li�

��

36.6

087

.01

51.5

332

.48

71.4

444

.65

45.5

186

.06

59.5

445

.70

45.7

045

.70

55.1

125

.24

34.6

264

.99

76.6

368

.92

46.2

7c0

.11

chan

g�

��

39.4

359

.97

47.5

830

.85

49.2

237

.93

53.0

278

.31

63.2

344

.89

44.8

944

.89

44.9

229

.99

35.9

759

.82

71.2

463

.16

45.7

1c0

.12

zhek

ova

��

�35

.12

72.5

247

.32

31.1

957

.97

40.5

649

.49

77.6

560

.45

41.8

641

.86

41.8

645

.92

25.2

432

.58

64.2

961

.64

62.7

944

.53

c1.0

0ch

en�

��

83.2

279

.25

81.1

972

.14

72.7

772

.46

75.3

280

.20

77.6

868

.67

68.6

768

.67

58.3

757

.64

58.0

076

.48

87.0

780

.80

69.3

8c1

.01

yuan

��

�83

.69

70.3

376

.43

73.7

365

.50

69.3

877

.97

74.3

776

.13

64.7

764

.77

64.7

749

.99

58.3

853

.86

77.1

279

.84

78.4

166

.46

c1.0

2xu

��

�75

.02

73.0

774

.03

65.9

569

.28

67.5

873

.28

77.6

575

.40

62.5

362

.53

62.5

353

.90

50.7

252

.26

76.5

075

.17

75.8

165

.08

c1.0

3bj

orke

lund

��

�77

.77

68.0

372

.57

66.7

663

.36

65.0

174

.21

75.8

475

.02

62.1

762

.17

62.1

749

.74

52.9

251

.28

74.3

882

.36

77.7

763

.77

c1.0

4fe

rnan

des

��

�63

.83

81.7

371

.68

59.3

574

.49

66.0

766

.31

81.4

373

.10

61.1

961

.19

61.1

955

.97

41.5

047

.66

78.1

181

.28

79.6

062

.28

c1.0

5st

ambo

rg�

��

63.0

775

.21

68.6

158

.32

67.9

562

.76

67.4

778

.54

72.5

858

.98

58.9

858

.98

49.8

041

.15

45.0

676

.29

80.0

978

.05

60.1

3c1

.06

uryu

pina

��

�56

.61

71.3

863

.14

50.7

464

.53

56.8

161

.78

80.1

169

.76

54.5

954

.59

54.5

948

.94

37.4

942

.45

70.3

780

.87

74.4

256

.34

c1.0

7m

arts

chat

��

�52

.94

77.1

762

.80

47.3

072

.19

57.1

557

.31

86.6

969

.00

53.9

253

.92

53.9

254

.11

34.4

042

.06

65.3

984

.86

70.9

556

.07

c1.0

8ch

unya

ng�

��

67.5

265

.33

66.4

156

.11

52.8

154

.41

68.2

764

.49

66.3

351

.92

51.9

251

.92

40.3

943

.47

41.8

869

.16

65.0

366

.80

54.2

1c1

.09

xinx

in�

��

62.5

958

.21

60.3

253

.49

50.4

451

.92

68.7

069

.78

69.2

451

.95

51.9

551

.95

39.6

542

.18

40.8

771

.40

74.6

672

.90

54.0

1c1

.10

yang

��

�58

.66

56.5

257

.57

48.7

449

.49

49.1

165

.61

72.9

169

.07

49.6

149

.61

49.6

140

.11

39.5

239

.81

64.5

173

.91

67.9

552

.66

c1.1

1li

��

�40

.34

89.7

455

.66

34.8

573

.77

47.3

346

.00

86.0

659

.95

46.7

946

.79

46.7

956

.58

25.7

735

.42

66.0

177

.39

69.9

647

.57

c1.1

2ch

ang

��

�43

.28

62.1

651

.03

33.3

650

.29

40.1

153

.96

77.6

063

.65

45.9

445

.94

45.9

445

.06

31.1

036

.80

60.4

071

.57

63.7

946

.85

c1.1

3zh

ekov

a�

��

37.8

474

.84

50.2

733

.95

60.2

943

.44

50.9

577

.28

61.4

143

.34

43.3

443

.34

46.6

826

.13

33.5

065

.98

62.1

563

.73

46.1

2

c2.0

0bj

orke

lund

��

�74

.86

55.0

763

.46

61.0

648

.92

54.3

276

.08

67.3

571

.45

57.3

957

.39

57.3

942

.39

53.1

047

.15

73.4

277

.96

75.4

857

.64

c3.0

0bj

orke

lund

��

�85

.29

59.8

770

.36

70.9

254

.07

61.3

679

.51

67.8

373

.21

60.8

060

.80

60.8

043

.61

59.3

150

.26

76.2

678

.94

77.5

361

.61

c4.0

0ch

en�

��

81.9

978

.97

80.4

570

.76

72.1

271

.43

74.3

779

.91

77.0

467

.87

67.8

767

.87

57.9

556

.41

57.1

775

.95

86.7

580

.32

68.5

5c4

.01

yuan

��

�82

.89

66.8

674

.02

72.1

261

.59

66.4

477

.96

72.3

075

.02

62.9

662

.96

62.9

647

.17

57.4

751

.81

75.7

778

.48

77.0

564

.42

c4.0

2xu

��

�73

.21

72.6

872

.94

63.5

468

.73

66.0

371

.36

78.7

074

.85

61.5

761

.57

61.5

753

.90

49.0

751

.37

72.9

077

.06

74.8

064

.08

c4.0

3bj

orke

lund

��

�71

.88

70.1

871

.02

61.6

965

.54

63.5

670

.42

79.1

474

.52

61.3

261

.32

61.3

252

.03

48.5

150

.20

72.4

485

.49

77.4

062

.76

c4.0

4fe

rnan

des

��

�64

.21

79.1

870

.91

58.7

671

.46

64.4

966

.62

79.8

872

.65

60.4

060

.40

60.4

054

.09

42.0

247

.29

77.6

980

.96

79.2

261

.48

c4.0

5st

ambo

rg�

��

71.2

155

.28

62.2

461

.13

47.2

053

.27

75.4

763

.09

68.7

352

.82

52.8

252

.82

35.8

147

.71

40.9

174

.53

71.1

272

.69

54.3

0c4

.06

zhek

ova

��

�36

.97

73.9

849

.30

32.0

958

.30

41.3

949

.43

77.3

860

.32

42.0

942

.09

42.0

946

.35

25.7

133

.07

63.4

161

.47

62.3

444

.93

c4.0

7li

��

�68

.22

41.8

851

.90

52.5

630

.62

38.7

076

.15

48.5

259

.27

41.0

641

.06

41.0

625

.14

43.5

131

.86

67.8

258

.70

61.4

743

.28

c5.0

0ch

en�

��

83.2

279

.25

81.1

972

.14

72.7

772

.46

75.3

280

.20

77.6

868

.67

68.6

768

.67

58.3

757

.64

58.0

076

.48

87.0

780

.80

69.3

8c5

.01

yuan

��

�83

.75

67.4

674

.72

73.2

562

.49

67.4

478

.46

72.6

875

.46

63.5

263

.52

63.5

247

.54

58.1

352

.30

76.1

078

.79

77.3

765

.07

c5.0

2bj

orke

lund

��

�76

.53

70.2

873

.27

65.4

665

.87

65.6

672

.79

78.3

875

.49

62.9

162

.91

62.9

152

.21

51.8

252

.01

73.7

085

.52

78.3

664

.39

c5.0

3fe

rnan

des

��

�63

.83

81.7

371

.68

59.3

574

.49

66.0

766

.31

81.4

373

.10

61.1

961

.19

61.1

955

.97

41.5

047

.66

78.1

181

.28

79.6

062

.28

c5.0

4st

ambo

rg�

��

71.3

655

.32

62.3

261

.17

47.2

753

.33

75.6

463

.41

68.9

953

.26

53.2

653

.26

36.1

148

.05

41.2

375

.75

72.5

374

.02

54.5

2c5

.05

zhek

ova

��

�37

.89

74.7

950

.30

33.9

360

.19

43.3

950

.87

77.2

761

.35

43.2

943

.29

43.2

946

.62

26.1

333

.49

65.9

162

.31

63.8

146

.08

c5.0

6li

��

�74

.10

40.2

352

.14

56.1

528

.50

37.8

179

.00

43.6

756

.25

39.3

239

.32

39.3

222

.49

45.7

230

.15

68.1

257

.08

59.8

341

.40

c6.0

0bj

orke

lund

��

�84

.09

61.1

970

.83

69.8

555

.71

61.9

878

.57

69.9

374

.00

61.5

261

.52

61.5

245

.23

58.4

350

.99

75.3

881

.33

78.0

262

.32

c7.0

0bj

orke

lund

��

�81

.07

100.

0089

.55

72.7

292

.28

81.3

469

.47

91.3

878

.93

72.2

472

.24

72.2

481

.32

52.4

563

.77

77.3

089

.40

82.0

974

.68

c8.0

0ch

en�

��

84.7

210

0.00

91.7

376

.60

92.4

183

.77

72.9

591

.43

81.1

575

.83

75.8

375

.83

83.5

657

.86

68.3

879

.21

91.3

884

.09

77.7

7c8

.01

yuan

��

�81

.72

99.7

989

.85

74.7

792

.74

82.7

970

.91

91.2

179

.79

73.6

773

.67

73.6

781

.98

54.6

565

.58

77.4

888

.14

81.8

176

.05

c8.0

2bj

orke

lund

��

�71

.63

100.

0083

.47

65.3

693

.23

76.8

564

.44

93.5

176

.30

68.3

068

.30

68.3

078

.59

44.2

456

.61

75.7

791

.56

81.5

669

.92

c8.0

3xu

��

�71

.51

100.

0083

.38

65.6

394

.07

77.3

265

.05

91.2

375

.95

66.2

266

.22

66.2

278

.13

43.7

756

.11

73.9

879

.71

76.4

869

.79

c8.0

4st

ambo

rg�

��

68.9

710

0.00

81.6

363

.52

88.2

373

.86

63.5

488

.12

73.8

465

.60

65.6

065

.60

72.5

642

.01

53.2

176

.96

83.7

079

.89

66.9

7c8

.05

fern

ande

s�

��

100.

0010

0.00

100.

0061

.64

90.8

173

.43

63.5

589

.43

74.3

065

.10

65.1

065

.10

72.7

839

.68

51.3

680

.21

83.3

981

.71

66.3

6c8

.06

li�

��

63.5

999

.95

77.7

355

.28

82.2

866

.13

55.9

582

.99

66.8

457

.50

57.5

057

.50

66.7

636

.06

46.8

370

.53

77.6

773

.47

59.9

3c8

.07

zhek

ova

��

�47

.53

100.

0064

.43

42.0

279

.57

55.0

050

.22

80.8

161

.94

46.8

846

.88

46.8

860

.27

27.0

837

.37

68.6

063

.62

65.5

851

.44

c9.0

0ch

en�

��

85.9

210

0.00

92.4

277

.88

92.8

584

.71

74.0

291

.67

81.9

176

.76

76.7

676

.76

84.3

359

.45

69.7

479

.63

91.5

584

.45

78.7

9c9

.01

yuan

��

�82

.58

99.8

090

.38

75.6

993

.06

83.4

871

.62

91.3

980

.30

74.2

374

.23

74.2

382

.40

55.6

066

.40

77.7

788

.30

82.0

676

.73

c9.0

2bj

orke

lund

��

�75

.93

100.

0086

.32

68.9

493

.76

79.4

666

.88

93.4

877

.97

70.3

070

.30

70.3

080

.53

47.7

359

.93

76.0

691

.11

81.6

672

.45

c9.0

3st

ambo

rg�

��

69.0

810

0.00

81.7

163

.52

88.2

473

.87

63.5

688

.56

74.0

065

.89

65.8

965

.89

72.9

342

.22

53.4

877

.43

85.6

580

.91

67.1

2c9

.04

fern

ande

s�

��

100.

0010

0.00

100.

0061

.70

91.4

573

.69

63.5

789

.76

74.4

365

.06

65.0

665

.06

72.8

439

.49

51.2

180

.08

83.2

181

.55

66.4

4c9

.05

li�

��

68.6

810

0.00

81.4

358

.88

81.0

468

.20

58.3

780

.25

67.5

958

.74

58.7

458

.74

66.4

438

.85

49.0

371

.31

76.9

273

.72

61.6

1c9

.06

zhek

ova

��

�48

.82

100.

0065

.61

44.1

280

.89

57.1

051

.79

80.5

363

.04

47.8

447

.84

47.8

460

.37

27.6

937

.96

70.4

964

.06

66.4

552

.70

Tabl

e21

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

hecl

osed

trac

kfo

rChi

nese

Page 33: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

IDPa

rtic

ipan

tTr

ain

Test

ME

NT

ION

DE

TE

CT

ION

CO

RE

FE

RE

NC

ER

ESO

LU

TIO

NO

ffici

alSy

ntax

Synt

axM

entio

nQ

lty.

MU

CB

CU

BE

DC

EA

F mC

EA

F eB

LA

NC

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

a0.0

0fe

rnan

des

��

�62

.72

67.0

064

.79

43.6

349

.69

46.4

662

.70

72.1

967

.11

55.5

955

.59

55.5

952

.49

46.0

949

.08

63.9

871

.91

66.9

754

.22

a0.0

1bj

orke

lund

��

�56

.78

64.8

660

.55

43.9

052

.51

47.8

262

.89

75.3

268

.54

53.4

253

.42

53.4

248

.45

40.8

044

.30

66.4

574

.61

69.6

353

.55

a0.0

2ur

yupi

na�

��

56.4

754

.35

55.3

941

.33

41.6

641

.49

65.7

769

.23

67.4

650

.82

50.8

250

.82

42.4

342

.13

42.2

865

.58

70.5

667

.69

50.4

1a0

.03

stam

borg

��

�56

.10

63.2

859

.47

39.1

143

.49

41.1

861

.57

67.9

564

.61

50.1

650

.16

50.1

644

.86

40.3

642

.49

66.8

066

.94

66.8

749

.43

a0.0

4ch

en�

��

56.1

663

.95

59.8

038

.13

39.9

639

.02

60.5

962

.51

61.5

347

.49

47.4

947

.49

41.8

939

.84

40.8

466

.45

61.8

463

.69

47.1

3a0

.05

zhek

ova

��

�27

.54

80.3

441

.02

19.6

462

.13

29.8

541

.91

90.7

257

.33

42.7

442

.74

42.7

456

.79

24.8

134

.53

57.1

079

.19

60.6

540

.57

a0.0

6li

��

�18

.17

80.4

329

.65

10.7

755

.60

18.0

536

.17

93.3

452

.14

37.0

337

.03

37.0

355

.45

20.9

530

.41

52.9

173

.93

54.1

233

.53

a1.0

0fe

rnan

des

��

�65

.03

68.7

166

.82

46.3

851

.78

48.9

363

.53

72.3

767

.66

56.4

956

.49

56.4

952

.57

46.8

849

.56

64.8

472

.97

67.9

455

.38

a1.0

1bj

orke

lund

��

�58

.33

64.6

061

.30

45.1

452

.15

48.3

963

.73

74.4

568

.68

53.5

253

.52

53.5

247

.78

41.5

344

.44

66.8

173

.83

69.6

553

.84

a1.1

2ch

en�

��

56.4

163

.41

59.7

038

.22

39.5

738

.89

60.9

162

.06

61.4

847

.73

47.7

347

.73

41.8

040

.27

41.0

266

.70

61.8

663

.78

47.1

3a1

.13

zhek

ova

��

�28

.00

82.2

141

.78

15.4

745

.92

23.1

539

.22

84.8

653

.65

39.5

239

.52

39.5

255

.10

24.2

233

.65

54.1

361

.78

55.6

336

.82

a2.0

0bj

orke

lund

��

�61

.88

62.5

262

.20

46.1

147

.66

46.8

765

.83

69.7

467

.73

53.7

753

.77

53.7

745

.82

44.3

345

.06

67.6

970

.71

69.0

653

.22

a3.0

0bj

orke

lund

��

�67

.07

62.4

464

.67

51.5

749

.76

50.6

569

.53

69.8

869

.71

56.2

156

.21

56.2

146

.26

47.9

847

.11

71.0

972

.67

71.8

555

.82

a4.0

0fe

rnan

des

��

�65

.34

64.8

265

.08

45.1

847

.39

46.2

664

.56

69.4

466

.91

54.8

854

.88

54.8

849

.73

47.3

948

.53

64.2

870

.09

66.6

453

.90

a4.0

1bj

orke

lund

��

�57

.77

63.7

460

.61

44.7

851

.47

47.9

063

.75

74.2

768

.61

53.1

853

.18

53.1

847

.16

41.2

444

.00

66.9

473

.43

69.6

153

.50

a4.0

2st

ambo

rg�

��

57.4

364

.62

60.8

140

.22

44.1

742

.10

61.4

567

.24

64.2

249

.92

49.9

249

.92

44.6

040

.50

42.4

666

.79

66.0

866

.42

49.5

9a4

.03

chen

��

�57

.21

62.5

559

.76

38.6

639

.24

38.9

561

.52

61.7

761

.65

47.8

447

.84

47.8

441

.55

40.9

041

.22

66.7

861

.94

63.8

747

.27

a4.0

4zh

ekov

a�

��

27.4

875

.53

40.2

918

.75

56.4

728

.16

42.6

789

.25

57.7

442

.57

42.5

742

.57

55.5

325

.36

34.8

256

.61

76.3

559

.86

40.2

4a4

.05

li�

��

52.9

520

.71

29.7

820

.62

7.78

11.3

079

.37

41.2

154

.25

33.6

833

.68

33.6

821

.73

42.8

728

.84

54.0

451

.10

51.4

631

.46

a5.0

0fe

rnan

des

��

�65

.03

68.7

166

.82

46.3

851

.78

48.9

363

.53

72.3

767

.66

56.4

956

.49

56.4

952

.57

46.8

849

.56

64.8

472

.97

67.9

455

.38

a5.0

1bj

orke

lund

��

�58

.29

64.6

361

.30

45.1

452

.20

48.4

163

.71

74.5

068

.68

53.5

253

.52

53.5

247

.80

41.5

144

.44

66.8

173

.84

69.6

553

.84

a5.0

2st

ambo

rg�

��

57.6

864

.18

60.7

640

.53

43.9

842

.18

61.7

066

.75

64.1

349

.55

49.5

549

.55

44.0

140

.47

42.1

665

.23

64.8

965

.06

49.4

9a5

.03

chen

��

�56

.41

63.4

559

.72

38.2

239

.59

38.8

960

.90

62.0

761

.48

47.7

347

.73

47.7

341

.81

40.2

641

.02

66.7

061

.86

63.7

847

.13

a5.0

4zh

ekov

a�

��

28.0

682

.39

41.8

715

.56

46.1

823

.28

39.2

384

.95

53.6

739

.52

39.5

239

.52

55.1

024

.20

33.6

354

.15

61.9

555

.66

36.8

6

a6.0

0bj

orke

lund

��

67.0

462

.47

64.6

751

.57

49.8

050

.67

69.5

269

.92

69.7

256

.21

56.2

156

.21

46.2

747

.95

47.1

071

.10

72.7

071

.86

55.8

3

a7.0

0fe

rnan

des

��

�10

0.00

100.

0010

0.00

57.2

576

.48

65.4

860

.27

79.8

168

.68

62.5

662

.56

62.5

672

.61

46.0

056

.32

69.0

374

.87

71.4

963

.49

a7.0

1bj

orke

lund

��

�61

.85

100.

0076

.43

49.5

778

.62

60.8

155

.55

85.3

567

.29

59.5

059

.50

59.5

070

.28

37.9

949

.32

70.6

980

.85

74.6

159

.14

a7.0

2zh

ekov

a�

��

57.9

510

0.00

73.3

842

.48

80.3

655

.58

50.8

789

.69

64.9

255

.42

55.4

255

.42

71.9

634

.52

46.6

661

.36

82.0

066

.12

55.7

2a7

.03

stam

borg

��

�56

.13

100.

0071

.90

41.9

969

.78

52.4

350

.45

81.3

062

.26

54.0

054

.00

54.0

066

.16

34.5

245

.37

67.3

773

.46

69.8

753

.35

a7.0

4ch

en�

��

58.2

910

0.00

73.6

541

.72

63.2

350

.28

50.0

075

.25

60.0

853

.16

53.1

653

.16

64.6

036

.24

46.4

367

.15

66.6

566

.90

52.2

6a7

.05

li�

��

35.6

710

0.00

52.5

822

.43

64.6

233

.31

38.6

788

.07

53.7

442

.25

42.2

542

.25

60.9

524

.36

34.8

155

.64

68.5

257

.96

40.6

2

a8.0

0fe

rnan

des

��

�10

0.00

100.

0010

0.00

56.8

976

.27

65.1

760

.07

80.0

268

.62

62.6

262

.62

62.6

272

.24

45.5

855

.90

69.3

575

.51

71.9

363

.23

a8.0

1bj

orke

lund

��

�61

.05

100.

0075

.81

49.1

778

.31

60.4

155

.51

85.4

067

.28

59.4

159

.41

59.4

170

.01

37.7

149

.02

70.9

780

.92

74.8

458

.90

a8.0

2zh

ekov

a�

��

65.6

810

0.00

79.2

945

.58

73.2

756

.20

52.2

782

.35

63.9

555

.11

55.1

155

.11

70.1

737

.54

48.9

159

.94

72.0

763

.28

56.3

5a8

.03

stam

borg

��

�56

.72

100.

0072

.38

42.8

870

.42

53.3

051

.17

80.8

362

.67

54.1

254

.12

54.1

266

.21

34.8

545

.66

67.1

072

.32

69.2

953

.88

a8.0

4ch

en�

��

58.2

610

0.00

73.6

341

.81

63.2

850

.36

50.1

075

.19

60.1

353

.19

53.1

953

.19

64.5

936

.27

46.4

667

.19

66.5

266

.85

52.3

2

a9.0

0bj

orke

lund

��

�68

.50

100.

0081

.30

55.2

178

.84

64.9

459

.85

83.7

569

.81

63.1

263

.12

63.1

272

.24

42.7

553

.71

73.3

580

.61

76.4

162

.82

Tabl

e22

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

hecl

osed

trac

kfo

rAra

bic

Page 34: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

also could be a factor that explains the lower per-formance, and that Arabic performance might gainfrom more data. The chen system did not use de-velopment data for the final models. It could haveincreased their score.

8.3.2 Gold Mention BoundariesThe system performance given gold boundaries

followed more of the trend in English than Chinese.There was not much improvement over the primaryNB evaluation. Interestingly, the chen system thatuses gold boundaries for Chinese so well, does notget any performance improvement. This might in-dicate that either the technique that helped that sys-tem in Chinese does not generalize well across lan-guages.

8.3.3 Gold MentionsPerformance given gold mentions seems to be

about ten points higher than in the NB case. Thebjorkelund system does well on BLANC metric thanfernandes even after getting a big hit in recall formention detection. In absence of the chang system,it seems like the fernandes system is the only onethat explicitly adds a constraint for the GM case andgets a perfect mention detection score. All other sys-tems look significantly on recall.

8.3.4 Gold Test ParsesFinally, providing gold parses during testing does

not have much of an impact on the scores.

8.3.5 All Languages OpenTables 23, 24 and 25, give the performance for

the systems that participated in the open track. Notmany systems participated in this track, so thereis not a lot to observe. One thing to note is thatthe chen system modified precise constructs sieveto add named entity information in the open tracksieve which gave them a point improvement in per-formance. With gold mentions and gold syntax dur-ing testing the chen system performance almost ap-proaches an F-score of 80 (79.79)

8.3.6 Headword-based and Genre specificscores

Since last year’s task showed that there was onlysome very local difference in ranking between sys-tems scored using the strict boundaries versus theones using head-word based scoring, we did notcompute the head-word based evaluation. Also,since there was no particular pattern in scores acrossgenre for the CoNLL test set, we did not computegenre-specific scores as well. Actually we have

computed them, but there is no place in the paper.These could be made available on the task webpage.

9 Comparison with CoNLL-2011

Table 26 shows the performance of the systems onCoNLL-2011 test set. The models use about 200kmore training data for English, but it is a small frac-tion of the total data, and given that the total size oftraining data in CoNLL-2011 was more than 80%of CoNLL-2012 training data (1M vs 1.2M wordsrespectively); and, the fact that coreference scoreshave shown to asymptote after a small fraction ofthe total training data, it can be inferred that the 5%absolute gap between the best performing systemsof last year and this year, that the improvement wasmost likely owing to algorithmic improvement andpossibly using better rules. also since 2011 systemwas purely rule-based, it is unlikely that the 200Kmore data would have added to the rules.

It is interesting to note that although the winningsystem in the CoNLL-2011 task was a completelyrule-based one, modified version of the same systemused by shou and xiong ranked close to 10. Thisdoes indicate that a hybrid approach has some ad-vantage over a purely rule-based system. Improve-ment seems to be mostly owing to higher precisionin mention detection, MUC, BCUBED, and higher re-call in CEAFe

10 Conclusions

In this paper we described the anaphoric corefer-ence information and other layers of annotation inthe OntoNotes corpus, over three languages – En-glish, Chinese and Arabic, and presented the resultsfrom an evaluation on learning such unrestricted en-tities and events in text. The following represent ourconclusions on reviewing the results:

• Most top performing systems used a hybrid-approach combining rule-based strategies withmachine learning. Rule-based approach doesseem to bring a system to a close to best per-formance region. The most significant advan-tage of the rule-based approach seems to be thecapturing of most confident links before con-sidering less confident ones. Discourse infor-mation when present is quite helpful to disam-biguate pronominal mentions. Using informa-tion from appositives and copular constructionsseems beneficial to bridge across various lex-icalized mentions. It is not clear how muchmore can be gained using further strategies.

Page 35: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Part

icip

ant

Trai

nTe

stM

EN

TIO

ND

ET

EC

TIO

NC

OR

EF

ER

EN

CE

RE

SOL

UT

ION

Offi

cial

Synt

axSy

ntax

Men

tion

Qlty

.M

UC

BC

UB

ED

CE

AF m

CE

AF e

BL

AN

C

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

xion

g�

��

75.2

272

.23

73.6

964

.08

63.5

763

.82

66.4

770

.69

68.5

257

.23

57.2

357

.23

45.0

945

.64

45.3

671

.12

77.9

073

.94

59.2

3

xion

g�

��

77.8

573

.44

75.5

867

.03

65.2

766

.14

68.0

371

.14

69.5

558

.58

58.5

858

.58

45.6

047

.53

46.5

472

.03

78.5

874

.78

60.7

4

Tabl

e23

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

heop

entr

ack

forE

nglis

h

Part

icip

ant

Trai

nTe

stM

EN

TIO

ND

ET

EC

TIO

NC

OR

EF

ER

EN

CE

RE

SOL

UT

ION

Offi

cial

Synt

axSy

ntax

Men

tion

Qlty

.M

UC

BC

UB

ED

CE

AF m

CE

AF e

BL

AN

C

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

chen

��

�71

.45

73.4

572

.44

62.4

867

.08

64.7

071

.21

78.3

574

.61

63.4

863

.48

63.4

853

.64

49.1

051

.27

75.1

584

.29

78.9

463

.53

yuan

��

�73

.71

63.9

768

.49

63.6

758

.48

60.9

674

.04

72.1

673

.09

60.0

560

.05

60.0

546

.75

51.5

249

.02

74.3

277

.99

76.0

261

.02

xion

g�

��

39.4

767

.55

49.8

230

.00

51.2

037

.83

49.3

777

.45

60.3

042

.71

42.7

142

.71

46.1

028

.12

34.9

357

.98

67.0

860

.59

44.3

5

chen

��

�83

.50

80.4

481

.95

74.7

774

.93

74.8

577

.14

80.8

078

.93

70.1

370

.13

70.1

358

.64

58.4

658

.55

78.8

786

.63

82.2

470

.78

yuan

��

�83

.92

69.8

576

.24

74.2

465

.11

69.3

778

.61

73.7

476

.10

64.8

464

.84

64.8

449

.41

58.7

053

.66

77.6

079

.45

78.4

966

.38

xion

g�

��

42.7

871

.10

53.4

232

.57

53.8

940

.60

49.5

077

.38

60.3

743

.66

43.6

643

.66

47.0

428

.83

35.7

558

.24

67.3

760

.90

45.5

7

chen

��

�82

.39

80.1

181

.24

73.5

074

.28

73.8

876

.30

80.4

978

.34

69.4

069

.40

69.4

058

.22

57.3

257

.77

78.4

486

.39

81.8

870

.00

chen

��

�83

.50

80.4

481

.95

74.7

774

.93

74.8

577

.14

80.8

078

.93

70.1

370

.13

70.1

358

.64

58.4

658

.55

78.8

786

.63

82.2

470

.78

chen

��

�84

.80

100.

0091

.77

78.1

293

.19

84.9

975

.04

91.5

982

.50

77.5

077

.50

77.5

084

.03

59.1

769

.44

81.4

690

.73

85.4

178

.98

chen

��

�85

.71

100.

0092

.31

79.0

793

.59

85.7

275

.83

91.9

483

.11

78.2

678

.26

78.2

684

.77

60.4

270

.55

81.7

191

.00

85.6

779

.79

Tabl

e24

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

heop

entr

ack

forC

hine

se

Part

icip

ant

Trai

nTe

stM

EN

TIO

ND

ET

EC

TIO

NC

OR

EF

ER

EN

CE

RE

SOL

UT

ION

Offi

cial

Synt

axSy

ntax

Men

tion

Qlty

.M

UC

BC

UB

ED

CE

AF m

CE

AF e

BL

AN

C

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

xion

g�

��

55.0

852

.75

53.8

928

.20

28.4

328

.31

60.8

962

.81

61.8

347

.31

47.3

147

.31

43.1

242

.82

42.9

757

.05

60.7

558

.46

44.3

7

xion

g�

��

57.5

552

.98

55.1

730

.99

30.1

030

.54

62.1

662

.55

62.3

647

.73

47.7

347

.73

42.4

843

.59

43.0

357

.78

61.3

959

.20

45.3

1

Tabl

e25

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

heop

entr

ack

forA

rabi

c

Part

icip

ant

Trai

nTe

stM

EN

TIO

ND

ET

EC

TIO

NC

OR

EF

ER

EN

CE

RE

SOL

UT

ION

Offi

cial

Synt

axSy

ntax

Men

tion

Qlty

.M

UC

BC

UB

ED

CE

AF m

CE

AF e

BL

AN

CA

GA

GN

BG

BG

MR

PF

RP

F 1R

PF 2

RP

FR

PF 3

RP

FF1+

F2+

F3

3

2011

best

��

�75

.07

66.8

170

.70

61.7

657

.53

59.5

768

.40

68.2

368

.31

56.3

756

.37

56.3

743

.41

47.7

545

.48

70.6

376

.21

73.0

257

.79

fern

ande

s�

��

70.1

281

.27

75.2

862

.74

73.4

667

.68

65.0

378

.05

70.9

560

.61

60.6

160

.61

54.1

842

.50

47.6

375

.31

78.5

376

.80

62.0

9m

arts

chat

��

�71

.96

73.5

772

.76

62.8

066

.23

64.4

767

.09

74.5

070

.60

58.8

458

.84

58.8

448

.05

44.5

546

.23

73.1

778

.04

75.3

260

.43

bjor

kelu

nd�

��

70.9

974

.91

72.9

062

.25

67.7

264

.87

65.6

675

.32

70.1

657

.99

57.9

957

.99

48.1

242

.67

45.2

372

.81

77.6

474

.94

60.0

9ch

ang

��

�69

.94

73.3

671

.61

62.5

165

.54

63.9

967

.76

71.9

769

.80

57.4

957

.49

57.4

945

.86

42.8

244

.29

75.3

574

.13

74.7

259

.36

chen

��

�73

.13

69.5

671

.30

60.8

460

.70

60.7

767

.34

71.3

769

.30

57.4

757

.47

57.4

745

.98

46.1

346

.05

70.8

978

.69

74.0

658

.71

stam

borg

��

�72

.83

69.7

871

.27

63.4

761

.16

62.2

969

.10

69.1

969

.14

55.3

755

.37

55.3

741

.89

44.1

342

.98

72.8

675

.67

74.1

658

.14

chun

yang

��

�73

.14

69.3

171

.17

61.7

860

.57

61.1

767

.43

70.3

168

.84

56.6

256

.62

56.6

244

.48

45.7

045

.08

71.3

176

.91

73.7

258

.36

yuan

��

�70

.44

68.8

769

.64

58.9

960

.09

59.5

366

.66

71.3

868

.94

56.9

556

.95

56.9

545

.48

44.3

844

.92

72.2

078

.69

74.9

557

.80

shou

��

�73

.26

69.1

671

.15

61.0

259

.24

60.1

266

.14

68.3

467

.22

54.8

854

.88

54.8

843

.52

45.3

544

.42

69.0

073

.39

70.9

257

.25

xu�

��

58.9

082

.61

68.7

755

.14

73.2

862

.93

60.5

075

.93

67.3

552

.81

52.8

152

.81

48.7

132

.17

38.7

572

.80

70.0

571

.30

56.3

4ur

yupi

na�

��

69.6

466

.80

68.1

958

.92

57.7

258

.31

65.0

568

.26

66.6

252

.25

52.2

552

.25

40.7

141

.83

41.2

668

.07

72.2

569

.89

55.4

0ya

ng�

��

61.5

170

.97

65.9

052

.15

61.8

856

.60

60.3

773

.02

66.0

951

.07

51.0

751

.07

44.6

935

.98

39.8

766

.20

71.2

768

.29

54.1

9xi

nxin

��

�71

.16

50.9

859

.40

52.8

839

.37

45.1

368

.73

56.1

661

.81

44.6

844

.68

44.6

831

.27

43.5

036

.38

65.7

164

.79

65.2

347

.77

zhek

ova

��

�63

.16

65.7

364

.42

50.6

449

.51

50.0

761

.71

56.5

359

.01

43.4

043

.40

43.4

034

.01

35.0

934

.54

65.4

958

.37

60.2

147

.87

li�

��

43.3

984

.81

57.4

036

.45

70.5

348

.06

43.1

181

.37

56.3

641

.88

41.8

841

.88

49.5

222

.69

31.1

262

.31

67.0

264

.12

45.1

8

Tabl

e26

:per

form

ance

on20

11te

stse

t

Page 36: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

The features for coreference prediction are cer-tainly more complex than for many other lan-guage processing tasks, which makes it morechallenging to generate effective feature com-binations.

• Gold parse during testing does seem to helpquite a bit. Gold boundaries are not of muchsignificance for English (?and Arabic), butseem to be very useful for Chinese. The reasonprobably has some roots in the parser perfor-mance gap for Chinese.

• It does seem that collecting information aboutan entity by merging information across thevarious attributes of the mentions that compriseit can be useful, though not all systems that at-tempted this achieved a benefit, and has to bedone carefully.

• It is noteworthy that systems did not seem toattempt the kind of joint inference that couldmake use of the full potential of various lay-ers available in OntoNotes, but this could wellhave been owing to the limited time availablefor the shared task.

• We had expected to see more attention paid toevent coreference, which is a novel feature inthis data, but again, given the time constraintsand given that events represent only a smallportion of the total, it is not surprising that mostsystems chose not to focus on it.

• Scoring coreference seems to remain a signif-icant challenge. There does not seem to be anobjective way to establish one metric in prefer-ence to another in the absence of a specific ap-plication. On the other hand, the system rank-ings do not seem terribly sensitive to the par-ticular metric chosen. It is interesting that bothversions of the CEAF metric – which tries tocapture the goodness of the entities in the out-put – seem much lower than the other metric,though it is not clear whether that means thatour systems are doing a poor job of creatingcoherent entities or whether that metric is justespecially harsh.

Acknowledgments

We gratefully acknowledge the support of theDefense Advanced Research Projects Agency(DARPA/IPTO) under the GALE program,DARPA/CMO Contract No. HR0011-06-C-0022.

We would like to thank all the participants. Withouttheir hard work, patience and perseverance this eval-uation would not have been a success. We wouldalso like to thank the Linguistic Data Consortium formaking the OntoNotes 5.0 corpus freely and timelyavailable in training/development/test sets to theparticipants. Emili Sapena, who graciously allowedthe use of his scorer implementation. Finally wewould like to thank Hwee Tou Ng and his studentZhong Zhi for training the word sense models andproviding outputs for the training/development andtest sets.

References

Olga Babko-Malaya, Ann Bies, Ann Taylor, Szuting Yi,Martha Palmer, Mitch Marcus, Seth Kulick, and Li-bin Shen. 2006. Issues in synchronizing the Englishtreebank and propbank. In Workshop on Frontiers inLinguistically Annotated Corpora 2006, July.

Amit Bagga and Breck Baldwin. 1998. Algorithms forScoring Coreference Chains. In The First Interna-tional Conference on Language Resources and Eval-uation Workshop on Linguistics Coreference, pages563–566.

Shane Bergsma and Dekang Lin. 2006. Bootstrappingpath-based pronoun resolution. In Proceedings of the21st International Conference on Computational Lin-guistics and 44th Annual Meeting of the Associationfor Computational Linguistics, pages 33–40, Sydney,Australia, July.

Jie Cai and Michael Strube. 2010. Evaluation metricsfor end-to-end coreference resolution systems. In Pro-ceedings of the 11th Annual Meeting of the Special In-terest Group on Discourse and Dialogue, SIGDIAL’10, pages 28–36.

Wendy W Chapman, Prakash M Nadkarni, LynetteHirschman, Leonard W D’Avolio, Guergana KSavova, and Ozlem Uzuner. 2011. Overcoming bar-riers to NLP for clinical text: the role of shared tasksand the need for additional creative solutions. Journalof American Medical Informatics Association, 18(5),September.

Eugene Charniak and Mark Johnson. 2001. Edit detec-tion and parsing for transcribed speech. In Proceed-ings of the Second Meeting of North American Chapterof the Association of Computational Linguistics, June.

Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine n-best parsing and maxent discriminative rerank-ing. In Proceedings of the 43rd Annual Meeting of theAssociation for Computational Linguistics (ACL), AnnArbor, MI, June.

Page 37: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Nancy Chinchor and Beth Sundheim. 2003. Messageunderstanding conference (MUC) 6. In LDC2003T13.

Nancy Chinchor. 2001. Message understanding confer-ence (MUC) 7. In LDC2001T02.

Aron Culotta, Michael Wick, Robert Hall, and AndrewMcCallum. 2007. First-order probabilistic models forcoreference resolution. In HLT/NAACL, pages 81–88.

Pascal Denis and Jason Baldridge. 2007. Joint de-termination of anaphoricity and coreference resolu-tion using integer programming. In Proceedings ofHLT/NAACL.

Pascal Denis and Jason Baldridge. 2009. Global jointmodels for coreference resolution and named entityclassification. Procesamiento del Lenguaje Natural,(42):87–96.

Charles Fillmore, Christopher Johnson, and Miriam R. L.Petruck. 2003. Background to framenet. Interna-tional Journal of Lexicography, 16(3).

G. G. Doddington, A. Mitchell, M. Przybocki,L. Ramshaw, S. Strassell, and R. Weischedel.2004. The automatic content extraction (ACE)program-tasks, data, and evaluation. In Proceedingsof LREC.

Aria Haghighi and Dan Klein. 2010. Coreference reso-lution in a modular, entity-centered model. In HumanLanguage Technologies: The 2010 Annual Conferenceof the North American Chapter of the Association forComputational Linguistics, pages 385–393, Los An-geles, California, June.

Jan Hajic, Massimiliano Ciaramita, Richard Johans-son, Daisuke Kawahara, Maria Antonia Martı, LluısMarquez, Adam Meyers, Joakim Nivre, SebastianPado, Jan Stepanek, Pavel Stranak, Mihai Surdeanu,Nianwen Xue, and Yi Zhang. 2009. The CoNLL-2009 shared task: Syntactic and semantic dependen-cies in multiple languages. In Proceedings of the Thir-teenth Conference on Computational Natural Lan-guage Learning (CoNLL 2009): Shared Task, pages1–18, Boulder, Colorado, June.

Sanda M. Harabagiu, Razvan C. Bunescu, and Steven J.Maiorano. 2001. Text and knowledge mining forcoreference resolution. In NAACL.

L. Hirschman and N. Chinchor. 1997. Coreference taskdefinition (v3.0, 13 jul 97). In Proceedings of the Sev-enth Message Understanding Conference.

Eduard Hovy, Mitchell Marcus, Martha Palmer, LanceRamshaw, and Ralph Weischedel. 2006. OntoNotes:The 90% solution. In Proceedings of HLT/NAACL,pages 57–60, New York City, USA, June. Associationfor Computational Linguistics.

Karin Kipper, Anna Korhonen, Neville Ryant, andMartha Palmer. 2000. A large-scale classification ofenglish verbs. Language Resources and Evaluation,42(1):21 – 40.

Heeyoung Lee, Yves Peirsman, Angel Chang, NathanaelChambers, Mihai Surdeanu, and Dan Jurafsky. 2011.Stanford’s multi-pass sieve coreference resolution sys-tem at the conll-2011 shared task. In Proceedingsof the Fifteenth Conference on Computational Natu-ral Language Learning: Shared Task, pages 28–34,Portland, Oregon, USA, June. Association for Com-putational Linguistics.

Xiaoqiang Luo. 2005. On coreference resolution perfor-mance metrics. In Proceedings of Human LanguageTechnology Conference and Conference on EmpiricalMethods in Natural Language Processing, pages 25–32, Vancouver, British Columbia, Canada, October.

Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotated cor-pus of English: The Penn treebank. ComputationalLinguistics, 19(2):313–330, June.

Andrew McCallum and Ben Wellner. 2004. Conditionalmodels of identity uncertainty with application to nouncoreference. In Advances in Neural Information Pro-cessing Systems (NIPS).

Joseph McCarthy and Wendy Lehnert. 1995. Using de-cision trees for coreference resolution. In Proceedingsof the Fourteenth International Conference on Artifi-cial Intelligence, pages 1050–1055.

Thomas S. Morton. 2000. Coreference for nlp applica-tions. In Proceedings of the 38th Annual Meeting ofthe Association for Computational Linguistics, Octo-ber.

Vincent Ng. 2007. Shallow semantics for coreferenceresolution. In Proceedings of the IJCAI.

Vincent Ng. 2010. Supervised noun phrase coreferenceresearch: The first fifteen years. In Proceedings of the48th Annual Meeting of the Association for Compu-tational Linguistics, pages 1396–1411, Uppsala, Swe-den, July.

David S. Pallett. 2002. The role of the National Insti-tute of Standards and Technology in DARPA’s Broad-cast News continuous speech recognition research pro-gram. Speech Communication, 37(1-2), May.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The Proposition Bank: An annotated corpus ofsemantic roles. Computational Linguistics, 31(1):71–106.

Martha Palmer, Hoa Trang Dang, and Christiane Fell-baum. 2007. Making fine-grained and coarse-grainedsense distinctions, both manually and automatically.

R. Passonneau. 2004. Computing reliability for corefer-ence annotation. In Proceedings of LREC.

Massimo Poesio and Ron Artstein. 2005. The reliabilityof anaphoric annotation, reconsidered: Taking ambi-guity into account. In Proceedings of the Workshop onFrontiers in Corpus Annotations II: Pie in the Sky.

Page 38: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted

Massimo Poesio. 2004. The mate/gnome scheme foranaphoric annotation, revisited. In Proceedings ofSIGDIAL.

Simone Paolo Ponzetto and Massimo Poesio. 2009.State-of-the-art nlp approaches to coreference resolu-tion: Theory and practical recipes. In Tutorial Ab-stracts of ACL-IJCNLP 2009, page 6, Suntec, Singa-pore, August.

Simone Paolo Ponzetto and Michael Strube. 2005. Se-mantic role labeling for coreference resolution. InCompanion Volume of the Proceedings of the 11thMeeting of the European Chapter of the Associa-tion for Computational Linguistics, pages 143–146,Trento, Italy, April.

Simone Paolo Ponzetto and Michael Strube. 2006.Exploiting semantic role labeling, WordNet andWikipedia for coreference resolution. In Proceedingsof the HLT/NAACL, pages 192–199, New York City,N.Y., June.

Sameer Pradhan, Kadri Hacioglu, Valerie Krugler,Wayne Ward, James Martin, and Dan Jurafsky. 2005.Support vector learning for semantic argument classi-fication. Machine Learning Journal, 60(1):11–39.

Sameer Pradhan, Eduard Hovy, Mitchell Marcus, MarthaPalmer, Lance Ramshaw, and Ralph Weischedel.2007a. OntoNotes: A Unified Relational SemanticRepresentation. International Journal of SemanticComputing, 1(4):405–419.

Sameer Pradhan, Lance Ramshaw, Ralph Weischedel,Jessica MacBride, and Linnea Micciulla. 2007b.Unrestricted Coreference: Indentifying Entities andEvents in OntoNotes. In in Proceedings of theIEEE International Conference on Semantic Comput-ing (ICSC), September 17-19.

Altaf Rahman and Vincent Ng. 2009. Supervised mod-els for coreference resolution. In Proceedings of the2009 Conference on Empirical Methods in NaturalLanguage Processing, pages 968–977, Singapore, Au-gust. Association for Computational Linguistics.

W. M. Rand. 1971. Objective criteria for the evaluationof clustering methods. Journal of the American Statis-tical Association, 66(336).

Marta Recasens and Eduard Hovy. 2011. Blanc: Im-plementing the rand index for coreference evaluation.Natural Language Engineering.

W. Soon, H. Ng, and D. Lim. 2001. A machine learn-ing approach to coreference resolution of noun phrase.Computational Linguistics, 27(4):521–544.

Veselin Stoyanov, Nathan Gilbert, Claire Cardie, andEllen Riloff. 2009. Conundrums in noun phrase coref-erence resolution: Making sense of the state-of-the-art. In Proceedings of the Joint Conference of the 47thAnnual Meeting of the ACL and the 4th International

Joint Conference on Natural Language Processing ofthe AFNLP, pages 656–664, Suntec, Singapore, Au-gust. Association for Computational Linguistics.

Mihai Surdeanu, Richard Johansson, Adam Meyers,Lluıs Marquez, and Joakim Nivre. 2008. The CoNLL2008 shared task on joint parsing of syntactic and se-mantic dependencies. In CoNLL 2008: Proceedingsof the Twelfth Conference on Computational Natu-ral Language Learning, pages 159–177, Manchester,England, August.

Ozlem Uzuner, Andreea Bodnari, Shuying Shen, TylerForbush, John Pestian, and Brett R South. 2012. Eval-uating the state of the art in coreference resolution forelectronic medical records. Journal of American Med-ical Informatics Association, 19(5), September.

Y Versley, S.P. Ponzetto, M. Poesio, V. Eidelman,A. Jern, J. Smith, X. Yang, and A. Moschitti. 2008.BART: A modular toolkit for coreference resolution.In Proceedings of the 6th International Conference onLanguage Resources and Evaluation, Marrakech, Mo-rocco, May.

Yannick Versley. 2007. Antecedent selection techniquesfor high-recall coreference resolution. In Proceedingsof the 2007 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL).

M. Vilain, J. Burger, J. Aberdeen, D. Connolly, andL. Hirschman. 1995. A model theoretic coreferencescoring scheme. In Proceedings of the Sixth MessageUndersatnding Conference (MUC-6), pages 45–52.

Ralph Weischedel and Ada Brunstein. 2005. BBN pro-noun coreference and entity type corpus LDC catalogno.: LDC2005T33. BBN Technologies.

Ralph Weischedel, Eduard Hovy, Mitchell Marcus,Martha Palmer, Robert Belvin, Sameer Pradhan,Lance Ramshaw, and Nianwen Xue. 2011.OntoNotes: A large training corpus for enhanced pro-cessing. In Joseph Olive, Caitlin Christianson, andJohn McCary, editors, Handbook of Natural LanguageProcessing and Machine Translation: DARPA GlobalAutonomous Language Exploitation. Springer.

Zhi Zhong and Hwee Tou Ng. 2010. It makes sense:A wide-coverage word sense disambiguation systemfor free text. In Proceedings of the ACL 2010 SystemDemonstrations, pages 78–83, Uppsala, Sweden.