explainaboard an explainable leaderboard for nlpexplainaboard.nlpedia.ai/explainaboard.pdf · 2021....

EXPLAINABOARD:An Explainable Leaderboard for NLP

Pengfei Liu1†, Jinlan Fu2, Yang Xiao2, Weizhe Yuan1, Shuaichen Chang3,Junqi Dai2, Yixin Liu1, Zihuiwen Ye1, Graham Neubig1‡

1Carnegie Mellon University, 2Fudan University, 3The Ohio State University,†[email protected], ‡[email protected]

Abstract

With the rapid development of NLP research,leaderboards have emerged as one tool totrack the performance of various systems onvarious NLP tasks. They are effective in thisgoal to some extent, but generally presenta rather simplistic one-dimensional view ofthe submitted systems, communicated onlythrough holistic accuracy numbers. In thispaper, we present a new conceptualizationand implementation of NLP evaluation: theEXPLAINABOARD, which in addition toinheriting the functionality of the standardleaderboard, also allows researchers to (i)diagnose strengths and weaknesses of a singlesystem (e.g. what is the best-performingsystem bad at?) (ii) interpret relationshipsbetween multiple systems. (e.g. where doessystem A outperform system B? What if wecombine systems A, B, C?) and (iii) examineprediction results closely (e.g. what are com-mon errors made by multiple systems or and inwhat contexts do particular errors occur?). EX-PLAINABOARD has been deployed at http://explainaboard.nlpedia.ai/, andwe have additionally released our interpretableevaluation code at https://github.com/neulab/ExplainaBoard alongwith output files1 from more than 300 systems,40 datasets and 9 tasks to motivate the“output-driven” research in the future.

1 Introduction

Natural language processing (NLP) research hasbeen and is making astounding strides forward.This is true both for classical tasks such as machinetranslation (Sutskever et al., 2014; Wu et al., 2016),as well as for new tasks (Lu et al., 2016; Rajpurkaret al., 2016), domains (Beltagy et al., 2019), andlanguages (Conneau and Lample, 2019). One waythis progress is quantified is through leaderboards,which report and update performance numbers

1http://explainaboard.nlpedia.ai/download.html

Figure 1: Illustration of the EXPLAINABOARD concept.Compared to vanilla leaderboards, EXPLAINABOARDallows users to perform interpretable (single-system ,pairwise analysis, data bias), interactive (system com-bination, fine-grained/common error analysis), and re-liable analysis (confidence interval, calibration) on sys-tems in which they are interested. “comb.” denotes“combination” and “Errs” represents “errors”. “PER,LOC, ORG” refer to different labels.

of state-of-the-art systems on one or more tasks.Some prototypical leaderboards include GLUE andSuperGLUE for natural language understanding(Wang et al., 2018, 2019), XTREME and XGLUE(Hu et al., 2020; Liang et al., 2020) for multilingualunderstanding, the WMT shared tasks (Barraultet al., 2020) for machine translation, and GEM andGENIE for natural language generation (Gehrmannet al., 2021; Khashabi et al., 2021), among manyothers.

These leaderboards serve an important purpose:they provide a standardized evaluation setup, oftenon multiple tasks, that eases reproducible modelcomparison across organizations. However, at thesame time, due to the prestige imbued by a top, orhigh, place on a leaderboard they also can resultin a singular focus on raising evaluation numbersat the cost of deeper scientific understanding of

http://explainaboard.nlpedia.ai/

http://explainaboard.nlpedia.ai/

https://github.com/neulab/ExplainaBoard

https://github.com/neulab/ExplainaBoard

http://explainaboard.nlpedia.ai/download.html


model properties (Ethayarajh and Jurafsky, 2020).In particular, we argue that, among others, the fol-lowing are three major limitations of the existingleaderboard paradigm:

• Interpretability: Most existing leaderboardscommonly use a single number to summarizesystem performance holistically. This is con-ducive to system ranking but at the same timethe results are opaque, making the strengths andweaknesses of systems less interpretable.

• Interactivity: Existing leaderboards are staticand non-interactive, which limits the ability ofusers to dig deeper into the results. Thus, (1)they usually do not flexibly support more com-plex evaluation settings (e.g. multi-dataset, multi-metric, multi-language) (2) users may miss op-portunities to understand the relationships be-tween different systems. For example, wheredoes model A outperform model B? Would theperformance be further improved if we combinethe Top-3 models?

• Reliability: Given the increasing role thatleaderboards have taken in guiding NLP research,it is important that information expressed in themis reliable, especially on datasets with small sam-ple sizes, but most current leaderboards do notgive an idea of the reliability of system rankings.

In this paper, we describe EXPLAINABOARD

(see Fig.1), a software package and hosted leader-board that satisfies all of the above desiderata.It also serves as a prototype implementation ofsome desirable features that may be included infuture leaderboards, even independent of the pro-vided software itself. We have deployed EXPLAIN-ABOARD for 9 different tasks and 41 differentdatasets, and demonstrate how it can be easilyadapted to new tasks of interest.

We expect that EXPLAINABOARD will benefitdifferent steps of the research process:(i) System Developement: EXPLAINABOARD

provides more detailed information regarding thesubmitted systems (e.g. fine-grained results, con-fidence intervals), allowing system developers todiagnose successes and failures of their own sys-tems, or compare their systems with baselines andunderstand where improvements of their proposedmethods come from. This better understanding canlead to more efficient and effective system improve-ments. Additionally, EXPLAINABOARD can helpsystem developers uncover their systems’ advan-tages over others, even when these systems have

not achieved state-of-the-art performance holisti-cally. (ii) Leaderboard Organization: The EX-PLAINABOARD software both provides a ready-made platform for easy leaderboard developmentover different NLP tasks, and helps upgrade tradi-tional leaderboards to allow for more fine-grainedanalysis. (iii) Broad Analysis and Understand-ing: Because EXPLAINABOARD encourages sys-tem developers to provide their system outputs inan easy-to-analyze format, these will also help re-searchers, particularly those just starting in a par-ticular NLP sub-field, get a broad sense of whatcurrent state-of-the-art models can and cannot do.This not only helps them quickly track the progressof different areas, but also can allow them to un-derstand the relative advantages of diverse sys-tems, suggesting insightful ideas for what’s leftand what’s next.

2 ExplainaBoard

As stated above, EXPLAINABOARD extends exist-ing leaderboards, improving their interpretability,interactivity, and reliability. It does so by provid-ing a number of functionalities that are applicableto a wide variety of NLP tasks (as illustrated inTab. 1). Many of these functionalities are groundedin existing research on evaluation and fine-graineddiagnostics.

2.1 Interpretability

Interpretable evaluation (Popovic and Ney, 2011;Stymne, 2011; Neubig et al., 2019; Fu et al., 2020a),is a research area that considers methods that breakdown the holistic performance of each system intodifferent interpretable groups. For example, in aNamed Entity Recognition (NER) task, we may ex-amine the accuracy along different dimensions of aconcerned entity (such as “entity frequency,” tellingus how well the model does on entities that appearin the training data a certain number of times) orsentences (such as “sentence length,” telling ushow well the model does on entities that appearin longer or shorter sentences) (Fu et al., 2020a).This makes it possible to understand where mod-els do well and poorly, leading to further insightsbeyond those that can be gleaned by holistic evalua-tion numbers. Applying this to a new task involvesthe following steps: (i) Attribute definition: defineattributes by which we can partition the test setinto different groups. (ii) Bucketing: partition intodifferent buckets based on defined attributes and

Aspect Functionality Input Output

Interpretability

Single-systemAnalysis One model

� 4<latexit sha1_base64="7v4dR4K94Y+/+eZ5Zm5Ub0mqdao=">AAACynicjVHLSsNAFD2Nr1pfVZdugq3iqiRF0GXBjQsXFewD2iLJdFqHpkmcTIRS3PkDbvXDxD/Qv/DOmIJaRCckOXPuOXfm3uvHgUiU47zmrIXFpeWV/GphbX1jc6u4vdNMolQy3mBREMm27yU8ECFvKKEC3o4l98Z+wFv+6EzHW3dcJiIKr9Qk5r2xNwzFQDBPEdUqd4f89rh8XSw5Fccsex64GSghW/Wo+IIu+ojAkGIMjhCKcAAPCT0duHAQE9fDlDhJSJg4xz0K5E1JxUnhETui75B2nYwNaa9zJsbN6JSAXklOGwfkiUgnCevTbBNPTWbN/pZ7anLqu03o72e5xsQq3BD7l2+m/K9P16IwwKmpQVBNsWF0dSzLkpqu6JvbX6pSlCEmTuM+xSVhZpyzPtvGk5jadW89E38zSs3qPcu0Kd71LWnA7s9xzoNmteI6FfeyWqodZqPOYw/7OKJ5nqCGc9TRMFU+4gnP1oUlrYk1/ZRaucyzi2/LevgAIF6ROA==</latexit><latexit sha1_base64="7v4dR4K94Y+/+eZ5Zm5Ub0mqdao=">AAACynicjVHLSsNAFD2Nr1pfVZdugq3iqiRF0GXBjQsXFewD2iLJdFqHpkmcTIRS3PkDbvXDxD/Qv/DOmIJaRCckOXPuOXfm3uvHgUiU47zmrIXFpeWV/GphbX1jc6u4vdNMolQy3mBREMm27yU8ECFvKKEC3o4l98Z+wFv+6EzHW3dcJiIKr9Qk5r2xNwzFQDBPEdUqd4f89rh8XSw5Fccsex64GSghW/Wo+IIu+ojAkGIMjhCKcAAPCT0duHAQE9fDlDhJSJg4xz0K5E1JxUnhETui75B2nYwNaa9zJsbN6JSAXklOGwfkiUgnCevTbBNPTWbN/pZ7anLqu03o72e5xsQq3BD7l2+m/K9P16IwwKmpQVBNsWF0dSzLkpqu6JvbX6pSlCEmTuM+xSVhZpyzPtvGk5jadW89E38zSs3qPcu0Kd71LWnA7s9xzoNmteI6FfeyWqodZqPOYw/7OKJ5nqCGc9TRMFU+4gnP1oUlrYk1/ZRaucyzi2/LevgAIF6ROA==</latexit><latexit sha1_base64="7v4dR4K94Y+/+eZ5Zm5Ub0mqdao=">AAACynicjVHLSsNAFD2Nr1pfVZdugq3iqiRF0GXBjQsXFewD2iLJdFqHpkmcTIRS3PkDbvXDxD/Qv/DOmIJaRCckOXPuOXfm3uvHgUiU47zmrIXFpeWV/GphbX1jc6u4vdNMolQy3mBREMm27yU8ECFvKKEC3o4l98Z+wFv+6EzHW3dcJiIKr9Qk5r2xNwzFQDBPEdUqd4f89rh8XSw5Fccsex64GSghW/Wo+IIu+ojAkGIMjhCKcAAPCT0duHAQE9fDlDhJSJg4xz0K5E1JxUnhETui75B2nYwNaa9zJsbN6JSAXklOGwfkiUgnCevTbBNPTWbN/pZ7anLqu03o72e5xsQq3BD7l2+m/K9P16IwwKmpQVBNsWF0dSzLkpqu6JvbX6pSlCEmTuM+xSVhZpyzPtvGk5jadW89E38zSs3qPcu0Kd71LWnA7s9xzoNmteI6FfeyWqodZqPOYw/7OKJ5nqCGc9TRMFU+4gnP1oUlrYk1/ZRaucyzi2/LevgAIF6ROA==</latexit>

3<latexit sha1_base64="9GJBKIMKDbN/wGyhhAENs48J8N0=">AAACxnicjVHLTsJAFD3UF+ILdemmETSuSIsLXZK4YYlRHgkS0w4DTihtM51qCDHxB9zqpxn/QP/CO2NJVGJ0mrZnzr3nzNx7/TgQiXKc15y1sLi0vJJfLaytb2xuFbd3WkmUSsabLAoi2fG9hAci5E0lVMA7seTe2A942x+d6Xj7lstEROGlmsS8N/aGoRgI5imiLsrH5etiyak4ZtnzwM1ACdlqRMUXXKGPCAwpxuAIoQgH8JDQ04ULBzFxPUyJk4SEiXPco0DalLI4ZXjEjug7pF03Y0Paa8/EqBmdEtArSWnjgDQR5UnC+jTbxFPjrNnfvKfGU99tQn8/8xoTq3BD7F+6WeZ/dboWhQFOTQ2CaooNo6tjmUtquqJvbn+pSpFDTJzGfYpLwswoZ322jSYxteveeib+ZjI1q/csy03xrm9JA3Z/jnMetKoV16m459VS7TAbdR572McRzfMENdTRQJO8h3jEE56tuhVaqXX3mWrlMs0uvi3r4QO1LY92</latexit><latexit sha1_base64="9GJBKIMKDbN/wGyhhAENs48J8N0=">AAACxnicjVHLTsJAFD3UF+ILdemmETSuSIsLXZK4YYlRHgkS0w4DTihtM51qCDHxB9zqpxn/QP/CO2NJVGJ0mrZnzr3nzNx7/TgQiXKc15y1sLi0vJJfLaytb2xuFbd3WkmUSsabLAoi2fG9hAci5E0lVMA7seTe2A942x+d6Xj7lstEROGlmsS8N/aGoRgI5imiLsrH5etiyak4ZtnzwM1ACdlqRMUXXKGPCAwpxuAIoQgH8JDQ04ULBzFxPUyJk4SEiXPco0DalLI4ZXjEjug7pF03Y0Paa8/EqBmdEtArSWnjgDQR5UnC+jTbxFPjrNnfvKfGU99tQn8/8xoTq3BD7F+6WeZ/dboWhQFOTQ2CaooNo6tjmUtquqJvbn+pSpFDTJzGfYpLwswoZ322jSYxteveeib+ZjI1q/csy03xrm9JA3Z/jnMetKoV16m459VS7TAbdR572McRzfMENdTRQJO8h3jEE56tuhVaqXX3mWrlMs0uvi3r4QO1LY92</latexit><latexit sha1_base64="9GJBKIMKDbN/wGyhhAENs48J8N0=">AAACxnicjVHLTsJAFD3UF+ILdemmETSuSIsLXZK4YYlRHgkS0w4DTihtM51qCDHxB9zqpxn/QP/CO2NJVGJ0mrZnzr3nzNx7/TgQiXKc15y1sLi0vJJfLaytb2xuFbd3WkmUSsabLAoi2fG9hAci5E0lVMA7seTe2A942x+d6Xj7lstEROGlmsS8N/aGoRgI5imiLsrH5etiyak4ZtnzwM1ACdlqRMUXXKGPCAwpxuAIoQgH8JDQ04ULBzFxPUyJk4SEiXPco0DalLI4ZXjEjug7pF03Y0Paa8/EqBmdEtArSWnjgDQR5UnC+jTbxFPjrNnfvKfGU99tQn8/8xoTq3BD7F+6WeZ/dboWhQFOTQ2CaooNo6tjmUtquqJvbn+pSpFDTJzGfYpLwswoZ322jSYxteveeib+ZjI1q/csy03xrm9JA3Z/jnMetKoV16m459VS7TAbdR572McRzfMENdTRQJO8h3jEE56tuhVaqXX3mWrlMs0uvi3r4QO1LY92</latexit>

2<latexit sha1_base64="LQrcwPmSbu646Y4RuMVqVzI3V4Q=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlOa2heTCZKKYI/4FY/TfwD/QvvjFNQi+iEJGfOvefM3Hv9NAwy6TivC9bi0vLKamGtuL6xubVd2tltZ0kuGG+xJExE1/cyHgYxb8lAhrybCu5Ffsg7/vhMxTu3XGRBEl/KScr7kTeKg2HAPEnURaVWuS6Vnaqjlz0PXAPKMKuZlF5whQESMOSIwBFDEg7hIaOnBxcOUuL6mBInCAU6znGPImlzyuKU4RE7pu+Idj3DxrRXnplWMzolpFeQ0sYhaRLKE4TVabaO59pZsb95T7WnutuE/r7xioiVuCH2L90s8786VYvEEKe6hoBqSjWjqmPGJdddUTe3v1QlySElTuEBxQVhppWzPttak+naVW89HX/TmYpVe2Zyc7yrW9KA3Z/jnAftWtV1qu55rVw/MqMuYB8HOKZ5nqCOBppokfcIj3jCs9WwYiu37j5TrQWj2cO3ZT18ALLMj3U=</latexit><latexit sha1_base64="LQrcwPmSbu646Y4RuMVqVzI3V4Q=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlOa2heTCZKKYI/4FY/TfwD/QvvjFNQi+iEJGfOvefM3Hv9NAwy6TivC9bi0vLKamGtuL6xubVd2tltZ0kuGG+xJExE1/cyHgYxb8lAhrybCu5Ffsg7/vhMxTu3XGRBEl/KScr7kTeKg2HAPEnURaVWuS6Vnaqjlz0PXAPKMKuZlF5whQESMOSIwBFDEg7hIaOnBxcOUuL6mBInCAU6znGPImlzyuKU4RE7pu+Idj3DxrRXnplWMzolpFeQ0sYhaRLKE4TVabaO59pZsb95T7WnutuE/r7xioiVuCH2L90s8786VYvEEKe6hoBqSjWjqmPGJdddUTe3v1QlySElTuEBxQVhppWzPttak+naVW89HX/TmYpVe2Zyc7yrW9KA3Z/jnAftWtV1qu55rVw/MqMuYB8HOKZ5nqCOBppokfcIj3jCs9WwYiu37j5TrQWj2cO3ZT18ALLMj3U=</latexit><latexit sha1_base64="LQrcwPmSbu646Y4RuMVqVzI3V4Q=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlOa2heTCZKKYI/4FY/TfwD/QvvjFNQi+iEJGfOvefM3Hv9NAwy6TivC9bi0vLKamGtuL6xubVd2tltZ0kuGG+xJExE1/cyHgYxb8lAhrybCu5Ffsg7/vhMxTu3XGRBEl/KScr7kTeKg2HAPEnURaVWuS6Vnaqjlz0PXAPKMKuZlF5whQESMOSIwBFDEg7hIaOnBxcOUuL6mBInCAU6znGPImlzyuKU4RE7pu+Idj3DxrRXnplWMzolpFeQ0sYhaRLKE4TVabaO59pZsb95T7WnutuE/r7xioiVuCH2L90s8786VYvEEKe6hoBqSjWjqmPGJdddUTe3v1QlySElTuEBxQVhppWzPttak+naVW89HX/TmYpVe2Zyc7yrW9KA3Z/jnAftWtV1qu55rVw/MqMuYB8HOKZ5nqCOBppokfcIj3jCs9WwYiu37j5TrQWj2cO3ZT18ALLMj3U=</latexit>

1<latexit sha1_base64="H/pnWRR6sCifSSMGu54Gg3UaChE=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlO69C8mEyUUgR/wK1+mvgH+hfeGVNQi+iEJGfOvefM3Hv9JBCpcpzXBWtxaXlltbBWXN/Y3Nou7ey20ziTjLdYHMSy63spD0TEW0qogHcTyb3QD3jHH5/peOeWy1TE0aWaJLwfeqNIDAXzFFEXFbdyXSo7Vccsex64OSgjX8249IIrDBCDIUMIjgiKcAAPKT09uHCQENfHlDhJSJg4xz2KpM0oi1OGR+yYviPa9XI2or32TI2a0SkBvZKUNg5JE1OeJKxPs008M86a/c17ajz13Sb093OvkFiFG2L/0s0y/6vTtSgMcWpqEFRTYhhdHctdMtMVfXP7S1WKHBLiNB5QXBJmRjnrs200qald99Yz8TeTqVm9Z3luhnd9Sxqw+3Oc86Bdq7pO1T2vletH+agL2McBjmmeJ6ijgSZa5D3CI57wbDWsyMqsu89UayHX7OHbsh4+ALBrj3Q=</latexit><latexit sha1_base64="H/pnWRR6sCifSSMGu54Gg3UaChE=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlO69C8mEyUUgR/wK1+mvgH+hfeGVNQi+iEJGfOvefM3Hv9JBCpcpzXBWtxaXlltbBWXN/Y3Nou7ey20ziTjLdYHMSy63spD0TEW0qogHcTyb3QD3jHH5/peOeWy1TE0aWaJLwfeqNIDAXzFFEXFbdyXSo7Vccsex64OSgjX8249IIrDBCDIUMIjgiKcAAPKT09uHCQENfHlDhJSJg4xz2KpM0oi1OGR+yYviPa9XI2or32TI2a0SkBvZKUNg5JE1OeJKxPs008M86a/c17ajz13Sb093OvkFiFG2L/0s0y/6vTtSgMcWpqEFRTYhhdHctdMtMVfXP7S1WKHBLiNB5QXBJmRjnrs200qald99Yz8TeTqVm9Z3luhnd9Sxqw+3Oc86Bdq7pO1T2vletH+agL2McBjmmeJ6ijgSZa5D3CI57wbDWsyMqsu89UayHX7OHbsh4+ALBrj3Q=</latexit><latexit sha1_base64="H/pnWRR6sCifSSMGu54Gg3UaChE=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlO69C8mEyUUgR/wK1+mvgH+hfeGVNQi+iEJGfOvefM3Hv9JBCpcpzXBWtxaXlltbBWXN/Y3Nou7ey20ziTjLdYHMSy63spD0TEW0qogHcTyb3QD3jHH5/peOeWy1TE0aWaJLwfeqNIDAXzFFEXFbdyXSo7Vccsex64OSgjX8249IIrDBCDIUMIjgiKcAAPKT09uHCQENfHlDhJSJg4xz2KpM0oi1OGR+yYviPa9XI2or32TI2a0SkBvZKUNg5JE1OeJKxPs008M86a/c17ajz13Sb093OvkFiFG2L/0s0y/6vTtSgMcWpqEFRTYhhdHctdMtMVfXP7S1WKHBLiNB5QXBJmRjnrs200qald99Yz8TeTqVm9Z3luhnd9Sxqw+3Oc86Bdq7pO1T2vletH+agL2McBjmmeJ6ijgSZa5D3CI57wbDWsyMqsu89UayHX7OHbsh4+ALBrj3Q=</latexit>

F1<latexit sha1_base64="Ay5waSm5tMPzb0tR7GabuA1hmOU=">AAACxXicjVHLSsNAFD2Nr1pfVZduglVwVZJudFkQ1GUV+4BaJEmndWheTCaFUsQfcKu/Jv6B/oV3ximoRXRCkjPn3nNm7r1+GvJMOs5rwVpYXFpeKa6W1tY3NrfK2zutLMlFwJpBEiai43sZC3nMmpLLkHVSwbzID1nbH52qeHvMRMaT+FpOUtaLvGHMBzzwJFFXZ+5tueJUHb3seeAaUIFZjaT8ghv0kSBAjggMMSThEB4yerpw4SAlrocpcYIQ13GGe5RIm1MWowyP2BF9h7TrGjamvfLMtDqgU0J6BSltHJImoTxBWJ1m63iunRX7m/dUe6q7TejvG6+IWIk7Yv/SzTL/q1O1SAxwomvgVFOqGVVdYFxy3RV1c/tLVZIcUuIU7lNcEA60ctZnW2syXbvqrafjbzpTsWofmNwc7+qWNGD35zjnQatWdZ2qe1mr1A/MqIvYwz6OaJ7HqOMCDTTJe4BHPOHZOrciS1rjz1SrYDS7+Lashw+FHI9m</latexit><latexit sha1_base64="Ay5waSm5tMPzb0tR7GabuA1hmOU=">AAACxXicjVHLSsNAFD2Nr1pfVZduglVwVZJudFkQ1GUV+4BaJEmndWheTCaFUsQfcKu/Jv6B/oV3ximoRXRCkjPn3nNm7r1+GvJMOs5rwVpYXFpeKa6W1tY3NrfK2zutLMlFwJpBEiai43sZC3nMmpLLkHVSwbzID1nbH52qeHvMRMaT+FpOUtaLvGHMBzzwJFFXZ+5tueJUHb3seeAaUIFZjaT8ghv0kSBAjggMMSThEB4yerpw4SAlrocpcYIQ13GGe5RIm1MWowyP2BF9h7TrGjamvfLMtDqgU0J6BSltHJImoTxBWJ1m63iunRX7m/dUe6q7TejvG6+IWIk7Yv/SzTL/q1O1SAxwomvgVFOqGVVdYFxy3RV1c/tLVZIcUuIU7lNcEA60ctZnW2syXbvqrafjbzpTsWofmNwc7+qWNGD35zjnQatWdZ2qe1mr1A/MqIvYwz6OaJ7HqOMCDTTJe4BHPOHZOrciS1rjz1SrYDS7+Lashw+FHI9m</latexit><latexit sha1_base64="Ay5waSm5tMPzb0tR7GabuA1hmOU=">AAACxXicjVHLSsNAFD2Nr1pfVZduglVwVZJudFkQ1GUV+4BaJEmndWheTCaFUsQfcKu/Jv6B/oV3ximoRXRCkjPn3nNm7r1+GvJMOs5rwVpYXFpeKa6W1tY3NrfK2zutLMlFwJpBEiai43sZC3nMmpLLkHVSwbzID1nbH52qeHvMRMaT+FpOUtaLvGHMBzzwJFFXZ+5tueJUHb3seeAaUIFZjaT8ghv0kSBAjggMMSThEB4yerpw4SAlrocpcYIQ13GGe5RIm1MWowyP2BF9h7TrGjamvfLMtDqgU0J6BSltHJImoTxBWJ1m63iunRX7m/dUe6q7TejvG6+IWIk7Yv/SzTL/q1O1SAxwomvgVFOqGVVdYFxy3RV1c/tLVZIcUuIU7lNcEA60ctZnW2syXbvqrafjbzpTsWofmNwc7+qWNGD35zjnQatWdZ2qe1mr1A/MqIvYwz6OaJ7HqOMCDTTJe4BHPOHZOrciS1rjz1SrYDS7+Lashw+FHI9m</latexit>

eLen<latexit sha1_base64="iEfE+gc8L5JynjQqD7tUKllhyIM=">AAACx3icjVHLSsNAFD2Nr1pfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQdTDIhmRRLEfQH3Oof+EniH+hfeGeaglpEJyQ5c+45Z+bOuJHPE2lZbzljbn5hcSm/XFhZXVvfKG5uNRKRxh6re8IXcct1EubzkNUllz5rRTFzAtdnTffmRNWbQxYnXISXchSxTuD0Q97jniMVxc5ZeF0sWxVLD3MW2BkoH78Uju4B1ETxFVfoQsBDigAMISRhHw4SetqwYSEiroMxcTEhrusMdyiQNyUVI4VD7A19+zRrZ2xIc5WZaLdHq/j0xuQ0sUseQbqYsFrN1PVUJyv2t+yxzlR7G9HfzbICYiUGxP7lmyr/61O9SPRwqHvg1FOkGdWdl6Wk+lTUzs0vXUlKiIhTuEv1mLCnndNzNrUn0b2rs3V0/V0rFavmXqZN8aF2SRds/7zOWdDYr9hWxb6wytUSJiOPHZSwR/d5gCpOUUOdsgd4xBOejTNDGEPjdiI1cplnG9+G8fAJBP2SMA==</latexit><latexit sha1_base64="uuDIzhQVICXnC3BUxPGGowZ0ajQ=">AAACx3icjVHLSsNAFD2NrxpfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQNzYvJpFiKC3/Are5d+EnFP9C/8M40BbWITkhy5txzzsydcWLfS4RpvuW0hcWl5ZX8qr62vrG5VdjeqSdRyl1WcyM/4k3HTpjvhawmPOGzZsyZHTg+aziDM1lvDBlPvCi8FqOYtQO7F3pdz7WFpNglC28LJbNsqmHMAysDpdNX/SR+mejVqDDBDTqI4CJFAIYQgrAPGwk9LVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZq2MDWkuMxPldmkVn15OTgP75IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF8eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wH9cOyZZatK7NUKWI68thDEQd0n0eo4BxV1Ci7j0c84Vm70CJtqN1NpVou8+zi29AePgE2S5Ok</latexit><latexit sha1_base64="uuDIzhQVICXnC3BUxPGGowZ0ajQ=">AAACx3icjVHLSsNAFD2NrxpfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQNzYvJpFiKC3/Are5d+EnFP9C/8M40BbWITkhy5txzzsydcWLfS4RpvuW0hcWl5ZX8qr62vrG5VdjeqSdRyl1WcyM/4k3HTpjvhawmPOGzZsyZHTg+aziDM1lvDBlPvCi8FqOYtQO7F3pdz7WFpNglC28LJbNsqmHMAysDpdNX/SR+mejVqDDBDTqI4CJFAIYQgrAPGwk9LVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZq2MDWkuMxPldmkVn15OTgP75IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF8eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wH9cOyZZatK7NUKWI68thDEQd0n0eo4BxV1Ci7j0c84Vm70CJtqN1NpVou8+zi29AePgE2S5Ok</latexit>

Performance Histogram: the input model isgood at dealing with short entities, while achiev-ing lower performance on long entities.

PairwiseAnalysis

Two models(M1,M2)







0

Performance Gap Histogram (M1−M2): M1is better at dealing with short entities, while M2is better at dealing with long entities.

Data BiasAnalysis Multi-dataset

BN

BC

CN03

WB

.6.8

.4.2

eLen

Data Bias Chart: For the entity lengthattribute, the average entity length (We averagethe length of all test entities on a given data set.)of these datasets order by descending is BN>BC> CN03> WB.

Interactivity

Fine-grainedError Analysis

Single-system or /Pairwisediagnostic results

··· Bromwich v BoltonNFL AMERICAN ···NCAA AMERICAN ···

Bolton

NCAANFL

orgorgorg O

O

True SentenceError Pred


Bolton

NCAANFL

orgorgorg O

O



Bolton

NCAANFL

orgorgorg O

O


PER LOC ORG Error Table: Error analysis allows the user toprint out the entities that are incorrectly pre-dicted by the given model, as well as the truelabel of the entity, the mispredicted label, andthe sentence where the entity is located.

SystemCombination

Multi-models(M1,M2,M3)

M1 M3M2 Comb0

Ensemble Chart: The combined result ofmodel M1, M2, and M3 is shown by the his-togram with x-label value comb. The combinedresult is better than the single models.

Reliability

Confidence One model







0

Error Bars: the error bars represent 95% con-fidence intervals of the performance on the spe-cific bucket.

Calibration One model

OutputGap

Error: 28.6

0.0 0.5 1.0

Reliability Diagram: Confidence histograms(red) and reliability diagrams (blue). that indi-cate the accuracy of model probability estimates

Table 1: A graphical breakdown of the functionality of EXPLAINABOARD, with examples from an NER task.

calculate performance w.r.t each bucket.Generally, previous work on interpretable evalu-

ation has been performed over single tasks, whileEXPLAINABOARD allows for comprehensive eval-uation of different types of tasks in a single soft-ware package. We concretely show several waysinterpretable evaluation can be defined within EX-PLAINABOARD below:

F12: Single-system Analysis: What is a systemgood or bad at? For an individual system as in-put, generate a performance histogram that high-lights the buckets where it performs well or poorly.For example, in Tab. 1 we demonstrate an examplefrom NER where the input system does worse indealing with longer entities (eLen ≥ 4).

F2: Pairwise Analysis: Where is one system bet-ter (worse) than another? Given a pair of sys-tems, interpret where the performance gap occurs.

2“F” represents “Functionality”.

Researchers could flexibly choose two systems theyare interested in (e.g. selecting two rows from theleaderboard), and EXPLAINABOARD will output aperformance gap histogram to describe how theperformance differences change over differentbuckets of different attributes. Tab. 1 demonstrateshow we can see one system is better than the otherat longer or shorter entities.

F3: Data Bias Analysis: What are the charac-teristics of different evaluated datasets? Thedefined attributes do not only help us interpret sys-tem performance, but also make it possible forusers to take a closer look at characteristics of di-verse datasets. For example, from Fig. 1 shows anexample of analyzing differences in average entitylength across several datasets.

2.2 Interactivity

EXPLAINABOARD also allows users to dig deeper,interacting with the results in more complex ways.

F4: Fine-grained Error Analysis: What arecommon mistakes that most systems make andwhere do they occur? EXPLAINABOARD pro-vides flexible fine-grained error analyses based onthe above-described performance evaluation:

1. Users can choose multiple systems and seetheir common error cases, which can be use-ful to identify challenging samples or evenannotation errors.

2. In single-system analysis, users can chooseparticular buckets in the performance his-togram3 and see corresponding error samplesin that bucket (e.g. which long entities doesthe current system mispredict?).

3. In pairwise analysis, users can select a bucket,and the unique errors (e.g. system A succeedswhile B fails and vice versa) of two modelswill be displayed.

F5: System Combination: Is there potentialcomplementarity between different systems?System combination (Ting and Witten, 1997;González-Rubio et al., 2011; Duh et al., 2011) is atechnique to improve performance by combiningthe output from multiple existing systems. In EX-PLAINABOARD, users can choose multiple systemsand obtain combined results calculated by votingover multiple base systems.4 In practice, we usemajority voting method for system combinationexcept the text summarization task, in which weuse a state-of-the-art ensemble approach (Liu et al.,2021).

2.3 ReliabilityThe experimental conclusions obtained from theevaluation metrics are not necessarily statisticallyreliable, especially when the experimental resultscan be affected by many factors. EXPLAIN-ABOARD also makes a step towards more reliableinterpretable evaluation.

F6: Confidence Analysis: To what extent canwe trust the results of our system? EXPLAIN-ABOARD can perform confidence analysis overboth holistic and fine-grained performance met-rics. As shown in Tab. 1, for each bucket, there isan error bar whose width reflects how reliable theperformance value is. We claim this is an important

3Each bin of the performance histogram is clickable, re-turning an error case table.

4With the system combination button of Explainaboard,we observed the-state-of-the art performance of some tasks(e.g., NER, Chunking) can be further improved.

feature for fine-grained analysis since the numbersof test samples in each bucket are imbalanced, andwith the confidence interval, one could know howmuch uncertainty there is. In practice, we use boot-strapping method (Efron, 1992; Ye et al., 2021) tocalculate the confidence interval.

F7: Calibration Analysis: How well is the con-fidence of prediction calibrated with its correct-ness? One commonly-cited issue with modernneural predictors is that their probability estimatesare not accurate (i.e. they are poorly calibrated), of-ten being over-confident in the correctness of theirpredictions (Guo et al., 2017). We also incorporatethis feature into EXPLAINABOARD, allowing usersto evaluate how well-calibrated their systems ofinterest are.

3 Tasks, Datasets and Systems

We have already added to EXPLAINABOARD 9NLP tasks, 41 datasets, and 301 models,5 whichcover many or most of top-scoring systems on these9 tasks.We briefly describe them below, and showhigh-level statistics in Fig. 2.

Text Classification Prediction of one or multi-ple pre-defined label(s) for a given input text. Thecurrent interface includes datasets for sentimentclassification (Pang et al., 2002), topic identifica-tion (Wang and Manning, 2012), and intention de-tection (Chen et al., 2013).

Text-Span Classification Prediction of a pre-defined class from the input of a text and a span,such as aspect-based sentiment classification task(Pappas and Popescu-Belis, 2014). We collect top-perform system outputs from (Dai et al., 2021).

Text Pair Classification Prediction of a classgiven two texts, such as the natural language in-ference task (Bowman et al., 2015).

Sequence Labeling Prediction of a label foreach token in a sequence. The EXPLAINABOARD

currently includes four concrete tasks: named en-tity recognition (Tjong Kim Sang and De Meulder,2003), part-of-speech tagging (Toutanova et al.,2003), text chunking (Ando and Zhang, 2005), andChinese word segmentation (Chen et al., 2015).

5265 of these models are implemented by us, as unfortu-nately it is currently not standard in NLP to release the systemoutputs that EXPLAINABOARD needs.

Structure Prediction Prediction of a syntac-tic or semantic structure from text, where EX-PLAINABOARD currently covers semantic parsingtasks (Berant et al., 2013; Yu et al., 2018).

Text Generation EXPLAINABOARD also con-siders text generation tasks, and currently mainlyfocuses on conditional text generation, for exam-ple, text summarization (Rush et al., 2015; Liu andLapata, 2019) . System outputs on text summa-rization are expanded based on the previous work’scollection (Bhandari et al., 2020).

Task Data Model Attr.

Text Classification

Sentiment 8 40 2

Topics 4 18 2

Intention 1 3 2

Text-Span Classi-fication

Aspect Sentiment 4 20 4

Text Pair Classifi-cation

NLI 2 6 7

Sequence Labeling

NER 3 74 9

POS 3 14 4

Chunking 3 14 9

CWS 7 64 7

Text GenerationSemantic Parsing 4 12 4

Summarization 2 36 7

Table 2: Brief descriptions of tasks, datasets and sys-tems that EXPLAINABOARD currently supports. “Attr.”denotes Attribute.

4 Case Study

Here, we briefly showcase the actual EXPLAIN-ABOARD interface through a case study on analyz-ing state-of-the-art NER systems.

4.1 Experimental Setup

Attribute Definition We use following at-tributes introduced and detailed by Fu et al.(2020a) (1) entity length, (2) sentencelength, (3) OOV density, (4) entitydensity, (5) entity frequency, (6)label consistency of entity. Weadditionally introduce two new ones: (7)capitalization of entity (8) labelof entity

Collection of Systems Outputs Currently, wecollect system outputs by either implementing themby ourselves or collecting from other researchers

(Fu et al., 2020b; Schweter and Akbik, 2020; Ya-mada et al., 2020). Using these methods, we havegathered 74 models on six NER datasets with sys-tem output information.

4.2 Analysis using EXPLAINABOARD

Fig. 2 illustrates different types of results drivenby four functionality buttons6 over the top-3 NERsystems: LUKE (Yamada et al., 2020), FLERT(Schweter and Akbik, 2020) and FLAIR (Akbiket al., 2019).Box A breaks down the performance of the top-1system over different attributes.7 We can intuitivelyobserve that even the state-of-the-art system doesworse on longer entities. Users can further printerror cases in the longer entity bucket by clickingthe corresponding bin.Box B shows the 1st system’s (LUKE) performanceminus the 2nd system’s (FLERT) performance.We can see that although LUKE surpasses FLERTholistically, it performs worse when dealing withPERSON entities.Box C identifies samples that all systems mispre-dict. Further analysis of these samples uncoverschallenging patterns or annotation errors.Box D examines potential complementarity amongthese top-3 systems. The result shows that, bya simple voting ensemble strategy, a new state-of-the-art (94.65 F1) has been achieved on theCoNLL-2003 dataset.

5 Usage

Example Use-cases To show the practical util-ity of EXPLAINABOARD, we first present exam-ples of how it has been used as an analysis toolin existing published research papers. Fu et al.(2020b) (Tab.4) utilize single-system analysis withthe attribute of label consistency for NERtask while Zhong et al. (2019) (Tab.4-5) use it fortext summarization with attributes of densityand compression. Fig.4 and Tab.3 in Fu et al.(2020a) leverage the data bias analysis and pair-wise system diagnostics to interpret top-performingNER systems while Tab.4 in Fu et al. (2020c) usesingle and pairwise system analysis to investigatewhat’s next for the Chinese Word Segmentationtask. Liu et al. (2021) use system combiner func-tionality to make ensemble analysis of summariza-

6As it is relatively challenging to define calibration in struc-ture prediction tasks, this feature is currently only providedfor classification tasks. We will explore more in the future.

7Due to the page limitation, we only show three.

Figure 2: An example of the actual EXPLAINABOARD interface for NER over three top-performing systems onthe CoNLL-2003 dataset. Box A shows the single-system analysis results obtained when users select the top-1system and click the “Single Analysis” button. Box B shows the pairwise analysis results when top-2 systems arechosen and “Pair Analysis” is clicked. Users can click any bin of the histogram, which results in a fine-grainederror case table. Box C represents a table with common errors of these top-3 systems. Box D illustrates thecombined result of the top-3 systems.

tion systems and Fig.1 in Ye et al. (2021) use reli-ability analysis functionality to observe how con-fidence intervals change in different buckets of aperformance histogram.

Using ExplainaBoard Researchers can use EX-PLAINABOARD in different ways: (i) We main-tain a website where each task-specific EXPLAIN-ABOARD allows researchers to interact with it, in-terpreting systems and datasets that they are inter-ested in from different perspectives. (ii) We alsorelease our back-end code for different NLP tasksso that researchers could flexibly use them to pro-cess their own system outputs, which can assisttheir research projects.

Contributing to ExplainaBoard The commu-nity can contribute to EXPLAINABOARD in sev-eral ways: (i) Submit system outputs of their im-plemented models. (ii) Add more informative at-tributes for different NLP tasks. (iii) Add newdatasets or benchmarks for existing or new tasks.

6 Implications and Roadmap

EXPLAINABOARD presents a new paradigm inleaderboard development for NLP. This is just thebeginning of its development, and there are manyfuture directions.

Direction 1: Research Revolving on SystemOutputs8 Due to the ability to analyze, contrast,or combine results from many systems EXPLAIN-ABOARD incentivizes researchers to submit theirresults to explainaboard to better understand themand showcase their systems’ strengths. At the sametime, EXPLAINABOARD will serve as a centralrepository for system outputs across many tasks,allowing for future avenues of research into cross-system analysis or system combination.

Direction 2: Attribute Engineering Attribute-aided interpretable evaluation has provided a frame-work where we can convert our understanding ofan NLP task (i.e., which attributes matter for theNER task?) into evaluation aspects and apply them

8We released system outputs of EXPLAINABOARD:http://explainaboard.nlpedia.ai/download.html



to acquire insights and make model improvements.EXPLAINABOARD platformizes this idea, and wehope it will provide an avenue for more research onother ways to effectively analyze prediction results.

Direction 3: Enriching ExplainaBoard withGlass-box Analysis EXPLAINABOARD cur-rently performs black-box analysis, solelyanalyzing system outputs without accessing modelinternals. On the other hand, there are many otherglass-box interpretability tools that look at modelinternals, such as the AllenNLP Interpret (Wal-lace et al., 2019) and Language InterpretabilityTool (Tenney et al., 2020). Expanding leaderboardsto glass-box analysis methods (see Lipton (2018);Belinkov and Glass (2019) for a survey) is aninteresting future work.

Direction 4: Interpretable Architecture SearchBy using EXPLAINABOARD, we can perform meta-analysis to know characteristics of different sys-tems (and datasets), which may be helpful for ex-isting architecture search research (Zoph and Le,2017; Pham et al., 2018), reducing its search space.

In the future, we aim to improve the applicabil-ity and usefulness by following action items: (1)Collaborate with more leaderboard organizers ofdiverse tasks and set up corresponding EXPLAIN-ABOARDs for them. (2) Cover more tasks, datasets,models, as well as functionality.

Acknowledgements

We thanks all authors who share their system out-puts with us: Ikuya Yamada, Stefan Schweter,Colin Raffel, Yang Liu, Li Dong. We also thankVijay Viswanathan, Yiran Chen, Hiroaki Hayashifor useful discussion and feedback about EXPLAIN-ABOARD.

The work was supported in part by the AirForce Research Laboratory under agreement num-ber FA8750-19-2-0200. The U.S. Government isauthorized to reproduce and distribute reprints forGovernmental purposes notwithstanding any copy-right notation thereon. The views and conclusionscontained herein are those of the authors and shouldnot be interpreted as necessarily representing theofficial policies or endorsements, either expressedor implied, of the Air Force Research Laboratoryor the U.S. Government.

ReferencesAlan Akbik, Tanja Bergmann, and Roland Vollgraf.

2019. Pooled contextualized embeddings for namedentity recognition. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 724–728, Minneapolis, Minnesota. As-sociation for Computational Linguistics.

Rie Ando and Tong Zhang. 2005. A high-performancesemi-supervised learning method for text chunking.In Proceedings of the 43rd Annual Meeting of the As-sociation for Computational Linguistics (ACL’05),pages 1–9, Ann Arbor, Michigan. Association forComputational Linguistics.

Loïc Barrault, Magdalena Biesialska, Ondrej Bojar,Marta R. Costa-jussà, Christian Federmann, YvetteGraham, Roman Grundkiewicz, Barry Haddow,Matthias Huck, Eric Joanis, Tom Kocmi, PhilippKoehn, Chi-kiu Lo, Nikola Ljubešic, ChristofMonz, Makoto Morishita, Masaaki Nagata, Toshi-aki Nakazawa, Santanu Pal, Matt Post, and MarcosZampieri. 2020. Findings of the 2020 conference onmachine translation (WMT20). In Proceedings ofthe Fifth Conference on Machine Translation, pages1–55, Online. Association for Computational Lin-guistics.

Yonatan Belinkov and James Glass. 2019. Analysismethods in neural language processing: A survey.Transactions of the Association for ComputationalLinguistics, 7:49–72.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-ERT: A pretrained language model for scientific text.In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computa-tional Linguistics.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing, pages 1533–1544, Seattle, Wash-ington, USA. Association for Computational Lin-guistics.

Manik Bhandari, Pranav Narayan Gour, Atabak Ash-faq, Pengfei Liu, and Graham Neubig. 2020. Re-evaluating evaluation in text summarization. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 9347–9359, Online. Association for Computa-tional Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages

https://doi.org/10.18653/v1/N19-1078

https://doi.org/10.18653/v1/N19-1078

https://doi.org/10.3115/1219840.1219841

https://doi.org/10.3115/1219840.1219841

https://www.aclweb.org/anthology/2020.wmt-1.1

https://www.aclweb.org/anthology/2020.wmt-1.1

https://doi.org/10.1162/tacl_a_00254

https://doi.org/10.1162/tacl_a_00254

https://doi.org/10.18653/v1/D19-1371

https://doi.org/10.18653/v1/D19-1371

https://www.aclweb.org/anthology/D13-1160

https://www.aclweb.org/anthology/D13-1160

https://doi.org/10.18653/v1/2020.emnlp-main.751


https://doi.org/10.18653/v1/D15-1075

https://doi.org/10.18653/v1/D15-1075

632–642, Lisbon, Portugal. Association for Compu-tational Linguistics.

Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu,and Xuanjing Huang. 2015. Long short-term mem-ory neural networks for Chinese word segmentation.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages1197–1206, Lisbon, Portugal. Association for Com-putational Linguistics.

Zhiyuan Chen, Bing Liu, Meichun Hsu, Malu Castel-lanos, and Riddhiman Ghosh. 2013. Identifying in-tention posts in discussion forums. In Proceedingsof the 2013 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 1041–1050, Atlanta, Georgia. Association for Computa-tional Linguistics.

Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advancesin Neural Information Processing Systems 32: An-nual Conference on Neural Information ProcessingSystems 2019, NeurIPS 2019, December 8-14, 2019,Vancouver, BC, Canada, pages 7057–7067.

Junqi Dai, Hang Yan, Tianxiang Sun, Pengfei Liu, andXipeng Qiu. 2021. Does syntax matter? a strongbaseline for aspect-based sentiment analysis withroberta. In The 2021 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies.

Kevin Duh, Katsuhito Sudoh, Xianchao Wu, HajimeTsukada, and Masaaki Nagata. 2011. Generalizedminimum bayes risk system combination. In Pro-ceedings of 5th International Joint Conference onNatural Language Processing, pages 1356–1360.

Bradley Efron. 1992. Bootstrap methods: anotherlook at the jackknife. In Breakthroughs in statistics,pages 569–593. Springer.

Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is inthe eye of the user: A critique of NLP leaderboards.In Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 4846–4853, Online. Association for Computa-tional Linguistics.

Jinlan Fu, Pengfei Liu, and Graham Neubig. 2020a. In-terpretable multi-dataset evaluation for named entityrecognition. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 6058–6069, Online. Associa-tion for Computational Linguistics.

Jinlan Fu, Pengfei Liu, and Qi Zhang. 2020b. Rethink-ing generalization of neural models: A named en-tity recognition case study. In The Thirty-FourthAAAI Conference on Artificial Intelligence, AAAI2020, The Thirty-Second Innovative Applications ofArtificial Intelligence Conference, IAAI 2020, TheTenth AAAI Symposium on Educational Advances

in Artificial Intelligence, EAAI 2020, New York, NY,USA, February 7-12, 2020, pages 7732–7739. AAAIPress.

Jinlan Fu, Pengfei Liu, Qi Zhang, and Xuanjing Huang.2020c. RethinkCWS: Is Chinese word segmentationa solved task? In Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 5676–5686, Online. As-sociation for Computational Linguistics.

Sebastian Gehrmann, Tosin Adewumi, Karmanya Ag-garwal, Pawan Sasanka Ammanamanchi, AremuAnuoluwapo, Antoine Bosselut, Khyathi RaghaviChandu, Miruna Clinciu, Dipanjan Das, Kaustubh DDhole, et al. 2021. The gem benchmark: Natu-ral language generation, its evaluation and metrics.arXiv preprint arXiv:2102.01672.

Jesús González-Rubio, Alfons Juan, and FranciscoCasacuberta. 2011. Minimum Bayes-risk systemcombination. In Proceedings of the 49th AnnualMeeting of the Association for Computational Lin-guistics: Human Language Technologies, pages1268–1277, Portland, Oregon, USA. Association forComputational Linguistics.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q.Weinberger. 2017. On calibration of modern neu-ral networks. In Proceedings of the 34th Inter-national Conference on Machine Learning, ICML2017, Sydney, NSW, Australia, 6-11 August 2017,volume 70 of Proceedings of Machine Learning Re-search, pages 1321–1330. PMLR.

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-ham Neubig, Orhan Firat, and Melvin Johnson.2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual gener-alisation. In International Conference on MachineLearning, pages 4411–4421. PMLR.

Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg,Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah ASmith, and Daniel S Weld. 2021. Genie: A leader-board for human-in-the-loop evaluation of text gen-eration. arXiv preprint arXiv:2101.06561.

Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fen-fei Guo, Weizhen Qi, Ming Gong, Linjun Shou,Daxin Jiang, Guihong Cao, et al. 2020. Xglue:A new benchmark dataset for cross-lingual pre-training, understanding and generation. arXivpreprint arXiv:2004.01401.

Zachary C Lipton. 2018. The mythos of model inter-pretability: In machine learning, the concept of in-terpretability is both important and slippery. Queue,16(3):31–57.

Yang Liu and Mirella Lapata. 2019. Text summariza-tion with pretrained encoders. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing

https://doi.org/10.18653/v1/D15-1141

https://doi.org/10.18653/v1/D15-1141

https://www.aclweb.org/anthology/N13-1124


https://proceedings.neurips.cc/paper/2019/hash/c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html

https://proceedings.neurips.cc/paper/2019/hash/c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html






https://aaai.org/ojs/index.php/AAAI/article/view/6276





https://www.aclweb.org/anthology/P11-1127


http://proceedings.mlr.press/v70/guo17a.html

http://proceedings.mlr.press/v70/guo17a.html

https://doi.org/10.18653/v1/D19-1387

https://doi.org/10.18653/v1/D19-1387

(EMNLP-IJCNLP), pages 3730–3740, Hong Kong,China. Association for Computational Linguistics.

Yixin Liu, Dou Ziyi, and Pengfei Liu. 2021. Ref-sum: Refactoring neural summarization. In The2021 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies.

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.2016. Hierarchical question-image co-attention forvisual question answering. In Advances in Neu-ral Information Processing Systems 29: AnnualConference on Neural Information Processing Sys-tems 2016, December 5-10, 2016, Barcelona, Spain,pages 289–297.

Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel,Danish Pruthi, and Xinyi Wang. 2019. compare-mt:A tool for holistic comparison of language genera-tion systems. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics (Demonstra-tions), pages 35–41, Minneapolis, Minnesota. Asso-ciation for Computational Linguistics.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.2002. Thumbs up? sentiment classification usingmachine learning techniques. In Proceedings of the2002 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP 2002), pages 79–86.Association for Computational Linguistics.

Nikolaos Pappas and Andrei Popescu-Belis. 2014. Ex-plaining the stars: Weighted multiple-instance learn-ing for aspect-based sentiment analysis. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages455–466, Doha, Qatar. Association for Computa-tional Linguistics.

Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le,and Jeff Dean. 2018. Efficient neural architecturesearch via parameter sharing. In Proceedings of the35th International Conference on Machine Learning,ICML 2018, Stockholmsmässan, Stockholm, Sweden,July 10-15, 2018, volume 80 of Proceedings of Ma-chine Learning Research, pages 4092–4101. PMLR.

Maja Popovic and Hermann Ney. 2011. Towards au-tomatic error analysis of machine translation output.Computational Linguistics, 37(4):657–688.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392, Austin,Texas. Association for Computational Linguistics.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 379–389, Lisbon, Portugal.Association for Computational Linguistics.

Stefan Schweter and Alan Akbik. 2020. Flert:Document-level features for named entity recogni-tion. arXiv preprint arXiv:2011.06993.

Sara Stymne. 2011. Blast: A tool for error analysisof machine translation output. In Proceedings ofthe ACL-HLT 2011 System Demonstrations, pages56–61, Portland, Oregon. Association for Computa-tional Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural networks.In Advances in Neural Information Processing Sys-tems 27: Annual Conference on Neural Informa-tion Processing Systems 2014, December 8-13 2014,Montreal, Quebec, Canada, pages 3104–3112.

Ian Tenney, James Wexler, Jasmijn Bastings, TolgaBolukbasi, Andy Coenen, Sebastian Gehrmann,Ellen Jiang, Mahima Pushkarna, Carey Radebaugh,Emily Reif, et al. 2020. The language inter-pretability tool: Extensible, interactive visualiza-tions and analysis for nlp models. arXiv preprintarXiv:2008.05122.

Kai Ming Ting and Ian H. Witten. 1997. Stacked gen-eralizations: When does it work? In Proceedings ofthe Fifteenth International Joint Conference on Ar-tificial Intelligence, IJCAI 97, Nagoya, Japan, Au-gust 23-29, 1997, 2 Volumes, pages 866–873. Mor-gan Kaufmann.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003, pages142–147.

Kristina Toutanova, Dan Klein, Christopher D. Man-ning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network.In Proceedings of the 2003 Human Language Tech-nology Conference of the North American Chapterof the Association for Computational Linguistics,pages 252–259.

Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Sub-ramanian, Matt Gardner, and Sameer Singh. 2019.AllenNLP interpret: A framework for explainingpredictions of NLP models. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP): System Demonstrations, pages7–12, Hong Kong, China. Association for Compu-tational Linguistics.

Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel R Bowman. 2019. Super-glue: A stickier benchmark for general-purposelanguage understanding systems. arXiv preprintarXiv:1905.00537.

https://proceedings.neurips.cc/paper/2016/hash/9dcb88e0137649590b755372b040afad-Abstract.html

https://proceedings.neurips.cc/paper/2016/hash/9dcb88e0137649590b755372b040afad-Abstract.html

https://doi.org/10.18653/v1/N19-4007

https://doi.org/10.18653/v1/N19-4007

https://doi.org/10.18653/v1/N19-4007

https://doi.org/10.3115/1118693.1118704

https://doi.org/10.3115/1118693.1118704

https://doi.org/10.3115/v1/D14-1052

https://doi.org/10.3115/v1/D14-1052

https://doi.org/10.3115/v1/D14-1052

http://proceedings.mlr.press/v80/pham18a.html

http://proceedings.mlr.press/v80/pham18a.html

https://doi.org/10.1162/COLI_a_00072

https://doi.org/10.1162/COLI_a_00072

https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D15-1044

https://doi.org/10.18653/v1/D15-1044



https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html

http://ijcai.org/Proceedings/97-2/Papers/011.pdf

http://ijcai.org/Proceedings/97-2/Papers/011.pdf

https://www.aclweb.org/anthology/W03-0419

https://www.aclweb.org/anthology/W03-0419



https://doi.org/10.18653/v1/D19-3002

https://doi.org/10.18653/v1/D19-3002

Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP, pages 353–355, Brussels, Belgium.Association for Computational Linguistics.

Sida Wang and Christopher Manning. 2012. Baselinesand bigrams: Simple, good sentiment and topic clas-sification. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), pages 90–94, Jeju Island,Korea. Association for Computational Linguistics.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, Melvin John-son, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws,Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, KeithStevens, George Kurian, Nishant Patil, Wei Wang,Cliff Young, Jason Smith, Jason Riesa, Alex Rud-nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,and Jeffrey Dean. 2016. Google’s neural machinetranslation system: Bridging the gap between humanand machine translation. CoRR, abs/1609.08144.

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, HideakiTakeda, and Yuji Matsumoto. 2020. LUKE: Deepcontextualized entity representations with entity-aware self-attention. In Proceedings of the 2020Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 6442–6454, On-line. Association for Computational Linguistics.

Zihuiwen Ye, Pengfei Liu, Jinlan Fu, and Graham Neu-big. 2021. Towards more fine-grained and reliableNLP performance prediction. In Conference of theEuropean Chapter of the Association for Computa-tional Linguistics (EACL), Online.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li,Qingning Yao, Shanelle Roman, Zilin Zhang,and Dragomir Radev. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages3911–3921, Brussels, Belgium. Association forComputational Linguistics.

Ming Zhong, Danqing Wang, Pengfei Liu, Xipeng Qiu,and Xuanjing Huang. 2019. A closer look at databias in neural extractive summarization models. InProceedings of the 2nd Workshop on New Frontiersin Summarization, pages 80–89, Hong Kong, China.Association for Computational Linguistics.

Barret Zoph and Quoc V. Le. 2017. Neural architec-ture search with reinforcement learning. In 5th Inter-national Conference on Learning Representations,ICLR 2017, Toulon, France, April 24-26, 2017, Con-ference Track Proceedings. OpenReview.net.

https://doi.org/10.18653/v1/W18-5446

https://doi.org/10.18653/v1/W18-5446




http://arxiv.org/abs/1609.08144






https://arxiv.org/abs/2102.05486

https://arxiv.org/abs/2102.05486

https://doi.org/10.18653/v1/D18-1425

https://doi.org/10.18653/v1/D18-1425

https://doi.org/10.18653/v1/D18-1425

https://doi.org/10.18653/v1/D19-5410

https://doi.org/10.18653/v1/D19-5410

https://openreview.net/forum?id=r1Ue8Hcxg

https://openreview.net/forum?id=r1Ue8Hcxg

explainaboard an explainable leaderboard for nlpexplainaboard.nlpedia.ai/explainaboard.pdf · 2021....

Documents