Saliency Learning: Teaching the Model Where to Pay
AttentionReza Ghaeini, Xioali Fern, Hamed Shahbazi, Prasad Tadepalli
Oregon State University
Motivation
!2 NAACL 2019
• Does deep models make the right prediction for the right reason? — How reliable the deep models are?
The Office, S04E03
Motivation
!3 NAACL 2019
• Does deep models make the right prediction for the right reason? — How reliable the deep models are?
The Office, S04E03
Pizza Delivery
Guy
Motivation
!4 NAACL 2019
• Does deep models make the right prediction for the right reason? — How reliable the deep models are?
The Office, S04E03
Pizza Delivery
Guy
Wearing Jeans
Motivation
!5 NAACL 2019
• Attempts toward interpretation and explanation.
• Teach the model to make the right prediction for the right reason.
The Office, S04E03
Pizza Delivery
Guy
Motivation
!6 NAACL 2019
• Attempts toward interpretation and explanation.
• Teach the model to make the right prediction for the right reason.
The Office, S04E03
Pizza Delivery
Guy
Right Reason: Carrying Pizzas
Saliency Learning• Contributory Words (Z): Words that their occurrence in a
sample suggest the gold prediction — The prediction should be make by focusing on them.
• Saliency: An explanation method which determines the impact of a unit toward a prediction. — Gradient of the prediction respect to the unit.
• Goal: Aligning behavior of the model with the expected and desired behavior.
• Methodology: Teaching the model to consider positive saliency for contributory words.
!7 NAACL 2019
Saliency Learning• Proposing a penalization term (explanation loss) to
enforce positive saliency for the contributory words.
!8 NAACL 2019
C(✓, X, y, Z) = L(✓, X, y) + �nX
i=1
max (0,�ZiS(Xi))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Saliency Learning• Proposing a penalization term (explanation loss) to
enforce positive saliency for the contributory words.
!9 NAACL 2019
C(✓, X, y, Z) = L(✓, X, y) + �nX
i=1
max (0,�ZiS(Xi))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Explanation Loss
Word SaliencySentence Length
Traditional Loss Function
Hyper-Parameter
Model Parameter
Gold Label
Gold Explanation
Input (Sentence)
A Word
Contributory Words
Saliency Learning
!10 NAACL 2019
tmp
mr.ghaeini
May 2019
1 Introduction
• Sentence:An unknown man had [broken into] a house last November.
• Contributory Words: broken into.
• Label: Positive.
• Sentence:An unknown man had [broken into] a house last November.
• Event Mention: broken into.
• Event Type: Attack.
SentenceAn unknown man had [broken into] a house last November.
Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive
X = X1, X2, X3, X4, X5, X6, · · · Xn
Z = 0, 0, 1, 1, 0, 1, · · · 0
1
0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Explanation Loss =0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Saliency Learning
!11 NAACL 2019
tmp
mr.ghaeini
May 2019
1 Introduction
• Sentence:An unknown man had [broken into] a house last November.
• Contributory Words: broken into.
• Label: Positive.
• Sentence:An unknown man had [broken into] a house last November.
• Event Mention: broken into.
• Event Type: Attack.
SentenceAn unknown man had [broken into] a house last November.
Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive
X = X1, X2, X3, X4, X5, X6, · · · Xn
Z = 0, 0, 1, 1, 0, 1, · · · 0
1
0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Explanation Loss =0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Saliency Learning
!12 NAACL 2019
tmp
mr.ghaeini
May 2019
1 Introduction
• Sentence:An unknown man had [broken into] a house last November.
• Contributory Words: broken into.
• Label: Positive.
• Sentence:An unknown man had [broken into] a house last November.
• Event Mention: broken into.
• Event Type: Attack.
SentenceAn unknown man had [broken into] a house last November.
Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive
X = X1, X2, X3, X4, X5, X6, · · · Xn
Z = 0, 0, 1, 1, 0, 1, · · · 0
1
max(0,�S(X3))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Explanation Loss =max(0,�S(X3))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Saliency Learning
!13 NAACL 2019
tmp
mr.ghaeini
May 2019
1 Introduction
• Sentence:An unknown man had [broken into] a house last November.
• Contributory Words: broken into.
• Label: Positive.
• Sentence:An unknown man had [broken into] a house last November.
• Event Mention: broken into.
• Event Type: Attack.
SentenceAn unknown man had [broken into] a house last November.
Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive
X = X1, X2, X3, X4, X5, X6, · · · Xn
Z = 0, 0, 1, 1, 0, 1, · · · 0
1
max(0,�S(X4))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Explanation Loss =max(0,�S(X3)) +max(0,�S(X4))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Saliency Learning
!14 NAACL 2019
tmp
mr.ghaeini
May 2019
1 Introduction
• Sentence:An unknown man had [broken into] a house last November.
• Contributory Words: broken into.
• Label: Positive.
• Sentence:An unknown man had [broken into] a house last November.
• Event Mention: broken into.
• Event Type: Attack.
SentenceAn unknown man had [broken into] a house last November.
Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive
X = X1, X2, X3, X4, X5, X6, · · · Xn
Z = 0, 0, 1, 1, 0, 1, · · · 0
1
0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Explanation Loss =max(0,�S(X3)) +max(0,�S(X4))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Saliency Learning
!15 NAACL 2019
tmp
mr.ghaeini
May 2019
1 Introduction
• Sentence:An unknown man had [broken into] a house last November.
• Contributory Words: broken into.
• Label: Positive.
• Sentence:An unknown man had [broken into] a house last November.
• Event Mention: broken into.
• Event Type: Attack.
SentenceAn unknown man had [broken into] a house last November.
Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive
X = X1, X2, X3, X4, X5, X6, · · · Xn
Z = 0, 0, 1, 1, 0, 1, · · · 0
1
Explanation Loss =max(0,�S(X3)) +max(0,�S(X4))+
max(0,�S(X6))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
max(0,�S(X6))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Saliency Learning
!16 NAACL 2019
tmp
mr.ghaeini
May 2019
1 Introduction
• Sentence:An unknown man had [broken into] a house last November.
• Contributory Words: broken into.
• Label: Positive.
• Sentence:An unknown man had [broken into] a house last November.
• Event Mention: broken into.
• Event Type: Attack.
SentenceAn unknown man had [broken into] a house last November.
Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive
X = X1, X2, X3, X4, X5, X6, · · · Xn
Z = 0, 0, 1, 1, 0, 1, · · · 0
1
Explanation Loss =max(0,�S(X3)) +max(0,�S(X4))+
max(0,�S(X6)) + · · ·<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Tasks and Dataset • Event Detection:
• ACE 2005 → Annotations: Event Mentions
• Rich ERE 2015 → Annotations: Event Mentions
• Cloze-Style Question Answering:
• CBT-NE → Annotations: Gold Replacement (Candidate)
• CBT-CN → Annotations: Gold Replacement (Candidate)
!17 NAACL 2019
Event Detection
• Event Detection: Given a sentence, find the event mention and determine its event type.
• Modified Event Detection: Given a sentence, determine if it contains an event.
18 NAACL 2019
tmp
mr.ghaeini
May 2019
1 Introduction
• Sentence:An unknown man had [broken into] a house last November.
• Contributory Words: broken into.
• Label: Positive.
• Sentence:An unknown man had [broken into] a house last November.
• Event Mention: broken into.
• Event Type: Attack.
SentenceAn unknown man had [broken into] a house last November.
Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive
1
Cloze-Style QA
!19
• Cloze-Style QA: Given a document, a query with a blank, and a set of possible entities for filling the blank in the query, find the right entity.
• Modified Cloze-Style QA: Given a sentence and a query with a blank, determine if the sentence contains the right entity for the blank in query.
NAACL 2019
Dataset Statistics
!20
1
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Saliency Learning: Teaching the Model Where to Pay AttentionAppendix
Anonymous NAACL submission
A Background: Saliency
The concept of saliency was first introduced in vi-sion for visualizing the spatial support on an imagefor particular object class (Simonyan et al., 2013).Considering a deep model prediction as a differ-entiable model f parameterized by ✓ with inputX 2 Rn⇥d. Such model could be describe usingthe Taylor series as follow:
f(x) = f(a)+f0(a)(x�a)+
f00(a)
2!(x�a)2+. . .
(1)By approximating that the deep model is a lin-
ear function, we could just use the first order Tay-lor expansion.
f(x) ⇡ f0(a)x+ b (2)
According to Equation 2, the first derivative ofmodel’s prediction with respect to the input (f 0
(a)or @f
@x |x=a) serves as the description of model’sbehaviour near the input. To make it more clear,bigger derivative/gradient indicates more impactand contribution toward model’s prediction. Con-sequently, the large-magnitude derivative valuesdetermine units of input that would greatly affectf(x) if changed.
B Task and Dataset
Here, we first describe the main and real EventExtraction and Close-Style Question Answeringtasks (before our modification). Next, we pro-vide data statistics of the modified version of ACE,ERE, CBT-NE, and CBT-CN datasets in Table 1.
• Event Extraction: Given a set of ontolo-gized event types (e.g. Movement, Transac-tion, Conflict, etc.), the goal of event extrac-tion is to identify the mentions of differentevents along their types from natural texts.
DatasetSample Count
Train TestP.a N.b P. N.
ACE 3.2K 15K 293 421ERE 3.1K 4K 2.7K 1.91K
CBT-NE 359K 1.82M 8.8K 41.1KCBT-CN 256K 2.16M 5.5K 44.4Ka Positive Sample Countb Negative Sample Count
Table 1: Dataset statistics of the modified tasks anddatasets.
• Cloze-Style Question Answering: Docu-ments in CBT consist of 20 contiguous sen-tences from the body of a popular childrenbook, and queries are formed by replacing atoken from the 21st sentence with a blank.Given a document, a query, and a set of can-didates, the goal is to find the correct replace-ment for blank in the query among the givencandidates. To avoid having too many neg-ative examples in our modified datasets, weonly consider the sentences that contains atleast one candidate. To be more clear, eachsample from the CBT dataset is split to atmost 20 samples – each sentence of the mainsample as long as it contains one of the can-didates.
C Training
All hyper-parameters are tuned based on the de-velopment set. We use pre-trained 300�D Glove840B vectors to initialize our word embeddingvectors. All hidden states and feature sizes are 300dimensions (d = 300). The weights are learned byminimizing the cost function on the training datavia Adam optimizer. The initial learning rate is0.0001 and � = 0.5, 0.7, 0.4, and 0.35 for ACE,
NAACL 2019
Toy Models
!21
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
Event Detection Cloze-Style Question Answering
NAACL 2019
Toy Models
!22
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
Event Detection Cloze-Style Question AnsweringSentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
W: Word Representation
NAACL 2019
Toy Models
!23
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
Event Detection Cloze-Style Question AnsweringSentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
I: Intermediate Representation
NAACL 2019
Toy Models
!24
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
Event Detection Cloze-Style Question AnsweringSentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Sentence
Conv-W3 Conv-W5
Max-Pooling
Dim & Seq Max-Pooling
Query
Conv-W3 Conv-W5
Max-Pooling
Max-Pooling
(a) (b)
D: Decision Representation
NAACL 2019
Train Cost Function
!25
Enforcing positive saliency at
NAACL 2019
C(✓, X, y, Z) = L(✓, X, y)
+ �nX
i=1
max (0,�ZiS(Wi))
+ �nX
i=1
max (0,�ZiS(Ii))
+ �nX
i=1
max (0,�ZiS(Di))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Train Cost Function
!26
Enforcing positive saliency at
W: Word Representation
NAACL 2019
C(✓, X, y, Z) = L(✓, X, y)
+ �nX
i=1
max (0,�ZiS(Wi))
+ �nX
i=1
max (0,�ZiS(Ii))
+ �nX
i=1
max (0,�ZiS(Di))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Train Cost Function
!27
Enforcing positive saliency at
I: Intermediate Representation
NAACL 2019
C(✓, X, y, Z) = L(✓, X, y)
+ �nX
i=1
max (0,�ZiS(Wi))
+ �nX
i=1
max (0,�ZiS(Ii))
+ �nX
i=1
max (0,�ZiS(Di))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Train Cost Function
!28
Enforcing positive saliency at
D: Decision Representation
NAACL 2019
C(✓, X, y, Z) = L(✓, X, y)
+ �nX
i=1
max (0,�ZiS(Wi))
+ �nX
i=1
max (0,�ZiS(Ii))
+ �nX
i=1
max (0,�ZiS(Di))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Results
!29
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
NAACL 2019
Results
!30
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
NAACL 2019
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
Without Explanation
Results
!31
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
NAACL 2019
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
With Explanation
Results
!32
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
NAACL 2019
Results
!33
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
NAACL 2019
Results
!34
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).
to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.
4 Model
We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.
Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.
Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).
As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).
Dataset S.a P.b R.c F1 Acc.d
ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9
ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0
CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5
CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7
aSaliency Learning. bPrecision.cRecall. dAccuracy
Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.
2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.
5 Experiments and Analysis
5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.
5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what
1Code will be publicly available upon acceptance.
NAACL 2019
Saliency Accuracy
!35
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.
percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100
Pi �(ZiGi>0)P
i Ziwhere Gi is the gradient
of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-
ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.
5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).
Table 3 shows the outcome of the aforemen-
Dataset S. TPRa0 TPRb
1 �TPRc
ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9
ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1
CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8
CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4
aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.
Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).
tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.
It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.
6 Conclusion
In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.
sacc = 100⇥P
i �(ZiGi > 0)Pi Zi
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
NAACL 2019
Saliency Accuracy
!36
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.
percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100
Pi �(ZiGi>0)P
i Ziwhere Gi is the gradient
of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-
ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.
5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).
Table 3 shows the outcome of the aforemen-
Dataset S. TPRa0 TPRb
1 �TPRc
ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9
ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1
CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8
CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4
aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.
Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).
tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.
It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.
6 Conclusion
In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.
sacc = 100⇥P
i �(ZiGi > 0)Pi Zi
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Saliency Accuracy
Gradient/SaliencyIndicator Function
NAACL 2019
Saliency Accuracy
!37
sacc = 100⇥P
i �(ZiGi > 0)Pi Zi
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Saliency Accuracy
Gradient/SaliencyIndicator Function
NAACL 2019
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.
percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100
Pi �(ZiGi>0)P
i Ziwhere Gi is the gradient
of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-
ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.
5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).
Table 3 shows the outcome of the aforemen-
Dataset S. TPRa0 TPRb
1 �TPRc
ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9
ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1
CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8
CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4
aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.
Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).
tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.
It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.
6 Conclusion
In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.
Other words are also impactful toward the occurrence of an event.
True Positive Rate
!38
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.
percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100
Pi �(ZiGi>0)P
i Ziwhere Gi is the gradient
of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-
ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.
5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).
Table 3 shows the outcome of the aforemen-
Dataset S. TPRa0 TPRb
1 �TPRc
ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9
ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1
CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8
CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4
aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.
Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).
tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.
It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.
6 Conclusion
In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.
�TPR = 100⇥ TPR0 � TPR1
TPR0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
NAACL 2019
True Positive Rate
!39
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.
percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100
Pi �(ZiGi>0)P
i Ziwhere Gi is the gradient
of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-
ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.
5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).
Table 3 shows the outcome of the aforemen-
Dataset S. TPRa0 TPRb
1 �TPRc
ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9
ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1
CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8
CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4
aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.
Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).
tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.
It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.
6 Conclusion
In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.
�TPR = 100⇥ TPR0 � TPR1
TPR0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
NAACL 2019
TPR before removing contributory words
TPR after removing contributory words
True Positive Rate
!40
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.
percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100
Pi �(ZiGi>0)P
i Ziwhere Gi is the gradient
of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-
ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.
5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).
Table 3 shows the outcome of the aforemen-
Dataset S. TPRa0 TPRb
1 �TPRc
ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9
ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1
CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8
CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4
aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.
Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).
tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.
It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.
6 Conclusion
In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.
percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100
Pi �(ZiGi>0)P
i Ziwhere Gi is the gradient
of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-
ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.
5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).
Table 3 shows the outcome of the aforemen-
Dataset S. TPRa0 TPRb
1 �TPRc
ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9
ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1
CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8
CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4
aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.
Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).
tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.
It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.
6 Conclusion
In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.
percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100
Pi �(ZiGi>0)P
i Ziwhere Gi is the gradient
of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-
ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.
5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).
Table 3 shows the outcome of the aforemen-
Dataset S. TPRa0 TPRb
1 �TPRc
ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9
ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1
CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8
CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4
aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.
Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).
tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.
It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.
6 Conclusion
In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.
percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100
Pi �(ZiGi>0)P
i Ziwhere Gi is the gradient
of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-
ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.
5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).
Table 3 shows the outcome of the aforemen-
Dataset S. TPRa0 TPRb
1 �TPRc
ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9
ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1
CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8
CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4
aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.
Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).
tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.
It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.
6 Conclusion
In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset S. W.a I.b D.c
ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49
ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78
CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56
CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06
aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.
Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.
percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100
Pi �(ZiGi>0)P
i Ziwhere Gi is the gradient
of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-
ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.
5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).
Table 3 shows the outcome of the aforemen-
Dataset S. TPRa0 TPRb
1 �TPRc
ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9
ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1
CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8
CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4
aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.
Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).
tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.
It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.
6 Conclusion
In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.
�TPR = 100⇥ TPR0 � TPR1
TPR0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
NAACL 2019
TPR before removing contributory words
TPR after removing contributory words
Saliency Visualization
!41
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
id Baseline Model Saliency-based Model Z PB PS
1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said
handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.
2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .
3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.
4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .
5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce
seeking work-related seeking work-relateddocuments of his estranged documents of his estranged
wife in his high-stakes wife in his high-stakesdivorce case . divorce case .
6 The following year, he was The following year, he was acquitted 1 1acquitted in the Guatemala acquitted in the Guatemala casecase, but the U.S. continued case , but the U.S. continuedto push for his prosecution. to push for his prosecution .
7 In 2011, a Spanish National In 2011, a Spanish National issued 1 1Court judge issued arrest Court judge issued arrest slayingwarrants for 20 men , warrants for 20 men, arrestincluding Montano,suspected including Montano,suspectedof participating in the of participating in the
slaying of the priests. slaying of the priests.8 Slobodan Milosevic’s wife will Slobodan Milosevic’s wife will trial 1 1
go on trial next week on go on trial next week on chargescharges of mismanaging state charges of mismanaging state formerproperty during the former property during the formerpresident’s rule, a court said president ’s rule, a court saidThursday. Thursday .
9 Iraqis mostly fought back Iraqis mostly fought back fought 1 1with small arms, pistols, with small arms, pistols,machine guns and machine guns and
rocket-propelled grenades . rocket-propelled grenades.10 But the Saint Petersburg But the Saint Petersburg summit 1 1
summit ended without any summit ended without anyformal declaration on Iraq . formal declaration on Iraq .
Table 2: Top 6 salient tokens visualization of samples in ACE and ERE for baseline and saliency-based models.
NAACL 2019
id Baseline Model Saliency-based Model Z PB PS
1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said
handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.
2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .
3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.
4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .
5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce
seeking work-related seeking work-relateddocuments of his estranged documents of his estranged
wife in his high-stakes wife in his high-stakesdivorce case . divorce case .
6 The following year, he was The following year, he was acquitted 1 1
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
id Baseline Model Saliency-based Model Z PB PS
1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said
handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.
2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .
3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.
4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .
5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce
seeking work-related seeking work-relateddocuments of his estranged documents of his estranged
wife in his high-stakes wife in his high-stakesdivorce case . divorce case .
6 The following year, he was The following year, he was acquitted 1 1acquitted in the Guatemala acquitted in the Guatemala casecase, but the U.S. continued case , but the U.S. continuedto push for his prosecution. to push for his prosecution .
7 In 2011, a Spanish National In 2011, a Spanish National issued 1 1Court judge issued arrest Court judge issued arrest slayingwarrants for 20 men , warrants for 20 men, arrestincluding Montano,suspected including Montano,suspectedof participating in the of participating in the
slaying of the priests. slaying of the priests.8 Slobodan Milosevic’s wife will Slobodan Milosevic’s wife will trial 1 1
go on trial next week on go on trial next week on chargescharges of mismanaging state charges of mismanaging state formerproperty during the former property during the formerpresident’s rule, a court said president ’s rule, a court saidThursday. Thursday .
9 Iraqis mostly fought back Iraqis mostly fought back fought 1 1with small arms, pistols, with small arms, pistols,machine guns and machine guns and
rocket-propelled grenades . rocket-propelled grenades.10 But the Saint Petersburg But the Saint Petersburg summit 1 1
summit ended without any summit ended without anyformal declaration on Iraq . formal declaration on Iraq .
Table 2: Top 6 salient tokens visualization of samples in ACE and ERE for baseline and saliency-based models.
Saliency Visualization
!42 NAACL 2019
id Baseline Model Saliency-based Model Z PB PS
1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said
handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.
2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .
3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.
4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .
5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce
seeking work-related seeking work-relateddocuments of his estranged documents of his estranged
wife in his high-stakes wife in his high-stakesdivorce case . divorce case .
6 The following year, he was The following year, he was acquitted 1 1
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
id Baseline Model Saliency-based Model Z PB PS
1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said
handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.
2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .
3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.
4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .
5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce
seeking work-related seeking work-relateddocuments of his estranged documents of his estranged
wife in his high-stakes wife in his high-stakesdivorce case . divorce case .
6 The following year, he was The following year, he was acquitted 1 1acquitted in the Guatemala acquitted in the Guatemala casecase, but the U.S. continued case , but the U.S. continuedto push for his prosecution. to push for his prosecution .
7 In 2011, a Spanish National In 2011, a Spanish National issued 1 1Court judge issued arrest Court judge issued arrest slayingwarrants for 20 men , warrants for 20 men, arrestincluding Montano,suspected including Montano,suspectedof participating in the of participating in the
slaying of the priests. slaying of the priests.8 Slobodan Milosevic’s wife will Slobodan Milosevic’s wife will trial 1 1
go on trial next week on go on trial next week on chargescharges of mismanaging state charges of mismanaging state formerproperty during the former property during the formerpresident’s rule, a court said president ’s rule, a court saidThursday. Thursday .
9 Iraqis mostly fought back Iraqis mostly fought back fought 1 1with small arms, pistols, with small arms, pistols,machine guns and machine guns and
rocket-propelled grenades . rocket-propelled grenades.10 But the Saint Petersburg But the Saint Petersburg summit 1 1
summit ended without any summit ended without anyformal declaration on Iraq . formal declaration on Iraq .
Table 2: Top 6 salient tokens visualization of samples in ACE and ERE for baseline and saliency-based models.
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
id Baseline Model Saliency-based Model Z PB PS
1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said
handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.
2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .
3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.
4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .
5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce
seeking work-related seeking work-relateddocuments of his estranged documents of his estranged
wife in his high-stakes wife in his high-stakesdivorce case . divorce case .
6 The following year, he was The following year, he was acquitted 1 1acquitted in the Guatemala acquitted in the Guatemala casecase, but the U.S. continued case , but the U.S. continuedto push for his prosecution. to push for his prosecution .
7 In 2011, a Spanish National In 2011, a Spanish National issued 1 1Court judge issued arrest Court judge issued arrest slayingwarrants for 20 men , warrants for 20 men, arrestincluding Montano,suspected including Montano,suspectedof participating in the of participating in the
slaying of the priests. slaying of the priests.8 Slobodan Milosevic’s wife will Slobodan Milosevic’s wife will trial 1 1
go on trial next week on go on trial next week on chargescharges of mismanaging state charges of mismanaging state formerproperty during the former property during the formerpresident’s rule, a court said president ’s rule, a court saidThursday. Thursday .
9 Iraqis mostly fought back Iraqis mostly fought back fought 1 1with small arms, pistols, with small arms, pistols,machine guns and machine guns and
rocket-propelled grenades . rocket-propelled grenades.10 But the Saint Petersburg But the Saint Petersburg summit 1 1
summit ended without any summit ended without anyformal declaration on Iraq . formal declaration on Iraq .
Table 2: Top 6 salient tokens visualization of samples in ACE and ERE for baseline and saliency-based models.
Saliency Visualization
!43 NAACL 2019
Irrelevant helpful data
Conclusion• Proposing saliency learning, a novel approach for
teaching a model where to pay attention.
• Experimenting on multiple tasks and datasets.
• Achieving better precision, F1 measure and accuracy.
• Obtaining models whose saliency are more properly aligned with the desired explanation (More reliable).
• FW: Using saliency learning in a semi-supervised framework.
!44 NAACL 2019
Thank You