learning and knowledge transfer with memory networks for
Post on 14-May-2022
1 Views
Preview:
TRANSCRIPT
Learning and Knowledge Transfer withMemory Networks for Machine
Comprehension
Mohit Yadav. Lovekesh Vig. Gautam Shro↵
TCS Research New-Delhi
Presented by Kyo Kim
April 24, 2018
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 1 / 27
Overview
1 Motivation2 Background3 Proposed Method4 Dataset and Experiment Results5 Summary
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 2 / 27
Motivation
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 3 / 27
Problem
Obtaining high performance in ”machinecomprehension” requires abundant humanannotated dataset.
Measured by question answering performance.
In a real-world dataset with small amount of data,wider range of vocabulary can be observed and thegrammar structure is often complex.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 4 / 27
High-level Overview of Proposed Method
1 Curriculum based training procedure.2 Knowledge transfer to increase the performance in
dataset with less abundant labeled data.3 Pre-trained memory network on small dataset.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 5 / 27
Background
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 6 / 27
End-to-end Memory Networks
1 Vectorize the problem tuple.2 Retrieve the corresponding memory attention vector.3 Use the retrieved memory to answer the question.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 7 / 27
End-to-end Memory Networks Cont.
Vectorize the problem tupleProblem tuple: (q,C , S , s)
q: questionC : context textS : set of answer choicess: correct answer (s 2 S)
Question and context embedding matrix A 2 Rp·d
Query vector: ~q = A�(q)�: Bag of words
Memory vector: ~m
i
= A�(ci
) for i = 1, · · · , n wheren = |C | and c
i
2 C
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 8 / 27
End-to-end Memory Networks Cont.
Retrieve the corresponding memory attention vector
Attention distribution: ai
= softmax( ~m>i
~q).
Second memory vector: ~ri
= B�(ci
) where B isanother embedding matrix similar to A.
Aggregated vector: ~ro
=P
n
i=1
a
i
~r
i
Prediction vector: ai
= softmax((~ro
+ ~q)>U�(s
i
))U is the embedding matrix for the answers
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 9 / 27
End-to-end Memory Networks Cont.
Answer the question
Pick s
i
that corresponds to the highest ai
.
Cross-entropy Loss
L(P ,D) =1
N
D
N
DX
n=1
a
n
· log(an
(P ,D))
+ (1� a
n
) · log(1� a
n
(P ,D))
�
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 10 / 27
Curriculum Learning
First proposed by Bengio et al. (2009)
Introduce samples with increasing ”di�culty”.
Better local minima even under non-convex loss.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 11 / 27
Pre-training and Joint-training
Pre-training
Have a pre-trained model to initially guide thetraining process in a similar domain.
Joint-training
Exploit the similarity between two di↵erent domainsby training the model is two di↵erent domainssimultaneously.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 12 / 27
Proposed Method
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 13 / 27
Curriculum Inspired Training (CIT)
Di�culty Measurement
SF (q, S ,C , s) =
Pword2{q[S[C} log(Freq(word))
#{q [ S [ C}
Partition the dataset into a fixed numberchapter size with increasing di�culty.
Each chapter consists ofS
current chapter
i=1
parition[i ].
The model is trained with fixed number of epochsper chapter.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 14 / 27
CIT Cont.
Loss Function
L(P ,D, en) =1
N
D
N
DX
n=1
(a
n
· log(an
(P ,D)))
+ (1� a
n
) · log(1� a
n
(P ,D)) · 1en>=c(n)·epc
�
en : Current epoch
c(n) : Chapter number that the example n isassigned to
epc : Epochs per chapterMohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 15 / 27
Joint-Training
General Joint Loss Function
L(P ,TD, SD) = 2� · L(P ,TD)
+ 2(1� �) · L(P , SD) · F (NTD
,NSD
)
TD: Target dataset
SD: Source dataset
N
D
: Number of examples in the dataset D
�: Tunable weight parameter
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 16 / 27
Loss Functions
Joint-training� = 1
2
and F (NTD
,NSD
) = 1
L(P ,TD, SD) = L(P ,TD) + L(P , SD)
Weighted joint-training� 2 (0, 1) and F (N
TD
,NSD
) = N
TD
N
SD
.
L(P ,TD, SD) = 2� ·L(P ,TD)+2(1��) ·L(P , SD) ·NTD
N
SD
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 17 / 27
Loss Functions Cont.
Curriculum joint-training� = 1
2
and F (NTD
,NSD
) = 1
L(P ,TD, SD) = L(P ,TD, en) + L(P , SD, en)
Weighted Curriculum joint-training� 2 (0, 1) and F (N
TD
,NSD
) = N
TD
N
SD
.
L(P ,TD, SD) = 2� · L(P ,TD, en)
+ 2(1� �)L(P , SD, en) · NTD
N
SD
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 18 / 27
Source only� = 0 and c 2 R+
L(P ,TD, SD) = c · L(P , SD)
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 19 / 27
Dataset and Experiment Results
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 20 / 27
Dataset
Figure: Dataset used for experiments.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 21 / 27
Experiment Results
Figure: The table has two major rows. The upper row are modelsthat only used the target dataset. The lower rows are models thatused both the target and source dataset.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 22 / 27
Experiment Results
Figure: Categorical performance measurement in CNN-11 K . Thetable has two major rows. The upper row are models that only usedthe target dataset. The lower rows are models that used both thetarget and source dataset.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 23 / 27
Experiment Results
Figure: Knowledge transfer performance result.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 24 / 27
Experiment Results
Figure: Loss convergence comparison between model trained withCIT and without CIT.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 25 / 27
Summary
MemNN is often used in QA.
Ordering the samples lead to better local minima.
Joint-training is useful in obtaining betterperformance on small target dataset.
Using pre-trained model improves performance.
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 26 / 27
The End
Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 27 / 27
top related