fada: fast document aligner with word embedding - pintu lohar, debasis ganguly, haithem afli, andy...

18
FaDA: Fast document aligner with word embedding Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones ADAPT Centre, School of Computing, Dublin City University

Upload: sebastian-ruder

Post on 14-Jan-2017

41 views

Category:

Science


1 download

TRANSCRIPT

Page 1: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

FaDA: Fast document aligner with

word embedding

Pintu Lohar, Debasis Ganguly, Haithem Afli,

Andy Way and Gareth F. Jones

ADAPT Centre, School of Computing, Dublin

City University

Page 2: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Contents

• Objective

• Introduction to FaDA

• Methodology used

• Word vector-based similarity

• Architecture of the whole system

• Experiments

• Results

• Conclusions and future work

Page 3: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Objective

• To align the documents in two different

languages within a large collection of

comparable documents.

• Alignment procedure should be faster with less

than quadratic time complexity.

Page 4: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Example of comparable documents

• The same news published in two languages

Page 5: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Introduction to FaDA

• FaDA (Fast Document Aligner) is a free/open-

source tool for aligning bilingual documents .

• It is a fast alignment tool with linear time

complexity.

Page 6: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Methodology used

• Crosslingual information retrieval (CLIR)-

based document-alignment system with word

vector-based similarity measurements.

Page 7: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Why word vector-based similarity ?

• CLIR-based approach takes into account only

text-based similarity without addressing the

underlying semantic match between the words.

• The word vector-based approach considers the

semantic similarity between the words.

Page 8: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Word vectors

• Example:

Page 9: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Word vector-based similarity

• Query likelihood

Where, q1, q2, q3 → query terms dots → words of a document in 2D space. The centroid of document in Figure (a) is closer to the query terms than document in Figure (b)

Page 10: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Combination of word vector-based

and text-based similarity

• α is the linear interpolation parameter denoting the relative contributions from the text overlap and word vector-based similarities

Page 11: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Bilingual

documents

target

documents

source

documents

Indexing

target index source index

Pseudo query of

each document

Translate by

dictionary

Translated

query terms Compare

top n

documents

Combine word -vector

and text similarity Select with best score

Retrieved target

document

System architecture of FaDA

Page 12: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Experiments

• Dataset

Page 13: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Baseline

• Based on “Jaccard similarity coefficient” which measures the term overlaps between

document pairs.

• “Cosine similarity-based” and “Named

Entity matching-based” approaches did not

work well hence not used as baseline.

Page 14: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Tuning (Euronews data) :

Optimal parameter settings:

i. λ = 0.9

ii. Number of translation terms M = 7 and

iii. Query to document ratio τ = 0.6

Page 15: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Result on WMT test data:

Page 16: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Conclusions and future work

Uses CLIR-based approach which is much faster

than the baseline (with quadratic time complexity).

The performance is further enhanced by word

vector embedding-based approach.

In future , we would like to apply our approach to

other language pairs.

Page 17: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Thank you

Page 18: FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Questions ?

and/or

Suggestions !