fada: fast document aligner with word embedding - pintu lohar, debasis ganguly, haithem afli, andy...

FaDA: Fast document aligner with

word embedding

Pintu Lohar, Debasis Ganguly, Haithem Afli,

Andy Way and Gareth F. Jones

ADAPT Centre, School of Computing, Dublin

City University

Contents

• Objective

• Introduction to FaDA

• Methodology used

• Word vector-based similarity

• Architecture of the whole system

• Experiments

• Results

• Conclusions and future work

Objective

• To align the documents in two different

languages within a large collection of

comparable documents.

• Alignment procedure should be faster with less

than quadratic time complexity.

Example of comparable documents

• The same news published in two languages

Introduction to FaDA

• FaDA (Fast Document Aligner) is a free/open-

source tool for aligning bilingual documents .

• It is a fast alignment tool with linear time

complexity.

Methodology used

• Crosslingual information retrieval (CLIR)-

based document-alignment system with word

vector-based similarity measurements.

Why word vector-based similarity ?

• CLIR-based approach takes into account only

text-based similarity without addressing the

underlying semantic match between the words.

• The word vector-based approach considers the

semantic similarity between the words.

Word vectors

• Example:

Word vector-based similarity

• Query likelihood

Where, q1, q2, q3 → query terms dots → words of a document in 2D space. The centroid of document in Figure (a) is closer to the query terms than document in Figure (b)

Combination of word vector-based

and text-based similarity

• α is the linear interpolation parameter denoting the relative contributions from the text overlap and word vector-based similarities

Bilingual

documents

target

documents

source

documents

Indexing

target index source index

Pseudo query of

each document

Translate by

dictionary

Translated

query terms Compare

top n

documents

Combine word -vector

and text similarity Select with best score

Retrieved target

document

System architecture of FaDA

Experiments

• Dataset

Baseline

• Based on “Jaccard similarity coefficient” which measures the term overlaps between

document pairs.

• “Cosine similarity-based” and “Named

Entity matching-based” approaches did not

work well hence not used as baseline.

Tuning (Euronews data) :

Optimal parameter settings:

i. λ = 0.9

ii. Number of translation terms M = 7 and

iii. Query to document ratio τ = 0.6

Result on WMT test data:

Conclusions and future work

Uses CLIR-based approach which is much faster

than the baseline (with quadratic time complexity).

The performance is further enhanced by word

vector embedding-based approach.

In future , we would like to apply our approach to

other language pairs.

Thank you

Questions ?

and/or

Suggestions !

fada: fast document aligner with word embedding - pintu lohar, debasis ganguly, haithem afli, andy...

Science