fada: fast document aligner with word embedding - pintu lohar, debasis ganguly, haithem afli, andy...

FaDA: Fast document aligner with

word embedding

Pintu Lohar, Debasis Ganguly, Haithem Afli,

Andy Way and Gareth F. Jones

ADAPT Centre, School of Computing, Dublin

City University

Contents

• Objective

• Introduction to FaDA

• Methodology used

• Word vector-based similarity

• Architecture of the whole system

• Experiments

• Results

• Conclusions and future work

Objective

• To align the documents in two different

languages within a large collection of

comparable documents.

• Alignment procedure should be faster with less

than quadratic time complexity.

Example of comparable documents

• The same news published in two languages

Introduction to FaDA

• FaDA (Fast Document Aligner) is a free/open-

source tool for aligning bilingual documents .

• It is a fast alignment tool with linear time

complexity.

Methodology used

• Crosslingual information retrieval (CLIR)-

based document-alignment system with word

vector-based similarity measurements.

Why word vector-based similarity ?

• CLIR-based approach takes into account only

text-based similarity without addressing the

underlying semantic match between the words.

• The word vector-based approach considers the

semantic similarity between the words.

Word vectors

• Example:

Word vector-based similarity

• Query likelihood

Where, q1, q2, q3 → query terms dots → words of a document in 2D space. The centroid of document in Figure (a) is closer to the query terms than document in Figure (b)

Combination of word vector-based

and text-based similarity

• α is the linear interpolation parameter denoting the relative contributions from the text overlap and word vector-based similarities

Bilingual

documents

target

documents

source

documents

Indexing

target index source index

Pseudo query of

each document

Translate by

dictionary

Translated

query terms Compare

documents

Combine word -vector

and text similarity Select with best score

Retrieved target

document

System architecture of FaDA

Experiments

• Dataset

Baseline

• Based on “Jaccard similarity coefficient” which measures the term overlaps between

document pairs.

• “Cosine similarity-based” and “Named

Entity matching-based” approaches did not

work well hence not used as baseline.

Tuning (Euronews data) :

Optimal parameter settings:

i. λ = 0.9

ii. Number of translation terms M = 7 and

iii. Query to document ratio τ = 0.6

Result on WMT test data:

Conclusions and future work

Uses CLIR-based approach which is much faster

than the baseline (with quadratic time complexity).

The performance is further enhanced by word

vector embedding-based approach.

In future , we would like to apply our approach to

other language pairs.

Thank you

Questions ?

and/or

Suggestions !

fada: fast document aligner with word embedding - pintu lohar, debasis ganguly, haithem afli, andy...

Science

universidad autónoma de nuevo...

the ethnographic narration of gadulia lohar tribe of udaipur

maintaining sentiment polarity in translation of...

gadulia lohar: nomads in india lohar: nomads in india 14 |...

sacred architecture of the rock : an inno- vative...

€¦ · xls file · web view · 2016-08-04kanaka...

[hal-00857918, v1] parameter estimation and energy ......

rajasthan...28 29 30 31 32 33 34 35 36 38 39 40 41 42 43 44...

dublin city university participation in the vtt track...

ouada haithem ben ali wafa ben romdane donia 1 exposé...

certificate -...

page 62 of 132 - edusanjal59 42966 7459009045 yubraj damai...

· jangid brahman ramgarhia joginath lohar or luhar,...

micro$oft a success story haithem abdelkefi johanne...

terpai afli bercept

karimganjjudiciary.gov.inkarimganjjudiciary.gov.in/dlsa/notice/lda_result.pdfsri...

cum afli parola la wifi ai uitat-o

learning speciﬁc-class segmentation from...

anexa - fincombank.com · compania bdo audit & consulting....

enicarthage liste d'emargement examen sp sem1 a. u : … ·...