recovering diacritics using wikipedia and google

14
Adrian Iftene 1 , Diana Trandabăţ 1,2 {adiftene, dtrandabat}@info.uaic.ro 1 Faculty of Computer Science 1 “Al. I. Cuza” University of Iasi 2 Romanian Academy, Iasi Branch 2 July, KEP T 2009, Cluj Napoca

Upload: faculty-of-computer-science

Post on 04-Jul-2015

579 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Recovering Diacritics using Wikipedia and Google

Adrian Iftene1, Diana Trandabăţ1,2

{adiftene, dtrandabat}@info.uaic.ro

1 Faculty of Computer Science1 “Al. I. Cuza” University of Iasi2 Romanian Academy, Iasi Branch

2 July, KEP T 2009, Cluj Napoca

Page 2: Recovering Diacritics using Wikipedia and Google

Motivation

The system

Steps performed

Results

Conclusions

Page 3: Recovering Diacritics using Wikipedia and Google

Ro-Wikipedia was used in CLEF 2007◦ 1.43 Gb◦ 121.832 files

Iftene, Trandabăţ, KEPT 2009

Page 4: Recovering Diacritics using Wikipedia and Google

Iftene, Trandabăţ, KEPT 2009

Page 5: Recovering Diacritics using Wikipedia and Google

Step 1 - Initial text is split into sentences and then sentences are further split into words

Step 2 - For every word without diacritics, we search in DBPF the corresponding possible value◦ If the current word doesn’t contain “a, i, s, t” letters then we search in

DBFP or in Ro-Wikipedia the word◦ If the current word contains one or more from “a, i, s, t” letters then we

search in DBFP or in Ro-Wikipedia using a pattern, obtained from initial word, where all possible diacritics (a, i, s, t) are replaced with the corresponding values (”a” is replaced by (ă|â|a), ”i” is replaced by (î|i), ”s” is replaced by (ş|s), ”t” is replaced by (t|ţ))◦ For example for word = “fata” the pattern = “f(ă|â|a)(t|ţ)(ă|â|a)”

Iftene, Trandabăţ, KEPT 2009

Page 6: Recovering Diacritics using Wikipedia and Google

Step 3 - We build a query in order to search web pages that contain similar sentences (At this step we receive sentences that contain words with multiples forms in DBFP)

Iftene, Trandabăţ, KEPT 2009

Page 7: Recovering Diacritics using Wikipedia and Google

Step 4 - We extract from web the first 10 relevant pages returned by Google

Step 5- From downloaded sites we select only pages with texts and ignore files with images, fonts, and with configuration settings. In the selection process we identify the ”correct” files with diacritics and concatenate them in one file

Iftene, Trandabăţ, KEPT 2009

Page 8: Recovering Diacritics using Wikipedia and Google

Step 6 - Using the file built at Step 5 we will show how we will identify the most appropiate form for words with multiple forms. We build the same kind of patterns as at Step 2 b) ii. and identify, for every word, the possible forms and its relative positions in the concatenated file

Iftene, Trandabăţ, KEPT 2009

Page 9: Recovering Diacritics using Wikipedia and Google

If the sentence S has as components the words w1, w2, ..., wn

We note with fi the current form for word wi and with pi1, pi2, ..., piti the positions from each associated layer

With these notations a full path from first layer (corresponding to the first word of the sentence) to the last layer (corresponding to the last word of the sentence) can be noticed with

FP = (p1i1, p2i2, …, pnin)

Iftene, Trandabăţ, KEPT 2009

Page 10: Recovering Diacritics using Wikipedia and Google

From now our goal is to find a full path between current layers with a minimal length

For that we build

Iftene, Trandabăţ, KEPT 2009

Page 11: Recovering Diacritics using Wikipedia and Google

An example is presented below for the sentence: ”Scoala incepe sambata” with two possible solutions:

Şcoala începe sâmbătă. (School starts this Saturday).

Şcoala începe sâmbăta. ((Usually) the school starts

Saturday).

Iftene, Trandabăţ, KEPT 2009

Page 12: Recovering Diacritics using Wikipedia and Google

Step 7 - Context improvement:◦ The backward rule◦ The forward rule◦ The maximization rule

Iftene, Trandabăţ, KEPT 2009

Page 13: Recovering Diacritics using Wikipedia and Google

In order to evaluate the systems performances, we used a large file containing the Calimera Guidelines (14.148 sentences).

Iftene, Trandabăţ, KEPT 2009

Page 14: Recovering Diacritics using Wikipedia and Google

The paper presents a method to restore diacritics using web found contexts

The system accuracy is similar to the accuracy of existing systems, but the main advantage comes from fact that it uses resource and tools available for free.

Also, we tested our algorithm on other languages like French and German and the results are very promising

Iftene, Trandabăţ, KEPT 2009