1 web-based acquisition of japanese katakana variants hiroshi nakagawa (university of tokyo, japan)...

Post on 17-Dec-2015

231 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Web-based Acquisition of Japanese Katakana Variants

Hiroshi Nakagawa (University of Tokyo, Japan)

Takeshi Masuyama (University of Tokyo , Japan)†

†(Now Yahoo! Japan)

2

• Very sorry for Katakana fonts printing problem in proceedings. We could not check the final printing.

• Please read English transliterations of Katakana parts like %-%c…..

3

Cooperation with

Satoshi SekineComputer Science, New York University

AndLanguage Craft Co.

4

mew,mew

ニャア,ニャア( nyaa,nyaa )

ニャアー、ニャアー (nyah,nyah)

The way of sound to spelling defers language by language

Katakana word variants

5

History of Katakana

Every Country and every language has its own history of meanings, codes and fonts.

Phonogram vs. Ideogram

6

Kanji(Hanji ) Character

s (=ideogram) imported   to Japan

1300 yeas ago

漢字( Hanji)

7

Almost 1000 years ago, women writers worked out phonogram(Hiragana and Katakana) from Kanji (ideogram) to express Japanese people’s mentality.

世(Kanji)ideogram

せ(hiragana

)セ(katakana)

phonogram

紫式部

8

Modern history of Katakana Japanese Katakana and Hiragana have one to

one mapping. After Meiji revolution(1868), Japanese people

used Katakana to express functional wordHiragana to express words imported from western

countries. After World War II(1945), we exchanged them.

Hiragana became used to express functional word like case markers

Katakana became used to express words imported from western countries.

Thus majority of Katakana words are transliterations from English words.

9

However, Japanese Katakana has only five vowels (a,i,u,

e,o) and 19 consonants (k,g,s,z,j,t,d,n,h,b,m,y,r,w,c,sh,ch,ny,my,).

Pronunciations are always C+V or V. No C+C.

No distinction between, (b,v),(h,f),(l,r),..There are no orthographic way to express English sou

nds with Katakana character set. Thus Japanese language accepted several Kata

kana spellings for one English word. Katakana variants

11

An example of search result Hits for “spaghetti” with Google

To make sure to avoid overlap between distinct Katakana variants by + and - options.

Katakana variants. Hits of Google search (%)

スパゲッティ (supagettuthi) 187,000 (32.7%)

スパゲッティー (supagettuthii) 57,600 (10.1%)

スパゲッテイ (supagettutei) 6,850 (1.2% )

スパゲティ (supagetuthi) 240,000 ( 41.9% )

スパゲティー (supagethii) 77,400 ( 13.5% )

スパゲテイ (supagetei) 3,800 ( 0.7% )total 572,650 ( 100% )

12

Katakana variants extraction system is needed to enhance

the cross-language ability of

Information RetrievalSearch engine Machine translationInformation ExtractionSummarizationQuestion Answering

13

Previous research 1 :

Manually constructed Rewriting rules to generate and/or extract Katakana variants from given Katakana word ( Shishibori et al, 1993, 1994, Kubota 1994 )Samples of rewrite rules

ベ (Be)⇔ ヴェ (Ve)チ (chi)⇔ ツィ (thi)

Input : ベネチア (Benechia)Output : ベネツィア (Benethia) ヴェネチア (Venechia) ヴェネツィア (Venethia)

14

Previous research 2 :

Extract Katakana variants with weighted edit distance ( Magari et al 、 2004 )、( Ohtake et al 、 2004 )Edit distance is defined as

Number of operations to transform one Katakana word into another Katakana word:

Operations: insert, delete,replace Ex. : レポート (Repooto) リポート (Ripooto) → edit d

ist. =1

Weighted edit distanceWeight of each operation is manually givenEx : Weight of edit dist. ( レポートリポート ) 0.8

15

Previous research 3 : more direct way

String penalty to extract Katakana variants ( Masuyama et al, 2004 )

String penalty: SPBased on weighted edit distance, but extended to t

reat two,three characters:stringManually given weights to Combination of edit ope

rations = string replacing operations.Ex.SP( ボイス , ヴォイス )=4 … replace and inser

t

Boisu, Voisu

16

Previous research 4 : Combination method

(Masuyama,Nakagawa,Sekine 2004 COLING)

Combination of string penalty and context

String penalty :SPSP value is given by an expertise

Similarity of contexts in which each Katakana variant appearsVector space model (automatically calculated) If Words around each Katakana words are similar,

then the Katakana words are variants each other

17

Problems of previous researches

Less coverage Need human intellectual and intensive work for

Working out rewrite rulesDetermining weights of weighted edit distanceDetermining values of string penalty of each

Katakana string pairs

Depend on specific corpus which is used to calculate weights of weighted edit distancestring penalty

18

Purpose of this work

The problem of manually given string penalty:Labor intensive (even in combination of SP and context)Low coverage

Determine string penalty mechanically

and

Automatically building Katakana variants

for each Katakana word

19

Calculating string penalties Mechanically

For this, we need accurate and high quality Katakana variants database!

20

(idea, アイデア )(report, レポート )

English word and its

Katakana variantWWW

( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)

… Pairs of varia

nt cadi.

… レポート …… report …

… リポート ……

( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)

(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )

レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3

String Penalty

Process

Pairs of variant

21

(idea, アイデア )(report, レポート )

English word and its

Katakana variantWWW

( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)

… Pairs of varia

nt cadi.

… レポート …… report …

… リポート ……

( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)

(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )

レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3

String Penalty

Process

Pairs of variant

Web search by

22

How to find candidates of Katakana variant pairs (1/3)

1. To collect English words and thier Katakana variants i.e. (vodka ウォッカ )

we used four Web sites where we collect a number of English words and their Japanese translations. http://homepage2.nifty.com/katakanaEnglish/ http://www.hoshi.cis.ibaraki.ac.jp/usefull/usefull15.html http://ke.ics.saitama-u.ac.jp/jsgs/keywords.html http://smalltown.ne.jp/~uasa/pub/distfiles/skk-extra-200307/S

KK-JISYO.edit

14,958 distinct pairs of English words and their Katakana translations.

23

How to find candidates of Katakana variant pairs (2/3)

1. Extract many English word and its Katakana variant

14.958 pairs of English-Katakana

2. To collect more Katakana variants for each English word, we use Google search to get pages that include English word and Katakana word of its translation

“English word + ( language = Japanese )” “English word + 「英和」 (“English to Japanese”)” in order to

search English-Japanese dictionary site

3. Gather Katakana words from search results

24

Google search with English word “vodka” among page written in Japanese

vodka

25

Add a query 「英和」 (english-Japanese) and Google search

英和’ e-j’ vodaka

26

(idea, アイデア )(report, レポート )

English word and its

Katakana variantWWW

( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)

… Pairs of varia

nt cadi.

… レポート …… report …

… リポート ……

( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)

(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )

レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3

String Penalty

Process

Pairs of variant

Web search

“ 英和 (e-j) report”

Edit dist.  =1

27

How to find candidates of Katakana variant pairs (3/3)

4. Extract promising candidates of Katakana word pairs whose edit distance =1 as Katakana variants

Ex. (vodka ウォッカ ) (ウォッカ’ Uottuka’ 、ウォトカ’ Uotoka ) (ウォッカ’ Uottuka’ 、ウオッカ’ UOttuka’ ) (ウォッカ’ Uottuka 、ヴォッカ (Vuottuka’ )

28

(idea, アイデア )(report, レポート )

English word and its

Katakana variantWWW

( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)

… Pairs of varia

nt candi.

… レポート …… report …

… リポート ……

( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)

(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )

レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3

…String

Penalty

Process

Pairs of variant candi. by c

ontext

Web search by“ 英和 (e-j) repor

t”

Edit dist.  =1

cosine sim > 0.00006

29

How to extract documents in which context similarity is calculated

Google search with a query of Katakana word which is a candidate of Katakana variant.

Extract context of the Katakana variant from search result pages.

30

Search “Vodka” with Google

+ ウォッカ

‘ +vodka’

retrieves all pages

includingウォッカ

ウオッカ‘s

contexts

ウオッカ‘s

contexts

31

1. Calculate context similarity of a candidate of Katakana variant pair

drink vodka(Vuottka) with a main dish and plate of caviar in the restaurants

cosine similarity

eat some main dish plate after vodka(Uotoka) in that restaurants

50 words around a candidate of Katakana variant is used as its context

2. Identify and extract Katakana variants if cosine similarity is greater than the threshold of 0.00006.

32

Detail of context similarity calculation

context = 50 words around Katakana word Weight of word t in context

log(freq(t)+1) Context similarity = cosine Selection from candidates by

cosine similarity≧0.00006 ( threshold ) The threshold optimization

argmax of F-value threshold

on positive pairs ( 347pairs ) and negative pairs(111 pair)

33

Results of context similarity vs cosine threshold

80

82

84

86

88

90

threshold of cosine similarity

-val

ue (

%)F

34

(idea, アイデア )(report, レポート )

English word and its

Katakana variantWWW

( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)

… Pairs of varia

nt cadi.

… レポート …… report …

… リポート ……

( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)

(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )

レ’ re’⇔ リ’ ri’ : SP=1ト’ to’⇔ ッ’ ttu’ : SP=3

Process

Pairs of variant

Web search by“ 英和 (e-j) repor

t”

Edit dist.  =1

cosine sim > 0.00006

Next to do is to calculate SP based on Statistics

35

2nd stage:Calculation of string penalty :SP

String penalty of operation x y (x replaces with y)

We focus onHigh correlation between replaced strings and their cha

racter context which is composed of several characters around the target string.

Example: (ウインブルドン、ウィンブルドン) (ウインドウズ、ウィンドウズ) (ウインク、ウィンク)

replace イ’ I’ with ィ’ i’→ ウ’ U’ and ン’ n’ co-occurs

36

Character level context:CLC1..CLC5 used to calculate SP

x : target characterα 、 β 、 γ 、 δ : characters around x

CLC String contexts around x

CLC1 αβ x preceeding two characters of x

CLC2 β x preceeding one character of x

CLC3 x γ succeeding one character of x

CLC4 x γδ succeeding two characters of x

CLC5 β x γ preceeding and succeeding characters of x

37

Calculation of string penalty:SP

2)(

1),()|(P

i

ii CLCf

yxCLCfCLCyx

i=1,2,3,4,5

f(CLCi) = freq. of pairs in which CLCi occurs

f(CLCi, xy) = freq. of pairs in which both of

CLCi and xy occur

38

Calculation of string penalty :SP

iiCLC

CLCyxCLCi

|Pmaxarg)5,..,1(

)CLC|P(

1

yxSP yx

Identify character context CLCi which most probably co-occurs with operation x y

Then

Rank of occurrence ≈ C * (Prob. of occurrence)-1

Zipf’s law

39

Examples of string penaltiesoperation SP Example

Insertion and deletion of ‘ ・’

1 ラストシーン、ラスト・シーン

Insertion and deletion of macron ‘ ー’

1 エネルギー、エネルギ

Replace オ ‘ O’ and ォ ‘ o’

1 ウオッカ、ウォッカ

Replace グ ‘ gu’ and ク ‘ ku’

2 バック、バッグ

Replace ヴ ’ vu’ and ブ ’ bu’

2 ジュネーヴ、ジュネーブ

Replace ヴ ‘ vu’ and ウ ‘ U’

3 ヴォッカ、ウォッカ

40

Comparison of SP by hand and SP by the proposed method

SP by hand proposed by Masuyama et al(2004)

Expertise worked out SP by handGold standard Katakana variants:

682 pairs of Katakana variant candidates extracted from newspaper corpus and whose string penalties are between 1 and 12We found no correct variants whose SPs are bewte

en 10 and 12. Thus, the above gold standard probably cover all correct varinats.

41

SP SP by hand SP by proposed mechanical method

1 216/221 (97.7%) 262/286 (91.6%)

2 162/207 (78.3%) 133/148 (89.9%)

3 70/99 (70.7%) 51/90 (56.7%)

4 2/14 (14.3%) 2/26 (7.7%)

5 0/29 (0.0%) 0/16 (0.0%)

6 0/13 (0.0%) 2/34 (5.9%)

7 1/20 (5.0%) 1/39 (2.6%)

8 0/13 (0.0%) 1/15 (6.7%)

9 1/12 (8.3%) 0/8 (0.0%)

10 0/16 (0.0%) 0/5 (0.0%)

11 0/17 (0.0%) 0/12 (0.0%)

12 0/21 (0.0%) 0/3 (0.0%)

Comparison of SPs

42

1 2 3 4 5 6 7 8 9 10 11 12 合計1 207 7 3 2 0 1 1 0 0 0 0 0 221

2 20 123 59 2 1 1 1 0 0 0 0 0 207

3 59 11 20 3 2 3 1 0 0 0 0 0 99

4 0 2 3 2 2 0 4 0 0 1 0 0 14

5 0 2 2 6 3 4 5 3 1 0 2 1 296 0 0 0 3 1 1 2 0 3 1 2 0 13

7 0 0 1 3 2 2 2 4 1 1 3 1 20

8 0 1 0 0 0 4 6 1 0 0 1 0 13

9 0 1 0 0 0 0 2 3 1 1 4 0 12

10 0 1 1 5 0 0 4 2 1 1 0 1 16

11 0 0 1 0 2 13 0 0 1 0 0 0 17

12 0 0 0 0 3 5 11 2 0 0 0 0 21

合計 286 148 90 26 16 34 39 15 8 5 12 3 682

correlation : 0.76

SP

by hand

SP by proposed mechanical methodComparison of SPs correlation

43

Building Katakana variantsDB automatically

44

Context similarity

Extracted variants

Context similarity

Extracted variants

SP

by Mechanical

methodby hand

Correlation

0.76

Accurate Accurate?

Summary of comparison and next?

SP

COLING 2004 SIGIR2005

45

(レポート,ラポート)( レポート,リポート )( レポート,サポート )

News paper corpus

Candidates of Katakana variants

( レポート,ラポート )( レポート,リポート )

…Candidates of

Katakana variants

( レポート,リポート )…

Katakana variants DB

Variants DB

… レポート …… ラポート …… リポート …… サポート …

46

(レポート,ラポート)( レポート,リポート )( レポート,サポート )

News paper corpus

Candidates of Katakana variants

( レポート,ラポート )( レポート,リポート )

…Candidates of

Katakana variants

( レポート,リポート )…

Katakana variants DB

Variants DB

… レポート …… ラポート …… リポート …… サポート …

Extract Katakana words

47

(レポート,ラポート)( レポート,リポート )( レポート,サポート )

News paper corpus

Candidates of Katakana variants

( レポート,ラポート )( レポート,リポート )

…Candidates of

Katakana variants

( レポート,リポート )…

Katakana variants DB

Variants DB

… レポート …… ラポート …… リポート …… サポート …

Extract Katakana words

SP ≤ 3

48

(レポート,ラポート)( レポート,リポート )( レポート,サポート )

News paper corpus

Candidates of Katakana variants

( レポート,ラポート )( レポート,リポート )

…Candidates of

Katakana variants

( レポート,リポート )…

Katakana variants DB

Variants DB

… レポート …… ラポート …… リポート …… サポート …

Extract Katakana words

SP ≤3

Context similarity

≥ 0.005

Optimized threshold

49

SP by hand of expertise SP by the proposed mechanical method

recall 417/420 (99.3%) 415/420 (98.8%)

precision 417/480 (86.9%) 415/480 (86.5%)

F-value 92.7% 92.2%

Comparison of variants DB

SP 3, context similarity 0.05≦ ≧

cf. The whole DB contains 3 million Katakana variants for 1 million distinct Katakana words.

50

Conclusions

Mechanical method of calculating SP Using Web search engine to extract variant

candidates SP by character context Almost same accuracy as SP by hand of expertise

Katakana variants DB with SP by mechanical method

recall : 98.8% precision : 86.5% F -value : 92.2%

51

Future of our research

Other language like GermanArbeit -- アルバイト

Application of our methodology (Web resource + statistical string penalty) to other language pair.Londre LondonMünchen Munich

Our hope is: Cross-language automatic spelling variants generator for any language pairs based on the proposed method.

52

Thank you!

サンキュー( sankyuh)

サンキュウ (sankyuu)

Question or comments are welcome.

53

Error analysis grizzly bear グリーズリーベア  vs  グリーズリー・ベア  gurihzurihbea gurihzurih ・ bea are not regarded as variants animal Norman Shwarzkovtotally different contexts! sign pole sign ball サインポール   vs . サインボール  sainpohru sainbohruAre regarded as variants. barber shop baseball customer, shop, sales ( very similar contexts)

54

The threshold of SP vs. F-value

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12The threshold of SP

-val

ue(%

)F

55

cosine similarity vs. F-value

25303540455055606570758085

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4Threshold of SP

F-va

lue(

%)

56

If you search some Kataka variant with Google,…

In case of spaghetti

Katakana Variants Found or not

スパゲッティ( spaghetti) ○

スパゲッティー( supagettuthii ) ○

スパゲッテイ( supagettutei ) ×

スパゲティ( supagettuthi ) ○

スパゲティー( supagethii ) ○

スパゲテイ( supagetei ) ×

57

How to find candidates of Katakana

How to extract document in which context similarity is calculated

Google search with a query of Katakana word which is a candidate of Katakana variant.

Extract context of the Katakana variant from search result pages and calculate context similarity to identify Katakana variants.

58

Example of similarity calculation

(ウォッカ’ Uottuka’ 、ウォトカ’ Uotoka’ )ウォッカ: liquor : 1.1 、 strong : 1.4 、 alcohol : 1.6 、 western liquir : 0.

7 、・・・ウォトカ liquor : 0.7 、 strong : 0.7 、 alcohol : 3.4 、 western liquor : 1.4 、・・・

00157.04.37.07.06.14.11.1

7.04.17.01.1),cos(

222222

・・・・・・

・・・UotokaUottuka

top related