1 web-based acquisition of japanese katakana variants hiroshi nakagawa (university of tokyo, japan)...
TRANSCRIPT
1
Web-based Acquisition of Japanese Katakana Variants
Hiroshi Nakagawa (University of Tokyo, Japan)
Takeshi Masuyama (University of Tokyo , Japan)†
†(Now Yahoo! Japan)
2
• Very sorry for Katakana fonts printing problem in proceedings. We could not check the final printing.
• Please read English transliterations of Katakana parts like %-%c…..
3
Cooperation with
Satoshi SekineComputer Science, New York University
AndLanguage Craft Co.
4
mew,mew
ニャア,ニャア( nyaa,nyaa )
ニャアー、ニャアー (nyah,nyah)
The way of sound to spelling defers language by language
Katakana word variants
5
History of Katakana
Every Country and every language has its own history of meanings, codes and fonts.
Phonogram vs. Ideogram
6
Kanji(Hanji ) Character
s (=ideogram) imported to Japan
1300 yeas ago
漢字( Hanji)
7
Almost 1000 years ago, women writers worked out phonogram(Hiragana and Katakana) from Kanji (ideogram) to express Japanese people’s mentality.
世(Kanji)ideogram
せ(hiragana
)セ(katakana)
phonogram
紫式部
8
Modern history of Katakana Japanese Katakana and Hiragana have one to
one mapping. After Meiji revolution(1868), Japanese people
used Katakana to express functional wordHiragana to express words imported from western
countries. After World War II(1945), we exchanged them.
Hiragana became used to express functional word like case markers
Katakana became used to express words imported from western countries.
Thus majority of Katakana words are transliterations from English words.
9
However, Japanese Katakana has only five vowels (a,i,u,
e,o) and 19 consonants (k,g,s,z,j,t,d,n,h,b,m,y,r,w,c,sh,ch,ny,my,).
Pronunciations are always C+V or V. No C+C.
No distinction between, (b,v),(h,f),(l,r),..There are no orthographic way to express English sou
nds with Katakana character set. Thus Japanese language accepted several Kata
kana spellings for one English word. Katakana variants
10
キャメロン・ディアス (kyameronn ・ dhiasu)
キャメロン・ディアズ (kyameronn ・ dhiazu)
キャメロンディアス (kyameronndhiasu)
detail
ディテール (dhiteeru)
ディティール (dhithiiru)
ディテェール (dhitheeru)
ディテイル (dhiteiru)
Cameron Diaz
Transliterated into Katakana
variants
11
An example of search result Hits for “spaghetti” with Google
To make sure to avoid overlap between distinct Katakana variants by + and - options.
Katakana variants. Hits of Google search (%)
スパゲッティ (supagettuthi) 187,000 (32.7%)
スパゲッティー (supagettuthii) 57,600 (10.1%)
スパゲッテイ (supagettutei) 6,850 (1.2% )
スパゲティ (supagetuthi) 240,000 ( 41.9% )
スパゲティー (supagethii) 77,400 ( 13.5% )
スパゲテイ (supagetei) 3,800 ( 0.7% )total 572,650 ( 100% )
12
Katakana variants extraction system is needed to enhance
the cross-language ability of
Information RetrievalSearch engine Machine translationInformation ExtractionSummarizationQuestion Answering
13
Previous research 1 :
Manually constructed Rewriting rules to generate and/or extract Katakana variants from given Katakana word ( Shishibori et al, 1993, 1994, Kubota 1994 )Samples of rewrite rules
ベ (Be)⇔ ヴェ (Ve)チ (chi)⇔ ツィ (thi)
Input : ベネチア (Benechia)Output : ベネツィア (Benethia) ヴェネチア (Venechia) ヴェネツィア (Venethia)
14
Previous research 2 :
Extract Katakana variants with weighted edit distance ( Magari et al 、 2004 )、( Ohtake et al 、 2004 )Edit distance is defined as
Number of operations to transform one Katakana word into another Katakana word:
Operations: insert, delete,replace Ex. : レポート (Repooto) リポート (Ripooto) → edit d
ist. =1
Weighted edit distanceWeight of each operation is manually givenEx : Weight of edit dist. ( レポートリポート ) 0.8
15
Previous research 3 : more direct way
String penalty to extract Katakana variants ( Masuyama et al, 2004 )
String penalty: SPBased on weighted edit distance, but extended to t
reat two,three characters:stringManually given weights to Combination of edit ope
rations = string replacing operations.Ex.SP( ボイス , ヴォイス )=4 … replace and inser
t
Boisu, Voisu
16
Previous research 4 : Combination method
(Masuyama,Nakagawa,Sekine 2004 COLING)
Combination of string penalty and context
String penalty :SPSP value is given by an expertise
Similarity of contexts in which each Katakana variant appearsVector space model (automatically calculated) If Words around each Katakana words are similar,
then the Katakana words are variants each other
17
Problems of previous researches
Less coverage Need human intellectual and intensive work for
Working out rewrite rulesDetermining weights of weighted edit distanceDetermining values of string penalty of each
Katakana string pairs
Depend on specific corpus which is used to calculate weights of weighted edit distancestring penalty
18
Purpose of this work
The problem of manually given string penalty:Labor intensive (even in combination of SP and context)Low coverage
Determine string penalty mechanically
and
Automatically building Katakana variants
for each Katakana word
19
Calculating string penalties Mechanically
For this, we need accurate and high quality Katakana variants database!
20
(idea, アイデア )(report, レポート )
…
English word and its
Katakana variantWWW
( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)
… Pairs of varia
nt cadi.
… レポート …… report …
… リポート ……
( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)
(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )
…
レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3
…
String Penalty
Process
Pairs of variant
21
(idea, アイデア )(report, レポート )
…
English word and its
Katakana variantWWW
( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)
… Pairs of varia
nt cadi.
… レポート …… report …
… リポート ……
( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)
(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )
…
レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3
…
String Penalty
Process
Pairs of variant
Web search by
22
How to find candidates of Katakana variant pairs (1/3)
1. To collect English words and thier Katakana variants i.e. (vodka ウォッカ )
we used four Web sites where we collect a number of English words and their Japanese translations. http://homepage2.nifty.com/katakanaEnglish/ http://www.hoshi.cis.ibaraki.ac.jp/usefull/usefull15.html http://ke.ics.saitama-u.ac.jp/jsgs/keywords.html http://smalltown.ne.jp/~uasa/pub/distfiles/skk-extra-200307/S
KK-JISYO.edit
14,958 distinct pairs of English words and their Katakana translations.
23
How to find candidates of Katakana variant pairs (2/3)
1. Extract many English word and its Katakana variant
14.958 pairs of English-Katakana
2. To collect more Katakana variants for each English word, we use Google search to get pages that include English word and Katakana word of its translation
“English word + ( language = Japanese )” “English word + 「英和」 (“English to Japanese”)” in order to
search English-Japanese dictionary site
3. Gather Katakana words from search results
24
Google search with English word “vodka” among page written in Japanese
vodka
25
Add a query 「英和」 (english-Japanese) and Google search
英和’ e-j’ vodaka
26
(idea, アイデア )(report, レポート )
…
English word and its
Katakana variantWWW
( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)
… Pairs of varia
nt cadi.
… レポート …… report …
… リポート ……
( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)
(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )
…
レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3
…
String Penalty
Process
Pairs of variant
Web search
“ 英和 (e-j) report”
Edit dist. =1
27
How to find candidates of Katakana variant pairs (3/3)
4. Extract promising candidates of Katakana word pairs whose edit distance =1 as Katakana variants
Ex. (vodka ウォッカ ) (ウォッカ’ Uottuka’ 、ウォトカ’ Uotoka ) (ウォッカ’ Uottuka’ 、ウオッカ’ UOttuka’ ) (ウォッカ’ Uottuka 、ヴォッカ (Vuottuka’ )
28
(idea, アイデア )(report, レポート )
…
English word and its
Katakana variantWWW
( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)
… Pairs of varia
nt candi.
… レポート …… report …
… リポート ……
( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)
(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )
…
レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3
…String
Penalty
Process
Pairs of variant candi. by c
ontext
Web search by“ 英和 (e-j) repor
t”
Edit dist. =1
cosine sim > 0.00006
29
How to extract documents in which context similarity is calculated
Google search with a query of Katakana word which is a candidate of Katakana variant.
Extract context of the Katakana variant from search result pages.
30
Search “Vodka” with Google
+ ウォッカ
‘ +vodka’
retrieves all pages
includingウォッカ
ウオッカ‘s
contexts
ウオッカ‘s
contexts
31
1. Calculate context similarity of a candidate of Katakana variant pair
drink vodka(Vuottka) with a main dish and plate of caviar in the restaurants
cosine similarity
eat some main dish plate after vodka(Uotoka) in that restaurants
50 words around a candidate of Katakana variant is used as its context
2. Identify and extract Katakana variants if cosine similarity is greater than the threshold of 0.00006.
32
Detail of context similarity calculation
context = 50 words around Katakana word Weight of word t in context
log(freq(t)+1) Context similarity = cosine Selection from candidates by
cosine similarity≧0.00006 ( threshold ) The threshold optimization
argmax of F-value threshold
on positive pairs ( 347pairs ) and negative pairs(111 pair)
33
Results of context similarity vs cosine threshold
80
82
84
86
88
90
threshold of cosine similarity
-val
ue (
%)F
34
(idea, アイデア )(report, レポート )
…
English word and its
Katakana variantWWW
( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)
… Pairs of varia
nt cadi.
… レポート …… report …
… リポート ……
( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)
(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )
…
レ’ re’⇔ リ’ ri’ : SP=1ト’ to’⇔ ッ’ ttu’ : SP=3
…
Process
Pairs of variant
Web search by“ 英和 (e-j) repor
t”
Edit dist. =1
cosine sim > 0.00006
Next to do is to calculate SP based on Statistics
35
2nd stage:Calculation of string penalty :SP
String penalty of operation x y (x replaces with y)
We focus onHigh correlation between replaced strings and their cha
racter context which is composed of several characters around the target string.
Example: (ウインブルドン、ウィンブルドン) (ウインドウズ、ウィンドウズ) (ウインク、ウィンク)
replace イ’ I’ with ィ’ i’→ ウ’ U’ and ン’ n’ co-occurs
36
Character level context:CLC1..CLC5 used to calculate SP
x : target characterα 、 β 、 γ 、 δ : characters around x
CLC String contexts around x
CLC1 αβ x preceeding two characters of x
CLC2 β x preceeding one character of x
CLC3 x γ succeeding one character of x
CLC4 x γδ succeeding two characters of x
CLC5 β x γ preceeding and succeeding characters of x
37
Calculation of string penalty:SP
2)(
1),()|(P
i
ii CLCf
yxCLCfCLCyx
i=1,2,3,4,5
f(CLCi) = freq. of pairs in which CLCi occurs
f(CLCi, xy) = freq. of pairs in which both of
CLCi and xy occur
38
Calculation of string penalty :SP
iiCLC
CLCyxCLCi
|Pmaxarg)5,..,1(
)CLC|P(
1
yxSP yx
Identify character context CLCi which most probably co-occurs with operation x y
Then
Rank of occurrence ≈ C * (Prob. of occurrence)-1
Zipf’s law
39
Examples of string penaltiesoperation SP Example
Insertion and deletion of ‘ ・’
1 ラストシーン、ラスト・シーン
Insertion and deletion of macron ‘ ー’
1 エネルギー、エネルギ
Replace オ ‘ O’ and ォ ‘ o’
1 ウオッカ、ウォッカ
Replace グ ‘ gu’ and ク ‘ ku’
2 バック、バッグ
Replace ヴ ’ vu’ and ブ ’ bu’
2 ジュネーヴ、ジュネーブ
Replace ヴ ‘ vu’ and ウ ‘ U’
3 ヴォッカ、ウォッカ
40
Comparison of SP by hand and SP by the proposed method
SP by hand proposed by Masuyama et al(2004)
Expertise worked out SP by handGold standard Katakana variants:
682 pairs of Katakana variant candidates extracted from newspaper corpus and whose string penalties are between 1 and 12We found no correct variants whose SPs are bewte
en 10 and 12. Thus, the above gold standard probably cover all correct varinats.
41
SP SP by hand SP by proposed mechanical method
1 216/221 (97.7%) 262/286 (91.6%)
2 162/207 (78.3%) 133/148 (89.9%)
3 70/99 (70.7%) 51/90 (56.7%)
4 2/14 (14.3%) 2/26 (7.7%)
5 0/29 (0.0%) 0/16 (0.0%)
6 0/13 (0.0%) 2/34 (5.9%)
7 1/20 (5.0%) 1/39 (2.6%)
8 0/13 (0.0%) 1/15 (6.7%)
9 1/12 (8.3%) 0/8 (0.0%)
10 0/16 (0.0%) 0/5 (0.0%)
11 0/17 (0.0%) 0/12 (0.0%)
12 0/21 (0.0%) 0/3 (0.0%)
Comparison of SPs
42
1 2 3 4 5 6 7 8 9 10 11 12 合計1 207 7 3 2 0 1 1 0 0 0 0 0 221
2 20 123 59 2 1 1 1 0 0 0 0 0 207
3 59 11 20 3 2 3 1 0 0 0 0 0 99
4 0 2 3 2 2 0 4 0 0 1 0 0 14
5 0 2 2 6 3 4 5 3 1 0 2 1 296 0 0 0 3 1 1 2 0 3 1 2 0 13
7 0 0 1 3 2 2 2 4 1 1 3 1 20
8 0 1 0 0 0 4 6 1 0 0 1 0 13
9 0 1 0 0 0 0 2 3 1 1 4 0 12
10 0 1 1 5 0 0 4 2 1 1 0 1 16
11 0 0 1 0 2 13 0 0 1 0 0 0 17
12 0 0 0 0 3 5 11 2 0 0 0 0 21
合計 286 148 90 26 16 34 39 15 8 5 12 3 682
correlation : 0.76
SP
by hand
SP by proposed mechanical methodComparison of SPs correlation
43
Building Katakana variantsDB automatically
44
Context similarity
Extracted variants
Context similarity
Extracted variants
SP
by Mechanical
methodby hand
Correlation
0.76
Accurate Accurate?
Summary of comparison and next?
SP
COLING 2004 SIGIR2005
45
(レポート,ラポート)( レポート,リポート )( レポート,サポート )
…
News paper corpus
Candidates of Katakana variants
( レポート,ラポート )( レポート,リポート )
…Candidates of
Katakana variants
( レポート,リポート )…
Katakana variants DB
Variants DB
… レポート …… ラポート …… リポート …… サポート …
46
(レポート,ラポート)( レポート,リポート )( レポート,サポート )
…
News paper corpus
Candidates of Katakana variants
( レポート,ラポート )( レポート,リポート )
…Candidates of
Katakana variants
( レポート,リポート )…
Katakana variants DB
Variants DB
… レポート …… ラポート …… リポート …… サポート …
Extract Katakana words
47
(レポート,ラポート)( レポート,リポート )( レポート,サポート )
…
News paper corpus
Candidates of Katakana variants
( レポート,ラポート )( レポート,リポート )
…Candidates of
Katakana variants
( レポート,リポート )…
Katakana variants DB
Variants DB
… レポート …… ラポート …… リポート …… サポート …
Extract Katakana words
SP ≤ 3
48
(レポート,ラポート)( レポート,リポート )( レポート,サポート )
…
News paper corpus
Candidates of Katakana variants
( レポート,ラポート )( レポート,リポート )
…Candidates of
Katakana variants
( レポート,リポート )…
Katakana variants DB
Variants DB
… レポート …… ラポート …… リポート …… サポート …
Extract Katakana words
SP ≤3
Context similarity
≥ 0.005
Optimized threshold
49
SP by hand of expertise SP by the proposed mechanical method
recall 417/420 (99.3%) 415/420 (98.8%)
precision 417/480 (86.9%) 415/480 (86.5%)
F-value 92.7% 92.2%
Comparison of variants DB
SP 3, context similarity 0.05≦ ≧
cf. The whole DB contains 3 million Katakana variants for 1 million distinct Katakana words.
50
Conclusions
Mechanical method of calculating SP Using Web search engine to extract variant
candidates SP by character context Almost same accuracy as SP by hand of expertise
Katakana variants DB with SP by mechanical method
recall : 98.8% precision : 86.5% F -value : 92.2%
51
Future of our research
Other language like GermanArbeit -- アルバイト
Application of our methodology (Web resource + statistical string penalty) to other language pair.Londre LondonMünchen Munich
Our hope is: Cross-language automatic spelling variants generator for any language pairs based on the proposed method.
52
Thank you!
サンキュー( sankyuh)
サンキュウ (sankyuu)
Question or comments are welcome.
53
Error analysis grizzly bear グリーズリーベア vs グリーズリー・ベア gurihzurihbea gurihzurih ・ bea are not regarded as variants animal Norman Shwarzkovtotally different contexts! sign pole sign ball サインポール vs . サインボール sainpohru sainbohruAre regarded as variants. barber shop baseball customer, shop, sales ( very similar contexts)
54
The threshold of SP vs. F-value
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10 11 12The threshold of SP
-val
ue(%
)F
55
cosine similarity vs. F-value
25303540455055606570758085
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4Threshold of SP
F-va
lue(
%)
56
If you search some Kataka variant with Google,…
In case of spaghetti
Katakana Variants Found or not
スパゲッティ( spaghetti) ○
スパゲッティー( supagettuthii ) ○
スパゲッテイ( supagettutei ) ×
スパゲティ( supagettuthi ) ○
スパゲティー( supagethii ) ○
スパゲテイ( supagetei ) ×
57
How to find candidates of Katakana
How to extract document in which context similarity is calculated
Google search with a query of Katakana word which is a candidate of Katakana variant.
Extract context of the Katakana variant from search result pages and calculate context similarity to identify Katakana variants.
58
Example of similarity calculation
(ウォッカ’ Uottuka’ 、ウォトカ’ Uotoka’ )ウォッカ: liquor : 1.1 、 strong : 1.4 、 alcohol : 1.6 、 western liquir : 0.
7 、・・・ウォトカ liquor : 0.7 、 strong : 0.7 、 alcohol : 3.4 、 western liquor : 1.4 、・・・
00157.04.37.07.06.14.11.1
7.04.17.01.1),cos(
222222
・・・・・・
・・・UotokaUottuka