aug. 14, 2012 2012 iaslod

23
Aug. 14, 2012 2012 IASLOD Linking Korean Resources to LOD: Issues in Localization Mun Y. Yi

Upload: kynton

Post on 23-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Linking Korean Resources to LOD: Issues in Localization Mun Y. Yi. Aug. 14, 2012 2012 IASLOD. Agenda. Project Scope System Architecture Silk in Action Korean Traditional Knowledge Data Localization Issues. LOD2 Work Packages. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Aug. 14, 2012 2012  IASLOD

Aug. 14, 20122012 IASLOD

Linking Korean Resources to LOD: Issues in Localization

Mun Y. Yi

Page 2: Aug. 14, 2012 2012  IASLOD

- 2 -

Agenda

• Project Scope• System Architecture• Silk in Action• Korean Traditional Knowledge Data• Localization Issues

Page 3: Aug. 14, 2012 2012  IASLOD

- 3 -

LOD2 Work Packages• The project is structured into twelve consecutively numbered work packages (WPs). WP1 to WP6 are concerned with development of the

LOD2 Stack, and WP7 to WP9 are designed to extensively validate and demonstrate the developed technology on the basis of a carefully selected and representative set of demonstrator applications, holding potentially great impact. WP10 (SWC) is devoted to training, awareness and dissemination, WP11 is concerned with exploitation and standardization activities, as well as technical coordination activities with other projects. WP12 is designed for high-level project coordination, reporting to the EC as well as activities related to the resolution of the IPR and maintenance of the Consortium Agreement.

Page 4: Aug. 14, 2012 2012  IASLOD

- 4 -

Simplified LOD2 Stack High-Level Architec-ture• The main result of LOD2 will be the LOD2 Stack, an integrated distribution of aligned tools which support the

whole life cycle of Linked Data from creation over enrichment, interlinking, fusing to maintenance.

Page 5: Aug. 14, 2012 2012  IASLOD

- 5 -

Project Scope: Tasks & Deliverables• In Task4.1, a semi-automatic machine learning technique will be developed and implemented to simplify the

creation of mappings between knowledge bases and the assessment of their quality. • KAIST will contribute to this task by providing a platform for automatic linking with Korean, Chinese, and

Japanese RDF resources.

Task 4.1Semi-Automatic Data Interlinking- University Leipzig- Digital Enterprise Research Institute- Free University Berlin- KAIST

Deliverable 4.1.1First Linking Assist ReleaseDue Date: M18 (2012-02)

Deliverable 4.1.3Korean Resource Linking As-sist Release

Due Date: M24 (2012-08)

Deliverable 4.1.4Asian Resource Linking Assist Release

Due Date: M30 (2013-02)

Page 6: Aug. 14, 2012 2012  IASLOD

- 6 -

Project Scope: Tasks & Deliverables (Cont’d)

Task 4.5Link Data Fusion- University Leipzig- Digital Enterprise Research Institute- Free University Berlin- KAIST

Deliverable 4.5.1Initial Release of Data Fusion ComponentDue Date: M24 (2012-08)

Deliverable 4.5.3Korean Data Fusion Assistant

Due Date: M30 (2013-02)

Deliverable 4.5.4Asian Data Fusion Assistant

Due Date: M36 (2013-08)

• In Task 4.5, methods for fusing data about single concept from multiple different sources will be devised and implemented.

• KAIST will work on the fusion of multilingual DBpedia datasets, thus eliminating issues for other multilingual resources.

Page 7: Aug. 14, 2012 2012  IASLOD

- 7 -

Phased Approaches

2nd Cycle(~July, 2012)Implementation of Korean Re-

source Linking Assistant

• Silk Localization• Linking with Silk Framework• Internal publication

1st Cycle(~Feb., 2012)Understanding of the Task Do-

main

• Semantic Web• LOD2 Concept• Software Architecture• Data Model(Relational2RDF)• Pilot Project• Korean Traditional Recipe

data

3rd Cycle(~Aug., 2012)Quality Enhancement

• Linking Quality• Publish to the LOD2 cloud

• The project has been done in 3 iterative cycles. • Each cycle focuses on specific tasks, and lessons learned will be transferred into the next cycles.• In the 1st cycle, preliminary RDF data was generated. During the second cycle, we localized Silk to support

Korean resource linking. The last cycle focuses on enhancing data quality.

Page 8: Aug. 14, 2012 2012  IASLOD

- 8 -

Silk in Action• url: http://lod.kaist.ac.kr/silk-workbench/• File or SPARQL endpoint can be sources or targets.

• Define a project

• Define a source & a target

• Define a task

• Define an output

• And then click Open

Page 9: Aug. 14, 2012 2012  IASLOD

- 9 -

Silk in Action (Cont’d)• Multiple operators can be used for complex tasks.• Outputs can be displayed or written into a file.• Interim result can be exported as a final result or be used as training data sets for machine learning.• Learned algorithm can be used to generate final links.

• Define a source & a target from Property Paths

• Define operator(s)

• Click GenerateLinks • Click Start

Page 10: Aug. 14, 2012 2012  IASLOD

- 10 -

Korean Traditional Knowledge Portal

Page 11: Aug. 14, 2012 2012  IASLOD

- 11 -

Korean Traditional Knowledge

Data includes– Food (3,236 records)

• Food name• Food type• Recipe, ingredients• Cooking process (images)

– Medicine, sickness, and treatment (38,121 records)

– Agriculture (2,775 units)

– Life (4,438 units)

Page 12: Aug. 14, 2012 2012  IASLOD

- 12 -

System Architecture

Source Datain Relational DB

SilkVirtuoso

Triple Store

• Proprietary RDFgen for transforming relational model to RDF model• Silk for link generation• Virtuoso triple store for serving RDF

RDFgen*

Link Creation

Silk New Korean Similarity Measures

Transformation

RDFgen

Publication

Virtuoso triple store

RDF Links

Instances

Ontology

DBpedia

Page 13: Aug. 14, 2012 2012  IASLOD

- 13 -

Key Linking Issues

• Data Preprocessing• Address Encoding: URI vs.IRI• Korean String Similarity Measure• Handling Transliterated Data

Page 14: Aug. 14, 2012 2012  IASLOD

- 14 -

Data Preprocessing : Mapping Relation to RDF• Our goal is to make the recipes of Korean traditional food open.• Original data from relational database were transformed into tables by object relational mapping.• Related ontologies for recipe: LinkedRecipe.com, www.mindswap.org.

• Tool and IngredientPortion are not implemented at this phase.

Relational RDFTable name Class name

PK column value Subject

Non-PK column name Predicate

FK column value Object(used as URI; RDF link)

Non-FK column value Object(used as string; Literal triple)

Page 15: Aug. 14, 2012 2012  IASLOD

- 15 -

Handling Non-Latin Data• Resources would be described in non-Latin characters.• Tools are not known whether to support non-Latin characters.

Writing Systems of the world today - Wikipedia

Page 16: Aug. 14, 2012 2012  IASLOD

- 16 -

Address Encoding• URI is a core component of linked data.• URIs are used as names for things.• URI only allows US-ASCII characters for names of the resource.

W3 Recommendations for URI : UTF-8 Character Set & URI Encoding

• Use UTF-8 character sets for URI, and encode special/non-Latin characters using %.

• ex) http://ko.wikipedia.org/wiki/%EB%B2%A0%EB%A5%BC%EB%A6%B0

• But it’s hard to understand what it is…

Another W3 Recommendations : IRI(Internationalized Resource Identifier)

• ex) http://ko.wikipedia.org/wiki/ 베를린

• Now we can understand what it means.

• But some characters look so similar that chance for spoofing increases. ( ex) Å

Page 17: Aug. 14, 2012 2012  IASLOD

- 17 -

Localization: Silk Workbench Address Encod-ing• Silk Workbench is GUI interface for the generation of links• Silk Workbench displays encoded URIs ‘as is’ so that it’s hard to understand non-Latin dataset.• Decoding URIs enables non-Latin dataset to be displayed in its native language, so it’s a lot easier to work

with.

Page 18: Aug. 14, 2012 2012  IASLOD

- 18 -

Localization: Korean String Similarity Mea-sures• Two kinds of Korean resources exist: Resources in Korean and resources in transliterated Korean.• We need to calculate similarity distances for both of them.• Korean alphabet has 14 consonants and 10 vowels (together with consonant clusters and diphthongs).

For resources in Korean• ‘ 비빔밥’ i.e., Korean DBpedia• Most of the resources in Korea

For resources in transliterated Korean• ‘bibimbap’ i.e., English DBpedia• Most of the resources abroad

Most of the comparators in Silk are based on string comparison• i.e., Levenshtein distance• However, writing systems are different from languages to languages.• So comparators for Latin or Roman alphabets are appropriate for Korean alphabet?

String Similarity Distance Measures for Korean• KorED• GrpSim• OneDSim2• KorPhoD (Our approach) = (sD-1)*3 + min(pD), sD:Syllable Distance, pD:

Page 19: Aug. 14, 2012 2012  IASLOD

- 19 -

Localization: Korean String Similarity Measures (Cont’d)• Several Korean similarity distances exist to reflect the characteristics of Korean alphabet.• We devised a new way to measure based on the distribution of phonemes (KorPhoD).• We implemented KoreanPhonemeDistance operator in Silk and used it to build links among Korean

resources.

Source Target Levenstein Dis-tance

Actual Edit Operation Differences in pho-nemes

Differences in sylla-bles

녹차 모과차 2 3 ( ㅁ -> ㄴ , ㄱ -> add, 과 -> delete) 4 2

Source Target Levenshtein Dis-tance

KorED GrpSim OneDSim2 KorPhoD

우연히 망연이 2 + + *ws(‘ ㅇ’ and ‘ ㅎ’ are similar) + *w3 + 강낭콩 뿔난콩 2 + + *wd(‘ ㅇ’ and ‘ ㄴ’ are different) + *w4 + 일반통계학 일방통행 3 + 3 + *wd+*w+*wd 2 + 바람 보름 2 2 +

: syllable distance, : phoneme distanceComparison of Similarity Measures for Korean

Application of Edit Distance to Korean Resources

Performance Comparison

• Precision : 1.28% vs. 17.78% (about thirteen times improvement )• F-score: 0.0223 vs. 0.0896 (Four times more effective finding correct links)

Page 20: Aug. 14, 2012 2012  IASLOD

- 20 -

Localization: Transliterated Korean Similarity Mea-sures• Two kinds of transliteration related to Korean: From English to Korean / From Korean to English.• For now, we focus on the transliteration from Korean to English to build links for resources in Korean.• The biggest problem is that there have been various algorithms for transliterating Korean into English so far.

From English to Korean• ‘Digital’ -> ‘ 디지털’ , ‘ 디지틀’ , ‘ 디지탈’ , …

From Korean to English• ‘ 칼국수’ -> ‘Kalguksu’, ‘Kalguksoo’, ‘Kalgugsoo’, …

Transliteration algorithms for Korean• McCune-Reischauer(1937) : Official standard in the past (from 1984 to 2000)

• Uses breves( ˘: indicates a short vowel), apostrophes and diereses(¨ : a vowel is sounded in a separate syllable)

• Yale(1942)• Revised Romanization(2000) : Current official standard.

• Is generally similar to MR, but uses no diacritics or apostrophes, and uses distinct letters for ㅌ / ㄷ (t/d), ㅋ / ㄱ (k/g), ㅊ / ㅈ (ch/j) and ㅍ / ㅂ (p/b), etc.

• and probably many more…• We found that many academic and government websites still use MR more.

Silk doesn’t have phonetic similarity measures though…• i.e., Soundex

Page 21: Aug. 14, 2012 2012  IASLOD

- 21 -

Localization: Transliterated Korean Similarity Mea-sures (Cont’d)• We compare performance from both string similarity perspective and phonetic similarity perspective.• Levenshtein shows good performance for precision, and Soundex shows good performance for recall.• KoTlit shows good performance for both precision and recall, and we are still optimizing the algorithms.

Performance Comparison

M.R. Relevant Retrieved Ret. & Rel. Precision(%) Recall(%)Levenshtein*

66692875 2770 96.35 41.54

Soundex 386432 5969 1.54 89.50 KoTlit 4469 4241 94.90 63.59 * threshold:0

R.R. Relevant Retrieved Ret. & Rel. Precision(%) Recall(%)Levenshtein*

66695552 5237 94.33 78.53

Soundex 348187 6188 1.78 92.79 KoTlit 5977 5641 94.38 84.59 * threshold:0

Page 22: Aug. 14, 2012 2012  IASLOD

- 22 -

Concluding Remarks

• Localization issues are important for Asian and other non-Latin countries

• Need to develop its own similarity measures – string similarity and phonetic similarity

• SILK is likely to become a key linking assistant program for LOD

• LOD is a major movement to define the next version of the Internet.

Page 23: Aug. 14, 2012 2012  IASLOD

- 23 -

Thank you!

Mun Yong YiKAIST 지식서비스공학과http://kslab.kaist.ac.kr

mail: [email protected]