sangwon park 2010.11.24. the result of applying plug-in component based architecture ◦ key...

14
HanNanum Project Sangwon Park 2010.11.24

Upload: roger-crawford

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

HanNanum Project

Sangwon Park

2010.11.24

Page 2: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

The result of applying plug-in component based architec-ture◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)◦ GUI demo

A measurement of the morphological analyzer◦ Features of Korean morphological analysis◦ Measurement 1. Strict criteria◦ Measurement 2. Loose criteria

Contents

Page 3: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

HanNanum ver.0.8 was released◦ Plug-in component based architecture◦ Faster analysis speed

Object based communication Reduced overhead between components

◦ More accurate result Several bugs were fixed.

GUI Demo◦ It helps people to understand the concept of HanNanum workflow◦ People can test various workflow for their own purpose

The result of applying plug-in compo-nent based architecture

Page 4: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

GUI Demo

Plug-in Pool

Workflow

Information of a plug-in

Workflow control

Input & Output

Page 5: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

A measurement of the morphological analyzer

POS Tagger

Page 6: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

Features of Korean morphological analysis

가시는 가시 /noun + 는 /josa (thorn, prickle) 가시 /verb + 는 /eomi (leave, disappear) 가 /verb + 시 /eomi + 는 /eomi (go) 갈 /verb + 시 /eomi + 는 /eomi (grind, sharpen)

A measurement of the morphological analyzer

Ambiguity of part-of-speech Ambiguity of segmentation of morpheme

Page 7: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

Evaluation Metrics

Output

집에집 /pvg+ 에 /ecx

집 /pvg+ 에 /jca

가시는가시 /ncn+ 는 /jxc

갈 /pvg+ 시 /ep+ 는 /etm

가 /pvg+ 시 /ep+ 는 /etm

가 /px+ 시 /ep+ 는 /etm

Input

집에 가시는Correct Analysis

집에집 /ncn+ 에 /jca

가시는가 /pvg+ 시 /ep+ 는 /etm

POSTagger

Page 8: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

Measurement 1. Strict criteria◦ Only when the analysis result is exactly same with the corpus, it is con-

sidered as a correct one.◦ A measurement can be performed on large amount of test data auto-

matically.◦ This has not been used in papers on Korean morphological analyzer.

Measurement 2. Loose criteria◦ There can be several correct answers on a input Eojeol.◦ Only few tags, such as {N, P, M, I , J, E, X, F, S} are considered.◦ Most of the papers use this criteria and say that their analyzers show

around 98% accuracy.

Evaluation Metrics

Page 9: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

Input Data◦ Test Corpus BORA Corpus 2 Aligned morpheme analysis corpus◦ Test Set 20 sentences, which have more than 10 eojeols, from 68 documents◦ # of sentences 1360◦ # of eojeols25515

Result◦ # of generated eojeols 74415◦ # of eojeols which are restored and segmented correctly 23605◦ # of eojeols which are tagged correctly 19147

◦ Precision 19147/25515 (0.75)◦ Recall 19147/74415 (0.26)◦ F-measure 0.38

Measurement 1. Strict criteria

Precision on Level 1 22402/25515 (0.88)

Precision on Level 2 21174/25515 (0.83)

Precision on Level 3 19168/25515 (0.75)

Precision on Level 4 19147/25515 (0.75)

Precision on Level 5 19147/25515 (0.75)

Page 10: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

Larger Morpheme Dictionary◦ Morpheme Dictionary was extended with the Corpus◦ 29098 morphemes+tags are extended

Input◦ Test Corpus BORA Corpus 2 Aligned morpheme analysis corpus◦ Test Set 2 sentences, which have more than 10 eojeols, from 68 documents◦ # of sentences 136◦ # of eojeols 2527

Result◦ # of generated eojeols 30536◦ # of eojeols which are restored and segmented correctly 2340◦ # of eojeols which are tagged correctly 2041

◦ Precision 2041/2527 (0.81)◦ Recall 2041/30536 (0.07)◦ F-measure 0.12

Measurement 2. Loose criteria

Page 11: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

# of correct segmentation= 2340

# of spacing error= 46

# of both are correct= 59 (subjective)

Count Precision(all)

Precision(correct segmentation)

Precision(correct seg. - spacing error)

Precision(Correct seg. - spacing error- both are correct)

Precision on Level 1 2216 0.877 0.947 0.966 0.991498881

Precision on Level 2 2151 0.851 0.919

Precision on Level 3 2042 0.808 0.873

Precision on Level 4 2041 0.808 0.872

Precision on Level 5 2041 0.808 0.872

Measurement 2. Loose criteria

Page 12: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

Eojeol Corpus HanNanum

보문구조에 보문 + 구조 + 에 보문구조 + 에보문구조 + 이 + 에

정보화가 정보 + 화 + 가 정보화 + 가정보 + 화가

대중화가 대중 + 화 + 가 대중화 + 가대중 + 화가대중화 + 이 + 가

미래학자들의 미래 + 학자 + 들 + 의 미래학자 + 들 + 의통해서 통하 + 어서 통해 + 서

통 + 하 + 어서통 + 하 + 어 + 서

Appendix. Segmentation

Page 13: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

Corpus HanNanum Note

과 과 /jcj과 /ncn 과 /ncn 과 /mag 고 /pvg+ 아 /ecs 고 /pvg+ 아 /ecx …

띄어쓰기 오류“ 관형 관계절 ( adnominal relative clause ) 과”

로 로 /jca 로 /ncn 로 /nq띄어쓰기 오류“ 문장 관계절 ( sentential relative clause ) 로”

과의 과 /jct+ 의 /jcm과 /ncn+ 의 /jcm 과 /ncn+ 의 /jcm 과 /ncn+ 의 /jxc 과 /ncn+ 의 /jca 과/ncn+ 의 /jct 과 /ncn+ 의 /jco …

띄어쓰기 오류“ 동격절 구성 ( appositive ) 과의”

의 의 /jcm 의 /ncn 의 /ncn 의 /nq띄어쓰기 오류“Toffler 의” Eojeol

Appendix. Spacing

Page 14: Sangwon Park 2010.11.24.  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)

HAPPY CILAB

Thank you