sangwon park 2010.11.24. the result of applying plug-in component based architecture ◦ key...
TRANSCRIPT
![Page 1: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/1.jpg)
HanNanum Project
Sangwon Park
2010.11.24
![Page 2: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/2.jpg)
The result of applying plug-in component based architec-ture◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)◦ GUI demo
A measurement of the morphological analyzer◦ Features of Korean morphological analysis◦ Measurement 1. Strict criteria◦ Measurement 2. Loose criteria
Contents
![Page 3: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/3.jpg)
HanNanum ver.0.8 was released◦ Plug-in component based architecture◦ Faster analysis speed
Object based communication Reduced overhead between components
◦ More accurate result Several bugs were fixed.
GUI Demo◦ It helps people to understand the concept of HanNanum workflow◦ People can test various workflow for their own purpose
The result of applying plug-in compo-nent based architecture
![Page 4: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/4.jpg)
GUI Demo
Plug-in Pool
Workflow
Information of a plug-in
Workflow control
Input & Output
![Page 5: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/5.jpg)
A measurement of the morphological analyzer
POS Tagger
![Page 6: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/6.jpg)
Features of Korean morphological analysis
가시는 가시 /noun + 는 /josa (thorn, prickle) 가시 /verb + 는 /eomi (leave, disappear) 가 /verb + 시 /eomi + 는 /eomi (go) 갈 /verb + 시 /eomi + 는 /eomi (grind, sharpen)
A measurement of the morphological analyzer
Ambiguity of part-of-speech Ambiguity of segmentation of morpheme
![Page 7: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/7.jpg)
Evaluation Metrics
Output
집에집 /pvg+ 에 /ecx
집 /pvg+ 에 /jca
가시는가시 /ncn+ 는 /jxc
갈 /pvg+ 시 /ep+ 는 /etm
가 /pvg+ 시 /ep+ 는 /etm
가 /px+ 시 /ep+ 는 /etm
Input
집에 가시는Correct Analysis
집에집 /ncn+ 에 /jca
가시는가 /pvg+ 시 /ep+ 는 /etm
POSTagger
![Page 8: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/8.jpg)
Measurement 1. Strict criteria◦ Only when the analysis result is exactly same with the corpus, it is con-
sidered as a correct one.◦ A measurement can be performed on large amount of test data auto-
matically.◦ This has not been used in papers on Korean morphological analyzer.
Measurement 2. Loose criteria◦ There can be several correct answers on a input Eojeol.◦ Only few tags, such as {N, P, M, I , J, E, X, F, S} are considered.◦ Most of the papers use this criteria and say that their analyzers show
around 98% accuracy.
Evaluation Metrics
![Page 9: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/9.jpg)
Input Data◦ Test Corpus BORA Corpus 2 Aligned morpheme analysis corpus◦ Test Set 20 sentences, which have more than 10 eojeols, from 68 documents◦ # of sentences 1360◦ # of eojeols25515
Result◦ # of generated eojeols 74415◦ # of eojeols which are restored and segmented correctly 23605◦ # of eojeols which are tagged correctly 19147
◦ Precision 19147/25515 (0.75)◦ Recall 19147/74415 (0.26)◦ F-measure 0.38
Measurement 1. Strict criteria
Precision on Level 1 22402/25515 (0.88)
Precision on Level 2 21174/25515 (0.83)
Precision on Level 3 19168/25515 (0.75)
Precision on Level 4 19147/25515 (0.75)
Precision on Level 5 19147/25515 (0.75)
![Page 10: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/10.jpg)
Larger Morpheme Dictionary◦ Morpheme Dictionary was extended with the Corpus◦ 29098 morphemes+tags are extended
Input◦ Test Corpus BORA Corpus 2 Aligned morpheme analysis corpus◦ Test Set 2 sentences, which have more than 10 eojeols, from 68 documents◦ # of sentences 136◦ # of eojeols 2527
Result◦ # of generated eojeols 30536◦ # of eojeols which are restored and segmented correctly 2340◦ # of eojeols which are tagged correctly 2041
◦ Precision 2041/2527 (0.81)◦ Recall 2041/30536 (0.07)◦ F-measure 0.12
Measurement 2. Loose criteria
![Page 11: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/11.jpg)
# of correct segmentation= 2340
# of spacing error= 46
# of both are correct= 59 (subjective)
Count Precision(all)
Precision(correct segmentation)
Precision(correct seg. - spacing error)
Precision(Correct seg. - spacing error- both are correct)
Precision on Level 1 2216 0.877 0.947 0.966 0.991498881
Precision on Level 2 2151 0.851 0.919
Precision on Level 3 2042 0.808 0.873
Precision on Level 4 2041 0.808 0.872
Precision on Level 5 2041 0.808 0.872
Measurement 2. Loose criteria
![Page 12: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/12.jpg)
Eojeol Corpus HanNanum
보문구조에 보문 + 구조 + 에 보문구조 + 에보문구조 + 이 + 에
정보화가 정보 + 화 + 가 정보화 + 가정보 + 화가
대중화가 대중 + 화 + 가 대중화 + 가대중 + 화가대중화 + 이 + 가
미래학자들의 미래 + 학자 + 들 + 의 미래학자 + 들 + 의통해서 통하 + 어서 통해 + 서
통 + 하 + 어서통 + 하 + 어 + 서
Appendix. Segmentation
![Page 13: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/13.jpg)
Corpus HanNanum Note
과 과 /jcj과 /ncn 과 /ncn 과 /mag 고 /pvg+ 아 /ecs 고 /pvg+ 아 /ecx …
띄어쓰기 오류“ 관형 관계절 ( adnominal relative clause ) 과”
로 로 /jca 로 /ncn 로 /nq띄어쓰기 오류“ 문장 관계절 ( sentential relative clause ) 로”
과의 과 /jct+ 의 /jcm과 /ncn+ 의 /jcm 과 /ncn+ 의 /jcm 과 /ncn+ 의 /jxc 과 /ncn+ 의 /jca 과/ncn+ 의 /jct 과 /ncn+ 의 /jco …
띄어쓰기 오류“ 동격절 구성 ( appositive ) 과의”
의 의 /jcm 의 /ncn 의 /ncn 의 /nq띄어쓰기 오류“Toffler 의” Eojeol
Appendix. Spacing
![Page 14: Sangwon Park 2010.11.24. The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)](https://reader034.vdocuments.net/reader034/viewer/2022051820/56649f1c5503460f94c32983/html5/thumbnails/14.jpg)
HAPPY CILAB
Thank you