sangwon park january 20, 2011

14
KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 20, 2011

Upload: gratia

Post on 24-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

KKAP : KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser. Sangwon Park January 20, 2011. Research Goal. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sangwon  Park January  20,  2011

KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser

Sangwon Park

January 20, 2011

Page 2: Sangwon  Park January  20,  2011

• The goal of the research is to develop KKAP(KAIST Korean Analysis Platform), which is a infrastructure for Korean nat-ural language analysis.

• The KKAP will be flexible and easy to utilize so that it can be widely used in various areas. The platform will include morphological analyzer, POS tagger, parser, etc.

Research Goal

Page 3: Sangwon  Park January  20,  2011

KKAP: KAIST Korean Analysis Platform

Phase 3.POS Tagging

Phase 2.Morphological Analy-

sis

Plugin PoolPhase 1. Plugin

SentenceSegmentation

InputFilter

AutoSpacing

NounExtraction

TagMapper

Unknown TermProcessing

Chart-baseMorph Ana-

lyzer

Phase 2. Plugin

Phase 1.Text Preprocessing

Supplement Plugin SupplementPlugin

Major Plugin

Workflow for Korean Analysis

Major Plugin

7/nnc+ 일 /nbu 저녁 /ncn 발표예정 /ncpa+ 이 /jp+ ㄴ /etm 노벨문학상 /nq+ 의 /jcm 유력 /ncps 수상자 /ncn+ 로 /jca 고은 /nq 시인 /ncn+ 이 /jcc 거론 /ncpa+ 되 /xsv+ 고 /ecc 있 /paa+ 다 /ef ./sf 통신은 통 /ncn+ 신 /ncn+ 은 /jxc 스웨덴 /nq+ 의 /jcm 노벨상 /ncn 관측통 /ncn+ 들 /xsn 사이 /ncn+ 에 /jca ….

7 일 저녁 발표예정인 노벨문학상의 유력 수상자로 고은 시인이 거론되고 있다 . AP통신은 스웨덴의 노벨상 관측통들 사이에 한국의 고은 시인이 시리아의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거론됐다고 전했다 . …

Korean Document Analysis

Analyzed Korean Document

Phase 4. Parsing

SupplementPlugin

Major Plugin

SupplementPlugin

HMM-basedPOS Tag-

ging

Phase 3. Plugin

NounExtraction

TagMapper Chart

ParserPhase 4. Plugin

Verb PhraseExtractor

Noun PhraseExtractor

Page 4: Sangwon  Park January  20,  2011

Target Users

HanNanum

Parser

Smart Calendar ProjectKorean E-mail Analysis

Multi-lingual Knowl-edge Sync. on

WikipediaKorean Wikipedia Analysis

• The Korean parser can support the other researches which need Korean analysis.

• The major goal is to make the parser useful on the following researches.

• I plan to work on a dependency parser so that I can follow and improve the previous researches of our laboratory and existing parser.

Page 5: Sangwon  Park January  20,  2011

Korean Syntactic Tagged Corpus• KAIST Syntactic Tagged Corpus

– http://bora.or.kr– Corpus 5. Manual sentence analysis corpus– 31,091 Sentences from 97 different sources.– Length: 1 ~ 33 Eojeols

Average 11.35 Eojeols

• Related document– Kong joo Lee, Byung Gyu Chang, Gil Chang Kim, “Bracketing Guidelines

for Korean Syntactic Tree Tagged Corpus Version 1”, KAIST CS Depart-ment Technical Report, CS/TR-97-112, 1997 (In Korean)

– Byung Gyu Chang, Kong joo Lee, Gil Chang Kim, “Design and Imple-mentation of Tree Tagging Workbench To Build a Large Tree Tagged Corpus of Korean”, Proceedings of the Conference on Hangul and Ko-rean Language Information Processing, pp.421~429, 1997 (In Korean)

Page 6: Sangwon  Park January  20,  2011

Korean Syntactic Tagged Corpus• KAIST Syntactic Tagged Corpus

[4226] ; 물론 꼭 필요할 땐 어디서든지 부르짖어야지요 .(((((( 물론 /mag )0Mag- (((((( 꼭 /mag )0Mag- (( 필요 /ncps )0Ncps+ 하 /xsm )1Paa )MgPaa+ ㄹ /etm )emPaa- ( 때 /nbn )0Nbn )EmNbn+ ㄴ /jxt )jtNbn- (((( 어디 /npd )0Npd+ 에서 /jca )jaNpd+ 든지 /jxc )jxNpd- ( 부르짖 /pvg )0Pvg )JxPvg )JtPvg

)MgPvg+ 어야지 /ef )efPvg+ 요 /jxf )jfPvg+ (./sf )0Sf )sfPvg)S0 : 0Mag -> mag1 : 0Mag -> mag2 : 0Ncps -> ncps3 : 1Paa -> 0Ncps+xsm4 : MgPaa -> 0Mag 1Paa5 : emPaa -> MgPaa+etm6 : 0Nbn -> nbn7 : EmNbn -> emPaa 0Nbn8 : jtNbn -> EmNbn+jxt9 : 0Npd -> npd10 : jaNpd -> 0Npd+jca11 : jxNpd -> jaNpd+jxc12 : 0Pvg -> pvg13 : JxPvg -> jxNpd 0Pvg14 : JtPvg -> jtNbn JxPvg15 : MgPvg -> 0Mag JtPvg16 : efPvg -> MgPvg+ef17 : jfPvg -> efPvg+jxf18 : 0Sf -> sf19 : sfPvg -> jfPvg+0Sf20 : S -> sfPvg

Page 7: Sangwon  Park January  20,  2011

Korean Syntactic Tagged Corpus• Sejong Syntactic Tagged Corpus

– I got the latest release from the National Institute of the Korean Language this week.

– Released on December 2010• 15 Documents• 433,839 Eojeols / 43,828 Sentences

; 프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다 . (S (NP_SBJ (NP (NP_MOD 프랑스 /NNP + 의 /JKG)

(NP (VNP_MOD 세계 /NNG + 적 /XSN + 이 /VCP + ᆫ /ETM)(NP (NP 의상 /NNG)

(NP 디자이너 /NNG))))(NP_SBJ (NP 엠마누엘 /NNP)

(NP_SBJ 웅가로 /NNP + 가 /JKS)))(VP (NP_AJT (NP (NP (NP 실내 /NNG)

(NP 장식 /NNG + 용 /XSN))(NP 직물 /NNG))

(NP_AJT 디자이너 /NNG + 로 /JKB))(VP 나서 /VV + 었 /EP + 다 /EF + ./SF)))

Page 8: Sangwon  Park January  20,  2011

Kookmin Univ. KLT version 2.2.0

Page 9: Sangwon  Park January  20,  2011

POSTECH: KoPA

Page 10: Sangwon  Park January  20,  2011

SNU KKMA

Page 11: Sangwon  Park January  20,  2011

Current Version

Page 12: Sangwon  Park January  20,  2011

Question & Comments

Page 13: Sangwon  Park January  20,  2011
Page 14: Sangwon  Park January  20,  2011