notes about my studies of information engineering and natural language processing

27
Notes about my studies of Information Engineering and Natural Language Processing by Changhua Yang, 04/09, 2003.

Upload: vinaya

Post on 22-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Notes about my studies of Information Engineering and Natural Language Processing. by Changhua Yang, 04/09, 2003. Outline. Knowledge Management information from images Classification Problem SVM A Chinese product 漢字基因 SVM Tool Demo. Knowledge E poch. Data? Information? Knowledge? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Notes about my studies of Information Engineering and Natural Language Processing

Notes about my studies ofInformation Engineering and

Natural Language Processingby

Changhua Yang, 04/09, 2003.

Page 2: Notes about my studies of Information Engineering and Natural Language Processing

Outline

• Knowledge Management– information from images

• Classification Problem– SVM

• A Chinese product– 漢字基因

• SVM Tool Demo

Page 4: Notes about my studies of Information Engineering and Natural Language Processing

• Data: a compressed JPEG file a {(x, y, color)}-bit mapping– Metadata: data describing data

• Information:– A dog( 狐狸狗 ) on grassland

• Knowledge:– Daytime photograph– A easy case for outlining the objects

Page 6: Notes about my studies of Information Engineering and Natural Language Processing

Problem Conversion

• 問這是不是狗– 從 Knowledge 中形成一個 temporary classifier {d

og ,!dog}

• 這裡面有沒有狗– Phase 1: Identify all objects– Phase 2: for each object, determine {dog, !dog}

• 這裡面有沒有狐狸狗– Option 1: a classifier for { 狐狸狗 ,! 狐狸狗 } from t

he training sets of all objects– Option 2: one of those from all dogs

Page 7: Notes about my studies of Information Engineering and Natural Language Processing
Page 8: Notes about my studies of Information Engineering and Natural Language Processing
Page 9: Notes about my studies of Information Engineering and Natural Language Processing
Page 10: Notes about my studies of Information Engineering and Natural Language Processing
Page 11: Notes about my studies of Information Engineering and Natural Language Processing

Shallow Semantic Parsing using Support Vector Machines

Sameer Pradhan, Wayne Ward, Kadri Hacioglu, James H. Martin, & Dan Jura

fskyHLT-NAACL 2004

Page 12: Notes about my studies of Information Engineering and Natural Language Processing

using PropBank• PropBank (Kingsbury et al., 2002)

– a 300k-word corpus• Wall Street Journal (WSJ) part of the Penn Tree-Bank (Marcus

et al., 1994) (hand-corrected parses)– predicate argument relations are marked for part of the verbs

• The arguments of a verb are labeled ARG0 to ARG5 – ARG0 is the PROTOAGENT (usually the subject)– ARG1 is the PROTO-PATIENT (usually its direct object)

• PB attempts to treat semantically related verbs consistently– In addition to these CORE ARGUMENTS, additional ADJUNCTIV

E ARGUMENTS, referred to as ARGMs are marked– Some examples are ARGMLOC, for locatives, and ARGM-TMP, f

or temporals

Page 13: Notes about my studies of Information Engineering and Natural Language Processing
Page 14: Notes about my studies of Information Engineering and Natural Language Processing

Problem Description- Shallow Semantic Parsing

• Argument Identification – the process of identifying parsed constituents in the

sentence that represent semantic arguments of a given predicate

• Argument Classification– Given constituents known to represent arguments of a

predicate, assign the appropriate argument labels to them

• Argument Identification and Classification– A combination of the above two tasks

Page 15: Notes about my studies of Information Engineering and Natural Language Processing

Baseline Features

• Predicate• Path NP↑S↓VP↓VBD• Phrase Type (NP, PP, S)• Position• Voice• Head Word

– Syntactic head

• Sub-categorization VP->VBD-PP

Page 16: Notes about my studies of Information Engineering and Natural Language Processing

Classifier and Implementation

• SVM – binary classifiers– One vs ALL (OVA) formalism

• training n binary classifiers for a n-class problem

– Converted multi-class problem

• 80% of the nodes have NULL labels– a binary NULL vs NON-NULL classifier– remaining data for training OVA classifiers

• Tool– TinySVM– YamCha

Page 17: Notes about my studies of Information Engineering and Natural Language Processing
Page 18: Notes about my studies of Information Engineering and Natural Language Processing

New Features

1. NE2. Headword POS3. Verb Clustering4. Partial Path5. Verb Sense Info6. Head of PP7. First and Last W/P8. Ordinal position9. Tree Distance10. Relative Features11. Temporal cue words12. Dynamic class context

Page 19: Notes about my studies of Information Engineering and Natural Language Processing

Technology

• 中文倉頡輸入法發明人朱邦復領導的「香港文化傳信」正與 IBM 公司聯手開發中文嵌入式處理器V-Dragon( 飛龍 ) ,希望結合 Linux 作業系統讓個人電腦售價降為目前的三分之一,打破英特爾和微軟的 Wintel 架構。

• 「文化傳信」的 V-Dragon 是一款中文 CPU( 中央處理器 ) ,內建 3 萬 2000 個中文字,並採用 Linux 作業系統 Midori Linux 。

Page 20: Notes about my studies of Information Engineering and Natural Language Processing

UCLA Report Confirms

Culturecom Processor for Chinese Character Generation

•The SCS 1610 can generate about 32,000 characters in three fonts and sizes ranging from 11x11 to 127x127 pixels.•The display quality of the characters is optimized aesthetically for sizes generated.•The code and data for the generation algorithm and the character representations occupy no more than 256KB•The speed of character generation is good.

這個技術是中文及其他非拼音文字最有效的解決方案

Page 21: Notes about my studies of Information Engineering and Natural Language Processing

• 采用中文 CPU ,完全的中文环境,中文字型皆以向量方式由 CPU 产生,可产生多种字体并可自由放大、缩小,不需使用 Mask-ROM 存放字型。同时也完全支持英文。

Page 22: Notes about my studies of Information Engineering and Natural Language Processing

漢字基因 (1/2)

• 漢字– 百分之九十是形聲字– 聲符之外,形聲字尚有「假借」的 機能

• 也就是說,字首代表分類,字身可作定義之用• 對檢字法的要求,是以字義的理解為第一訴求

• 以字根觀念產生「向量字形產生器」 • 漢字概念,發現有「字碼、字序、字形、

字辨、字音、字義」六大功能

Page 23: Notes about my studies of Information Engineering and Natural Language Processing

漢字基因 (2/2)

• 字碼 倉頡 25 碼• 字序 倉頡 24 個漢字字母排序• 字形

– 向量筆形9個,字根64個,供 字庫組字用– 僅佔 160kb 系統空間,可組成各種字形近一千萬個,

採用無級次放大,可選用各種已知之字體變化,組字速度, p450 為例, 16*16 之字形,每秒可生成及顯示四萬六千字

• 字辨 73(9+64) 類字形基因特徵,轉換之字碼• 字音 六書「形聲」為本的波形追蹤法 • 字義 字義基因512個

– 1/3 from 宋儒之「體用因果」– 1/4 of 「常識定義」

Page 24: Notes about my studies of Information Engineering and Natural Language Processing

TinySVM

• Support standard C-SVR and C-SVM

• Uses sparse vector representation

• Can handle several ten-thousands of training examples, and hundred-thousands of feature dimension

• Fast optimization algorithms stemming from SVM_light

Page 25: Notes about my studies of Information Engineering and Natural Language Processing

+1 1:0.5 2:0.5+1 1:1 2:1+1 1:2 2:2+1 1:3 2:2+1 1:4 2:2-1 1:2 2:1-1 1:2 2:1.5-1 1:3 2:1.5-1 1:4 2:1.5

Page 26: Notes about my studies of Information Engineering and Natural Language Processing

Steps

• Define Feature space

• Get feature values from [training|testing] set

• Create Model from f-values of training set– svm_learn -t 1 -d 2 -c 1 news.train news_model

• Verify the testing set with f-values– svm_classify -V news.test news_model

Page 27: Notes about my studies of Information Engineering and Natural Language Processing

My Trial

• 13 Features are defined

• Training Set: 4 articles– 2 are annotated – advantage of the

government– 2 are annotated negative

• 2 test articles