notes about my studies of information engineering and natural language processing

Notes about my studies ofInformation Engineering and

Natural Language Processingby

Changhua Yang, 04/09, 2003.

Outline

• Knowledge Management– information from images

• Classification Problem– SVM

• A Chinese product– 漢字基因

• SVM Tool Demo

Knowledge Epoch

• Data?

• Information?

• Knowledge?

• Data Processing->Information Engineering->Knowledge Management

http://images.google.com.tw/imgres?imgurl=www.hokkaidohk.com/gif/Japanese_Spitz.jpg&imgrefurl=http://www.hokkaidohk.com/dog_intro.htm&h=157&w=199&sz=7&tbnid=IVMrtGiflOEJ:&tbnh=78&tbnw=98&prev=/images%3Fq%3D%25E7%258B%2590%25E7%258B%25B8%25E7%258B%2597%26hl%3Dzh-TW%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8

http://images.google.com.tw/imgres?imgurl=mengyj.nease.net/chw/dog4.jpg&imgrefurl=http://mengyj.nease.net/chw4.htm&h=298&w=342&sz=13&tbnid=4sRo252fyeMJ:&tbnh=99&tbnw=114&prev=/images%3Fq%3D%25E9%25A6%25AC%25E7%2588%25BE%25E6%25BF%259F%25E6%2596%25AF%26hl%3Dzh-TW%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8

• Data: a compressed JPEG file a {(x, y, color)}-bit mapping– Metadata: data describing data

• Information:– A dog( 狐狸狗 ) on grassland

• Knowledge:– Daytime photograph– A easy case for outlining the objects


Problem

• From Data to Information– An Search (Match) problem - relevance– A Classification problem

– A Decision problem

Feature X

Feature Y





Problem Conversion

• 問這是不是狗– 從 Knowledge 中形成一個 temporary classifier {d

og ,!dog}

• 這裡面有沒有狗– Phase 1: Identify all objects– Phase 2: for each object, determine {dog, !dog}

• 這裡面有沒有狐狸狗– Option 1: a classifier for { 狐狸狗 ,! 狐狸狗 } from t

he training sets of all objects– Option 2: one of those from all dogs

Shallow Semantic Parsing using Support Vector Machines

Sameer Pradhan, Wayne Ward, Kadri Hacioglu, James H. Martin, & Dan Jura

fskyHLT-NAACL 2004

using PropBank• PropBank (Kingsbury et al., 2002)

– a 300k-word corpus• Wall Street Journal (WSJ) part of the Penn Tree-Bank (Marcus

et al., 1994) (hand-corrected parses)– predicate argument relations are marked for part of the verbs

• The arguments of a verb are labeled ARG0 to ARG5 – ARG0 is the PROTOAGENT (usually the subject)– ARG1 is the PROTO-PATIENT (usually its direct object)

• PB attempts to treat semantically related verbs consistently– In addition to these CORE ARGUMENTS, additional ADJUNCTIV

E ARGUMENTS, referred to as ARGMs are marked– Some examples are ARGMLOC, for locatives, and ARGM-TMP, f

or temporals

Problem Description- Shallow Semantic Parsing

• Argument Identification – the process of identifying parsed constituents in the

sentence that represent semantic arguments of a given predicate

• Argument Classification– Given constituents known to represent arguments of a

predicate, assign the appropriate argument labels to them

• Argument Identification and Classification– A combination of the above two tasks

Baseline Features

• Predicate• Path NP↑S↓VP↓VBD• Phrase Type (NP, PP, S)• Position• Voice• Head Word

– Syntactic head

• Sub-categorization VP->VBD-PP

Classifier and Implementation

• SVM – binary classifiers– One vs ALL (OVA) formalism

• training n binary classifiers for a n-class problem

– Converted multi-class problem

• 80% of the nodes have NULL labels– a binary NULL vs NON-NULL classifier– remaining data for training OVA classifiers

• Tool– TinySVM– YamCha

New Features

1. NE2. Headword POS3. Verb Clustering4. Partial Path5. Verb Sense Info6. Head of PP7. First and Last W/P8. Ordinal position9. Tree Distance10. Relative Features11. Temporal cue words12. Dynamic class context

Technology

• 中文倉頡輸入法發明人朱邦復領導的「香港文化傳信」正與 IBM 公司聯手開發中文嵌入式處理器V-Dragon( 飛龍 ) ，希望結合 Linux 作業系統讓個人電腦售價降為目前的三分之一，打破英特爾和微軟的 Wintel 架構。

• 「文化傳信」的 V-Dragon 是一款中文 CPU( 中央處理器 ) ，內建 3 萬 2000 個中文字，並採用 Linux 作業系統 Midori Linux 。

UCLA Report Confirms

Culturecom Processor for Chinese Character Generation

•The SCS 1610 can generate about 32,000 characters in three fonts and sizes ranging from 11x11 to 127x127 pixels.•The display quality of the characters is optimized aesthetically for sizes generated.•The code and data for the generation algorithm and the character representations occupy no more than 256KB•The speed of character generation is good.

這個技術是中文及其他非拼音文字最有效的解決方案

• 采用中文 CPU ，完全的中文环境，中文字型皆以向量方式由 CPU 产生，可产生多种字体并可自由放大、缩小，不需使用 Mask-ROM 存放字型。同时也完全支持英文。

漢字基因 (1/2)

• 漢字– 百分之九十是形聲字– 聲符之外，形聲字尚有「假借」的機能

• 也就是說，字首代表分類，字身可作定義之用• 對檢字法的要求，是以字義的理解為第一訴求

• 以字根觀念產生「向量字形產生器」 • 漢字概念，發現有「字碼、字序、字形、

字辨、字音、字義」六大功能

漢字基因 (2/2)

• 字碼倉頡 25 碼• 字序倉頡 24 個漢字字母排序• 字形

– 向量筆形９個，字根６４個，供字庫組字用– 僅佔 160kb 系統空間，可組成各種字形近一千萬個，

採用無級次放大，可選用各種已知之字體變化，組字速度， p450 為例， 16*16 之字形，每秒可生成及顯示四萬六千字

• 字辨 73(9+64) 類字形基因特徵，轉換之字碼• 字音六書「形聲」為本的波形追蹤法 • 字義字義基因５１２個

– 1/3 from 宋儒之「體用因果」– 1/4 of 「常識定義」

TinySVM

• Support standard C-SVR and C-SVM

• Uses sparse vector representation

• Can handle several ten-thousands of training examples, and hundred-thousands of feature dimension

• Fast optimization algorithms stemming from SVM_light

+1 1:0.5 2:0.5+1 1:1 2:1+1 1:2 2:2+1 1:3 2:2+1 1:4 2:2-1 1:2 2:1-1 1:2 2:1.5-1 1:3 2:1.5-1 1:4 2:1.5

Steps

• Define Feature space

• Get feature values from [training|testing] set

• Create Model from f-values of training set– svm_learn -t 1 -d 2 -c 1 news.train news_model

• Verify the testing set with f-values– svm_classify -V news.test news_model

My Trial

• 13 Features are defined

• Training Set: 4 articles– 2 are annotated – advantage of the

government– 2 are annotated negative

• 2 test articles

notes about my studies of information engineering and natural language processing

Documents

semantic arguments

objectsproblemfrom data

core arguments

verbsthe arguments

character representations

dogsshallow semantic

themargument identification

generation algorithm