using corpora to study classifiers in mandarin chinese richard xiao [email protected]

34
Using corpora to study Classifiers in Mandarin Chinese Richard Xiao [email protected] .uk

Upload: ferdinand-francis

Post on 27-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Using corpora to studyClassifiers in Mandarin Chinese

Richard [email protected]

08/12/2006, Berlin COST Action A31 WG1 Meeting 2

Chinese corpus linguistics

• In relation to English, Chinese has a much shorter history of using corpora– Sinica Balanced Corpus of Chinese

• The first annotated corpus of Mandarin• Freely accessible online since the mid-1990s

• Rapid progress over the last decade– Corpus building and exploration technology– Publicly available corpus resources

08/12/2006, Berlin COST Action A31 WG1 Meeting 3

Chinese text processing• Computational processing of Chinese text is more complex than

English• Chinese text is encoded in double-byte native encodings

– Potential confusion of bytes in running text– GB2312 for SC and Big5 for TC– The advent of Unicode has facilitated Chinese computing

• But most existing data and tools are based on native encoding• Word tokenization is an essential first step in serious Chinese

computing– Defining legitimate “words” in running text– Involving dictionary matching and the use of statistic models

• Part-of-speech tagging depends on the results of tokenizaton– Accuracy of accuracy: 98%– Accuracy of POS tagging: 96%

08/12/2006, Berlin COST Action A31 WG1 Meeting 4

Concordancers for Chinese

• Many concordancers designed for English do not work well with Chinese data

• There are presently three types of tools for Chinese– Unicode-based tools

• WordSmith version 4 (Commercial product)• Xaira (open source freeware)

– Concordancers dependent on language support packs (or in WinXP, default non-Unicode font set as Chinese)

• AntConc (freeware)• ConcApp (freeware)• MonoConc Pro (commercial product)• Concordance (shareware)

– Web-based query systems bundled with specific online corpora

08/12/2006, Berlin COST Action A31 WG1 Meeting 5

Chinese corpus resources• Sinica Balanced Corpus

– http://www.sinica.edu.tw/SinicaCorpus/• Sinica Tagged Corpus of Early Mandarin

– http://www.sinica.edu.tw/Early_Mandarin/• Modern Chinese Language Corpus

– http://219.238.40.213:8080/CpsQrySv.srf• PKU-CCL Chinese Corpus

– http://ccl.pku.edu.cn/YuLiao_Contents.Asp• BLCU Modern Chinese Corpus

– http://202.112.195.8:8089/ccir_login?input=*• Chinese Internet Corpus

– http://corpus.leeds.ac.uk/query-zh.html• Lancaster Corpus of Mandarin Chinese

– http://www.ling.lancs.ac.uk/corplang/lcmc/• Lancaster LOS Angeles Spoken Chinese Corpus

– http://www.ling.lancs.ac.uk/corplang/llscc/• More details of more corpora in more languages are on the handout

08/12/2006, Berlin COST Action A31 WG1 Meeting 6

Lancaster Corpus of Mandarin Chinese (LCMC)

• Designed as a Chinese match for FLOB and Frown• Representing written Mandarin as used in mainland

China in the early 1990s• A balanced corpus of one million words in 500 samples

proportionally taken from 15 text categories• Marked up in XML and Encoded in Unicode• Tokenized and POS tagged• Freely searchable online

– http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl• Released by ELRA and OTA free of charge for academic

and educational purposes• An indexed version for use with Xaira is available• V1.2 incorporates validated details of classifier use

08/12/2006, Berlin COST Action A31 WG1 Meeting 7

Lancaster LOS Angeles Spoken Chinese Corpus (LLSCC)

• One million words of spoken Mandarin• Both dialogues (55%) and monologues (45% )• Both spontaneous (57% ) and scripted (43%) speech• Seven spoken registers

– face-to-face conversation, telephone conversation, play/movie scripts, TV talk show transcripts, formal debates, spontaneous oral narrative, edited oral narrative

• Marked up in XML and encoded in Unicode• Tokenised and POS tagged

– The Telephone Conversation part is tagged with details of classifier use– The unannotated version of this part is available from the LDC as

CallHome Mandarin Transcripts• More information

– http://www.ling.lancs.ac.uk/corplang/llscc/

08/12/2006, Berlin COST Action A31 WG1 Meeting 8

Annotation scheme for classifiers (q)

Tag Gloss

qu Unit classifier

ql Collective classifier

qa Arrangement classifier

qc Container classifier

qm Standard measure

qs Species classifier

qt Temporal classifier

qv Verbal classifier

08/12/2006, Berlin COST Action A31 WG1 Meeting 9

Why classifiers are necessary (1)

• Grammatically mandatorysan ben shu *san shu

three CL book three book

three books three books

• Distinguishing between word sensesyi tiao xian yi gen xian

one CL line one CL thread

a line a thread

08/12/2006, Berlin COST Action A31 WG1 Meeting 10

Why classifiers are necessary (2)

• Resolving syntactic ambiguity– Example A)

Ho laozong gei-le ta yi-ba shouqiangHo general give-Asp him one-CL pistolGeneral Ho gave him a pistol.

– Example B)Ho laozong gei-le ta yi shouqiangHo general give-Asp him one pistol (CL)General Ho shot him once with a pistol.

08/12/2006, Berlin COST Action A31 WG1 Meeting 11

Use and name of classifiers

• The use of “classifiers” dated back as early as over 3,300 years ago– Oracle bone inscriptions excavated from the Yin

Ruins (1300-1100 B.C.)

• Classifiers became established as a separate word class in Chinese only in the 1950s– Ding et al (1952): A Talk on Grammar in Modern

Chinese

• Different terms had been used for classifiers– But mainly treated as a subclass of nouns

08/12/2006, Berlin COST Action A31 WG1 Meeting 12

Syntactic features of classifiers

• Classifiers were the last to have become one of the 11 word classes in Chinese because they cannot be used independently as sentential constituents

• Typically following a numeral or demonstrative pronoun zhe 这 ‘ this’, na (那 ) ‘that’, or na (哪 ) ‘which’

• Monosyllabic classifiers can be reduplicated to function as different sentential constituents, expressing a general grammatical meaning with different situational variants (Guo 1999)– Co-existence or repetition of entities or events

• “All around”, “many”, “one by one”, “continuous”

08/12/2006, Berlin COST Action A31 WG1 Meeting 13

Levels of grammaticalization• Specialised classifiers

– Fully grammaticalized– Functioning as classifiers only– Bleaching of lexical meaning, difficult to find translation equivalents in a non-

classifier language– E.g. (n) 个 ,件 ,块 ,颗 ,辆 ,枚 ,匹 ,幢 ; (v) 次 ,遍 ,场 ,顿 ,番 ,回 ,通 ,趟 ,下 ,

阵• Concurrent classifiers

– Mainly derived from nouns and verbs– Can be used as nouns/verbs and classifiers– The classifier use is semantically related to the lexical meaning of the original

noun/verb– E.g. 口 ,头 ,台 ;瓶 ,碗 ; 包 ,封 ,卷 ,捆 , 束

• Temporary borrowings– Mainly borrowed from nouns, verbs, and adjectives– Functioning as classifiers only on an ad hoc basis– Full lexical meaning– E.g. 脸 (face), 屋子 (house); 刀 (knife), 枪 (gun), 脚 (foot), 拳 (fist)

08/12/2006, Berlin COST Action A31 WG1 Meeting 14

Semantic types of classifiers (1) • Nominal classifiers (6 types): Quantifying nouns

– Unit classifiers• Count individual entities• E.g. 个 (63.5% of unit classifiers, 38.8% of all classifiers),位 ,条 ,张 ,名 ,件 ,句 ,家 ,项 ,封 ,只 ,片 ,步 ,块 ,部 ,份 ,座 ,届 ,口 , 支

– Collective classifiers• Provide a collective reference for separate entities• E.g. 套 ‘ set’ , 批 ‘ batch’ , 双 ‘ pair’ , 系列 ‘ series’ , 副 ‘ pair’ ,

群 ‘ group’ , 代 ‘ generation’ , 组 ‘ group’ , 对 ‘ pair’ , 队 ‘ team’– Arrangement classifiers

• Also refer to a collection, but focus on constellation aspect (shape), i.e. how entities are arranged or grouped together

• E.g. 层 ‘ layer’, 堆 ‘ pile’, 团 ‘ ball’, 沓 ‘ pad’, 串 ‘ string’, 丝 ‘ thread’, 排 ‘ row’, 把 ‘ handful’, 滴 ‘ drop’, 束 ‘ bunch’, 缕 ‘ thread’, 行 ‘ row’

08/12/2006, Berlin COST Action A31 WG1 Meeting 15

Semantic types of classifiers (2)• Nominal classifiers: Quantifying nouns

– Standard measure classifiers• Express exact measures of various kinds, in local or international

units• E.g. 元 ,块 ,米 ,吨 ,克 , 美元 ,里 , 厘米 ,亩 ,度 , 平方米 ,斤 , 公里 ,

公斤 ,分 ,尺 ,升 ,丈 ,℃– Container classifiers

• Denote types of containers, which are borrowed temporarily to provide an inexact measure of mass or entities usually associated with such containers

• E.g. 杯 ,碗 ,盒 ,袋 ,桶 ,脸 ,瓶 ,壶 ,盆 ,盘 ,锅 ,瓢 ,箱 ,筐 ,包 ,匙 ,罐 ,腔 ,坛 ,锹 ,盅 ,车 ,斗 ,肚子

– Special container classifiers, can only take yi -> ‘full’, more descriptive than quantifying

– Species classifiers• Denote the type of entities grouped together• E.g. 种 (kind, over 90%), 类 (sort), 级 (grade), 样 (type), 等

(grade), 品 (class)

08/12/2006, Berlin COST Action A31 WG1 Meeting 16

Semantic types of classifiers (3)• Verbal classifiers: quantifying verbs

– 9 specialised verbal classifiers• E.g. 次 (times, 40.8% of all verbal classifiers),下 (stroke),场

(course of action),番 (once over),阵 (step of action),趟(return journey),回 (times),遍 (once through),顿 (criticising, abusing)

– Borrowed verbal classifiers• An open set, mostly nouns denoting tools and related items• E.g. 声 ,眼 ,口 ,刀 ,脚 ,拳 , 巴掌 ,枪 , 棒

• Temporal classifiers: measuring time– Exact measures

• 年 ,天 ,岁 , 分钟 , 小时 ,夜 ,周 , 周年 ,日 , 周岁 ,月 ,载 , 星期 , 昼夜 ,刻 ,宿 ,宵 , 礼拜 , 旬

– Inexact measures• E.g. 会儿 ,段 , 辈子 , 阵子 ,会 ,阵 , 瞬间

08/12/2006, Berlin COST Action A31 WG1 Meeting 17

Classifiers in writing and speech

• Unit classifiers by far most common, in speech and writing• Because of the weight of generalised classifier ge, unit classifiers

are particularly frequent in speech• Other common types: temporal, verbal• Infrequent types: container, arrangement, collective

0

500

1000

1500

2000

2500

Arrang

emen

t

Contain

er

Collecti

ve

Std m

easu

re

Specie

s

Tempo

ral

Unit

Verbal

Classifier type

Fre

qu

ency

per

100

,000

to

ken

sLCMC

CallHome

08/12/2006, Berlin COST Action A31 WG1 Meeting 18

Variation across genres

• Apart from the speech-writing difference, various genres also differ in classifier use• Most frequent in news reportage (A), humour (R), and speech (S): over 3K in 100K• Least common in news review (B), news editorial (C), religious writing (D), and

academic prose (J): below 2k in 100k• Generally more common in imaginative (K-R) writing and speech (S) than in

informative writing (A-J)

0

500

1000

1500

2000

2500

3000

3500

4000

A B C D E F G H J K L M N P R S

Genre

Fre

qu

en

cy

pe

r 1

00

K w

ord

s

08/12/2006, Berlin COST Action A31 WG1 Meeting 19

Distribution of classifier types

0%

20%

40%

60%

80%

100%

A B C D E F G H J K L M N P R S

Genre

Pro

po

rtio

n

Verbal

Unit

Temporal

Species

Std measure

Collective

Container

Arrangement

• Distribution of different types of classifiers also varies across genres• Unit classifier is the most common type in all genres (2/3 of all classifiers)• Container, arrangement, and collective classifiers are relatively rare in all genres• Std measure classifiers are most frequent in news reportage (A) and official docs (H)• Species classifiers are more common in informative than imaginative writing

08/12/2006, Berlin COST Action A31 WG1 Meeting 20

Cognitive basis of classifier use• Allan (1977): number of dimensions• Adams and Conklin (1973): elasticity, hardness, discreteness• Shi (2001): ratio between different dimensions, and materiality• Dimensions and use of classifiers

– 0-D: point, e.g. yi dian (点 ) mo ‘a point of ink’– 1-D: line, e.g. yi xian (线 ) xiwang ‘a thread of hope’– 2-D: area (Y being the longer dimension)

• Y/X>>1 –> zhang (张 ): e.g. yi zhang zhaopian ‘a photo’• Y/X>>0 –> tiao (条 ): e.g. yi tiao malu ‘a road’

– 3-D: block (Q=Y/X)• Z/Q >> 0 –> pian (片 ): e.g., yi pian shuye ‘a leaf’• Z/Q >> 1 –> kuai (块 ): e.g. yi kuai tang ‘a lump of sugar’• Z/Q >> sufficiently large –> gen (根 ): e.g. yi gen dianxian ‘a cable’

• While the use of nominal classifiers is closely associated with shape, this is not the only criterion nouns and classifiers co-select each other– Five co-selection criteria

08/12/2006, Berlin COST Action A31 WG1 Meeting 21

Co-selection by similarity

• Classifiers are closely related to shapes which are historically associated with the nouns that have given rise to these classifiers, e.g. tiao (条 )– tiao: ‘small branch/twig’ –> ‘long, narrow, flexible’: jie

(街 ) ‘street’, tui (腿 ) ‘leg’, lu (路 ) ‘road’, xian (线 ) ‘line; thread’, he (河 ) ‘river’, yu (鱼 ) ‘fish’, etc; ‘bamboo slips for writing’ –> guiding ( 规定 ) ‘regulation’, jianyi ( 建议 ) ‘suggestion’, falu ( 法律 ) ‘law’, xinwen ( 新闻 ) ‘news’, etc

– kuai (块 ) (‘soil lump/block’ –> something of a lumpy/blocky shape, e.g. a wrist watch; ‘territory soil’ –> something with a boundary, e.g. a scar

08/12/2006, Berlin COST Action A31 WG1 Meeting 22

Co-selection by metonymy

• The original lexical meanings of classifiers refer to the most salient features of the entities being classified, e.g.– kou (口 ) ‘mouth’ (for pigs), tou (头 ) ‘head’ (for

cattle), wei (尾 ) ‘tail’ (for fish), ding (顶 ) ‘top’ (for hats, sedan chairs etc)

• BUT long term linguistic conventions are always important in language use– *tou: rabbit, cat– *wei: peacock, squirrel

08/12/2006, Berlin COST Action A31 WG1 Meeting 23

Co-selection by relatedness

• The original lexical meanings of classifiers refer to actions closely related to entities being classified, e.g.– bao (包 ) ‘wrap-> pack (resulting of packing)’– chuan (串 ) ‘string together-> string, bunch’– kun (捆 ) ‘tie up, fasten -> bundle’– peng (捧 ) ‘hold in both hands -> a double

handful’

08/12/2006, Berlin COST Action A31 WG1 Meeting 24

Co-selection by association

• The original lexical meanings of classifiers refer to tools, containers, and places, etc closely associated with the entities being classified, e.g.– dao (刀 ) ‘knife -> a cut of (meat)’ – wan (碗 ) ‘bowl -> a bowl of (rice)’– chuang (床 ) ‘bed -> a bed of (quilt/sheet etc)’– mu (幕 ) ‘curtain -> an act of (play)’

08/12/2006, Berlin COST Action A31 WG1 Meeting 25

Co-selection by conventions

• Sometimes, co-selection has to be interpreted by following linguistic conventions because it is not always possible to track the grammaticalization path of a classifier to ascertain the relationship between its original lexical meaning with the entities being classified– In what way is tiao historically related to renming ‘human life’?– Why is tou used for pigs and cattle but not rabbits or cats?– Why is wei used for fish but not for peacocks or squirrels even

though they have tails that are as salient as, if not more so, than that of fish

• Such missing links have to be accounted for by linguistic conventions of the speech community

08/12/2006, Berlin COST Action A31 WG1 Meeting 26

Collocates

• Let’s now have a look at the noun collocates of some common classifiers in Chinese to see how well the proposed co-selection criteria work

• Defining collocates (in 2 million words)– Window span of L5-R5– z>3.0– Minimum co-occurrence frequency of 5

08/12/2006, Berlin COST Action A31 WG1 Meeting 27

Collocates of zhang (张 )

Collocate Gloss Frequency z-score

牌 playing card 64 85.9

纸条 notepaper 9 49.7

支票 cheque 6 40.4

照片 photo 7 26.6

票 ticket 10 21.5

纸 paper 13 21.2

脸 (thick/thin) face, cheek 12 17.2

皮 skin/leather 7 17.2

画 drawing 6 9.5

床 (prototypical)

bed 6 9.3

08/12/2006, Berlin COST Action A31 WG1 Meeting 28

Collocates of tiao (条 ) – 1

Collocate Gloss Frequency z-score

规定 stipulation 51 41.9

条例 regulation 11 26.1

街 street 11 25.4

腿 leg 14 22.5

车道 (traffic) lane 6 20.4

路 road 23 19.7

直线 straight line 6 19.4

河 river 6 11.7

08/12/2006, Berlin COST Action A31 WG1 Meeting 29

Collocates of tiao (条 ) - 2

Collocate Gloss Frequency z-score

指令 instruction 6 10.7

建议 suggestion 7 9.2

鱼 fish 9 8.8

线 line; thread 7 7.4

原则 principle 8 7.0

意见 comment 6 5.3

新闻 news 6 4.3

08/12/2006, Berlin COST Action A31 WG1 Meeting 30

Collocates of kuai (块 )

Collocate Gloss Frequency z-score

平地 level ground 6 59.0

石头 stone 11 51.1

布 cloth 6 23.0

地 land, field 9 3.2

08/12/2006, Berlin COST Action A31 WG1 Meeting 31

Collocates of ge (个 )• Generalised classifier ge (个 ): bamboo (竹 ) split into halves, initially as a

counter for bamboos and arrows; when a bamboo chip is used for counting, it becomes a symbol of the entity being counted. In other words, the entity loses its shape, colour, function or any other attribute and becomes a unit of counting, ge.

• Ge can be used for any noun (people or things, large or small) that does not have a specific classifier, and it can be used to replace specific classifiers of many nouns.

• A total of 115 noun collocates• 29 refer to human beings, 86 to non-human entities• 66 refer to concrete entities, 49 to abstract entities

– 12 related to time• Top 20 noun collocates (z>8.8, F>5, in the order of z-scores)

– 月 ‘ month’, 星期 ‘ week’, 人 ‘ person’, 小时 ‘ hour’, 电话 ‘ phone call’, 礼拜 ‘ week’, 字 ‘ character’, 百分点 ‘ percentage’, 地方 ‘ place’, 角落 ‘ corner’, 项目 ‘ project’, 钟头 ‘ hour’, 问题 ‘ problem, question’, 电饭锅 ‘ rice cooker’, 女人 ‘ woman’, 字儿 ‘ character’, 例子 ‘ example’, 盒子 ‘ box’, 照相机 ‘ camera’, 东西 ‘ stuff’

08/12/2006, Berlin COST Action A31 WG1 Meeting 32

Classifiers for dongxi ( 东西 )• A noun with a rather general and vague referent; can refer to anything, but not human

being – It is an insult to say someone is a dongxi, or is not a dongxi

• The vagueness in reference makes it possible to use a nominal classifier of any type for dongxi

• Unit classifier– (General) ge ( 个 ), jian ( 件 ) ‘piece’, fen ( 份 ) ‘portion’– (Shape) tiao ( 条 ), zhang ( 张 ), and kuai ( 块 )– (Book/paper) ben ( 本 for books), pian ( 篇 for a piece of writing)

• Collective classifier– tao ( 套 ) ‘set’

• Arrangement classifier– dui ( 堆 ) ‘pile’

• Container classifier– xiangzi ( 箱子 ) ‘box’, bao (包 ) ‘pack’

• Standard measure classifier– dun ( 吨 ) ‘ton’

• Species classifier– yang ( 样 ) ‘type’, zhong ( 种 ) ‘kind’, lei ( 类 ) ‘class’

08/12/2006, Berlin COST Action A31 WG1 Meeting 33

Variations

• Not all instances of classifier use are in line with these co-selection criteria– Regional variation

• dao (刀 ) ‘knife’– Mandarin: yi-ba (把 ) dao ‘a knife’– Cantonese: yi-zhang (张 ) dao ‘a knife’

• niu (牛 ) ‘cattle’– Mandarin: yi-tou (头 ) niu ‘a cow’– Wu: yi-zhi (只 ) niu ‘a cow’

• ren (人 ) ‘person’– Mandarin: yi-ge (个 ) ren– Fuzhou: yi-zhi (只 ) ren

– Unconventional, creative use of classifiers often found in literary works

– Diachronic variaion

08/12/2006, Berlin COST Action A31 WG1 Meeting 34

Thank you!