chinese information processing (i): basic concepts and practice

30
Chinese Information Chinese Information Processing (I): Basic Processing (I): Basic Concepts and Practice Concepts and Practice Unit 1: The Chinese Language and Chinese, Script and Software

Upload: tawana

Post on 28-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Chinese Information Processing (I): Basic Concepts and Practice. Unit 1: The Chinese Language and Chinese, Script and Software. Noménclature. ·         Mandarin - Guanhua, an official language used in the court, the language of officials ·         Guoyu - National language. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chinese Information Processing (I): Basic Concepts and Practice

Chinese Information Chinese Information Processing (I): Basic Processing (I): Basic

Concepts and PracticeConcepts and Practice

Unit 1: The Chinese Language and Chinese, Script and Software

Page 2: Chinese Information Processing (I): Basic Concepts and Practice

NoménclatureNoménclature

        Mandarin - Guanhua, an official language used in the court, the language of officials

        Guoyu - National language.

        Putonghua - Common Speech, Common Language

        Huayu or Huawen – Singapore or overseas

        Hanwen – used in Korea and Japan

        Zhongguohua – Languages in China

        Zhongwen – alternative to Hanyu, focusing on written language

Page 3: Chinese Information Processing (I): Basic Concepts and Practice

Chinese dialectsChinese dialects• Northen (Beijing) 647,000,000 • Wu (Shanghai) 77,000,000 • Yue (Cantonese Guangzhou) 47,000,000 • Xiang (Hunan Changsha) 46,000,000 • Min South 28,000,000• (Southern Fujianese Xiamen) • Min North 11,000,000• (Northern Fujianese Fuzhou, Taiwan) • Hakka (Mei Xian) 37,000,000 • Gan (Jiangxi Nanchang) 22,000,000

Page 4: Chinese Information Processing (I): Basic Concepts and Practice
Page 5: Chinese Information Processing (I): Basic Concepts and Practice

Pronunciation Pronunciation

Mutual unintelligible Northern dialects do not have voice sounds b-, d-,

g-, z-, v- and entering tone –p, -t, -k -? Wu dialect has voiced sounds, entering tones and

makes no distinction between z, c, s and zh, ch, sh Cantonese has entering tones, but no voiced

sounds.

Page 6: Chinese Information Processing (I): Basic Concepts and Practice

Pronunciation Pronunciation

Mutual unintelligible Northern dialects do not have voice sounds b-, d-,

g-, z-, v- and entering tone –p, -t, -k -? Wu dialect has voiced sounds, entering tones and

makes no distinction between z, c, s and zh, ch, sh Cantonese has entering tones, but no voiced

sounds.

Page 7: Chinese Information Processing (I): Basic Concepts and Practice

Tonal differencesTonal differences

The number of tones vary in various dialects

Mandarin – 4 tones

1 2 3 4

Yīn Píng Yáng Píng Shǎng Shēng Qù Shēng

55 35 214 51

媽 麻 馬 罵

Page 8: Chinese Information Processing (I): Basic Concepts and Practice

Tones in Wu and CantoneseTones in Wu and Cantonese

Wu Dialect –5 tones

1 2 3 4 5Yīn Píng Yīng Qù Yáng Qù Yīn Rù Yáng Rù

53 34 13 5 12

詩 使 時 識 食

Page 9: Chinese Information Processing (I): Basic Concepts and Practice

Tones in Wu and CantoneseTones in Wu and Cantonese

Cantonese – 9 tones

1 2 3 4 5 6 7 8 9Yīn Yáng Yīn Yáng Yīn Yáng Yīn Zhōng YángPíng Píng Shàng Shàng Qù Qù Rù Rù Rù

55, 53 21, 11 35 13 33 22 55 33 22

詩,夫 時, 扶 使苦 市婦 試富 事父 識忽 泄法 食佛

Page 10: Chinese Information Processing (I): Basic Concepts and Practice

Vocabulary differencesVocabulary differences

Dialect sun thing clothing wife we know

太陽 東西 衣服 妻子 我們 知道tàiyáng dōngxi yīfu qīzi wǒmen zhīdào

太陽 物事 衣裳 家主婆 阿拉 曉得tayang mez yizang kazibu ala xiaode

熱頭 野 衫 老婆 我地 知yaktou ye sam lopo ngodi ji

Putonghua

Shanghai

Cantonese

Page 11: Chinese Information Processing (I): Basic Concepts and Practice

Why are dialect issues related to Chinese information processing?

 1. 1. When one inputs characters, he may use the

pronunciation of characters. When a person’s pronunciation is not standard, the input Pinyin will be incorrect, thus he may not be able to retrieve a proper character.

2. Since all educated people know the structure of characters, the stroke number, the character components or radicals may be used to input characters.

3. When voice recognition software is developed, the dialect accents must be taken into consideration.

4. When OCR software is developed, the character structure must be taken into consideration.

Page 12: Chinese Information Processing (I): Basic Concepts and Practice

Chinese script: Issues Related to Chinese script: Issues Related to Chinese Information ProcessingChinese Information Processing

Number of charactersStructure of charactersCharacter evolutionTraditional vs. simplified characters

Page 13: Chinese Information Processing (I): Basic Concepts and Practice

Number of Chinese CharactersNumber of Chinese Characters============================================

Dates Dynasty or period Name of Dictionary Number

--------------------------------------------------------------------------

100 Eastern Han Shuowen Jiezi 9,353

1615 Ming Zihui 33,179

1716 Qing Kangxi Zidian 47,035

1916 Republic Zhonghua Da zidian 48,000

(Source: Norman, 1988)

Page 14: Chinese Information Processing (I): Basic Concepts and Practice

Number of Frequently Used Number of Frequently Used Chinese CharactersChinese Characters

The Language and Script Committee and the Education Commission have published “The Frequently Used Characters of Modern Chinese” which includes 2,500 primarily frequent characters and 1,000 secondarily frequent characters.

(Source: Li Xingjian and Fei Jinchang, People’s Daily 9/25/2001.)

Page 15: Chinese Information Processing (I): Basic Concepts and Practice

Character StructureCharacter Structure

Charater 好

Componenets 女 子

Strokes 丶 一 丨 丿

Page 16: Chinese Information Processing (I): Basic Concepts and Practice

Important to remember:Important to remember:

Single characters: 一 , 乙 Compound characters: 明,海 Radicals: 女,人,口

Characters can be decomposedCharacters have some basic components

Page 17: Chinese Information Processing (I): Basic Concepts and Practice

Source: library.thinkquest.org/C004203/ art/chinese.jpg

Character EvolutionCharacter Evolution

Page 18: Chinese Information Processing (I): Basic Concepts and Practice

Seal (Zhuanshu)Seal (Zhuanshu)

Page 19: Chinese Information Processing (I): Basic Concepts and Practice

Clerical (Lishu)Clerical (Lishu)

Page 20: Chinese Information Processing (I): Basic Concepts and Practice

Standard (Kaishu)Standard (Kaishu)

Page 21: Chinese Information Processing (I): Basic Concepts and Practice

Running (Xingshu)Running (Xingshu)

Page 22: Chinese Information Processing (I): Basic Concepts and Practice

Cursive (Caoshu)Cursive (Caoshu)

Page 23: Chinese Information Processing (I): Basic Concepts and Practice

Print TypefacesPrint Typefaces

Page 24: Chinese Information Processing (I): Basic Concepts and Practice

Definition of Character, Glyph, Definition of Character, Glyph, Typeface and FontTypeface and Font

Character - an abstract notion indicating a class of shapes declared to have the same meaning or form.

Glyph - a specific instance of a character. e.g., 囘 回

Typeface - the printed style of a glyph or character set.中 , 中 ,中 ,中

Font - a single instance of a typeface such as

specific point size. 中 ,中 ,中

Page 25: Chinese Information Processing (I): Basic Concepts and Practice

Traditional vs. simplified charactersTraditional vs. simplified charactersSimplification of characters has long been a deputed topic in China. Advocating character simplification began in early Republic years. Only after 1949, the simplification of characters was truly implemented. In 1956, the Committee on Language Reform promulgated a list of 515 simplified characters and 54 simplified components or parts.

 

Currently the simplified characters are used in Mainland of China, Singapore. The traditional characters are used in Taiwan, Hong Kong. In overseas Chinese communities, a kind of mixed situation can be observed.

Page 26: Chinese Information Processing (I): Basic Concepts and Practice

The problem of this dual system The problem of this dual system caused for Chinese computing.caused for Chinese computing.

Computer must store two sets of characters making the storage space huge (for display and printing).

The input methods based on strokes or components may be different. The radicals or components of traditional and simplified characters are different. 谢 - 謝 have different radicals. 后 後 are completely different glyphs.

Page 27: Chinese Information Processing (I): Basic Concepts and Practice

Problem of Conversion Problem of Conversion

Traditional 後來 皇后 心臟 骯髒 關係

Simplified 后来 皇后 心脏 肮脏 关系

 Problem caused by conversion from simplified characters to traditional characters:

后来 => 后來 (後來)

心脏 => 心髒 (心臟)

关系 => 關系 (關係)

Page 28: Chinese Information Processing (I): Basic Concepts and Practice

Chinese SoftwareChinese Software

Chinese Word ProcessorsChinese Systems

– Chinese Windows– Third party Chinese systems

Page 29: Chinese Information Processing (I): Basic Concepts and Practice

DOS based softwareDOS based software Byx, DOS based simple Chinese word processor. It

handles simplified characters only. GB code. It has only one print font. NJSTAR, DOS and Window based Chinese word

processor, handles both simplified and traditional characters.

Kuochiao, DOS based Chinese system, traditional characters, big5 code

Yitien, same as Kuochiao CCDOS, DOS based Chinese system, simplified

characters, GB code.

Page 30: Chinese Information Processing (I): Basic Concepts and Practice

Windows based softwareWindows based software

Twinbridge http://www.twinbridge.com

Chinese Star http://www.suntendyusa.com/

Unionway http://www.unionway.com/tea/html/0/1.html

Richwin  http://richwin.sina.com.cn/

Microsoft Cwindows and Pwindows Microsoft multilingual support 2000 and XP 5.02

Install and Use of IME from Office 2000 multilanguage pack

(Mac OS with multilingual support)