chinese information processing (i): basic concepts and practice
DESCRIPTION
Chinese Information Processing (I): Basic Concepts and Practice. Unit 1: The Chinese Language and Chinese, Script and Software. Noménclature. · Mandarin - Guanhua, an official language used in the court, the language of officials · Guoyu - National language. - PowerPoint PPT PresentationTRANSCRIPT
Chinese Information Chinese Information Processing (I): Basic Processing (I): Basic
Concepts and PracticeConcepts and Practice
Unit 1: The Chinese Language and Chinese, Script and Software
NoménclatureNoménclature
Mandarin - Guanhua, an official language used in the court, the language of officials
Guoyu - National language.
Putonghua - Common Speech, Common Language
Huayu or Huawen – Singapore or overseas
Hanwen – used in Korea and Japan
Zhongguohua – Languages in China
Zhongwen – alternative to Hanyu, focusing on written language
Chinese dialectsChinese dialects• Northen (Beijing) 647,000,000 • Wu (Shanghai) 77,000,000 • Yue (Cantonese Guangzhou) 47,000,000 • Xiang (Hunan Changsha) 46,000,000 • Min South 28,000,000• (Southern Fujianese Xiamen) • Min North 11,000,000• (Northern Fujianese Fuzhou, Taiwan) • Hakka (Mei Xian) 37,000,000 • Gan (Jiangxi Nanchang) 22,000,000
Pronunciation Pronunciation
Mutual unintelligible Northern dialects do not have voice sounds b-, d-,
g-, z-, v- and entering tone –p, -t, -k -? Wu dialect has voiced sounds, entering tones and
makes no distinction between z, c, s and zh, ch, sh Cantonese has entering tones, but no voiced
sounds.
Pronunciation Pronunciation
Mutual unintelligible Northern dialects do not have voice sounds b-, d-,
g-, z-, v- and entering tone –p, -t, -k -? Wu dialect has voiced sounds, entering tones and
makes no distinction between z, c, s and zh, ch, sh Cantonese has entering tones, but no voiced
sounds.
Tonal differencesTonal differences
The number of tones vary in various dialects
Mandarin – 4 tones
1 2 3 4
Yīn Píng Yáng Píng Shǎng Shēng Qù Shēng
55 35 214 51
媽 麻 馬 罵
Tones in Wu and CantoneseTones in Wu and Cantonese
Wu Dialect –5 tones
1 2 3 4 5Yīn Píng Yīng Qù Yáng Qù Yīn Rù Yáng Rù
53 34 13 5 12
詩 使 時 識 食
Tones in Wu and CantoneseTones in Wu and Cantonese
Cantonese – 9 tones
1 2 3 4 5 6 7 8 9Yīn Yáng Yīn Yáng Yīn Yáng Yīn Zhōng YángPíng Píng Shàng Shàng Qù Qù Rù Rù Rù
55, 53 21, 11 35 13 33 22 55 33 22
詩,夫 時, 扶 使苦 市婦 試富 事父 識忽 泄法 食佛
Vocabulary differencesVocabulary differences
Dialect sun thing clothing wife we know
太陽 東西 衣服 妻子 我們 知道tàiyáng dōngxi yīfu qīzi wǒmen zhīdào
太陽 物事 衣裳 家主婆 阿拉 曉得tayang mez yizang kazibu ala xiaode
熱頭 野 衫 老婆 我地 知yaktou ye sam lopo ngodi ji
Putonghua
Shanghai
Cantonese
Why are dialect issues related to Chinese information processing?
1. 1. When one inputs characters, he may use the
pronunciation of characters. When a person’s pronunciation is not standard, the input Pinyin will be incorrect, thus he may not be able to retrieve a proper character.
2. Since all educated people know the structure of characters, the stroke number, the character components or radicals may be used to input characters.
3. When voice recognition software is developed, the dialect accents must be taken into consideration.
4. When OCR software is developed, the character structure must be taken into consideration.
Chinese script: Issues Related to Chinese script: Issues Related to Chinese Information ProcessingChinese Information Processing
Number of charactersStructure of charactersCharacter evolutionTraditional vs. simplified characters
Number of Chinese CharactersNumber of Chinese Characters============================================
Dates Dynasty or period Name of Dictionary Number
--------------------------------------------------------------------------
100 Eastern Han Shuowen Jiezi 9,353
1615 Ming Zihui 33,179
1716 Qing Kangxi Zidian 47,035
1916 Republic Zhonghua Da zidian 48,000
(Source: Norman, 1988)
Number of Frequently Used Number of Frequently Used Chinese CharactersChinese Characters
The Language and Script Committee and the Education Commission have published “The Frequently Used Characters of Modern Chinese” which includes 2,500 primarily frequent characters and 1,000 secondarily frequent characters.
(Source: Li Xingjian and Fei Jinchang, People’s Daily 9/25/2001.)
Character StructureCharacter Structure
Charater 好
Componenets 女 子
Strokes 丶 一 丨 丿
Important to remember:Important to remember:
Single characters: 一 , 乙 Compound characters: 明,海 Radicals: 女,人,口
Characters can be decomposedCharacters have some basic components
Source: library.thinkquest.org/C004203/ art/chinese.jpg
Character EvolutionCharacter Evolution
Seal (Zhuanshu)Seal (Zhuanshu)
Clerical (Lishu)Clerical (Lishu)
Standard (Kaishu)Standard (Kaishu)
Running (Xingshu)Running (Xingshu)
Cursive (Caoshu)Cursive (Caoshu)
Print TypefacesPrint Typefaces
Definition of Character, Glyph, Definition of Character, Glyph, Typeface and FontTypeface and Font
Character - an abstract notion indicating a class of shapes declared to have the same meaning or form.
Glyph - a specific instance of a character. e.g., 囘 回
Typeface - the printed style of a glyph or character set.中 , 中 ,中 ,中
Font - a single instance of a typeface such as
specific point size. 中 ,中 ,中
Traditional vs. simplified charactersTraditional vs. simplified charactersSimplification of characters has long been a deputed topic in China. Advocating character simplification began in early Republic years. Only after 1949, the simplification of characters was truly implemented. In 1956, the Committee on Language Reform promulgated a list of 515 simplified characters and 54 simplified components or parts.
Currently the simplified characters are used in Mainland of China, Singapore. The traditional characters are used in Taiwan, Hong Kong. In overseas Chinese communities, a kind of mixed situation can be observed.
The problem of this dual system The problem of this dual system caused for Chinese computing.caused for Chinese computing.
Computer must store two sets of characters making the storage space huge (for display and printing).
The input methods based on strokes or components may be different. The radicals or components of traditional and simplified characters are different. 谢 - 謝 have different radicals. 后 後 are completely different glyphs.
Problem of Conversion Problem of Conversion
Traditional 後來 皇后 心臟 骯髒 關係
Simplified 后来 皇后 心脏 肮脏 关系
Problem caused by conversion from simplified characters to traditional characters:
后来 => 后來 (後來)
心脏 => 心髒 (心臟)
关系 => 關系 (關係)
Chinese SoftwareChinese Software
Chinese Word ProcessorsChinese Systems
– Chinese Windows– Third party Chinese systems
DOS based softwareDOS based software Byx, DOS based simple Chinese word processor. It
handles simplified characters only. GB code. It has only one print font. NJSTAR, DOS and Window based Chinese word
processor, handles both simplified and traditional characters.
Kuochiao, DOS based Chinese system, traditional characters, big5 code
Yitien, same as Kuochiao CCDOS, DOS based Chinese system, simplified
characters, GB code.
Windows based softwareWindows based software
Twinbridge http://www.twinbridge.com
Chinese Star http://www.suntendyusa.com/
Unionway http://www.unionway.com/tea/html/0/1.html
Richwin http://richwin.sina.com.cn/
Microsoft Cwindows and Pwindows Microsoft multilingual support 2000 and XP 5.02
Install and Use of IME from Office 2000 multilanguage pack
(Mac OS with multilingual support)