language archiving- document annotation and corpus linguistics keh-jiann chen institute of...
TRANSCRIPT
Language Archiving- Document Annotation and Corpus Linguistics
Keh-Jiann ChenInstitute of Information science
Academia Sinica
The goals of NDAP are :(Quote from [Hsieh 2002, “Digital Media, Informatics, and Cultural Heritage “])
Preserving national cultural collections. Popularizing fine cultural holdings. Strengthening cultural heritage as well as guiding
cultural development. Popularizing knowledge and Improving Information
sharing. Enhancing education and learning. Bootstrapping cultural and value-added industries. Improving literacy, creativity and quality of life. Promoting International Cooperation and resource
sharing.
28
Space, Time and Language Coordinates for Digital ArchivesSpace, Time and Language Coordinates for Digital Archives
LanguageLanguage
TimeTime SpaceSpace
Language Language in Timein Time
HistoricalHistoricalGISGIS
Language Language in Spacein Space
Language Language in Text, in in Text, in Speech...Speech...
Language Changes
Digital Archives
Language variations
Digital Archives and TSL coordinates: (Quote from [Hsieh 2002, “Digital Media, Informatics, and Cultural Heritage “])
Language Archiving is a is a Collection of Linguistic ResourcesCollection of Linguistic Resources Collection of a linguistic archive (such Collection of a linguistic archive (such
as a balanced corpus) is guided by a sas a balanced corpus) is guided by a set of et of design criteriadesign criteria
Design CriteriaDesign Criteria define natural classes of texts in a collection
Each criterion establishes a dimension for comparative studies www.sinica.edu.tw/SinicaCorpus
How to make a single How to make a single archive more versatilearchive more versatile
One Corpus or Many Corpora?One Corpus or Many Corpora?
Or How to make a Balanced Corpus Biased?Or How to make a Balanced Corpus Biased?
With Textual Markup InformationWith Textual Markup Information (e.g. (e.g.
Metadata)Metadata)
genre, style, mode, topic, medium etc.genre, style, mode, topic, medium etc.
word, part-of-speech, structure tags, semantic word, part-of-speech, structure tags, semantic
tagstags
Alignment for heterogeneous corporaAlignment for heterogeneous corpora
Creating Synergy from Uniform Resource Type Each document is marked up with textual
description features: topic, style etc. Each feature selects a subset of
documents Sub-corpora (or new archives) can be
created online according to user’s specification
Creating Synergy from Uniform Resource Type Classical Chinese Corpora http://www.sinica.edu.tw/~tibe/2-words/old-words/index.html
Corpus of Formosan Austronesian Languages Under construction, part of the NationalDigital Archive Initiative
Lexical Databases of other Sino-Tibetan and Tibet
o-Burmese Languages
Creating Synergy from Heterogeneous Resource Type Bi-lingual or multi-lingual corpora
Text and speech aligned corpora
Synchronized corpora collected from
different areas
How to create a balanced corpus?
Creating of Sinica corpus – A word segmented modern Chinese corpus with pos
tagging
Introduction TEI : A corpus is a body of texts put
together in a principled way, typically in order to construct a sample of a given language or sublanguage.
It must be representative and balanced if it claims to faithfully represent the facts in that language or sublanguage [Sinclair 87].
Introduction Sinica balanced corpus
Texts are classified according to 5 different features: ( 1)Genre( 2) Style( 3)Mode( 4) Topic( 5)Medium
Word segmentation standard Segmentation standard for Chinese language
processing Http://godel.iis.sinica.edu.tw/ROCLING/
juhuashu1.htm Part-of-speech tagging
46 syntactic categories
Genre written reportagecommentaryadvertisementletterannouncementfictionprosebiography & diarypoetryanalectsmanual
spoken scriptconversationspeechmeeting minutes
Style
Mode
Topic
NarrationArgumentationExpositionDescription
writtenwritten-to-be-readwritten-to-be-spokenspokenspoken-to-be-written
philosophynatural sciencessocial sciencesfine artsgeneral/leisureliterature
Medium Newspapergeneral magazineacademic journaltextbookreference bookthesisgeneral bookaudio/visual mediainteractive speech
Sinica Corpus philosophy 10% natural sciences 10% social 35% arts 5% general/leisure 20% literature 20%
%% 文類 Genre= 報導 reportage
%% 文體 Style= 記敘 Description
%% 語式 Mode= written
%% 主題 Topic= 訊息 Message
%% 媒體 Medium= 報紙 Newspaper
%% 姓名 Author’s name=
%% 性別 Gender= 男女
%% 國籍 Nationality= 中華民國 Chinese
%% 母語 Mother tone= 中文 Chinese
%% 出版單位 Publisher= 中研院週報 Academia Sinica
%% 出版地 Place= 台北市台灣 Taipei Taiwan
%% 出版日期 date=1994
%% 版次 version=
%% 標題 Title= 國史研習會:中國宗教與社會
1. 。 (PERIODCATEGORY) 由 (P) 本 (Nes) 院 (Nc) 歷史 (Na) 語言 (Na) 研究所 (Nc) 主辦 (VC) , (COMMACATEGORY)
***********************************************
2. , (COMMACATEGORY) 台灣 (Nc) 大學 (Nc) 歷史系 (Nc) 暨 (Caa) 研究所 (Nc) 與 (Caa) 清華 (Nb) 大學 (Nc) 歷史系 (Nc) 暨 (Caa) 研究所 (Nc) 協辦 (VC) 之 (DE) 「 (PARENTHESISCATEGORY) 國史(Na) 研習會 (Na) : (COLONCATEGORY)
***********************************************
3. : (COLONCATEGORY) 中國 (Nc) 宗教 (Na) 與 (Caa) 社會 (Na) 」 (PARENTHESISCATEGORY) ,(COMMACATEGORY)
***********************************************
Introduction Motivations for designing a corpus
management system It is hard to collect, maintain, classify,
tagging a large amount of texts without using a management system.
Automate the word segmentation and tagging processes.
Maintain the precision and consistency of data collection.
Handle the out-of-vocabulary words.
Database for Texts
Text Id
Text Id features
features text
text
record 1
field 1
record 2
field 2 field 3
Text database
ConstructionSystem
Tagged textTaggedtext
…
Construction Flow
Text Collection Module
網路 (WWW)
Text Files
text
text
Inspection System
New Word Editor
Unknown word Identification Module
text
Text & New words
Word Segmentation and Pos-tagging Module
text
Tagged Text Editor
Tagged Text
Revised Tagged text
Text Database(SQL)
Revised New WordsDomain Lexicons
Text Collection Module Purpose: Semi-automatically
collect the various texts from WWW.
Features: Automatic feature extraction and document classification.
Unknown Word Identification Module Identify new words before word
segmentation Methods:
Detect the existence of unknown words
Apply statistical rules and morphological rules to identify unknown words
Word Segmentation & Tagging Module Based on the word segmentation standard for
information processing, the segmentation program segments input text and tags the result words with their part-of-speeches.
Methods:word matching based on lexicon and newly identified words. Segmentation process: Longest matching
and heuristic rules to resolve the segmentation ambiguities.
Pos tagging : Bi-gram model for resolving pos ambiguities.
Word Segmentation & Tagging Module (cont) Additional features: Incorporate user defined
dictionary or domain dictionary to enhance the word segmentation accuracy. Domain dictionary: e.g. medical
dictionary, dictionary for computing terminology.
Extracted unknown words: New words, such as personal names, always occurred in text. The unknown word identification process will extract the unknown words and they will be the supplement of dictionary.
Unknown words extracted from text
General Lexicon
Text Tagged textWord
segmentation and tagging
台大本學期舉辦減重班
台大 (Nc) 本 (Nes) 學期 (Na) 舉辦 (VC) 減重班 (Na)
Domain Lexicon
Inspection System Purpose: To assure the quality of the corpus
collection, the automatic processed texts need to be verified by human experts. Thus an inspection system was designed to speed up the verification process.
Major functions : Editing functions: The errors of word breaks,
pos-tags, features, sentence breaks can be fixed by just clicking the mouse.
Reminder functions : The system will highlight the common errors, prefix, suffix in the text.
Short term memory : The system will recall the most recent modifications and fixed the same type of errors automatically.
Inspection System (cont)
Provide lexical information and examples:
Friendly user interface:
欲構建之語料庫
使用者
Web ServerSQL Server
詞典 舊版本之語料庫
J 塑膠 (Na) 皮 (Na)→ 塑膠皮 (Na)
J 公文 (Na) 包 (VC)→ 公文包 (Na)
J 村 (Nc) 上 (Ncd)→ 村上 (Nb)
J 毛利 (Na) 遜 (VH)→ 毛利遜 (Nb)
J 吉姆 (Nb) 毛利遜 (Nb)→ 吉姆毛利遜 (Nb)
D 世界級 (Na)→ 世界 (Nc) 級 (Na)
D 科學方法 (Na)→ 科學 (Na) 方法 (Na)
D 三代 (Nd)→ 三 (Neu) 代 (Na)
D 交互作用 (Na)→ 交互 (VH) 作用 (Na)
D 如一 (VH)→ 如 (P) 一 (Neu)
C 改變 (VC)→ 改變 (Na)
C 傳統 (VH)→ 傳統 (Na)
C 企畫 (VC)→ 企畫 (Na)
C 自然 (D)→ 自然 (VH)
C 起來 (VA)→ 起來 (Di)
F 反射 (VJ)→ 反射 (VJ)[+nom]
F 遮雨 (VA)→ 遮雨 (VA)[+nom]
F 保持 (VJ)[+nom]→ 保持 (VJ)
F 萊特班 (Na)→ 萊特班 (Na)[+prop]
F 感動 (VHC)→ 感動 (VHC)[+nom]
Corpus Management System Advantages:
The corpus management system speeds up the construction processes and reduces the human efforts.
It also increases the precision and consistency of the word segmentation and pos-tagging.
Database system facilitates the functions of searching, managing, retrieving, and reorganizing texts.
Reorganizing sub-corpora Sub-corpora can be reorganized
according to different features. Sport corpus Spoken corpus Corpus of the most recent tree
months News corpus Corpus of poetry
Corpus Searching ToolsKey word vectors
Key Word in Context(KWIC) Search
KWIC file
Filtering and Sorting
Display, or Print,or Store
Statistics colllocation
Corpus Searching Tools KWIC search
Key word vector what is matched [ 代表 , N, φ, φ] every word 代表 daibiao tagged
with the pos noun [φ,VA, φ, 1] all monosyllabic intransitiv
e verb(VA) [φ, φ,+fw,φ] all foreign words [.. 化 ,V, φ, 3] all tri-syllabic verb with suf
fix 化 hua '-ize'
Corpus Searching Tools Filtering
The filtering methods include: random sampling, removing redundant samples, removing irrelevant samples by restricting the
content in the window of key words. Displaying, printing, and storing
The result KWIC files can be displayed on screen, or printed,or stored for future processing.
Corpus Searching Tools Statistics:
Statistic functions provide statistical distributions of words and categories occurring within the context window of key words.
For instance, the category distribution of the word 把 ba.
Category Frequency % preposition P 2704 92.57 measure Nf 211 7.22 transitive verb Vc 3 0.10 determiner Neqb 2 0.07 noun Na 1 0.03
Corpus Searching Tools Collocation finding
The system finds collocations of the key words by computing the mutual information [Church & Hanks 90] of the key words with the words or parts-of-speech in a user defined window.
Mutual Information= Log P(X,Y)/P(X)*P(Y) I(x,y) >> 0 : x,y are strongly associated. I(x,y) ≈ 0 : x,y are unrelated. I(x,y) << 0 : x,y are mutually exclusive.
Examples The top 16 collocations of ‘ 威脅’ within t
he window of distance 10. 1. 飽受 2. 恫嚇 3. 綑綁 4. 構成 5. 嚴重 6. 崩坍 7. 恐怖 8. 恐嚇 9. 遭受 10. 刀槍 11. 滾滾 12. 安全 13. 尖刀 14. 健康 15. 成全 16. 備受
Corpus Linguistics Corpus provides ample examples of
word uses and syntactic patterns. It also reflect the real uses of the language and their frequency distribution.
Comparative study can be made within KWIC or between sub-corpora.
Automatic knowledge extraction techniques can be performed on corpus to reduce manual efforts.
Lexicography Corpus provides ample examples of different w
ord uses and syntactic patterns. Corpus reflects the real uses of the language an
d their frequency distribution. Collocations show idiomatic patterns and they
are the most important uses of a word. Examples can be extracted from corpora. Senses and syntactic functions can be ordered a
ccording to their frequencies. CoBuild, Oxford, EDR, Collocation Dictionary of
Noun and Measure Words are examples of using corpora for editing dictionaries.
Language Modeling Markov Language Model: the probabilities a
re estimated from corpora. P(W1W2…Wm)= P(W1)*P(W2|W1)*P(W3|
W1W2)*…*P(Wn|W1W2…Wm-1) N-gram Model: P(W1W2…Wn) P(W1)*P
(W2|W1)*P(W3|W1W2)*…*P(Wn|Wm-n+1,…,Wm-1)
Language Modeling Applications of language modeling:
Inputting methods: speech recognition, character recognition, spelling check, phonetic input, …
Data compression: Huffman coding, Arithmetic Coding,…
Categorization: Text classification, pos tagging, sense disambiguation, word segmentation,…
Machine Translation IBM [Brown etc. 1990] used the bi-lingual H
ansard corpus to build translation models. To translate a French sentence F to an En
glish sentence E is equivalent to find the E which maximize P(E)*P(E|F).
P(E) is estimated from bi-gram model. P(E|F) is estimated from aligned bi-lingua
l corpus.
Conclusion Language archive is not only the
most important culture heritage but also the most important resources for language research.
The computer tools makes the archiving more efficient and manageable.
Everyone can access the archive through WWW.
Websites: Corpora and Archives
Sinica Corpus (Academia Sinica Balanced Corpus of Modern Chinese)
www.sinica.edu.tw/SinicaCorpus
Academia Sinica Classical Chinese Corpora: Early Mandarin
www.sinica.edu.tw/Early_Mandarin
Academia Sinica Formosan Language Archive: Rukai(Mantauran)
www.ling.sinica.edu.tw/formosan
Websites: Digital Museums
Chinese Language KnowledgeNets
WenGuo: Adventures in Wen-Land
http://www.sinica.edu.tw/wen
SouWenJieZi
http://www.dmpo.sinica.edu.tw/~words
5 million words, segmented and taggedDirect WWW Access
-http://www.sinica.edu.tw/ftms-bin/kiwi.sh
License Informationhttp://rocling.iis.sinica.edu.tw/ROCLING/corpus98/sinicor_E.htm
Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus)
Sinica Treebank 1.038,725 Trees
239,532 Words
Direct WWW Access (1000 sample trees)http://godel.iis.sinica.edu.tw/CKIP/trees1000.htm
License Informationhttp://rocling.iis.sinica.edu.tw/ROCLING/Treebank/Treebank-E.htm
Mandarin-Across-Taiwan (MAT) Speech Database
Speech files are collected through telephone networks. The content Includes spontaneous speech (short answering statements) and read speech (numbers, Mandarin syllables, words of 2 to 4 syllables, phonetically balanced sentences).MAT-160 (160 speakers)
MAT-2000 http://rocling.iis.sinica.edu.tw/ROCLING/MAT/index_cf.htm