statistical nlp: lecture 6 corpus-based work (ch 4)

13
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Upload: adam-singleton

Post on 13-Dec-2015

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Statistical NLP: Lecture 6

Corpus-Based Work

(Ch 4)

Page 2: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Corpus-Based Work

• Text Corpora are usually big. They also need to be representative samples of the population of interest.

• Corpus-Based work involves collecting a large number of counts from corpora that need to be access quickly.

• There exists some software for processing corpora (see useful links on course homepage).

Page 3: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Corpora

• Linguistically mark-up or not • Representative sample of the population

of interest– American English vs. British English– Written vs. Spoken– Areas

• The performance of a system depends heavily on– the entropy – Text categorization

• Balanced corpus vs. all text available

Page 4: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Software/Coding

• Software– Text editor– Regular expression – Programming language

• C/C++, Perl, awk, Python, Prolog, Java

• Coding– Mapping words to numbers – Hashing – CMU-Cambridge Statistical Language Modeling toolkit

Page 5: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Looking at Text (I)Low-Level Formatting Issues

• Mark-up of a text – Formatting mark-up or explicit mark-up

• Junk formatting/Content. Examples: document headers and separators, typesetter codes, table and diagrams, garbled data in the computer file. Also other problems if data was retrieved through OCR (unrecognized words). Often one needs a filter to remove junk content before any processing begins.

• Uppercase and Lowercase: should we keep the case or not? The, the and THE should all be treated the same but “brown” in “George Brown” and “brown dog” should be treated separately.

Page 6: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Looking at Text (II): TokenizationWhat is a Word?

• An early step of processing is to divide the input text into units called tokens where each is either a word or something else like a number or a punctuation mark.

• Periods: haplologies or end of sentence?• White spaces • Periods : etc., 먹었다 하였다 . 6.7, 3.1 절• Single apostrophes: isn’t, I’ll 2 words ? 1 words • Hyphenation: text-based, co-operation, e-mail, A-1-plus paper,

“take-it-or-leave-it”, the 90-cent-an-hour raise, mark up mark-up mark(ed) up

• Homographs --> two lexemes :: “saw”• 26.3$, www.hyowon.pusan.ac.kr, MicroSoft, :-), “ 책 , ‘ 그’

책”

Page 7: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Looking at Text (III): TokenizationWhat is a Word (Cont’d)?

• Word Segmentation in other languages: no whitespace ==> words segmentation is hard

• whitespace not indicating a word break.– New York, data base– the New York-New Haven railroad

• variant coding of information of a certain semantic type.– +45 43 48 60 60, (202) 522-2230, 33 1 34 43 32 26, (44.17

1) 830 1007• Speech corpora.

– er, um,

Page 8: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Morphology• Stemming: Strips off affixes.

– sit, sits, sat • Lemmatization: transforms into base form (lemma, lexeme)

– Disambiguation• Not always helpful in English (from an IR point of view) which has very li

ttle morphology. – !! Stemming does not help the performance of classical IR

• business busy – Perhaps more useful in other contexts.

• Mutilpe words a morpheme ???• Richer inflectional and derivational system

– Bantu language: KiHaya• akabimu’ha (a-ka-bi-mu’-ha, 1SG-PAST-3PL-3SG-give)• I gave them to him.

– Finnish • Millions of inflected forms for each verb

Page 9: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Sentences: What is a sentence?”

• Something ending with a ‘.’, ‘?’ or ‘!’. True in 90% of the cases.

• Sometimes, however, sentences are split up by other punctuation marks or quotes.

• Often, solutions involve heuristic methods. However, these solutions are hand-coded. Some effort to automate the sentenceboundary process have also been done.

• “You remind me,” she remarked, “of your mother.”• 우리말은 더욱 어려움 !!!

– 마침표가 없기도 하고 종결형 어미 뒤 ?– 연결형 어미이면서 종결형 어미– 따옴표

Page 10: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

End-of-Sentence Detection (I)

• Place EOS after all . ? ! (maybe ;:-)

• Move EOS after quotation marks, if any

• Disqualify a period boundary if:

– Preceeded by known abbreviation followed by

upper case letter, not normally sentence-final:

e.g., Prof. vs. Mr.

Page 11: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

End-of-Sentence Detection (II)

– Precedeed by a known abbreviation not

followed by upper case: e.g., Jr. etc.

(abbreviation that is sentence-final or medial)

• Disqualify a sentence boundary with ? or !

If followed by a lower case (or a known

name)

• Keep all the rest as EOS

Page 12: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Marked-Up Data I: Mark-up Schemes

• Schemes developed to mark up the structure

of text

• Different Mark-up schemes:– COCOA format (older, and rather ad-hoc)

– SGML [other related encodings: HTML,

TEI, XML]• DTD, XML Scheme

Page 13: Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Marked-Up Data II: GrammaticalCoding

• Tagging indicates the various conventionalparts of speech. Tagging can be doneautomatically (we will talk about that inWeek 9).

• Different Tag Sets have been used: e.g.,Brown Tag Set, Penn Treebank Tag Set.

• Table 4.4, 4.5 설명• The Design of a Tag Set: Target Features

versus Predictive Features.– 국내 tag-set 에 대해 설명

• 보조용언과 본용언 구별을 위한 예로 설명• ETRI, KAIST, …