csa2050 introduction to computational linguistics lecture 3 examples
TRANSCRIPT
CSA2050 Introduction to Computational
Linguistics
Lecture 3
Examples
Mar 2005 -- MR CSA2050 - Lecture III: Examples 2
Course Contents
1 (MR) Overview
2 (RF) Chomsky Hierarchy
3 (MR) Examples
4 (RF) Grammatical Categories
5, 6 (MR) Tagging
7 (RF) Morphology
8, 9, 10 (MR) Comp Morphology
11 (RF) Syntax
12, 13, 14(MR) Grammar Formalism
Mar 2005 -- MR CSA2050 - Lecture III: Examples 3
Outline
Examples in the areas of Tokenisation Morphological Analysis Tagging Syntactic Analysis
Mar 2005 -- MR CSA2050 - Lecture III: Examples 4
Information Extraction
raw text tokenisation
morphologicalanalysis
named entity recognition
tagged text
syntactic analysis
Mar 2005 -- MR CSA2050 - Lecture III: Examples 5
Tokenisation
The basic idea of tokenisation is to identify the basic tokens that are present in a text.
Mostly, tokens are the same as words, but not always
Why should this be a problem?
John’s car cost €10,000.00.
“And it’s worth every penny”, he exclaimed.
Mar 2005 -- MR CSA2050 - Lecture III: Examples 6
Tokenisation ProblemsPunctuation
novel forms: .net, Micro$oft, :-) hyphenation:
linebreaks vs word-internal: e-mail, 898-0587 multi-word: the 90-cent-an-hour raise confusion with dash
apostrophes in contractions: we'll periods
part of names: Amazon.com numerical expressions: $1.99 abbreviations, end of sentence, haplology
commas: 1,000,000
Mar 2005 -- MR CSA2050 - Lecture III: Examples 7
Other Problems
Token-internal whitespace: 898 0464 Interaction: the New York-New Haven railroad Mixed language tokens : u
Automated language guesser Token equivalence (when are two tokens the same)? Case-normalization. Sentence boundary detection. Inconsistency: database, data-base, data base Demo: xerox tokeniser
Mar 2005 -- MR CSA2050 - Lecture III: Examples 8
Morphology
Simple versus complex wordsdogdogs
Complex words formed by concatenation of morphemes.
Morpheme: The smallest unit in a word that bears some meaning, such as dog and s.
Mar 2005 -- MR CSA2050 - Lecture III: Examples 9
Morphological Analysis
Morphological analysis of a word involves a segmentation problem
Segmentation: discovery of the component morphemesdogs → dog + senlargement → en + large + ment
Possible ambiguities:enlargement → enlarge + ment
→ en + largement Role of lexicon
Mar 2005 -- MR CSA2050 - Lecture III: Examples 10
Morphological Analysis
John has a couple of rabbits
rabbits → rabbit + s s indicates plural of noun rabbit Is this the only possibility?
Mar 2005 -- MR CSA2050 - Lecture III: Examples 11
Morphological Analysis
John rabbits on and on
rabbits → rabbit + s s indicates 3rd person singular plural of verb
rabbit The suffix “s” is a realisation of two entirely
different morphemes. The morpheme is something more abstract
than the string which realises it.
Mar 2005 -- MR CSA2050 - Lecture III: Examples 12
Morphological Analysis
+PL +3S
-s -a
suffix world
morpheme world
Mar 2005 -- MR CSA2050 - Lecture III: Examples 13
Morphological Analysis
MorphologicalParser
Input Word
rabbits
OutputAnalysis
rabbit N PLrabbit V 3S
• Output is a string of morphemes• Morpheme is employed in a loose sense that is useful for further processing
Mar 2005 -- MR CSA2050 - Lecture III: Examples 14
Morphological Analysis: ENGTWOL & Xerox
Atro Voutilainen, Juha Heikkilä, Timo Järvinen and Lingsoft, Inc. 1993-1995
ENGTWOL demo Xerox morphological analysis
Mar 2005 -- MR CSA2050 - Lecture III: Examples 15
Morphological Synthesis
MorphologicalParser
Output Word
rabbits
Input
rabbit N PLrabbit V 3S
• Input is a string of morphemes• Ouput is a word
Mar 2005 -- MR CSA2050 - Lecture III: Examples 16
Reversibility
LookupAPPLY UP> leftleft leave+Verb+PastBoth+123SPleft left+Advleft left+Adjleft left+Noun+Sg
LookdownAPPLY DOWN> leave+Adjleft
Mar 2005 -- MR CSA2050 - Lecture III: Examples 17
POS Tagging
In POS tagging, the task is to assign the most appropriate morphosyntactic label from amongst those listed in the lexicon, given the context.
John leaves presents. Proper Names
Mar 2005 -- MR CSA2050 - Lecture III: Examples 18
Semantic Tagging
Named Entity Recognition Basic idea is to recognise and tag named
entities and classify them as being of type Persons Locations Organisations
Named Entity Recognition - Demo
Mar 2005 -- MR CSA2050 - Lecture III: Examples 19
Syntactic Analysis
Problem: given sentence and grammar/lexicon, discover assigned tree structure.
XIP Parser Demo