csa2050 introduction to computational linguistics lecture 3 examples

19
Introduction to Computational Linguistics Lecture 3 Examples

Upload: buddy-oliver

Post on 02-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

CSA2050 Introduction to Computational

Linguistics

Lecture 3

Examples

Page 2: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 2

Course Contents

1 (MR) Overview

2 (RF) Chomsky Hierarchy

3 (MR) Examples

4 (RF) Grammatical Categories

5, 6 (MR) Tagging

7 (RF) Morphology

8, 9, 10 (MR) Comp Morphology

11 (RF) Syntax

12, 13, 14(MR) Grammar Formalism

Page 3: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 3

Outline

Examples in the areas of Tokenisation Morphological Analysis Tagging Syntactic Analysis

Page 4: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 4

Information Extraction

raw text tokenisation

morphologicalanalysis

named entity recognition

tagged text

syntactic analysis

Page 5: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 5

Tokenisation

The basic idea of tokenisation is to identify the basic tokens that are present in a text.

Mostly, tokens are the same as words, but not always

Why should this be a problem?

John’s car cost €10,000.00.

“And it’s worth every penny”, he exclaimed.

Page 6: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 6

Tokenisation ProblemsPunctuation

novel forms: .net, Micro$oft, :-) hyphenation:

linebreaks vs word-internal: e-mail, 898-0587 multi-word: the 90-cent-an-hour raise confusion with dash

apostrophes in contractions: we'll periods

part of names: Amazon.com numerical expressions: $1.99 abbreviations, end of sentence, haplology

commas: 1,000,000

Page 7: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 7

Other Problems

Token-internal whitespace: 898 0464 Interaction: the New York-New Haven railroad Mixed language tokens : u

Automated language guesser Token equivalence (when are two tokens the same)? Case-normalization. Sentence boundary detection. Inconsistency: database, data-base, data base Demo: xerox tokeniser

Page 8: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 8

Morphology

Simple versus complex wordsdogdogs

Complex words formed by concatenation of morphemes.

Morpheme: The smallest unit in a word that bears some meaning, such as dog and s.

Page 9: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 9

Morphological Analysis

Morphological analysis of a word involves a segmentation problem

Segmentation: discovery of the component morphemesdogs → dog + senlargement → en + large + ment

Possible ambiguities:enlargement → enlarge + ment

→ en + largement Role of lexicon

Page 10: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 10

Morphological Analysis

John has a couple of rabbits

rabbits → rabbit + s s indicates plural of noun rabbit Is this the only possibility?

Page 11: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 11

Morphological Analysis

John rabbits on and on

rabbits → rabbit + s s indicates 3rd person singular plural of verb

rabbit The suffix “s” is a realisation of two entirely

different morphemes. The morpheme is something more abstract

than the string which realises it.

Page 12: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 12

Morphological Analysis

+PL +3S

-s -a

suffix world

morpheme world

Page 13: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 13

Morphological Analysis

MorphologicalParser

Input Word

rabbits

OutputAnalysis

rabbit N PLrabbit V 3S

• Output is a string of morphemes• Morpheme is employed in a loose sense that is useful for further processing

Page 14: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 14

Morphological Analysis: ENGTWOL & Xerox

Atro Voutilainen, Juha Heikkilä, Timo Järvinen and Lingsoft, Inc. 1993-1995

ENGTWOL demo Xerox morphological analysis

Page 15: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 15

Morphological Synthesis

MorphologicalParser

Output Word

rabbits

Input

rabbit N PLrabbit V 3S

• Input is a string of morphemes• Ouput is a word

Page 16: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 16

Reversibility

LookupAPPLY UP> leftleft leave+Verb+PastBoth+123SPleft left+Advleft left+Adjleft left+Noun+Sg

LookdownAPPLY DOWN> leave+Adjleft

Page 17: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 17

POS Tagging

In POS tagging, the task is to assign the most appropriate morphosyntactic label from amongst those listed in the lexicon, given the context.

John leaves presents. Proper Names

Page 18: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 18

Semantic Tagging

Named Entity Recognition Basic idea is to recognise and tag named

entities and classify them as being of type Persons Locations Organisations

Named Entity Recognition - Demo

Page 19: CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

Mar 2005 -- MR CSA2050 - Lecture III: Examples 19

Syntactic Analysis

Problem: given sentence and grammar/lexicon, discover assigned tree structure.

XIP Parser Demo