february 2007csa3050: tagging i1 csa2050: natural language processing tagging 1 tagging pos and...

37
February 2007 CSA3050: Tagging I 1 CSA2050: Natural Language Processing Tagging 1 • Tagging • POS and Tagsets • Ambiguities • NLTK

Upload: brian-todd

Post on 25-Dec-2015

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 1

CSA2050: Natural Language Processing

Tagging 1• Tagging• POS and Tagsets• Ambiguities• NLTK

Page 2: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 2

Tagging 1 Lecture

• Slides based on Mike Rosner and Marti Hearst notes

• Diane Litman’s version of Steven Bird’s notes

• Additions from NLTK tutorials

Page 3: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 3

Tagging

Mr. Sherlock Holmes, who was usually very X, …

What is the part of speech of X ?

Page 4: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 4

Tagging

Mr. Sherlock Holmes, who was usually very late/ADJ in the mornings, save upon those not infrequent occasions when he was up all night, was Y

What is the part of speech of Y ?

Page 5: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 5

Tagging

Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated/VBN at the breakfast table

Page 6: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 6

Tagging Terminology

• Tagging

– The process of associating labels with each token in a text

• Tags

– The labels

• Tag Set

– The collection of tags used for a particular task

Page 7: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 7

Tagging Example

Typically a tagged text is a sequence of white-space separated base/tag tokens:

The/at Pantheon’s/np interior/nn ,/,still/rb in/in its/pp original/jj form/nn ,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn ./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn ./.

Page 8: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 8

What does tagging do?

1. Collapses Some Distinctions• Lexical identity may be discarded• e.g. all personal pronouns tagged with PRP

2. ….But Introduces Others• Ambiguities may be removed• e.g. deal tagged with NN or VB• e.g. deal tagged with DEAL1 or DEAL2

3. Helps classification and prediction

Page 9: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 9

Parts of Speech (POS)

• A word’s POS tells us a lot about the word and its neighbors:– Limits the range of meanings (deal), pronunciation

(object vs object) or both (wind)– Helps in stemming– Limits the range of following words for Speech

Recognition– Can help select nouns from a document for IR– Basis for partial parsing (chunked parsing)– Parsers can build trees directly on the POS tags

instead of maintaining a lexicon

Page 10: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 10

POS and Tagsets

• The choice of tagset greatly affects the difficulty of the problem

• Need to strike a balance between– Getting better information about context

(best: introduce more distinctions)– Make it possible for classifiers to do their job

(need to minimize distinctions)

Page 11: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 11

Common Tagsets

• Brown corpus: 87 tags

• Penn Treebank: 45 tags

• Lancaster UCREL C5 (used to tag the British National Corpus - BNC): 61 tags

• Lancaster C7: 145 tags

Page 12: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 12

Brown Corpus

• The first digital corpus (1961)– Francis and Kucera, Brown University

• Contents: 500 texts, each 2000 words long– From American books, newspapers,

magazines– Representing genres:

• Science fiction, romance fiction, press reportage scientific writing, popular lore

Page 13: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 13

Penn Treebank

• First syntactically annotated corpus

• 1 million words from Wall Street Journal

• Part of speech tags and syntax trees

Page 14: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 14

Penn Treebank

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

VB DT NN .Book that flight .

VBZ DT NN VB NN ?Does that flight serve dinner ?

Page 15: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 15

Penn Treebank

Page 16: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 16

Penn Treebank – Important Tags

Page 17: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 17

Penn Treebank – Verb Tags

Page 18: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 18

Penn Treebank Example

(S (NP-SBJ-1 (DT The) (NNP Senate)) (VP (VBZ plans_ (S (NP-SBJ (-NONE- *-1)) (VP (TO to) (VP (VB take) (PRT (RP up)) (NP (DT the) (NN measure)) (ADV-TMP (RB quickly)))))) (. .))

Page 19: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 19

Tagging

• Typically the set of tags is larger than basic parts of speech

• Tags often contain some morphological information

• Often referred to as “morphosyntactic labels”

Page 20: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 20

Tagging Ambiguities

N N-V V-IN DT N

FRUIT FLIES LIKE A BANANA

Page 21: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 21

Interpretation 1

S VP NP NP

N N V DT NFRUIT FLIES LIKE A BANANA

Page 22: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 22

Interpretation 2

S VP

PP

NP NP

N V IN DT NFRUIT FLIES LIKE A BANANA

Page 23: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 23

Lots of ambiguities…

1. He can can a can.

2. I can light a fire and you can open a can of beans. Now the can is open, and we can eat in the light of the fire.

Page 24: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 24

Lots of ambiguities…

• In the Brown Corpus– 11.5% of word types are ambiguous– 40% of word tokens are ambiguous

• Most words in English are unambiguous.

• Many of the most common words are ambiguous.

• Typically ambiguous tags are not equally probable.

Page 25: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 25

Lots of ambiguities…

Brown Corpus

Unambiguous (1 tag): 35,340 types

Ambiguous (2-7 tags): 4,100 types

(Table: Derose, 1988)

2 tags 3,760

3 tags 264

4 tags 61

5 tags 12

6 tags 2

7 tags 1

Page 26: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 26

Approaches to Tagging

1. Tagger: ENGTWOL Tagger(Voutilainen 1995)

2. Stochastic Tagger: HMM-based Tagger

3. Transformation-Based Tagger: Brill Tagger(Brill 1995)

Page 27: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 27

NLTK

• Natural Language Toolkit (NLTK)

• http://nltk.sourceforge.net/

• Please download and install!

• Runs on Python

Page 28: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 28

NLTK Introduction

• The Natural Language Toolkit (NLTK) provides:– Basic classes for representing data relevant

to natural language processing.– Standard interfaces for performing tasks, such

as tokenization, tagging, and parsing.– Standard implementations of each task, which

can be combined to solve complex problems.

• Two versions: NLTK and NLTK-Lite

Page 29: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 29

NLTK Modules

• nltk.token: processing individual elements of text, such as words or sentences.

• nltk.probability: modeling frequency distributions and probabilistic systems.

• nltk.tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags.

• nltk.parser: high-level interface for parsing texts.• nltk.chartparser: a chart-based implementation of

the parser interface.• nltk.chunkparser: a regular-expression based

surface parser.

Page 30: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 30

Python for NLP

• Python is a great language for NLP:– Simple– Easy to debug:

• Exceptions• Interpreted language

– Easy to structure• Modules• Object oriented programming

– Powerful string manipulation

Page 31: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 31

Python Modules and Packages

• Python modules “package program code and data for reuse.” (Lutz)– Similar to library in C, package in Java.

• Python packages are hierarchical modules (i.e., modules that contain other modules).

• Three commands for accessing modules:1.import2.from…import3.reload

Page 32: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 32

Import Command

• The import command loads a module:# Load the regular expression module

>>> import re

• To access the contents of a module, use dotted names:

# Use the search method from the re module

>>> re.search(‘\w+’, str)

• To list the contents of a module, use dir:>>> dir(re)

[‘DOTALL’, ‘I’, ‘IGNORECASE’,…]

Page 33: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 33

from...import

• The from…import command loads individual functions and objects from a module:

# Load the search function from the re module>>> from re import search

• Once an individual function or object is loaded with from…import, it can be used directly:

# Use the search method from the re module>>> search (‘\w+’, str)

Page 34: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 34

Import vs. from...import

• Import• Keeps module

functions separate from user functions.

• Requires the use of dotted names.

• Works with reload.

from…import• Puts module functions

and user functions together.

• More convenient names.

• Does not work with reload.

Page 35: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 35

Reload

• If you edit a module, you must use the reload command before the changes become visible in Python:

>>> import mymodule

...

>>> reload (mymodule)

• The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import.

Page 36: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 36

Reload

• If you edit a module, you must use the reload command before the changes become visible in Python:

>>> import mymodule

...

>>> reload (mymodule)

• The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import.

Page 37: February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007 CSA3050: Tagging I 37

Next Sessions…

• Rule-Based Tagging• Stochastic Tagging• Hidden Markov Models (HMMs)• N-Grams

• Read Jurafsky and Marting Chapter 4 (PDF)

• Install NLTK