si 760 / eecs 597 / ling 702 language and information

57
SI 760 / EECS 597 / Ling 702 Language and Information Winter 2004 Handout #1

Upload: ezekiel-acevedo

Post on 03-Jan-2016

39 views

Category:

Documents


3 download

DESCRIPTION

SI 760 / EECS 597 / Ling 702 Language and Information. Handout #1. Winter 2004. Course Information. Instructor: Dragomir R. Radev ([email protected]) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: TBA Course page: http://www.si.umich.edu/~radev/LNI-winter2004/ - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SI 760 / EECS 597 / Ling 702 Language and Information

SI 760 / EECS 597 / Ling 702

Language and Information

Winter 2004

Handout #1

Page 2: SI 760 / EECS 597 / Ling 702 Language and Information

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector• Phone: (734) 615-5225• Office hours: TBA• Course page:

http://www.si.umich.edu/~radev/LNI-winter2004/

• Class meets on Mondays, 1-4 PM in 412 WH

Page 3: SI 760 / EECS 597 / Ling 702 Language and Information

Readings

• Two introductions to statistical NLP

• Collocations paper

• Joshua Goodman’s language modeling tutorial

• Documentation for the CMU LM toolkit

Page 4: SI 760 / EECS 597 / Ling 702 Language and Information

N-gram Models

Page 5: SI 760 / EECS 597 / Ling 702 Language and Information

Word Prediction

• Example: “I’d like to make a collect …”

• “I have a gub”

• “He is trying to fine out”

• “Hopefully, all with continue smoothly in my absence”

• “They are leaving in about fifteen minuets to go to her house”

• “I need to notified the bank of [this problem]

• Language model: a statistical model of word sequences

Page 6: SI 760 / EECS 597 / Ling 702 Language and Information

Counting Words

• Brown corpus (1 million words from 500 texts)

• Example: “He stepped out into the hall, was delighted to encounter a water brother” - how many words?

• Word forms and lemmas. “cat” and “cats” share the same lemma (also tokens and types)

• Shakespeare’s complete works: 884,647 word tokens and 29,066 word types

• Brown corpus: 61,805 types and 37,851 lemmas

• American Heritage 3rd edition has 200,000 “boldface forms” (including some multiword phrases)

Page 7: SI 760 / EECS 597 / Ling 702 Language and Information

Unsmoothed N-grams

• First approximation: each word has an equal probability to follow any other. E.g., with 100,000 words, the probability of each of them at any given point is .00001

• “the” - 69,971 times in BC, while “rabbit” appears 11 times

• “Just then, the white …”

P(w1,w2,…, wn) = P(w1) P(w2 |w1) P(w3|w1w2) … P(wn |w1w2…wn-1)

Page 8: SI 760 / EECS 597 / Ling 702 Language and Information

A language model

• The sum of probabilities of all strings has to be 1.

• Bigram and trigram models

• How do you estimate the probabilities?

Replace P(wn |w1w2…wn-1) with P(wn|wn-1)

Bigram model:

Page 9: SI 760 / EECS 597 / Ling 702 Language and Information

Perplexity of a language model

• What is the perplexity of guessing a digit if all digits are equally likely?

• How about a letter?• How about guessing A with a probability of 1/4, B

with a probability of 1/2 and 10,000 other cases with a probability of 1/2 total (example modified from Joshua Goodman).

)(log)/1(2 iSPN

Perp

Page 10: SI 760 / EECS 597 / Ling 702 Language and Information

Perplexity across distributions

• What if the actual distribution is very different from the expected one?

• Example: all of the 10,000 other cases are equally likely but P(A) = P(B) = 0.

• Cross-entropy = log (perplexity), measured in bits

Page 11: SI 760 / EECS 597 / Ling 702 Language and Information

Smoothing techniques

• Imagine the following situation: you are looking at Web pages and you have to guess how many different languages they are written in.

• First one is in English, then French, then again English, then Korean, then Chinese, etc.Total: 5F, 3E, 1K, 1C

• Can you predict what the next language will be?• What is a problem with the simplest approach to

this problem?

Page 12: SI 760 / EECS 597 / Ling 702 Language and Information

Smoothing

• Why smoothing?

• How many parameters are there in the model (given 100,000 possible words)

• What are the unsmoothed (ML) estimates for unigrams, bigrams, trigrams?

• Linear interpolation (mixture with λi).

• How to estimate λi?

Page 13: SI 760 / EECS 597 / Ling 702 Language and Information

Example

• Consider the problem of estimating bigram probabilities from a training corpus.

• Probability mass must be 1.

• How to account for unseen events?

Page 14: SI 760 / EECS 597 / Ling 702 Language and Information

Common methods

• Add-1 smoothing (add one to numerator, add N to denominator)

• Good-Turing smoothing

• Best: Kneser-Ney

Page 15: SI 760 / EECS 597 / Ling 702 Language and Information

Markov Models

• Assumption: we can predict the probability of some future item on the basis of a short history

• Bigrams: first-level Markov models• Bigram grammars: as an N-by-N matrix of probabilities,

where N is the size of the vocabulary that we are modeling.

Page 16: SI 760 / EECS 597 / Ling 702 Language and Information

Relative Frequenciesa aardvark aardwolf aback … zoophyte zucchini

a X 0 0 0 … X X

aardvark 0 0 0 0 … 0 0

aardwolf 0 0 0 0 … 0 0

aback X X X 0 … X X

… … … … … … … …

zoophyte 0 0 0 X … 0 0

zucchini 0 0 0 X … 0 0

Page 17: SI 760 / EECS 597 / Ling 702 Language and Information

Language Modeling andStatistical Machine Translation

Page 18: SI 760 / EECS 597 / Ling 702 Language and Information

The Noisy Channel Model

• Source-channel model of communication

• Parametric probabilistic models of language and translation

• Training such models

Page 19: SI 760 / EECS 597 / Ling 702 Language and Information

Statistics

• Given f, guess e

ef

e’E F F E

encoder decoder

e’ = argmax P(e|f) = argmax P(f|e) P(e)e e

translation model language model

Page 20: SI 760 / EECS 597 / Ling 702 Language and Information

Parametric probabilistic models

• Language model (LM)

• Deleted interpolation

• Translation model (TM)

P(e) = P(e1, e2, …, eL) = P(e1) P(e2|e1) … P(eL|e1 … eL-1)

P(eL|e1 … eK-1) P(eL|eL-2, eL-1)

Alignment: P(f,a|e)

Page 21: SI 760 / EECS 597 / Ling 702 Language and Information

IBM’s EM trained models

1. Word translation

2. Local alignment

3. Fertilities

4. Class-based alignment

5. Non-deficient algorithm (avoid overlaps, overflow)

Page 22: SI 760 / EECS 597 / Ling 702 Language and Information

Bayesian formulas

• argmaxe P(e | f) = ?

• P(e|f) = P(e) * P(f | e) / P(f)

• argmaxe P(e | f) = argmaxe P(e) * P(f | e)

The rest of the slides in this section are based on“A Statistical MT Tutorial Workbook” by Kevin Knight

Page 23: SI 760 / EECS 597 / Ling 702 Language and Information

N-gram model

• P(e) = ?

• P(how's it going?) = 76,413/1,000,000,000 = 0.000076413

• Bigrams: b(y|x) = count (xy)/count (x)

• P(“I like snakes that are not poisonous”) = P(“I”|start) * P(“like”|”I”) * …

• Trigrams: b(z|xy) = ??

Page 24: SI 760 / EECS 597 / Ling 702 Language and Information

Smoothing

• b(z | x y) = 0.95 * count (“xyz”) / count (“xy”) + 0.04 * count (“yz”) / count (“z”) + 0.008 * count (“z”) / totalwordsseen + 0.002

Page 25: SI 760 / EECS 597 / Ling 702 Language and Information

Ordering words

(1) a a centre earthquake evacuation forced has historic Italian of of second southern strong the the village

(2) met Paul Wendy

Page 26: SI 760 / EECS 597 / Ling 702 Language and Information

Translation models

• Mary did not slap the green witch.• Mary not slap slap slap the the green witch • Mary no daba una botefada a la verde bruja • Mary no daba una botefada a la bruja verde

Page 27: SI 760 / EECS 597 / Ling 702 Language and Information

Translation models

• Fertility

• Permutation

Page 28: SI 760 / EECS 597 / Ling 702 Language and Information

IBM model 3

• Fertility

• Spurious words (e0)

• Pick words

• Pick positions

Page 29: SI 760 / EECS 597 / Ling 702 Language and Information

Translation models

• Mary did not slap the green witch.• Mary not slap slap slap the green witch • Mary not slap slap slap NULL the green witch • Mary no daba una botefada a la verde bruja • Mary no daba una botefada a la bruja verde

Page 30: SI 760 / EECS 597 / Ling 702 Language and Information

Parameters

• N - fertility (x*x)

• T - translation (x)

• D - position (x)

• P

Page 31: SI 760 / EECS 597 / Ling 702 Language and Information

Example

NULL And the program has been implemented

Le programme a ete mis en application

Page 32: SI 760 / EECS 597 / Ling 702 Language and Information

Alignments

The blue house The blue house

La maison bleue La maison bleue

• Needed: P(a|f,e)

Page 33: SI 760 / EECS 597 / Ling 702 Language and Information

Markov models

Page 34: SI 760 / EECS 597 / Ling 702 Language and Information

Motivation

• Sequence of random variables that aren’t independent

• Example: weather reports

Page 35: SI 760 / EECS 597 / Ling 702 Language and Information

Properties

• Limited horizon:P(Xt+1 = sk|X1,…,Xt) = P(Xt+1 = sk|Xt)

• Time invariant (stationary):= P(X2=sk|X1)

• Definition: in terms of a transition matrix A and initial state probabilities .

Page 36: SI 760 / EECS 597 / Ling 702 Language and Information

Example

h a p

e t i1.0

1.0

0.4

0.6

0.4

1.0

0.6

0.3

0.3

0.4

start

Page 37: SI 760 / EECS 597 / Ling 702 Language and Information

Visible MM

P(X1,…XT) = P(X1) P(X2|X1) P(X3|X1,X2) … P(XT|X1,…,XT-1)

= P(X1) P(X2|X1) P(X3|X2) … P(XT|XT-1)

P(t, i, p) = P(X1=t) P(X2=i|X1=t) P(X3=p|X2=i)

= 1.0 x 0.3 x 0.6

= 0.18

1

111

T

tXXX tt

a

Page 38: SI 760 / EECS 597 / Ling 702 Language and Information

Hidden MM

start

0.3

0.5

0.50.7 COLA ICE TEA

Page 39: SI 760 / EECS 597 / Ling 702 Language and Information

Hidden MM

• P(Ot=k|Xt=si,Xt+1=sj) = bijk

cola icetea lemonade

COLA 0.6 0.1 0.3

ICETEA 0.1 0.7 0.2

Page 40: SI 760 / EECS 597 / Ling 702 Language and Information

Example

• P(lemonade,icetea|COLA) = ?

• P = 0.7 x 0.3 x 0.7 x 0.1 + 0.7 x 0.3 x 0.3 x 0.1 + 0.3 x 0.3 x 0.5 x 0.7 + 0.3 x 0.3 x 0.5 x 0.7 = 0.084

Page 41: SI 760 / EECS 597 / Ling 702 Language and Information

Hidden MM

• Part of speech tagging, speech recognition, gene sequencing

• Three tasks:– A=state transition probabilities, B=symbol emission

probabilities, = initial state prob.

– Given = (A,B, ), find P(O|)

– Given O, , what is (X1,…XT+1)

– Given O and a space of all possible , find model that best describes the observations

Page 42: SI 760 / EECS 597 / Ling 702 Language and Information

Collocations

Page 43: SI 760 / EECS 597 / Ling 702 Language and Information

Collocations

• Idioms

• Free word combinations

• Know a word by the company that it keeps (Firth)

• Common use

• No general syntactic or semantic rules

• Important for non-native speakers

Page 44: SI 760 / EECS 597 / Ling 702 Language and Information

Examples

Idioms

To kick the bucketDead endTo catch up

Collocations

To trade activelyTable of contentsOrthogonal projection

Free-word combinations

To take the busThe end of the roadTo buy a house

Page 45: SI 760 / EECS 597 / Ling 702 Language and Information

Uses

• Disambiguation (e.g, “bank”/”loan”,”river”)

• Translation

• Generation

Page 46: SI 760 / EECS 597 / Ling 702 Language and Information

Properties

• Arbitrariness

• Language- and dialect-specific

• Common in technical language

• Recurrent in context

• (see Smadja 83)

Page 47: SI 760 / EECS 597 / Ling 702 Language and Information

Arbitrariness

• Make an effort vs. *make an exertion

• Running commentary vs. *running discussion

• Commit treason vs. *commit treachery

Page 48: SI 760 / EECS 597 / Ling 702 Language and Information

Cross-lingual properties

• Régler la circulation = direct traffic

• Russian, German, Serbo-Croatian: direct translation is used

• AE: set the table, make a decision

• BE: lay the table, take a decision

• “semer le désarroi” - “to sow disarray” - “to wreak havoc”

Page 49: SI 760 / EECS 597 / Ling 702 Language and Information

Types of collocations

• Grammatical: come to, put on; afraid that, fond of, by accident, witness to

• Semantic (only certain synonyms)

• Flexible: find/discover/notice by chance

Page 50: SI 760 / EECS 597 / Ling 702 Language and Information

Base/collocator pairs

Base

NounNounVerbAdjectiveVerb

Collocator

verbadjectiveadverbadverbpreposition

Example

Set the tableWarm greetingsStruggle desperatelySound asleepPut on

Page 51: SI 760 / EECS 597 / Ling 702 Language and Information

Extracting collocations• Mutual information

• What if I(x;y) = 0?• What if I(x;y) < 0?

P(x,y)

P(x)P(y)I (x;y) = log2

Page 52: SI 760 / EECS 597 / Ling 702 Language and Information

Yule’s coefficient

A - frequency of lemma pairs involving both Li and Lj

B - frequency of pairs involving Li only

C - frequency of pairs involving Lk only

D - frequency of pairs involving neither

YUL = AD - BC

AD + BC -1 YUL 1

Page 53: SI 760 / EECS 597 / Ling 702 Language and Information

Specific mutual information

• Used in extracting bilingual collocations

I (e,f) = p (e,f)

p(e) p(f)

• p(e,f) - probability of finding both e and f in aligned sentences

• p(e), p(f) - probabilities of finding the word in one of the languages

Page 54: SI 760 / EECS 597 / Ling 702 Language and Information

Example from the Hansard corpus (Brown, Lai, and Mercer)

French word Mutual information

sein 5.63

bureau 5.63

trudeau 5.34

premier 5.25

résidence 5.12

intention 4.57

no 4.53

session 4.34

Page 55: SI 760 / EECS 597 / Ling 702 Language and Information

Total p-5 p-4 p-3 p-2 p-1 p+1 p+2 p+3 p+4 p+5

8031 7 6 13 5 7918 0 12 20 26 24

Flexible and rigid collocations

• Example (from Smadja): “free” and “trade”

Page 56: SI 760 / EECS 597 / Ling 702 Language and Information

Xtract (Smadja)

• The Dow Jones Industrial Average

• The NYSE’s composite index of all its listed common stocks fell *NUMBER* to *NUMBER*

Page 57: SI 760 / EECS 597 / Ling 702 Language and Information

Translating Collocations

• Brush up a lesson, repasser une leçon

• Bring about/осуществлять

• Hansards: late spring: fin du printemps, Atlantic Canada Opportunities Agency, Agence de promotion économique du Canada atlantique