katrin erk distributional models. representing meaning through collections of words doc 1: abdullah...

34
Katrin Erk Distributional models

Upload: godwin-todd

Post on 18-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Katrin Erk

Distributional models

Page 2: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Representing meaning through collections of words

Doc 1: Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday

Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild

Doc 3: applications documents engines information iterated library metadata precision query statistical web

Doc 4: cucumbers crops discoveries enjoyable fill garden leafless plotting Vermont you

Page 3: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Representing meaning through collections of words

Doc 1: Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday

Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild

Doc 3: applications documents engines information iterated library metadata precision query statistical web

Washington Post Oct 24, 2009 on elections in Afghanistan

Wikipedia (version Oct 24, 2009) on the movie “Where the Wild Things Are”

Wikipedia (version Oct 24, 2009) on Information Retrieval

Doc 4: cucumbers crops discoveries enjoyable fill garden leafless plotting Vermont you

garden.org: Planning a Vegetable Garden

Page 4: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Representing meaning through a collection of wordsWhat parts of the meaning of a document

can you capture through an unordered collection of words?

How can you make use of such collections?

Page 5: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Representing meaning through a collection of wordsWhat parts of the meaning of a document

can you capture through an unordered collection of words?General topic information: What is the

document about?More specifically: things mentioned in the

document How can you make use of such collections?

Documents on similar topics contain similar words

Use in Information Retrieval (search)

Page 6: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Representing collections of words through tables of counts

Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild

film wild max

that things as [edit]

him

jonze

released

24 18 16 16 12 12 11 9 9 6

Page 7: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Representing collections of words through tables of counts

We can now compare documents by comparing tables of counts.

What can you tell about the second document below?

film wild max

that things as [edit]

him

jonze

released

24 18 16 16 12 12 11 9 9 6

film wild max

that things as [edit]

him

jonze

released

17 0 0 9 0 36 8 7 0 3

Page 8: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

The “second document”: a more extensive list of words

the 167and 58of 58to 56a 49in 37as 36is 33victor 30

* 27with 26by 23her 18film 17for 16emily 15was 15corpse 14

bride 13victoria 13his 13on 13from 11

What movie is this?

Page 9: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

From tables to vectors

Interpret table as a vector:Each entry is a dimension:

“film” is a dimension. Document’s coordinate: 24“wild” is a dimensions. Document’s coordinate: 18…

Then this document is a point in 10-dimensional space

film wild max

that things as [edit]

him

jonze

released

24 18 16 16 12 12 11 9 9 6

Page 10: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Documents as points in vector spaceViewing “Wild Things” and “Corpse Bride”

as vectors/points in vector space: Similarity between them as proximity in space

Corpse Bride

Where the Wild Things Are

“Distributional model”, “vector space model”, “semantic space model” used interchangeably here

Page 11: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

What have we gained?Representation of document in vector

space can be computed completely automatically: Just counts words

Similarity in vector space is a good predictor for similarity in topicDocuments that contain similar words tend to

be about similar things

Page 12: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

What do we mean by “similarity” of vectors?Euclidean distance (a dissimilarity measure!):

Corpse Bride

Where the Wild Things Are

Page 13: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

What do we mean by “similarity” of vectors?Cosine similarity:

Corpse Bride

Where the Wild Things Are

Page 14: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

What have we gained?We can compute the similarity of

documents through their Euclidean distanceor through their cosine

We can also represent a query as a vector:Just count the words in the query

Now we can search for documents similar to the query

Page 15: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

From documents to wordsSame holds for words as for documents:

Context words are a good indicator of meaningSimilar words tend to occur in similar

contextsWhat is a context? How do we count here?

Take all the occurrences of our target word in a large text

Take a context window, e.g. 10 words either side

Count all that occurs there

Page 16: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Representing the meaning of a word through a collection of context words

Emerging from the earth is Emily, the "Corpse Bride," a beautiful undead girl in a moldy bridal gown who declares Victor her husband.

a the corpse emerging

from

2 2 1 1 1

is undead beautiful moldy bride

1 1 1 1 1

in earth girl

1 1 1

Counts for target “Emily”, 10 words context either side.

Page 17: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Representing the meaning of a word through a collection of context words

Go through all occurrences of “Emily” in a large corpusCount words in 10-word window for each

occurrence, sum up

a the corpse emerging

from

2 2 1 1 1

is undead beautiful moldy bride

1 1 1 1 1

in earth girl

1 1 1

Page 18: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Some co-occurrences: “letter” in “Pride and Prejudice”

jane : 12 when : 14 by : 15 which : 16 him : 16 with : 16 elizabeth : 17 but : 17 he : 17 be : 18 s : 20 on : 20

was : 34 it : 35 his : 36 she : 41 her : 50 a : 52 and : 56 of : 72 to : 75 the : 102

• not : 21• for : 21• mr : 22• this : 23• as : 23• you : 25• from : 28• i : 28• had : 32• that : 33• in : 34

This is not a large text!Large = something like 100 million words at least

Page 19: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

From tables to vectors

Interpret table as a vector:Each entry is a dimension:

“admirer” is a dimension. Coordinate of “letter”: 1. Coordinate of “surprise”: 0

“all” is a dimensions. Coordinate of “letter”: 8. Coordinate of “surprise: 7

Then each word is a point in n-dimensional space

Counts for “letter” and “surprise” from Pride and Prejudice

Page 20: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

What have we gained?Representation of word in vector space can

be computed completely automatically: Just counts co-occurring words in all context

Similarity in vector space is a good predictor for meaning similarityWords that occur in similar contexts tend to

be similar in meaningSynonyms are close together in vector spaceAntonyms too

Page 21: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Parameters of vector space modelsW. Lowe (2001): “Towards a theory of semantic

space”A semantic space defined as a tuple

(A, B, S, M)B: base elements. A: mapping from raw co-occurrence counts to

something else, to correct for frequency effectsS: similarity measure. M: transformation of the whole space to

different dimensions

Page 22: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

B: base elementsWe have seen: context words as base

elementsTerm x document matrix:

Represent document as vector of weighted terms

Represent term as vector of weighted documents

Page 23: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

B: base elementsDimensions:

not words in a context window, but dependency paths starting from the target word (Pado & Lapata 07)

Page 24: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

A: transforming raw countsProblem with vectors of raw counts:

Distortion through frequency of target word

Weigh counts: The count on dimension “and” will not be as

informative as that on the dimension “angry”For example, using Pointwise Mutual

Information between target a and context word b

Page 25: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

M: transforming the whole spaceDimensionality reduction:

Principal Component Analysis (PCA)Singular Value Decomposition (SVD)

Latent Semantic Analysis, LSA(also called Latent Semantic Indexing, LSI):Do SVD on term x document representation to induce “latent” dimensions that correspond to topics that a document can be about

Landauer & Dumais 1997

Page 26: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Using similarity in vector spacesSearch/information retrieval: Given query

and document collection,Use term x document representation:

Each document is a vector of weighted termsAlso represent query as vector of weighted

termsRetrieve the documents that are most similar

to the query

Page 27: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Using similarity in vector spacesTo find synonyms:

Synonyms tend to have more similar vectors than non-synonyms:Synonyms occur in the same contexts

But the same holds for antonyms:In vector spaces, “good” and “evil” are the same (more or less)

So: vector spaces can be used to build a thesaurus automatically

Page 28: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Using similarity in vector spacesIn cognitive science, to predict

human judgments on how similar pairs of words are (on a scale of 1-10)

“priming”

Page 29: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

An automatically extracted thesaurusDekang Lin 1998:

For each word, automatically extract similar words

vector space representation based on syntactic context of target (dependency parses)

similarity measure: based on mutual information (“Lin’s measure”)

Large thesaurus, used often in NLP applications

Page 30: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Vectors for word sensesUp to now: one vector per wordVector for “bank” conflates

financial contextsfishing contexts

How to get to vectors for word senses?

Page 31: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Automatically inducing word sensesSchütze 1998: one vector per sentence,

or per occurrence (token)of “letter”She wrote an angry letter to her niece.He sprayed the word in big letters.The newspaper gets 100 letters from readers every

day.Make token vector by adding up the vectors of

all other (content) words in the sentence:

Cluster token vectorsClusters = induced word senses

Page 32: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

A vector for an individual occurrence of a wordAvoid having to define word senses

Sometimes hard to divide uses into senses:words like “leave”, or “paint”

Erk/Pado 2008: Modify vector of “bank” using its syntactic context:

bankbank

breakbreak obj

bankbank

fishfish on

Page 33: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Summary: vector space models

Representing meaning through countsRepresent document through content wordsRepresent word meaning through context

words / parse tree snippets / documentsContext items as dimensions,

target as vector/point in semantic space

Proximity in semantic space ~ similarity between words

Page 34: Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election

Summary: vector space models

Uses: SearchInducing ontologiesModeling human judgments of word similarityRepresent word senses

Cluster sentence vectorsCompute vectors for individual occurrences