katrin erk distributional models. representing meaning through collections of words doc 1: abdullah...

Katrin Erk

Distributional models

Representing meaning through collections of words

Doc 1: Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday

Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild

Doc 3: applications documents engines information iterated library metadata precision query statistical web

Doc 4: cucumbers crops discoveries enjoyable fill garden leafless plotting Vermont you

Representing meaning through collections of words

Doc 1: Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday


Doc 3: applications documents engines information iterated library metadata precision query statistical web

Washington Post Oct 24, 2009 on elections in Afghanistan

Wikipedia (version Oct 24, 2009) on the movie “Where the Wild Things Are”

Wikipedia (version Oct 24, 2009) on Information Retrieval

Doc 4: cucumbers crops discoveries enjoyable fill garden leafless plotting Vermont you

garden.org: Planning a Vegetable Garden

Representing meaning through a collection of wordsWhat parts of the meaning of a document

can you capture through an unordered collection of words?

How can you make use of such collections?

Representing meaning through a collection of wordsWhat parts of the meaning of a document

can you capture through an unordered collection of words?General topic information: What is the

document about?More specifically: things mentioned in the

document How can you make use of such collections?

Documents on similar topics contain similar words

Use in Information Retrieval (search)

Representing collections of words through tables of counts


film wild max

that things as [edit]

him

jonze

released

24 18 16 16 12 12 11 9 9 6

Representing collections of words through tables of counts

We can now compare documents by comparing tables of counts.

What can you tell about the second document below?

film wild max


him

jonze

released

24 18 16 16 12 12 11 9 9 6

film wild max


him

jonze

released

17 0 0 9 0 36 8 7 0 3

The “second document”: a more extensive list of words

the 167and 58of 58to 56a 49in 37as 36is 33victor 30

* 27with 26by 23her 18film 17for 16emily 15was 15corpse 14

bride 13victoria 13his 13on 13from 11

What movie is this?

From tables to vectors

Interpret table as a vector:Each entry is a dimension:

“film” is a dimension. Document’s coordinate: 24“wild” is a dimensions. Document’s coordinate: 18…

Then this document is a point in 10-dimensional space

film wild max


him

jonze

released

24 18 16 16 12 12 11 9 9 6

Documents as points in vector spaceViewing “Wild Things” and “Corpse Bride”

as vectors/points in vector space: Similarity between them as proximity in space

Corpse Bride

Where the Wild Things Are

“Distributional model”, “vector space model”, “semantic space model” used interchangeably here

What have we gained?Representation of document in vector

space can be computed completely automatically: Just counts words

Similarity in vector space is a good predictor for similarity in topicDocuments that contain similar words tend to

be about similar things

What do we mean by “similarity” of vectors?Euclidean distance (a dissimilarity measure!):

Corpse Bride


What do we mean by “similarity” of vectors?Cosine similarity:

Corpse Bride


What have we gained?We can compute the similarity of

documents through their Euclidean distanceor through their cosine

We can also represent a query as a vector:Just count the words in the query

Now we can search for documents similar to the query

From documents to wordsSame holds for words as for documents:

Context words are a good indicator of meaningSimilar words tend to occur in similar

contextsWhat is a context? How do we count here?

Take all the occurrences of our target word in a large text

Take a context window, e.g. 10 words either side

Count all that occurs there

Representing the meaning of a word through a collection of context words

Emerging from the earth is Emily, the "Corpse Bride," a beautiful undead girl in a moldy bridal gown who declares Victor her husband.

a the corpse emerging

from

2 2 1 1 1

is undead beautiful moldy bride

1 1 1 1 1

in earth girl

1 1 1

Counts for target “Emily”, 10 words context either side.

Representing the meaning of a word through a collection of context words

Go through all occurrences of “Emily” in a large corpusCount words in 10-word window for each

occurrence, sum up

a the corpse emerging

from

2 2 1 1 1

is undead beautiful moldy bride

1 1 1 1 1

in earth girl

1 1 1

Some co-occurrences: “letter” in “Pride and Prejudice”

jane : 12 when : 14 by : 15 which : 16 him : 16 with : 16 elizabeth : 17 but : 17 he : 17 be : 18 s : 20 on : 20

was : 34 it : 35 his : 36 she : 41 her : 50 a : 52 and : 56 of : 72 to : 75 the : 102

• not : 21• for : 21• mr : 22• this : 23• as : 23• you : 25• from : 28• i : 28• had : 32• that : 33• in : 34

This is not a large text!Large = something like 100 million words at least

From tables to vectors

Interpret table as a vector:Each entry is a dimension:

“admirer” is a dimension. Coordinate of “letter”: 1. Coordinate of “surprise”: 0

“all” is a dimensions. Coordinate of “letter”: 8. Coordinate of “surprise: 7

…

Then each word is a point in n-dimensional space

Counts for “letter” and “surprise” from Pride and Prejudice

What have we gained?Representation of word in vector space can

be computed completely automatically: Just counts co-occurring words in all context

Similarity in vector space is a good predictor for meaning similarityWords that occur in similar contexts tend to

be similar in meaningSynonyms are close together in vector spaceAntonyms too

Parameters of vector space modelsW. Lowe (2001): “Towards a theory of semantic

space”A semantic space defined as a tuple

(A, B, S, M)B: base elements. A: mapping from raw co-occurrence counts to

something else, to correct for frequency effectsS: similarity measure. M: transformation of the whole space to

different dimensions

B: base elementsWe have seen: context words as base

elementsTerm x document matrix:

Represent document as vector of weighted terms

Represent term as vector of weighted documents

B: base elementsDimensions:

not words in a context window, but dependency paths starting from the target word (Pado & Lapata 07)

A: transforming raw countsProblem with vectors of raw counts:

Distortion through frequency of target word

Weigh counts: The count on dimension “and” will not be as

informative as that on the dimension “angry”For example, using Pointwise Mutual

Information between target a and context word b

M: transforming the whole spaceDimensionality reduction:

Principal Component Analysis (PCA)Singular Value Decomposition (SVD)

Latent Semantic Analysis, LSA(also called Latent Semantic Indexing, LSI):Do SVD on term x document representation to induce “latent” dimensions that correspond to topics that a document can be about

Landauer & Dumais 1997

Using similarity in vector spacesSearch/information retrieval: Given query

and document collection,Use term x document representation:

Each document is a vector of weighted termsAlso represent query as vector of weighted

termsRetrieve the documents that are most similar

to the query

Using similarity in vector spacesTo find synonyms:

Synonyms tend to have more similar vectors than non-synonyms:Synonyms occur in the same contexts

But the same holds for antonyms:In vector spaces, “good” and “evil” are the same (more or less)

So: vector spaces can be used to build a thesaurus automatically

Using similarity in vector spacesIn cognitive science, to predict

human judgments on how similar pairs of words are (on a scale of 1-10)

“priming”

An automatically extracted thesaurusDekang Lin 1998:

For each word, automatically extract similar words

vector space representation based on syntactic context of target (dependency parses)

similarity measure: based on mutual information (“Lin’s measure”)

Large thesaurus, used often in NLP applications

Vectors for word sensesUp to now: one vector per wordVector for “bank” conflates

financial contextsfishing contexts

How to get to vectors for word senses?

Automatically inducing word sensesSchütze 1998: one vector per sentence,

or per occurrence (token)of “letter”She wrote an angry letter to her niece.He sprayed the word in big letters.The newspaper gets 100 letters from readers every

day.Make token vector by adding up the vectors of

all other (content) words in the sentence:

Cluster token vectorsClusters = induced word senses

A vector for an individual occurrence of a wordAvoid having to define word senses

Sometimes hard to divide uses into senses:words like “leave”, or “paint”

Erk/Pado 2008: Modify vector of “bank” using its syntactic context:

bankbank

breakbreak obj

bankbank

fishfish on

Summary: vector space models

Representing meaning through countsRepresent document through content wordsRepresent word meaning through context

words / parse tree snippets / documentsContext items as dimensions,

target as vector/point in semantic space

Proximity in semantic space ~ similarity between words

Summary: vector space models

Uses: SearchInducing ontologiesModeling human judgments of word similarityRepresent word senses

Cluster sentence vectorsCompute vectors for individual occurrences

katrin erk distributional models. representing meaning through collections of words doc 1: abdullah...

Documents

collections of words

wild doc

similar words

information retrieval

tables of counts doc

unordered collection

vector space model

vegetable garden slide