modern word embedding models: predict to learn word vectors · contents 1 introduction simple demo...

248
Modern word embedding models: predict to learn word vectors Andrey Kutuzov andreku@ifi.uio.no Language Technology Group 04 October 2017

Upload: others

Post on 28-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Modern word embedding models:predict to learn word vectors

Andrey [email protected]

Language Technology Group

04 October 2017

Page 2: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Contents

1 IntroductionSimple demoDistributional hypothesis

2 Vector space modelsWord embeddingsCount-based distributional models

3 Predictive distributional models: Word2Vec revolutionThe followers: GloVe and the others

4 Models evaluation5 Practical aspects6 Representing documents

Can we do without semantics?Distributed models: composing from word vectorsDistributed models: training document vectors

7 Summing up

1

Page 3: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Number of publications on word embeddings in Association forComputational Linguistics Anthology (http://aclanthology.info/)

2

Page 4: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Mapping words in brain

We want a machine to imitate human brain and understand meaning ofwords.How we can design it?

3

Page 5: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Mapping words in brain

We want a machine to imitate human brain and understand meaning ofwords.

How we can design it?

3

Page 6: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Mapping words in brain

We want a machine to imitate human brain and understand meaning ofwords.How we can design it?

3

Page 7: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

I Vector space models of meaning (based on distributional semantics)have been here for already several decades [Turney et al., 2010].

I Recent advances in employing machine learning to traindistributional models allowed them to become state-of-the-art andliterally conquer the computational linguistics landscape.

I Now they are commonly used both in research and in large-scaleindustry projects (web search, opinion mining, tracing events,plagiarism detection, document collections management etc.)

I All this is based on the ability of such models to efficiently calculatesemantic similarity between linguistic entities.

4

Page 8: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

I Vector space models of meaning (based on distributional semantics)have been here for already several decades [Turney et al., 2010].

I Recent advances in employing machine learning to traindistributional models allowed them to become state-of-the-art andliterally conquer the computational linguistics landscape.

I Now they are commonly used both in research and in large-scaleindustry projects (web search, opinion mining, tracing events,plagiarism detection, document collections management etc.)

I All this is based on the ability of such models to efficiently calculatesemantic similarity between linguistic entities.

4

Page 9: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

I Vector space models of meaning (based on distributional semantics)have been here for already several decades [Turney et al., 2010].

I Recent advances in employing machine learning to traindistributional models allowed them to become state-of-the-art andliterally conquer the computational linguistics landscape.

I Now they are commonly used both in research and in large-scaleindustry projects (web search, opinion mining, tracing events,plagiarism detection, document collections management etc.)

I All this is based on the ability of such models to efficiently calculatesemantic similarity between linguistic entities.

4

Page 10: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

I Vector space models of meaning (based on distributional semantics)have been here for already several decades [Turney et al., 2010].

I Recent advances in employing machine learning to traindistributional models allowed them to become state-of-the-art andliterally conquer the computational linguistics landscape.

I Now they are commonly used both in research and in large-scaleindustry projects (web search, opinion mining, tracing events,plagiarism detection, document collections management etc.)

I All this is based on the ability of such models to efficiently calculatesemantic similarity between linguistic entities.

4

Page 11: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Distributional semantic models for Englisn and Norwegian onlineYou can try our WebVectors service

http://vectors.nlpl.eu/explore/embeddings/

5

Page 12: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Distributional semantic models for Englisn and Norwegian onlineYou can try our WebVectors service

http://vectors.nlpl.eu/explore/embeddings/

5

Page 13: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Tiers of linguistic analysis

Computational linguistics can comparatively easy model lower tiers oflanguage:I graphematics – how words are spelledI phonetics – how words are pronouncedI morphology – how words inflectI syntax – how words interact in sentences

6

Page 14: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Tiers of linguistic analysis

Computational linguistics can comparatively easy model lower tiers oflanguage:I graphematics – how words are spelled

I phonetics – how words are pronouncedI morphology – how words inflectI syntax – how words interact in sentences

6

Page 15: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Tiers of linguistic analysis

Computational linguistics can comparatively easy model lower tiers oflanguage:I graphematics – how words are spelledI phonetics – how words are pronounced

I morphology – how words inflectI syntax – how words interact in sentences

6

Page 16: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Tiers of linguistic analysis

Computational linguistics can comparatively easy model lower tiers oflanguage:I graphematics – how words are spelledI phonetics – how words are pronouncedI morphology – how words inflect

I syntax – how words interact in sentences

6

Page 17: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Tiers of linguistic analysis

Computational linguistics can comparatively easy model lower tiers oflanguage:I graphematics – how words are spelledI phonetics – how words are pronouncedI morphology – how words inflectI syntax – how words interact in sentences

6

Page 18: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

To model means to densely represent important features of somephenomenon. For example, in a phrase ‘The judge sits in the court’,the word ‘judge’ :

1. consists of 3 phonemes [ j e j ];2. is a singular noun in the nominative case;3. functions as a subject in the syntactic tree of our sentence.

Such local representations describe many important features of theword ‘judge’. But not meaning.

7

Page 19: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

To model means to densely represent important features of somephenomenon. For example, in a phrase ‘The judge sits in the court’,the word ‘judge’ :

1. consists of 3 phonemes [ j e j ];

2. is a singular noun in the nominative case;3. functions as a subject in the syntactic tree of our sentence.

Such local representations describe many important features of theword ‘judge’. But not meaning.

7

Page 20: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

To model means to densely represent important features of somephenomenon. For example, in a phrase ‘The judge sits in the court’,the word ‘judge’ :

1. consists of 3 phonemes [ j e j ];2. is a singular noun in the nominative case;

3. functions as a subject in the syntactic tree of our sentence.

Such local representations describe many important features of theword ‘judge’. But not meaning.

7

Page 21: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

To model means to densely represent important features of somephenomenon. For example, in a phrase ‘The judge sits in the court’,the word ‘judge’ :

1. consists of 3 phonemes [ j e j ];2. is a singular noun in the nominative case;3. functions as a subject in the syntactic tree of our sentence.

Such local representations describe many important features of theword ‘judge’. But not meaning.

7

Page 22: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

To model means to densely represent important features of somephenomenon. For example, in a phrase ‘The judge sits in the court’,the word ‘judge’ :

1. consists of 3 phonemes [ j e j ];2. is a singular noun in the nominative case;3. functions as a subject in the syntactic tree of our sentence.

Such local representations describe many important features of theword ‘judge’. But not meaning.

7

Page 23: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

But how to represent meaning?

I Semantics is difficult to represent formally.I We need machine-readable word representations.I Words which are similar in their meaning should possess

mathematically similar representations.I ‘Judge’ is similar to ‘court’ but not to ‘kludge’, even though their

surface form suggests the opposite.I Why so?

8

Page 24: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

But how to represent meaning?I Semantics is difficult to represent formally.

I We need machine-readable word representations.I Words which are similar in their meaning should possess

mathematically similar representations.I ‘Judge’ is similar to ‘court’ but not to ‘kludge’, even though their

surface form suggests the opposite.I Why so?

8

Page 25: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

But how to represent meaning?I Semantics is difficult to represent formally.I We need machine-readable word representations.I Words which are similar in their meaning should possess

mathematically similar representations.

I ‘Judge’ is similar to ‘court’ but not to ‘kludge’, even though theirsurface form suggests the opposite.

I Why so?

8

Page 26: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

But how to represent meaning?I Semantics is difficult to represent formally.I We need machine-readable word representations.I Words which are similar in their meaning should possess

mathematically similar representations.I ‘Judge’ is similar to ‘court’ but not to ‘kludge’, even though their

surface form suggests the opposite.

I Why so?

8

Page 27: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

But how to represent meaning?I Semantics is difficult to represent formally.I We need machine-readable word representations.I Words which are similar in their meaning should possess

mathematically similar representations.I ‘Judge’ is similar to ‘court’ but not to ‘kludge’, even though their

surface form suggests the opposite.I Why so?

8

Page 28: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Arbitrariness of a linguistic sign

Unlike road signs, words do not possess a direct link between form andmeaning.We know this since Ferdinand de Saussure, and in fact structuralisttheory influenced distributional approach much.‘Lantern’ concept can be expressed by any sequence of letters orsounds:

I lanternI lyktI лампаI lucernaI гэрэлI ...

9

Page 29: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Arbitrariness of a linguistic sign

Unlike road signs, words do not possess a direct link between form andmeaning.

We know this since Ferdinand de Saussure, and in fact structuralisttheory influenced distributional approach much.‘Lantern’ concept can be expressed by any sequence of letters orsounds:

I lanternI lyktI лампаI lucernaI гэрэлI ...

9

Page 30: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Arbitrariness of a linguistic sign

Unlike road signs, words do not possess a direct link between form andmeaning.We know this since Ferdinand de Saussure, and in fact structuralisttheory influenced distributional approach much.

‘Lantern’ concept can be expressed by any sequence of letters orsounds:

I lanternI lyktI лампаI lucernaI гэрэлI ...

9

Page 31: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Arbitrariness of a linguistic sign

Unlike road signs, words do not possess a direct link between form andmeaning.We know this since Ferdinand de Saussure, and in fact structuralisttheory influenced distributional approach much.‘Lantern’ concept can be expressed by any sequence of letters orsounds:

I lanternI lyktI лампаI lucernaI гэрэлI ...

9

Page 32: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Arbitrariness of a linguistic sign

Unlike road signs, words do not possess a direct link between form andmeaning.We know this since Ferdinand de Saussure, and in fact structuralisttheory influenced distributional approach much.‘Lantern’ concept can be expressed by any sequence of letters orsounds:

I lantern

I lyktI лампаI lucernaI гэрэлI ...

9

Page 33: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Arbitrariness of a linguistic sign

Unlike road signs, words do not possess a direct link between form andmeaning.We know this since Ferdinand de Saussure, and in fact structuralisttheory influenced distributional approach much.‘Lantern’ concept can be expressed by any sequence of letters orsounds:

I lanternI lykt

I лампаI lucernaI гэрэлI ...

9

Page 34: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Arbitrariness of a linguistic sign

Unlike road signs, words do not possess a direct link between form andmeaning.We know this since Ferdinand de Saussure, and in fact structuralisttheory influenced distributional approach much.‘Lantern’ concept can be expressed by any sequence of letters orsounds:

I lanternI lyktI лампа

I lucernaI гэрэлI ...

9

Page 35: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Arbitrariness of a linguistic sign

Unlike road signs, words do not possess a direct link between form andmeaning.We know this since Ferdinand de Saussure, and in fact structuralisttheory influenced distributional approach much.‘Lantern’ concept can be expressed by any sequence of letters orsounds:

I lanternI lyktI лампаI lucerna

I гэрэлI ...

9

Page 36: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Arbitrariness of a linguistic sign

Unlike road signs, words do not possess a direct link between form andmeaning.We know this since Ferdinand de Saussure, and in fact structuralisttheory influenced distributional approach much.‘Lantern’ concept can be expressed by any sequence of letters orsounds:

I lanternI lyktI лампаI lucernaI гэрэл

I ...

9

Page 37: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Introduction

Arbitrariness of a linguistic sign

Unlike road signs, words do not possess a direct link between form andmeaning.We know this since Ferdinand de Saussure, and in fact structuralisttheory influenced distributional approach much.‘Lantern’ concept can be expressed by any sequence of letters orsounds:

I lanternI lyktI лампаI lucernaI гэрэлI ...

9

Page 38: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Distributional hypothesis

Meaning is actually a sum of contexts and distributional differences willalways be enough to explain semantic differences:I Words with similar typical contexts have similar meaning.

I The first to formulate: Ludwig Wittgenstein (1930s) and[Harris, 1954].

I ‘You shall know a word by the company it keeps’ [Firth, 1957]I Distributional semantics models (DSMs) are built upon lexical

co-occurrences in a large training corpus.

10

Page 39: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Distributional hypothesis

Meaning is actually a sum of contexts and distributional differences willalways be enough to explain semantic differences:I Words with similar typical contexts have similar meaning.I The first to formulate: Ludwig Wittgenstein (1930s) and

[Harris, 1954].

I ‘You shall know a word by the company it keeps’ [Firth, 1957]I Distributional semantics models (DSMs) are built upon lexical

co-occurrences in a large training corpus.

10

Page 40: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Distributional hypothesis

Meaning is actually a sum of contexts and distributional differences willalways be enough to explain semantic differences:I Words with similar typical contexts have similar meaning.I The first to formulate: Ludwig Wittgenstein (1930s) and

[Harris, 1954].I ‘You shall know a word by the company it keeps’ [Firth, 1957]

I Distributional semantics models (DSMs) are built upon lexicalco-occurrences in a large training corpus.

10

Page 41: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Distributional hypothesis

Meaning is actually a sum of contexts and distributional differences willalways be enough to explain semantic differences:I Words with similar typical contexts have similar meaning.I The first to formulate: Ludwig Wittgenstein (1930s) and

[Harris, 1954].I ‘You shall know a word by the company it keeps’ [Firth, 1957]I Distributional semantics models (DSMs) are built upon lexical

co-occurrences in a large training corpus.

10

Page 42: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Contents

1 IntroductionSimple demoDistributional hypothesis

2 Vector space modelsWord embeddingsCount-based distributional models

3 Predictive distributional models: Word2Vec revolutionThe followers: GloVe and the others

4 Models evaluation5 Practical aspects6 Representing documents

Can we do without semantics?Distributed models: composing from word vectorsDistributed models: training document vectors

7 Summing up

10

Page 43: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

The first and primary method of representingmeaning in distributional semantics – semanticvectors.First invented by Charles Osgood, Americanpsychologist, in the 1950s [Osgood et al., 1964],then developed by many others.

11

Page 44: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

In distributional semantics, meanings of particular words arerepresented as vectors or arrays of real values derived from frequencyof their co-occurrences with other words (or other entities) in thetraining corpus.

I Words (or, more often, their lemmas) are vectors or points inmulti-dimensional semantic space

I At the same time, words are also axes (dimensions) in this space(but we can use other types of contexts: documents, sentences,even characters).

I Each word A is represented with the vector ~A. Vector dimensions orcomponents are other words of the corpus’ lexicon (B,C,D...N).Values of components are frequencies of words co-occurrences.

12

Page 45: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

In distributional semantics, meanings of particular words arerepresented as vectors or arrays of real values derived from frequencyof their co-occurrences with other words (or other entities) in thetraining corpus.I Words (or, more often, their lemmas) are vectors or points in

multi-dimensional semantic space

I At the same time, words are also axes (dimensions) in this space(but we can use other types of contexts: documents, sentences,even characters).

I Each word A is represented with the vector ~A. Vector dimensions orcomponents are other words of the corpus’ lexicon (B,C,D...N).Values of components are frequencies of words co-occurrences.

12

Page 46: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

In distributional semantics, meanings of particular words arerepresented as vectors or arrays of real values derived from frequencyof their co-occurrences with other words (or other entities) in thetraining corpus.I Words (or, more often, their lemmas) are vectors or points in

multi-dimensional semantic spaceI At the same time, words are also axes (dimensions) in this space

(but we can use other types of contexts: documents, sentences,even characters).

I Each word A is represented with the vector ~A. Vector dimensions orcomponents are other words of the corpus’ lexicon (B,C,D...N).Values of components are frequencies of words co-occurrences.

12

Page 47: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

In distributional semantics, meanings of particular words arerepresented as vectors or arrays of real values derived from frequencyof their co-occurrences with other words (or other entities) in thetraining corpus.I Words (or, more often, their lemmas) are vectors or points in

multi-dimensional semantic spaceI At the same time, words are also axes (dimensions) in this space

(but we can use other types of contexts: documents, sentences,even characters).

I Each word A is represented with the vector ~A. Vector dimensions orcomponents are other words of the corpus’ lexicon (B,C,D...N).Values of components are frequencies of words co-occurrences.

12

Page 48: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Similar words are close to each other in the space defined by theirtypical co-occurrences

13

Page 49: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Word embeddings

Curse of dimensionality

I With large corpora, we can end up with millions of dimensions(axes, words).

I But the vectors are very sparse, most components are zero.I One can reduce vector sizes to some reasonable values, and still

retain meaningful relations between them.I Such dense vectors are called ‘word embeddings’.

14

Page 50: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Word embeddings

Curse of dimensionalityI With large corpora, we can end up with millions of dimensions

(axes, words).

I But the vectors are very sparse, most components are zero.I One can reduce vector sizes to some reasonable values, and still

retain meaningful relations between them.I Such dense vectors are called ‘word embeddings’.

14

Page 51: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Word embeddings

Curse of dimensionalityI With large corpora, we can end up with millions of dimensions

(axes, words).I But the vectors are very sparse, most components are zero.

I One can reduce vector sizes to some reasonable values, and stillretain meaningful relations between them.

I Such dense vectors are called ‘word embeddings’.

14

Page 52: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Word embeddings

Curse of dimensionalityI With large corpora, we can end up with millions of dimensions

(axes, words).I But the vectors are very sparse, most components are zero.I One can reduce vector sizes to some reasonable values, and still

retain meaningful relations between them.

I Such dense vectors are called ‘word embeddings’.

14

Page 53: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Word embeddings

Curse of dimensionalityI With large corpora, we can end up with millions of dimensions

(axes, words).I But the vectors are very sparse, most components are zero.I One can reduce vector sizes to some reasonable values, and still

retain meaningful relations between them.I Such dense vectors are called ‘word embeddings’.

14

Page 54: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

300-D vector of ‘tomato’

15

Page 55: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

300-D vector of ‘cucumber’

16

Page 56: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

300-D vector of ‘philosophy’

Can we prove that tomatoes are more similar to cucumbers than tophilosophy?

17

Page 57: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

300-D vector of ‘philosophy’

Can we prove that tomatoes are more similar to cucumbers than tophilosophy?

17

Page 58: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Semantic similarity between words is measured by the cosine of theangle between their corresponding vectors (takes values from -1 to 1).

I Similarity lowers as the angle between word vectors grows.I Similarity grows as the angle lessens.

cos(w1,w2) =~V (w1)× ~V (w2)

|~V (w1)| × |~V (w2)|(1)

(dot product of unit-normalized vectors)cos(tomato,philosophy) = 0.09cos(cucumber ,philosophy) = 0.16cos(tomato, cucumber) = 0.66

I Vectors point at the same direction: cos = 1I Vectors are orthogonal: cos = 0I Vectors point at the opposite directions: cos = −1

18

Page 59: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Semantic similarity between words is measured by the cosine of theangle between their corresponding vectors (takes values from -1 to 1).I Similarity lowers as the angle between word vectors grows.

I Similarity grows as the angle lessens.

cos(w1,w2) =~V (w1)× ~V (w2)

|~V (w1)| × |~V (w2)|(1)

(dot product of unit-normalized vectors)cos(tomato,philosophy) = 0.09cos(cucumber ,philosophy) = 0.16cos(tomato, cucumber) = 0.66

I Vectors point at the same direction: cos = 1I Vectors are orthogonal: cos = 0I Vectors point at the opposite directions: cos = −1

18

Page 60: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Semantic similarity between words is measured by the cosine of theangle between their corresponding vectors (takes values from -1 to 1).I Similarity lowers as the angle between word vectors grows.I Similarity grows as the angle lessens.

cos(w1,w2) =~V (w1)× ~V (w2)

|~V (w1)| × |~V (w2)|(1)

(dot product of unit-normalized vectors)

cos(tomato,philosophy) = 0.09cos(cucumber ,philosophy) = 0.16cos(tomato, cucumber) = 0.66

I Vectors point at the same direction: cos = 1I Vectors are orthogonal: cos = 0I Vectors point at the opposite directions: cos = −1

18

Page 61: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Semantic similarity between words is measured by the cosine of theangle between their corresponding vectors (takes values from -1 to 1).I Similarity lowers as the angle between word vectors grows.I Similarity grows as the angle lessens.

cos(w1,w2) =~V (w1)× ~V (w2)

|~V (w1)| × |~V (w2)|(1)

(dot product of unit-normalized vectors)cos(tomato,philosophy) = 0.09

cos(cucumber ,philosophy) = 0.16cos(tomato, cucumber) = 0.66

I Vectors point at the same direction: cos = 1I Vectors are orthogonal: cos = 0I Vectors point at the opposite directions: cos = −1

18

Page 62: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Semantic similarity between words is measured by the cosine of theangle between their corresponding vectors (takes values from -1 to 1).I Similarity lowers as the angle between word vectors grows.I Similarity grows as the angle lessens.

cos(w1,w2) =~V (w1)× ~V (w2)

|~V (w1)| × |~V (w2)|(1)

(dot product of unit-normalized vectors)cos(tomato,philosophy) = 0.09cos(cucumber ,philosophy) = 0.16

cos(tomato, cucumber) = 0.66

I Vectors point at the same direction: cos = 1I Vectors are orthogonal: cos = 0I Vectors point at the opposite directions: cos = −1

18

Page 63: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Semantic similarity between words is measured by the cosine of theangle between their corresponding vectors (takes values from -1 to 1).I Similarity lowers as the angle between word vectors grows.I Similarity grows as the angle lessens.

cos(w1,w2) =~V (w1)× ~V (w2)

|~V (w1)| × |~V (w2)|(1)

(dot product of unit-normalized vectors)cos(tomato,philosophy) = 0.09cos(cucumber ,philosophy) = 0.16cos(tomato, cucumber) = 0.66

I Vectors point at the same direction: cos = 1I Vectors are orthogonal: cos = 0I Vectors point at the opposite directions: cos = −1

18

Page 64: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Semantic similarity between words is measured by the cosine of theangle between their corresponding vectors (takes values from -1 to 1).I Similarity lowers as the angle between word vectors grows.I Similarity grows as the angle lessens.

cos(w1,w2) =~V (w1)× ~V (w2)

|~V (w1)| × |~V (w2)|(1)

(dot product of unit-normalized vectors)cos(tomato,philosophy) = 0.09cos(cucumber ,philosophy) = 0.16cos(tomato, cucumber) = 0.66

I Vectors point at the same direction: cos = 1

I Vectors are orthogonal: cos = 0I Vectors point at the opposite directions: cos = −1

18

Page 65: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Semantic similarity between words is measured by the cosine of theangle between their corresponding vectors (takes values from -1 to 1).I Similarity lowers as the angle between word vectors grows.I Similarity grows as the angle lessens.

cos(w1,w2) =~V (w1)× ~V (w2)

|~V (w1)| × |~V (w2)|(1)

(dot product of unit-normalized vectors)cos(tomato,philosophy) = 0.09cos(cucumber ,philosophy) = 0.16cos(tomato, cucumber) = 0.66

I Vectors point at the same direction: cos = 1I Vectors are orthogonal: cos = 0

I Vectors point at the opposite directions: cos = −1

18

Page 66: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Semantic similarity between words is measured by the cosine of theangle between their corresponding vectors (takes values from -1 to 1).I Similarity lowers as the angle between word vectors grows.I Similarity grows as the angle lessens.

cos(w1,w2) =~V (w1)× ~V (w2)

|~V (w1)| × |~V (w2)|(1)

(dot product of unit-normalized vectors)cos(tomato,philosophy) = 0.09cos(cucumber ,philosophy) = 0.16cos(tomato, cucumber) = 0.66

I Vectors point at the same direction: cos = 1I Vectors are orthogonal: cos = 0I Vectors point at the opposite directions: cos = −1

18

Page 67: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Embeddings reduced to 2 dimensions and visualized with t-SNE[Van der Maaten and Hinton, 2008]

19

Page 68: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Brief recapI Distributional models are based on distributions of word

co-occurrences in large training corpora;

I they represent words as dense lexical vectors (embeddings);I the models are also distributed: each word is represented as

multiple activations (not a one-hot vector);I particular vector components (features) are not directly related to

any particular semantic ‘properties’;I words occurring in similar contexts have similar vectors;I one can find nearest semantic associates of a given word by

calculating cosine similarity between vectors.

20

Page 69: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Brief recapI Distributional models are based on distributions of word

co-occurrences in large training corpora;I they represent words as dense lexical vectors (embeddings);

I the models are also distributed: each word is represented asmultiple activations (not a one-hot vector);

I particular vector components (features) are not directly related toany particular semantic ‘properties’;

I words occurring in similar contexts have similar vectors;I one can find nearest semantic associates of a given word by

calculating cosine similarity between vectors.

20

Page 70: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Brief recapI Distributional models are based on distributions of word

co-occurrences in large training corpora;I they represent words as dense lexical vectors (embeddings);I the models are also distributed: each word is represented as

multiple activations (not a one-hot vector);

I particular vector components (features) are not directly related toany particular semantic ‘properties’;

I words occurring in similar contexts have similar vectors;I one can find nearest semantic associates of a given word by

calculating cosine similarity between vectors.

20

Page 71: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Brief recapI Distributional models are based on distributions of word

co-occurrences in large training corpora;I they represent words as dense lexical vectors (embeddings);I the models are also distributed: each word is represented as

multiple activations (not a one-hot vector);I particular vector components (features) are not directly related to

any particular semantic ‘properties’;

I words occurring in similar contexts have similar vectors;I one can find nearest semantic associates of a given word by

calculating cosine similarity between vectors.

20

Page 72: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Brief recapI Distributional models are based on distributions of word

co-occurrences in large training corpora;I they represent words as dense lexical vectors (embeddings);I the models are also distributed: each word is represented as

multiple activations (not a one-hot vector);I particular vector components (features) are not directly related to

any particular semantic ‘properties’;I words occurring in similar contexts have similar vectors;

I one can find nearest semantic associates of a given word bycalculating cosine similarity between vectors.

20

Page 73: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Brief recapI Distributional models are based on distributions of word

co-occurrences in large training corpora;I they represent words as dense lexical vectors (embeddings);I the models are also distributed: each word is represented as

multiple activations (not a one-hot vector);I particular vector components (features) are not directly related to

any particular semantic ‘properties’;I words occurring in similar contexts have similar vectors;I one can find nearest semantic associates of a given word by

calculating cosine similarity between vectors.

20

Page 74: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Nearest semantic associates

Brain (from a model trained on EnglishWikipedia):1. cerebral 0.742. cerebellum 0.723. brainstem 0.704. cortical 0.685. hippocampal 0.666. ...

21

Page 75: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Nearest semantic associates

Brain (from a model trained on EnglishWikipedia):

1. cerebral 0.742. cerebellum 0.723. brainstem 0.704. cortical 0.685. hippocampal 0.666. ...

21

Page 76: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Nearest semantic associates

Brain (from a model trained on EnglishWikipedia):1. cerebral 0.74

2. cerebellum 0.723. brainstem 0.704. cortical 0.685. hippocampal 0.666. ...

21

Page 77: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Nearest semantic associates

Brain (from a model trained on EnglishWikipedia):1. cerebral 0.742. cerebellum 0.72

3. brainstem 0.704. cortical 0.685. hippocampal 0.666. ...

21

Page 78: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Nearest semantic associates

Brain (from a model trained on EnglishWikipedia):1. cerebral 0.742. cerebellum 0.723. brainstem 0.70

4. cortical 0.685. hippocampal 0.666. ...

21

Page 79: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Nearest semantic associates

Brain (from a model trained on EnglishWikipedia):1. cerebral 0.742. cerebellum 0.723. brainstem 0.704. cortical 0.68

5. hippocampal 0.666. ...

21

Page 80: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Nearest semantic associates

Brain (from a model trained on EnglishWikipedia):1. cerebral 0.742. cerebellum 0.723. brainstem 0.704. cortical 0.685. hippocampal 0.666. ...

21

Page 81: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Works with multi-word entities as well

Alan Turing (from a model trained onGoogle News corpus (2013)):1. Turing 0.682. Charles Babbage 0.653. mathematician Alan Turing 0.624. pioneer Alan Turing 0.605. On Computable Numbers 0.606. ...

Try yourself at http://vectors.nlpl.eu/explore/embeddings/!

22

Page 82: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Works with multi-word entities as well

Alan Turing (from a model trained onGoogle News corpus (2013)):

1. Turing 0.682. Charles Babbage 0.653. mathematician Alan Turing 0.624. pioneer Alan Turing 0.605. On Computable Numbers 0.606. ...

Try yourself at http://vectors.nlpl.eu/explore/embeddings/!

22

Page 83: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Works with multi-word entities as well

Alan Turing (from a model trained onGoogle News corpus (2013)):1. Turing 0.68

2. Charles Babbage 0.653. mathematician Alan Turing 0.624. pioneer Alan Turing 0.605. On Computable Numbers 0.606. ...

Try yourself at http://vectors.nlpl.eu/explore/embeddings/!

22

Page 84: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Works with multi-word entities as well

Alan Turing (from a model trained onGoogle News corpus (2013)):1. Turing 0.682. Charles Babbage 0.65

3. mathematician Alan Turing 0.624. pioneer Alan Turing 0.605. On Computable Numbers 0.606. ...

Try yourself at http://vectors.nlpl.eu/explore/embeddings/!

22

Page 85: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Works with multi-word entities as well

Alan Turing (from a model trained onGoogle News corpus (2013)):1. Turing 0.682. Charles Babbage 0.653. mathematician Alan Turing 0.62

4. pioneer Alan Turing 0.605. On Computable Numbers 0.606. ...

Try yourself at http://vectors.nlpl.eu/explore/embeddings/!

22

Page 86: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Works with multi-word entities as well

Alan Turing (from a model trained onGoogle News corpus (2013)):1. Turing 0.682. Charles Babbage 0.653. mathematician Alan Turing 0.624. pioneer Alan Turing 0.60

5. On Computable Numbers 0.606. ...

Try yourself at http://vectors.nlpl.eu/explore/embeddings/!

22

Page 87: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Works with multi-word entities as well

Alan Turing (from a model trained onGoogle News corpus (2013)):1. Turing 0.682. Charles Babbage 0.653. mathematician Alan Turing 0.624. pioneer Alan Turing 0.605. On Computable Numbers 0.606. ...

Try yourself at http://vectors.nlpl.eu/explore/embeddings/!

22

Page 88: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Works with multi-word entities as well

Alan Turing (from a model trained onGoogle News corpus (2013)):1. Turing 0.682. Charles Babbage 0.653. mathematician Alan Turing 0.624. pioneer Alan Turing 0.605. On Computable Numbers 0.606. ...

Try yourself at http://vectors.nlpl.eu/explore/embeddings/!

22

Page 89: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Main approaches to produce word embeddings

1. Point-wise mutual information (PMI) association matrices, factorizedby SVD (so called count-based models) [Bullinaria and Levy, 2007];

2. Predictive models using artificial neural networks, introduced in[Bengio et al., 2003] and [Mikolov et al., 2013] (word2vec):

I Continuous Bag-of-Words (CBOW),I Continuous Skip-Gram (skipgram);

3. Global Vectors for Word Representation (GloVe)[Pennington et al., 2014];

4. fastText [Bojanowski et al., 2016];5. etc...

The last approaches became super popular in the recent years andboosted almost all areas of natural language processing.Their principal difference from previous methods is that they activelyemploy machine learning.

23

Page 90: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Main approaches to produce word embeddings

1. Point-wise mutual information (PMI) association matrices, factorizedby SVD (so called count-based models) [Bullinaria and Levy, 2007];

2. Predictive models using artificial neural networks, introduced in[Bengio et al., 2003] and [Mikolov et al., 2013] (word2vec):

I Continuous Bag-of-Words (CBOW),I Continuous Skip-Gram (skipgram);

3. Global Vectors for Word Representation (GloVe)[Pennington et al., 2014];

4. fastText [Bojanowski et al., 2016];5. etc...

The last approaches became super popular in the recent years andboosted almost all areas of natural language processing.Their principal difference from previous methods is that they activelyemploy machine learning.

23

Page 91: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Main approaches to produce word embeddings

1. Point-wise mutual information (PMI) association matrices, factorizedby SVD (so called count-based models) [Bullinaria and Levy, 2007];

2. Predictive models using artificial neural networks, introduced in[Bengio et al., 2003] and [Mikolov et al., 2013] (word2vec):

I Continuous Bag-of-Words (CBOW),I Continuous Skip-Gram (skipgram);

3. Global Vectors for Word Representation (GloVe)[Pennington et al., 2014];

4. fastText [Bojanowski et al., 2016];5. etc...

The last approaches became super popular in the recent years andboosted almost all areas of natural language processing.Their principal difference from previous methods is that they activelyemploy machine learning.

23

Page 92: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Main approaches to produce word embeddings

1. Point-wise mutual information (PMI) association matrices, factorizedby SVD (so called count-based models) [Bullinaria and Levy, 2007];

2. Predictive models using artificial neural networks, introduced in[Bengio et al., 2003] and [Mikolov et al., 2013] (word2vec):

I Continuous Bag-of-Words (CBOW),I Continuous Skip-Gram (skipgram);

3. Global Vectors for Word Representation (GloVe)[Pennington et al., 2014];

4. fastText [Bojanowski et al., 2016];5. etc...

The last approaches became super popular in the recent years andboosted almost all areas of natural language processing.

Their principal difference from previous methods is that they activelyemploy machine learning.

23

Page 93: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Vector space models

Main approaches to produce word embeddings

1. Point-wise mutual information (PMI) association matrices, factorizedby SVD (so called count-based models) [Bullinaria and Levy, 2007];

2. Predictive models using artificial neural networks, introduced in[Bengio et al., 2003] and [Mikolov et al., 2013] (word2vec):

I Continuous Bag-of-Words (CBOW),I Continuous Skip-Gram (skipgram);

3. Global Vectors for Word Representation (GloVe)[Pennington et al., 2014];

4. fastText [Bojanowski et al., 2016];5. etc...

The last approaches became super popular in the recent years andboosted almost all areas of natural language processing.Their principal difference from previous methods is that they activelyemploy machine learning.

23

Page 94: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Count-based distributional models

Traditional distributional models are known as count-based.

How to construct a good count-based model

1. compile full co-occurrence matrix on the whole corpus;2. weigh absolute frequencies with positive point-wise mutual

information (PPMI) association measure;3. factorize the matrix with singular value decomposition (SVD) to

reduce dimensionality and arrive from sparse to dense vectors.

For more details, see [Bullinaria and Levy, 2007] and methods likeLatent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA).

24

Page 95: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Count-based distributional models

Traditional distributional models are known as count-based.

How to construct a good count-based model

1. compile full co-occurrence matrix on the whole corpus;

2. weigh absolute frequencies with positive point-wise mutualinformation (PPMI) association measure;

3. factorize the matrix with singular value decomposition (SVD) toreduce dimensionality and arrive from sparse to dense vectors.

For more details, see [Bullinaria and Levy, 2007] and methods likeLatent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA).

24

Page 96: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Count-based distributional models

Traditional distributional models are known as count-based.

How to construct a good count-based model

1. compile full co-occurrence matrix on the whole corpus;2. weigh absolute frequencies with positive point-wise mutual

information (PPMI) association measure;

3. factorize the matrix with singular value decomposition (SVD) toreduce dimensionality and arrive from sparse to dense vectors.

For more details, see [Bullinaria and Levy, 2007] and methods likeLatent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA).

24

Page 97: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Count-based distributional models

Traditional distributional models are known as count-based.

How to construct a good count-based model

1. compile full co-occurrence matrix on the whole corpus;2. weigh absolute frequencies with positive point-wise mutual

information (PPMI) association measure;3. factorize the matrix with singular value decomposition (SVD) to

reduce dimensionality and arrive from sparse to dense vectors.

For more details, see [Bullinaria and Levy, 2007] and methods likeLatent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA).

24

Page 98: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Contents

1 IntroductionSimple demoDistributional hypothesis

2 Vector space modelsWord embeddingsCount-based distributional models

3 Predictive distributional models: Word2Vec revolutionThe followers: GloVe and the others

4 Models evaluation5 Practical aspects6 Representing documents

Can we do without semantics?Distributed models: composing from word vectorsDistributed models: training document vectors

7 Summing up

24

Page 99: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Machine learning

I Some problems are so complex that we can’t formulate exactalgorithms for them. We do not know ourselves how our brain doesthis.

I To solve such problems, one can use machine learning: attempts tobuild programs which learn to make correct decisions on sometraining material and improve with experience;

I One of popular machine learning approaches for languagemodeling – artificial neural networks.

25

Page 100: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Machine learningI Some problems are so complex that we can’t formulate exact

algorithms for them. We do not know ourselves how our brain doesthis.

I To solve such problems, one can use machine learning: attempts tobuild programs which learn to make correct decisions on sometraining material and improve with experience;

I One of popular machine learning approaches for languagemodeling – artificial neural networks.

25

Page 101: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Machine learningI Some problems are so complex that we can’t formulate exact

algorithms for them. We do not know ourselves how our brain doesthis.

I To solve such problems, one can use machine learning: attempts tobuild programs which learn to make correct decisions on sometraining material and improve with experience;

I One of popular machine learning approaches for languagemodeling – artificial neural networks.

25

Page 102: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Machine learningI Some problems are so complex that we can’t formulate exact

algorithms for them. We do not know ourselves how our brain doesthis.

I To solve such problems, one can use machine learning: attempts tobuild programs which learn to make correct decisions on sometraining material and improve with experience;

I One of popular machine learning approaches for languagemodeling – artificial neural networks.

25

Page 103: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Machine learning based distributional models are often calledpredict models.

I In the count models we count co-occurrence frequencies and usethem as word vectors; in the predict models it is vice versa:

I We try to find (to learn) for each word such a vector/embedding thatit is maximally similar to the vectors of its paradigmatic neighborsand minimally similar to the vectors of the words which in thetraining corpus are not second-order neighbors of the given word.

When using artificial neural networks, such learned vectors are calledneural embeddings.

26

Page 104: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Machine learning based distributional models are often calledpredict models.

I In the count models we count co-occurrence frequencies and usethem as word vectors; in the predict models it is vice versa:

I We try to find (to learn) for each word such a vector/embedding thatit is maximally similar to the vectors of its paradigmatic neighborsand minimally similar to the vectors of the words which in thetraining corpus are not second-order neighbors of the given word.

When using artificial neural networks, such learned vectors are calledneural embeddings.

26

Page 105: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Machine learning based distributional models are often calledpredict models.

I In the count models we count co-occurrence frequencies and usethem as word vectors; in the predict models it is vice versa:

I We try to find (to learn) for each word such a vector/embedding thatit is maximally similar to the vectors of its paradigmatic neighborsand minimally similar to the vectors of the words which in thetraining corpus are not second-order neighbors of the given word.

When using artificial neural networks, such learned vectors are calledneural embeddings.

26

Page 106: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Machine learning based distributional models are often calledpredict models.

I In the count models we count co-occurrence frequencies and usethem as word vectors; in the predict models it is vice versa:

I We try to find (to learn) for each word such a vector/embedding thatit is maximally similar to the vectors of its paradigmatic neighborsand minimally similar to the vectors of the words which in thetraining corpus are not second-order neighbors of the given word.

When using artificial neural networks, such learned vectors are calledneural embeddings.

26

Page 107: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I In 2013, Google’s Tomas Mikolov et al.published a paper called ‘Efficient Estimationof Word Representations in Vector Space’;

I they also made available the source code ofword2vec tool, implementing their algorithms...

I ...and a distributional model trained on largeGoogle News corpus.

I [Mikolov et al., 2013]I https://code.google.com/p/word2vec/

27

Page 108: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I In 2013, Google’s Tomas Mikolov et al.published a paper called ‘Efficient Estimationof Word Representations in Vector Space’;

I they also made available the source code ofword2vec tool, implementing their algorithms...

I ...and a distributional model trained on largeGoogle News corpus.

I [Mikolov et al., 2013]I https://code.google.com/p/word2vec/

27

Page 109: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I In 2013, Google’s Tomas Mikolov et al.published a paper called ‘Efficient Estimationof Word Representations in Vector Space’;

I they also made available the source code ofword2vec tool, implementing their algorithms...

I ...and a distributional model trained on largeGoogle News corpus.

I [Mikolov et al., 2013]I https://code.google.com/p/word2vec/

27

Page 110: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I In 2013, Google’s Tomas Mikolov et al.published a paper called ‘Efficient Estimationof Word Representations in Vector Space’;

I they also made available the source code ofword2vec tool, implementing their algorithms...

I ...and a distributional model trained on largeGoogle News corpus.

I [Mikolov et al., 2013]I https://code.google.com/p/word2vec/

27

Page 111: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Mikolov modified already existing algorithms (especially from[Bengio et al., 2003] and work by R. Collobert)

I explicitly made learning good embeddings the final aim of the modeltraining.

I word2vec turned out to be very fast and efficient.I NB: it actually features two different algorithms:

1. Continuous Bag-of-Words (CBOW),2. Continuous Skipgram.

28

Page 112: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Mikolov modified already existing algorithms (especially from[Bengio et al., 2003] and work by R. Collobert)

I explicitly made learning good embeddings the final aim of the modeltraining.

I word2vec turned out to be very fast and efficient.I NB: it actually features two different algorithms:

1. Continuous Bag-of-Words (CBOW),2. Continuous Skipgram.

28

Page 113: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Mikolov modified already existing algorithms (especially from[Bengio et al., 2003] and work by R. Collobert)

I explicitly made learning good embeddings the final aim of the modeltraining.

I word2vec turned out to be very fast and efficient.

I NB: it actually features two different algorithms:1. Continuous Bag-of-Words (CBOW),2. Continuous Skipgram.

28

Page 114: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Mikolov modified already existing algorithms (especially from[Bengio et al., 2003] and work by R. Collobert)

I explicitly made learning good embeddings the final aim of the modeltraining.

I word2vec turned out to be very fast and efficient.I NB: it actually features two different algorithms:

1. Continuous Bag-of-Words (CBOW),2. Continuous Skipgram.

28

Page 115: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Model initializing

First, each word in the vocabulary receives 2 random initial vectors (asa word and as a context) of a pre-defined size. Thus, we have twoweight matrices:

I input matrix WI with word vectors between input and projectionlayers;

I output matrix WO with context vectors between projection andoutput layers.

The first matrix dimensionality is vocabulary size X vector size and thesecond matrix dimensionality is its transposition: vector size Xvocabulary size.

What happens next?

29

Page 116: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Model initializing

First, each word in the vocabulary receives 2 random initial vectors (asa word and as a context) of a pre-defined size. Thus, we have twoweight matrices:I input matrix WI with word vectors between input and projection

layers;

I output matrix WO with context vectors between projection andoutput layers.

The first matrix dimensionality is vocabulary size X vector size and thesecond matrix dimensionality is its transposition: vector size Xvocabulary size.

What happens next?

29

Page 117: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Model initializing

First, each word in the vocabulary receives 2 random initial vectors (asa word and as a context) of a pre-defined size. Thus, we have twoweight matrices:I input matrix WI with word vectors between input and projection

layers;I output matrix WO with context vectors between projection and

output layers.

The first matrix dimensionality is vocabulary size X vector size and thesecond matrix dimensionality is its transposition: vector size Xvocabulary size.

What happens next?

29

Page 118: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Model initializing

First, each word in the vocabulary receives 2 random initial vectors (asa word and as a context) of a pre-defined size. Thus, we have twoweight matrices:I input matrix WI with word vectors between input and projection

layers;I output matrix WO with context vectors between projection and

output layers.The first matrix dimensionality is vocabulary size X vector size and thesecond matrix dimensionality is its transposition: vector size Xvocabulary size.

What happens next?

29

Page 119: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Model initializing

First, each word in the vocabulary receives 2 random initial vectors (asa word and as a context) of a pre-defined size. Thus, we have twoweight matrices:I input matrix WI with word vectors between input and projection

layers;I output matrix WO with context vectors between projection and

output layers.The first matrix dimensionality is vocabulary size X vector size and thesecond matrix dimensionality is its transposition: vector size Xvocabulary size.

What happens next?

29

Page 120: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Learning good vectorsDuring the training, we move through the training corpus with a slidingwindow. Each instance (word in running text) is a prediction problem:the objective is to predict the current word with the help of its contexts(or vice versa).

The outcome of the prediction determines whether we adjust thecurrent word vector and in what direction. Gradually, vectors convergeto (hopefully) optimal values.

It is important that prediction here is not an aim in itself: it is just aproxy to learn vector representations good for other downstream tasks.

30

Page 121: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Learning good vectorsDuring the training, we move through the training corpus with a slidingwindow. Each instance (word in running text) is a prediction problem:the objective is to predict the current word with the help of its contexts(or vice versa).The outcome of the prediction determines whether we adjust thecurrent word vector and in what direction. Gradually, vectors convergeto (hopefully) optimal values.

It is important that prediction here is not an aim in itself: it is just aproxy to learn vector representations good for other downstream tasks.

30

Page 122: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Learning good vectorsDuring the training, we move through the training corpus with a slidingwindow. Each instance (word in running text) is a prediction problem:the objective is to predict the current word with the help of its contexts(or vice versa).The outcome of the prediction determines whether we adjust thecurrent word vector and in what direction. Gradually, vectors convergeto (hopefully) optimal values.

It is important that prediction here is not an aim in itself: it is just aproxy to learn vector representations good for other downstream tasks.

30

Page 123: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Continuous Bag-of-words (CBOW) and Continuous Skip-gram(skip-gram) are conceptually similar but differ in important details;

I Both shown to outperform traditional count DSMs in varioussemantic tasks for English [Baroni et al., 2014]

At training time, CBOW learns to predict current word based on itscontext, while Skip-Gram learns to predict context based on the currentword.

31

Page 124: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Continuous Bag-of-words (CBOW) and Continuous Skip-gram(skip-gram) are conceptually similar but differ in important details;

I Both shown to outperform traditional count DSMs in varioussemantic tasks for English [Baroni et al., 2014]

At training time, CBOW learns to predict current word based on itscontext, while Skip-Gram learns to predict context based on the currentword.

31

Page 125: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I Continuous Bag-of-words (CBOW) and Continuous Skip-gram(skip-gram) are conceptually similar but differ in important details;

I Both shown to outperform traditional count DSMs in varioussemantic tasks for English [Baroni et al., 2014]

At training time, CBOW learns to predict current word based on itscontext, while Skip-Gram learns to predict context based on the currentword.

31

Page 126: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Continuous Bag-of-Words and Continuous Skip-Gram: two algorithmsin the word2vec paper

32

Page 127: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I None of these algorithms is actually deep learning.

I Network is very simple, single linear projection layer between theinput and the output layers.

I Training objective maximizes the probability of observing the correctoutput word(s) wt given the context word(s) cw1...cwj , with regard totheir current embeddings (sets of neural weights).

I Cost function C for CBOW is the negative log probability(cross-entropy) of the correct answer:

C = −log(p(wt |cw1...cwj)) (2)

...or for SkipGram

C = −j∑

i=1

log(p(cwi |wt)) (3)

...and the learning itself is implemented with stochastic gradientdescent and (optionally) adaptive learning rate.

33

Page 128: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I None of these algorithms is actually deep learning.I Network is very simple, single linear projection layer between the

input and the output layers.

I Training objective maximizes the probability of observing the correctoutput word(s) wt given the context word(s) cw1...cwj , with regard totheir current embeddings (sets of neural weights).

I Cost function C for CBOW is the negative log probability(cross-entropy) of the correct answer:

C = −log(p(wt |cw1...cwj)) (2)

...or for SkipGram

C = −j∑

i=1

log(p(cwi |wt)) (3)

...and the learning itself is implemented with stochastic gradientdescent and (optionally) adaptive learning rate.

33

Page 129: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I None of these algorithms is actually deep learning.I Network is very simple, single linear projection layer between the

input and the output layers.I Training objective maximizes the probability of observing the correct

output word(s) wt given the context word(s) cw1...cwj , with regard totheir current embeddings (sets of neural weights).

I Cost function C for CBOW is the negative log probability(cross-entropy) of the correct answer:

C = −log(p(wt |cw1...cwj)) (2)

...or for SkipGram

C = −j∑

i=1

log(p(cwi |wt)) (3)

...and the learning itself is implemented with stochastic gradientdescent and (optionally) adaptive learning rate.

33

Page 130: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I None of these algorithms is actually deep learning.I Network is very simple, single linear projection layer between the

input and the output layers.I Training objective maximizes the probability of observing the correct

output word(s) wt given the context word(s) cw1...cwj , with regard totheir current embeddings (sets of neural weights).

I Cost function C for CBOW is the negative log probability(cross-entropy) of the correct answer:

C = −log(p(wt |cw1...cwj)) (2)

...or for SkipGram

C = −j∑

i=1

log(p(cwi |wt)) (3)

...and the learning itself is implemented with stochastic gradientdescent and (optionally) adaptive learning rate.

33

Page 131: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I None of these algorithms is actually deep learning.I Network is very simple, single linear projection layer between the

input and the output layers.I Training objective maximizes the probability of observing the correct

output word(s) wt given the context word(s) cw1...cwj , with regard totheir current embeddings (sets of neural weights).

I Cost function C for CBOW is the negative log probability(cross-entropy) of the correct answer:

C = −log(p(wt |cw1...cwj)) (2)

...or for SkipGram

C = −j∑

i=1

log(p(cwi |wt)) (3)

...and the learning itself is implemented with stochastic gradientdescent and (optionally) adaptive learning rate.

33

Page 132: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

I None of these algorithms is actually deep learning.I Network is very simple, single linear projection layer between the

input and the output layers.I Training objective maximizes the probability of observing the correct

output word(s) wt given the context word(s) cw1...cwj , with regard totheir current embeddings (sets of neural weights).

I Cost function C for CBOW is the negative log probability(cross-entropy) of the correct answer:

C = −log(p(wt |cw1...cwj)) (2)

...or for SkipGram

C = −j∑

i=1

log(p(cwi |wt)) (3)

...and the learning itself is implemented with stochastic gradientdescent and (optionally) adaptive learning rate.

33

Page 133: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

At each training instance, the input for the prediction is basically:I CBOW: average input vector for all context words. We check

whether the current word output vector is the closest to it among allvocabulary words.

I SkipGram: current word input vector. We check whether each ofcontext words output vectors is the closest to it among allvocabulary words.

Reminder: this ‘closeness’ is calculated with the help of cosinesimilarity and then turned into probabilities using softmax.

During the training, we are updating 2 weight matrices: of input vectors(WI, from the input layer to the hidden layer) and of output vectors (WO,from the hidden layer to the output layer). As a rule, they share thesame vocabulary, and only the input vectors are used in practical tasks.

34

Page 134: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

At each training instance, the input for the prediction is basically:I CBOW: average input vector for all context words. We check

whether the current word output vector is the closest to it among allvocabulary words.

I SkipGram: current word input vector. We check whether each ofcontext words output vectors is the closest to it among allvocabulary words.

Reminder: this ‘closeness’ is calculated with the help of cosinesimilarity and then turned into probabilities using softmax.

During the training, we are updating 2 weight matrices: of input vectors(WI, from the input layer to the hidden layer) and of output vectors (WO,from the hidden layer to the output layer). As a rule, they share thesame vocabulary, and only the input vectors are used in practical tasks.

34

Page 135: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

At each training instance, the input for the prediction is basically:I CBOW: average input vector for all context words. We check

whether the current word output vector is the closest to it among allvocabulary words.

I SkipGram: current word input vector. We check whether each ofcontext words output vectors is the closest to it among allvocabulary words.

Reminder: this ‘closeness’ is calculated with the help of cosinesimilarity and then turned into probabilities using softmax.

During the training, we are updating 2 weight matrices: of input vectors(WI, from the input layer to the hidden layer) and of output vectors (WO,from the hidden layer to the output layer). As a rule, they share thesame vocabulary, and only the input vectors are used in practical tasks.

34

Page 136: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

CBOW and SkipGram training algorithms

‘the vector of a word w is “dragged” back-and-forth by the vectors ofw’s co-occurring words, as if there are physical strings between w andits neighbors...like gravity, or force-directed graph layout.’ [Rong, 2014]

Useful demo of word2vec algorithms: https://ronxin.github.io/wevi/

35

Page 137: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

CBOW and SkipGram training algorithms

‘the vector of a word w is “dragged” back-and-forth by the vectors ofw’s co-occurring words, as if there are physical strings between w andits neighbors...like gravity, or force-directed graph layout.’ [Rong, 2014]

Useful demo of word2vec algorithms: https://ronxin.github.io/wevi/

35

Page 138: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

CBOW and SkipGram training algorithms

‘the vector of a word w is “dragged” back-and-forth by the vectors ofw’s co-occurring words, as if there are physical strings between w andits neighbors...like gravity, or force-directed graph layout.’ [Rong, 2014]

Useful demo of word2vec algorithms: https://ronxin.github.io/wevi/

35

Page 139: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

The followers: GloVe and the others

Lots of follow-up research after Mikolov’s 2013 paper:I Christopher Mannning and other folks at Stanford released GloVe –

a different approach to learn word embeddings[Pennington et al., 2014];

I Omer Levy and Yoav Goldberg from Bar-Ilan University showed thatSkipGram implicitly factorizes word-context matrix of PMIcoefficients [Levy and Goldberg, 2014];

I The same people showed that much of amazing performance ofSkipGram is due to the choice of hyperparameters, but it is still veryrobust and computationally efficient [Levy et al., 2015];

I Le and Mikolov proposed Paragraph Vector: an algorithm to learndistributed representations not only for words but also forparagraphs or documents [Le and Mikolov, 2014];

I After moving to Facebook, Mikolov released fastText, which learnsembeddings for character sequences [Bojanowski et al., 2016];

I These approaches were implemented in third-party open-sourcesoftware, for example, Gensim or TensorFlow.

36

Page 140: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

The followers: GloVe and the others

Lots of follow-up research after Mikolov’s 2013 paper:I Christopher Mannning and other folks at Stanford released GloVe –

a different approach to learn word embeddings[Pennington et al., 2014];

I Omer Levy and Yoav Goldberg from Bar-Ilan University showed thatSkipGram implicitly factorizes word-context matrix of PMIcoefficients [Levy and Goldberg, 2014];

I The same people showed that much of amazing performance ofSkipGram is due to the choice of hyperparameters, but it is still veryrobust and computationally efficient [Levy et al., 2015];

I Le and Mikolov proposed Paragraph Vector: an algorithm to learndistributed representations not only for words but also forparagraphs or documents [Le and Mikolov, 2014];

I After moving to Facebook, Mikolov released fastText, which learnsembeddings for character sequences [Bojanowski et al., 2016];

I These approaches were implemented in third-party open-sourcesoftware, for example, Gensim or TensorFlow.

36

Page 141: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

The followers: GloVe and the others

Lots of follow-up research after Mikolov’s 2013 paper:I Christopher Mannning and other folks at Stanford released GloVe –

a different approach to learn word embeddings[Pennington et al., 2014];

I Omer Levy and Yoav Goldberg from Bar-Ilan University showed thatSkipGram implicitly factorizes word-context matrix of PMIcoefficients [Levy and Goldberg, 2014];

I The same people showed that much of amazing performance ofSkipGram is due to the choice of hyperparameters, but it is still veryrobust and computationally efficient [Levy et al., 2015];

I Le and Mikolov proposed Paragraph Vector: an algorithm to learndistributed representations not only for words but also forparagraphs or documents [Le and Mikolov, 2014];

I After moving to Facebook, Mikolov released fastText, which learnsembeddings for character sequences [Bojanowski et al., 2016];

I These approaches were implemented in third-party open-sourcesoftware, for example, Gensim or TensorFlow.

36

Page 142: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

The followers: GloVe and the others

Lots of follow-up research after Mikolov’s 2013 paper:I Christopher Mannning and other folks at Stanford released GloVe –

a different approach to learn word embeddings[Pennington et al., 2014];

I Omer Levy and Yoav Goldberg from Bar-Ilan University showed thatSkipGram implicitly factorizes word-context matrix of PMIcoefficients [Levy and Goldberg, 2014];

I The same people showed that much of amazing performance ofSkipGram is due to the choice of hyperparameters, but it is still veryrobust and computationally efficient [Levy et al., 2015];

I Le and Mikolov proposed Paragraph Vector: an algorithm to learndistributed representations not only for words but also forparagraphs or documents [Le and Mikolov, 2014];

I After moving to Facebook, Mikolov released fastText, which learnsembeddings for character sequences [Bojanowski et al., 2016];

I These approaches were implemented in third-party open-sourcesoftware, for example, Gensim or TensorFlow.

36

Page 143: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

The followers: GloVe and the others

Lots of follow-up research after Mikolov’s 2013 paper:I Christopher Mannning and other folks at Stanford released GloVe –

a different approach to learn word embeddings[Pennington et al., 2014];

I Omer Levy and Yoav Goldberg from Bar-Ilan University showed thatSkipGram implicitly factorizes word-context matrix of PMIcoefficients [Levy and Goldberg, 2014];

I The same people showed that much of amazing performance ofSkipGram is due to the choice of hyperparameters, but it is still veryrobust and computationally efficient [Levy et al., 2015];

I Le and Mikolov proposed Paragraph Vector: an algorithm to learndistributed representations not only for words but also forparagraphs or documents [Le and Mikolov, 2014];

I After moving to Facebook, Mikolov released fastText, which learnsembeddings for character sequences [Bojanowski et al., 2016];

I These approaches were implemented in third-party open-sourcesoftware, for example, Gensim or TensorFlow.

36

Page 144: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

The followers: GloVe and the others

Lots of follow-up research after Mikolov’s 2013 paper:I Christopher Mannning and other folks at Stanford released GloVe –

a different approach to learn word embeddings[Pennington et al., 2014];

I Omer Levy and Yoav Goldberg from Bar-Ilan University showed thatSkipGram implicitly factorizes word-context matrix of PMIcoefficients [Levy and Goldberg, 2014];

I The same people showed that much of amazing performance ofSkipGram is due to the choice of hyperparameters, but it is still veryrobust and computationally efficient [Levy et al., 2015];

I Le and Mikolov proposed Paragraph Vector: an algorithm to learndistributed representations not only for words but also forparagraphs or documents [Le and Mikolov, 2014];

I After moving to Facebook, Mikolov released fastText, which learnsembeddings for character sequences [Bojanowski et al., 2016];

I These approaches were implemented in third-party open-sourcesoftware, for example, Gensim or TensorFlow.

36

Page 145: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

GlobalVectors: a global log-bilinear regression model for unsupervisedlearning of word embeddingsI GloVe is an attempt to combine the global count models and the

local context window prediction models.

I It employs on global co-occurrence counts by analyzing thelog-probability co-occurrence matrix

I Non-zero elements are stochastically sampled from the matrix, andthe model iteratively trained on them.

I The objective is to learn word vectors such that their dot productequals the logarithm of the words’ probability of co-occurrence.

I Code and pre-trained embeddings available athttp://nlp.stanford.edu/projects/glove/.

37

Page 146: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

GlobalVectors: a global log-bilinear regression model for unsupervisedlearning of word embeddingsI GloVe is an attempt to combine the global count models and the

local context window prediction models.I It employs on global co-occurrence counts by analyzing the

log-probability co-occurrence matrix

I Non-zero elements are stochastically sampled from the matrix, andthe model iteratively trained on them.

I The objective is to learn word vectors such that their dot productequals the logarithm of the words’ probability of co-occurrence.

I Code and pre-trained embeddings available athttp://nlp.stanford.edu/projects/glove/.

37

Page 147: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

GlobalVectors: a global log-bilinear regression model for unsupervisedlearning of word embeddingsI GloVe is an attempt to combine the global count models and the

local context window prediction models.I It employs on global co-occurrence counts by analyzing the

log-probability co-occurrence matrixI Non-zero elements are stochastically sampled from the matrix, and

the model iteratively trained on them.

I The objective is to learn word vectors such that their dot productequals the logarithm of the words’ probability of co-occurrence.

I Code and pre-trained embeddings available athttp://nlp.stanford.edu/projects/glove/.

37

Page 148: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

GlobalVectors: a global log-bilinear regression model for unsupervisedlearning of word embeddingsI GloVe is an attempt to combine the global count models and the

local context window prediction models.I It employs on global co-occurrence counts by analyzing the

log-probability co-occurrence matrixI Non-zero elements are stochastically sampled from the matrix, and

the model iteratively trained on them.I The objective is to learn word vectors such that their dot product

equals the logarithm of the words’ probability of co-occurrence.

I Code and pre-trained embeddings available athttp://nlp.stanford.edu/projects/glove/.

37

Page 149: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

GlobalVectors: a global log-bilinear regression model for unsupervisedlearning of word embeddingsI GloVe is an attempt to combine the global count models and the

local context window prediction models.I It employs on global co-occurrence counts by analyzing the

log-probability co-occurrence matrixI Non-zero elements are stochastically sampled from the matrix, and

the model iteratively trained on them.I The objective is to learn word vectors such that their dot product

equals the logarithm of the words’ probability of co-occurrence.I Code and pre-trained embeddings available at

http://nlp.stanford.edu/projects/glove/.

37

Page 150: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Nowadays, word embeddings are widely used instead of discrete wordtokens as an input to more complex neural network models:

I feedforward networks,I convolutional networks,I recurrent networks,I LSTMs...

38

Page 151: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Nowadays, word embeddings are widely used instead of discrete wordtokens as an input to more complex neural network models:I feedforward networks,

I convolutional networks,I recurrent networks,I LSTMs...

38

Page 152: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Nowadays, word embeddings are widely used instead of discrete wordtokens as an input to more complex neural network models:I feedforward networks,I convolutional networks,

I recurrent networks,I LSTMs...

38

Page 153: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Nowadays, word embeddings are widely used instead of discrete wordtokens as an input to more complex neural network models:I feedforward networks,I convolutional networks,I recurrent networks,

I LSTMs...

38

Page 154: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Nowadays, word embeddings are widely used instead of discrete wordtokens as an input to more complex neural network models:I feedforward networks,I convolutional networks,I recurrent networks,I LSTMs...

38

Page 155: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Nowadays, word embeddings are widely used instead of discrete wordtokens as an input to more complex neural network models:I feedforward networks,I convolutional networks,I recurrent networks,I LSTMs...

38

Page 156: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Predictive distributional models: Word2Vec revolution

Nowadays, word embeddings are widely used instead of discrete wordtokens as an input to more complex neural network models:I feedforward networks,I convolutional networks,I recurrent networks,I LSTMs...

38

Page 157: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Contents

1 IntroductionSimple demoDistributional hypothesis

2 Vector space modelsWord embeddingsCount-based distributional models

3 Predictive distributional models: Word2Vec revolutionThe followers: GloVe and the others

4 Models evaluation5 Practical aspects6 Representing documents

Can we do without semantics?Distributed models: composing from word vectorsDistributed models: training document vectors

7 Summing up

38

Page 158: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Models evaluation

How do we evaluate trained models? Subject to many discussions!The topic of special workshops at ACL-2016 and EMNLP-2017:https://repeval2017.github.io/

I Semantic similarity (what is the association degree?)I Synonym detection (what is most similar?)I Concept categorization (what groups with what?)I Analogical inference (A is to B as C is to ?)I Correlation with manually crafted linguistic features.

39

Page 159: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Models evaluation

How do we evaluate trained models? Subject to many discussions!The topic of special workshops at ACL-2016 and EMNLP-2017:https://repeval2017.github.io/I Semantic similarity (what is the association degree?)

I Synonym detection (what is most similar?)I Concept categorization (what groups with what?)I Analogical inference (A is to B as C is to ?)I Correlation with manually crafted linguistic features.

39

Page 160: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Models evaluation

How do we evaluate trained models? Subject to many discussions!The topic of special workshops at ACL-2016 and EMNLP-2017:https://repeval2017.github.io/I Semantic similarity (what is the association degree?)I Synonym detection (what is most similar?)

I Concept categorization (what groups with what?)I Analogical inference (A is to B as C is to ?)I Correlation with manually crafted linguistic features.

39

Page 161: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Models evaluation

How do we evaluate trained models? Subject to many discussions!The topic of special workshops at ACL-2016 and EMNLP-2017:https://repeval2017.github.io/I Semantic similarity (what is the association degree?)I Synonym detection (what is most similar?)I Concept categorization (what groups with what?)

I Analogical inference (A is to B as C is to ?)I Correlation with manually crafted linguistic features.

39

Page 162: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Models evaluation

How do we evaluate trained models? Subject to many discussions!The topic of special workshops at ACL-2016 and EMNLP-2017:https://repeval2017.github.io/I Semantic similarity (what is the association degree?)I Synonym detection (what is most similar?)I Concept categorization (what groups with what?)I Analogical inference (A is to B as C is to ?)

I Correlation with manually crafted linguistic features.

39

Page 163: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Models evaluation

How do we evaluate trained models? Subject to many discussions!The topic of special workshops at ACL-2016 and EMNLP-2017:https://repeval2017.github.io/I Semantic similarity (what is the association degree?)I Synonym detection (what is most similar?)I Concept categorization (what groups with what?)I Analogical inference (A is to B as C is to ?)I Correlation with manually crafted linguistic features.

39

Page 164: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Models evaluation

Model performance in semantic similarity task depending on contextwidth and vector size.

40

Page 165: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Contents

1 IntroductionSimple demoDistributional hypothesis

2 Vector space modelsWord embeddingsCount-based distributional models

3 Predictive distributional models: Word2Vec revolutionThe followers: GloVe and the others

4 Models evaluation5 Practical aspects6 Representing documents

Can we do without semantics?Distributed models: composing from word vectorsDistributed models: training document vectors

7 Summing up

40

Page 166: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

Main frameworks and toolkits1. Dissect [Dinu et al., 2013]

(http://clic.cimec.unitn.it/composes/toolkit/);

2. word2vec original C code [Le and Mikolov, 2014](https://word2vec.googlecode.com/svn/trunk/)

3. Gensim framework for Python, including word2vec and fastTextimplementation and wrappers for other algorithms(http://radimrehurek.com/gensim/);

4. word2vec implementations in Google’s TensorFlow(https://www.tensorflow.org/tutorials/word2vec);

5. GloVe reference implementation [Pennington et al., 2014](http://nlp.stanford.edu/projects/glove/).

41

Page 167: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

Main frameworks and toolkits1. Dissect [Dinu et al., 2013]

(http://clic.cimec.unitn.it/composes/toolkit/);2. word2vec original C code [Le and Mikolov, 2014]

(https://word2vec.googlecode.com/svn/trunk/)

3. Gensim framework for Python, including word2vec and fastTextimplementation and wrappers for other algorithms(http://radimrehurek.com/gensim/);

4. word2vec implementations in Google’s TensorFlow(https://www.tensorflow.org/tutorials/word2vec);

5. GloVe reference implementation [Pennington et al., 2014](http://nlp.stanford.edu/projects/glove/).

41

Page 168: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

Main frameworks and toolkits1. Dissect [Dinu et al., 2013]

(http://clic.cimec.unitn.it/composes/toolkit/);2. word2vec original C code [Le and Mikolov, 2014]

(https://word2vec.googlecode.com/svn/trunk/)3. Gensim framework for Python, including word2vec and fastText

implementation and wrappers for other algorithms(http://radimrehurek.com/gensim/);

4. word2vec implementations in Google’s TensorFlow(https://www.tensorflow.org/tutorials/word2vec);

5. GloVe reference implementation [Pennington et al., 2014](http://nlp.stanford.edu/projects/glove/).

41

Page 169: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

Main frameworks and toolkits1. Dissect [Dinu et al., 2013]

(http://clic.cimec.unitn.it/composes/toolkit/);2. word2vec original C code [Le and Mikolov, 2014]

(https://word2vec.googlecode.com/svn/trunk/)3. Gensim framework for Python, including word2vec and fastText

implementation and wrappers for other algorithms(http://radimrehurek.com/gensim/);

4. word2vec implementations in Google’s TensorFlow(https://www.tensorflow.org/tutorials/word2vec);

5. GloVe reference implementation [Pennington et al., 2014](http://nlp.stanford.edu/projects/glove/).

41

Page 170: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

Main frameworks and toolkits1. Dissect [Dinu et al., 2013]

(http://clic.cimec.unitn.it/composes/toolkit/);2. word2vec original C code [Le and Mikolov, 2014]

(https://word2vec.googlecode.com/svn/trunk/)3. Gensim framework for Python, including word2vec and fastText

implementation and wrappers for other algorithms(http://radimrehurek.com/gensim/);

4. word2vec implementations in Google’s TensorFlow(https://www.tensorflow.org/tutorials/word2vec);

5. GloVe reference implementation [Pennington et al., 2014](http://nlp.stanford.edu/projects/glove/).

41

Page 171: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Models can come in several formats:1. Simple text format:

I words and sequences of values representing their vectors, one wordper line;

I first line gives information on the number of words in the model andvector size.

2. The same in the binary form.3. Gensim binary format:

I uses NumPy matrices saved via Python pickles;I stores a lot of additional information (input vectors, training algorithm,

word frequency, etc).

Gensim works with all of these formats.

42

Page 172: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Models can come in several formats:1. Simple text format:

I words and sequences of values representing their vectors, one wordper line;

I first line gives information on the number of words in the model andvector size.

2. The same in the binary form.

3. Gensim binary format:I uses NumPy matrices saved via Python pickles;I stores a lot of additional information (input vectors, training algorithm,

word frequency, etc).

Gensim works with all of these formats.

42

Page 173: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Models can come in several formats:1. Simple text format:

I words and sequences of values representing their vectors, one wordper line;

I first line gives information on the number of words in the model andvector size.

2. The same in the binary form.3. Gensim binary format:

I uses NumPy matrices saved via Python pickles;I stores a lot of additional information (input vectors, training algorithm,

word frequency, etc).

Gensim works with all of these formats.

42

Page 174: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

A bunch of observationsI Wikipedia is not the best training corpus: fluctuates wildly

depending on hyperparameters. Perhaps, too specific language.

I Normalize you data: lowercase, lemmatize, detect named entities.I It helps to augment words with PoS tags before training

(‘boot_NOUN’, ‘boot_VERB’). As a result, your model becomesaware of morphological ambiguity.

I Remove your stop words yourself. Statistical downsamplingimplemented in word2vec algorithms can easily deprive you ofvaluable text data.

I Feed data to the model one paragraph per line (not one sentenceper line).

I Shuffle your corpus before training.

43

Page 175: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

A bunch of observationsI Wikipedia is not the best training corpus: fluctuates wildly

depending on hyperparameters. Perhaps, too specific language.I Normalize you data: lowercase, lemmatize, detect named entities.

I It helps to augment words with PoS tags before training(‘boot_NOUN’, ‘boot_VERB’). As a result, your model becomesaware of morphological ambiguity.

I Remove your stop words yourself. Statistical downsamplingimplemented in word2vec algorithms can easily deprive you ofvaluable text data.

I Feed data to the model one paragraph per line (not one sentenceper line).

I Shuffle your corpus before training.

43

Page 176: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

A bunch of observationsI Wikipedia is not the best training corpus: fluctuates wildly

depending on hyperparameters. Perhaps, too specific language.I Normalize you data: lowercase, lemmatize, detect named entities.I It helps to augment words with PoS tags before training

(‘boot_NOUN’, ‘boot_VERB’). As a result, your model becomesaware of morphological ambiguity.

I Remove your stop words yourself. Statistical downsamplingimplemented in word2vec algorithms can easily deprive you ofvaluable text data.

I Feed data to the model one paragraph per line (not one sentenceper line).

I Shuffle your corpus before training.

43

Page 177: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

A bunch of observationsI Wikipedia is not the best training corpus: fluctuates wildly

depending on hyperparameters. Perhaps, too specific language.I Normalize you data: lowercase, lemmatize, detect named entities.I It helps to augment words with PoS tags before training

(‘boot_NOUN’, ‘boot_VERB’). As a result, your model becomesaware of morphological ambiguity.

I Remove your stop words yourself. Statistical downsamplingimplemented in word2vec algorithms can easily deprive you ofvaluable text data.

I Feed data to the model one paragraph per line (not one sentenceper line).

I Shuffle your corpus before training.

43

Page 178: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

A bunch of observationsI Wikipedia is not the best training corpus: fluctuates wildly

depending on hyperparameters. Perhaps, too specific language.I Normalize you data: lowercase, lemmatize, detect named entities.I It helps to augment words with PoS tags before training

(‘boot_NOUN’, ‘boot_VERB’). As a result, your model becomesaware of morphological ambiguity.

I Remove your stop words yourself. Statistical downsamplingimplemented in word2vec algorithms can easily deprive you ofvaluable text data.

I Feed data to the model one paragraph per line (not one sentenceper line).

I Shuffle your corpus before training.

43

Page 179: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Practical aspects

A bunch of observationsI Wikipedia is not the best training corpus: fluctuates wildly

depending on hyperparameters. Perhaps, too specific language.I Normalize you data: lowercase, lemmatize, detect named entities.I It helps to augment words with PoS tags before training

(‘boot_NOUN’, ‘boot_VERB’). As a result, your model becomesaware of morphological ambiguity.

I Remove your stop words yourself. Statistical downsamplingimplemented in word2vec algorithms can easily deprive you ofvaluable text data.

I Feed data to the model one paragraph per line (not one sentenceper line).

I Shuffle your corpus before training.

43

Page 180: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Contents

1 IntroductionSimple demoDistributional hypothesis

2 Vector space modelsWord embeddingsCount-based distributional models

3 Predictive distributional models: Word2Vec revolutionThe followers: GloVe and the others

4 Models evaluation5 Practical aspects6 Representing documents

Can we do without semantics?Distributed models: composing from word vectorsDistributed models: training document vectors

7 Summing up

43

Page 181: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Distributional approaches allow to extract semantics from unlabeleddata at word level.

I But we also need to represent variable-length documents!I for classification,I for clustering,I for information retrieval (including web search).

44

Page 182: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Distributional approaches allow to extract semantics from unlabeleddata at word level.

I But we also need to represent variable-length documents!

I for classification,I for clustering,I for information retrieval (including web search).

44

Page 183: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Distributional approaches allow to extract semantics from unlabeleddata at word level.

I But we also need to represent variable-length documents!I for classification,

I for clustering,I for information retrieval (including web search).

44

Page 184: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Distributional approaches allow to extract semantics from unlabeleddata at word level.

I But we also need to represent variable-length documents!I for classification,I for clustering,

I for information retrieval (including web search).

44

Page 185: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Distributional approaches allow to extract semantics from unlabeleddata at word level.

I But we also need to represent variable-length documents!I for classification,I for clustering,I for information retrieval (including web search).

44

Page 186: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Can we detect semantically similar texts in the same way as wedetect similar words?

I Yes we can!I Nothing prevents us from representing sentences, paragraphs or

whole documents (further we use the term ‘document ’ for all thesethings) as dense vectors.

I After the documents are represented as vectors, classification,clustering or other data processing tasks become trivial.

Note: this talk does not cover sequence-to-sequence sentencemodeling approaches based on RNNs (LSTM, GRU, etc). A goodexample of those is the Skip-Thought algorithm [Kiros et al., 2015].We are concerned with comparatively simple algorithms conceptuallysimilar to prediction-based distributional models for words.

45

Page 187: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Can we detect semantically similar texts in the same way as wedetect similar words?

I Yes we can!

I Nothing prevents us from representing sentences, paragraphs orwhole documents (further we use the term ‘document ’ for all thesethings) as dense vectors.

I After the documents are represented as vectors, classification,clustering or other data processing tasks become trivial.

Note: this talk does not cover sequence-to-sequence sentencemodeling approaches based on RNNs (LSTM, GRU, etc). A goodexample of those is the Skip-Thought algorithm [Kiros et al., 2015].We are concerned with comparatively simple algorithms conceptuallysimilar to prediction-based distributional models for words.

45

Page 188: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Can we detect semantically similar texts in the same way as wedetect similar words?

I Yes we can!I Nothing prevents us from representing sentences, paragraphs or

whole documents (further we use the term ‘document ’ for all thesethings) as dense vectors.

I After the documents are represented as vectors, classification,clustering or other data processing tasks become trivial.

Note: this talk does not cover sequence-to-sequence sentencemodeling approaches based on RNNs (LSTM, GRU, etc). A goodexample of those is the Skip-Thought algorithm [Kiros et al., 2015].We are concerned with comparatively simple algorithms conceptuallysimilar to prediction-based distributional models for words.

45

Page 189: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Can we detect semantically similar texts in the same way as wedetect similar words?

I Yes we can!I Nothing prevents us from representing sentences, paragraphs or

whole documents (further we use the term ‘document ’ for all thesethings) as dense vectors.

I After the documents are represented as vectors, classification,clustering or other data processing tasks become trivial.

Note: this talk does not cover sequence-to-sequence sentencemodeling approaches based on RNNs (LSTM, GRU, etc). A goodexample of those is the Skip-Thought algorithm [Kiros et al., 2015].We are concerned with comparatively simple algorithms conceptuallysimilar to prediction-based distributional models for words.

45

Page 190: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Can we detect semantically similar texts in the same way as wedetect similar words?

I Yes we can!I Nothing prevents us from representing sentences, paragraphs or

whole documents (further we use the term ‘document ’ for all thesethings) as dense vectors.

I After the documents are represented as vectors, classification,clustering or other data processing tasks become trivial.

Note: this talk does not cover sequence-to-sequence sentencemodeling approaches based on RNNs (LSTM, GRU, etc). A goodexample of those is the Skip-Thought algorithm [Kiros et al., 2015].

We are concerned with comparatively simple algorithms conceptuallysimilar to prediction-based distributional models for words.

45

Page 191: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Can we detect semantically similar texts in the same way as wedetect similar words?

I Yes we can!I Nothing prevents us from representing sentences, paragraphs or

whole documents (further we use the term ‘document ’ for all thesethings) as dense vectors.

I After the documents are represented as vectors, classification,clustering or other data processing tasks become trivial.

Note: this talk does not cover sequence-to-sequence sentencemodeling approaches based on RNNs (LSTM, GRU, etc). A goodexample of those is the Skip-Thought algorithm [Kiros et al., 2015].We are concerned with comparatively simple algorithms conceptuallysimilar to prediction-based distributional models for words.

45

Page 192: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Bag-of-words problems

Simple bag-of-word and TF-IDF approaches do not take into accountsemantic relationships between linguistic entities.

No way to detect semantic similarity between documents which do notshare words:

I California saw mass protests after the elections.I Many Americans were anxious about the elected president.

We need more sophisticated semantically-aware distributed methods,like neural embeddings.

46

Page 193: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Bag-of-words problems

Simple bag-of-word and TF-IDF approaches do not take into accountsemantic relationships between linguistic entities.No way to detect semantic similarity between documents which do notshare words:

I California saw mass protests after the elections.I Many Americans were anxious about the elected president.

We need more sophisticated semantically-aware distributed methods,like neural embeddings.

46

Page 194: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Bag-of-words problems

Simple bag-of-word and TF-IDF approaches do not take into accountsemantic relationships between linguistic entities.No way to detect semantic similarity between documents which do notshare words:

I California saw mass protests after the elections.I Many Americans were anxious about the elected president.

We need more sophisticated semantically-aware distributed methods,like neural embeddings.

46

Page 195: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Document meaning is composed of individual word meanings.

I Need to combine continuous word vectors into continuousdocument vectors.

I It is called a composition function.

Semantic fingerprints

I One of the simplest composition functions: an average vector ~Sover vectors of all words w0...n in the document.

~S =1n×

n∑i=0

~wn (4)

I We don’t care about syntax and word order.I If we already have a good word embedding model, this bottom-up

approach is strikingly efficient and usually beats bag-of-words.I Let’s call it a ‘semantic fingerprint’ of the document.I It is very important to remove stop words beforehand!

47

Page 196: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Document meaning is composed of individual word meanings.I Need to combine continuous word vectors into continuous

document vectors.

I It is called a composition function.

Semantic fingerprints

I One of the simplest composition functions: an average vector ~Sover vectors of all words w0...n in the document.

~S =1n×

n∑i=0

~wn (4)

I We don’t care about syntax and word order.I If we already have a good word embedding model, this bottom-up

approach is strikingly efficient and usually beats bag-of-words.I Let’s call it a ‘semantic fingerprint’ of the document.I It is very important to remove stop words beforehand!

47

Page 197: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Document meaning is composed of individual word meanings.I Need to combine continuous word vectors into continuous

document vectors.I It is called a composition function.

Semantic fingerprints

I One of the simplest composition functions: an average vector ~Sover vectors of all words w0...n in the document.

~S =1n×

n∑i=0

~wn (4)

I We don’t care about syntax and word order.I If we already have a good word embedding model, this bottom-up

approach is strikingly efficient and usually beats bag-of-words.I Let’s call it a ‘semantic fingerprint’ of the document.I It is very important to remove stop words beforehand!

47

Page 198: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Document meaning is composed of individual word meanings.I Need to combine continuous word vectors into continuous

document vectors.I It is called a composition function.

Semantic fingerprints

I One of the simplest composition functions: an average vector ~Sover vectors of all words w0...n in the document.

~S =1n×

n∑i=0

~wn (4)

I We don’t care about syntax and word order.I If we already have a good word embedding model, this bottom-up

approach is strikingly efficient and usually beats bag-of-words.I Let’s call it a ‘semantic fingerprint’ of the document.I It is very important to remove stop words beforehand!

47

Page 199: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Document meaning is composed of individual word meanings.I Need to combine continuous word vectors into continuous

document vectors.I It is called a composition function.

Semantic fingerprints

I One of the simplest composition functions: an average vector ~Sover vectors of all words w0...n in the document.

~S =1n×

n∑i=0

~wn (4)

I We don’t care about syntax and word order.I If we already have a good word embedding model, this bottom-up

approach is strikingly efficient and usually beats bag-of-words.I Let’s call it a ‘semantic fingerprint’ of the document.I It is very important to remove stop words beforehand!

47

Page 200: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Document meaning is composed of individual word meanings.I Need to combine continuous word vectors into continuous

document vectors.I It is called a composition function.

Semantic fingerprints

I One of the simplest composition functions: an average vector ~Sover vectors of all words w0...n in the document.

~S =1n×

n∑i=0

~wn (4)

I We don’t care about syntax and word order.

I If we already have a good word embedding model, this bottom-upapproach is strikingly efficient and usually beats bag-of-words.

I Let’s call it a ‘semantic fingerprint’ of the document.I It is very important to remove stop words beforehand!

47

Page 201: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Document meaning is composed of individual word meanings.I Need to combine continuous word vectors into continuous

document vectors.I It is called a composition function.

Semantic fingerprints

I One of the simplest composition functions: an average vector ~Sover vectors of all words w0...n in the document.

~S =1n×

n∑i=0

~wn (4)

I We don’t care about syntax and word order.I If we already have a good word embedding model, this bottom-up

approach is strikingly efficient and usually beats bag-of-words.

I Let’s call it a ‘semantic fingerprint’ of the document.I It is very important to remove stop words beforehand!

47

Page 202: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Document meaning is composed of individual word meanings.I Need to combine continuous word vectors into continuous

document vectors.I It is called a composition function.

Semantic fingerprints

I One of the simplest composition functions: an average vector ~Sover vectors of all words w0...n in the document.

~S =1n×

n∑i=0

~wn (4)

I We don’t care about syntax and word order.I If we already have a good word embedding model, this bottom-up

approach is strikingly efficient and usually beats bag-of-words.I Let’s call it a ‘semantic fingerprint’ of the document.

I It is very important to remove stop words beforehand!

47

Page 203: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I Document meaning is composed of individual word meanings.I Need to combine continuous word vectors into continuous

document vectors.I It is called a composition function.

Semantic fingerprints

I One of the simplest composition functions: an average vector ~Sover vectors of all words w0...n in the document.

~S =1n×

n∑i=0

~wn (4)

I We don’t care about syntax and word order.I If we already have a good word embedding model, this bottom-up

approach is strikingly efficient and usually beats bag-of-words.I Let’s call it a ‘semantic fingerprint’ of the document.I It is very important to remove stop words beforehand! 47

Page 204: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I You even don’t have to average. Summing vectors is enough:cosine is about angles, not magnitudes.

I However, averaging makes difference in case of other distancemetrics (Euclidean distance, etc).

I Also helps to keep things tidy and normalized (representations donot depend on document length).

48

Page 205: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I You even don’t have to average. Summing vectors is enough:cosine is about angles, not magnitudes.

I However, averaging makes difference in case of other distancemetrics (Euclidean distance, etc).

I Also helps to keep things tidy and normalized (representations donot depend on document length).

48

Page 206: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I You even don’t have to average. Summing vectors is enough:cosine is about angles, not magnitudes.

I However, averaging makes difference in case of other distancemetrics (Euclidean distance, etc).

I Also helps to keep things tidy and normalized (representations donot depend on document length).

48

Page 207: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

I You even don’t have to average. Summing vectors is enough:cosine is about angles, not magnitudes.

I However, averaging makes difference in case of other distancemetrics (Euclidean distance, etc).

I Also helps to keep things tidy and normalized (representations donot depend on document length).

48

Page 208: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Advantages of semantic fingerprintsI Semantic fingerprints work fast and reuse already trained models.

I Generalized document representations do not depend on particularwords.

I They take advantage of ‘semantic features’ learned during themodel training.

I Topically connected words collectively increase or decreaseexpression of the corresponding semantic components.

I Thus, topical words automatically become more important thannoise words.

See more in [Kutuzov et al., 2016].

49

Page 209: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Advantages of semantic fingerprintsI Semantic fingerprints work fast and reuse already trained models.I Generalized document representations do not depend on particular

words.

I They take advantage of ‘semantic features’ learned during themodel training.

I Topically connected words collectively increase or decreaseexpression of the corresponding semantic components.

I Thus, topical words automatically become more important thannoise words.

See more in [Kutuzov et al., 2016].

49

Page 210: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Advantages of semantic fingerprintsI Semantic fingerprints work fast and reuse already trained models.I Generalized document representations do not depend on particular

words.I They take advantage of ‘semantic features’ learned during the

model training.

I Topically connected words collectively increase or decreaseexpression of the corresponding semantic components.

I Thus, topical words automatically become more important thannoise words.

See more in [Kutuzov et al., 2016].

49

Page 211: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Advantages of semantic fingerprintsI Semantic fingerprints work fast and reuse already trained models.I Generalized document representations do not depend on particular

words.I They take advantage of ‘semantic features’ learned during the

model training.I Topically connected words collectively increase or decrease

expression of the corresponding semantic components.

I Thus, topical words automatically become more important thannoise words.

See more in [Kutuzov et al., 2016].

49

Page 212: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Advantages of semantic fingerprintsI Semantic fingerprints work fast and reuse already trained models.I Generalized document representations do not depend on particular

words.I They take advantage of ‘semantic features’ learned during the

model training.I Topically connected words collectively increase or decrease

expression of the corresponding semantic components.I Thus, topical words automatically become more important than

noise words.

See more in [Kutuzov et al., 2016].

49

Page 213: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Advantages of semantic fingerprintsI Semantic fingerprints work fast and reuse already trained models.I Generalized document representations do not depend on particular

words.I They take advantage of ‘semantic features’ learned during the

model training.I Topically connected words collectively increase or decrease

expression of the corresponding semantic components.I Thus, topical words automatically become more important than

noise words.

See more in [Kutuzov et al., 2016].

49

Page 214: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

But...However, for some problems such compositional approaches are notenough and we need to generate real document embeddings.

But how?

50

Page 215: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

But...However, for some problems such compositional approaches are notenough and we need to generate real document embeddings.

But how?

50

Page 216: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph VectorI [Le and Mikolov, 2014] proposed Paragraph Vector;

I primarily designed for learning sentence vectors;I the algorithm takes as an input sentences/documents tagged with

(possibly unique) identifiers;I learns distributed representations for the sentences, such that

similar sentences have similar vectors;I so each sentence is represented with an identifier and a vector, like

a word;I these vectors serve as sort of document memories or document

topics.

See [Hill et al., 2016] and [Lau and Baldwin, 2016] for evaluation.

51

Page 217: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph VectorI [Le and Mikolov, 2014] proposed Paragraph Vector;I primarily designed for learning sentence vectors;

I the algorithm takes as an input sentences/documents tagged with(possibly unique) identifiers;

I learns distributed representations for the sentences, such thatsimilar sentences have similar vectors;

I so each sentence is represented with an identifier and a vector, likea word;

I these vectors serve as sort of document memories or documenttopics.

See [Hill et al., 2016] and [Lau and Baldwin, 2016] for evaluation.

51

Page 218: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph VectorI [Le and Mikolov, 2014] proposed Paragraph Vector;I primarily designed for learning sentence vectors;I the algorithm takes as an input sentences/documents tagged with

(possibly unique) identifiers;

I learns distributed representations for the sentences, such thatsimilar sentences have similar vectors;

I so each sentence is represented with an identifier and a vector, likea word;

I these vectors serve as sort of document memories or documenttopics.

See [Hill et al., 2016] and [Lau and Baldwin, 2016] for evaluation.

51

Page 219: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph VectorI [Le and Mikolov, 2014] proposed Paragraph Vector;I primarily designed for learning sentence vectors;I the algorithm takes as an input sentences/documents tagged with

(possibly unique) identifiers;I learns distributed representations for the sentences, such that

similar sentences have similar vectors;

I so each sentence is represented with an identifier and a vector, likea word;

I these vectors serve as sort of document memories or documenttopics.

See [Hill et al., 2016] and [Lau and Baldwin, 2016] for evaluation.

51

Page 220: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph VectorI [Le and Mikolov, 2014] proposed Paragraph Vector;I primarily designed for learning sentence vectors;I the algorithm takes as an input sentences/documents tagged with

(possibly unique) identifiers;I learns distributed representations for the sentences, such that

similar sentences have similar vectors;I so each sentence is represented with an identifier and a vector, like

a word;

I these vectors serve as sort of document memories or documenttopics.

See [Hill et al., 2016] and [Lau and Baldwin, 2016] for evaluation.

51

Page 221: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph VectorI [Le and Mikolov, 2014] proposed Paragraph Vector;I primarily designed for learning sentence vectors;I the algorithm takes as an input sentences/documents tagged with

(possibly unique) identifiers;I learns distributed representations for the sentences, such that

similar sentences have similar vectors;I so each sentence is represented with an identifier and a vector, like

a word;I these vectors serve as sort of document memories or document

topics.

See [Hill et al., 2016] and [Lau and Baldwin, 2016] for evaluation.

51

Page 222: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph VectorI [Le and Mikolov, 2014] proposed Paragraph Vector;I primarily designed for learning sentence vectors;I the algorithm takes as an input sentences/documents tagged with

(possibly unique) identifiers;I learns distributed representations for the sentences, such that

similar sentences have similar vectors;I so each sentence is represented with an identifier and a vector, like

a word;I these vectors serve as sort of document memories or document

topics.

See [Hill et al., 2016] and [Lau and Baldwin, 2016] for evaluation.

51

Page 223: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I implemented in Gensim under the name doc2vec;

I Distributed memory (DM) and Distributed Bag-of-words (DBOW)methods;

I PV-DM:I learn word embeddings in a usual way (shared by all documents);I randomly initialize document vectors;I use document vectors together with word vectors to predict the

neighboring words within a pre-defined window;I minimize error;I the trained model can inference a vector for any new document (the

model remains intact).I PV-DBOW:

I don’t use sliding window at all;I just predict all words in the current document using its vector.

Contradicting reports on which method is better.

52

Page 224: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I implemented in Gensim under the name doc2vec;I Distributed memory (DM) and Distributed Bag-of-words (DBOW)

methods;

I PV-DM:I learn word embeddings in a usual way (shared by all documents);I randomly initialize document vectors;I use document vectors together with word vectors to predict the

neighboring words within a pre-defined window;I minimize error;I the trained model can inference a vector for any new document (the

model remains intact).I PV-DBOW:

I don’t use sliding window at all;I just predict all words in the current document using its vector.

Contradicting reports on which method is better.

52

Page 225: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I implemented in Gensim under the name doc2vec;I Distributed memory (DM) and Distributed Bag-of-words (DBOW)

methods;I PV-DM:

I learn word embeddings in a usual way (shared by all documents);

I randomly initialize document vectors;I use document vectors together with word vectors to predict the

neighboring words within a pre-defined window;I minimize error;I the trained model can inference a vector for any new document (the

model remains intact).I PV-DBOW:

I don’t use sliding window at all;I just predict all words in the current document using its vector.

Contradicting reports on which method is better.

52

Page 226: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I implemented in Gensim under the name doc2vec;I Distributed memory (DM) and Distributed Bag-of-words (DBOW)

methods;I PV-DM:

I learn word embeddings in a usual way (shared by all documents);I randomly initialize document vectors;

I use document vectors together with word vectors to predict theneighboring words within a pre-defined window;

I minimize error;I the trained model can inference a vector for any new document (the

model remains intact).I PV-DBOW:

I don’t use sliding window at all;I just predict all words in the current document using its vector.

Contradicting reports on which method is better.

52

Page 227: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I implemented in Gensim under the name doc2vec;I Distributed memory (DM) and Distributed Bag-of-words (DBOW)

methods;I PV-DM:

I learn word embeddings in a usual way (shared by all documents);I randomly initialize document vectors;I use document vectors together with word vectors to predict the

neighboring words within a pre-defined window;

I minimize error;I the trained model can inference a vector for any new document (the

model remains intact).I PV-DBOW:

I don’t use sliding window at all;I just predict all words in the current document using its vector.

Contradicting reports on which method is better.

52

Page 228: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I implemented in Gensim under the name doc2vec;I Distributed memory (DM) and Distributed Bag-of-words (DBOW)

methods;I PV-DM:

I learn word embeddings in a usual way (shared by all documents);I randomly initialize document vectors;I use document vectors together with word vectors to predict the

neighboring words within a pre-defined window;I minimize error;

I the trained model can inference a vector for any new document (themodel remains intact).

I PV-DBOW:I don’t use sliding window at all;I just predict all words in the current document using its vector.

Contradicting reports on which method is better.

52

Page 229: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I implemented in Gensim under the name doc2vec;I Distributed memory (DM) and Distributed Bag-of-words (DBOW)

methods;I PV-DM:

I learn word embeddings in a usual way (shared by all documents);I randomly initialize document vectors;I use document vectors together with word vectors to predict the

neighboring words within a pre-defined window;I minimize error;I the trained model can inference a vector for any new document (the

model remains intact).

I PV-DBOW:I don’t use sliding window at all;I just predict all words in the current document using its vector.

Contradicting reports on which method is better.

52

Page 230: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I implemented in Gensim under the name doc2vec;I Distributed memory (DM) and Distributed Bag-of-words (DBOW)

methods;I PV-DM:

I learn word embeddings in a usual way (shared by all documents);I randomly initialize document vectors;I use document vectors together with word vectors to predict the

neighboring words within a pre-defined window;I minimize error;I the trained model can inference a vector for any new document (the

model remains intact).I PV-DBOW:

I don’t use sliding window at all;

I just predict all words in the current document using its vector.

Contradicting reports on which method is better.

52

Page 231: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I implemented in Gensim under the name doc2vec;I Distributed memory (DM) and Distributed Bag-of-words (DBOW)

methods;I PV-DM:

I learn word embeddings in a usual way (shared by all documents);I randomly initialize document vectors;I use document vectors together with word vectors to predict the

neighboring words within a pre-defined window;I minimize error;I the trained model can inference a vector for any new document (the

model remains intact).I PV-DBOW:

I don’t use sliding window at all;I just predict all words in the current document using its vector.

Contradicting reports on which method is better.

52

Page 232: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I implemented in Gensim under the name doc2vec;I Distributed memory (DM) and Distributed Bag-of-words (DBOW)

methods;I PV-DM:

I learn word embeddings in a usual way (shared by all documents);I randomly initialize document vectors;I use document vectors together with word vectors to predict the

neighboring words within a pre-defined window;I minimize error;I the trained model can inference a vector for any new document (the

model remains intact).I PV-DBOW:

I don’t use sliding window at all;I just predict all words in the current document using its vector.

Contradicting reports on which method is better.52

Page 233: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector - Distributed memory (PV-DM)

53

Page 234: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector - Distributed Bag-of-words (PV-DBOW)

54

Page 235: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I You train the model, then inference embeddings for the documents

you are interested in.

I The resulting embeddings are shown to perform very good onsentiment analysis and other document classification tasks, as wellas in IR tasks.

I Very memory-hungry: each sentence gets its own vector (manymillions of sentences in the real-life corpora).

I It is possible to reduce the memory footprint by training a limitednumber of vectors: group sentences into classes.

55

Page 236: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I You train the model, then inference embeddings for the documents

you are interested in.I The resulting embeddings are shown to perform very good on

sentiment analysis and other document classification tasks, as wellas in IR tasks.

I Very memory-hungry: each sentence gets its own vector (manymillions of sentences in the real-life corpora).

I It is possible to reduce the memory footprint by training a limitednumber of vectors: group sentences into classes.

55

Page 237: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I You train the model, then inference embeddings for the documents

you are interested in.I The resulting embeddings are shown to perform very good on

sentiment analysis and other document classification tasks, as wellas in IR tasks.

I Very memory-hungry: each sentence gets its own vector (manymillions of sentences in the real-life corpora).

I It is possible to reduce the memory footprint by training a limitednumber of vectors: group sentences into classes.

55

Page 238: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Representing documents

Paragraph Vector (aka doc2vec)I You train the model, then inference embeddings for the documents

you are interested in.I The resulting embeddings are shown to perform very good on

sentiment analysis and other document classification tasks, as wellas in IR tasks.

I Very memory-hungry: each sentence gets its own vector (manymillions of sentences in the real-life corpora).

I It is possible to reduce the memory footprint by training a limitednumber of vectors: group sentences into classes.

55

Page 239: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Contents

1 IntroductionSimple demoDistributional hypothesis

2 Vector space modelsWord embeddingsCount-based distributional models

3 Predictive distributional models: Word2Vec revolutionThe followers: GloVe and the others

4 Models evaluation5 Practical aspects6 Representing documents

Can we do without semantics?Distributed models: composing from word vectorsDistributed models: training document vectors

7 Summing up

55

Page 240: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Summing up

Distributional semantic models (a.k.a. word embeddings) are nowemployed in practically every NLP task related to meaning (and even insome other). The models learn meaningful word representations: it isthen your responsibility to use them to your ends.

Tools to play with1. https://code.google.com/p/word2vec/

2. http://radimrehurek.com/gensim

3. https://github.com/facebookresearch/fastText

4. http://vectors.nlpl.eu/explore/embeddings/

56

Page 241: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Summing up

Distributional semantic models (a.k.a. word embeddings) are nowemployed in practically every NLP task related to meaning (and even insome other). The models learn meaningful word representations: it isthen your responsibility to use them to your ends.

Tools to play with1. https://code.google.com/p/word2vec/

2. http://radimrehurek.com/gensim

3. https://github.com/facebookresearch/fastText

4. http://vectors.nlpl.eu/explore/embeddings/

56

Page 242: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

Summing up

Thanks for your attention!Questions are welcome.

Modern word embedding models:predict to learn word vectors

Andrey [email protected]

57

Page 243: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

References I

Baroni, M., Dinu, G., and Kruszewski, G. (2014).Don’t count, predict! A systematic comparison of context-countingvs. context-predicting semantic vectors.In Proceedings of the 52nd Annual Meeting of the Association forComputational Linguistics, volume 1, pages 238–247, Baltimore,USA.

Bengio, Y., Ducharme, R., and Vincent, P. (2003).A neural probabilistic language model.Journal of Machine Learning Research, 3:1137–1155.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016).Enriching word vectors with subword information.arXiv preprint arXiv:1607.04606.

58

Page 244: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

References II

Bullinaria, J. A. and Levy, J. P. (2007).Extracting semantic representations from word co-occurrencestatistics: A computational study.Behavior research methods, 39(3):510–526.

Dinu, G., Pham, T. N., and Baroni, M. (2013).Dissect - distributional semantics composition toolkit.In Proceedings of the 51st Annual Meeting of the Association forComputational Linguistics: System Demonstrations, pages 31–36.Association for Computational Linguistics.

Firth, J. (1957).A synopsis of linguistic theory, 1930-1955.Blackwell.

Harris, Z. S. (1954).Distributional structure.Word, 10(2-3):146–162.

59

Page 245: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

References III

Hill, F., Cho, K., and Korhonen, A. (2016).Learning distributed representations of sentences from unlabelleddata.arXiv preprint arXiv:1602.03483.

Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R.,Torralba, A., and Fidler, S. (2015).Skip-thought vectors.In Advances in neural information processing systems, pages3294–3302.

Kutuzov, A., Kopotev, M., Ivanova, L., and Sviridenko, T. (2016).Clustering comparable corpora of Russian and Ukrainian academictexts: Word embeddings and semantic fingerprints.In Proceedings of the Ninth Workshop on Building and UsingComparable Corpora, held at LREC-2016, pages 3–10. EuropeanLanguage Resources Association.

60

Page 246: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

References IV

Lau, H. J. and Baldwin, T. (2016).An empirical evaluation of doc2vec with practical insights intodocument embedding generation.In Proceedings of the 1st Workshop on Representation Learningfor NLP, pages 78–86. Association for Computational Linguistics.

Le, Q. V. and Mikolov, T. (2014).Distributed representations of sentences and documents.In ICML, volume 14, pages 1188–1196.

Levy, O. and Goldberg, Y. (2014).Neural word embedding as implicit matrix factorization.In Advances in neural information processing systems, pages2177–2185.

61

Page 247: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

References V

Levy, O., Goldberg, Y., and Dagan, I. (2015).Improving distributional similarity with lessons learned from wordembeddings.Transactions of the Association for Computational Linguistics,3:211–225.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J.(2013).Distributed representations of words and phrases and theircompositionality.Advances in Neural Information Processing Systems 26.

Osgood, C. E., Suci, G. J., and Tannenbaum, P. H. (1964).The measurement of meaning.University of Illinois Press.

62

Page 248: Modern word embedding models: predict to learn word vectors · Contents 1 Introduction Simple demo Distributional hypothesis 2 Vector space models Word embeddings Count-based distributional

References VI

Pennington, J., Socher, R., and Manning, C. D. (2014).GloVe: Global vectors for word representation.In Empirical Methods in Natural Language Processing (EMNLP),pages 1532–1543.

Rong, X. (2014).word2vec parameter learning explained.arXiv preprint arXiv:1411.2738.

Turney, P., Pantel, P., et al. (2010).From frequency to meaning: Vector space models of semantics.Journal of artificial intelligence research, 37(1):141–188.

Van der Maaten, L. and Hinton, G. (2008).Visualizing data using t-SNE.Journal of Machine Learning Research, 9(2579-2605):85.

63