word2vec hendrik heuer from theory to practicehen-drik.de/pub/heuer - word2vec - from theory...

57
Hendrik Heuer Stockholm NLP Meetup word2vec From theory to practice

Upload: others

Post on 21-Jun-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Hendrik HeuerStockholm NLP Meetup

word2vec From theory to practice

Page 2: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

About me

Hendrik Heuer [email protected] http://hen-drik.de @hen_drik

Page 3: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

–J. R. Firth 1957

“You shall know a word by the company it keeps”

Page 4: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

–J. R. Firth 1957

“You shall know a word by the company it keeps”

Quoted after Socher

Page 5: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Quoted after Socher

Page 6: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Quoted after Socher

Page 7: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 8: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Vectors are directions in space Vectors can encode relationships

Page 9: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

man is to woman as king is to ?

Page 10: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 11: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 12: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 13: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 14: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

word2vec

• by Mikolov, Sutskever, Chen, Corrado and Dean at Google

• NAACL 2013

• takes a text corpus as input and produces the word vectors as output

Page 15: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Sweden Similar words

Page 16: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Harvard Similar words

Page 17: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

word2vec

• word meaning and relationships between words are encoded spatially

• two main learning algorithms in word2vec: continuous bag-of-words and continuous skip-gram

Page 18: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Goal

Page 19: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

continuous bag-of-words

• predicting the current word based on the context

• order of words in the history does not influence the projection

• faster & more appropriate for larger corpora

Mikolov et al.

Page 20: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

continuous skip-gram

• maximize classification of a word based on another word in the same sentence

• better word vectors for frequent words, but slower to train

Mikolov et al.

Page 21: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Why it is awesome

• there is a fast open-source implementation

• can be used as features for natural language processing tasks and machine learning algorithms

Page 22: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Machine Translation

Quoted after Mikolov

Page 23: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Sentiment Analysis

Quoted after SocherRecursive Neural Tensor Network

Page 24: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Image Descriptions

Quoted after Vinyals

Page 25: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Using word2vec

• Original: http://word2vec.googlecode.com/svn/trunk/

• C++11 version: https://github.com/jdeng/word2vec

• Python: http://radimrehurek.com/gensim/models/word2vec.html

• Java: https://github.com/ansjsun/word2vec_java

• Parallel java: https://github.com/siegfang/word2vec

• CUDAversion: https://github.com/whatupbiatch/cuda-word2vec

Quoted after Wang

Page 26: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Using it in Python

Page 27: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 28: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Usage

Page 29: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Training a model

Quoted after Řehůřek

Page 30: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Training a model with iterator

Quoted after Řehůřek

Page 31: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Doing it in C

• Download the code: git clone https://github.com/h10r/word2vec-macosx-maverics.git !

• Run 'make' to compile word2vec tool !

• Run the demo scripts: ./demo-word.sh and ./demo-phrases.sh

Page 32: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Doing it in C

• Download the code: git clone https://github.com/h10r/word2vec-macosx-maverics.git !

• Run 'make' to compile word2vec tool !

• Run the demo scripts: ./demo-word.sh and ./demo-phrases.sh

Page 33: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 34: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Testing a model

Quoted after Řehůřek

Page 35: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 36: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 37: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 38: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 39: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 40: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

word2vec t-SNE JSON

1. Find Word Embeddings

2. Dimensionality Reduction 3. Output

Page 41: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

word2vec t-SNE JSON

1. Find Word Embeddings

2. Dimensionality Reduction 3. Output

Page 42: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 43: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 44: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

word2vec t-SNE JSON

1. Find Word Embeddings

2. Dimensionality Reduction 3. Output

Page 45: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 46: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Should be in Macports py27-scikit-learn @0.15.2 (python, science)

Page 47: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 48: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

word2vec t-SNE JSON

1. Find Word Embeddings

2. Dimensionality Reduction 3. Output

Page 49: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 50: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 51: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 52: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013
Page 53: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

word2vec From theory to practice

Hendrik HeuerStockholm NLP Meetup

!

Discussion: Can anybody here think of ways

this might help her or him?

Page 54: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Further Reading

• Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

• Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

• Tomas Mikolov, Wen-tau Yih,and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

Page 55: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Further Reading

• Richard Socher - Deep Learning for NLP (without Magic)http://lxmls.it.pt/2014/?page_id=5

• Wang - Introduction to Word2vec and its application to find predominant word senses http://compling.hss.ntu.edu.sg/courses/hg7017/pdf/word2vec%20and%20its%20application%20to%20wsd.pdf

• Exploiting Similarities among Languages for Machine Translation, Tomas Mikolov, Quoc V. Le, Ilya Sutskever, http://arxiv.org/abs/1309.4168

• Title Image by Hans Arp

Page 56: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Further Coding

• word2vechttps://code.google.com/p/word2vec/

• word2vec for MacOSX Maverics https://github.com/h10r/word2vec-macosx-maverics

• Gensim Python Library https://radimrehurek.com/gensim/index.html

• Gensim Tutorialshttps://radimrehurek.com/gensim/tutorial.html

• Scikit-Learn TSNEhttp://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Page 57: word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Hendrik HeuerStockholm NLP Meetup

word2vec From theory to practice