word2vec durgesh kumar osint lab, cse department iit guwahati · featurized representation : word...

20
word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati

Upload: others

Post on 21-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

word2vec

Durgesh KumarOSINT LAB, CSE Department

IIT Guwahati

Page 2: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

Table of contents

1 Overview

2 Background

3 Introduction

4 Training word2vec algorithmTerminologies

5 References

Durgesh Kumar word2vec 13th December 2019 1 / 16

Page 3: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

Word RepresentationOne hot vector

V= [ a, aaron, ..., apple, ..., man, ..., woman, ..., king, ..., queen, ...,zula, <UNK>]

man

00...1...0

O5391

woman

00...1...0

O9853

king

00...1...0

O4914

queen

00...1...0

O7157

apple

00...1...0

O456

WeaknessIt treats each word as discrete thingit does not allow to utilize the inter word relationshipI want a glass of orange —— .

11This slide is borrowed from the lecture of deeplearning.aiDurgesh Kumar word2vec 13th December 2019 2 / 16

Page 4: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

Featurized representation : wordembedding

a aaron ··· apple man woman king queen orange

Gender 0.00 0.004 · · · −1 1 −0.95 0.97 −21 46Royal 0.01 0.02 · · · 0.28 0.21 0.11 11 1.46 0.12Age 0.03 0.02 · · · −0.36 0.84 0.13 5.68 13 2.19Food 0.09 0.01 · · · 15 1.67 1.14 0.09 2.2 12noun 2.3 −2.4 · · · 0.01 0.05 8.10 1.4 1.2 1.6verb 2.3 −2.4 · · · 0.01 0.05 8.10 1.4 1.2 1.6

Durgesh Kumar word2vec 13th December 2019 3 / 16

Page 5: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

Introduction to word2vecword2vec is one of the popular model to learn word embeddingword embedding is dense vector of fixed size representing a wordcapturing semantic and syntactic regularities

semantic regularities : Antonym, synonym, etcsyntactic regularities : language structure, verb, noun, etceach word is represented by a vector of fixed dimension varying from50 to 300.boy : [0.89461, 0.37758, 0.42067, -0.51334, -0.28298, 1.0012,0.18748, 0.21868, -0.030053, ... ]

word2vec is proposed by T. Milkov et. al. in 2013.paper [1]: Efficient estimation of word representations in vectorspace by T. Mikolov et. al. in ICLR Workshop 2013paper [2]: Distributed Representations of Words and Phrases andtheir Compositionality by T. Mikolov et. al. in NIPS 2013

Durgesh Kumar word2vec 13th December 2019 4 / 16

Page 6: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

Interesting examples of semantic andsyntactic relationsExamples from paper [1]: Efficient estimation of word representationsin vector space by T. Mikolov et. al. in ICLR Workshop 2013

vector(“king”) - vector(“man”) + vector(“woman”) is closest tovector(“queen”)vector(“big”) : vector(“biggest”) : : vector(“small”) :vector(“smallest”)

Figure: Word pairs illustrating the gender relation and singular/plural relationsfrom paper [3]

Durgesh Kumar word2vec 13th December 2019 5 / 16

Page 7: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

Interesting examples of semantic andsyntactic relationsExamples from paper [1]: Efficient estimation of word representationsin vector space by T. Mikolov et. al. in ICLR Workshop 2013

vector(“king”) - vector(“man”) + vector(“woman”) is closest tovector(“queen”)vector(“big”) : vector(“biggest”) : : vector(“small”) :vector(“smallest”)

Figure: Word pairs illustrating the gender relation and singular/plural relationsfrom paper [3]

Durgesh Kumar word2vec 13th December 2019 5 / 16

Page 8: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

More examples from Paper [1]Table: Examples of five types of semantic and nine types of syntactic wordrelationship

Type of relationship Word Pair 1 Word Pair 2Common capital city Athens Greece Oslo NorwayAll capital cities Astana Kazakhstan Harare ZimbabweCurrency Angola kwanza Iran rialCity-in-state Chicago Illinois Stockton CaliforniaMan-Woman brother sister grandson granddaughterAdjective to adverb apparent apparently rapid rapidlyOpposite possibly impossibly ethical unethicalComparative great greater tough tougherSuperlative easy easiest lucky luckiestPresent Participle think thinking read readingNationality adjective Switzerland Swiss Cambodia CambodianPast tense walking walked swimming swamPlural nouns mouse mice dollar dollarsPlural verbs work works speak speaks

Durgesh Kumar word2vec 13th December 2019 6 / 16

Page 9: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

More examples from Paper [1]

Table: Examples of the word pair relationships, using the best word vectors(Skip-gram model trained on 783M words with 300 dimensionality)

Relationship Example 1 Example 2 Example 3France - Paris Italy: Rome Japan: Tokyo Florida: Tallahasseebig - bigger small: larger cold: colder quick: quickerMiami - Florida Baltimore: Maryland Dallas: Texas Kona: HawaiiEinstein - scientist Messi: midfielder Mozart: violinist Picasso: painterSarkozy - France Berlusconi: Italy Merkel: Germany Koizumi: Japancopper - Cu zinc: Zn gold: Au uranium: plutoniumBerlusconi - Silvio Sarkozy: Nicolas Putin: Medvedev Obama: BarackMicrosoft - Windows Google: Android IBM: Linux Apple: iPhoneMicrosoft - Ballmer Google: Yahoo IBM: McNealy Apple: JobsJapan - sushi Germany: bratwurst France: tapas USA: pizza

Durgesh Kumar word2vec 13th December 2019 7 / 16

Page 10: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

Few terminologies related to theword2vec modelTarget word, context word , sliding window

The yellow quick brown fox jumps over the lazy dogtarget word : foxcontext word : quick, brown, jumps, overwindow length: 5

The yellow quick brown fox jumps over the lazy dog

The yellow quick brown fox jumps over the lazy dog

Durgesh Kumar word2vec 13th December 2019 8 / 16

Page 11: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

Few terminologies related to theword2vec modelTarget word, context word , sliding window

The yellow quick brown fox jumps over the lazy dogtarget word : foxcontext word : quick, brown, jumps, overwindow length: 5

The yellow quick brown fox jumps over the lazy dog

The yellow quick brown fox jumps over the lazy dog

Durgesh Kumar word2vec 13th December 2019 8 / 16

Page 12: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

Few terminologies related to theword2vec modelTarget word, context word , sliding window

The yellow quick brown fox jumps over the lazy dogtarget word : foxcontext word : quick, brown, jumps, overwindow length: 5

The yellow quick brown fox jumps over the lazy dog

The yellow quick brown fox jumps over the lazy dog

Durgesh Kumar word2vec 13th December 2019 8 / 16

Page 13: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

One hot vector encoding and Embeddingmatrix

The yellow quick brown fox jumps over the lazy dogLet V = {the, yellow, quick, brown, fox, jumps, over, lazy, dog } ;Ordered dictionary of unique word in the corpus9 unique word in the corpusthe : [1,0,0,0,0,0,0,0,0]T −→ O1

yellow : [0,1,0,0,0,0,0,0,0]T −→ O2

brown : [0,0,0,1,0,0,0,0,0]T −→ O4

Durgesh Kumar word2vec 13th December 2019 9 / 16

Page 14: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

Embedding Matrix

E5∗9 =

the yellow quick brown fox jumps over lazy dog

d1 −1 1 0.04 1.24 1.12 1.21 4 −21 46d2 0.01 0.02 1.56 0.28 0.21 0.11 11 1.46 61d3 0.03 0.02 −0.36 0.84 0.13 5.68 13 2.19 72d4 0.09 0.01 0.09 1.67 1.14 0.09 2.2 3.8 49d5 2.3 −2.4 0.01 0.05 8.10 1.4 1.2 1.6 1.8

O1 = [ 1, 0, 0 , 0, 0, 0, 0, 0, 0]T

E5∗9 . O1(9∗1) = e1(5∗1) −→ [ -1, 0.01, 0.03, 0.09 , 2.3 ]T

Durgesh Kumar word2vec 13th December 2019 10 / 16

Page 15: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

CBOW and skipgram architectureThe yellow quick brown fox jumps over the lazy dog

Figure: The CBOW architecture predicts the current word based on the context,and the Skip-gram predicts surrounding words given the current word [1]

Durgesh Kumar word2vec 13th December 2019 11 / 16

Page 16: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

CBOW and skipgram architectureThe yellow quick brown fox jumps over the lazy dog

Figure: The CBOW architecture predicts the current word based on the context,and the Skip-gram predicts surrounding words given the current word [1]

Durgesh Kumar word2vec 13th December 2019 12 / 16

Page 17: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

CBOW simplified architectureThe yellow quick brown fox jumps over the lazy dog

Figure: The CBOW simplified architecture

Durgesh Kumar word2vec 13th December 2019 13 / 16

Page 18: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

CBOW vs SkipgramSkipgram is better at predicting syntactic relationshipCBOW is aprox 20 times faster than skipgramBoth CBOW and skipgram are good at predicting semanticrelationship

Durgesh Kumar word2vec 13th December 2019 14 / 16

Page 19: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

References ITomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and JeffDean.Distributed representations of words and phrases and theircompositionality.In Advances in neural information processing systems, pages3111–3119, 2013.

Durgesh Kumar word2vec 13th December 2019 15 / 16

Page 20: word2vec Durgesh Kumar OSINT LAB, CSE Department IIT Guwahati · Featurized representation : word embedding 2 6 6 6 6 6 6 4 a aaron apple man woman king queen orange Gender 0.00 0.004

References IITomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.Linguistic regularities in continuous space word representations.In Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics: HumanLanguage Technologies, pages 746–751, Atlanta, Georgia, June2013. Association for Computational Linguistics.

Durgesh Kumar word2vec 13th December 2019 16 / 16