word vectorization(embedding) with nnlm

Word Vectorization(Embedding

) with NNLMIntern 이현성

SungKyunKwan University, Data mining Lab

Contents• Brief intro to Keras

• Backgrounds : simple linear algebra

• Model description

• Discussion

• Go further

Brief intro to Keras

What is it?• Deep learning library(wrapper) for Theano and Tensor-

flow.

• High-level Neural Network API

Example : Multilayer perceptron

Example : Convolutional neural network

If you want to know deeper• https://keras.io

• https://keras.io/getting-started/sequential-model-guide/#examples

https://keras.io/

https://keras.io/getting-started/sequential-model-guide/#examples

https://keras.io/getting-started/sequential-model-guide/#examples

Backgrounds : simple lin-ear algebra

Backgrounds

Then, for what happen to the one-hot encoded vectors?

So we can use C’s row-vector for dense vec-tor representation for word.(embedding)• How to implement it?

• Is it work well?

Model description

Dataset Description• Corpus of Contemporary American English

http://corpus.byu.edu/coca/• 1 million most frequent 5-gram in the total corpus

• No stemming or lemmatization done

• Approximately 25000 words

http://corpus.byu.edu/coca/

Example of Datasetpreprocessed by me

W0 W1 W2 W3 W4Both men and women reported

i wanted something that was

the hospital when he wasto have a baby that

policies of the clinton adminis-tration

Model architecture• Goal : Similar word have

similar vector representa-tion.

• Input : N-gram word list

• Output : list of probabil-ity’s of word t is word i

Model description

CW0

CW1

CW2

CW3

W0

W1

W2

W3

CW0 +

CW1 +

CW2 +

CW3

NNC 를 곱해준다Flat-ten

W4

W4_hat

이 둘의 차이 (negative log likelihood, log categorical cross-entropy) 로 back-propagation

Four vectors with dimension 30

One vector with dimen-sion 120

One vector with dimen-sion 120

Relu Softmax

How loss is calculated• V ={“ 미쿠가 오늘도 너무 귀여워” }• vector representation :

Model description• # samples = 1 Million

• Minibatch with epoch 1000

• # Iteration 50

Implementation with keras

Implementation with Keras

Discussion

This Vector representation actually is ‘vector representation’?

• Similar vectors have similar meaning(in syntactic, se-mantic)?

Result.• find similar vectors with Trained feature vector Ci• KNN with Euclidean metric used

word 1st 2nd 3rd 4th 5th

Look Looks Looking Stared Peek glance

Run Ran Running term Pass Runs

Talk Talked Talking Story Bones Truth

Know Guess Thinking Knowing Knows sure

Boy Girl Woman Man Africa Doctor

Year Week Weeks Days Decade Month

Times Moment Day Nights Night Pause

results• 잘 안 된 것들… ?

word 1st 2nd 3rd 4th 5th

The Our United Your White Main

Japan Russia Slavery Terrorism Britain Sector

Indian Competitive Humanitar-ian

Regulatory Canadian Investiga-tive

New His Our Its My your

Your Our My His White Their

Gay Missile Reproduc-tive

Governmen-tal

Preventive Same-sex

A Presidential San Foreign The domestic

Discussion• Good syntactic similarity for most words.• Good semantic(meaning) similarity for nouns and verbs• Bad semantic similarity for other words(adjectives, or

etc…)

• I think this is mainly because I skipped • lemmatization(erasing unimportant words such as ‘a’,

‘the’.......)• stemming (hashing words like ‘did’, ‘do’ and ‘done’ into single

‘do’)

Go further( 다음 발표 때 할 거 )• Use Skip-gram or CBOW• toward better word to vector representation• Better efficiency• Larger corpus size

• Visualization for word models

Use Skip-gram or CBOW

Proper visualization for word models

실제로는 하나도 안 닮음… ;;;

word vectorization(embedding) with nnlm

Engineering