word vectorization(embedding) with nnlm
TRANSCRIPT
Word Vectorization(Embedding
) with NNLMIntern 이현성
SungKyunKwan University, Data mining Lab
Contents• Brief intro to Keras
• Backgrounds : simple linear algebra
• Model description
• Discussion
• Go further
Brief intro to Keras
What is it?• Deep learning library(wrapper) for Theano and Tensor-
flow.
• High-level Neural Network API
Example : Multilayer perceptron
Example : Convolutional neural network
If you want to know deeper• https://keras.io
• https://keras.io/getting-started/sequential-model-guide/#examples
Backgrounds : simple lin-ear algebra
Backgrounds
Backgrounds
Then, for what happen to the one-hot encoded vectors?
So we can use C’s row-vector for dense vec-tor representation for word.(embedding)• How to implement it?
• Is it work well?
Model description
Dataset Description• Corpus of Contemporary American English
http://corpus.byu.edu/coca/• 1 million most frequent 5-gram in the total corpus
• No stemming or lemmatization done
• Approximately 25000 words
Example of Datasetpreprocessed by me
W0 W1 W2 W3 W4Both men and women reported
i wanted something that was
the hospital when he wasto have a baby that
policies of the clinton adminis-tration
Model architecture• Goal : Similar word have
similar vector representa-tion.
• Input : N-gram word list
• Output : list of probabil-ity’s of word t is word i
Model description
CW0
CW1
CW2
CW3
W0
W1
W2
W3
CW0 +
CW1 +
CW2 +
CW3
NNC 를 곱해준다Flat-ten
W4
W4_hat
이 둘의 차이 (negative log likelihood, log categorical cross-entropy) 로 back-propagation
Four vectors with dimension 30
One vector with dimen-sion 120
One vector with dimen-sion 120
Relu Softmax
How loss is calculated• V ={“ 미쿠가 오늘도 너무 귀여워” }• vector representation :
Model description• # samples = 1 Million
• Minibatch with epoch 1000
• # Iteration 50
Implementation with keras
Implementation with keras
Implementation with Keras
Discussion
This Vector representation actually is ‘vector representation’?
• Similar vectors have similar meaning(in syntactic, se-mantic)?
Result.• find similar vectors with Trained feature vector Ci• KNN with Euclidean metric used
word 1st 2nd 3rd 4th 5th
Look Looks Looking Stared Peek glance
Run Ran Running term Pass Runs
Talk Talked Talking Story Bones Truth
Know Guess Thinking Knowing Knows sure
Boy Girl Woman Man Africa Doctor
Year Week Weeks Days Decade Month
Times Moment Day Nights Night Pause
results• 잘 안 된 것들… ?
word 1st 2nd 3rd 4th 5th
The Our United Your White Main
Japan Russia Slavery Terrorism Britain Sector
Indian Competitive Humanitar-ian
Regulatory Canadian Investiga-tive
New His Our Its My your
Your Our My His White Their
Gay Missile Reproduc-tive
Governmen-tal
Preventive Same-sex
A Presidential San Foreign The domestic
Discussion• Good syntactic similarity for most words.• Good semantic(meaning) similarity for nouns and verbs• Bad semantic similarity for other words(adjectives, or
etc…)
• I think this is mainly because I skipped • lemmatization(erasing unimportant words such as ‘a’,
‘the’.......)• stemming (hashing words like ‘did’, ‘do’ and ‘done’ into single
‘do’)
Go further( 다음 발표 때 할 거 )• Use Skip-gram or CBOW• toward better word to vector representation• Better efficiency• Larger corpus size
• Visualization for word models
Use Skip-gram or CBOW
Proper visualization for word models
실제로는 하나도 안 닮음… ;;;