2016word embbed

単語埋込みモデルによる意味論，統語論，隠喩，暗喩の計算（仮）

浅川伸一東京女子大学情報処理センター

1678585東京都杉並区善福寺２ー６ー１E-mail [email protected]

http://www.cis.twcu.ac.jp/˜asakawa/

Abstract

単語埋込みモデルによる言語情報処理を概説し，展望を与える。単語埋込みモデルとは word2vec (プログラム名)又は skip gram (プロジェクト名)と GloVeの総称である。文献では SGNS (skip gram with negative sampling)と略記する場合も散見される。ここでは単語埋込みモデルの概略を示し，隠喩，換喩の算法を紹介する。これらの背後に控える研究としては，言語モデル，隠れマルコフモデル，条件付き確率場，と数理的関連を指摘できる。加えて機械翻訳モデルである seq2seq, skip-thought,注意モデル，画像脚注付け，言語からの画像生成モデル，さらに時間が許せば，どうすれば女子高生会話ボット「りんな」がつくれるか，を紹介したい。

Key Words: word embedding models, word2vec, neural language models, LSTM, GRU, skip-thought

1 導入日本語の文献としては西尾 [40]がある。実例に即した書籍であるので手を動かして理解することができる。

TensorFlowの導入がまとまっているので一読をお勧めする 1。日本語への翻訳も存在する 2 が，英語に不便を感じなければ原文を読んだ方が良いだろう。以下に単語埋込みモデルへの動機づけについての文章を引用する。

Image and audio processing systems work with rich, high-dimensional datasets encoded as vectors ofthe individual raw pixel-intensities for image data, or e.g. power spectral density coefficients for audiodata. For tasks like object or speech recognition we know that all the information required to successfullyperform the task is encoded in the data (because humans can perform these tasks from the raw data).However, natural language processing systems traditionally treat words as discrete atomic symbols, andtherefore ’cat’ may be represented as Id537 and ’dog’ as Id143. These encodings are arbitrary, and provideno useful information to the system regarding the relationships that may exist between the individualsymbols. This means that the model can leverage very little of what it has learned about ’cats’ when it isprocessing data about ’dogs’ (such that they are both animals, four-legged, pets, etc.). Representing wordsas unique, discrete ids furthermore leads to data sparsity, and usually means that we may need more datain order to successfully train statistical models. Using vector representations can overcome some of theseobstacles. Vector space models3 (VSMs) represent (embed) words in a continuous vector space wheresemantically similar words are mapped to nearby points (’are embedded nearby each other’). VSMshave a long, rich history in NLP, but all methods depend in some way or another on the DistributionalHypothesis4, which states that words that appear in the same contexts share semantic meaning. Thedifferent approaches that leverage this principle can be divided into two categories: count-based methods(e.g. Latent Semantic Analysis5), and predictive methods (e.g. neural probabilistic language models6).

This distinction is elaborated in much more detail by Baroni et al.7, but in a nutshell: Count-basedmethods compute the statistics of how often some word co-occurs with its neighbor words in a largetext corpus, and then map these count-statistics down to a small, dense vector for each word. Predictivemodels directly try to predict a word from its neighbors in terms of learned small, dense embeddingvectors (considered parameters of the model).

Word2vec is a particularly computationally-efficient predictive model for learning word embeddingsfrom raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram

1https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html2http://media.accel-brain.com/tensorflow-vector-representations-of-words/3https://en.wikipedia.org/wiki/Vector_space_model4https://en.wikipedia.org/wiki/Distributional_semantics#Distributional_Hypothesis5https://en.wikipedia.org/wiki/Latent_semantic_analysis6http://www.scholarpedia.org/article/Neural_net_language_models7http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf

1

http://www.cis.twcu.ac.jp/~asakawa/

https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html

https://en.wikipedia.org/wiki/Vector_space_model

https://en.wikipedia.org/wiki/Distributional_semantics#Distributional_Hypothesis

https://en.wikipedia.org/wiki/Latent_semantic_analysis

http://www.scholarpedia.org/article/Neural_net_language_models

http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf

2 Shin Asakawa

model (Chapter 3.1 and 3.2 in Mikolov et al8.). Algorithmically, these models are similar, except thatCBOW predicts target words (e.g. ’mat’) from source context words (’the cat sits on the’), while theskip-gram does the inverse and predicts source context-words from the target words. This inversion mightseem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of thedistributional information (by treating an entire context as one observation). For the most part, this turnsout to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a newobservation, and this tends to do better when we have larger datasets. We will focus on the skip-grammodel in the rest of this tutorial.

最後から 2段落目の意味が取りにくいかも知れないが，Baroniらによれば計数に基づく手法 count-based methodsとは PCA, SVD, LSI, NMFなどの従来モデル (広義には TF/IDFも含まれるだろう)のことであり，予測モデルpredictive modelsとは word2vec (skip-gram, cbow)や GloVeの意である。

2 ミコロフ革命

2.1 いにしえより伝わりし単語埋込みモデル word embedding modelあるいはベクトル空間モデル vector space modelと呼ばれる一連

のモデルは 2013年に突然話題になったように思われるが，1990年代に遡ることができる。最近では理論的考察も進展し一定の成果を達成し周知されたと言える。

pay attention to me!

Figure 1: [7] Fig.1を改変

Figure 2: Thomas Mikolov,右は NIPS2015での講演時

まず最初に歴史に言及する [33, 34]. Hiton9によれば 1990年代のバックプロパゲーションの特徴は以下の３8https://arxiv.org/pdf/1301.3781.pdf9https://www.youtube.com/watch?v=EK61htlw8hY

https://arxiv.org/pdf/1301.3781.pdf

https://www.youtube.com/watch?v=EK61htlw8hY

Word Embeddings and Metaphor 3

点に要約できる。

1. データ不足，規模不足 too small

2. 速度不足 too slow

3. 最適化理論不足 too stupid initialized stupid ways, choose wrong type of non linearity

これらを打破したのは周知のとおり Hinton の制限ボルツマンマシン (RBM) [15, 16],さらに遡れば LeCun のLeNet5[23]であった。Mikolovの貢献 [30]は古典的単純再帰型リカレントニューラルネットワークモデル [9, 10]を実用的な言語モデルに拡張したことにある。

2.2 Mikolovの言語モデル

EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGEMODEL

Tomas Mikolov1,2, Stefan Kombrink1, Lukas Burget1, Jan “Honza” Cernocky1, Sanjeev Khudanpur2

1Brno University of Technology, Speech@FIT, Czech Republic2 Department of Electrical and Computer Engineering, Johns Hopkins University, USA

{imikolov,kombrink,burget,cernocky}@fit.vutbr.cz, [email protected]

ABSTRACTWe present several modifications of the original recurrent neural net-work language model (RNN LM). While this model has been shownto significantly outperform many competitive language modelingtechniques in terms of accuracy, the remaining problem is the com-putational complexity. In this work, we show approaches that leadto more than 15 times speedup for both training and testing phases.Next, we show importance of using a backpropagation through timealgorithm. An empirical comparison with feedforward networks isalso provided. In the end, we discuss possibilities how to reduce theamount of parameters in the model. The resulting RNN model canthus be smaller, faster both during training and testing, and moreaccurate than the basic one.

Index Terms— language modeling, recurrent neural networks,speech recognition

1. INTRODUCTION

Statistical models of natural language are a key part of many systems today. The most widely known applications are automatic speech recognition (ASR), machine translation (MT) and optical charac-ter recognition (OCR). In the past, there was always a struggle be-tween those who follow the statistical way, and those who claim that we need to adopt linguistics and expert knowledge to build models of natural language. The most serious criticism of statistical ap-proaches is that there is no true understanding occurring in these models, which are typically limited by the Markov assumption and are represented by n-gram models. Prediction of the next word is often conditioned just on two preceding words, which is clearly in-sufficient to capture semantics. On the other hand, the criticism of linguistic approaches was even more straightforward: despite all the efforts of linguists, statistical approaches were dominating when per-formance in real world applications was a measure.

Thus, there has been a lot of research effort in the field of statistical language modeling. Among models of natural language, neural network based models seemed to outperform most of the competition [1] [2], and were also showing steady improvements in state of the art speech recognition systems[3]. The main power of neural network based language models seems to be in their simplicity: almost the same model can be used for prediction of many types of signals, not just language. These models perform implicitly clustering of words in low-dimensional space. Prediction based on this compact representation of words is then more robust. No additional smoothing of probabilities is required.

This work was partly supported by European project DIRAC (FP6-027787), Grant Agency of Czech Republic project No. 102/08/0707, CzechMinistry of Education project No. MSM0021630528 and by BUT FIT grantNo. FIT-10-S-2.

Fig. 1. Simple recurrent neural network.

Among many following modifications of the original model, therecurrent neural network based language model [4] provides furthergeneralization: instead of considering just several preceding words,neurons with input from recurrent connections are assumed to repre-sent short term memory. The model learns itself from the data howto represent memory. While shallow feedforward neural networks(those with just one hidden layer) can only cluster similar words,recurrent neural network (which can be considered as a deep archi-tecture [5]) can perform clustering of similar histories. This allowsfor instance efficient representation of patterns with variable length.

In this work, we show the importance of the Backpropagation through time algorithm for learning appropriate short term memory. Then we show how to further improve the original RNN LM by de-creasing its computational complexity. In the end, we briefly discuss possibilities of reducing the size of the resulting model.

2. MODEL DESCRIPTION

The recurrent neural network described in [4] is also called Elman network [6]. Its architecture is shown in Figure 1. The vector x(t) is formed by concatenating the vector w(t) that represents the current word while using 1 of N coding (thus its size is equal to the size of the vocabulary) and vector s(t − 1) that represents output values in the hidden layer from the previous time step. The network is trained by using the standard backpropagation and contains input, hidden and output layers. Values in these layers are computed as follows:

x(t) = [w(t)Ts(t − 1)T ]T (1)

sj(t) = fX

i

xi(t)uji

!(2)

yk(t) = gX

j

sj(t)vkj

!(3)

5528978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011

Figure 3: Mikolonvの RNNLM。Mikolov+2011より

• 入力層 wと出力層 yは同一次元，総語彙数に一致する。(約 1万語から 20万語)

• 中間層 sは相対的に低次元 (50から 1000ニューロン)

• 入力層から中間層への結合係数行列 U，中間層から出力層への結合係数行列 V，

• 再帰結合係数行列W がなければバイグラム (2-グラム)ニューラルネットワーク言語モデルと等しい

• 中間層ニューロンの出力と出力層ニューロンの出力:

s(t) = f (Uw(t) +Ws (t− 1)) (1)

y (t) = g (Vs (t)) (2)

ここで f(z)はシグモイド関数：

f(z) =1

1 + exp (−z)(3)

g(z)はソフトマックス softmax関数：

g(zm) =exp (zm)∑k exp (zk)

(4)

ちなみにハードマックス関数はg (zm) = argmax

mp (zm) (5)

• 学習については，時刻 tにおける誤差ベクトル e0 (t)の勾配計算にはクロスエントロピー損失を用いる。

eo (t) = d (t)− y (t) (6)

d (t)は出力単語を表すターゲット単語であり時刻 t+1の入力単語 w (t+ 1)[4]では 1-of-ｋ表現, Bengioはワンホットベクトルと呼ぶ。

4 Shin Asakawa

• 時刻 tにおける中間層から出力層への結合係数行列 V は，中間層ベクトル s (t)と出力層ベクトル y (t)を用いて次式のように計算する

V (t+ 1) = V (t) + αs (t) eo (t)⊤ (7)

ここで αは学習係数である。

• 出力層からの誤差勾配ベクトルから中間層の誤差勾配ベクトルを計算すると，

eh (t) = dh

(eo (t)

⊤V, t

)(8)

誤差ベクトルは関数 dh()をベクトルの各要素に対して適用して

dhj (x, t) = xsj (t) (1− sj (t)) (9)

となる。

• 時刻 tにおける入力層から中間層への結合係数行列 U は，ベクトル s (t)の更新を以下のようにする。

U (t+ 1) = U (t) + αW (t) eh (t)⊤ (10)

時刻 tにおける入力層ベクトル w(t)は，一つのニューロンを除き全て 0である。上式のように結合係数を更新するニューロンは入力単語に対応する一つのニューロンのそれを除いて全て 0なので，計算は高速化できる。

2.3 word2vecMikolovの言語モデルのポイントは図 3の結合係数行列 U がワンホットベクトルを中間層ニューロン数次元

のベクトル空間への射影に成っていることである。このことが word2vecへの道を開いた。すなわち，Mikolovの提案した word2vecは単語をベクトル空間へ射影する [27, 28, 29]10。

w(t)

w(t-2)

w(t-1)

w(t+1)

w(t+2)

Skip-gramは次式のように定式化できる。すなわち単語系列を w1, w2, · · · , wt として

ℓ =1

T

T∑t=1

∑−c≤j≤c,

j =0

log p (wt+j |wt ) (11)

を最大化する。階層ソフトマックス n (w, j) を j-番目のノードとして L (w) を，パス長とする。n (w, 1) = root であり

n (w,L (w)) = wである。ch (n)は任意の nの子ノードとする。[[x]]は xが真の時 1でそれ以外のときは −1をとるとする。階層ソフトマックスは

p (w | wI) =

L(w)−1∏j=1

σ(

[[n (w, j + 1) = ch (n (n, j)) ]] · v′⊤n(w,j)vwI

)(12)

ここで σ = [1 + exp (−x)]−1シグモイド関数である。

∑Ww=1 p (w | wI) = 1は自明である。∇ log p (wO | wI)

は L (wO)に比例する。10 Recurrent Neural Network Language Model: http://www.fit.vutbr.cz/˜imikolov/rnnlm/

Word2vec: https://github.com/dav/word2vec

http://www.fit.vutbr.cz/~imikolov/rnnlm/

https://github.com/dav/word2vec


2.4 Negative Sampling

log σ(v′⊤WO

vwI

)+

K∑i=1

Ewi∼Pn(w)

[log σ

(−v′⊤wi

vwI

)](13)

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

中国

日本

フランス

ロシア

ドイツ

イタリア

スペインギリシャ

トルコ

北京

パリ

東京

ポーランド

モスクワ

ポルトガル

ベルリン

ローマアテネ

マドリッド

アンカラ

ワルシャワ

リスボン

Figure 4: SGNSのサンプル

2.5 CBOW

w(t-2)

w(t+1)

w(t-1)

w(t+2)

w(t)

SUM

INPUT PROJECTION OUTPUT

w(t)

INPUT PROJECTION OUTPUT

w(t-2)

w(t-1)

w(t+1)

w(t+2) CBOW Skip-gram

Figure 5: CBOW

[27]よりvector(“King”) - vector(“Man”) + vector(“Woman”) = vector(“Queen”)

a : b = c : dで dが不明とする。埋込ベクトル xa, xb, xc は正規化済。y = xb − xa + xc なる演算により yを求める。正確に同じ位置に該当する単語が存在するとは限らないので最近傍の単語を探す RNNLM[29]ではコサイン類似度 (a.k.a相関係数各ベクトルが正規化してあるから)：

w∗ = argmaxw

(xw · y)∥xw∥ ∥y∥

(14)

6 Shin Asakawa

dist (a, b) = cos θab =(a · b)∥a∥ ∥b∥

(15)

一方，ユークリッド距離は

dist (a, b) = |a− b|2 = |a|2 + |b|2 − 2 |a| |b| cos θab (16)

= |a|2 + |b|2 − 2 (a · b) (17)

3 結果

3.1 アナロジー課題

vec(“ベルリン”)-vec(“ドイツ”)+vec(“France”)→vec(“パリ”)vec(“quick”)-vec(“quickly”)+vec(“slow”)→vec(“slowly”)

Figure 6: 左図：３単語対の性差を表す関係。右図：単数形と複数形の関係。各単語は高次元空間に埋め込まれている

Table 1: アナロジー課題の例 (n = 3218)。課題は４番目の単語を探すこと（正解率およそ 72%）新聞

New York New York Times Baltimore Baltimore SunSan Jose San Jose Mercury News Cincinnati Cincinnati Enquirer

アイスホッケーチーム NHLBoston Boston Bruins Montreal Montreal CanadiensPhoenix Phoenix Coyotes Nashville Nashville Predators

バスケットボールチーム NBADetroit Detroit Pistons Toronto Toronto Raptors

Oakland Golden State Warriors Memphis Memphis Grizzlies飛行機会社

Austria Austrian Airlines Spain SpainairBelgium Brussels Airlines Greece Aegean Airlines

会社重役Steve Ballmer Microsoft Larry Page Google

Samuel J. Palmisano IBM Werner Vogels Amazon


Table 2: Examples of the word pair relationships, using the best word vectors from Table 4 (Skipgram model trainedon 783M words with 300 dimensionality) [27]Table.8

Relationship Example 1 Example 2 Example 3France - Paris Italy: Rome Japan: Tokyo Florida: Tallahasseebig - bigger small: larger cold: colder quick: quicker

Miami - Florida B altimore: Maryland Dallas: Texas Kona: HawaiiEinstein - scientist Messi: midfielder Mozart: violinist Picasso: painterSarkozy - France Berlusconi: Italy Merkel: Germany Koizumi: Japan

copper - Cu zinc: Zn gold: Au uranium: plutoniumBerlusconi - Silvio Sarkozy: Nicolas Putin: Medvedev Obama: Barack

Microsoft - Windows Google: Android IBM: Linux Apple: iPhoneMicrosoft - Ballmer Google: Yahoo IBM: McNealy Apple: Jobs

Japan - sushi Germany: bratwurst France: tapas USA: pizza

データセットはダウンロードできる 11。

Table 3: 意味の関係 (5つ)と統語関係 (9つ)[27]の Table 1Type of relationship Word Pair 1 Word Pair 2共通の都市 Athens Greece Oslo Norway首都 Astana Kazakhstan Harare Zimbabwe国と通貨 Angola kwanza Iran rial州と州都 Chicago Illinois Stockton California男性，女性 brother sister grandson granddaughter形容詞，副詞 to adverb apparent apparently rapid rapidly反意語 possibly impossibly ethical unethical比較級 great greater tough tougher最上級 easy easiest lucky luckiest現在分詞 think thinking read reading国籍を表す形容詞 Switzerland Swiss Cambodia Cambodian過去形 walking walked swimming swam複数形名詞 nouns mouse mice dollar dollars動詞三人称単数現在 work works speak speaks

783M391M196M98M49M24M

50

100

300

600

percent correct dimensionality/training words

Figure 7: 意味論，統語論の正解率 CBOWモデルによる横軸は訓練データセットのサイズ（総語彙数）。グラフの色の違いは埋込層の次元数（ニューロン数）[27]Table2を改変

4 他のモデルとの関係潜在意味解析 Latent Semantic Analysis (LSA)[20, 21, 22], 潜在ディレクリ配置 Latent Dirichlet Allocation

(LDA)[6],主成分分析 Principle Component Analysis (PCA)[31]との比較が行われている。11http://2code.google.com/p/word2vec/source/browse/trunk/questions-phrases.txt

http://2code.google.com/p/word2vec/source/browse/trunk/questions-phrases.txt

8 Shin Asakawa

LDCコーパス総単語数 3.2億語,語彙数 8.2万語，中間層の次元数は 640で比較 [29]

NNLMモデルの成績は RNNモデルより良かった（パラメータ数が８倍）CBOWモデルは NNLMモデルよりも統語関係問題で優れていたが，意味を問う課題では同程度の成績であった。skip-gramモデルは統語問題で CBOWよりやや劣る。しかし， NNLMモデルより良い。意味を問う課題では一番良かった。

Semantic Syntactic Total

Accuracy

skip-gram(300/783M)

CBOW(300/783M)

Our NNLM(100/6B)

Mikolov RNNLM

Huang NNLM(50/990M)

Collobert-Weston NNLM(50/660M)

Figure 8: Comparison of publicly available word vectors on the Semantic-Syntactic Word Relationship test set, andword vectors from our models. Full vocabularies are used.

Skip-gram+RNNLMs

Skip-gram

Log-bilinear model [24]

RNNLMs[19]

Average LSA similarity [32]

4-gram [32]

0 10 20 30 40 50 60

Figure 9: Comparison and combination of models on the Microsoft Sentence Completion Challenge.

Skip-gramは LSAより良くはない。ちなみに SOTAは 58.9%


percent correct

Semantic

syntacticMSR wordRelatedness

RNNLM

NNLM

CBOWskip-gram

Figure 10: 意味的，統語的，関係のモデル比較 [27]の Table4を改変

5 実装Pythonistaは gensim12 を使うことになるだろう。

$ pip install -U gensim

gensimは word2vecだけでなく LSA, LSI, SVD, LDA (Latent Dirichlet Allocation)も用意されていて NLP関係者にとっては de facto standardになっている。gensimでサポートされていない手法は NMF[24, 25]くらいであろう 13。古い文献にはMikolovがオリジナルの C++コードが入手できるように書いてる 14。しかし既知のとおりこ

のサイトはサービスを終了している。Wikepediaには Vector space modelの詳細な記述がある 15。wikipedia.jaの記述とは異なる 16。この現状（惨

状？）は何とかしたい。TensorFlowの word2vecチュートリアルは実践的である 17。GloVeはリチャード・ソッカー (Richard Socher)やクリス・マニング (Christopher Manning)などスタンフォー

ド大学の自然言語処理研究室で開発されたベクトル埋込モデル [32]であり，正式名称は Global vectors for wordrepresentation18 である。コードは GitHubでも公開されている 19

さらに skip-thought[19], doc2vecというモデルも存在する。

6 りんなついに明かされる「りんな」の “脳内”マイクロソフト「女子高生 AI」の自然言語処理アルゴリズムを公開

20。によれば

• Learning to Rank

• Word to Vector

• Term Frequency Inverse Document Frequency（TFIDF）

• ニューラルネットワーク

以下は記事からの抜粋

• Learning to Rankは、ユーザーの問い掛け（クエリ）との関連性に基づき、りんなの返答候補をランキング付けする。

12https://github.com/RaRe-Technologies/gensim13 What is NMF? please have a visit and read http://www.cis.twcu.ac.jp/˜asakawa/nmf/. Thank you in advance. lol14 https://code.google.com/p/word2vec/15https://en.wikipedia.org/wiki/Vector_space_model16https://ja.wikipedia.org/wiki/%E3%83%99%E3%82%AF%E3%83%88%E3%83%AB%E7%A9%BA%E9%96%93%E3%83%A2%

E3%83%87%E3%83%AB17 https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html18 http://nlp.stanford.edu/projects/glove/19https://github.com/stanfordnlp/GloVe20http://www.itmedia.co.jp/news/articles/1605/27/news110.html

https://github.com/RaRe-Technologies/gensim

http://www.cis.twcu.ac.jp/~asakawa/nmf/

https://code.google.com/p/word2vec/

https://en.wikipedia.org/wiki/Vector_space_model

https://ja.wikipedia.org/wiki/%E3%83%99%E3%82%AF%E3%83%88%E3%83%AB%E7%A9%BA%E9%96%93%E3%83%A2%E3%83%87%E3%83%AB

https://ja.wikipedia.org/wiki/%E3%83%99%E3%82%AF%E3%83%88%E3%83%AB%E7%A9%BA%E9%96%93%E3%83%A2%E3%83%87%E3%83%AB

https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html

http://nlp.stanford.edu/projects/glove/

https://github.com/stanfordnlp/GloVe

http://www.itmedia.co.jp/news/articles/1605/27/news110.html

10 Shin Asakawa

• このランク付けために用いられるのがWord to Vectorと TFIDFという概念だ。

• Word to Vectorは、Web上にある大量の単語間の類似性をベクトルとして計算し、アルゴリズムで重み付けできる状態にするための仕組み。

• 一方、TFIDFは「TF」（用語頻度）と「IDF」（逆文献頻度）に分けて考えるのがよさそうだ。

• 4つ目の仕組みであるニューラルネットワークは、人間の脳のニューロンをシミュレーションした数学的モデル。りんなの場合、これを自然言語の学習に応用している。

りんな＝ rinna = rInnA だから，逆から読むと AI がみえる。しかも AIを除くと RNNすなわちリカレントニューラルネットワークである。

7 リカレントニューラルネットワーク

7.1 RNNの成果リカレントニューラルネットワークの成果 (SOTAを含む)

1. 手書き文字認識 [13]

2. 音声認識 [12, 14]

3. 手書き文字生成 [11]

4. 系列学習 [36]

5. 機械翻訳 [2, 26]

6. 画像脚注付け [18, 38]

7. 構文解析 [37]

8. プログラムコード生成 [39]

Try https://www.captionbot.ai/ on your mobile phone.

7.2 古典的RNN古典的リカレントニューラルネットワーク

U W

VZ=I

one hot vector(1-of-k)

input: x(t)

hidden: h(t)

output

context:h(t-1)

Figure 11: ジョーダンネット [17]

https://www.captionbot.ai/


U W

V

Z=I

one hot vector(1-of-k)

input: x(t) context:h(t-1)

hidden: h(t)

output

Figure 12: エルマンネット [8, 10]

入力t-1 入力t 入力t+1入力

展開

V WW

W W W

V V V

U U U

状態

出力

状態t-1

出力t-1 出力t

状態t 状態t+1

出力t+1

U

Figure 13: リカレントニューラルネットワークの時間展開

12 Shin Asakawa

y-4 y-3 y-2 y-1 y0

t-4 t-3 t-2 t-1 t

x-4 x-3 x-2 x-1 x0

Figure 14: 長距離依存

Figure 15: リカレントニューラルネットワークの様々な入出力形態 http://karpathy.github.io/2015/05/21/rnn-effectiveness/より

• 1 to 1 : xi → yi, vannila RNN

• many to 1: x1, x2, · · · , xn → yj , Image captioning

• 1 to many: x1 → y1, y2, · · · , yn, sentiment analysis

• many to many: xi → yi, xi+1 → yi+1, machine translation

• many to many: xi, xi+1, · · · , xi+k → yi+d, yi+1+d, · · · , yi+d+k, video classification

• many to many: x1 → y1, x2 → y2, · · ·

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/


Weights V

Weights W

Output

State/hidden

Input

Weights U

Weights V

State/hidden (t-1)

Input (t-1)

Weights U

Weights V

State/hidden (t-2)

Input (t-2)S

tate/hidden (t-3)

Weights U

Figure

5:T

heeffect

ofunfolding

anetw

orkfor

BP

TT

(τ=

3).

8

Figure 16: Boden’s BPTT

7.3 モチベーション• 統計的言語モデル。単語系列に確率を与える

• 良い言語モデルは，有意味文に高い確率を与え，曖昧な文には低い確率を与える。

• 言語モデルは人工知能の問題

• チューリングテストは原理的に言語モデルの問題とみなすことが可能

• 会話の履歴が与えられた時，良い言語モデルは正しい応答に高い確率を与える

• 例:P (月曜日 |今日は何曜日ですか？) =?P (赤 |バラは何色？) =?言語モデルの問題と考えれば以下の文のような問題と等価とみなせる:P (赤 |バラの色は ) =?

7.4 N-グラム言語モデル• どうすれば「良い言語モデル」を創れるか？

• 伝統的解— N-グラム言語モデル: P (w |h ) = C (h,w)

C (h)

• N-グラム言語モデル: 文脈 hの中で単語 wが何回出現したかをカウント。観測した全ての文脈 hで正規化

• 類似した言語履歴 hについて, N-gram言語モデルは言語履歴 hが完全一致することを要請

• 実用的には，N-gram言語モデルは N語の単語系列パターンを表象するモデル

• N-gram言語モデルでは Nの次数増大に従って，パラメータは指数関数的に増大する

• 高次 N グラム言語モデルのパラメータ推定に必要な言語情報のコーパスサイズは，次数増大に伴って，急激不足する

7.5 ニューラルネットワーク言語モデル• スパースな言語履歴 hは低次元空間へと射影される。類似した言語履歴は群化する

• 類似の言語履歴を共有することで，ニューラルネットワーク言語モデルは頑健 (訓練データから推定すべきパラメータが少ない)。

14 Shin AsakawaModel Description - Feedforward NNLM

Figure: Feedforward neural network based LM used by Y. Bengio andH. Schwenk

8 / 59

Figure 17: フィードフォワード型ニューラルネットワーク言語モデル NNLM,[3, 35]

• 入力層 wと出力層 yは同一次元，総語彙数に一致。(約一万語から 20万語)

• 中間層 sは相対的に低次元 (50から 1000ニューロン)

• 入力層から中間層への結合係数行列 U，中間層から出力層への結合係数行列 V，

• 再帰結合係数行列W がなければバイグラム (2-グラム)ニューラルネットワーク言語モデルと等しい

中間層ニューロンの出力と出力層ニューロンの出力は，それぞれ以下のとおり：

s(t) = f (Uw(t) +W ) s (t− 1)) (18)y(t) = g (V s (t)) , (19)

f (z)はシグモイド関数，g (z)はソフトマックス関数。最近のほとんどのニューラルネットワークと同じく出力層にはソフトマックス関数を用いる。出力を確率分布とみなすように，全ニューロンの出力確率を合わせると 1となるように

f (z) =1

1 + e −z , g (zm) =e zm∑k e zk

(20)

7.6 RNNLMの学習• 確率的勾配降下法 (SGD)

• 全訓練データを繰り返し学習，結合係数行列 U , V , W をオンライン学習 (各単語ごとに逐次)

• 数エポック実施 (通常 5-10)

時刻 tにおける出力層の誤差ベクトル eo (t)の勾配計算にはクロスエントロピー誤差を用いて：

eo (t) = d (t)− y (t) (21)

d (t)は出力単語を表すターゲット単語であり時刻 t+1の入力単語w (t+ 1) (ビショップは PRML [5]では 1-of-ｋ表現と呼んだ。ベンジオはワンホットベクトルと呼ぶ)。時刻 tにおける中間層から出力層への結合係数行列 V は，中間層ベクトル s (t)と出力層ベクトル y (t)を用いて次式のように計算する

V (t+ 1) = V (t) + αs (t) eo (t)T (22)

αは学習係数続いて，出力層からの誤差勾配ベクトルから中間層の誤差勾配ベクトルを計算すると，

eh (t) = dh

(eo (t)

TV , t

), (23)

誤差ベクトルは関数 dh()をベクトルの各要素に対して適用して

dhj (x, t) = xsj (t) (1− sj (t)) (24)

時刻 tにおける入力層から中間層への結合係数行列 U は，ベクトル s (t)の更新を以下のようにする。

U (t+ 1) = U (t) + αw (t) eh (t)T (25)

時刻 tにおける入力層ベクトルw (t)は，一つのニューロンを除き全て 0である。式 (25)のように結合係数を更新するニューロンは入力単語に対応する一つのニューロンのそれを除いて全て 0なので，計算は高速化できる。


7.7 バックプロパゲーションスルータイム BPTT

1 2 3 4 5 6 7 8105

110

115

120

125

130

135

140

145

BPTT step

Per

plex

ity (P

enn

corp

us)

average over 4 modelsmixture of 4 modelsKN5 baseline

Figure 18: [1]Fig.3

• 再帰結合係数行列W を時間展開し，多層ニューラルネットワークとみなして学習を行う

• 時間貫通バックプロパゲーションは Backpropagation Through Time (BPTT)というTraining of RNNLM - Backpropagation Through Time

U

s(t-3)

w(t-2)

W

U

U

y(t)

s(t-1)

s(t)

w(t)

s(t-2)

w(t-1)

W

W

V

Figure: Recurrent neural network unfolded as a deep feedforwardnetwork, here for 3 time steps back in time.

17 / 59

Figure 19: リカレントニューラルネットワークを時間展開して，多層フィードフォワードニューラルネットワークとみなす。3ステップ分を表示してある

16 Shin Asakawa

誤差伝播は再帰的に計算する。バックプロパゲーションスルータイムの計算方法では，前の時刻の中間層の状態を保持しておく必要がある。

eh (t− τ − 1) = dh

(eh (t− τ)

TW , t− τ − 1

), (26)

時間展開したこの図で示すように各タイムステップで，繰り返し（再帰的に）で微分して勾配ベクトルの計算が行われる。このとき各タイムステップの時々刻々の刻みを経るごとに急速に勾配が小さくなってしまう勾配消失が起きる。BPTTで時刻に関する再帰が深いと深刻な問題となり収束しない、学習がいつまで経っても終わらないことがある。再帰結合係数行列W の更新には次の式を用いる：

W (t+ 1) = W (t) + αT∑

z=0

s (t− z − 1) eh (t− z)T. (27)

行列W の更新は誤差が逆伝播するたびに更新されるのではなく、一度だけ更新する。計算効率の面からも、訓練事例をまとめて扱い、時間ステップニューラルネットワークの時刻 T に関する

時間展開に関する複雑さは抑えることが行われる。

図: バッチ更新の例。赤い矢印は誤差勾配がリカレントニューラルネットワークの時間展開を遡っていく様子を示している。

8 活性化関数ロジスティック関数:

σ (x) =1

1 + exp (−x)(28)

d

dxσ (x) = x (1− x) (29)

ハイパータンジェント:

tanh (x) =exp (x)− exp (−x)

exp (x) + exp (−x)(30)

d

dxtanh (x) = 1− x2 (31)


整流線形ユニット ReLU(Rectified Linear Unit)21:

ReLU (x) = max (0, x) (32)d

dxReLU (x) = max (0, 1) (33)

ReLUは厳密には微分可能な関数ではない。ReLUでは原点 x = 0において劣微分 subdifferentialを考える。原点 x = 0での勾配が計算できないが，ReLUは下に凸であるので x = 0における勾配はある範囲内に納まる。これを劣勾配 subgradientと呼び dReLU (0) /dx = [0, 1]である。すなわち劣勾配は値が定まるのではなく勾配の範囲を定める。ソフトプラス:

softplus (x) = log (1 + exp (x)) (34)d

dxlog (1 + exp (x)) =

1

1 + exp (−x)(35)

ソフトプラスは ReLUを微分可能な関数で近似したと見做すことができる。ソフトマックス：

softmax (xi) =exp (xi)∑j exp (xj)

(36)

∂

∂xisoftmax (xi) = xi (δij − xi) (37)

ここで δij はクロネッカーのデルタである 22：

δij =

{1 (i = j)0 (i = j)

(38)

多値分類の場合にはソフトマックスが用いられる場合が多くなっている。以下のサンプルコードはソフトプラスとその微分，及び ReLUの描画を行う Pythonコードである。

1 #!/bin/env python2 from __future__ import print_function3 import numpy as np4 import matplotlib.pyplot as plt56 def relu(x):7 return x * (x > 0)89 def softplus(x):

10 return np.log(1. + np.exp(x))1112 def dsoftplus(x, delta=1e-05):13 a = softplus(x+delta)14 b = softplus(x-delta)15 return (a-b)/(2.* delta)1617 a = []18 for x in np.linspace(-5,5,300):19 a.append([softplus(x), dsoftplus(x), relu(x)])2021 plt.plot(a)22 plt.show()

文献[1]Extensions of recurrent neural network language model. In IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP), Prague, Czech Republic, May 2011.[2]Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and

translate. In Yoshua Bengio and Yann LeCun, editors, Proceedings in the International Conference on LearningRepresentations (ICLR), San Diego, CA, USA, 2015.

21カタカナ表記すれば「レル」あるいは「レル―」と聞こえるが rと ℓの発音を区別できない日本語話者にとって発音するのは苦行である

22https://ja.wikipedia.org/wiki/%E3%82%AF%E3%83%AD%E3%83%8D%E3%83%83%E3%82%AB%E3%83%BC%E3%81%AE%E3%83%87%E3%83%AB%E3%82%BF

https://ja.wikipedia.org/wiki/%E3%82%AF%E3%83%AD%E3%83%8D%E3%83%83%E3%82%AB%E3%83%BC%E3%81%AE%E3%83%87%E3%83%AB%E3%82%BF

https://ja.wikipedia.org/wiki/%E3%82%AF%E3%83%AD%E3%83%8D%E3%83%83%E3%82%AB%E3%83%BC%E3%81%AE%E3%83%87%E3%83%AB%E3%82%BF

18 Shin Asakawa

[3]Yoshua Bengio, Rejean Ducharme, and Pascal Vincent. A neural probabilistic language model. Journal of MachineLearning Research, 3:1137–1155, 2003.

[4]Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.[5]Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer–Verlag, New York, NY, 2006.[6]David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning

Research, pages 993–1022, 2003.[7]Jeffery L. Elman. Incremental learing, or the importance of starting small. Technical report, University of California,

San Diego, San Diego, CA, 1991.[8]Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.[9]Jeffrey L. Elman. Distributed representations, simple recurrent networks, and grammatical structure. Machine

Learning, 7:195–225, 1991.[10]Jeffrey L. Elman. Learning and development in neural networks: The importance of starting small. Cognition,

8:71–99, 1993.[11]Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.[12]Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In

Proceedings of the 31st International Conference on Machine Learning, pages 1764–1772, Beijing, China,2014.

[13]Alex Graves, Marcus Liwicki, Santiago Fernandez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. Anovel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysisand Machine Intelligence, 31(5):855–868, May 2009.

[14]Alex Graves, Abdel Rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks.In Rabab Kreidieh Ward, editor, IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pages 6645–6649, Vancouver, BC, Canada, 2013.

[15]Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. NeuralComputation, 18:1527–1554, 2006.

[16]Geoffrey E. Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,313(5786):504–507, 2006.

[17]Michael Irving Jordan. Serial order: A parallel distributed processing approach. Technical report, University ofCalifornia, San Diego, San Diego, CA, May 1986.

[18]Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visual-semantic embeddings with multimodalneural language models. arXiv:1411.2539v1, Nov. 2014.

[19]Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and SanjaFidler. Skip-thought vectors. arXiv:1506.06726, 2015.

[20]Barbara Landau, Linda B. Smith, and Susan Jones. Syntactic context and the shape bias in children’s and adults’lexical learning. Journal of memory and language, 31:807–825, 1992.

[21]Barbara Landau, Linda B. Smith, and Susan S. Jones. The importance of shape in early lexical learning. CognitiveDevelopment, 3:299–321, 1988.

[22]Thomas K Landauer, Peter W. Foltz, and Darrell Laham. An introduction to latent semantic analysis. DiscourseProcesses, 25:259–284, 1998.

[23]Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to documentrecognition. Proceedings of the IEEE, 86:2278–2324, 1998.

[24]Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. nature,pages 788–791, Oct. 1999.

[25]Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. pages 556–562, 2001.[26]Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the rare word

problem in neural machine translation. arXiv:141.8206, May 2015.[27]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector

space. In Yoshua Bengio and Yann Lecun, editors, Proceedings in the International Conference on LearningRepresentations (ICLR) Workshop, Scottsdale, Arizona, USA, May 2013.

[28]Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of wordsand phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, andK.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. CurranAssociates, Inc., 2013.

[29]Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous spaceword representations.In Proceedings of the 2013 Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies NAACL, Atlanta, WA, USA, June 2013.

[30]Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan “Honza” Cernocky, and Sanjeev Khudanpur. Recurrentneural network based language model. In Takao Kobayashi, Keiichi Hirose, and Satoshi Nakamura, editors,Proceedings of INTERSPEECH2010, pages 1045–1048, Makuhari, JAPAN, September 2010.

[31]Karl Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2:559–572,


1901.[32]Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word representation. In

Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Quatar, Oct. 2014.[33]David E. Rumelhart, Geoffery E. Hinton, and Ronald J. Williams. Learning internal representations by error

propagation. In David E. Rumelhart and James L. McClelland, editors, Parallel Distributed Porcessing:Explorations in the Microstructures of Cognition, volume 1, chapter 8, pages 318–362. MIT Press, Cambridge,MA, 1986.

[34]David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. williams. Learning representations by back-propagatingerrors. Nature, 323(6088):533–536, 1986.

[35]Holger Schwenk. Continuous space language models. Computer Speech and Language, 21:492–518, July 2007.[36]Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Z. Ghahramani,

M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural InformationProcessing Systems (NIPS), volume 27, pages 3104–3112, Montreal, BC, Canada, 2014.

[37]Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreignlanguage. In Yoshua Bengio and Yann LeCun, editors, Proceedings of the International Conference on LearningRepresentations (ICLR), San Diego, CA, USA, 2015.

[38]Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator.In Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015.

[39]Wojciech Zaremba and Ilya Sutskever. Learning to execute. In Yoshua Bengio and Yann LeCun, editors, Proceedingsof the International Conference on Learning Representations, (ICLR), San Diego, CA, USA, 2015.

[40]西尾泰和. word2vecによる自然言語処理. オライリー・ジャパン,東京, 2014.