pystan for nlp

1

Pystanで自然言語処理へ向けて

2013/12/22 BUGS,Stan勉強会 #2@xiangze750

2

Agenda● Pythonの魅力● Pystanでできること● NLP(自然言語処理)

– topic model–ライブラリ–混合モデル–LDA–Dirichlet process, Chinese restaurant process–階層Dirichlet process

● 生態学における中立理論

3

●Pythonの魅力

● 豊富なライブラリ

● Computer Vision (PIL,OpenCV)

● 数式処理(sympy)

● 音声処理(wave,Audiolab)音楽解析(music21)

● 可視化(matplotlib,networkx)

● NLP(自然言語処理)

4

●Pystanでできること

● data(変数)はarrayで代入

http://pystan.readthedocs.org/en/latest/getting_started.html

5

●NLPライブラリとの連携

● NLTK(Natural language toolkit)

さまざまなCorpus(文書、単語の集合)が使える

– N-gram化、頻度分布など

● Gensim

– Topic modelの実装(後述)

6

●NLP(自然言語処理)

● Bag of words– 単語の位置関係の情報は捨て去る–

http://journalofdigitalhumanities.org/wp-content/uploads/2013/02/blei_lda_illustration.png

7

●NLP(自然言語処理)

● Topic model–文書をtopicに分類– Topicを確率変数とする

http://journalofdigitalhumanities.org/wp-content/uploads/2013/02/blei_lda_illustration.png

8

●NLPライブラリとの連携

● shuyoさんによるトピックモデルのPython実装

– https://github.com/shuyo/iir/blob/master/lda

– NLTKのコーパス読み出し

– documentをbag of wordsの形式にできる

– 階層Dirichlet modelも実装されている(後述)

9

●混合モデル

● 多項混合モデル

– 多項分布(Categorical分布)でトピックごとの単語選択のモデル化

トピックモデル概論http://sugiyama-www.cs.titech.ac.jp/~sugi/2007/Canon-MachineLearning27-jp.pdf

多項分布=“歪んだサイコロ”

10

●混合モデル

● ポリヤ混合モデル

– トピックの事前分布としてDirichlet分布を用いる

– Dirichlet分布はCategorical分布,多項分布の共役事前分布

トピックモデル概論http://sugiyama-www.cs.titech.ac.jp/~sugi/2007/Canon-MachineLearning27-jp.pdf

11

●Dirichlet分布● Categorical分布,多項分布の共役事前分布

● simplex上の値を返す

● Stanでは

歪んだサイコロを生成するガチャガチャ

vector<lower=0>[V] alpha;simplex[V] x;

x~dirichlet(alpha);

12

●LDA(latent Dirichlet allocation)● Word w_m,nごとにトピックz_m,nがある。

● トピックz_m,nごとに混合分布がある。

トピックの分布(documentごと)

単語の分布(トピックごと)

トピック

単語

13

●LDA(latent dirichlet allocation)● Stan code(manual 128 page)

潜在変数zのCategorical 分布は直接使えない

(http://xiangze.hatenablog.com/entry/2013/12/19/013557)

parameters {simplex[K] theta[M]; // topic dist for doc msimplex[V] phi[K]; // word dist for topic k}model {for (m in 1:M)theta[m] ~ dirichlet(alpha); // priorfor (k in 1:K)phi[k] ~ dirichlet(beta); // prior

for (n in 1:N) { real gamma[K]; for (k in 1:K) gamma[k]<-log(theta[doc[n],k])+log(phi[k,w[n]]); increment_log_prob(log_sum_exp(gamma)); }

14

●Dirichlet process

● Topicの数を可変(non-parametric)としたい

– 無限変数のDirichlet分布

– 確率分布(Dirichlet分布)上の確率分布

– 変数を交換しても分布は変わらない(c.f. De Finetti's theorem)

面積:G(A0),G(A1),......G(An)

G

任意の分割Aに対して

Θ

となればGはHをbase distributionとしたDirichlet process

Dirichlet Processes(Teh 2010)http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/Teh2010a.pdf

15

●Chinese restaurant process● 無限の変数を有限の過程で表現したい

– 観測変数は有限

– 確率変数を反復的に取り出す(変数の交換に対して不変)

– 人(word)の多いテーブルに行きやすい

Dirichlet Processes(Teh 2010)http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/Teh2010a.pdf

客:word料理:topicTable:対応関係

16

●Chinese restaurant process

n+1人目の客

新しいテーブルに着く確率

既存のテーブルに着く確率着席者が多いテーブルにつきやすい

Hierarchical Dirichlet Processes(Teh 2006)http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/jasa2006.pdf

17

●階層Dirichlet process

● Dirichlet process上のDirichlet process


18

●CRPと階層Dirichlet process

● Chinese resutaurant franchise

各店舗　出てくる料理は同じ

客:word料理:topicTable:対応関係


19

●CRPと階層Dirichlet process

● Chinese resutaurant franchise

– 単語分布の変数

– トピック分布の変数


20

●Chinese restaurant franchise● 実装の難点(Stan)

– model内に変数への代入が書けない

“Infinite LDA” –Implementing the HDP with minimum code complexityhttp://www.arbylon.net/publications/ilda.pdf

21

●Chinese restaurant franchise● 実装(JAGS)

– 途中...

model { x[1] ~ dnorm(0.0, 1.0E-4) k <- 1 for (j in 1:M) { for (i in 2:N) { q ~ dunif(0,1) totm <-sum(m) if(q>gamma/(totm+gamma)){ ind ~ dmulti(m/sum(m),1) th[i] <- th[ind] m[ind] <- m[ind]+1 }else{ th[i] ~ dunif(0,1) k <- k+1 m[k] <- 1 } }} for (j in 1:M) { for (i in 2:N) { q ~ dunif(0,1) if(q>alpha/(i-1+alpha)){ ind <- dmulti(n/sum(n),1) phi[i] ~ th[j][ind] nj[i] <- nj[i]+1 }else{ phi[i] ~ th[docid[i]*N+] kn <- kn + 1 n[kn] <- 1 } }}

22

●そもそもの問題意識

● “ノンパラベイズに, 汎用の「パッケージ」はない”(Nonparametric Bayes for Non-Bayesians)

● 様々なデータ構造上の確率過程

– Infinite Stochastic Tree

– Mondrian Process

“Nonparametric Bayes for Non-Bayesians”http://www.ism.ac.jp/~daichi/paper/ibis2008-npbayes-tutorial.pdf

Mondrian Process 実装したかったもの

23

●Stick breaking process

● (階層)Dirichlet processの別表現

π0

π1

π2

π3Truncated stick breaking process...?

24

●余談: 生態学における中立理論● 中立性

– 同一の生態学的ニッチに属する種の個体数分布は一定の関数に従う

● Ewens distribution

– 限られたニッチの中での各種の個体数の分布

– Chinese restaurant processの特殊な場合

Stephen P. Hubbell先生

25

●余談: 生態学における中立理論● Ewens distribution

A unified theory of biogeography and relative species abundance andits application to tropical rain forests and coral reefshttp://www3.botany.ubc.ca/vellend/COM_ECOL/Hubbell_CoralReefs97.pdf

26

●余談: 生態学における中立理論● Rのuntb package

– http://cran.r-project.org/web/packages/untb

#exampledemo(untb)#Saunder datasetの個体数-種の順位分布と推定されたθ

27

●まとめ

● Pystanを用いてば比較的簡単にStanのLDAを使える。

● Stan2.0では制約上ノンパラメトリックLDAの実装はできない。JAGSでは出来るかもしれない。

● 生態学はすごい

28

Reference

● shuyoさんによるLDA,HDP-LDAのpython実装(nltkを使用)

– https://github.com/shuyo/iir/blob/master/lda● ノンパラベイズの入門の入門

– http://www.slideshare.net/shuyo/ss-15098006● Mi manca qualche giovedi`?　　階層ディリクレ過程を実装してみる (1) HDP-LDA と LDA のモデルを比較

– http://d.hatena.ne.jp/n_shuyo/20110608/hdplda● トピックモデル概論

– http://sugiyama-www.cs.titech.ac.jp/~sugi/2007/Canon-MachineLearning27-jp.pdf● Introduction to Nonparametric Bayesian Models(上田、山田2007)

– http://www.kecl.ntt.co.jp/as/members/yamada/dpm_ueda_yamada2007.pdf● ディリクレ過程混合モデルへの変分推論適用について

– http://breakbee.hatenablog.jp/entry/2013/11/30/222553● Hierarchical Dirichlet Processes(Teh 2006)

– http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/jasa2006.pdf● Dirichlet Process(Teh 2010)

– http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/Teh2010a.pdf● “Infinite LDA” –Implementing the HDP with minimum code complexity

– http://www.arbylon.net/publications/ilda.pdf● A unified theory of biogeography and relative species abundance and its application to tropical rain forests and coral reefs

– http://www3.botany.ubc.ca/vellend/COM_ECOL/Hubbell_CoralReefs97.pdf

https://github.com/shuyo/iir/blob/master/lda

http://www.slideshare.net/shuyo/ss-15098006

http://d.hatena.ne.jp/n_shuyo/20110608/hdplda

http://sugiyama-www.cs.titech.ac.jp/~sugi/2007/Canon-MachineLearning27-jp.pdf

http://www.kecl.ntt.co.jp/as/members/yamada/dpm_ueda_yamada2007.pdf

http://breakbee.hatenablog.jp/entry/2013/11/30/222553

http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/Teh2010a.pdf

http://www.arbylon.net/publications/ilda.pdf

http://www3.botany.ubc.ca/vellend/COM_ECOL/Hubbell_CoralReefs97.pdf

pystan for nlp

Technology