gatsby computational neuroscience unit supported by the gatsby charitable foundation frank wood, yee...

77
GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH 17/06/22 A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation

Upload: kathryn-weymouth

Post on 30-Mar-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

A Hierarchical Nonparametric

Bayesian Approach to Statistical Language Model

Domain Adaptation

Page 2: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Statistical Natural Language Modelling (SLM)

• Learning distributions over sequences of discrete observations (words)

• Useful for– Automatic Speech Recognition (ASR)

P(text | speech) / P(speech | text) P(text)

– Machine Translation (MT)P(english | french) / P(french | english) P(english)

w1:N = [w1w2 : : :wN ]P (w1:N )

Page 3: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Markov “n–gram” models

P (w1:N )def=

NY

n=1

P (wn jwn¡ 1:n¡ 2) =NY

n=1

Gf wn ¡ 1:n ¡ 2g(wn)

tri-gram

e.g. mice | blind, three league | premier, Barclay’s

ContextWord following context

Discrete conditional distribution over words that

follow given context

Page 4: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

????

? ??

??

Domain Adaptation

DataData

2531

8 91

72

Est

imat

ion

Adaptation

Est

imat

ion

3531

7 91

72

Big Domain Small Domain

Page 5: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Hierarchical Bayesian Approach

CorporaCorpus specific modelsCorpus

independentmodel

µbigµlatent

big

smallµsmall

µ Y

testT

Page 6: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Bayesian Domain Adaptation• Estimation goal

• Inference objective

• Simultaneous estimation = automatic domain adaptation

Big

Small

µbig

µsmall

µlatent

µ Y

TestT

P (µjY ) / P (Y jµ)P (µ)

P (T jY ) =RP (T jµ)P (µjY )dµ

Page 7: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

• One domain-specific n-gram (Markov) model

for each domain

Model Structure : The Likelihood

P (DjµD )def=

NY

n=1

GDf wDn ¡ 2:n ¡ 1g

(wDn jµD )

Corpus

Context

Word following context

Corpus specific parameters

Distribution over words that follow given context

D 2 fbig;smallg

Page 8: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Model Structure : The Prior• Doubly hierarchical Pitman-Yor process

language model (DHPYLM)

GDf g » PY (dD0 ;®D0 ;¸

D0 U+(1¡ ¸D0 )G

Hf g)

GDf wt ¡ 1g » PY (dD1 ;®D1 ;¸

D1 G

Df g+(1¡ ¸D1 )G

Hf wt ¡ 1g)

...

GDf wt ¡ j :wt ¡ 1g » PY (dDj ;®Dj ;¸

Dj G

Df wt ¡ j +1:wt ¡ 1g+(1¡ ¸Dj )G

Hf wt ¡ j :wt ¡ 1g)

xjwt¡ n+1 : wt¡ 1 » GDf wt ¡ n +1 :wt ¡ 1g:

Prior (parameters µ)

Likelihood

Novel generalization of back-offEvery line is conditioned on everything on the right

Page 9: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Pitman-Yor Process

• Distribution over distributions• Base distribution is the “mean”

• Generalization of the Dirichlet Process (d = 0)– Different (power-law) “clustering” properties– Better for text [Teh, 2006]

G » PY (d;®;G0)

x » G

concentrationdiscount

base distribution

E [G(dx)]= G0(dx)

Page 10: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

DHPYLM

GDf g » PY (dD0 ;®D0 ;¸

D0 U+(1¡ ¸D0 )G

Hf g)

GDf wt ¡ 1g » PY (dD1 ;®D1 ;¸

D1 G

Df g+(1¡ ¸D1 )G

Hf wt ¡ 1g)

...

GDf wt ¡ j :wt ¡ 1g » PY (dDj ;®Dj ;¸

Dj G

Df wt ¡ j +1:wt ¡ 1g+(1¡ ¸Dj )G

Hf wt ¡ j :wt ¡ 1g)

xjwt¡ n+1 : wt¡ 1 » GDf wt ¡ n +1 :wt ¡ 1g:

Observations Distribution over wordsfollowing given context

Domain-specific distribution over wordsfollowing shorter context (by 1)

Non-specific distribution overwords following full context

• Focus on a “leaf”

• Imagine recursionBase distribution

Page 11: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Tree-Shaped Graphical Model

Latent LM

Small LM Big LM

Glatentf the Unitedg

Gsmallf Unitedg

Gsmallf the Unitedg

{Parcel, etc.}

Gsmallf g

{States, Kingdom, etc.}

Page 12: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Intuition• Domain = “The Times of London”

• Context = “the United”

• The distribution over subsequent words will look like:

Gf theUnitedg(aa) = :0000001...

Gf theUnitedg(aah) = :0000001 Gf theUnitedg(States) = :05...

...Gf theUnitedg(K ingdom) = :25 Gf theUnitedg(zyzzyva) = :0000001

Page 13: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Problematic Example• Domain = “The Times of London”

• Context = “quarterback Joe”

• The distribution over subsequent words should look like:

• But where could this come from?

Gfquarterback J oeg(aa) = :0000001...

Gfquarterback J oeg(aah) = :0000001 Gfquarterback J oeg(Namath) = :15...

...Gfquarterback J oeg(Montana) = :25 Gfquarterback J oeg(zyzzyva) = :0000001

Page 14: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

The Model Estimation Problem• Domain = “The Times of London”

• Context = “quarterback Joe”

• No counts for American football phrases in UK publications! (hypothetical but nearly true)

Gfquarterback J oeg(Montana) ¼#fquarterback J oeMontanag

#fquarterback J oeg= 0

Not how we do estimation!

Page 15: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

A Solution: In-domain Back-Off• Regularization ↔ smoothing ↔ backing-off • In-domain back-off

– [Kneser & Ney, Chen & Goodman, MacKay & Peto, Teh], etc.

– Use counts from shorter contexts, roughly:

• Due to zero-counts, Times of London training data alone will still yield a skewed distribution

Gfquarterback J oeg(Montana) ¼ ¼3#fquarterback J oeMontanag

#fquarterback J oeg+¼2

#f J oeMontanag#f J oeg

+¼1#fMontanag

N

¼ ¼3#fquarterback J oeMontanag

#fquarterback J oeg+¼2Gf J oeg(Montana)

Recursive!

Page 16: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Our Approach: Generalized Back-Off

• Use inter-domain shared counts too. – roughly:

GTLfquarterback J oeg(Montana)

¼ ¼3#fquarterback J oeMontanag

#fquarterback J oeg+¼2(¸GTLf J oeg(Montana) + (1¡ ¸)Gfquarterback J oeg(Montana))

TL: Times of London

In-domain back-off;shorter context

Out-of-domain back-off;same context, no domain-specificity

Probability that “Montana” follows “quarterback Joe”

In-domain counts

Page 17: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

DHPYLM generalized back-off• Desired intuitive form

• Generic Pitman-Yor single-sample posterior predictive distribution

• Leaf-layer of DHPYLM

P (xn+1jx1:n ;®;d) =P K

k=1(ck ¡ d)®+N

±(Ák ¡ xn+1) +®+dK®+N

G0(xn+1)

GDf wt ¡ j :wt ¡ 1g » PY (dDj ;®Dj ;¸

Dj G

Df wt ¡ j + 1:wt ¡ 1g+(1¡ ¸Dj )G

Hf wt ¡ j :wt ¡ 1g)

xjwt¡ n+1 : wt¡ 1 » GDf wt ¡ n +1 :wt ¡ 1g:

GTLfquarterback J oeg(Montana)

¼ ¼3#fquarterback J oeMontanag

#fquarterback J oeg+¼2(¸GTLf J oeg(Montana) + (1¡ ¸)Gfquarterback J oeg(Montana))

Page 18: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

DHPYLM – Latent Language Model

GHf g » PY (dH0 ;®H0 ;U)

GHf wt ¡ 1g » PY (dH1 ;®H1 ;G

Hf g)

...

GHf wt ¡ j :t ¡ 1g » PY (dHj ;®Hj ;G

Hf wt ¡ j +1:t ¡ 1g)

… no associated observations

GeneratesLatent

LanguageModel

Page 19: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

DHPYLM – Domain Specific Model

GDf g » PY (dD0 ;®D0 ;¸

D0 U+(1¡ ¸D0 )G

Hf g)

GDf wt ¡ 1g » PY (dD1 ;®D1 ;¸

D1 G

Df g+(1¡ ¸D1 )G

Hf wt ¡ 1g)

...

GDf wt ¡ j :wt ¡ 1g » PY (dDj ;®Dj ;¸

Dj G

Df wt ¡ j +1:wt ¡ 1g+(1¡ ¸Dj )G

Hf wt ¡ j :wt ¡ 1g)

xjwt¡ n+1 : wt¡ 1 » GDf wt ¡ n +1 :wt ¡ 1g:

GeneratesDomain Specific

LanguageModel

Page 20: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

An Generalization

• Inference in the graphical Pitman-Yor process accomplished via a collapsed Gibbs auxiliary variable sampler

The “Graphical Pitman-Yor Process”

¤v » SvGvjfGw : w 2 Pa(v)g » PY (dv;®v;

X

w2Pa(v)

¸w! vGw)

Page 21: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Graphical Pitman-Yor Process Inference

• Auxiliary variable collapsed Gibbs sampler– “Chinese restaurant franchise” representation– Must keep track of which parent restaurant each

table comes from – “Multi-floor Chinese restaurant franchise”

• Every table has two labels – Ák : the parameter (here a word “type”)

– sk : the parent restaurant from which it came

Page 22: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

SOU / Brown• Training corpora

– Small• State of the Union (SOU)

– 1945-2006 – ~ 370,000 words, 13,000

unique

– Big• Brown

– 1967– ~ 1,000,000 words, 50,000

unique

• Test corpus • Johnson’s SOU Addresses

– 1963-1969– ~ 37,000 words

Page 23: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

SLM Domain Adaptation Approaches

• Mixture [Kneser & Steinbiss, 1993]

• Union [Bellegarda, 2004]

• MAP [Bacchiani, 2006]

2531

8 91

72

¸ +(1¡ ¸)2531

8 91

72

small big

2531

8 91

72

Data

Data+

small

small

big

Page 24: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

AMI / Brown• Training corpora

– Small• Augmented Multi-Party

Interaction (AMI)– 2007 – ~ 800,000 words, 8,000 unique

– Big• Brown

– 1967– ~ 1,000,000 words, 50,000 unique

• Test corpus • AMI excerpt

– 2007– ~ 60,000 words

Page 25: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Cost / Benefit Analysis• “Realistic” scenario

– Computational cost ($) of using more “big” corpus data (Brown corpus)

– Data preparation cost ($$$) of gathering more “small” corpus data (SOU corpus)

– Perplexity Goal

• Same test data

Page 26: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Cost / Benefit Analysis• “Realistic” scenario

– Computational cost ($) of using more “big” corpus data (Brown corpus)

– Data preparation cost ($$$) of gathering more “small” corpus data (SOU corpus)

– Perplexity Goal

• Same test data

Page 27: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Take Home• Showed how to do SLM domain adaptation

through hierarchical Bayesian language modelling

• Introduced a new type of model – “Graphical Pitman Yor Process”

Page 28: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Thank You

Page 29: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Select References• [1] Bellegarda, J. R. (2004). Statistical language model adaptation: review and perspectives.

Speech Communication, 42, 93–108.• [2] Carletta, J. (2007). Unleashing the killer corpus: experiences in creating the multi-everything

AMI meeting corpus. Language Resources and Evaluation Journal, 41, 181–190.• [3] Daum´e III, H., & Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of

Artificial Intelligence Research, 101–126.• [4] Goldwater, S., Griffiths, T. L., & Johnson, M. (2007). Interpolating between types and tokens

by estimating power law generators. NIPS 19 (pp. 459–466).• [5] Iyer, R., Ostendorf, M., & Gish, H. (1997). Using out-of-domain data to improve in-domain

language models. IEEE Signal processing letters, 4, 221–223.• [6] Kneser, R., & Steinbiss, V. (1993). On the dynamic adaptation of stochastic language

models. IEEE Conference on Acoustics, Speech, and Signal Processing (pp. 586–589).• [7] Kucera, H., & Francis, W. N. (1967). Computational analysis of present-day American

English. Brown University Press.• [8] Rosenfeld, R. (2000). Two decades of statistical language modeling: where do we go from

here? Proceedings of the IEEE (pp. 1270–1278).• [9] Teh, Y.W. (2006). A hierarchical Bayesian language model based on Pitman-Yor

processes. ACL Proceedings (44th) (pp. 985–992).• [10] Zhu, X., & Rosenfeld, R. (2001). Improving trigram language modeling with the world wide

web. IEEE Conference on Acoustics, Speech, and Signal Processing (pp. 533–536).

Page 30: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

HHPYLM Inference• Every table at level n corresponds to a customer in a restaurant at level n-1• All customers in the entire hierarchy are unseated and re-seated in each

sampler sweep

am was am

. . .. ... .

here I

P (z = kj everything else) /¡ck ¡ d

¢I (x = Ák)

P (z = K +1j everything else) / (®+dK )Gf g(x):

am was

.

here I

.

am

.

I

.

...

H H S

S

Latent Domain Specific

Domain Specific

Page 31: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

HHPYPLM Intuition

• Each G is still a weighted set of (repeated) atoms

GDf wt ¡ j :t ¡ 1g » PY (dj ;®j ;¸GDf wt ¡ j + 1:t ¡ 1g+(1¡ ¸)GLf wt ¡ j :t ¡ 1g)

) GDf wt ¡ j :t ¡ 1g =X

j 2 f D ;L g

1X

k=1

¼k±Ák ;sj

Ák » Gf wt ¡ j + 1:t ¡ 1g

sj » f ¸;1¡ ¸gÁ1 Á2

.

{context}

. ..

D

s1 s2

Page 32: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Sampling the “floor” indicators• Switch variables drawn

from a discrete distribution (mixture weights)

• Integrate out ¸’s• Sample this in the

Chinese restaurant representation as well

sk » f ¸;1¡ ¸g

f ¸;1¡ ¸g » P Y (d ;® ;U2)

Page 33: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Why PY vs DP?• PY power law characteristics match the statistics of

natural language well – number of unique words in a set of n words follows power

law

[Teh 2006]

Number of “words” drawn from Pitman-Yor ProcessN

umb

er o

f un

ique

wor

ds

d= :5;®= 100;10;1

0 0.5 1 1.5 2 2.5

x 107

0

2

4

6

8x 10

4

Tokens

Typ

es

European Parliamentary Proceedings

Page 34: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

What does a PY draw look like?• A draw from a PY process is a discrete

distribution with infinite support

[Sethurman, 94]

G » PY (d;®;H)

) G =1X

k=1

¼k±Ák ;1X

k=1

¼k = 1

aa (.0001), aah (.00001), aal (.000001), aardvark (.001), …

Page 35: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

How does one work with such a model?

• Truncation in the stick breaking representation

• Or in a representation where G is analytically marginalized out

P (xn+1jx1:n;®;d) =ZP (xn+1jG)P (Gjx1:n;®;d)dG

roughly

Page 36: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Posterior PY

• This marginalization is possible and results in the Pitman Yor Polya urn representation.

• By invoking exchangeability one can easily construct Gibbs samplers using such representations– A generalization of the hierarchical Dirichlet

process sampler of Teh [2006].

P (xn+1jx1:n;®;d) =P K

k=1(ck ¡ d)®+n

±(Ák ¡ xn+1) +®+dK®+n

G0(xn+1)

[Pitman, 02]

Page 37: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Doubly-Hierarchical Pitman-Yor Process Language Model

• Finite, fixed vocabularly• Two domains

– One big (“general”)– One small (“specific”)

• “Backs-off” to– out of domain model with

same context– or in domain model with

one fewer words of context wn

Page 38: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Bayesian SLM Domain Adaptation

Hyper SLM

Domain specific SLMGeneral SLM

Page 39: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

HHPYLM Inference• Every table at level n corresponds to a customer in a restaurant at level n-1• All customers in the entire hierarchy are unseated and re-seated in each

sampler sweep

am was am

. . .. ... .

here I

P (z = kj everything else) /¡ck ¡ d

¢I (x = Ák)

P (z = K +1j everything else) / (®+dK )Gf g(x):

am was

.

here I

.

am

.

I

.

...

H H S

S

Latent Domain Specific

Domain Specific

Page 40: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

HHPYPLM Intuition

• Each G is still a weighted set of (repeated) atoms

GDf wt ¡ j :t ¡ 1g » PY (dj ;®j ;¸GDf wt ¡ j +1:t ¡ 1g+(1¡ ¸)GHf wt ¡ j :t ¡ 1g)

) GDf wt ¡ j :t ¡ 1g =X

j 2 f D ;H g

1X

k=1

¼k±Ák ;sj

Ák » Gf wt ¡ j + 1:t ¡ 1g

sj » f ¸;1¡ ¸gÁ1 Á2

.

{context}

. ..

D

s1 s2

Page 41: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Sampling the “floor” indicators• Switch variables drawn

from a discrete distribution (mixture weights)

• Integrate out ¸’s• Sample this in the

Chinese restaurant representation as well

sk » f ¸;1¡ ¸g

f ¸;1¡ ¸g » P Y (d ;® ;U2)

Page 42: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Language Modelling• Maximum likelihood

– Smoothed empirical counts

• Hierarchical Bayes– Finite

• MacKay & Peto, “Hierarchical Dirichlet Language Model,” 1994

– Nonparametric• Teh, “A hierarchical Bayesian language model based on Pitman-Yor

processes.” 2006• Comparable to the best smoothing n-gram model

Page 43: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Questions and Contributions • How does one do this?

– Hierarchical Bayesian modelling approach

• What does such a model look like?– Novel model architecture

• How does one estimate such a model?– Novel auxiliary variable Gibbs sampler

• Given such a model, does inference in the model actually confer application benefits?– Positive results

Page 44: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Review: Hierarchical Pitman-Yor Process Language Model

• Finite, fixed vocabulary

• Infinite dimensional parameter space (G’s, d’s, and ®’s)

• Forms a suffix tree of depth n+1 (n is from n-gram) with one distribution at each node

Gf g » PY (d0;®0;U)

Gf ofg » PY (d1;®1;Gf g)

...

Gf United;States;ofg » PY (d3;®3;Gf States;ofg)

xjfUnited;States;ofg » Gf United;States;ofg

Gf g » PY (d0;®0;U)

Gf wt ¡ 1g » PY (d1;®1;Gf g)

...

Gf wt ¡ n +1:t ¡ 1g » PY (dn¡ 1;®n¡ 1;Gf wt ¡ n +2:t ¡ 1g)

wtjfwt¡ n+1 : wt¡ 1g » Gf wt ¡ n +1 :wt ¡ 1g

America (.00001), of (.01), States (.00001), the (.01), …

America (.0001), of (.000000001), States (.0001), the (.01), …

America (.8), of (.00000001), States (.0000001), the (. 01), …

[Teh, 2006]

Page 45: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Review: HPYPLM General Notation

Gf g » PY (d0;®0;U)

Gf wt ¡ 1g » PY (d1;®1;Gf g)

...

Gf wt ¡ n +1:t ¡ 1g » PY (dn¡ 1;®n¡ 1;Gf wt ¡ n +2:t ¡ 1g)

wtjfwt¡ n+1 : wt¡ 1g » Gf wt ¡ n +1 :wt ¡ 1g

Gf g » PY (d0;®0;U)

Gf wt ¡ 1g » PY (d1;®1;Gf g)

...

Gf wt ¡ j :t ¡ 1g » PY (dj ;®j ;Gf wt ¡ j +1:t ¡ 1g)

xjwt¡ j :t¡ 1 » Gf wt ¡ j :t ¡ 1g

America (.00001), of (.01), States (.00001), the (.01), …

America (.0001), of (.000000001), States (.0001), the (.01), …

America (.8), of (.00000001), States (.0000001), the (. 01), …

Page 46: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Review: Pitman-Yor / Dirichlet Process Stick-Breaking Representation

• Each G is a weighted set of atoms

• Atoms may repeat– Discrete base distribution

Gf g » PY (d0;®0;U)

Gf wt ¡ 1g » PY (d1;®1;Gf g)

...

Gf wt ¡ n +1:t ¡ 1g » PY (dn¡ 1;®n¡ 1;Gf wt ¡ n +2:t ¡ 1g)

wtjfwt¡ n+1 : wt¡ 1g » Gf wt ¡ n +1 :wt ¡ 1g

Gf wt ¡ j :t ¡ 1g » PY (dj ;®j ;Gf wt ¡ j +1:t ¡ 1g)

) Gf wt ¡ j :t ¡ 1g =1X

k=1

¼k±Ák

aa (.0001), aah (.00001), aal (.000001), aardvark (.001), …

Ák » Gf wt ¡ j + 1:t ¡ 1g

Page 47: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Now you know…• Statistical natural language modelling

– Perplexity as a performance measure

• Domain adaptation

• Hierarchical Bayesian natural language models– HPYLM– HDP Inference

Page 48: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Dirichlet Process (DP) Review• Notation (G and H are measures over space X)

• Definition– For any fixed partition (A1, A2, …, AK) of X

G » DP(®;H)

µi » G

[G(A1);G(A2); : : :;G(AK )] » Dirichlet(®H(A1);®H(A2); : : :;®H(AK ))

Fergusen 1973, Blackwell & MacQueen 1973, Adous 1985, Sethuraman 1994

Page 49: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Inference• Quantities of interest

• Critical identities from multinomial / Dirichlet conjugacy (roughly)

P (Gjµ1) =P (µ1jG)P (G)

P (µ1)

P (µ1) =ZP (µ1jG)P (G)dG

µ1 » H

Gjµ1 » DP(®+1;®H +±µ1®+1

)

Page 50: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Posterior updating in one step• With the following identifications

• then

µi+1jµ1; : : : ;µi » H i

Gi+1jµ1; : : : ;µi+1 » DP(®i+1;H i+1)

H i =®H +

P ij =1±µj

®+i®i = ®+i

Page 51: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Don’t need G’s – “integrated out”

• Many µ’s are the same

• The “Chinese restaurant process” is a representation (data structure) of the posterior updating scheme that keeps track of only the unique µ’s and the number times each was drawn

H i =®H +

P ij =1±µj

®+iµi+1jµ1; : : :;µi » H i

Page 52: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Review: Hierarchical Dirichlet Process (HDP) Inference

• Starting with training data (a corpus) we estimate the posterior distribution through sampling

• HPYP Inference is the roughly the same as HDP inference

P (£ jw1;w1; : : : ;wN ) / P (w1;w1; : : : ;wN j£ )P (£ )

Page 53: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

HDP Inference: Review• Chinese restaurant franchise [Teh et al, 2006]

– Seating arrangements of hierarchy of Chinese restaurant processes are the state of the sampler

• Data are the observed tuples/n-grams (“here I am” (6), “here I was” (2))

am was am

. . .. ... .

Gf I ;hereg » PY (d;®;Gf Ig)

xjf I;hereg » Gf I ;hereg

here I

Page 54: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

HDP Inference: Review• Induction over base distribution draws -> Chinese restaurant franchise• Every level is a Chinese restaurant process• Auxiliary variables indicate from which table each n-gram came

– Table label is a draw from the base distribution

• State of the sampler: {am, 1}, {am, 3}, {was, 2}, … (in context, “here I”)

am was am ….... ... .

here I

P (z = kj everything else) /¡ck ¡ d

¢I (x = Ák)

P (z = K +1j everything else) / (®+dK )GI (x):

Page 55: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

HDP Inference: Review• Every draw from the base distribution means that the resulting atom must

(also) be in the base distribution’s set of atoms.

am was am

. . .. ... .

here I

P (z = kj everything else) /¡ck ¡ d

¢I (x = Ák)

P (z = K +1j everything else) / (®+dK )Gf g(x):

am was.

I

. .

Page 56: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

HDP Inference: Review• Every table at level n corresponds to a customer in a restaurant at level n-1

– If a customer is seated at a new table, this corresponds to a draw from the base distribution

• All customers in the entire hierarchy are unseated and re-seated in each sweep

am was am

. . .. ... .

here I

P (z = kj everything else) /¡ck ¡ d

¢I (x = Ák)

P (z = K +1j everything else) / (®+dK )Gf g(x):

am was.

I

. .

Page 57: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Parameter Infestation• If V is the vocabulary size then a full tri-gram

model has V3 parameters (potentially)

P (wn jwn¡ 1;wn¡ 2)

P (pepperjand;salt)

P (lemonjand;salt)...

P (snufalupagusjand;salt)

Page 58: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Big models• Assume a 50,000 word vocabulary

– 1.25x1014 parameters • Not all must be realized or stored in a practical

implementation

– Maximum number of tri-grams in a billion word corpus, ~1x109

Page 59: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Language Modelling• Zero counts and smoothing

• Kneser-Ney

P (xjwn¡ 1;wn¡ 2) =#fwn¡ 2;wn¡ 1;xg#fwn¡ 2;wn¡ 1g

PK N (xjwn¡ 1;wn¡ 2) =max(0;#fwn¡ 2;wn¡ 1;xg¡ d)

#fwn¡ 2;wn¡ 1g+

dK#fwn¡ 2;wn¡ 1g

PK N (xjwn¡ 1)

Page 60: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Multi-Floor Chinese Restaurant Process

P (zjv = kjzvnzjv;S;X ) / max((ck¡v ¡ dv);0)±(xjv ¡ Ákv)

P (zjv =K +1;sK +1v =wjzvnzjv;S;X ) / (®v +dvK ¡

v )¸w! vGw(xjv):

Page 61: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Domain LM Adaptation Approaches

• MAP [Bacchiani, 2006]

• Mixture [Kneser & Steinbiss, 1993]

• Union [Bellegarda, 2004]

P (xjwn¡ 1;wn¡ 2) = ¸PD1 (xjwn¡ 1;wn¡ 2) + (1¡ ¸)PD2 (xjwn¡ 1;wn¡ 2)

P (xjwn¡ 1;wn¡ 2) =#fwn¡ 2;wn¡ 1;xgD1 +#fwn¡ 2;wn¡ 1;xgD2

#fwn¡ 2;wn¡ 1gD1 +#fwn¡ 2;wn¡ 1gD2

xjwn¡ 1;wn¡ 2 / Discrete(¼1; : : :;¼N )

¼1; : : :;¼N / Dirichlet(#fwn¡ 2;wn¡ 1;xgD1 +#fwn¡ 2;wn¡ 1;xgD2 ; : : : ;#fwn¡ 2;wn¡ 1;xgD1 +#fwn¡ 2;wn¡ 1;xgD2)

Page 62: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

HHPYLM Inference: Intuition

Page 63: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Estimation• Counting

• Zero counts are a problem• Smoothing (or regularization) is essential

– Kneser-Ney– Hierarchical Dirichlet Language Model (HDLM)– Hierarchical Pitman-Yor Process Language Model (HPYLM)– etc.

P (xjwn¡ 1;wn¡ 2) =#fwn¡ 2;wn¡ 1;xg#fwn¡ 2;wn¡ 1g

=#fwn¡ 2;wn¡ 1;xgPj #fwn¡ 2;wn¡ 1; j g

Page 64: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Justification• Domain specific training data can be costly to

collect and process– i.e. manual transcription of customer service

telephone calls

• Different domains are different!

Page 65: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Why Domain Adaptation?• … a language model trained on Dow-Jones

newswire text will see its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period– Ronald Rosenfeld, “Two Decades Of Statistical

Language Modeling: Where Do We Go From Here”, 2000

Page 66: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

SLM Evaluation• Per-word perplexity is a model evaluation

metric common to the NLP community

– Computed using test data and trained model

– Related to log likelihood and entropy

– Language modelling interpretation:• Average size of “replacement word” set

perplexity(Ptrained;Wtest) = 2¡1N

PlogP t rained (w

testi jwtest

i ¡ 1 ;wtesti ¡ 2)

Page 67: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

SOU / Brown• State of the Union Corpus, 1945-2006

– ~ 370,000 words, 13,000 unique• Today, the entire world is looking to America for enlightened

leadership to peace and progress. Truman, 1945• Today, an estimated 4 out of every 10 students in the 5th grade will

not even finish high school - and that is a waste we cannot afford. Kennedy, 1963

• In 1945, there were about two dozen lonely democracies in the world. Today, there are 122. And we're writing a new chapter in the story of self-government -- with women lining up to vote in Afghanistan, and millions of Iraqis marking their liberty with purple ink, and men and women from Lebanon to Egypt debating the rights of individuals and the necessity of freedom. Bush, 2006

• Brown Corpus, 1967– ~ 1,000,000 words, 50,000 unique

• During the morning hours, it became clear that the arrest of Spencer was having no sobering effect upon the men of the Somers.

• He certainly didn't want a wife who was fickle as Ann.• It is being fought, moreover, in fairly close correspondence with the

predictions of the soothsayers of the think factories.

• Test: Johnson’s SOU Addresses, 1963-1969

– ~ 37,000 words• But we will not permit those who fire upon us in Vietnam to win

a victory over the desires and the intentions of all the American people.

• This Nation is mighty enough, its society is healthy enough, its people are strong enough, to pursue our goals in the rest of the world while still building a Great Society here at home.

Page 68: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

AMI / Brown• AMI, 2007

– Approx 800,000 tokens, 8,000 types• Yeah yeah for example the l.c.d. you can take it you can put it

put it back in or you can use the other one or the speech recognizer with the microphone yeah yeah you want a microphone to pu in the speech recognizer you don’t you pay less for the system you see so

Page 69: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Bayesian Domain Adaptation

µhyperBig

Small

CorpusCorpus specific modelsCorpus

independentmodel

µbig

µsmall

Test

Page 70: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

HPYPLM

GDf g » PY (dD0 ;®D0 ;¸

D0 U+(1¡ ¸D0 )G

Hf g)

GDf wt ¡ 1g » PY (dD1 ;®D1 ;¸

D1 G

Df g+(1¡ ¸D1 )G

Hf wt ¡ 1g)

...

GDf wt ¡ j :wt ¡ 1g » PY (dDj ;®Dj ;¸

Dj G

Df wt ¡ j +1:wt ¡ 1g+(1¡ ¸Dj )G

Hf wt ¡ j :wt ¡ 1g)

xjwt¡ n+1 : wt¡ 1 » GDf wt ¡ n +1 :wt ¡ 1g:

Page 71: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Bayesian Domain Adaptation• Domain Adaptation ↔ Hierarchical Modelling

Áy1

y2

yn

Domain specific observationsDomain specific modelsHyper model

µ1

µ2

µn

… …

Page 72: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Doubly hierarchical Pitman-Yor Process Language Model

• Uniform distribution over words

• Hyper language model• Domain

Page 73: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Hierarchical Pitman-Yor Process Language Model

• e.g.

Gf g » PY (d0;®0;U)

Gf ofg » PY (d1;®1;Gf g)

...

Gf United;States;ofg » PY (d3;®3;Gf States;ofg)

xjfUnited;States;ofg » Gf United;States;ofg

America (.00001), of (.01), States (.00001), the (.01), …

America (.0001), of (.000000001), States (.0001), the (.01), …

America (.8), of (.00000001), States (.0000001), the (. 01), …

Gf g =

Gf ofg =

...

Gf United;States;ofg =

[Teh, Y.W., 2006], [Goldwater, S., Griffiths, T. L., & Johnson, M., 2007]

Page 74: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Pitman-Yor Process?

GDf g » PY (dD0 ;®D0 ;¸

D0 U+(1¡ ¸D0 )G

Hf g)

GDf wt ¡ 1g » PY (dD1 ;®D1 ;¸

D1 G

Df g+(1¡ ¸D1 )G

Hf wt ¡ 1g)

...

GDf wt ¡ j :wt ¡ 1g » PY (dDj ;®Dj ;¸

Dj G

Df wt ¡ j +1:wt ¡ 1g+(1¡ ¸Dj )G

Hf wt ¡ j :wt ¡ 1g)

xjwt¡ n+1 : wt¡ 1 » GDf wt ¡ n +1 :wt ¡ 1g:

Observations

base distribution

G » PY (d;®;G0)

x » G

concentrationdiscount

Page 75: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

Hierarchical Bayesian Modelling

µhyperBig

Small

CorporaCorpus specific modelsCorpus

independentmodel

µbig

µsmall

Page 76: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

What does G look like?• G is a discrete distribution with infinite support

[Sethurman, 94]

G » PY (d;®;H)

) G =1X

k=1

¼k±Ák ;1X

k=1

¼k = 1

aa (.0001), aah (.00001), aal (.000001), aardvark (.001), …

Page 77: GATSBY COMPUTATIONAL NEUROSCIENCE UNIT supported by the Gatsby Charitable Foundation FRANK WOOD, YEE WHYE TEH04/06/2014 A Hierarchical Nonparametric Bayesian

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

supported by the Gatsby Charitable Foundation

FRANK WOOD, YEE WHYE TEH 10/04/23

What does G look like?• Stick breaking representation (GEM)

G » PY (d;®;H)

) G =1X

k=1

¼k±Ák

Ák » H

¼k =wkQ k¡ 1

`=1 (1¡ w`);wk » Beta(1¡ d;®+kd)

[Sethurman, 94]

The Dirichlet process arises when d = 0