dirichlet simplex nest and geometric inference

25
IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 1 / 17 Dirichlet Simplex Nest and Geometric Inference Mikhail Yurochkin 1,2,* Aritra Guha 3,* , Yuekai Sun 3 , XuanLong Nguyen 3 IBM Research 1 , MIT-IBM Watson AI Lab 2 , University of Michigan 3 * authors contributed equally ICML 2019, June 11th 1 / 17

Upload: others

Post on 08-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 1 / 17

Dirichlet Simplex Nest and Geometric Inference

Mikhail Yurochkin1,2,∗

Aritra Guha3,∗, Yuekai Sun3, XuanLong Nguyen3

IBM Research1, MIT-IBM Watson AI Lab2, University of Michigan3

∗ authors contributed equally

ICML 2019, June 11th

1 / 17

Page 2: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 2 / 17

Dirichlet Simplex Nest (DSN)

θm

Word 1

Word 2

Word 3

Word 4

β3 β2

β1

The parent nest: B = Conv(β1, . . . , βK )

2 / 17

Page 3: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 2 / 17

Dirichlet Simplex Nest (DSN)

Word 1

Word 2

Word 3

Word 4

The parent nest: B = Conv(β1, . . . , βK )

Admixture weights: θi ∼ DirK (α) ∈ ∆K−1

2 / 17

Page 4: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 2 / 17

Dirichlet Simplex Nest (DSN)

Word 1

Word 2

Word 3

Word 4

The parent nest: B = Conv(β1, . . . , βK )

Admixture weights: θi ∼ DirK (α) ∈ ∆K−1

Offspring i is born: µi =K∑

k=1

θikβk ∈ RD

2 / 17

Page 5: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 2 / 17

Dirichlet Simplex Nest (DSN)

The parent nest: B = Conv(β1, . . . , βK )

Admixture weights: θi ∼ DirK (α) ∈ ∆K−1

Offspring i is born: µi =K∑

k=1

θikβk ∈ RD

Offspring i leaves the nest:

xi |µi ∼ F (·|µi ) ∈ RD s.t. E[xi |µi ] = µi

2 / 17

Page 6: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 2 / 17

Dirichlet Simplex Nest (DSN)

Dirichlet Simplex Nest examples:

• LDA: xi |µi ∼ Multinomial(µi ) (Blei et al. (2003))

• Gaussian: xi |µi ∼ N (µi ,Σ) (Schmidt et al. (2009))

• Poisson-Gamma: xi,d |µi,diid∼ Pois(µi,d) (Cemgil (2009))

2 / 17

Page 7: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 3 / 17

Centroidal Voronoi tessellation (CVT)Du et al (1999)

• open set Ω

• distance d : Ω× Ω→ R+

• density ρ : Ω→ R+

• For z1, . . . , zK ⊂ Ω, the Voronoi region corresponding to zk is

Vk = x ∈ Ω : d(x , zk) < d(x , zl) for any l 6= k.

? V1, . . . ,VK is a Voronoi tessallation of Ω

? zk ’s are the generators of this Voronoi tessellation

• V1, . . . ,VK is a CVT iff the zk ’s satisfy

zk =

∫Vk

xρ(x)dx∫Vkρ(x)dx

.

3 / 17

Page 8: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 4 / 17

Variational characterization of CVT’s

• define the cost function

J(z1, . . . , zK ) =∑K

k=1

∫Vk

d(x , zk)2ρ(x)dx ,

where the Vk ’s are Voronoi regions corresponding to the zk ’s,

• If (z1, . . . , zK ) ∈ argmin J, then the corresponding VT is a CVT.

• estimate (z1, . . . , zK ) by minimizing empirical counterpart of J:

1n

∑Kk=1

∑ni=1 d(xi , zk)21xi ∈ Vk, xi

iid∼ ρ

• If d is Euclidean distance, this is the K -means cost function.

4 / 17

Page 9: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 5 / 17

CVT of equilateral simplices

• Ω is a scaled, rotated, translated copy of ∆K−1

• equipped with distance d(x1, x2) = ‖x1 − x2‖2

• ρ is DirK (α1K ) density “transported” to Ω

• fact: The CVT generators zk ’s are on the line segments between centroid µof the simplex and its extreme points.

• idea: estimate extreme points by estimating µ and zk ’s and searching alongray from µ to zk (Geometric Dirichlet Means (GDM), Y. and Nguyen (2016))

5 / 17

Page 10: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 6 / 17

Toy example: Gaussian DSN with K = D = 3

0.00.5

1.01.5

2.02.5

3.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00.51.01.52.02.53.0

Figure: GDM with equilateral B, α = 2.5

6 / 17

Page 11: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 6 / 17

Toy example: Gaussian DSN with K = D = 3

01

23

45

6 01

23

45

6

0

1

2

3

4

5

6

Figure: GDM with general B, α = 0.3

6 / 17

Page 12: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 6 / 17

Toy example: Gaussian DSN with K = D = 3

01

23

45

6 01

23

45

6

0

1

2

3

4

5

6

Figure: GDM with general B, α = 2.5

6 / 17

Page 13: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 6 / 17

Toy example: Gaussian DSN with K = D = 3

01

23

45

6 01

23

45

6

0

1

2

3

4

5

6

Figure: Xray (Kumar et al., 2013) - separability is not satisfied

6 / 17

Page 14: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 6 / 17

Toy example: Gaussian DSN with K = D = 3

01

23

45

6 01

23

45

6

0

1

2

3

4

5

6

Figure: Stan-HMC (Carpenter et al., 2017) - too slow (10min)

6 / 17

Page 15: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 7 / 17

Estimating extreme points of non-equilateral simplices

• idea: transform non-equilateral simplex to equilateral simplex

• map from equilateral to non-equilateral simplex is the linear map BT

• CVT centroids in ‖ · ‖(BBT )†-norm are images of CVT generators of ∆K−1

under BT (for any concentration parameter α > 0)

• K -means clustering the xi ’s in the ‖ · ‖(BBT )† -norm is equivalent to K -meansclustering the (left) singular vectors of X = X − 1nµT

7 / 17

Page 16: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 8 / 17

Voronoi Latent Admixture (VLAD)

inputs: x1, . . . , xn ∈ RD , extension size γ > 0

outputs: β1, . . . , βK ∈ RD

1a. µ = 1n

∑ni=1 xi

1b. xi ← xi − µ2. compute top K singular factors of centered data matrix: X ≈ UΛV T

3. (w1, . . . , wK )← K -means(u1, . . . , un), ui ∈ RK is the i-th row of U4. zk ← µ+ VΛwk

5. βk ← µ+ γ(zk − µ)

8 / 17

Page 17: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 9 / 17

Toy example: Gaussian DSN with K = D = 3

01

23

45

6 01

23

45

6

0

1

2

3

4

5

6

Figure: Voronoi Latent Admixture (VLAD), under 1s

9 / 17

Page 18: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 10 / 17

Choosing extension size γ

• fact: exact γ = ‖βk−µ‖2‖zk−µ‖2 depends only on the concentration parameter α

(and K ). It does not depend on the geometry of the simplex• choosing extension size γ is equivalent to estimating concentration parameterα

• tabulate γ(α) for all α > 0 in a particular simplex (eg. ∆K−1) by Monte Carlo• estimate α by a method of moments approach:

α = argmin‖ B(γ(α))(IK− 1K 1K1TK )B(γ(α))T

K(Kα+1) − Σ‖ : α > 0

? B(γ) = 1K µ+ γ(α)(Z − 1K µT )

?IK− 1

K1K 1TK

K(Kα+1) is the covariance matrix of a DirK (α1K ) random vector

? Σ is the sample covariance matrix of the xi ’s

10 / 17

Page 19: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 11 / 17

Estimation error of VLAD

• Pollard’s condition: the noiseless CVT cost function∑Kk=1

∫Vk

Dirichlet(θ;α1K )‖BT θ − zk‖22dθ

is λ-strongly convex

• quantify the noise level by σ2 := sup‖Cov[xi | θi ]‖2 : θi ∈ ∆K−1

• there is a permutation π of [K ] such that

‖(βπ(1) , . . . , βπ(K))− (β1, . . . , βK )‖2 ≤ O(σ

1/4√λ

) + OP( 1√n

)

11 / 17

Page 20: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 12 / 17

Simulations: varying DSN geometry

• α = 2, D = 500/2000, K = 10, n = 10k• βk ∼ N (0,K )/DirD(0.1) and scaled towards mean by ck ∼ Unif(cmin, 1)

• small cmin → more DSN skewness → harder problem

0.2 0.4 0.6 0.8 1.0Gaussian likelihood: cmin

10

20

30

40

50

60

Min

imum

mat

chin

g di

stan

ce

VLAD-GDMVLADXrayStan-HMCSPAMVESGDM-MC

Figure: Gaussian data

0.2 0.4 0.6 0.8 1.0Multinomial likelihood: cmin

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Min

imum

mat

chin

g di

stan

ce

VLAD-GDMVLADRecoverKLGibbsSVIGDM-MC

Figure: Multinomial data (LDA)

12 / 17

Page 21: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 13 / 17

Simulations: varying concentration parameter α

• varying α, D = 500/2000, K = 10, n = 10k• βk ∼ N (0,K )/DirD(0.1) and scaled towards mean by ck ∼ Unif(0.5, 1)

• large α→ non-separable data → harder problem

0 1 2 3 4 5Gaussian likelihood:

0

10

20

30

40

50

60

Min

imum

mat

chin

g di

stan

ce

VLAD-GDMVLADXrayStan-HMCSPAMVES

Figure: Gaussian data

0 1 2 3 4 5Multinomial likelihood:

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Min

imum

mat

chin

g di

stan

ce

VLAD-GDMVLADRecoverKLGibbsSVI

Figure: Multinomial data (LDA)

13 / 17

Page 22: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 14 / 17

Factor analysis of stock data

• D = 55 stocks• n = 3k days• 400 data points left out for validation• VLAD factors

? growth component related to banks (e.g., Bank of America, Wells Fargo)? Stocks of fuel companies (Valero Energy, Chevron) are inversely related to

defense contractors (Boeing, Raytheon)

Residual norm Volume Runtime

VLAD 0.300 0.14 1 secGDM 0.294 1499 1 secHMC 0.299 1.95 10 minMVES 0.287 5.39× 109 3 minSPA 0.392 3.31× 107 1sec

14 / 17

Page 23: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 15 / 17

Topic discovery in New York Times corpus

• D = 5320 words in vocabulary• n ≈ 100k articles• 25k articles left out for perplexity evaluation• VLAD topics

? ballot al-gore votes election recount florida voter court vote counties countcounty board hand bush george-bush official republican counted lawyer

? cup sugar minutes cream butter teaspoon tablespoon chocolate egg pan addbowl mixture flour milk oven baking water serving cake

Perplexity Coherence Runtime

VLAD 1767 0.86 6 minGDM 1777 0.88 30 minGibbs 1520 0.80 5.3 hrsRecoverKL 2365 0.70 17 minSVI 1669 0.81 40min

15 / 17

Page 24: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 16 / 17

Open problem: asymmetric concentration parameter α

01

23

45

6 01

23

45

6

0

1

2

3

4

5

6

Figure: VLAD, α = (0.5, 1.5, 2.5)

16 / 17

Page 25: Dirichlet Simplex Nest and Geometric Inference

IBM Research, MIT-IBM Watson AI Lab, University of Michigan VLAD (Poster #230) 17 / 17

THANK YOU!

Please come to poster #230

Questions?

17 / 17