efficient learning in high dimensions with trees and mixtures marina meila carnegie mellon...

37
Efficient Learning in High Dimensions with Trees and Mixtures Marina Meila Carnegie Mellon University

Upload: roderick-spencer

Post on 18-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Efficient Learning in High Dimensions with

Trees and Mixtures

Marina MeilaCarnegie Mellon University

· Multidimensional (noisy) data

· Learning tasks - intelligent data analysis· categorization (clustering)· classification· novelty detection· probabilistic reasoning

· Data is changing, growing· Tasks change

need to make learning automatic, efficient

Multidimensional data

Learning

Combining probability and algorithms

· Automatic probability and statistics· Efficient algorithms

· This talk

the tree statistical model

Talk overview

Introduction:statistical models

The tree model

Mixtures of trees Learning Experiments

Accelerated learning

Bayesian learning

Perspective: generative models and decision tasks

A multivariate domain

· Data

Patient1

Patient2

. . . . . . . . . . . .

· Queries

· Diagnose new patient

· Is smoking related to lung cancer?

· Understand the “laws” of the domain

Smoker X rayCoughBronchitis

Smoker X rayCoughLung cancer

Bronchitis

Smoker X rayCoughBronchitis Lung cancer?

Lung cancer?

Smoker X rayCoughLung cancer

Bronchitis

Statistical model

Probabilistic approach

· Smoker, Bronchitis .. (discrete) random variables· Statistical model (joint distribution)

P( Smoker, Bronchitis, Lung cancer, Cough, X ray )

summarizes knowledge about domain

· Queries:· inference

e.g. P( Lung cancer = true | Smoker = true, Cough = false )

· structure of the model• discovering relationships• categorization

Probability table representation

· Query:

P(v1=0 | v2=1) = = = .23

· Curse of dimensionalityif v1, v2, … vn binary variables PV1,V2…Vn table with 2n entries!

· How to represent?· How to query?· How to learn from data?· Structure?

P(v1=0, v2=1)

P(v2=1)

.14 + .03

.14 + .3 + .22 + .33

v1v2 00 01 11 00

v3

0 .01 .14 .22 .01 1 .23 .03 .33 .03

Graphical models

· Structure· vertices = variables· edges = “direct dependencies”

· Parametrization · by local probability tables

· compact parametric representation· efficient computation· learning parameters by simple formula· learning structure is NP-hard

Galaxy type

distance

photometric measuremen

t

size

observed size

spectrum Z (red-shift)dust

Obs spectrum

spectrum Z (red-shift)dust

Obs spectrum

The tree statistical model

· Structure

tree (graph with no cycles)

· Parameters

· probability tables associated to edges

1

2

3 4 5

1

2

3 4 5

T34

T3

T4|3

Tv|u(xv|xu) uv E

T(x) = Tuv(xuxv)

Tv(xv)

uv E

v V

deg v-1T(x) =

• T(x) factors over tree edges

equivalent

Examples

Weight

Temperature

Thrombocyt

BPDNeutropeni

aSuspect Lipid

GestationAcidosis

HyperNa

PulmHemorrh

Coag

Hypertension

-7 -5 -4 -3-6 +4

+6

+7

+5

-2 +1

+2

+3

-1

junction type

+8

· Splice junction domain

· Premature babies’ Bronho-Pulmonary Disease (BPD)

|V| =n

· computing likelihood T(x) ~ n

· conditioning TV-A|A (junction tree algorithm) ~ n

· marginalization Tuv for arbitrary u,v ~ n

· sampling ~ n· fitting to a given distribution ~ n2

• learning from data ~ n2Ndata

· is a simple model

Trees - basic operations

uv ETuv(xuxv)

Tv (xv) v V

deg v -1T(x) =

Querying the model

Estimating the model

The mixture of trees

h = “hidden” variableP( h=k ) = k k = 1, 2 . . . m

· NOT a graphical model· computational efficiency preserved

m

Q(x) = kTk(x) k=1

(Meila 97)

Learning - problem formulation

· Maximum Likelihood learning· given a data set D = { x1, . . . xN }· find the model that best predicts the data

Topt = argmax T(D)

· Fitting a tree to a distribution· given a data set D = { x1, . . . xN }

and distribution P that weights each data point, · find

Topt = argmin KL( P || T )

· KL is Kullbach-Leibler divergence· includes Maximum likelihood learning as a special case

Fitting a tree to a distribution

Topt = argmin KL( P || T )· optimization over structure + parameters

· sufficient statistics

· probability tables Puv = Nuv/N u,v V

· mutual informations Iuv

Iuv = Puv log Puv

PuPv

(Chow & Liu 68)

Fitting a tree to a distribution - solution

· StructureEopt = argmax Iuv

uv E

· found by Maximum Weight Spanning Tree algorithm with edge weights Iuv

· Parameters · copy marginals of P

Tuv = Puv for uv E

I61I23

I12

I34

I63

I45

I56

E step which xi come from T k?

distribution P k(x)

Learning mixtures by the EM algorithm

· Initialize randomly· converges to local maximum of the likelihood

M step fit T k to set of points

min KL( Pk||Tk )

Meila & Jordan ‘97

Remarks

· Learning a tree· solution is globally optimal over structures and

parameters· tractable: running time ~ n2N

· Learning a mixture by the EM algorithm· both E and M steps are exact, tractable· running time

• E step ~ mnN• M step ~ mn2N

· assumes m known· converges to local optimum

Finding structure - the bars problem

Data n=25 learned structure

Structure recovery: 19 out of 20 trials

Hidden variable accuracy: 0.85 +/- 0.08 (ambiguous)

0.95 +/- 0.01 (unambiguous)

Data likelihood [bits/data point] true model 8.58

learned model 9.82 +/-0.95

Experiments - density estimation

· Digits and digit pairs Ntrain = 6000 Nvalid = 2000 Ntest = 5000

n = 64 variables ( m = 16 trees ) n = 128 variables ( m = 32 trees )

Mix

Tre

es

Mix

Tre

es

DNA splice junction classification

· n = 61 variables· class = Intron/Exon, Exon/Intron, Neither

Tree TANB NBSupervised

(DELVE)

Discovering structure

IE junction Intron Exon

15 16 . . . 25 26 27 28 29 30 31

Tree - CT CT CT - - CT A G G

True CT CT CT CT - - CT A G G

EI junction Exon Intron

28 29 30 31 32 33 34 35 36

Tree CA A G G T AG A G -

True CA A G G T AG A G T

(Watson “The molecular biology of the gene” 87)

class

Tree adjacency matrix

Irrelevant variables

61 original variables + 60 “noise” variables

Original Augmented with irrelevant variables

Accelerated tree learning

· Running time for the tree learning algorithm ~ n2N

· Quadratic running time may be too slow:Example: document classification· document = data point --> N = 103-4

· word = variable --> n = 103-4

· sparse data --> #words in document s and s << n,N

· Can sparsity be exploited to create faster algorithms?

Meila ‘99

Sparsity

· assume special value “0” that occurs frequentlysparsity = s # non-zero variables in each data point s

s << n, N

· Idea: ““do not represent / count zerosdo not represent / count zeros”

0 1 0 0 0 0 1 0 0 0 0 1 0 0 0

0 0 0 1 0 0 0 0 0 1 0 0 0 0 0

0 1 0 0 0 0 0 0 0 0 0 0 0 0 1

Linked list

length s

Sparse data

Presort mutual informations

Theorem (Meila,99) If v, v’ are variables that do not cooccur with u in V (i.e. Nuv = Nuv’ = 0 ) then

Nv > Nv’ ==> Iuv > Iuv’

· Consequences

· sort Nv => all edges uv , Nuv = 0 implicitly sorted by Iuv

· these edges need not be represented explicitly· construct black box that outputs next “largest” edge

The black box data structure

list of u , Nuv > 0, sorted by Iuvlist of u , Nuv > 0, sorted by Iuv

list of u, Nuv =0, sorted by Nv list of u, Nuv =0, sorted by Nv (virtual)(virtual)

vv

vv22

vv11

vvnn

F-heap F-heap of size of size

~n~n

next edge next edge uvuv

NNvv

Total running time n log n + s2N + nK log n

(standard alg. running time n2N )

Experiments - sparse binary data

· N = 10,000· s = 5, 10, 15, 100

Standard

accelerated

Remarks

· Realistic assumption · Exact algorithm, provably efficient time bounds· Degrades slowly to the standard algorithm if data not sparse· General

· non-integer counts· multi-valued discrete variables

Meila & Jaakkola ‘00Bayesian learning of trees

· Problem· given prior distribution over trees P0(T)

data D = { x1, . . . xN }· find posterior distribution P(T|D)

· Advantages · incorporates prior knowledge · regularization

· Solution · Bayes’ formula P(T|D) = P0(T) T(xi) i=1,N

· practically hard• distribution over structure E and parameters E

hard to represent• computing Z is intractable in general• exception: conjugate priors

1Z

· want priors that factor over tree edges· prior for structure E

P0(E) uv uv E

· prior for tree parameters

P0(E) D( u|v ; N’uv ) uv E

· (hyper) Dirichlet with hyper-parameters N’uv(xuxv), u,v V· posterior is also Dirichlet with hyper-parameters

Nuv(xuxv) + N’uv(xuxv), u,v V

Decomposable priors

T = f( u, v, u|v)

uv E

Decomposable posterior

· Posterior distribution

P(T|D) Wuv

uv E · factored over edges· same form as prior

Wuv = uv D( u|v; N’uv + Nuv )

· Remains to compute the normalization constant

The Matrix tree theorem

· Matrix tree theorem

IfP0(E) = uv, uv 0

uv E

M( ) =

1Z

v

u

v’vv'

uv

uv

Discrete: graph theorycontinuous: Meila & Jaakkola 99

Then Z = det M( )

Remarks on the decomposable prior

· Is a conjugate prior for the tree distribution· Is tractable

· defined by ~ n2 parameters· computed exactly in ~ n3 operations· posterior obtained in ~ n2N + n3 operations· derivatives w.r.t parameters, averaging, . . . ~ n3

· Mixtures of trees with decomposable priors· MAP estimation with EM algorithm tractable

· Other applications· ensembles of trees· maximum entropy distributions on trees

So far . .

· Trees and mixtures of trees are structured statistical models

· Algorithmic techniques enable efficient learning • mixture of trees• accelerated algorithm• matrix tree theorem & Bayesian learning

· Examples of usage· Structure learning· Compression· Classification

Generative models and discrimination

· Trees are generative models · descriptive· can perform many tasks suboptimally

· Maximum Entropy discrimination (Jaakkola,Meila,Jebara,’99)

· optimize for specific tasks· use generative models· combine simple models into ensembles· complexity control - by information theoretic principle

· Discrimination tasks· detecting novelty · diagnosis· classification

Bridging the gap

Descriptive learning

Discriminative learning

Tasks

Future . . .

· Tasks have structure· multi-way classification· multiple indexing of documents· gene expression data· hierarchical, sequential decisions

Learn structured decision tasks· sharing information btw tasks (transfer)· modeling dependencies btw decisions