introduction to machine...
TRANSCRIPT
Probabilistic graphical modelsYifeng Tao
School of Computer ScienceCarnegie Mellon University
Slides adapted from Eric Xing, Matt Gormley
Carnegie Mellon University 1Yifeng Tao
Introduction to Machine Learning
Recap of Basic Probability Concepts
oRepresentation: the joint probability distribution on multiple binary variables?
oState configurations in total: 28
oAre they all needed to be represented? oDo we get any scientific/medical insight?
oLearning: where do we get all this probabilities? oMaximal-likelihood estimation?
oInference: If not all variables are observable, how to compute the conditional distribution of latent variables given evidence?oComputing p(H|A) would require summing over all 26 configurations of the
unobserved variables
Yifeng Tao Carnegie Mellon University 2
[Slide from EricXing.]
Graphical Model: Structure Simplifies Representation
oDependencies among variables
Yifeng Tao Carnegie Mellon University 3
[Slide from EricXing.]
Probabilistic Graphical Models
oIf Xi’s are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g.,
oWhy we may favor a PGM?o Incorporation of domain knowledge and causal (logical) structureso2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28 in representation cost!
Yifeng Tao Carnegie Mellon University 4
[Slide from EricXing.]
Two types of GMs
oDirected edges give causality relationships (Bayesian Network or Directed Graphical Model):
oUndirected edges simply give correlations between variables (Markov Random Field or Undirected Graphical model):
Yifeng Tao Carnegie Mellon University 5
[Slide from EricXing.]
Bayesian Network
oDefinition:
oIt consists of a graph G and the conditional probabilities P oThese two parts full specify the distribution:
oQualitative Specification: GoQuantitative Specification: P
Yifeng Tao Carnegie Mellon University 6
[Slide from EricXing.]
Where does the qualitative specification come from?
oPrior knowledge of causal relationships oLearning from data (i.e. structure learning) oWe simply prefer a certain architecture (e.g. a layered graph) o…
Yifeng Tao Carnegie Mellon University 7
[Slide from MattGormley.]
Quantitative Specification
oExample: Conditional probability tables (CPTs) for discrete random variables
Yifeng Tao Carnegie Mellon University 8
[Slide from EricXing.]
Quantitative Specification
oExample: Conditional probability density functions (CPDs) for continuous random variables
Yifeng Tao Carnegie Mellon University 9
[Slide from EricXing.]
Observed Variables
oIn a graphical model, shaded nodes are “observed”, i.e. their values are given
Yifeng Tao Carnegie Mellon University 10
[Slide from MattGormley.]
GMs are your old friends
oDensity estimation oParametric and nonparametric methods
oRegression oLinear, conditional mixture, nonparametric
oClassification oGenerative and discriminative approach
oClustering
Yifeng Tao Carnegie Mellon University 11
[Slide from EricXing.]
What Independencies does a Bayes Net Model?
oIndependency of X and Z given Y?P(X|Y)P(Z|Y) = P(X,Z|Y)
oThree cases of interest...oProof?
Yifeng Tao Carnegie Mellon University 12
[Slide from MattGormley.]
The “Burglar Alarm” example
oYour house has a twitchy burglar alarm that is also sometimes triggered by earthquakes.
oEarth arguably doesn’t care whether your house is currently being burgled.
oWhile you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing.
Yifeng Tao Carnegie Mellon University 13
[Slide from MattGormley.]
Markov Blanket
oDef: the co-parents of a node are the parents of its children
oDef: the Markov Blanket of a node is the set containing the node’s parents, children, and co-parents.
oThm: a node is conditionally independent of every other node in the graph given its Markov blanket
oExample: The Markov Blanket of X6 is{X3, X4, X5, X8, X9, X10}
Yifeng Tao Carnegie Mellon University 14
[Slide from MattGormley.]
Markov Blanket
oExample: The Markov Blanket of X6 is{X3, X4, X5, X8, X9, X10}
Yifeng Tao Carnegie Mellon University 15
[Slide from MattGormley.]
D-Separation
oThm: If variables X and Z are d-separated given a set of variables EThen X and Z are conditionally independent given the set E
oDefinition:oVariables X and Z are d-separated given a set of evidence variables E
iff every path from X to Z is “blocked”.
Yifeng Tao Carnegie Mellon University 16
[Slide from MattGormley.]
D-SeparationoVariables X and Z are d-separated
given a set of evidence variables Eiff every path from X to Z is “blocked”.
Yifeng Tao Carnegie Mellon University 17
[Slide from EricXing.]
Machine Learning
Yifeng Tao Carnegie Mellon University 18
[Slide from MattGormley.]
Recipe for Closed-form MLE
Yifeng Tao Carnegie Mellon University 19
[Slide from MattGormley.]
Learning Fully Observed BNs
oHow do we learn these conditional and marginal distributions for a Bayes Net?
Yifeng Tao Carnegie Mellon University 20
[Slide from MattGormley.]
Learning Fully Observed BNs
oLearning this fully observed Bayesian Network is equivalent to learning five (small / simple) independent networks from the same data
Yifeng Tao Carnegie Mellon University 21
[Slide from MattGormley.]
Learning Fully Observed BNs
Yifeng Tao Carnegie Mellon University 22
[Slide from MattGormley.]
Learning Partially Observed BNs
oPartially Observed Bayesian Network:oMaximal likelihood estimation à Incomplete log-likelihoodoThe log-likelihood contains unobserved latent variables
oSolve with EM algorithmoExample: Gaussian Mixture Models (GMMs)
Yifeng Tao Carnegie Mellon University 23
[Slide from EricXing.]
Inference of BNs
oSuppose we already have the parameters of a Bayesian Network...
Yifeng Tao Carnegie Mellon University 24
[Slide from MattGormley.]
Approaches to inference
oExact inference algorithmsoThe elimination algorithm à Message PassingoBelief propagationoThe junction tree algorithms
oApproximate inference techniques oVariational algorithms oStochastic simulation / sampling methods oMarkov chain Monte Carlo methods
Yifeng Tao Carnegie Mellon University 25
[Slide from EricXing.]
Marginalization and Elimination
Yifeng Tao Carnegie Mellon University 26
[Slide from EricXing.]
Marginalization and Elimination
Yifeng Tao Carnegie Mellon University 27
[Slide from EricXing.]
Yifeng Tao Carnegie Mellon University 28
[Slide from EricXing.]
oStep 8: Wrap-up
Yifeng Tao Carnegie Mellon University 29
[Slide from EricXing.]
Elimination algorithm
oElimination on trees is equivalent to message passing on branchesoMessage-passing is consistent in trees
oApplication: HMM
Yifeng Tao Carnegie Mellon University 30
[Slide from EricXing.]
Gibbs Sampling
Yifeng Tao Carnegie Mellon University 31
[Slide from MattGormley.]
Gibbs Sampling
Yifeng Tao Carnegie Mellon University 32
[Slide from MattGormley.]
Gibbs Sampling
Yifeng Tao Carnegie Mellon University 33
[Slide from MattGormley.]
Gibbs Sampling
oFull conditionals only need to condition on the Markov Blanket
oMust be “easy” to sample from conditionals
oMany conditionals are log-concave and are amenable to adaptive rejection sampling
Yifeng Tao Carnegie Mellon University 34
[Slide from MattGormley.]
Take home message
oGraphical models portrays the sparse dependencies of variablesoTwo types of graphical models: Bayesian network and Markov
random fieldoConditional independence, Markov blanket, and d-separationoLearning fully observed and partially observed Bayesian networksoExact inference and approximate inference of Bayesian networks
Carnegie Mellon University 35Yifeng Tao
References
oEric Xing, Ziv Bar-Joseph. 10701 Introduction to Machine Learning: http://www.cs.cmu.edu/~epxing/Class/10701/
oMatt Gormley. 10601 Introduction to Machine Learning: http://www.cs.cmu.edu/~mgormley/courses/10601/index.html
Carnegie Mellon University 36Yifeng Tao