machine learning cmpt 726 simon fraser university

52
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION

Upload: shada

Post on 12-Jan-2016

40 views

Category:

Documents


1 download

DESCRIPTION

Machine Learning CMPT 726 Simon Fraser University. CHAPTER 1: INTRODUCTION. Outline. Comments on general approach. Probability Theory. Joint, conditional and marginal probabilities. Random Variables. Functions of R.V.s Bernoulli Distribution (Coin Tosses). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Learning CMPT 726 Simon Fraser  University

Machine LearningCMPT 726Simon Fraser University

CHAPTER 1: INTRODUCTION

Page 2: Machine Learning CMPT 726 Simon Fraser  University

Outline

• Comments on general approach.• Probability Theory.

• Joint, conditional and marginal probabilities.• Random Variables.• Functions of R.V.s

• Bernoulli Distribution (Coin Tosses).• Maximum Likelihood Estimation.• Bayesian Learning With Conjugate Prior.

• The Gaussian Distribution.• Maximum Likelihood Estimation.• Bayesian Learning With Conjugate Prior.

• More Probability Theory.• Entropy.• KL Divergence.

Page 3: Machine Learning CMPT 726 Simon Fraser  University

Our Approach

• The course generally follows statistics, very interdisciplinary.• Emphasis on predictive models: guess the value(s) of target variable(s). “Pattern Recognition”• Generally a Bayesian approach as in the text.• Compared to standard Bayesian statistics:

• more complex models (neural nets, Bayes nets)• more discrete variables• more emphasis on algorithms and efficiency

Page 4: Machine Learning CMPT 726 Simon Fraser  University

Things Not Covered

• Within statistics:• Hypothesis testing• Frequentist theory, learning theory.

• Other types of data (not random samples)• Relational data• Scientific data (automated scientific discovery)• Action + learning = reinforcement learning.

Could be optional – what do you think?

Page 5: Machine Learning CMPT 726 Simon Fraser  University

Probability Theory

Apples and Oranges

Page 6: Machine Learning CMPT 726 Simon Fraser  University

Probability Theory

Marginal Probability

Conditional ProbabilityJoint Probability

Page 7: Machine Learning CMPT 726 Simon Fraser  University

Probability Theory

Sum Rule

Product Rule

Page 8: Machine Learning CMPT 726 Simon Fraser  University

The Rules of Probability

Sum Rule

Product Rule

Page 9: Machine Learning CMPT 726 Simon Fraser  University

Bayes’ Theorem

posterior likelihood × prior

Page 10: Machine Learning CMPT 726 Simon Fraser  University

Bayes’ Theorem: Model Version

• Let M be model, E be evidence.

•P(M|E) proportional to P(M) x P(E|M)

Intuition• prior = how plausible is the event (model, theory) a priori before seeing any evidence.• likelihood = how well does the model explain the data?

Page 11: Machine Learning CMPT 726 Simon Fraser  University

Probability Densities

Page 12: Machine Learning CMPT 726 Simon Fraser  University

Transformed Densities

Markus Svensén
This figure was taken from Solution 1.4 in the web-edition of the solutions manual for PRML, available at http://research.microsoft.com/~cmbishop/PRML. A more thorough explanation of what the figure shows is provided in the text of the solution.
Page 13: Machine Learning CMPT 726 Simon Fraser  University

Expectations

Conditional Expectation(discrete)

Approximate Expectation(discrete and continuous)

Page 14: Machine Learning CMPT 726 Simon Fraser  University

Variances and Covariances

Page 15: Machine Learning CMPT 726 Simon Fraser  University

The Gaussian Distribution

Page 16: Machine Learning CMPT 726 Simon Fraser  University

Gaussian Mean and Variance

Page 17: Machine Learning CMPT 726 Simon Fraser  University

The Multivariate Gaussian

Page 18: Machine Learning CMPT 726 Simon Fraser  University

Reading exponential prob formulas

• In infinite space, cannot just form sumΣx p(x) grows to infinity.

• Instead, use exponential, e.g.p(n) = (1/2)n

• Suppose there is a relevant feature f(x) and I want to express that “the greater f(x) is, the less probable x is”.

• Use p(x) = exp(-f(x)).

Page 19: Machine Learning CMPT 726 Simon Fraser  University

Example: exponential form sample size

• Fair Coin: The longer the sample size, the less likely it is.

• p(n) = 2-n.

ln[p(n)]

Sample size n

Page 20: Machine Learning CMPT 726 Simon Fraser  University

Exponential Form: Gaussian mean

• The further x is from the mean, the less likely it is.

ln[p(x)]

2(x-μ)

Page 21: Machine Learning CMPT 726 Simon Fraser  University

Smaller variance decreases probability

• The smaller the variance σ2, the less likely x is (away from the mean).

ln[p(x)]

-σ2

Page 22: Machine Learning CMPT 726 Simon Fraser  University

Minimal energy = max probability

• The greater the energy (of the joint state), the less probable the state is.

ln[p(x)]

E(x)

Page 23: Machine Learning CMPT 726 Simon Fraser  University

Gaussian Parameter Estimation

Likelihood function

Page 24: Machine Learning CMPT 726 Simon Fraser  University

Maximum (Log) Likelihood

Page 25: Machine Learning CMPT 726 Simon Fraser  University

Properties of and

Page 26: Machine Learning CMPT 726 Simon Fraser  University

Curve Fitting Re-visited

Page 27: Machine Learning CMPT 726 Simon Fraser  University

Maximum Likelihood

Determine by minimizing sum-of-squares error, .

Page 28: Machine Learning CMPT 726 Simon Fraser  University

Predictive Distribution

Page 29: Machine Learning CMPT 726 Simon Fraser  University

Frequentism vs. Bayesianism

• Frequentists: probabilities are measured as the frequencies of repeatable events.• E.g., coin flips, snow falls in January.

• Bayesian: in addition, allow probabilities to be attached to parameter values (e.g., P(μ=0).

• Frequentist model selection: give performance guarantees (e.g., 95% of the time the method is right).

• Bayesian model selection: choose prior distribution over parameters, maximize resulting cost function (posterior).

Page 30: Machine Learning CMPT 726 Simon Fraser  University

MAP: A Step towards Bayes

Determine by minimizing regularized sum-of-squares error, .

Page 31: Machine Learning CMPT 726 Simon Fraser  University

Bayesian Curve Fitting

Page 32: Machine Learning CMPT 726 Simon Fraser  University

Bayesian Predictive Distribution

Page 33: Machine Learning CMPT 726 Simon Fraser  University

Model Selection

Cross-Validation

Page 34: Machine Learning CMPT 726 Simon Fraser  University

Curse of Dimensionality

Rule of Thumb: 10 datapoints per parameter.

Page 35: Machine Learning CMPT 726 Simon Fraser  University

Curse of Dimensionality

Polynomial curve fitting, M = 3

Gaussian Densities in higher dimensions

Page 36: Machine Learning CMPT 726 Simon Fraser  University

Decision Theory

Inference stepDetermine either or .

Decision stepFor given x, determine optimal t.

Page 37: Machine Learning CMPT 726 Simon Fraser  University

Minimum Misclassification Rate

Page 38: Machine Learning CMPT 726 Simon Fraser  University

Minimum Expected Loss

Example: classify medical images as ‘cancer’ or ‘normal’

DecisionTr

uth

Page 39: Machine Learning CMPT 726 Simon Fraser  University

Minimum Expected Loss

Regions are chosen to minimize

Page 40: Machine Learning CMPT 726 Simon Fraser  University

Why Separate Inference and Decision?

• Minimizing risk (loss matrix may change over time)• Unbalanced class priors• Combining models

Page 41: Machine Learning CMPT 726 Simon Fraser  University

Decision Theory for Regression

Inference stepDetermine .

Decision stepFor given x, make optimal prediction, y(x), for t.

Loss function:

Page 42: Machine Learning CMPT 726 Simon Fraser  University

The Squared Loss Function

Page 43: Machine Learning CMPT 726 Simon Fraser  University

Generative vs Discriminative

Generative approach: ModelUse Bayes’ theorem

Discriminative approach: Model directly

Page 44: Machine Learning CMPT 726 Simon Fraser  University

Entropy

Important quantity in• coding theory• statistical physics• machine learning

Page 45: Machine Learning CMPT 726 Simon Fraser  University

Entropy

Page 46: Machine Learning CMPT 726 Simon Fraser  University

Entropy

Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x?

All states equally likely

Page 47: Machine Learning CMPT 726 Simon Fraser  University

Entropy

Page 48: Machine Learning CMPT 726 Simon Fraser  University

The Maximum Entropy Principle

• Commonly used principle for model selection: maximize entropy.

• Example: In how many ways can N identical objects be allocated M bins?

Entropy maximized when

Page 49: Machine Learning CMPT 726 Simon Fraser  University

Differential Entropy and the Gaussian

Put bins of width ¢ along the real line

Differential entropy maximized (for fixed ) when

in which case

Page 50: Machine Learning CMPT 726 Simon Fraser  University

Conditional Entropy

Page 51: Machine Learning CMPT 726 Simon Fraser  University

The Kullback-Leibler Divergence

Page 52: Machine Learning CMPT 726 Simon Fraser  University

Mutual Information