pattern recognition and machine learning
DESCRIPTION
Source: Bishop book chapter 1 with modifications by Christoph F. Eick. Pattern Recognition and Machine Learning. Chapter 1: Introduction. Polynomial Curve Fitting. Experiment: Given a function; create N training example. What M should we choose? Model Selection - PowerPoint PPT PresentationTRANSCRIPT
Source: Bishop book chapter 1 with modifications by Christoph F. Eick
PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 1: INTRODUCTION
Polynomial Curve Fitting
What M should we choose?Model Selection
Given M, what w’s should we choose? Parameter Selection
Experiment: Given a function; create N training example
Sum-of-Squares Error Function
0th Order Polynomial
How do M, the quality of fitting and the capability to generalize relate to each other??
As NE As c (H)first Eand then E s c (H)the training error
decreases for some time and then stays constant (frequently at 0)
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting
Root-Mean-Square (RMS) Error:
Polynomial Coefficients
Data Set Size:
9th Order Polynomial
Data Set Size:
9th Order Polynomial
Increasing the size of the data sets alleviates the over-fitting problem.
Regularization
Penalize large coefficient values
Idea: penalize high weights that contribute to highvariance and sensitivity to outliers.
Regularization: 9th Order Polynomial
Regularization:
Regularization: vs.
The example demonstrated:
As NE As c (H)first Eand then E s c (H)the training error decreases for
some time and then stays constant (frequently at 0)
Polynomial Coefficients
Weight of regularization increases
Probability Theory
Apples and Oranges
Probability Theory
Marginal Probability
Conditional ProbabilityJoint Probability
Probability Theory
Sum Rule
Product Rule
The Rules of Probability
Sum Rule
Product Rule
Bayes’ Theorem
posterior likelihood × prior
Probability DensitiesCumulative Distribution Function
Usually in ML!
Transformed Densities
Expectations (f under p(x))
Conditional Expectation(discrete)
Approximate Expectation(discrete and continuous)
Variances and Covariances
The Gaussian Distribution
Gaussian Mean and Variance
The Multivariate Gaussian
Gaussian Parameter Estimation
Likelihood function
Compare: for 2, 2.1, 1.9,2.05,1.99 N(2,1) and N(3.1)
Maximum (Log) Likelihood
Properties of and
Curve Fitting Re-visited
Maximum Likelihood
Determine by minimizing sum-of-squares error, .
Predictive DistributionSkip initially
Model Selection
Cross-Validation
Entropy
Important quantity in• coding theory• statistical physics• machine learning
Entropy
Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x?
All states equally likely
Entropy
Entropy
In how many ways can N identical objects be allocated M bins?
Entropy maximized when
Entropy
Differential Entropy
Put bins of width ¢ along the real line
Differential entropy maximized (for fixed ) when
in which case
Conditional Entropy
The Kullback-Leibler Divergence
Mutual Information