introduction to pattern recorgnition beforeexamding/history/739_spring_2011/notes/... ·...

INTRODUCTION TO PATTERN RECOGNITIONRECOGNITION

INSTRUCTOR: WEI DING

1

Pattern RecognitionPattern Recognition

• Automatic discovery of regularities in data through the use of computer algorithms g p g

• With the use of these regularities to take actions such as classifying the data intoactions such as classifying the data into different categories

2

ExampleExampleEach digit corresponds to a 28 by 28 pixel image How would you formulate

Handwritten Digit Recognitionpixel image. How would you formulate the problem?

3

Example: Problem FormulationExample: Problem Formulation

• Each digit corresponds to 28*28 pixel image and so can be represented by a vector x p ycomprising 744 real numbers.

• The goal is to build a machine that will take• The goal is to build a machine that will take such a vector x as input and that will

fproduce the identity of the digit 0, …, 9 as the output.

Thi i i i l bl d h id i bili f h d i i This is a nontrivial problem due to the wide variability of handwriting.

Using handcrafted rules? Heuristics for distinguishing the digits based on the shapes of strokes?

4

strokes?

Example: T i i S t & T t S t & G li tiTraining Set & Test Set & Generalization

T i i t Th t i f th di it i th t i i• Training set: The categories of the digits in the training set are known in advance• We can express the category of a digit using target vector t, p g y g g g

which represents the identify of the corresponding digit.

• Test set: New digital unseen images• Generalization: a machine learning approach to learn a• Generalization: a machine learning approach to learn a

function y(x) that categorize correctly new examples that differ from those used for training

• In practical applications, the variability of the input vectors will be such that the training data can comprise only a tiny fraction of all possible input vectors, and soonly a tiny fraction of all possible input vectors, and so generalization is a central goal in pattern recognition.

5

Feature ExtractionFeature Extraction

• Find useful features that are fast to compute, and yet that also preserve useful p , y pdiscriminatory information.

• A form of dimensionality reduction• A form of dimensionality reduction

6

Supervised Learning: Cl ifi ti & R iClassification & Regression

li i i hi h h i i d• Applications in which the training data comprises examples of the input vectors along ith th i di t t twith their corresponding target vectors are

known as supervised learning problems. h h di i i i l i• Cases such as the digit recognition example, in

which the aim is to assign each input vector on one of a finite number of discrete categories areone of a finite number of discrete categories, are called classification problem.

• In the desired output consists of one or moreIn the desired output consists of one or more continuous variables, then the task is called regression.

7

Polynomial Curve Fitting (Regression)Polynomial Curve Fitting (Regression)

In real data set, individual observations are often corrupted by random noise.

Plot of a training data set of N=10 points, shown as blue circles, each comprising an observation of thecomprising an observation of the input variable x along with the corresponding target variable t.The green curve shows theThe green curve shows the function sin(2πx) used to generate the data.

Goal: to predict the value of t for some new value of x, without knowledge of the green curve. g g

8

Polynomial Curve FittingPolynomial Curve Fitting

• M is the order of the polynomialM is the order of the polynomial• Xj denote x raised to the power of j• The polynomial coefficients 1 w are• The polynomial coefficients 1, …, wm are

collectively denoted by the vector w• The polynomial function y(x w) is a nonlinearThe polynomial function y(x,w) is a nonlinear

function of x, it is a linear function of the coefficients w.

• Linear models: functions are linear in the unknown parameters are called linear models.

9

Error FunctionError Function

The sum of the squares of the errors q

between the predictions y(xn,w) for each data point xn and the

corresponding target values t

• The values of the coefficients will be determined by fitting the polynomial to the

corresponding target values tn

determined by fitting the polynomial to the training data

• This can be done by minimizing an error function that measure the misfit between function that measure the misfit between the function y(x,w), for any given value of w and the training set data pointsw, and the training set data points.

10

Sum of Squares Error FunctionSum‐of‐Squares Error Function

When E(w)=0?

11

Solving the Curve Fitting ProblemSolving the Curve Fitting Problem

Ch i th l f f hi h E( ) i• Choosing the value of w for which E(w) is as small as possible

• Because the error function is a quadratic• Because the error function is a quadratic function of the coefficients w, its derivatives with respect to the coefficients will be linear inwith respect to the coefficients will be linear in the elements of w, and so the minimization of the error function has a unique solution, denoted by w*, which can be found in closed form. h l l l b h• The resulting polynomial is given by the function y(x, w*).

12

0th Order Polynomial0th Order Polynomial

Poor fitting

13

1st Order Polynomial1st Order Polynomial

Poor fitting

14

3rd Order Polynomial3rd Order Polynomial

Excellent fitting

15

9th Order Polynomial9th Order PolynomialOver‐fittingThe polynomial passes exactly through each data point and E(w*)=0. However, the fitted curve oscillates widely and gives a vary representation of the function sin(2πx)

16

Root Mean Square (RMS) ErrorRoot‐Mean‐Square (RMS) Error

Root‐Mean‐Square (RMS) Error:

• Each choice of M, we can evaluate error for h dthe test data set.

• RMS for the test data set.

• Division by N allows us to compare different si es of data sets on an eq al footinsizes of data sets on an equal footing

• the square root ensures that ERMS is RMS

measured as the same scale (and in the same units) as the target variable t.

17

same units) as the target variable t.

Over fittingOver‐fitting

Graphs of the training and testGraphs of the training and test set RMS errors are shown, for various value of M.

The test set error is a measure of how well we are doing in predicting the values of t for newpredicting the values of t for new data observations of x.

Values of M in the range g3<=M<=8 give small values for the test set error, and these also give reasonable representations

Root Mean Square (RMS) Error:

of the generating function sin(2πx)

Root‐Mean‐Square (RMS) Error:

18

Over FittingOver‐Fitting

• For M=9, the training set error goes to zero.

• Because this polynomial contains 10 degrees of• Because this polynomial contains 10 degrees of freedom corresponding to the 10 coefficients w w and so can be tuned exactly to the 10w0,…,w9, and so can be tuned exactly to the 10 data points in the training set.

• However the test set error has become very• However, the test set error has become very large and the corresponding function y(x,w*) hibit ild ill tiexhibits wild oscillations.

19

ParadoxParadox

A l i l f i d t i ll l• A polynomial of given order contains all lower order polynomials as special case

• The M 9 polynomial is therefore capable of• The M=9 polynomial is therefore capable of generating results at least as good as the M=3 polynomialpolynomial.

• The best predicator of new data would be the function sin(2πx) contains terms of all orders.function sin(2πx) contains terms of all orders.

• We know that a power series expansion of the function sin(2πx) contains terms of all orders, ( ) ,so we might expect that results should improve monotonically as we increase M

20

Polynomial CoefficientsPolynomial Coefficients

21

The more flexible polynomials with larger values of M are becoming increasingly tuned to the random noise on the target values.

• It is also interesting to examine the behavior of a given model as the size of the gdata set is varied…

22

Data Set Size:Data Set Size:

9th Order Polynomial9th Order Polynomial

23

Data Set Size:Data Set Size:

9th Order Polynomial S h t i l i ?9th Order Polynomial So what is your conclusion?

24

Over fitting vs size of the data setOver‐fitting vs. size of the data set• For a given model complexity, the over‐fitting

problem become less severe as the size of data set increases. Th l h d h l (i• The larger the data set, the more complex (in other words more flexible) the model that we can afford to fit to the datacan afford to fit to the data.

• One rough heuristic that is sometimes advocated is that the number of data pointsadvocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in ) p pthe model. not necessarily though

There is something rather unsatisfying about having to limit the # of parameters in a model according the size of the available training set

25

parameters in a model according the size of the available training set. It would seem more reasonable to choose the complexity of the model according the complexity of the problem being solved.

Overfitting (later in this semester)Overfitting (later in this semester)

• Maximum likelihood. The over‐fitting problem can be understood as a general property of maximum likelihood.

• Bayesian approach. There is no difficulty fromBayesian approach. There is no difficulty from a Bayesian perspective in employing models for which the number of parameters greatlyfor which the number of parameters greatly exceeds the number of data points. By adopting a Bayesian approach the over fittingadopting a Bayesian approach, the over‐fitting problem can be avoided.

26

• For the moment, it is instructive to continue with the current approach and to consider pphow in practice we can apply it to data sets of limited size where we may wish to useof limited size where we may wish to use relatively complex and flexible models.

27

RegularizationRegularizationλ governs the relative importance of the

l i ti t d ith th f

Penalize large coefficient valuesregularization term compared with the sum‐of squares error term

ground truth

wTw=w 2+w 2+ +w 2

predications

w w=w0 +w1 +…+wM

A penalty term to discourage the coefficients from reaching large values

28

Regularization: and M 9Regularization: and M=9

29

Regularization:Regularization:

30

Polynomial CoefficientsPolynomial Coefficients

31

The regularization has the desired effect of reducing the magnitude of the coefficients.

Regularization: vsRegularization: vs.

The impact of the regularization term on the generalization error can be seen by plotting the value of the RMS error for both training and test sets against lnλ.

ff λ l hIn effect λ now controls the effective complexity of the model and hence determines the degree f fittiof over‐fitting

32

Model ComplexityModel Complexity

l i l li i i hi• To solve a practical application using this approach of minimizing an error function, we

ld h t fi d t d t iwould have to find a way to determine a suitable value for the model complexity.

• What we have discussed is a simple way of achieving this– taking the available data and

i i i i i i i dpartitioning it into a training set, used to determine the coefficient w, and a separate lid ti t l ll d h ld t t d tvalidation set, also called hold‐out set, used to

optimize the model complexity (either M or λ)

33 Too wasteful of valuable training data

Probability TheoryProbability Theory

k h f ld f• A key concept in the field of pattern recognition is that of uncertainty. • Arising through noise on measurement and the

finite size of data sets

• Probability theory provides a consistent framework for the quantification and qmanipulation of uncertainty and forms one of the central foundations for pattern precognition.

34

Probability Theory (example 1/4)Probability Theory (example 1/4)

Apples and OrangesWe randomly pick one of the

Apples and Oranges boxes and from the box we randomly select an item of fruit, and having observed which sort of fruit it is we replace it in the box from which it came.

Let us suppose that in so doing we pick the red box 40% of the time and we pick40% of the time and we pick the blue box 60% of the time, and that when we remove an item of fruit from a box weitem of fruit from a box we are equally likely to select any of the pieces of fruit in the box

35

box

Random Variables (example 2/4)Random Variables (example 2/4)

h d f h b h ll b h• The identity of the box that will be chosen is a random variable, which we shall denote by B. This random variable can take one of two possible values, namely r (red box) or b (blue box).

• The identity of the fruit is also a random yvariable and will be denoted by F. It can take either of the values a (for apple) or otake either of the values a (for apple) or o (for organge).

36

Example 3/4Example 3/4

W h ll d fi h b bili f b h• We shall define the probability of an event to be the fraction of times that event occurs out of the total number of trials in the limit that the total numbernumber of trials, in the limit that the total number of trials goes to infinity.

• The probability of selecting the red box is 4/10 andThe probability of selecting the red box is 4/10 and the probability of selecting the blue box is 6/10: p(B=r)=4/10 and p(B=b)=6/10.p( ) p( )

• By definition, probabilities must lie in the interval [0,1]. If the events are mutually exclusive and if they include all possible outcomes, then the probabilities for those events must sum to one.

37

Example 4/4Example 4/4

• What is the overall probability that the selection procedure will pick an apple?p p pp

• Given that we have chosen an orange, what is the probability that the box we chose wasis the probability that the box we chose was the blue one?

• We need to learn sum rule and the product• We need to learn sum rule and the product rule in order to answer those questions.

38

Probability TheoryTwo random variables X and YTotal number N of instances of these variablesThe number of points in column i, corresponding to X=xi, is denoted by ciThe number of points in row j, corresponding to Y=yj, is denoted by rj

Marginal Probability

The number of points in row j, corresponding to Y yj, is denoted by rj

Conditional ProbabilityJoint Probability

H i li itl id i th li it N

39

Here we are implicitly considering the limit N∞

Probability TheoryProbability Theory

Sum Rule

Product Rule

40

The Rules of ProbabilityThe Rules of Probability

Sum RuleSum Rule

Product Rule

41

introduction to pattern recorgnition beforeexamding/history/739_spring_2011/notes/... ·...

Documents