section 1 - jonathan krause cs 231a section 1: linear algebra & probability review jonathan...

55
Section 1 - Jonathan Krause CS 231A Section 1: Linear Algebra & Probability Review Jonathan Krause 9/28/2012 1

Upload: megan-booker

Post on 27-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Section 1 - Jonathan Krause

CS 231A Section 1: Linear Algebra & Probability Review

Jonathan Krause

9/28/20121

Section 1 - Jonathan Krause

Topics• Support Vector Machines• Boosting

– Viola-Jones face detector• Linear Algebra Review

– Notation– Operations & Properties– Matrix Calculus

• Probability– Axioms– Basic Properties– Bayes Theorem, Chain Rule

9/28/20122

Section 1 - Jonathan Krause

Linear classifiers

0:negative

0:positive

b

b

ii

ii

wxx

wxx

9/28/2012

• Find linear function (hyperplane) to separate positive and negative examples

Which hyperplaneis best?

w, b

3

Section 1 - Jonathan Krause

Support vector machines• Find hyperplane that maximizes the margin between the

positive and negative examples

9/28/2012

MarginSupport vectors

4

Section 1 - Jonathan Krause

Support Vector Machines (SVM)• Wish to perform

binary classification, i.e. find a linear classifier

• Given dataand labelswhere

• When data is linearly separable we can solve the optimization problem to find our linear classifier 9/28/20125

Section 1 - Jonathan Krause

• Datasets that are linearly separable work out great:

• But what if the dataset is just too hard?

• We can map it to a higher-dimensional space:

0 x

0 x

0 x

x2

Nonlinear SVMs

9/28/2012

Slide credit: Andrew Moore

6

Section 1 - Jonathan Krause

Φ: x → φ(x)

Nonlinear SVMs• General idea: the original input space can always be mapped

to some higher-dimensional feature space where the training set is separable:

9/28/2012

Slide credit: Andrew Moorelifting transformation

7

Section 1 - Jonathan Krause

SVM – l1 regularization

• What if data is not linearly separable?

• Can use regularization to solve this problem

• We solve a new optimization problem and “tune” our regularization parameter C

9/28/20128

Section 1 - Jonathan Krause

Solving the SVM

• There are many different packages for solving SVMs

• In PS0 we have you use the liblinear package. This is an efficient implementation but can only use a linear kernelhttp://www.csie.ntu.edu.tw/~cjlin/liblinear/

• If you wish to have more flexibility with your choice of kernel you can use the LibSVM packagehttp://www.csie.ntu.edu.tw/~cjlin/libsvm/

9/28/20129

Section 1 - Jonathan Krause

Topics• Support Vector Machines• Boosting

– Viola-Jones face detector• Linear Algebra Review

– Notation– Operations & Properties– Matrix Calculus

• Probability– Axioms– Basic Properties– Bayes Theorem, Chain Rule

9/28/201210

Section 1 - Jonathan Krause

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

Boosting

• It is a sequential procedure:

9/28/2012

xt=1

xt=2

xt

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.

11

Section 1 - Jonathan Krause

Toy example

9/28/2012

Weak learners from the family of lines

h => p(error) = 0.5 it is at chance

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

12

Section 1 - Jonathan Krause

Toy example

9/28/2012

This one seems to be the best

This is a ‘weak classifier’: It performs slightly better than chance.

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

13

Section 1 - Jonathan Krause

Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =

14

Section 1 - Jonathan Krause

Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =

15

Section 1 - Jonathan Krause

Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =

16

Section 1 - Jonathan Krause

Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =

17

Section 1 - Jonathan Krause

Toy example

9/28/2012

The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.

f1 f2

f3

f4

18

Section 1 - Jonathan Krause

• Defines a classifier using an additive model:

Boosting

9/28/2012

Strong classifier

Weak classifier

WeightFeaturesvector

19

Section 1 - Jonathan Krause

• Defines a classifier using an additive model:

Boosting

• We need to define a family of weak classifiers

9/28/2012

Strong classifier

Weak classifier

WeightFeaturesvector

form a family of weak classifiers

20

Section 1 - Jonathan Krause

Why boosting?

• A simple algorithm for learning robust classifiers– Freund & Shapire, 1995– Friedman, Hastie, Tibshhirani, 1998

• Provides efficient algorithm for sparse visual feature selection– Tieu & Viola, 2000– Viola & Jones, 2003

• Easy to implement, doesn’t require external optimization tools.

9/28/201221

Section 1 - Jonathan Krause

Boosting - mathematics • Weak learners

9/28/2012

value of rectangle feature

threshold

1 if ( )( )

0 otherwisej j

j

f xh x

1 1

11 ( )

( ) 20 otherwise

T T

t t tt th x

h x

• Final strong classifier

22

Section 1 - Jonathan Krause

Weak classifier

9/28/2012

• 4 kind of Rectangle filters

• Value =

∑ (pixels in white area) – ∑ (pixels in black area)

Credit slide: S. Lazebnik

23

Section 1 - Jonathan Krause

Source

Result

Credit slide: S. Lazebnik

Weak classifier

9/28/201224

Section 1 - Jonathan Krause

…..1( ,1)x

2( ,1)x 3( ,0)x 4( ,0)x

( , )n nx y1

5( ,0)x 6( ,0)x

Weak classifierthreshold

1 if ( )( )

0 otherwisej j

j

f xh x

1. Evaluate each rectangle filter on each example

Viola & Jones algorithm

9/28/2012

0.8 0.7 0.2 0.3 0.8 0.1

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

25

Section 1 - Jonathan Krause

Viola & Jones algorithm• For a 24x24 detection region,

9/28/2012

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

26

Section 1 - Jonathan Krause

b. For each feature, j ( )j i j i iiw h x y

1 if ( )( )

0 otherwisej j

j

f xh x

c. Choose the classifier, ht with the lowest error t

1 ( )1, ,

t i ih x yt i t i tw w

1t

tt

2. Select best filter/threshold combination,

,

,1

t it i n

t jj

ww

w

a. Normalize the weights

3. Reweight examples

Viola & Jones algorithm

9/28/2012

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

27

Section 1 - Jonathan Krause

4. The final strong classifier is

1 1

11 ( )

( ) 20 otherwise

T T

t t tt th x

h x

1logt

t

The final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the training errors

Viola & Jones algorithm

9/28/2012

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

28

Section 1 - Jonathan Krause

Boosting for face detection

• For each round of boosting:1. Evaluate each rectangle filter on each example2. Select best filter/threshold combination3. Reweight examples

9/28/201229

Section 1 - Jonathan Krause

The implemented system• Training Data

– 5000 faces• All frontal, rescaled to

24x24 pixels– 300 million non-faces

• 9500 non-face images– Faces are normalized

• Scale, translation

• Many variations– Across individuals– Illumination– Pose

9/28/2012

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

30

Section 1 - Jonathan Krause

System performance

• Training time: “weeks” on 466 MHz Sun workstation• 38 layers, total of 6061 features• Average of 10 features evaluated per window on test set• “On a 700 Mhz Pentium III processor, the face detector can

process a 384 by 288 pixel image in about .067 seconds” – 15 Hz– 15 times faster than previous detector of comparable accuracy

(Rowley et al., 1998)

9/28/2012

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

31

Section 1 - Jonathan Krause

Output of Face Detector on Test Images

9/28/2012

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

32

Section 1 - Jonathan Krause

Topics• Support Vector Machines• Boosting

– Viola-Jones face detector• Linear Algebra Review

– Notation– Operations & Properties– Matrix Calculus

• Probability– Axioms– Basic Properties– Bayes Theorem, Chain Rule

9/28/201233

Section 1 - Jonathan Krause

Linear Algebra in Computer Vision

• Representation– 3D points in the scene– 2D points in the image (Images are matrices)

• Transformations– Mapping 2D to 2D– Mapping 3D to 2D

9/28/201234

Section 1 - Jonathan Krause

Notation

• We adopt the notation for a matrix

which is a real valued matrix with m rows, and n columns

• We adopt the notation for a column vector, and a row vector respectively

9/28/201235

Section 1 - Jonathan Krause

Notation

• To indicate the element in the ith row and jth column of a matrix we use

• Similarly to indicate the ith entry in a vector we use

9/28/201236

Section 1 - Jonathan Krause

Norms

• Intuitively the norm of a vector is the measure of its “length”

• The l2 norm is defined as

in this class we will use the l2 norm unless otherwise noted. Thus we drop the 2 subscript on the norm for convenience.

• Note that

9/28/201237

Section 1 - Jonathan Krause

Linear Independence and Rank

• A set of vectors is linearly independent if no vector in the set can be represented as a linear combination of the remaining vectors in the set

• The rank of a matrix is the maximal number of linearly independent column or rows of a matrix

• • • •

9/28/201238

Section 1 - Jonathan Krause

Range and Nullspace

• The range of a matrix is the span of the columns of the matrix, denoted by the set

• The nullspace of a matrix, is the set of vectors that when multiplied by the matrix result in 0, given by the set

9/28/201239

Section 1 - Jonathan Krause

Eigenvalues and Eigenvectors

• Given a matrix, and are said to be an eigenvalue and the corresponding eigenvector of the matrix if

• We can solve for the eigenvalues by solving for the roots of the polynomial generated by

9/28/201240

Section 1 - Jonathan Krause

Eigenvalue Properties

• The rank of a matrix is equal to the number of its non-zero eigenvalues

• Eigenvalues of a diagonal matrix, are simply the diagonal entries

• A matrix is said to be diagonalizable if we can write

9/28/201241

Section 1 - Jonathan Krause

Eigenvalues & Eigenvectors of Symmetric Matrices

• Eigenvalues of symmetric matrices are real• Eigenvectors of symmetric matrices are orthonormal• Consider the optimization problem involving the

symmetric matrix

the maximizing is the eigenvector corresponding to the largest eigenvalue

9/28/201242

Section 1 - Jonathan Krause

Generalized Eigenvalues

• Generalized Eigenvalue problem

• Generalized eigenvalues must satisfy

• This reduces to the original eigenvalue problem when exists

• Generalized eigenvalues are used in Fisherfaces 9/28/201243

Section 1 - Jonathan Krause

Singular Value Decomposition (SVD)

• The SVD of matrix is given by

• Where are the columns of and called the left singular vectors is a diagonal matrix whose values are , and called the singular values are the columns of , and are called the right singular vectors

9/28/201244

Section 1 - Jonathan Krause

SVD

• If the matrix has rank , then has non-zero singular values

• are an orthonormal basis for

• are an orthonormal basis for

• Singular values of are the square root of the non-zero eigenvalues of or

9/28/201245

Section 1 - Jonathan Krause

Matlab

• [V,D] = eig(A)The eigenvectors of A are the columns of V. D is a diagonal matrix whose entries are the eigenvalues of A.

• [V,D] = eig(A,B)The generalized eigenvectors are the columns of V. D is a diagonal matrix whose entries of the generalized eigenvalues.

• [U,S,V] = svd(X)The columns of U are the left singular vectors of X. S is a diagonal matrix whose entries are the singular values of X. The columns of V are the right singular vectors of X. Recall X = U*S*V’;

9/28/201246

Section 1 - Jonathan Krause

Matrix Calculus -- Gradient

• Let then the gradient is given by

• is always the same size as , thus if we just have a vector the gradient is simply

9/28/201247

Section 1 - Jonathan Krause

Gradients

• From partial derivatives

• Some common gradients

9/28/201248

Section 1 - Jonathan Krause

Topics• Support Vector Machines• Boosting

– Viola-Jones face detector• Linear Algebra Review

– Notation– Operations & Properties– Matrix Calculus

• Probability– Axioms– Basic Properties– Bayes Theorem, Chain Rule

9/28/201249

Section 1 - Jonathan Krause

Probability in Computer Vision

• Foundation for algorithms to solve– Tracking problems– Human activity recognition– Object recognition– Segmentation

9/28/201250

Section 1 - Jonathan Krause

Probability Axioms

• Sample space: The set of all the outcomes of a random experiment. Denoted by

• Event space: A set whose elements are subsets of . The event space is denoted by . For example

• Probability measure: A functionthat satisfies– – –

9/28/201251

Section 1 - Jonathan Krause

Basic Properties

• • • • •

9/28/201252

Section 1 - Jonathan Krause

Conditional Probability

• Two events are independent if

• Conditional Independence

9/28/201253

Section 1 - Jonathan Krause

Product Rule

• From the definition of conditional probability we can write

• From the product rule we can derive the chain rule of probability

9/28/201254

Section 1 - Jonathan Krause

Bayes Theorem

9/28/2012

PriorProbability

Likelihood

Normalizing Constant

Posterior Probability

55