machine learning lecture

Machine Learning

Roughly speaking, for a given learning task, with a given finite amount of training data, the

best generalization performance will be achieved if the right balance is struck between the

accuracy attained on that particular training set, and the “capacity” of the machine, that is, the

ability of the machine to learn any training set without error. A machine with too much capacity

is like a botanist with a photographic memory who, when presented with a new tree,

concludes that it is not a tree because it has a different number of leaves from anything she

has seen before; a machine with too little capacity is like the botanist’s lazy brother, who

declares that if it’s green, it’s a tree. Neither can generalize well. The exploration and

formalization of these concepts has resulted in one of the shining peaks of the theory of

statistical learning.

(Vapnik, 1979)

What is machine learning?

Data

examples

Model

training

Output

Predictions

Classifications

Clusters

OrdinalsWhy: Face Recognition?

Categories of problems

Clustering

Classification

Regression

Ordinal Reg.

Prediction

By output:

By input:

Vector, X Time Series, x(t)

One size never fits all…

• Improving an algorithm:

– First option: better features

• Visualize classes

• Trends

• Histograms

– Next: make the algorithm smarter (more complicated)

• Interaction of features

• Better objective and training criteria

WEKA or GGOBI

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

y=1 + 0.5t + 4t2 - t3

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

input

outp

ut

Categories of ML algorithms

Non-parametric Parametric

By model:

By training:

Supervised (labeled) Unsupervised (unlabeled)

Raw data only Model parameters only

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

input

outp

ut

Kernel

methods

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

input

outp

ut

0 50 100 150 200 2500

0.05

0.1

0.15

0.2

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

input

outp

ut

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

input

outp

ut

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

input

outp

ut

Training a ML algorithm

• Choose data

• Optimize model parameters according to:

– Objective function

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

Regression Classification

-2 0 2 4 6 8-2

0

2

4

6

8

10

1

2Mean Square Error

Max Margin

Pitfalls of ML algorithms

• Clean your features:– Training volume: more is better

– Outliers: remove them!

– Dynamic range: normalize it!

• Generalization– Over fitting

– Under fitting

• Speed: parametric vs. non

• What are you learning? …features, features, features…

outliers

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

input

outp

ut

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

input

outp

ut

-4 -2 0 2 4 6-20

-10

0

10

20

30

40

50

input

outp

ut Keep a “good” percentile range!

5-95, 1-99: depends on your data

Dynamic range

0 0.2 0.4 0.6 0.8 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

f1

f2

1

2

0 200 400 600 800 1000-1

0

1

2

3

4

5

6

f1

f2

1

2

0 200 400 600 800 10000

50

100

150

200

250

300

350

400

f1

f2

1

2

-2 0 2 4 6 8-1

0

1

2

3

4

5

6

f1

f2

1

2

Over fitting and comparing

algorithms

• Early stop

• Regularization

• Validation Sets

Under fittingCurse of dimensionality

K-Means clustering

•Planar decision boundaries,

depending on space you are in…

•Highly Efficient

•Not always great (but usually

pretty good)

•Needs good starting criteria

K-Nearest Neighbor

•Arbitrary decision boundaries

•Not so efficient…

•With enough data in each class…

optimal

•Easy to train, known as a lazy classifier

Mixture of Gaussians•Arbitrary decision boundaries

with enough boundaries

•Efficient, depending on number

of models and Gaussians

•Can represent more than just

Gaussian distributions

•Generative, sometimes tough to

train up

•Spurious singularities

•Can get a distribution for a

specific class and feature(s)… and

get a Bayesian classifier

Components Analysis

(principal or independent)•Reduces dimensionality

•All other classifiers work in a

rotated space

•Remember Eigen-values and

Vectors?

Trees Classifiers

•Arbitrary Decision boundaries

•Can be quite efficient (or not!)

•Needs good criteria for splitting

•Easy to visualize

Multi-Layer Perceptron

•Arbitrary (but linear) Decision

boundaries

•Can be quite efficient (or not!)

•What did it learn?

Support Vector Machines


•Efficiency depends on support

vector size and feature size

Hidden Markov Models


•Efficiency depends on state

space and number of models

•Generalizes to incorporate

features that change over time

More sophisticated approaches

• Graphical models (like an HMM)– Bayesian network

– Markov random fields

• Boosting– Adaboost

• Voting

• Cascading

• Stacking…

machine learning lecture

Technology