introduction to machine learning -...

L. Risser CNRS / Institut de Mathématiques de Toulouse

Introduction to Machine Learning

Laurent Risser Institut de Mathématiques de Toulouse

lrisser@math.univ-toulouse.fr

0) From Statistics to Machine Learning

As a starter: From Statistics to Machine Learning

1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.

1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).

1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.

1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.

2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.

2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.

… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)

Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html

Talk overview

• From statistics to machine learning • Introductory examples

• Supervised learning • Unsupervised learning

• Classic algorithms in machine learning • K-means • Decision trees and Random forests • SVM • Logistic regression

• Overfitting and cross validation • Overfitting • Cross-validation

• High dimensionality and regularization • Modeling a real-life problem • Effect of regularization • Dimensionality reduction using PCA

• Supervised learning using Neural networks • Conclusion

1) Introductory examples

Two introductory examples

Supervised learning — Training Data

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

Variable 1

21.a) Introductory examples — Supervised learning

Example : Variable 1 = Age Variable 2 = Monthly incomes State = Buy a product at Christmas

Variable 1

Supervised learning — Training Data

Most likely state of ?

Variable 1

Supervised learning — Prediction

Supervised learning (here using a linear model) then prediction on new data

Variable 1

Supervised learning (here using a linear model) then prediction on new data

Variable 1

To sum-up

1. We have labelled training data 2. Choose a model to classify the data 3. Optimisation of the model parameters (learning) as a

function of a loss function (e.g. prediction error) 4. Model validation on a test set 5. Predictions on new observations

Variable 1

21.b) Introductory examples — Unsupervised learning

Unsupervised learning — Input Data

n = 20 observations p = 2 variables (problem dimension) No label

Variable 1

Unsupervised learning — Input Data

Is it reasonably possible to distinguish several sub-groups

of observations?

Variable 1

dist(xi , xj), j = {1, …, i-1, i+1, …, n}

Unsupervised learning — Distance between the observations

of observations?

Distance between the observations {dist(xi, xj)}i, j∈{1,…,n}2

Variable 1

dist(xi+1 , xj), j = {1, …, i, i+2, …, n}

Unsupervised learning — Distance between the observations

of observations?

Distance between the observations {dist(xi, xj)}i, j∈{1,…,n}2

Variable 1

Group 1Group 2

Group 3

Distance between the observations

Energy to minimize w.r.t the labels yi

Unsupervised learning — Learning

of observations?

f (y1, …, yn, {dist(xi, xj)}i, j∈{1,…,n}2)

f (…) =n

∑i=1

∑j=1

𝕀yi==yj|xi − xj |

{dist(xi, xj)}i, j∈{1,…,n}2

Variable 1

Graph clustering

Strong link

Group 1Group 2

Group 3

Unsupervised learning — Link with graph clustering

Variable 1

Group 1Group 2

Group 3

Distance between the observations

Energy to minimize w.r.t the labels yi

Unsupervised learning — Learning

of observations?

f (y1, …, yn, {dist(xi, xj)}i, j∈{1,…,n}2)

f (…) =n

∑i=1

∑j=1

𝕀yi==yj|xi − xj |

{dist(xi, xj)}i, j∈{1,…,n}2

To sum-up

1. We have training data with no known label 2. Choose a model to classify the data 3. Optimisation of the model parameters (learning) based

on an criterion of inner energy 4. Predictions on new observations

2) Classic algorithms

Classic algorithms in Machine Learning

2.a) Classic algorithms - K-means

K-means algorithm

Variable 1

K-means algorithm

N seeds are randomly drawn (in this example N=4)

Variable 1

K-means algorithm

For each observation, we consider the nearest seed.

Remark: Euclidian distances are used in this example

Variable 1

K-means algorithm

Seeds are then centered according to the their corresponding observations…

Variable 1

K-means algorithm

… for each observation, we again consider the nearest seed …

Variable 1

K-means algorithm

… and we re-iterate until convergence.

Decision trees

Variable 1

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).

2.b) Classic algorithms - decision trees

Decision trees

Variable 1

var2 < 3var2 > 3

All obs.

Decision trees

Variable 1

All obs.

var2 < 3var2 > 3

var1 < 1var1 > 1

Decision trees

Variable 1

var2 < 3var2 > 3

var1 < 1var1 > 1

………

All obs.

Random forests

Variable 1

High dimension: p >> 1 (p = 2 in the former example)

Learning : • Several trees are independently

defined. • Dimensions are randomly drawn.

Prediction : • A label predicted at a given point is the

one predicted by the majority of trees (bagging).

2.c) Classic algorithms - random forests

Variable 1

Support Vector Machine (SVM)

2.d) Classic algorithms - Support Vector Machine (SVM)

We estimate w and b such as:

yi (w . xi - b) ≥ 1 for all 1 ≤ i ≤ n

Variable 1

(w . x

- b) =

(w . x

- b) =

(w . x

- b) =

Variable 1

We estimate w and b such as:

yi (w . xi - b) ≥ 1 for all 1 ≤ i ≤ n

Full classification with a linear model is impossible now!!!

Variable 1

We estimate w et b that minimize:

[ n-1 Σ max( 0 , 1 - yi (w . xi - b) ) ] + λ || w ||i = 1

> 0 si yi is not well predicted

Variable 1

We estimate w et b that minimize:

[ n-1 Σ max( 0 , 1 - yi (w . xi - b) ) ] + λ || w ||i = 1

> 0 si yi is not well predicted

Remark: It is possible (and common) to use non-linear separations by replacing vector products with non-linear relations.

Support Vector Machine (SVM) — Kernel methods

How to handle this case using a Linear model?

Variable 1

We denote an observation

We classify the instead of theΦ(xi) = (x1i , x2

i , (x1i )2)

xi = (x1i , x2

Variable 1

Support Vector Machine (SVM) — Kernel methods

How to handle this case using a Linear model?

… there exists a huge literature on this topic

2.e) Classic algorithms - Logistic regression

Logistic regression

For each observation :

• Explicative variable: where

• Response variable:

Estimate as the minimizer of:

Remarks: • Linear classification as by using linear SVM but different model with strong statistical insights • Scales particularly well when n and/or p is very high • Modeling constrains on is gold standard when

xi = (x1i , x2

i , …, xpi )

yi ∈ {−1,1}

p > > 1

i ∈ {1,…, n}

∑j=1

wjxji > 0b +

∑j=1

wjxji < 0

yi = 1yi = − 1

w p > > n

Ideally

w= {w1, …, wp}, b)(w

3) Over-fitting and cross validation

Overfitting and cross-validation

3.a) Over-fitting and cross validation — overfitting

Overfitting

Full decision tree Linear SVM or Logistic RegressionTruncated decision tree

Training data:

Area Area

Area Area Area Area

Overfitting

Full decision tree Truncated decision tree

Training data:

100% accuracy 95% accuracy 95% accuracy

Linear SVM or Logistic Regression

Area Area

Area Area Area Area

Overfitting

Full decision tree Truncated decision tree

Training data:

100% accuracy 95% accuracy 95% accuracy

Which strategy would you trust most to predict the label of new observations ?

Linear SVM or Logistic Regression

All available data to learn decision rules

3.b) Over-fitting and cross validation — cross-validation

Cross-validation: Fundamental paradigm of Machine Learning to validate trained models

Split the data into training data and test data

Test dataTraining data

Learn the model parameters on training data

93.75% accuracy in this example

Evaluate the model quality with no risk of overfitting on test data

100% accuracy in this example

K-folds: K tests to be more robust and additionally evaluate the model stability

Test: 100% acc.Learn

1st 4-fold 2nd 4-fold

3rd 4-fold 4th 4-fold

LearnTest: 100% acc. Test: 100% acc.

Test: 80% acc.

95% average accuracy and stable parameters… good stuff

Leave-1-out: n tests (recommended when n is small)

Test: sucessLeave x1 Leave x2 Leave x3

Leave x4 Leave x5 Leave x7

Leave x8 Leave x9

Test: sucessLearn

Test: failedLearn

Test: sucessLearn

4) High dimensionality

High dimensionality… regularization, model selection, and/or dimensionality reduction

4.a) High dimensionality — Example of problem

Project context : • Observations = MRI images of the brain at different acquisition times (ADNI*) • Labels = Patient state (MCI/AD) • Prediction of Alzheimer disease depending on the hippocampus morphological evolution?

[Baseline] [Baseline + 12 months]

Initial data: • [Baseline] : n = 103 patients are MCI • [Baseline + 12 months] : 84 patients are MCI / 19 patients are AD

Hippocampus

* http://adni.bmap.ucla.edu/

For each of the n = 103 observations (patients): • xi : Evolution marker on the Template p = 20000 points • yi : State AD or MCI

Questions: • Is it possible to discriminate MCI and AD patients based on the shape evolution? • How to learn the most discriminant markers?

Subject 2 Baseline

Subject 84 Baseline

Subject 86 Baseline

Subject 103 Baseline

... ...

Subject 2 Baseline

+ 12 months

Subject 84 Baseline

+ 12 months

Subject 86 Baseline

+ 12 months

Subject 103 Baseline

+ 12 months... ...

MCI Group

Treatment 1: Estimate deformations between [Baseline] and [Baseline + 12 months] [Ourselin et al. Im Vis Comp., 2001], [Vialard et al. IJCV, 2012]

Subject 85 Baseline

+ 12 months

Subject 1 Baseline

+ 12 months

Registration

AD Group

Treatment 2 : Transport evolution markers on an Template/average shape

Registration Registration Registration Registration Registration

where: X ∈ Rn*p : matrix of the n = 103 observations in dimension p = 20000 y ∈ {∓1}n : State ( AD = -1 / MCI = 1 ) (w , b) ∈ Rp * R : parameters to estimate

= F( + b )

# points

y X wdef.

Logistic regression predictive model that defines the probability of the yi depending on the xi :

Log-likelihood optimization:

where:

Regularization parameter (mandatory as n>p)

Log-likelihood optimization: Regularization parameter (mandatory as n>p)

where: X ∈ Rn*p : matrix of the n = 103 observations in dimension p = 20000 y ∈ {∓1}n : State ( AD = -1 / MCI = 1 ) (w , b) ∈ Rp * R : parameters to estimate

= F( + b )

# points

y X wdef.

Logistic regression predictive model that defines the probability of the yi depending on the xi :

where:

2x = 3

2x1 + 3x2 = 3

2x1 + 3x2 = 33x1 + 1x2 = 1

2x1 + 3x2 + 1x3 − x4 = 15x1 − x2 + 2x3 + x4 = 1

n = 1 et p = 1 OK

n = 1 et p = 2 KO

n = 2 et p = 2 OK

n = 2 et p = 4 KO

4.b) High dimensionality — Effect of regularization

Optimization of w using: Lewis & Overton, Nonsmooth optimization via quasi-Newton methods. Math. Programming 2012

Tested regularization models:

(1) Ridge :

(2) LASSO :

(3) Elastic net :

(4) Sobolev semi-norm:

(5) Total Variation :

(6) Fused LASSO :

4.b) High dimensionality — Effect of regularization (and model selection)

(1) Ridge (4) Sobolev semi-norm

(2) LASSO (5) Total Variation

(3) Elastic net (6) Fused LASSO

Representation of w for three λ on a slice of the hippocampus: • Blue and red: strong local influence • Green: little or no local influence

4.b) High dimensionality — Effect of regularization

Results obtained using a cross validation method (here leave-10%-out) : • Spec+Sens = 2 good prediction in 100% of the cases • Spec+Sens = 1 coin flipping (Heads or Tails) has the same predictive power • Spec+Sens = 0 good prediction in 0% of the cases

Best results obtained using a regularization pertinent with regard to the data: • Spatial distribution taken into account • Allows clear transitions

[Fiot J.B. et al., NeuroImage: Clinical, 2012]

4.c) High dimensionality — Dimensionality reduction using SVD

National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538

How to establish a general ranking between these countries ???

Weighted sum between the scores, then ranking of these sums.

Weighted sum of the scores is equivalent to a matrix x vector multiplication:

Vector containing the scores = M . w

How to establish a general ranking between these countries ???

Weighted sum between the scores, then ranking of these sums.

One can also look for a vector of norm 1 that maximizes the variability between the scores

Optimal vector = 1st eigenvector (v1) of the SVD Variability level = 1st eigenvalue (λ1) of the SVD

Vector of scores with the highest variability = M . v1

Matrix M

One can now search the vector of norm 1, orthogonal to v1 , that maximizes the variability

Optimal vector = 2nd eigenvector (v2) of the PCA Variability level = 2nd eigenvalue (λ2) dof the PCA

… and so on ...Can be calculated analytically

Matrix M

Black: Projection of the data on PC2 and PC3

Red: Influence of the variables in PC2 and PC3

Fantastic tool to visualize and interpret high dimensional data …

Scree plot of the eigenvalues

λ1 λ2 λ3 λ4 λ5 λ6 λ7 λ8

… and a powerful tool to reduce the problem dimensionality before training a M.L. model

proj.PC1 proj.PC2 proj.PC3 Australie … … … Belgique … … … Brésil … … … RoyaumeUni … … … Canada … … … Chine … … … Croatie … … … Ethiopie … … … France … … … Allemagne … … … …

100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 …

Projection of the data from a 9D space to a 3D space preserves

here 90% of their variability

5) Supervised learning using Neural Networks

Supervised learning using Neural Networks

https://pythonprogramming.net/neural-networks-machine-learning-tutorial/

Deep learning …

Known PredictedLearned

• Very efficient in important applications (signal, images).

• Computationally heavy learning phase but quick predictions.

• Very large amount of parameters to learn.

• Requires large databases of annotated data (or wise network designs).

5) Supervised learning using Neural Networks

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Prediction

I = Image RGB 200*200

Prediction

Image of dog or cat

• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1

Classifier (black-box)

5) Supervised learning using Neural Networks — User point of view

Image of dog or cat

Prediction

Image of dog or cat

Prediction

Image of dog or cat

Prediction

Training phase: Parameters optimization to get the best predictions in average

Input training data Many images of dogs and cats

Output training data Labels of each image

… ……

……

The xi are typically the intensities of an RGB image I in each of its channel Predicted labels

5) Supervised learning using Neural Networks — Into the black-box

Input layer Hidden layers Output layer

Layer 1 Layer 2 Layer 3 Layer L

Minimize the expectation of the prediction error ( ! its average on the K training observations): ≈

Predicted labels

Known labels

Stochastic gradient descent:

• Calculated on a subsample of the K observations

at each iteration (batch)

• Calculated analytically if l=L-1

• Back-propagated if l<L-1

Optimisation de l’espérance (moyenne) de :

In practice: • Various types of layers • Various types of architectures • Various strategies to perform the stochastic gradient descent

Hidden important properties of N.N.: • Prediction and training can be straightforwardly parallelized on GPUs • Nvidia cuDNN library massively used by Keras, TensorFlow, Theano, PyTorch, …

That’s all for now

MERCI !!!

introduction to machine learning -...

Documents

machine learning made easy€¦ · machine learning –what...

introduction to azure machine learning...expected learning...

methods machine learning from machine learning to …

machine learning for nlp - ethics and machine learning

introducing machine learning · 4 ntroducing machine...

machine learning: machine learning:

fast track machine learning part 1 (machine learning...

machine learning introduction machine...

imagerie des scolioses de lʼenfant et de … de lʼenfant...

python for machine learning using...

machine learning probabilistic machine learning · machine...

machine learning: machine learning: introduction...

machine learning in logistics: machine learning...

machine learning an introduction - sas institute ·...

machine learning learning

section 1section 1 machine learning basic concepts ·...

machine learning - 0. overview€¦ · machine learning 1....

machine learning fabspace march 2018 - irit.fr learning...

machine learning chapter 11. 2 machine learning what is...

chapter- 6 : machine learning - ioe notesioenotes.edu.np ›...