introduction to machine learning -...

Post on 28-Sep-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

L. Risser CNRS / Institut de Mathématiques de Toulouse

Introduction to Machine Learning

Laurent Risser Institut de Mathématiques de Toulouse

lrisser@math.univ-toulouse.fr

L. Risser CNRS / Institut de Mathématiques de Toulouse

L. Risser CNRS / Institut de Mathématiques de Toulouse

0) From Statistics to Machine Learning

As a starter: From Statistics to Machine Learning

L. Risser CNRS / Institut de Mathématiques de Toulouse

1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.

1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).

1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.

1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.

2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.

2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.

… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)

Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html

0) From Statistics to Machine Learning

L. Risser CNRS / Institut de Mathématiques de Toulouse

1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.

1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).

1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.

1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.

2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.

2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.

… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)

Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html

0) From Statistics to Machine Learning

L. Risser CNRS / Institut de Mathématiques de Toulouse

1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.

1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).

1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.

1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.

2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.

2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.

… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)

Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html

0) From Statistics to Machine Learning

L. Risser CNRS / Institut de Mathématiques de Toulouse

1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.

1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).

1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.

1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.

2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.

2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.

… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)

Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html

0) From Statistics to Machine Learning

L. Risser CNRS / Institut de Mathématiques de Toulouse

Talk overview

• From statistics to machine learning • Introductory examples

• Supervised learning • Unsupervised learning

• Classic algorithms in machine learning • K-means • Decision trees and Random forests • SVM • Logistic regression

• Overfitting and cross validation • Overfitting • Cross-validation

• High dimensionality and regularization • Modeling a real-life problem • Effect of regularization • Dimensionality reduction using PCA

• Supervised learning using Neural networks • Conclusion

L. Risser CNRS / Institut de Mathématiques de Toulouse

1) Introductory examples

Two introductory examples

L. Risser CNRS / Institut de Mathématiques de Toulouse

Supervised learning — Training Data

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

Variable 1

Varia

ble

21.a) Introductory examples — Supervised learning

L. Risser CNRS / Institut de Mathématiques de Toulouse

Example : Variable 1 = Age Variable 2 = Monthly incomes State = Buy a product at Christmas

Variable 1

Varia

ble

2

Supervised learning — Training Data

1.a) Introductory examples — Supervised learning

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

L. Risser CNRS / Institut de Mathématiques de Toulouse

?

Most likely state of ?

Variable 1

Varia

ble

2

Supervised learning — Prediction

1.a) Introductory examples — Supervised learning

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

L. Risser CNRS / Institut de Mathématiques de Toulouse

?

Supervised learning (here using a linear model) then prediction on new data

Variable 1

Varia

ble

2

Most likely state of ?

Supervised learning — Prediction

1.a) Introductory examples — Supervised learning

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

L. Risser CNRS / Institut de Mathématiques de Toulouse

?

Supervised learning (here using a linear model) then prediction on new data

Variable 1

Varia

ble

2

Most likely state of ?

Supervised learning — Prediction

1.a) Introductory examples — Supervised learning

n = 20 observations p = 2 variables (problem dimension) Label with 2 states

To sum-up

1. We have labelled training data 2. Choose a model to classify the data 3. Optimisation of the model parameters (learning) as a

function of a loss function (e.g. prediction error) 4. Model validation on a test set 5. Predictions on new observations

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

21.b) Introductory examples — Unsupervised learning

Unsupervised learning — Input Data

n = 20 observations p = 2 variables (problem dimension) No label

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

21.b) Introductory examples — Unsupervised learning

Unsupervised learning — Input Data

n = 20 observations p = 2 variables (problem dimension) No label

Is it reasonably possible to distinguish several sub-groups

of observations?

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

dist(xi , xj), j = {1, …, i-1, i+1, …, n}

xi

1.b) Introductory examples — Unsupervised learning

Unsupervised learning — Distance between the observations

n = 20 observations p = 2 variables (problem dimension) No label

Is it reasonably possible to distinguish several sub-groups

of observations?

Distance between the observations {dist(xi, xj)}i, j∈{1,…,n}2

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

dist(xi+1 , xj), j = {1, …, i, i+2, …, n}

xi+1

1.b) Introductory examples — Unsupervised learning

Unsupervised learning — Distance between the observations

n = 20 observations p = 2 variables (problem dimension) No label

Is it reasonably possible to distinguish several sub-groups

of observations?

Distance between the observations {dist(xi, xj)}i, j∈{1,…,n}2

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

Group 1Group 2

Group 3

Distance between the observations

Energy to minimize w.r.t the labels yi

e.g.

1.b) Introductory examples — Unsupervised learning

Unsupervised learning — Learning

n = 20 observations p = 2 variables (problem dimension) No label

Is it reasonably possible to distinguish several sub-groups

of observations?

f (y1, …, yn, {dist(xi, xj)}i, j∈{1,…,n}2)

f (…) =n

∑i=1

n

∑j=1

𝕀yi==yj|xi − xj |

22

{dist(xi, xj)}i, j∈{1,…,n}2

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

Graph clustering

Strong link

Group 1Group 2

Group 3

1.b) Introductory examples — Unsupervised learning

Unsupervised learning — Link with graph clustering

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

Group 1Group 2

Group 3

Distance between the observations

Energy to minimize w.r.t the labels yi

e.g.

1.b) Introductory examples — Unsupervised learning

Unsupervised learning — Learning

n = 20 observations p = 2 variables (problem dimension) No label

Is it reasonably possible to distinguish several sub-groups

of observations?

f (y1, …, yn, {dist(xi, xj)}i, j∈{1,…,n}2)

f (…) =n

∑i=1

n

∑j=1

𝕀yi==yj|xi − xj |

22

{dist(xi, xj)}i, j∈{1,…,n}2

To sum-up

1. We have training data with no known label 2. Choose a model to classify the data 3. Optimisation of the model parameters (learning) based

on an criterion of inner energy 4. Predictions on new observations

L. Risser CNRS / Institut de Mathématiques de Toulouse

2) Classic algorithms

Classic algorithms in Machine Learning

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

K-means algorithm

Variable 1

Varia

ble

2

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

Variable 1

Varia

ble

2

K-means algorithm

N seeds are randomly drawn (in this example N=4)

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

Variable 1

Varia

ble

2

K-means algorithm

For each observation, we consider the nearest seed.

Remark: Euclidian distances are used in this example

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

Variable 1

Varia

ble

2

K-means algorithm

Seeds are then centered according to the their corresponding observations…

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

Variable 1

Varia

ble

2

K-means algorithm

… for each observation, we again consider the nearest seed …

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.a) Classic algorithms - K-means

Variable 1

Varia

ble

2

K-means algorithm

… and we re-iterate until convergence.

L. Risser CNRS / Institut de Mathématiques de Toulouse

Decision trees

Variable 1

Varia

ble

2

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).

2.b) Classic algorithms - decision trees

L. Risser CNRS / Institut de Mathématiques de Toulouse

Decision trees

Variable 1

Varia

ble

2

D2D1

var2 < 3var2 > 3

D2

D1

All obs.

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).

2.b) Classic algorithms - decision trees

L. Risser CNRS / Institut de Mathématiques de Toulouse

Decision trees

Variable 1

Varia

ble

2

All obs.

D2D1

var2 < 3var2 > 3

D3

D1

D4D3

var1 < 1var1 > 1

D4

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).

2.b) Classic algorithms - decision trees

L. Risser CNRS / Institut de Mathématiques de Toulouse

Decision trees

Variable 1

Varia

ble

2

D2D1

var2 < 3var2 > 3

D3

D1

D4D3

var1 < 1var1 > 1

………

All obs.

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).

2.b) Classic algorithms - decision trees

L. Risser CNRS / Institut de Mathématiques de Toulouse

Random forests

Variable 1

Varia

ble

2

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

High dimension: p >> 1 (p = 2 in the former example)

Learning : • Several trees are independently

defined. • Dimensions are randomly drawn.

Prediction : • A label predicted at a given point is the

one predicted by the majority of trees (bagging).

2.c) Classic algorithms - random forests

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

Support Vector Machine (SVM)

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

2.d) Classic algorithms - Support Vector Machine (SVM)

L. Risser CNRS / Institut de Mathématiques de Toulouse

We estimate w and b such as:

yi (w . xi - b) ≥ 1 for all 1 ≤ i ≤ n

Variable 1

Varia

ble

2

Support Vector Machine (SVM)

w

(w . x

- b) =

1

(w . x

- b) =

0

(w . x

- b) =

-1

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

2.d) Classic algorithms - Support Vector Machine (SVM)

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

Support Vector Machine (SVM)

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

We estimate w and b such as:

yi (w . xi - b) ≥ 1 for all 1 ≤ i ≤ n

Full classification with a linear model is impossible now!!!

2.d) Classic algorithms - Support Vector Machine (SVM)

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

We estimate w et b that minimize:

[ n-1 Σ max( 0 , 1 - yi (w . xi - b) ) ] + λ || w ||i = 1

n

2

Support Vector Machine (SVM)

w

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

2

> 0 si yi is not well predicted

2.d) Classic algorithms - Support Vector Machine (SVM)

L. Risser CNRS / Institut de Mathématiques de Toulouse

Variable 1

Varia

ble

2

We estimate w et b that minimize:

[ n-1 Σ max( 0 , 1 - yi (w . xi - b) ) ] + λ || w ||i = 1

n

2

Support Vector Machine (SVM)

w

x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)

y1 , y2 , … , yN are the labels (here: 1 and -1)

2

> 0 si yi is not well predicted

Remark: It is possible (and common) to use non-linear separations by replacing vector products with non-linear relations.

2.d) Classic algorithms - Support Vector Machine (SVM)

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.d) Classic algorithms - Support Vector Machine (SVM)

Support Vector Machine (SVM) — Kernel methods

How to handle this case using a Linear model?

Variable 1

Varia

ble

2

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.d) Classic algorithms - Support Vector Machine (SVM)

Variable 1

Varia

ble

2

We denote an observation

We classify the instead of theΦ(xi) = (x1i , x2

i , (x1i )2)

xi = (x1i , x2

i )

xi

Variable 1

Varia

ble

3

Var 2

Support Vector Machine (SVM) — Kernel methods

How to handle this case using a Linear model?

… there exists a huge literature on this topic

L. Risser CNRS / Institut de Mathématiques de Toulouse

2.e) Classic algorithms - Logistic regression

Logistic regression

For each observation :

• Explicative variable: where

• Response variable:

Estimate as the minimizer of:

Remarks: • Linear classification as by using linear SVM but different model with strong statistical insights • Scales particularly well when n and/or p is very high • Modeling constrains on is gold standard when

xi = (x1i , x2

i , …, xpi )

yi ∈ {−1,1}

p > > 1

i ∈ {1,…, n}

(w

b +p

∑j=1

wjxji > 0b +

p

∑j=1

wjxji < 0

yi = 1yi = − 1

w p > > n

Ideally

w= {w1, …, wp}, b)(w

L. Risser CNRS / Institut de Mathématiques de Toulouse

3) Over-fitting and cross validation

Overfitting and cross-validation

L. Risser CNRS / Institut de Mathématiques de Toulouse

3.a) Over-fitting and cross validation — overfitting

Overfitting

Full decision tree Linear SVM or Logistic RegressionTruncated decision tree

Training data:

L. Risser CNRS / Institut de Mathématiques de Toulouse

Area

Area

Area

Area

Area Area

Area

Area

Area Area Area Area

Overfitting

Full decision tree Truncated decision tree

Training data:

100% accuracy 95% accuracy 95% accuracy

Linear SVM or Logistic Regression

3.a) Over-fitting and cross validation — overfitting

L. Risser CNRS / Institut de Mathématiques de Toulouse

Area

Area

Area

Area

Area Area

Area

Area

Area Area Area Area

Overfitting

Full decision tree Truncated decision tree

Training data:

100% accuracy 95% accuracy 95% accuracy

Which strategy would you trust most to predict the label of new observations ?

Linear SVM or Logistic Regression

3.a) Over-fitting and cross validation — overfitting

L. Risser CNRS / Institut de Mathématiques de Toulouse

All available data to learn decision rules

3.b) Over-fitting and cross validation — cross-validation

Cross-validation: Fundamental paradigm of Machine Learning to validate trained models

L. Risser CNRS / Institut de Mathématiques de Toulouse

Split the data into training data and test data

Test dataTraining data

3.b) Over-fitting and cross validation — cross-validation

L. Risser CNRS / Institut de Mathématiques de Toulouse

Learn the model parameters on training data

Test dataTraining data

93.75% accuracy in this example

3.b) Over-fitting and cross validation — cross-validation

L. Risser CNRS / Institut de Mathématiques de Toulouse

Evaluate the model quality with no risk of overfitting on test data

100% accuracy in this example

Test dataTraining data

3.b) Over-fitting and cross validation — cross-validation

L. Risser CNRS / Institut de Mathématiques de Toulouse

K-folds: K tests to be more robust and additionally evaluate the model stability

Test: 100% acc.Learn

1st 4-fold 2nd 4-fold

3rd 4-fold 4th 4-fold

3.b) Over-fitting and cross validation — cross-validation

Learn

Learn

LearnTest: 100% acc. Test: 100% acc.

Test: 80% acc.

95% average accuracy and stable parameters… good stuff

L. Risser CNRS / Institut de Mathématiques de Toulouse

Leave-1-out: n tests (recommended when n is small)

Test: sucessLeave x1 Leave x2 Leave x3

Leave x4 Leave x5 Leave x7

Leave x8 Leave x9

3.b) Over-fitting and cross validation — cross-validation

Learn

Test: sucessLearn

Test: sucessLearn

Test: failedLearn

Test: sucessLearn

Test: sucessLearn

Test: sucessLearn

Test: sucessLearn

L. Risser CNRS / Institut de Mathématiques de Toulouse

4) High dimensionality

High dimensionality… regularization, model selection, and/or dimensionality reduction

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.a) High dimensionality — Example of problem

Project context : • Observations = MRI images of the brain at different acquisition times (ADNI*) • Labels = Patient state (MCI/AD) • Prediction of Alzheimer disease depending on the hippocampus morphological evolution?

[Baseline] [Baseline + 12 months]

Initial data: • [Baseline] : n = 103 patients are MCI • [Baseline + 12 months] : 84 patients are MCI / 19 patients are AD

Hippocampus

* http://adni.bmap.ucla.edu/

L. Risser CNRS / Institut de Mathématiques de Toulouse

For each of the n = 103 observations (patients): • xi : Evolution marker on the Template p = 20000 points • yi : State AD or MCI

Questions: • Is it possible to discriminate MCI and AD patients based on the shape evolution? • How to learn the most discriminant markers?

103 x

Subject 2 Baseline

Subject 84 Baseline

Subject 86 Baseline

Subject 103 Baseline

... ...

Subject 2 Baseline

+ 12 months

Subject 84 Baseline

+ 12 months

Subject 86 Baseline

+ 12 months

Subject 103 Baseline

+ 12 months... ...

MCI Group

Treatment 1: Estimate deformations between [Baseline] and [Baseline + 12 months] [Ourselin et al. Im Vis Comp., 2001], [Vialard et al. IJCV, 2012]

Subject 85 Baseline

Subject 85 Baseline

+ 12 months

Subject 1 Baseline

Subject 1 Baseline

+ 12 months

Registration

AD Group

Treatment 2 : Transport evolution markers on an Template/average shape

4.a) High dimensionality — Example of problem

Registration Registration Registration Registration Registration

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.a) High dimensionality — Example of problem

where: X ∈ Rn*p : matrix of the n = 103 observations in dimension p = 20000 y ∈ {∓1}n : State ( AD = -1 / MCI = 1 ) (w , b) ∈ Rp * R : parameters to estimate

= F( + b )

# points

# su

jets

# po

ints1

1

y X wdef.

Logistic regression predictive model that defines the probability of the yi depending on the xi :

Log-likelihood optimization:

# su

jets

where:

Regularization parameter (mandatory as n>p)

L. Risser CNRS / Institut de Mathématiques de Toulouse

Log-likelihood optimization: Regularization parameter (mandatory as n>p)

4.a) High dimensionality — Example of problem

where: X ∈ Rn*p : matrix of the n = 103 observations in dimension p = 20000 y ∈ {∓1}n : State ( AD = -1 / MCI = 1 ) (w , b) ∈ Rp * R : parameters to estimate

= F( + b )

# points

# su

jets

# po

ints1

1

y X wdef.

Logistic regression predictive model that defines the probability of the yi depending on the xi :

# su

jets

where:

2x = 3

2x1 + 3x2 = 3

2x1 + 3x2 = 33x1 + 1x2 = 1

2x1 + 3x2 + 1x3 − x4 = 15x1 − x2 + 2x3 + x4 = 1

n = 1 et p = 1 OK

n = 1 et p = 2 KO

n = 2 et p = 2 OK

n = 2 et p = 4 KO

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.b) High dimensionality — Effect of regularization

Optimization of w using: Lewis & Overton, Nonsmooth optimization via quasi-Newton methods. Math. Programming 2012

Tested regularization models:

(1) Ridge :

(2) LASSO :

(3) Elastic net :

(4) Sobolev semi-norm:

(5) Total Variation :

(6) Fused LASSO :

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.b) High dimensionality — Effect of regularization (and model selection)

(1) Ridge (4) Sobolev semi-norm

(2) LASSO (5) Total Variation

(3) Elastic net (6) Fused LASSO

Representation of w for three λ on a slice of the hippocampus: • Blue and red: strong local influence • Green: little or no local influence

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.b) High dimensionality — Effect of regularization

Results obtained using a cross validation method (here leave-10%-out) : • Spec+Sens = 2 good prediction in 100% of the cases • Spec+Sens = 1 coin flipping (Heads or Tails) has the same predictive power • Spec+Sens = 0 good prediction in 0% of the cases

Best results obtained using a regularization pertinent with regard to the data: • Spatial distribution taken into account • Allows clear transitions

[Fiot J.B. et al., NeuroImage: Clinical, 2012]

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.c) High dimensionality — Dimensionality reduction using SVD

National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538

How to establish a general ranking between these countries ???

Weighted sum between the scores, then ranking of these sums.

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.c) High dimensionality — Dimensionality reduction using SVD

Weighted sum of the scores is equivalent to a matrix x vector multiplication:

Vector containing the scores = M . w

National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538

How to establish a general ranking between these countries ???

Weighted sum between the scores, then ranking of these sums.

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.c) High dimensionality — Dimensionality reduction using SVD

One can also look for a vector of norm 1 that maximizes the variability between the scores

Optimal vector = 1st eigenvector (v1) of the SVD Variability level = 1st eigenvalue (λ1) of the SVD

Vector of scores with the highest variability = M . v1

Matrix M

National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538

L. Risser CNRS / Institut de Mathématiques de Toulouse

4.c) High dimensionality — Dimensionality reduction using SVD

One can now search the vector of norm 1, orthogonal to v1 , that maximizes the variability

Optimal vector = 2nd eigenvector (v2) of the PCA Variability level = 2nd eigenvalue (λ2) dof the PCA

… and so on ...Can be calculated analytically

Matrix M

National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538

L. Risser CNRS / Institut de Mathématiques de Toulouse

Black: Projection of the data on PC2 and PC3

Red: Influence of the variables in PC2 and PC3

4.c) High dimensionality — Dimensionality reduction using SVD

Fantastic tool to visualize and interpret high dimensional data …

L. Risser CNRS / Institut de Mathématiques de Toulouse

Scree plot of the eigenvalues

4.c) High dimensionality — Dimensionality reduction using SVD

λ1 λ2 λ3 λ4 λ5 λ6 λ7 λ8

… and a powerful tool to reduce the problem dimensionality before training a M.L. model

Cap

ture

d va

riabi

lity

proj.PC1 proj.PC2 proj.PC3 Australie … … … Belgique … … … Brésil … … … RoyaumeUni … … … Canada … … … Chine … … … Croatie … … … Ethiopie … … … France … … … Allemagne … … … …

100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 …

Projection of the data from a 9D space to a 3D space preserves

here 90% of their variability

!!!

L. Risser CNRS / Institut de Mathématiques de Toulouse

5) Supervised learning using Neural Networks

Supervised learning using Neural Networks

L. Risser CNRS / Institut de Mathématiques de Toulouse

https://pythonprogramming.net/neural-networks-machine-learning-tutorial/

Deep learning …

Known PredictedLearned

• Very efficient in important applications (signal, images).

• Computationally heavy learning phase but quick predictions.

• Very large amount of parameters to learn.

• Requires large databases of annotated data (or wise network designs).

5) Supervised learning using Neural Networks

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Prediction

I = Image RGB 200*200

h1(I)

h2(I)

Prediction

Image of dog or cat

• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1

Classifier (black-box)

5) Supervised learning using Neural Networks — User point of view

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

1

0

5) Supervised learning using Neural Networks — User point of view

Image of dog or cat

• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1

Classifier (black-box)

Prediction

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

1

1

5) Supervised learning using Neural Networks — User point of view

Image of dog or cat

• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1

Classifier (black-box)

Prediction

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

0

0

5) Supervised learning using Neural Networks — User point of view

Image of dog or cat

• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1

Classifier (black-box)

Prediction

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Training phase: Parameters optimization to get the best predictions in average

0

0

Input training data Many images of dogs and cats

Output training data Labels of each image

… ……

0

00

0

……

……

5) Supervised learning using Neural Networks — User point of view

Classifier (black-box)

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

The xi are typically the intensities of an RGB image I in each of its channel Predicted labels

5) Supervised learning using Neural Networks — Into the black-box

Input layer Hidden layers Output layer

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Minimize the expectation of the prediction error ( ! its average on the K training observations): ≈

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Predicted labels

Known labels

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Stochastic gradient descent:

where

• Calculated on a subsample of the K observations

at each iteration (batch)

• Calculated analytically if l=L-1

• Back-propagated if l<L-1

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

Laurent Risser CNRS / Institut de Mathématiques de Toulouse

Optimisation de l’espérance (moyenne) de :

In practice: • Various types of layers • Various types of architectures • Various strategies to perform the stochastic gradient descent

Hidden important properties of N.N.: • Prediction and training can be straightforwardly parallelized on GPUs • Nvidia cuDNN library massively used by Keras, TensorFlow, Theano, PyTorch, …

5) Supervised learning using Neural Networks — Into the black-box

Layer 1 Layer 2 Layer 3 Layer L

L. Risser CNRS / Institut de Mathématiques de Toulouse

That’s all for now

MERCI !!!

top related