statistical machine learning an overview · z. ghahramani (2015) probabilistic machine learning and...

22
Statistical Machine Learning an overview Andreas Svensson Department of Information Technology Uppsala University [email protected] Department of Information Technology, Uppsala University

Upload: others

Post on 24-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Statistical Machine Learningan overview

Andreas Svensson

Department of Information TechnologyUppsala University

[email protected] Department of Information Technology, Uppsala University

Machine Learning: a very subjective view

‘Classical’ approach

Firstprinciplesmodeling

CalibrationSimulationsPredictions

. . .

Kirchhoff’s lawsIdeal gas law

Newton’s laws of motion

Stefan-Boltzmann law

Drag equation. . .

Stokes’ law

Graham’s law

Data

Model withunknown

parameters Model

Statistical machine learning approach

Flexibleblack-boxmodeling

LearningSimulationsPredictions

. . .

Data

Flexiblemodel structure Model

I Level of detailedknowledge present?

I Model-reduction:manual or automated?

I When to makeapproximations?

I Interpretation of themodel?

I Representinguncertainty

1 / 21 [email protected] Department of Information Technology, Uppsala University

Outline

I Introduction

I Deep learning

I Gaussian processes

I Outlook

2 / 21 [email protected] Department of Information Technology, Uppsala University

Introduction to neural networks (I/III)

What is a neural network?

A neural network (NN) is a nonlinear function y = gθ(x) from aninput variable x to an output variable y, parameterized by θ.

Linear regression models the relationship between the input x andthe output y as a linear combination

y =

n∑i=1

xiθi + θ0 = xTθ,

where θ is the parameters composed by the “weights” θi and theoffset (“bias”) term θ0,

θ =(θ0 θ1 θ2 · · · θn

)T,

x =(1 x1 x2 · · · xn

)T.

3 / 21 [email protected] Department of Information Technology, Uppsala University

Introduction to neural networks (II/III)

1. Form m1 linear combinations of the input

a(1)j =

n∑i=1

θ(1)ji xi + θ

(1)j0 = xTθ

(1)j , j = 1, . . . ,m1.

2. Apply a (simple) nonlinear transformation

z(1)j = f

(a

(1)j

), j = 1, . . . ,m1.

(common choices: f(a) = 1/(1 + e−a), f(a) = tanh(a) or f(a) = max(0, a))

3. Form a linear combination of the zjs

y =

m1∑j=1

θ(2)j z

(1)j + θ

(2)0 = z(1)T

θ(1)j

4 / 21 [email protected] Department of Information Technology, Uppsala University

Introduction to neural networks (III/III)

y(θ) =

m1∑j=1

θ(2)j f

( n∑i=1

θ(1)ji xi + θ

(1)j0

)︸ ︷︷ ︸

z(1)j

+θ(2)0

...

f

f

f

...f

f

x1

x2

xn

z(1)

y

θ(1)11

θ(1)m1n

θ(2)1

θ(2)m1

Inputs Hidden layer Output layer

5 / 21 [email protected] Department of Information Technology, Uppsala University

Deep neural networks

......

......

......

......

...

x1

x2

x3

xn

z(1)1

z(1)m1

z(8)1

z(8)m8

y

How to learn the unknown parameters θ(d)ij ?

Differentiation of (y − y)2 with respect to θ(d)ij , backpropagation.

6 / 21 [email protected] Department of Information Technology, Uppsala University

Deep learning: Image classification

Input: pixels of an imageOutput: object identityEach hidden layer extractsincreasingly abstractfeatures.

M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networksComputer Vision - ECCV.

7 / 21 [email protected] Department of Information Technology, Uppsala University

Deep learning: A recent example

An AI defeated a human professional forthe first time in the game of Go

D. Silver et al. (2016) Mastering the game of Go with deep neural networks and tree search, Nature, vol 529.

8 / 21 [email protected] Department of Information Technology, Uppsala University

Deep learning—Why now?

Neural networks have been around for more than fifty years. Why have theybecome so popular now (again)?

To solve really interesting problems you need:

1. Efficient learning algorithms

2. Efficient computational hardware

3. A lot of labeled data!

This has not been fulfilled to a satisfactory level until the last 5-10 years.

9 / 21 [email protected] Department of Information Technology, Uppsala University

Outline

I Introduction

I Deep learning

I Gaussian processes

I Outlook

10 / 21 [email protected] Department of Information Technology, Uppsala University

GP: a probability distribution over functions

f1

Gaussian distribution for f1,

p(f1) = 1√2πσ2

exp(− 12σ2 (f1 − µ)2)

f1 f2

Multivariate Gaussian distribution for [f1 f2],

p([f1 f2]) = 12π|Σ| exp(−1

2(([f1 f2] − µ)Σ−1([f1 f2] − µ)T)

f1 f2 f3 f4 f5

Multivariate Gaussian distribution for [f1 f2 f3 f4 f5] = f ,

p(f) = 12π|Σ| exp(−1

2((f − µ)Σ−1(f − µ)T)

x

f(x)

Gaussian Process distribution for f(x),

p(f(x?)) = 12π|K(x?,x?)| exp(−

12((f(x?)−µ(x?))TK(x?, x?)−1(f(x?)−µ(x?)))

11 / 21 [email protected] Department of Information Technology, Uppsala University

GP: a probability distribution over functions

x

f(x)

Distribution for f(x) conditioned on one observation (orange dot)

12 / 21 [email protected] Department of Information Technology, Uppsala University

Gaussian Processes: a flexible model

x

f(x)

x

f(x)

x

f(x)

x

f(x)

There are ‘tuning knobs’ for the smoothness etc.

13 / 21 [email protected] Department of Information Technology, Uppsala University

Gaussian Processes: Fault detection I

O. Samuelsson et al. (2016), Improved monitoring and fault detection of wastewater treatment processes withGaussian process regression, manuscript.

14 / 21 [email protected] Department of Information Technology, Uppsala University

Gaussian Processes: Fault detection II

0 20 40 60 80 100 120 140 160 1800

2

4

dissolvedox

ygen

(mg/l)

0 20 40 60 80 100 120 140 160 1800

2

4

time (hours)

dissolvedox

ygen

(mg/l)

A. Svensson, J. Dahlin and T.B. Schon (2015) Marginalizing Gaussian Process Hyperparameters using SequentialMonte Carlo. IEEE CAMSAP.

15 / 21 [email protected] Department of Information Technology, Uppsala University

GP optimization

I Optimization of some parameters

I Expensive to evaluate the objective function (e.g., a lengthycomputer simulation)

M. Osborne, R. Garnett and S. Roberts (2009) Gaussian Processes for Global Optimization, LION3

B. Shahriari et al. (2016) Taking the Human Out of the Loop: A Review of Bayesian Optimization, Proceedingsof the IEEE vol 104

16 / 21 [email protected] Department of Information Technology, Uppsala University

Outline

I Introduction

I Deep learning

I Gaussian processes

I Outlook

17 / 21 [email protected] Department of Information Technology, Uppsala University

Of course there is more...

I RegressionI Classification

I Support vector machinesI Decision treesI Boosting

I Clustering

I Reinforcement learning

I Probabilistic programmingI Computational methods

I Markov chain Monte CarloI Sequential Monte CarloI Variational inference

18 / 21 [email protected] Department of Information Technology, Uppsala University

Learn more

General

Z. Ghahramani (2015) Probabilistic machine learning and artificial intelligence, Nature, Vol 521

G. James, D. Witten, T. Hastie and R. Tibshirani (2013) An introduction to statistical learning with applicationsin R, Springer

C. Bishop (2006) Pattern recognition and machine learning, Springer

Deep learning

I. Goodfellow, Y. Bengio and A. Courville (2016) Deep learning, Book in preparation for MIT Press,http://www.deeplearningbook.org/

Y. LeCun, Y. Bengio and G. Hinton (2015) Deep learning, Nature, vol 521

http://deeplearning.net/

Gaussian processes

C. Rasmussen and K. Williams (2006) Gaussian processes for machine learning, MIT Press

http://gaussianprocess.org/

Conferences: Neural Information Processing Systems (NIPS, http://nips.cc/), the International Conference onMachine Learning (ICML, http://icml.cc/)

Journals: Journal of Machine Learning Research (JMLR), IEEE Transactions Pattern Analysis and MachineIntelligence (PAMI)

19 / 21 [email protected] Department of Information Technology, Uppsala University

New course

Statistical Machine Learning 5 hp, January-March 2017

I Classical and Bayesian linear regression

I Classification via logistic regression

I Linear discriminant analysis

I Gaussian processes and kernel methods

I Cross-validation and model selection techniques

I Regularization (ridge regression and the LASSO)

I Regression and classification trees

I Principal component analysis

I k-means clustering

I Neural networks

http://www.it.uu.se/edu/course/homepage/sml

20 / 21 [email protected] Department of Information Technology, Uppsala University

Thank you!

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

x1

x2

x3

xn

z(1)1

z(1)m1

z(8)1

z(8)m8

y

Questions?

x

f(x)

x

f(x)

x

f(x)

x

f(x)

21 / 21 [email protected] Department of Information Technology, Uppsala University