statistical machine learning an overview · z. ghahramani (2015) probabilistic machine learning and...
TRANSCRIPT
Statistical Machine Learningan overview
Andreas Svensson
Department of Information TechnologyUppsala University
[email protected] Department of Information Technology, Uppsala University
Machine Learning: a very subjective view
‘Classical’ approach
Firstprinciplesmodeling
CalibrationSimulationsPredictions
. . .
Kirchhoff’s lawsIdeal gas law
Newton’s laws of motion
Stefan-Boltzmann law
Drag equation. . .
Stokes’ law
Graham’s law
Data
Model withunknown
parameters Model
Statistical machine learning approach
Flexibleblack-boxmodeling
LearningSimulationsPredictions
. . .
Data
Flexiblemodel structure Model
I Level of detailedknowledge present?
I Model-reduction:manual or automated?
I When to makeapproximations?
I Interpretation of themodel?
I Representinguncertainty
1 / 21 [email protected] Department of Information Technology, Uppsala University
Outline
I Introduction
I Deep learning
I Gaussian processes
I Outlook
2 / 21 [email protected] Department of Information Technology, Uppsala University
Introduction to neural networks (I/III)
What is a neural network?
A neural network (NN) is a nonlinear function y = gθ(x) from aninput variable x to an output variable y, parameterized by θ.
Linear regression models the relationship between the input x andthe output y as a linear combination
y =
n∑i=1
xiθi + θ0 = xTθ,
where θ is the parameters composed by the “weights” θi and theoffset (“bias”) term θ0,
θ =(θ0 θ1 θ2 · · · θn
)T,
x =(1 x1 x2 · · · xn
)T.
3 / 21 [email protected] Department of Information Technology, Uppsala University
Introduction to neural networks (II/III)
1. Form m1 linear combinations of the input
a(1)j =
n∑i=1
θ(1)ji xi + θ
(1)j0 = xTθ
(1)j , j = 1, . . . ,m1.
2. Apply a (simple) nonlinear transformation
z(1)j = f
(a
(1)j
), j = 1, . . . ,m1.
(common choices: f(a) = 1/(1 + e−a), f(a) = tanh(a) or f(a) = max(0, a))
3. Form a linear combination of the zjs
y =
m1∑j=1
θ(2)j z
(1)j + θ
(2)0 = z(1)T
θ(1)j
4 / 21 [email protected] Department of Information Technology, Uppsala University
Introduction to neural networks (III/III)
y(θ) =
m1∑j=1
θ(2)j f
( n∑i=1
θ(1)ji xi + θ
(1)j0
)︸ ︷︷ ︸
z(1)j
+θ(2)0
...
f
f
f
...f
f
x1
x2
xn
z(1)
y
θ(1)11
θ(1)m1n
θ(2)1
θ(2)m1
Inputs Hidden layer Output layer
5 / 21 [email protected] Department of Information Technology, Uppsala University
Deep neural networks
......
......
......
......
...
x1
x2
x3
xn
z(1)1
z(1)m1
z(8)1
z(8)m8
y
How to learn the unknown parameters θ(d)ij ?
Differentiation of (y − y)2 with respect to θ(d)ij , backpropagation.
6 / 21 [email protected] Department of Information Technology, Uppsala University
Deep learning: Image classification
Input: pixels of an imageOutput: object identityEach hidden layer extractsincreasingly abstractfeatures.
M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networksComputer Vision - ECCV.
7 / 21 [email protected] Department of Information Technology, Uppsala University
Deep learning: A recent example
An AI defeated a human professional forthe first time in the game of Go
D. Silver et al. (2016) Mastering the game of Go with deep neural networks and tree search, Nature, vol 529.
8 / 21 [email protected] Department of Information Technology, Uppsala University
Deep learning—Why now?
Neural networks have been around for more than fifty years. Why have theybecome so popular now (again)?
To solve really interesting problems you need:
1. Efficient learning algorithms
2. Efficient computational hardware
3. A lot of labeled data!
This has not been fulfilled to a satisfactory level until the last 5-10 years.
9 / 21 [email protected] Department of Information Technology, Uppsala University
Outline
I Introduction
I Deep learning
I Gaussian processes
I Outlook
10 / 21 [email protected] Department of Information Technology, Uppsala University
GP: a probability distribution over functions
f1
Gaussian distribution for f1,
p(f1) = 1√2πσ2
exp(− 12σ2 (f1 − µ)2)
f1 f2
Multivariate Gaussian distribution for [f1 f2],
p([f1 f2]) = 12π|Σ| exp(−1
2(([f1 f2] − µ)Σ−1([f1 f2] − µ)T)
f1 f2 f3 f4 f5
Multivariate Gaussian distribution for [f1 f2 f3 f4 f5] = f ,
p(f) = 12π|Σ| exp(−1
2((f − µ)Σ−1(f − µ)T)
x
f(x)
Gaussian Process distribution for f(x),
p(f(x?)) = 12π|K(x?,x?)| exp(−
12((f(x?)−µ(x?))TK(x?, x?)−1(f(x?)−µ(x?)))
11 / 21 [email protected] Department of Information Technology, Uppsala University
GP: a probability distribution over functions
x
f(x)
Distribution for f(x) conditioned on one observation (orange dot)
12 / 21 [email protected] Department of Information Technology, Uppsala University
Gaussian Processes: a flexible model
x
f(x)
x
f(x)
x
f(x)
x
f(x)
There are ‘tuning knobs’ for the smoothness etc.
13 / 21 [email protected] Department of Information Technology, Uppsala University
Gaussian Processes: Fault detection I
O. Samuelsson et al. (2016), Improved monitoring and fault detection of wastewater treatment processes withGaussian process regression, manuscript.
14 / 21 [email protected] Department of Information Technology, Uppsala University
Gaussian Processes: Fault detection II
0 20 40 60 80 100 120 140 160 1800
2
4
dissolvedox
ygen
(mg/l)
0 20 40 60 80 100 120 140 160 1800
2
4
time (hours)
dissolvedox
ygen
(mg/l)
A. Svensson, J. Dahlin and T.B. Schon (2015) Marginalizing Gaussian Process Hyperparameters using SequentialMonte Carlo. IEEE CAMSAP.
15 / 21 [email protected] Department of Information Technology, Uppsala University
GP optimization
I Optimization of some parameters
I Expensive to evaluate the objective function (e.g., a lengthycomputer simulation)
M. Osborne, R. Garnett and S. Roberts (2009) Gaussian Processes for Global Optimization, LION3
B. Shahriari et al. (2016) Taking the Human Out of the Loop: A Review of Bayesian Optimization, Proceedingsof the IEEE vol 104
16 / 21 [email protected] Department of Information Technology, Uppsala University
Outline
I Introduction
I Deep learning
I Gaussian processes
I Outlook
17 / 21 [email protected] Department of Information Technology, Uppsala University
Of course there is more...
I RegressionI Classification
I Support vector machinesI Decision treesI Boosting
I Clustering
I Reinforcement learning
I Probabilistic programmingI Computational methods
I Markov chain Monte CarloI Sequential Monte CarloI Variational inference
18 / 21 [email protected] Department of Information Technology, Uppsala University
Learn more
General
Z. Ghahramani (2015) Probabilistic machine learning and artificial intelligence, Nature, Vol 521
G. James, D. Witten, T. Hastie and R. Tibshirani (2013) An introduction to statistical learning with applicationsin R, Springer
C. Bishop (2006) Pattern recognition and machine learning, Springer
Deep learning
I. Goodfellow, Y. Bengio and A. Courville (2016) Deep learning, Book in preparation for MIT Press,http://www.deeplearningbook.org/
Y. LeCun, Y. Bengio and G. Hinton (2015) Deep learning, Nature, vol 521
http://deeplearning.net/
Gaussian processes
C. Rasmussen and K. Williams (2006) Gaussian processes for machine learning, MIT Press
http://gaussianprocess.org/
Conferences: Neural Information Processing Systems (NIPS, http://nips.cc/), the International Conference onMachine Learning (ICML, http://icml.cc/)
Journals: Journal of Machine Learning Research (JMLR), IEEE Transactions Pattern Analysis and MachineIntelligence (PAMI)
19 / 21 [email protected] Department of Information Technology, Uppsala University
New course
Statistical Machine Learning 5 hp, January-March 2017
I Classical and Bayesian linear regression
I Classification via logistic regression
I Linear discriminant analysis
I Gaussian processes and kernel methods
I Cross-validation and model selection techniques
I Regularization (ridge regression and the LASSO)
I Regression and classification trees
I Principal component analysis
I k-means clustering
I Neural networks
http://www.it.uu.se/edu/course/homepage/sml
20 / 21 [email protected] Department of Information Technology, Uppsala University
Thank you!
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x1
x2
x3
xn
z(1)1
z(1)m1
z(8)1
z(8)m8
y
Questions?
x
f(x)
x
f(x)
x
f(x)
x
f(x)
21 / 21 [email protected] Department of Information Technology, Uppsala University