modern machine learning: probabilistic modeling and ... · learning: probabilistic modeling and...
TRANSCRIPT
MODERN MACHINE LEARNING: PROBABILISTIC MODELING AND FUNCTIONAL PREDICTION
Tom DietterichOregon State University
Environmental,Health,Summit 1
Machine Learning Basics
! Goal: program a computer to compute some function
! Given: Training Data !", $" , … , !&, $&! Find: A function ' such that $( ≈ '(!()! Typical Tasks:
" Document classification" Predict jet engine failure" Predict customer behavior
Environmental,Health,Summit 3
“2”
Two Main Paradigms
! Probabilistic Modeling (“Declarative”)! Function Learning (“Algorithmic”)
Environmental,Health,Summit 4
Probabilistic Modeling! Goal:&Predict&! from&"! Model&the&process&that&creates&
the&data:" ! ~ $ ! discrete" " ~ , " -., Σ. Gaussian
! Learning = Model Fitting
! Classification requires probabilistic inference
" $ ! " = 6 . 7(9|;<,=<)∑<@ 6 .@ 7 " -.@Σ.@
Environmental&Health&Summit 5
!
"
“2”
End-to-End Function Learning
! Define a space of parameterized functions ℱ Θ! Define a loss function # $%, %! Solve the optimization problem:
'( ≔ argmin0 12#(40 52 , %2) + 8 ( 9
! Classify new input 5: by evaluating 4;0 5:
Environmental,Health,Summit 6
LeCu
n,,Bottou,,Ben
gio,,Haffner,,199
8
Programming Languages and Systems! Both paradigms are now well-
supported by programming languages and systems
! Probabilistic programming" Bayesia, Stan, etc.
! Deep neural networks" pytorch, TensorFlow, etc.
Environmental,Health,Summit 7
Outline
! Machine Learning: Two Paradigms! Multi-Level Modeling in Stan! Functional Prediction Methods! Deep Neural Networks
Environmental,Health,Summit 8
Multilevel ModelingGelman & Hill (2006)http://mc-stan.org/users/documentation/case-studies/radon.html
! Radon levels in homes (as risk factor for lung cancer)
! Data" radon level measured in basement or first
floor (if no basement)" county soil uranium level
! Goal: " Identify counties with high radon in homes
! Structure:" households are nested within counties
Environmental,Health,Summit 9
Plate Notation
! ! indexes the county
! " indexes household within county
! #$ soil uranium level
! %&,$ floor (0 or 1)! (&,$ log radon
level
Environmental,Health,Summit 10
! = 1,… , ,
#$
" = 1,… , -$
(&,$ %&,$
countyhome
Alternative Models (1): Fully Pooled Model! !" = $ + &'" + ("
" ignores the county uranium measurement" assumes each house has same error
distribution (" ∼ *+,-./ 0, 23
11Floor
Log(rado
n
High(variance(implies(poor(fit(to(the(data
Assumes(all(counties(have(same(radon(level
Alternative Models (2): No poolingSeparate intercept for each county! !" = $% " + '(" + )"
" assumes each house has same error distribution )" ∼ +,-./0 0, 34
" 5[7] means “the county where house 7 is located”
12
$ %
Much'lower'variance'within'countySome'counties'have'very'high'radon'levels!Are'these'real?
! County levels (“basement”) vary widely! Are those high levels real?! No, they reflect small sample sizes. Most
counties suffer from small samples of either ! = 0 or ! = 1 (most houses in some counties have basements)
Environmental,Health,Summit 13
fully,pooled
no,pooling
Multilevel Model 1:Partially pooled intercepts! Two-level model:
!" ∼ $%&'() *+, -+./0 ∼ $%&'() 0, -2.30 = !" 0 + 670 + /0
! Combines model of !" and model of 30! All counties affect *+, but counties with
more data points have more influence
Environmental,Health,Summit 14
! Note that the fit moves toward the fully pooled model for counties with few data points
! Now the variability in radon levels is much less
Environmental,Health,Summit 15
fully,pooled
partial,pooling
! Visualization of all of the fitted radon models! Some counties have log radon levels near
2.0; others have log radon levels near 1.0
Environmental,Health,Summit 16
!"#
$
Multilevel Model 2:Include county uranium in the intercept model
!" ∼ $%&'() 0, ,-./" = 12 + 145" + !"
67 ∼ $%&'() 0, ,8.97 = /" 7 + :;7 + 67
Environmental,Health,Summit 17
! Final per-county radon estimates! !" is a strong predictor! But #$ estimates are adjusted to reflect
confounding effects of %&Environmental,Health,Summit 18
#$"
!"
Stan Codedata {
int<lower=0> J;
int<lower=0> N; int<lower=1,upper=J> county[N];
vector[N] u;
vector[N] x;
vector[N] y;
} parameters {
vector[J] a;
vector[2] b;
real mu_a;
real<lower=0,upper=100> sigma_a;real<lower=0,upper=100> sigma_y;
}
transformed parameters {
vector[N] y_hat;
vector[N] m;
for (i in 1:N) {
m[i] <- a[county[i]] + u[i] * b[1];
y_hat[i] <- m[i] + x[i] * b[2];
}}
model {
mu_a ~ normal(0, 1);
a ~ normal(mu_a, sigma_a);
b ~ normal(0, 1);y ~ normal(y_hat, sigma_y);
}
Environmental,Health,Summit 19
Summary:Why Multilevel Modeling?! Accounts for individual- and group-
level variation when estimating group-level coefficients
! Models variation among individual-level coefficients
! Gives better estimates of regression coefficients for groups with small sample sizes by “borrowing strength” from other groups
Environmental,Health,Summit 20
Outline
! Machine Learning: Two Paradigms! Multi-Level Modeling in Stan! Functional Prediction Methods! Deep Neural Networks
Environmental,Health,Summit 21
Functional Prediction Methods
! Random Forests! Support Vector Machines
! Given:" Training data: !", $" , … , (!', $')
! !) *-dimensional vector of predictor variables! $) real or discrete response value
! Find:" Function + that can predict ,$ = +(!) for new
points !
Environmental,Health,Summit 22
Decision Tree
! Let !⋅# be the value of the $-th predictor variable for data point !
! A query ! traverses the tree until it reaches a leaf. The corresponding %&value is '(!)
! The tree is “grown” top-down by choosing the most informative predictor/threshold combination at each step
! %&* is the mean of the !+that arrive at leaf ,
Environmental,Health,Summit 23
!⋅#!⋅- > /0
!⋅1 > /1 !⋅2 > /-
!⋅- > /3 !⋅0 > /4
%&5%&4
%&1%&0
%&3%&-
yes
yes yes
yes yes
no
no no
no no
Randomized Tree
! When the tree is “grown”, only a randomly-chosen subset of !predictor variables is considered at each node
Environmental,Health,Summit 24
"⋅$"⋅% > '(
"⋅) > ') "⋅* > '%
"⋅% > '+ "⋅( > ',
-./-.,
-.)-.(
-.+-.%
yes
yes yes
yes yes
no
no no
no no
Random Forest
! A random forest is a collection of ! randomized trees
! Each tree "# is “grown” on a bootstrap replicate of the training data
! The predicted value is the mean of the predictions of the individual trees
$% = 1!(#)*
+"#(-)
Environmental,Health,Summit 25
Random Forest Advantages
! Can work with a mix of discrete and continuous predictor variables
! Can handle missing values! Makes no assumptions about the error
distribution of !! Considers high-order interactions
among predictors! Generally gives excellent predictive
accuracy
Environmental,Health,Summit 26
Random Forest Disadvantages
! Cannot be usefully inspected (“black box”)
! However" Can provide estimates of variable
importance (see “randomForest” R package)
" Can be modified to support hypothesis tests and confidence intervals (see Mensch & Hooker, 2016a, 2016b)
Environmental,Health,Summit 27
Support Vector Machines
! Extension of Linear Classification Model! ! = #$ + ∑' #'('! New ideas:
" Maximize the margin between the classes" Implicitly map to high-dimensional feature
space using kernels
Environmental,Health,Summit 28
Classification (Iris Species)
Environmental,Health,Summit 29
Decision Boundaries:Which one is best?
Environmental,Health,Summit 30
SVM Finds the Boundary that Maximizes the Margin
Environmental,Health,Summit 31
Full Iris Data is Not SeparableSVM balances sum of errors
Environmental,Health,Summit 32
SVMs can fit non-linear decision boundaries using “kernels”
Environmental,Health,Summit 34
SVM Assessment
! Strengths:" Excellent performance on ! ≫ # problems" Good free implementations (libSVM
wrapped for R, python, etc.)! Weaknesses:
" Does not scale to large datasets easily" Requires tuning 2 hyperparameters
Environmental,Health,Summit 36
Outline
! Machine Learning: Two Paradigms! Multi-Level Modeling in Stan! Functional Prediction Methods! Deep Neural Networks
Environmental,Health,Summit 37
ImageNet (1000 object classes): Top-5 Error Rate
Environmental,Health,Summit 38
0
5
10
15
20
25
30
2010 2011 2012 2013 2014
Top$5$Clas
sific
ation$Error$(%)
Before After
Speech Recognition Results
Environmental,Health,Summit 39
2013 2014 2015
23%(Word(Error
8%
Google,Speech,Recognition
Credit:,Fernando,Pereira,&,Matthew,Firestone,,Google
Protalinski,,Google
DNN Practicalities
! The structure of each DNN must be carefully chosen for the task
! There are many many hyperparameters" Auto-ML tools seek to automatically adjust the
network structure and hyperparameters! Generally require lots of data and lots of
compute time" Many groups have had success with “fine
tuning” of pre-trained networks
Environmental,Health,Summit 40
Environmental Health Applications of DNNs! Analyzing medical images! Analyzing EKG and other signal data! Analyzing spectra! Analyzing electronic health records
Environmental,Health,Summit 41
Summary
! For making inferences about environmental health, the probabilistic modeling paradigm is recommended" Interpretable models" Can draw causal inferences under some conditions
! For extracting data from sensors, EHRs, images" predictive models (random forests, SVMs, DNNs)
excel" SVMs and DNNs require tuning hyperparameters" Tools are beginning to emerge to automate tuning
Environmental,Health,Summit 42
References
! STAN: http://mc-stan.org/! Gelman & Hill (2006): Data analysis using regression and
multilevel/hierarchical models.! randomForests package in R! Mentch, L., & Hooker, G. (2016). Quantifying Uncertainty in
Random Forests via Confidence Intervals and Hypothesis Tests. Journal of Machine Learning Research, 17, 1–41.
! Mentch, L., & Hooker, G. (2017). Formal Hypothesis Tests for Additive Structure in Random Forests. Journal of Computational and Graphical Statistics, 26(3), 589–597.
! LibSVM package for fitting support vector machines! Cristianini, N., Shawe-Taylor, J. (2000). An Introduction to
Support Vector Machines and other kernel-based learning methods, Cambridge University Press.
! Zoph, B., & Le, Q. V. (2016). Neural Architecture Search with Reinforcement Learning. ArXiv 1611.01578, 1–16.
Environmental,Health,Summit 43