modern machine learning: probabilistic modeling and ... · learning: probabilistic modeling and...

MODERN MACHINE LEARNING: PROBABILISTIC MODELING AND FUNCTIONAL PREDICTION

Tom DietterichOregon State University

Environmental,Health,Summit 1

Machine Learning Basics

! Goal: program a computer to compute some function

! Given: Training Data !", $" , … , !&, $&! Find: A function ' such that $( ≈ '(!()! Typical Tasks:

" Document classification" Predict jet engine failure" Predict customer behavior


“2”

Two Main Paradigms

! Probabilistic Modeling (“Declarative”)! Function Learning (“Algorithmic”)


Probabilistic Modeling! Goal:&Predict&! from&"! Model&the&process&that&creates&

the&data:" ! ~ $ ! discrete" " ~ , " -., Σ. Gaussian

! Learning = Model Fitting

! Classification requires probabilistic inference

" $ ! " = 6 . 7(9|;<,=<)∑<@ 6 .@ 7 " -.@Σ.@

Environmental&Health&Summit 5

!

"

“2”

End-to-End Function Learning

! Define a space of parameterized functions ℱ Θ! Define a loss function # $%, %! Solve the optimization problem:

'( ≔ argmin0 12#(40 52 , %2) + 8 ( 9

! Classify new input 5: by evaluating 4;0 5:


LeCu

n,,Bottou,,Ben

gio,,Haffner,,199

8

Programming Languages and Systems! Both paradigms are now well-

supported by programming languages and systems

! Probabilistic programming" Bayesia, Stan, etc.

! Deep neural networks" pytorch, TensorFlow, etc.


Outline

! Machine Learning: Two Paradigms! Multi-Level Modeling in Stan! Functional Prediction Methods! Deep Neural Networks


Multilevel ModelingGelman & Hill (2006)http://mc-stan.org/users/documentation/case-studies/radon.html

! Radon levels in homes (as risk factor for lung cancer)

! Data" radon level measured in basement or first

floor (if no basement)" county soil uranium level

! Goal: " Identify counties with high radon in homes

! Structure:" households are nested within counties


http://mc-stan.org/users/documentation/case-studies/radon.html

Plate Notation

! ! indexes the county

! " indexes household within county

! #$ soil uranium level

! %&,$ floor (0 or 1)! (&,$ log radon

level


! = 1,… , ,

#$

" = 1,… , -$

(&,$ %&,$

countyhome

Alternative Models (1): Fully Pooled Model! !" = $ + &'" + ("

" ignores the county uranium measurement" assumes each house has same error

distribution (" ∼ *+,-./ 0, 23

11Floor

Log(rado

n

High(variance(implies(poor(fit(to(the(data

Assumes(all(counties(have(same(radon(level

Alternative Models (2): No poolingSeparate intercept for each county! !" = $% " + '(" + )"

" assumes each house has same error distribution )" ∼ +,-./0 0, 34

" 5[7] means “the county where house 7 is located”

12

$ %

Much'lower'variance'within'countySome'counties'have'very'high'radon'levels!Are'these'real?

! County levels (“basement”) vary widely! Are those high levels real?! No, they reflect small sample sizes. Most

counties suffer from small samples of either ! = 0 or ! = 1 (most houses in some counties have basements)


fully,pooled

no,pooling

Multilevel Model 1:Partially pooled intercepts! Two-level model:

!" ∼ $%&'() *+, -+./0 ∼ $%&'() 0, -2.30 = !" 0 + 670 + /0

! Combines model of !" and model of 30! All counties affect *+, but counties with

more data points have more influence


! Note that the fit moves toward the fully pooled model for counties with few data points

! Now the variability in radon levels is much less


fully,pooled

partial,pooling

! Visualization of all of the fitted radon models! Some counties have log radon levels near

2.0; others have log radon levels near 1.0


!"#

$

Multilevel Model 2:Include county uranium in the intercept model

!" ∼ $%&'() 0, ,-./" = 12 + 145" + !"

67 ∼ $%&'() 0, ,8.97 = /" 7 + :;7 + 67


! Final per-county radon estimates! !" is a strong predictor! But #$ estimates are adjusted to reflect

confounding effects of %&Environmental,Health,Summit 18

#$"

!"

Stan Codedata {

int<lower=0> J;

int<lower=0> N; int<lower=1,upper=J> county[N];

vector[N] u;

vector[N] x;

vector[N] y;

} parameters {

vector[J] a;

vector[2] b;

real mu_a;

real<lower=0,upper=100> sigma_a;real<lower=0,upper=100> sigma_y;

}

transformed parameters {

vector[N] y_hat;

vector[N] m;

for (i in 1:N) {

m[i] <- a[county[i]] + u[i] * b[1];

y_hat[i] <- m[i] + x[i] * b[2];

}}

model {

mu_a ~ normal(0, 1);

a ~ normal(mu_a, sigma_a);

b ~ normal(0, 1);y ~ normal(y_hat, sigma_y);

}


Summary:Why Multilevel Modeling?! Accounts for individual- and group-

level variation when estimating group-level coefficients

! Models variation among individual-level coefficients

! Gives better estimates of regression coefficients for groups with small sample sizes by “borrowing strength” from other groups


Outline



Functional Prediction Methods

! Random Forests! Support Vector Machines

! Given:" Training data: !", $" , … , (!', $')

! !) *-dimensional vector of predictor variables! $) real or discrete response value

! Find:" Function + that can predict ,$ = +(!) for new

points !


Decision Tree

! Let !⋅# be the value of the $-th predictor variable for data point !

! A query ! traverses the tree until it reaches a leaf. The corresponding %&value is '(!)

! The tree is “grown” top-down by choosing the most informative predictor/threshold combination at each step

! %&* is the mean of the !+that arrive at leaf ,


!⋅#!⋅- > /0

!⋅1 > /1 !⋅2 > /-

!⋅- > /3 !⋅0 > /4

%&5%&4

%&1%&0

%&3%&-

yes

yes yes

yes yes

no

no no

no no

Randomized Tree

! When the tree is “grown”, only a randomly-chosen subset of !predictor variables is considered at each node


"⋅$"⋅% > '(

"⋅) > ') "⋅* > '%

"⋅% > '+ "⋅( > ',

-./-.,

-.)-.(

-.+-.%

yes

yes yes

yes yes

no

no no

no no

Random Forest

! A random forest is a collection of ! randomized trees

! Each tree "# is “grown” on a bootstrap replicate of the training data

! The predicted value is the mean of the predictions of the individual trees

$% = 1!(#)*

+"#(-)


Random Forest Advantages

! Can work with a mix of discrete and continuous predictor variables

! Can handle missing values! Makes no assumptions about the error

distribution of !! Considers high-order interactions

among predictors! Generally gives excellent predictive

accuracy


Random Forest Disadvantages

! Cannot be usefully inspected (“black box”)

! However" Can provide estimates of variable

importance (see “randomForest” R package)

" Can be modified to support hypothesis tests and confidence intervals (see Mensch & Hooker, 2016a, 2016b)


Support Vector Machines

! Extension of Linear Classification Model! ! = #$ + ∑' #'('! New ideas:

" Maximize the margin between the classes" Implicitly map to high-dimensional feature

space using kernels


Classification (Iris Species)


Decision Boundaries:Which one is best?


SVM Finds the Boundary that Maximizes the Margin


Full Iris Data is Not SeparableSVM balances sum of errors


SVMs can fit non-linear decision boundaries using “kernels”


SVM Assessment

! Strengths:" Excellent performance on ! ≫ # problems" Good free implementations (libSVM

wrapped for R, python, etc.)! Weaknesses:

" Does not scale to large datasets easily" Requires tuning 2 hyperparameters


Outline



ImageNet (1000 object classes): Top-5 Error Rate


0

5

10

15

20

25

30

2010 2011 2012 2013 2014

Top$5$Clas

sific

ation$Error$(%)

Before After

Speech Recognition Results


2013 2014 2015

23%(Word(Error

8%

Google,Speech,Recognition

Credit:,Fernando,Pereira,&,Matthew,Firestone,,Google

Protalinski,,Google

DNN Practicalities

! The structure of each DNN must be carefully chosen for the task

! There are many many hyperparameters" Auto-ML tools seek to automatically adjust the

network structure and hyperparameters! Generally require lots of data and lots of

compute time" Many groups have had success with “fine

tuning” of pre-trained networks


Environmental Health Applications of DNNs! Analyzing medical images! Analyzing EKG and other signal data! Analyzing spectra! Analyzing electronic health records


Summary

! For making inferences about environmental health, the probabilistic modeling paradigm is recommended" Interpretable models" Can draw causal inferences under some conditions

! For extracting data from sensors, EHRs, images" predictive models (random forests, SVMs, DNNs)

excel" SVMs and DNNs require tuning hyperparameters" Tools are beginning to emerge to automate tuning


References

! STAN: http://mc-stan.org/! Gelman & Hill (2006): Data analysis using regression and

multilevel/hierarchical models.! randomForests package in R! Mentch, L., & Hooker, G. (2016). Quantifying Uncertainty in

Random Forests via Confidence Intervals and Hypothesis Tests. Journal of Machine Learning Research, 17, 1–41.

! Mentch, L., & Hooker, G. (2017). Formal Hypothesis Tests for Additive Structure in Random Forests. Journal of Computational and Graphical Statistics, 26(3), 589–597.

! LibSVM package for fitting support vector machines! Cristianini, N., Shawe-Taylor, J. (2000). An Introduction to

Support Vector Machines and other kernel-based learning methods, Cambridge University Press.

! Zoph, B., & Le, Q. V. (2016). Neural Architecture Search with Reinforcement Learning. ArXiv 1611.01578, 1–16.


http://mc-stan.org/

http://www.support-vector.net/

modern machine learning: probabilistic modeling and ... · learning: probabilistic modeling and...

Documents