modern machine learning: probabilistic modeling and ... · learning: probabilistic modeling and...

Post on 02-Jun-2020

20 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MODERN MACHINE LEARNING: PROBABILISTIC MODELING AND FUNCTIONAL PREDICTION

Tom DietterichOregon State University

Environmental,Health,Summit 1

Machine Learning Basics

! Goal: program a computer to compute some function

! Given: Training Data !", $" , … , !&, $&! Find: A function ' such that $( ≈ '(!()! Typical Tasks:

" Document classification" Predict jet engine failure" Predict customer behavior

Environmental,Health,Summit 3

“2”

Two Main Paradigms

! Probabilistic Modeling (“Declarative”)! Function Learning (“Algorithmic”)

Environmental,Health,Summit 4

Probabilistic Modeling! Goal:&Predict&! from&"! Model&the&process&that&creates&

the&data:" ! ~ $ ! discrete" " ~ , " -., Σ. Gaussian

! Learning = Model Fitting

! Classification requires probabilistic inference

" $ ! " = 6 . 7(9|;<,=<)∑<@ 6 .@ 7 " -.@Σ.@

Environmental&Health&Summit 5

!

"

“2”

End-to-End Function Learning

! Define a space of parameterized functions ℱ Θ! Define a loss function # $%, %! Solve the optimization problem:

'( ≔ argmin0 12#(40 52 , %2) + 8 ( 9

! Classify new input 5: by evaluating 4;0 5:

Environmental,Health,Summit 6

LeCu

n,,Bottou,,Ben

gio,,Haffner,,199

8

Programming Languages and Systems! Both paradigms are now well-

supported by programming languages and systems

! Probabilistic programming" Bayesia, Stan, etc.

! Deep neural networks" pytorch, TensorFlow, etc.

Environmental,Health,Summit 7

Outline

! Machine Learning: Two Paradigms! Multi-Level Modeling in Stan! Functional Prediction Methods! Deep Neural Networks

Environmental,Health,Summit 8

Multilevel ModelingGelman & Hill (2006)http://mc-stan.org/users/documentation/case-studies/radon.html

! Radon levels in homes (as risk factor for lung cancer)

! Data" radon level measured in basement or first

floor (if no basement)" county soil uranium level

! Goal: " Identify counties with high radon in homes

! Structure:" households are nested within counties

Environmental,Health,Summit 9

Plate Notation

! ! indexes the county

! " indexes household within county

! #$ soil uranium level

! %&,$ floor (0 or 1)! (&,$ log radon

level

Environmental,Health,Summit 10

! = 1,… , ,

#$

" = 1,… , -$

(&,$ %&,$

countyhome

Alternative Models (1): Fully Pooled Model! !" = $ + &'" + ("

" ignores the county uranium measurement" assumes each house has same error

distribution (" ∼ *+,-./ 0, 23

11Floor

Log(rado

n

High(variance(implies(poor(fit(to(the(data

Assumes(all(counties(have(same(radon(level

Alternative Models (2): No poolingSeparate intercept for each county! !" = $% " + '(" + )"

" assumes each house has same error distribution )" ∼ +,-./0 0, 34

" 5[7] means “the county where house 7 is located”

12

$ %

Much'lower'variance'within'countySome'counties'have'very'high'radon'levels!Are'these'real?

! County levels (“basement”) vary widely! Are those high levels real?! No, they reflect small sample sizes. Most

counties suffer from small samples of either ! = 0 or ! = 1 (most houses in some counties have basements)

Environmental,Health,Summit 13

fully,pooled

no,pooling

Multilevel Model 1:Partially pooled intercepts! Two-level model:

!" ∼ $%&'() *+, -+./0 ∼ $%&'() 0, -2.30 = !" 0 + 670 + /0

! Combines model of !" and model of 30! All counties affect *+, but counties with

more data points have more influence

Environmental,Health,Summit 14

! Note that the fit moves toward the fully pooled model for counties with few data points

! Now the variability in radon levels is much less

Environmental,Health,Summit 15

fully,pooled

partial,pooling

! Visualization of all of the fitted radon models! Some counties have log radon levels near

2.0; others have log radon levels near 1.0

Environmental,Health,Summit 16

!"#

$

Multilevel Model 2:Include county uranium in the intercept model

!" ∼ $%&'() 0, ,-./" = 12 + 145" + !"

67 ∼ $%&'() 0, ,8.97 = /" 7 + :;7 + 67

Environmental,Health,Summit 17

! Final per-county radon estimates! !" is a strong predictor! But #$ estimates are adjusted to reflect

confounding effects of %&Environmental,Health,Summit 18

#$"

!"

Stan Codedata {

int<lower=0> J;

int<lower=0> N; int<lower=1,upper=J> county[N];

vector[N] u;

vector[N] x;

vector[N] y;

} parameters {

vector[J] a;

vector[2] b;

real mu_a;

real<lower=0,upper=100> sigma_a;real<lower=0,upper=100> sigma_y;

}

transformed parameters {

vector[N] y_hat;

vector[N] m;

for (i in 1:N) {

m[i] <- a[county[i]] + u[i] * b[1];

y_hat[i] <- m[i] + x[i] * b[2];

}}

model {

mu_a ~ normal(0, 1);

a ~ normal(mu_a, sigma_a);

b ~ normal(0, 1);y ~ normal(y_hat, sigma_y);

}

Environmental,Health,Summit 19

Summary:Why Multilevel Modeling?! Accounts for individual- and group-

level variation when estimating group-level coefficients

! Models variation among individual-level coefficients

! Gives better estimates of regression coefficients for groups with small sample sizes by “borrowing strength” from other groups

Environmental,Health,Summit 20

Outline

! Machine Learning: Two Paradigms! Multi-Level Modeling in Stan! Functional Prediction Methods! Deep Neural Networks

Environmental,Health,Summit 21

Functional Prediction Methods

! Random Forests! Support Vector Machines

! Given:" Training data: !", $" , … , (!', $')

! !) *-dimensional vector of predictor variables! $) real or discrete response value

! Find:" Function + that can predict ,$ = +(!) for new

points !

Environmental,Health,Summit 22

Decision Tree

! Let !⋅# be the value of the $-th predictor variable for data point !

! A query ! traverses the tree until it reaches a leaf. The corresponding %&value is '(!)

! The tree is “grown” top-down by choosing the most informative predictor/threshold combination at each step

! %&* is the mean of the !+that arrive at leaf ,

Environmental,Health,Summit 23

!⋅#!⋅- > /0

!⋅1 > /1 !⋅2 > /-

!⋅- > /3 !⋅0 > /4

%&5%&4

%&1%&0

%&3%&-

yes

yes yes

yes yes

no

no no

no no

Randomized Tree

! When the tree is “grown”, only a randomly-chosen subset of !predictor variables is considered at each node

Environmental,Health,Summit 24

"⋅$"⋅% > '(

"⋅) > ') "⋅* > '%

"⋅% > '+ "⋅( > ',

-./-.,

-.)-.(

-.+-.%

yes

yes yes

yes yes

no

no no

no no

Random Forest

! A random forest is a collection of ! randomized trees

! Each tree "# is “grown” on a bootstrap replicate of the training data

! The predicted value is the mean of the predictions of the individual trees

$% = 1!(#)*

+"#(-)

Environmental,Health,Summit 25

Random Forest Advantages

! Can work with a mix of discrete and continuous predictor variables

! Can handle missing values! Makes no assumptions about the error

distribution of !! Considers high-order interactions

among predictors! Generally gives excellent predictive

accuracy

Environmental,Health,Summit 26

Random Forest Disadvantages

! Cannot be usefully inspected (“black box”)

! However" Can provide estimates of variable

importance (see “randomForest” R package)

" Can be modified to support hypothesis tests and confidence intervals (see Mensch & Hooker, 2016a, 2016b)

Environmental,Health,Summit 27

Support Vector Machines

! Extension of Linear Classification Model! ! = #$ + ∑' #'('! New ideas:

" Maximize the margin between the classes" Implicitly map to high-dimensional feature

space using kernels

Environmental,Health,Summit 28

Classification (Iris Species)

Environmental,Health,Summit 29

Decision Boundaries:Which one is best?

Environmental,Health,Summit 30

SVM Finds the Boundary that Maximizes the Margin

Environmental,Health,Summit 31

Full Iris Data is Not SeparableSVM balances sum of errors

Environmental,Health,Summit 32

SVMs can fit non-linear decision boundaries using “kernels”

Environmental,Health,Summit 34

SVM Assessment

! Strengths:" Excellent performance on ! ≫ # problems" Good free implementations (libSVM

wrapped for R, python, etc.)! Weaknesses:

" Does not scale to large datasets easily" Requires tuning 2 hyperparameters

Environmental,Health,Summit 36

Outline

! Machine Learning: Two Paradigms! Multi-Level Modeling in Stan! Functional Prediction Methods! Deep Neural Networks

Environmental,Health,Summit 37

ImageNet (1000 object classes): Top-5 Error Rate

Environmental,Health,Summit 38

0

5

10

15

20

25

30

2010 2011 2012 2013 2014

Top$5$Clas

sific

ation$Error$(%)

Before After

Speech Recognition Results

Environmental,Health,Summit 39

2013 2014 2015

23%(Word(Error

8%

Google,Speech,Recognition

Credit:,Fernando,Pereira,&,Matthew,Firestone,,Google

Protalinski,,Google

DNN Practicalities

! The structure of each DNN must be carefully chosen for the task

! There are many many hyperparameters" Auto-ML tools seek to automatically adjust the

network structure and hyperparameters! Generally require lots of data and lots of

compute time" Many groups have had success with “fine

tuning” of pre-trained networks

Environmental,Health,Summit 40

Environmental Health Applications of DNNs! Analyzing medical images! Analyzing EKG and other signal data! Analyzing spectra! Analyzing electronic health records

Environmental,Health,Summit 41

Summary

! For making inferences about environmental health, the probabilistic modeling paradigm is recommended" Interpretable models" Can draw causal inferences under some conditions

! For extracting data from sensors, EHRs, images" predictive models (random forests, SVMs, DNNs)

excel" SVMs and DNNs require tuning hyperparameters" Tools are beginning to emerge to automate tuning

Environmental,Health,Summit 42

References

! STAN: http://mc-stan.org/! Gelman & Hill (2006): Data analysis using regression and

multilevel/hierarchical models.! randomForests package in R! Mentch, L., & Hooker, G. (2016). Quantifying Uncertainty in

Random Forests via Confidence Intervals and Hypothesis Tests. Journal of Machine Learning Research, 17, 1–41.

! Mentch, L., & Hooker, G. (2017). Formal Hypothesis Tests for Additive Structure in Random Forests. Journal of Computational and Graphical Statistics, 26(3), 589–597.

! LibSVM package for fitting support vector machines! Cristianini, N., Shawe-Taylor, J. (2000). An Introduction to

Support Vector Machines and other kernel-based learning methods, Cambridge University Press.

! Zoph, B., & Le, Q. V. (2016). Neural Architecture Search with Reinforcement Learning. ArXiv 1611.01578, 1–16.

Environmental,Health,Summit 43

top related