random forest and analytics
Post on 02-Jun-2018
218 Views
Preview:
TRANSCRIPT
-
8/11/2019 Random Forest And Analytics
1/14
Random Forest
Applied Multivariate StatisticsSpring 2012
-
8/11/2019 Random Forest And Analytics
2/14
Overview
Intuition of Random Forest
The Random Forest Algorithm
De-correlation gives better accuracy
Out-of-bag error (OOB-error) Variable importance
1
Diseased
Diseased
Healthy
Healthy
Diseased
-
8/11/2019 Random Forest And Analytics
3/14
Intuition of Random Forest
2
youngold
shorttall
healthy diseased
youngold
diseased
femalemale
healthy healthy
workingretired
healthy
shorttall
healthy diseased
New sample:
old, retired, male, shortTree predictions:
diseased, healthy, diseased
Majority rule:
diseased
healthy
healthy
diseased healthy
Tree 1
Tree 3
Tree 2
-
8/11/2019 Random Forest And Analytics
4/14
The Random Forest Algorithm
3
-
8/11/2019 Random Forest And Analytics
5/14
Differences to standard tree
Train each tree on bootstrap resample of data(Bootstrap resample of data set with N samples:Make new data set by drawing with replacementN samples; i.e., some samples will
probably occur multiple times in new data set)
For each split, consider only m randomly selected variables
Dont prune
Fit B trees in such a way and use average or majority
voting to aggregate results
4
-
8/11/2019 Random Forest And Analytics
6/14
Why Random Forest works 1/2
Mean Squared Error = Variance + Bias2
If trees are sufficiently deep, they have very small bias
How could we improve the variance over that of a single
tree?
5
-
8/11/2019 Random Forest And Analytics
7/14
Why Random Forest works 2/2
6
i=j
Decreases, if number of trees B
increases (irrespective of )
Decreaes, if
decreases, i.e., if
m decreases
De-correlation gives
better accuracy
-
8/11/2019 Random Forest And Analytics
8/14
Estimating generalization error:
Out-of bag (OOB) error
Similar to leave-one-out cross-validation, but almost
without any additional computational burden
OOB error is a random number, since based on random
resamples of the data
youngold
shorttall
healthy diseased
diseased healthy
Data:
old, tallhealthy
old, shortdiseased
young, tallhealthy
young, shortdiseased
young, shorthealthy
young, tallhealthy
old, shortdiseased
Resampled Data:
old, tallhealthy
old, shortdiseased
young, tallhealthy
young, tallhealthy
Out of bag samples:
young, shortdiseased
young, shorthealthy
young, tallhealthy
old, shortdiseased
Out of bag (OOB) error rate:
= 0.25
-
8/11/2019 Random Forest And Analytics
9/14
Variable Importance for variable i
using Permutations
8
Data
Resampled
Dataset 1 OOB
Data 1
Resampled
Dataset mOOB
Data m
Tree 1 Tree m
OOB error e1 OOB error em
Permute values of
variable i in OOB
data set
OOB error p1 OOB error pm
d= 1m
Pm
i=1di
d1= e1p1 dm=em-pm
s
2
d= 1
m1Pm
i=1(di
d)2
vi= d
sd
-
8/11/2019 Random Forest And Analytics
10/14
Trees vs. Random Forest
+Trees yield insight into
decision rules+Rather fast
+Easy to tune
parameters
-Prediction of trees tend
to have a high variance
9
+RF as smaller prediction
variance and thereforeusually a better generalperformance
+Easy to tune parameters
-Rather slow
-Black Box: Rather hardto get insights into decisionrules
-
8/11/2019 Random Forest And Analytics
11/14
Comparing runtime
(just for illustration)
10
RF
Tree
Up to thousands of variables
Problematic if there are categorical predictors with many levels (max: 32 levels)
RF: First predictor cut into 15 levels
-
8/11/2019 Random Forest And Analytics
12/14
+ Very fast
+ Discriminants for visualizinggroup separation
+ Can read off decision rule
-Can model only linear classboundaries
- Mediocre performance
- No variable selection
- Only on categorical response- Needs CV for estimatingprediction error
RF vs. LDA
+Can model nonlinear
class boundaries
+OOB error for free (no
CV needed)
+Works on continuous and
categorical responses(regression / classification)
+Gives variable
importance
+ Very good performance
-Black box
-Slow
11
xx
x x
xx
xx
xx
xx
x
xx
xx
x
x
xx
xxx
-
8/11/2019 Random Forest And Analytics
13/14
Concepts to know
Idea of Random Forest and how it reduces the prediction
variance of trees
OOB error
Variable Importance based on Permutation
12
-
8/11/2019 Random Forest And Analytics
14/14
R functions to know
Function randomForest and varImpPlot from package
randomForest
13
top related