random forest
TRANSCRIPT
RANDOM FOREST
WHAT IS RANDOM FOREST? Random forest is a classifier
An ensemble classifier using many decision tree models. Can be used for classification and regression Accuracy and variable importance information is provided with the result
A random forest is a collection of unpruned CART-like trees following specific rules for Tree growing Tree combination Self-testing Post-processing
Trees are grown using binary partitioning
COMPARISON Similar to decision tree with a few differences
For each split-point, the search is not over all variables but just over a part of variables
No pruning necessary. Trees can be grown until each node contain just very few observations
Advantages over decision tree Better prediction (in general) No parameter tuning necessary with RF
Terminology Training size (N) Total number of attributes (M) Number of attributes used (m) Total number of trees (n)
HOW RANDOM FOREST WORKS? A random seed is chosen which pulls out at random a collection of samples
from training dataset while maintaining the class distribution
With this selected dataset, a random set of attributes from original dataset is chosen based on user defined values. All the input variables are not considered because of enormous computation and high chances of over fitting
In a dataset, where M is the total number of input attributes in the dataset, only m attributes are chosen at random for each tree where m<M
The attribute for this set creates the best possible split using the gini index to develop a decision tree model. This process repeats for each of the branches until the termination condition stating that the leaves are the nodes that are too small to split.
MORE ON RANDOM FOREST - I Information from random forest
Classification accuracy Variable importance Outliers (Classification) Missing Data Estimation Error Rates for Random Forest Object
Advantages No need for pruning trees Accuracy and variable importance generated automatically Overfitting is not a problem Not very sensitive to outliers in training data Easy to set parameters
MORE ON RANDOM FOREST - II Limitations
Regression cant predict beyond range in the training data Extreme values are not predicted accurately
Applications Classification
Land cover classification Cloud screening
Regression Continuous field mapping Biomass mapping
MOTIVATION Efficient use of Multi-Core Technology
Though it is OS dependent, but the usage of Hadoop guarantees efficient use of multi-core
WINNOWING ALGORITHM Its a technique from machine learning for learning a linear classifier from
labelled examples
Similar to perceptron algorithm
While perceptron algorithm uses additive weight-update scheme, winnowing uses a multiplicative weight-update scheme
Performs well when many of the features given to the learner turns out to be irrelevant
During training, its shown a sequence of positive and negative examples. From these it learn a decision hyperplane which can be used to novel examples as positive or negative
Uses linear threshold function (like the perceptron training algorithm) as hypothesis and performs incremental updates to its current hypothesis
THE ALGORITHM Initialize the weights w1,…….wn to 1
Both winnow and perceptron algorithm uses the same classification scheme
The winnowing algorithms differs form the perceptron algorithm in its updating scheme. When misclassifying a positive training example x (i.e. a prediction was negative
because w.x was too small)
When misclassifying a negative training example x (i.e. Prediction was positive because w.x was too large)
THE WINNOW ALGORITHMSPAM Example – each email is a Boolean vector indicating which phase appears and which don’t
SPAM if at least one of the phrase in S is present
SIMPLE ALGORITHM FOR LEARNING A DISJUNCTION
EXAMPLE – WINNOWING ALGORITHM Initialize the weights w1, …..wn = 1 on the n variables
Given an example x = (x1,……..xn), output 1 if
Else output 0
If the algorithm makes a mistake: On positive – if it predicts 0 when f(x)=1, then for each xi equal to 1, double the
value of wi
On negative – if it predicts 1 when f(x)=0, then for each xi equal to 1 cut the value of wi in half
MAXIMUM ENTROPY The principle of maximum entropy states that, subject to precisely
stated prior data, the probability distribution which best represents the current state of knowledge is the one with the largest entropy.
Commonly used in Natural Language Processing, speech and Information Retrieval
What is maximum entropy classifier? Probabilistic classifier which belongs to the class of exponential models Does not assume the features that are conditionally independent of each other Based on the principle of maximum entropy and forms all models that fit our
training data and selects the one which has the largest entropy
TESTABLE INFORMATION A piece of information is testable if it can be determined whether a given
distribution is consistent with it
The expectation of variable x is 2.87
And p2 + p3 > 0.6
Are statements of testable information
Maximum entropy procedure consist of seeking the probability distribution which maximizes information entropy, subject to constrains of the information.
Entropy maximization takes place under a single constrain: the sum of probabilities must be one
APPLICATIONS When to use maximum entropy?
Since it makes minimum assumptions, we use it when we don’t know about the prior distribution
Used when we cannot assume conditional independence of the features
The principle of maximum entropy is commonly applied in two ways to inferential problems Prior Probabilities: its often used to obtain prior probability distribution for
Bayesian inference Maximum Entropy Models: involved in model specifications which are widely
used in natural language processing. Ex. Logistic regression