z lp ^ randomforest - | world agroforestryhands-on soil infrared spectroscopy training course...
TRANSCRIPT
Hands-on Soil Infrared Spectroscopy Training Course
Getting the best out of light
12 – 17th May 2014
R package randomForest Erick Towett
2
Welcome
Outline
• Introduction
•Features of Random Forest (RF)
• How does RF work?
• Usage
• MIRS Random Forest prediction models for soil properties.
• Demo application of RF to MIRS calibration.
3
• A great variety of statistical procedures developed & tested for
developing calibrations using IR spectroscopic data.
• This area includes:
• spectral pre-treatments such as derivatives or scatter
correction, etc.
• Procedure used to derive a calibration from the resulting spectral
data such as stepwise-, PLS, or principal component-regression
and other methods such as neural networks e.g. 1. Williams, P., Norris, K. (Eds.), 1987. Near-Infrared Technology in the Agricultural and Food Industries. Amer. Assoc. of Cereal
Chemists, Inc., St. Paul, MN.
2. Williams, P., Norris, K. (Eds.), 2001. Near-Infrared Technology in the Agricultural and Food Industries, 2nd Edition. Amer. Assoc.
of Cereal Chemists, Inc., St. Paul, MN.
3. Naes, T., Isaksson, T., Fearn, T., Davies, T., 2002. A User-Friendly Guide to Multivariate Calibration and Classification. NIR
Publications, Chichester, West Sussex, UK.
4. Westerhaus, M., Workman Jr., J., Reeves III, J.B., Mark, H., 2004. Quantitative analysis. In: Roberts, C.A., Workman Jr., J., Reeves,
J.B. (Eds.), Near-Infrared Spectroscopy in Agriculture. American Society of Agronomy, Madison, WI, pp. 133–174. Chapter 7.).
Introduction I
4
• Minasny & McBratney (2008) examined 3 methods, PLS,
regression-rules which produces regression trees based on linear
regression, and Treenet which creates boosted regression trees.
• They concluded:
• results showed that, in comparison with PLS with spectra
pretreatment and Boosted Trees, the regression-rules model
provides greater accuracy, is simpler and produces
comprehensible equations, provides an optimal variable
selectio , a d respects the upper a d lower li its of the data .
Minasny, B., McBratney, A.B., 2008. Regression rules as a tool for predicting soil properties
from infrared reflectance spectroscopy. Chemometrics and Intelligent Laboratory Systems
94, 72–79.
Introduction I
5
• randomForest (RF) implements Breiman’s random forest
algorithm for classification and regression based on a forest of
trees using random inputs.
• Version 5.1
• Depends R (>= 2.5.0)
• Description: Classification and regression based on a forest of
trees using random inputs. URL http://stat-www.berkeley.edu/users/breiman/RandomForests
Reference: Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
Introduction: RF
6
RF is fast and easy to implement, produce highly accurate
predictions
It runs efficiently on large data bases.
It can handle thousands of input variables without variable
deletion and without overfitting.
It gives estimates of variable importance in the classification.
RF handles complex data types well.
Obviates the need for transformation of predictors to
approximate normal distributions.
Generated forests can be saved
for future use on other data
Features of Random Forests
7
What are the challenges of RF? X There are many possible alternative nodes;
X Reseeding will give different models.
How does RF work?
• The out-of-bag (oob) error estimate
In RF, each tree is constructed using a different bootstrap sample from the
original data.
~ 1/3 of the cases are left out of the bootstrap sample and not used in the
construction of the kth tree.
Data to get a running unbiased estimate of classification error as trees are
added to the forest.
It is used to get estimates of variable importance.
Features of RF
8
• RF can output a list of predictor variables that are important in
predicting the outcome.
• The randomForest package in R has two measures of importance.
One is "total decrease in node impurities from splitting on the variable,
averaged over all trees. The other is based on a permutation test.
How does RF work?
9
Ongoing:
• Analysis of MIRS randomForests prediction models for soil
properties.
attempt to offer an in-depth analysis of random forests models for the
prediction of a number of soil properties using MIR spectroscopy.
Usage
10
Materials and Methods
• 1907 soil samples scanned through MIR spectrometer at a resolution
of 4 cm-1 .
• 1st derivative of the spectral range 601.7-4001.6 cm-1 calculated
smoothing interval of 21 data points using the soil.spec package in R.
• RF-OOB built to predict the reference properties from the MIRS 1st
derivative spectra using the entire data set.
11
Preliminary Results
Bandplot raw spectra Bandplot first deriv. spectra Response wavenumber
Soil organic carbon predicted MIR Vs reference values for the AfSIS baseline fitted using Random Forests; (a) calibration model results & (b) Out-of-bag validation results. The validation samples lying far from the 1:1 line are soils indicate soil types for which more samples need to be added to the calibration library.
Importance plot
12
Demo:
R package randomForests
13
R package randomForests
Thank you for your attention