using decision trees with gis data for modeling and prediction
TRANSCRIPT
Decision tree in GIS using R environment
Omar F. Althuwaynee, Ph.D.
Evaluate and compare the results of applying different decision trees algorithms, to classify and understand landslide occurrence predictors distributions, using GIS and R environment.
Course objectives
Omar F. Althuwaynee, PhD in Geomatics engineering
You have to go through the following videos on my channel, regarding data preparation in ArcGIS:
1. How to process Logistic regression in GIS: Prepare binary training data by ArcGIS ?
2. How to easily produce testing binary data set for prediction mapping validation?
Course preparations
Omar F. Althuwaynee, PhD in Geomatics engineering
1. Create dichotomous (0,1) training and testing data.2. Effectively set your project environment , and install
packages related to current application.3. Read spatial data in R environment .5. Run Statistical analysis, using various decision trees
algorithms.6. Run statistical tests and produce decision tree.
End of this course, you will be able to
Omar F. Althuwaynee, PhD in Geomatics engineering
Omar F. Althuwaynee, PhD in Geomatics engineering
Data mining approaches, based on the successive division of the problem into several sub-problems with a smaller number of dimensions, until a solution for each of the simpler problems can be found.
Decision Trees philosophy
Omar F. Althuwaynee, PhD in Geomatics engineering
1. Mostly use supervised learning methods. 2. Predictive , high accuracy, stability . 3. Mapping non-linear relationships.4. Used for classification or regression solving methods.5. Easy to understand: no analytical or statistical background
needed (intuitive graphical representation)6. Useful in data exploration: (finding significant variables and
its relations). 7. Less data cleaning required: (fairly not influenced by outliers
and missing values).8. Data type is not a constraint: (handling both numerical and
categorical variables).
Why to use Tree based learning algorithms?
Omar F. Althuwaynee, PhD in Geomatics engineering
1. Categorical Variable Decision Tree: (categorical target variable Example:- Target variable, Student will play cricket or not” i.e. YES or NO. Natural hazards susceptibility, Yes=1, No=0.
2. Continuous Variable Decision Tree: (continuous target variable).Example: Target variable, continuous Students age classification <=10 & Age>20,. Earthquake intensity <=x1 & >x2
Types of decision tree
Omar F. Althuwaynee, PhD in Geomatics engineering
Regression trees Classification trees Dependent variable is continuous Dependent variable is categorical
Value of terminal nodes is the mean response of observation. (Make its prediction with mean value).
Value of terminal node is the mode response of observations . (make its prediction with mode value.
SimilarityDivide the predictors (independent variables) into distinct and non-overlapping boxes.
Splits the predictor space down into two new branches down the tree (looks for best variable available), and looks about only the current split, and not about future splits
Splitting process is continued until a user defined stopping criteria is reached.
But, the fully grown tree is likely to over fit data, leading to poor accuracy on unseen data. Therefore, we need to do ‘Pruning’.
Regression vs. Classification
Omar F. Althuwaynee, PhD in Geomatics engineering
The algorithm stops when any one of the conditions is true:• All the samples belong to the same class.• There are no remaining attributes on which the
samples may be further partitioned• There are no samples for the branch test attribute
Stopping Criteria
Omar F. Althuwaynee, PhD in Geomatics engineering
Reference: https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/
To remove anomalies in the training data due to noise or outliers, and tackle overfitting. • The pruned trees are smaller and less complex.( look at a few
steps ahead and make a choice).• Pre-pruning and Post-pruning (removes a sub-tree from a fully
grown tree).
Pruning
Omar F. Althuwaynee, PhD in Geomatics engineering
What are the various decision tree algorithms and how do they differ from each otherCommon algorithms: Like; C4.5, ID3, CART, CHAID, Random forest1. Classification and regression2. Numeric (continuous) or categorical targets and factors data.3. Using tree pruning or not.4. Amount of memory usage5. Amount of information and outcomes provided6. Statistical background7. Stand alone or ensemble learning based classifiers
Omar F. Althuwaynee, PhD in Geomatics engineering
To predict whether a landslide will happen in a certain areas (yes/ no).
• Slope angle is a significant variable but we don’t have enough details about all the related conditions of previous events.
• Now, as we know there are additional important variables, then we can build a decision tree to predict landslide (or any target) based on:
Elevation, Aspect, soil type, vegetation density and various other variables.
Case study
Slope Angle
Slope Angle
Landslides
Elevation
NDVI
Soil type Aspect
Yes=20 No=50
Yes=80 No= 50
Slope ≤5° Slope >5°
NDVI=0.5
S.Type=Silt Aspect=NE Elev.≥300m
Yes=60 No= 10
Yes=20 No= 40
Yes=20 No= 10
Yes=40 No= 0
Yes=100 No=100
Typical Decision Tree
• To predict the probability, whether a landslide will occur in a particular places, or not.
Data:• Dependent factor Landslide training (75 observations) and
testing (25 observations) data locations.• Independent factors (Elevation, slope, NDVI, Curvature).
Note:• Analysis will depend only on the number of the observations,
more training observations will increase the model efficiency.
Current Application
Omar F. Althuwaynee, PhD in Geomatics engineering
Omar F. Althuwaynee, PhD in Geomatics engineering
1. Prepare GIS data 2. Resample to similar extent and resolution.3. Data quality4. Convert into statistical data format, like, .txt, .csv!! 5. Check the data in R environment, like, summary, str, head,
plot.
Data input
Testing_points Elevation Curvature Slope NDVI1 275 -0.0625 13.55703 0.5162730 363 0.0625 16.73342 0.4697280 267 0.1875 13.2819 0.4354140 92 0.125 10.01578 0.396327… …. ….. …. ….
8399= respondents to the survey, 57% = best customers(1) - 43%= other (0). Left side (Total life time)• Females ( F) are more likely to be best customers than males (M). • 1st row: difference between males and females is statistically significant (59 -54)%= 5% females are
more likely to be a best customer .• Is 5% is significant from a business point of view or not? (ask a business analyst)Right side (Net sales)• This suggests that Net sales is a stronger or more relevant predictor of MEN customer status than
Total lifetime visits( used to analyse Females)To conclude: female behaviour and male behaviour have different nuances.
Further illustration using Categorical Target (1–0)
Omar F. Althuwaynee, PhD in Geomatics engineering
1. Download the current course data2. Open R studio, and connect to internet3. And, Let us begin!
Happy learning..!
• https://www.tutorialspoint.com/data_mining/dm_dti.html
• http://scikit-learn.org/stable/modules/tree.html
• https://goo.Gl/uk6i3x
• https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scra
tch-in-python
/
• https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/Decision_Trees
• https://cran.r-project.org/web/views/MachineLearning.html
• http://machinelearningmastery.com/non-linear-classification-in-r-with-decision-trees/
• http://dni-institute.in/blogs/random-forest-using-r-step-by-step-tutorial/
References
Omar F. Althuwaynee, PhD in Geomatics engineering