using decision trees with gis data for modeling and prediction

Decision tree in GIS using R environment

Omar F. Althuwaynee, Ph.D.

Evaluate and compare the results of applying different decision trees algorithms, to classify and understand landslide occurrence predictors distributions, using GIS and R environment.

Course objectives

Omar F. Althuwaynee, PhD in Geomatics engineering

You have to go through the following videos on my channel, regarding data preparation in ArcGIS:

1. How to process Logistic regression in GIS: Prepare binary training data by ArcGIS ?

2. How to easily produce testing binary data set for prediction mapping validation?

Course preparations


https://www.youtube.com/watch?v=VPEXa2MIqKo&t=855s




https://www.youtube.com/watch?v=HjTm5Lnutow

https://www.youtube.com/watch?v=HjTm5Lnutow

1. Create dichotomous (0,1) training and testing data.2. Effectively set your project environment , and install

packages related to current application.3. Read spatial data in R environment .5. Run Statistical analysis, using various decision trees

algorithms.6. Run statistical tests and produce decision tree.

End of this course, you will be able to



Data mining approaches, based on the successive division of the problem into several sub-problems with a smaller number of dimensions, until a solution for each of the simpler problems can be found.

Decision Trees philosophy


1. Mostly use supervised learning methods. 2. Predictive , high accuracy, stability . 3. Mapping non-linear relationships.4. Used for classification or regression solving methods.5. Easy to understand: no analytical or statistical background

needed (intuitive graphical representation)6. Useful in data exploration: (finding significant variables and

its relations). 7. Less data cleaning required: (fairly not influenced by outliers

and missing values).8. Data type is not a constraint: (handling both numerical and

categorical variables).

Why to use Tree based learning algorithms?


1. Categorical Variable Decision Tree: (categorical target variable Example:- Target variable, Student will play cricket or not” i.e. YES or NO. Natural hazards susceptibility, Yes=1, No=0.

2. Continuous Variable Decision Tree: (continuous target variable).Example: Target variable, continuous Students age classification <=10 & Age>20,. Earthquake intensity <=x1 & >x2

Types of decision tree


Regression trees Classification trees Dependent variable is continuous Dependent variable is categorical

Value of terminal nodes is the mean response of observation. (Make its prediction with mean value).

Value of terminal node is the mode response of observations . (make its prediction with mode value.

SimilarityDivide the predictors (independent variables) into distinct and non-overlapping boxes.

Splits the predictor space down into two new branches down the tree (looks for best variable available), and looks about only the current split, and not about future splits

Splitting process is continued until a user defined stopping criteria is reached.

But, the fully grown tree is likely to over fit data, leading to poor accuracy on unseen data. Therefore, we need to do ‘Pruning’.

Regression vs. Classification


The algorithm stops when any one of the conditions is true:• All the samples belong to the same class.• There are no remaining attributes on which the

samples may be further partitioned• There are no samples for the branch test attribute

Stopping Criteria


Reference: https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/

To remove anomalies in the training data due to noise or outliers, and tackle overfitting. • The pruned trees are smaller and less complex.( look at a few

steps ahead and make a choice).• Pre-pruning and Post-pruning (removes a sub-tree from a fully

grown tree).

Pruning


What are the various decision tree algorithms and how do they differ from each otherCommon algorithms: Like; C4.5, ID3, CART, CHAID, Random forest1. Classification and regression2. Numeric (continuous) or categorical targets and factors data.3. Using tree pruning or not.4. Amount of memory usage5. Amount of information and outcomes provided6. Statistical background7. Stand alone or ensemble learning based classifiers


To predict whether a landslide will happen in a certain areas (yes/ no).

• Slope angle is a significant variable but we don’t have enough details about all the related conditions of previous events.

• Now, as we know there are additional important variables, then we can build a decision tree to predict landslide (or any target) based on:

Elevation, Aspect, soil type, vegetation density and various other variables.

Case study

Slope Angle

Slope Angle

Landslides

Elevation

NDVI

Soil type Aspect

Yes=20 No=50

Yes=80 No= 50

Slope ≤5° Slope >5°

NDVI=0.5

S.Type=Silt Aspect=NE Elev.≥300m

Yes=60 No= 10

Yes=20 No= 40

Yes=20 No= 10

Yes=40 No= 0

Yes=100 No=100

Typical Decision Tree

• To predict the probability, whether a landslide will occur in a particular places, or not.

Data:• Dependent factor Landslide training (75 observations) and

testing (25 observations) data locations.• Independent factors (Elevation, slope, NDVI, Curvature).

Note:• Analysis will depend only on the number of the observations,

more training observations will increase the model efficiency.

Current Application



1. Prepare GIS data 2. Resample to similar extent and resolution.3. Data quality4. Convert into statistical data format, like, .txt, .csv!! 5. Check the data in R environment, like, summary, str, head,

plot.

Data input

Testing_points Elevation Curvature Slope NDVI1 275 -0.0625 13.55703 0.5162730 363 0.0625 16.73342 0.4697280 267 0.1875 13.2819 0.4354140 92 0.125 10.01578 0.396327… …. ….. …. ….

8399= respondents to the survey, 57% = best customers(1) - 43%= other (0). Left side (Total life time)• Females ( F) are more likely to be best customers than males (M). • 1st row: difference between males and females is statistically significant (59 -54)%= 5% females are

more likely to be a best customer .• Is 5% is significant from a business point of view or not? (ask a business analyst)Right side (Net sales)• This suggests that Net sales is a stronger or more relevant predictor of MEN customer status than

Total lifetime visits( used to analyse Females)To conclude: female behaviour and male behaviour have different nuances.

Further illustration using Categorical Target (1–0)


1. Download the current course data2. Open R studio, and connect to internet3. And, Let us begin!

Happy learning..!

• https://www.tutorialspoint.com/data_mining/dm_dti.html

• http://scikit-learn.org/stable/modules/tree.html

• https://goo.Gl/uk6i3x

• https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scra

tch-in-python

/

• https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/Decision_Trees

• https://cran.r-project.org/web/views/MachineLearning.html

• http://machinelearningmastery.com/non-linear-classification-in-r-with-decision-trees/

• http://dni-institute.in/blogs/random-forest-using-r-step-by-step-tutorial/

References


https://www.tutorialspoint.com/data_mining/dm_dti.html



http://scikit-learn.org/stable/modules/tree.html



https://goo.gl/uk6i3x



https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/



https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/Decision_Trees

https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/Decision_Trees

https://cran.r-project.org/web/views/MachineLearning.html

https://cran.r-project.org/web/views/MachineLearning.html

http://machinelearningmastery.com/non-linear-classification-in-r-with-decision-trees/

http://machinelearningmastery.com/non-linear-classification-in-r-with-decision-trees/

http://dni-institute.in/blogs/random-forest-using-r-step-by-step-tutorial/

http://dni-institute.in/blogs/random-forest-using-r-step-by-step-tutorial/

using decision trees with gis data for modeling and prediction

Education