4. models from data -...
TRANSCRIPT
1
1
4. MODELS FROM DATA
2
Outline
A) THEORETICAL BACKGROUND
1. Knowledge discovery in data bases (KDD)
2. Data mining
• Data
• Patterns
• Data mining algorithms
B) PRACTICAL IMPLEMENTATIONS
3. Applications:• Equations
• Decision trees
• Rules
3
Knowledge discovery in data bases (KDD)
What is KDD?
Frawley et al., 1991: “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”,
How to find patters in data?
Data mining (DM) – central step in the KDD process concerned with applying computational techniques to actually find patterns in the data.
- step 1: preparing data for DM (data preprocessing)
- step 3: evaluating the discovered patterns (results of DM)
2
4
Knowledge discovery in data bases (KDD)
When the patterns can be treated as knowledge?
Frawley et al., (1991): “A pattern that is interesting (according to a user- imposed interest measure) and certain enough (again according to the user’s criteria) is called knowledge. “
Condition 1: Discovered patterns should be valid on new data with some degree of certainty (typically prescribed by the user).
Condition 2: The patterns should potentially lead to some useful actions (according to user defined utility criteria).
5
Knowledge discovery in data bases (KDD)
What may KDD contribute to environmental sciences (ES) (e.g. agronomy, forestry, ecology, …)?
ES deal with complex unpredictable natural systems (e.g. arable, forest and water ecosystems) in order to get answers on complex questions.
KDD was purposively designed to cope with such complex questions about complex systems like:
- understanding the domain/system studied (e.g., gene flow, seed bank, life cycle, …)- predicting future values of system variables of interest (e.g., rate of out-crossing with GM plants at location x at time y, seedbankdynamics, …)- decision support for management (how to achieve co-existence with GM crops)
6
Data mining (DM)
What is data mining?Data Mining, is the process of automatically searching large
volumes of data for patterns using algorithms.
The most relevant notions of data mining:1. Data
2. Patterns
3. Data mining algorithms
3
7
Data mining (DM) - data
1. What is data?According to Fayyad et al. (1996):” Data is a set of facts, e.g., cases in a database.”
Data in DM is given in a single flat table:- rows: objects or records (examples in ML)- columns: properties of objects (attributes, features in ML)
which is then used as input to a data mining algorithm.Properties of objects
Distance (m) Wind direction (0) Wind speed (m/s) Out-crossing rate (%) 10 123 3 8 12 88 4 7 14 121 6 3 18 147 2 4 20 93 1 5 22 115 3 1
Obj
ects
… … … ..
8
Data mining (DM) - pattern
2. What is a pattern?
A pattern is defined as: ”A statement (expression) in a given language, that describes (relationships among) the facts in a subset of the given data and is (in some sense) simpler than the enumeration of all facts in the subset” (Frawley et al. 1991, Fayyad et al. 1996).
Classes of patterns considered in DM (depend on the data mining task at hand):
1. equations,2. decision trees3. association, classification, and regression rules
9
Data mining (DM) - pattern
1.EquationsTo predict the value of a target (dependent) variable as a linear or non linear combination of the input (independent) variables.
Linear equations involving:- two variables: straight lines in a two dimensional space - three variables: planes in a three-dimensional space - more variables: hyper-plains in multidimensional spaces
Nonlinear equations involving:- two variables: curves in a two dimensional space - three variables: surfaces in a three-dimensional space - more variables: hyper-surfaces in multidimensional spaces
4
10
Data mining (DM) - pattern
2. Decision treesTo predicted the value of one target dependent variable (class) from the values of other independent variables (attributes) by decision tree.
Decision tree is a hierarchical structure, where: - each internal node contains a test on an attribute, - each branch corresponds to an outcome of the test, - each leaf gives a prediction for the value of the class variable.
11
Data mining (DM) - pattern
Decision tree is called:
A classification tree: class value in leaf is discrete (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, …)
A regression tree: class value in leaf is a constant (infinite set of values): e.g.,120, 220, 312, …
A model tree: leaf contains linear modelpredicting the class value (piece-wise linear function): out-crossing rate= 12.3 distance + (-)0.123 wind speed + 0.00123 wind direction
12
Data mining (DM) - pattern
3. RulesTo perform association analysis between attributes discovered by association rules.
The rule denotes patterns of the form:“IF Conjunction of conditions THEN Conclusion.”
For classification rules, the conclusion assigns one of the possible discrete values to the class (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, spec. D)
For predictive rules, the conclusion gives a prediction for the value of the target (class) variable (infinite set of values): e.g.,120, 220, 312, …
5
13
Data mining (DM) - algorithm
3. What is data mining algorithm?Algorithm in general:- a procedure (a finite set of well-defined instructions) for accomplishing some task which will terminate in a defined end-stat.
Data mining algorithm:- a computational process defined by a Turing machine (Gurevich et al. 2000) for finding patterns in data
14
Data mining (DM) - algorithm
What kind of possible algorithms do we use for discovering patterns?
It depends on the goals:
1. Equations = Linear and multiple regressions
2. Decision trees = Top/down induction of decision trees
3. Rules = Rule induction
15
Data mining (DM) - algorithm
1. Linear and multiple regression
• Bivariate linear regression:predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of one attribute (A):
C = α+ β×A
• Multiple regression:predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of a multi-dimensional attribute vector (Ai):
C = Σni =1 βi×Ai
6
16
Data mining (DM) - algorithm
2. Top/down induction of decision treesDecision tree is induced by Top-Down Induction of Decision Trees (TDIDT) algorithm (Quinlan, 1986)
Tree construction proceeds recursively starting with the entireset of training examples (entire table). At each step, an attribute is selected as the root of the (sub)tree and the current training set is split into subsets according to the values of the selected attribute.
17
Data mining (DM) - algorithm
3. Rule induction
A rule that correctly classifies some examples is constructed first.
The positive examples covered by the rule from the training set are removed and the process is repeated until no more examples remain.
1818
DATA MINING – CASE STUDIES
7
1919
Practical implementations
Each class of described patterns is illustrated with examples ofapplications:
1. Equations:• Algebraic equations • Differential equations
2. Decision trees:• Classification trees• Regression trees• Model trees
3. Predictive rules
2020
Applications – Difference equations
Algebraic equations: CIPER
2121
Applications – Algebraic equations
Dataset
Hydrological conditions(HMS Lendava; monthly data on
minimal, average and maximum values)
- Ledava River levels - groundwater levels
Meteorological conditions(monthly data, HMS Lendava):-Time of solar radiation (h), - precipitation (mm), - ET (mm)- Number of days with white frost - Number of days with snow- T: max, aver, min- Cumulative T>0ºC, >5ºC, and >10ºC- Number of days with:
- minT>0ºC- minT<-10ºC- minT<-4ºC- minT>25ºC- maxT>10ºC- maxT>25ºC
Measured radial increments:- 8 trees
- 69 years old
Management data(thinning; m3/y removed from the
stand; Forestry Unit Lendava)
•Monthly data + aggregated data (AMJ, MJJ, JJA, MJJA etc.)
•Σ: 333 attributes; 35 years
Materials and methods
8
2222
Applications – Algebraic equations
• 52 different combinations of attributes were tested.Σ: 124 models
Experiment RRSE # eq. elements
jnj3_2m 0,7282 6jnj3_3s 0,7599 6jnj3_1s 0,7614 6jnj3_4m 0,76455 3jnj2_2 0,7685 5jly_4xl 0,7686 6
2323
Applications – Algebraic equations
Model jnj3_2m:
RadialGrowthIncrement =+ -0.0511025526922 minL8-10^1 + -0.0291795197998 maxL8-10^1 + -0.017479975134 t-sun4-7^1 + 0.0346935385853 t-sun8-10^1 + -1.950606536e-05 t-sun8-10^2 + -2.01014710248 d-wf-4-7^1 + 9.35586778387e-05 minL4-7^1 t-sun4-7^1 + -0.000179339939732 minL4-7^1 t-sun8-10^1 + 6.45688563611e-05 minL8-10^1 t-sun8-10^1 + 3.06551434164e-05 maxL8-10^1 t-sun4-7^1 + 0.00282485442386 t-sun4-7^1 d-wf-4-7^1 + -0.00141078675225 t-sun8-10^1 d-wf-4-7^1 + 7.91071710872
Relative Root Squared Error = 0.728229824611
Correlation between average measured
(r-aver8) and modeled increments: linear regression:
R2 = 0.8771
2424
Applications – Algebraic equations
Model jnj3_2m
9
2525
Applications – Algebraic equations
Algebraic equations: Lagramge
2626
Data source:
• Federal Biological Research Centre (BBA), Braunschweig, Germany
(2000, 2001)
• Slovenian Agricultural Institute (KIS), Slovenia (2006)
Plants involved:
• BBA: - transgenic maize (var. “Acrobat”, glufosinate tolerant line) – donor - non-transgenic maize field (var. “Anjou”) - receptor
• KIS: - yellowkernel variety of maize (hybrid Bc462, simulatinga transgenic maize variety) – donor
- white kernel variety of maize (variety Bc38W, simulating a non-GM variety) - receptor
Applications – Algebraic equations
2727
100 m
a
b
c
d
efghj
k
l
m
n o p q 1
23
4 5
6Field design 2000
transgenic field / donor
non-transgenic field / recipient
2 m
access paths
Direction of drilling
2 m3 m4.5 m7.5 m
13.5 m
25,5m
49,5 m
220 m
sampling point
a3 system of coordinates
for the sampling points
N
Experiment design: 96 points
60 cobs – 2500 kernels
% of outcrossing
Receptors -NT corns
Donors -GMO cornsTe
mpe
ratu
reR
elat
ive
hum
idity
Win
d ve
loci
tyW
ind
dire
ctio
n
Applications – Algebraic equations
10
2828
Selected attributes:• % of outcrossing• Cardinal direction of the sampling point from the center of
donor field• Visual angle between the sampling point and the donor
field• Distance between the point to the center of donor filed• The shortest distance between the sampling point and the
donor field• % of appropriate wind direction (exposure time)• Length of the wind ventilation route• Wind velocity
Applications – Algebraic equations: outcrossing rate
2929
Applications – Algebraic equations
3030
Applications – Algebraic equations
11
3131
Applications – Algebraic equations
3232
Applications – Algebraic equations
3333
Applications – Algebraic equations
12
3434
Applications – Differential equations
Differential equations: Lagramge
3535
Applications – Differential equations
3636
Applications – Differential equations
13
3737
Applications – Differential equations
water in-flow out-flow
respiration
growth
Phosphorus
3838
Applications – Differential equations
growthrespiration
sedimentationgrazing
Phytoplankton
3939
Applications – Differential equations
respiration mortality
Feeds on phytoplankton
Zooplankton
14
4040
Applications – Differential equations
4141
Applications – Classification trees: habitat models
Classification trees: J48
4242
Applications – Classification trees: habitat models
Population growth
15
4343
Applications – Classification trees: habitat models
Observed locations of BBs
### ### ###### ## # ##
#######
###### ## ###
### ########### # ############################## ##
###### ####### ### ####### # ######## #### ##
#
## #### # # ###### ###
# ########## #######
## ## #### ######## ###############################################################
#####
## ################## ######## ### ## ######
### ##########
############## ###
#### ##
### ############################## ## # ### #
### # ## ######### # # ############## # # #### ###
###### ####### ###### # ## ##
##### ####
#########################################
###### # # ##
### ### ########## #
###
#
#
#
#
# ####
#
#
###
##
#
#
#
###
#
#
###
##
#
#
#
### #
#
##
#
#
###
##
#
#
#
#
#
#####
##
#####
##
##
###
##
#
###
#
##
#### # ### ##
#
##
##
#
#
##
#
##
##
#
###
###
#
####
#
##
##
##
#
###
#### ###
##
##
#
##
##
#
#
#
#### #
## ##
###
## #
###
#
##
##
#
#
##
###
###
#
#
# #
#
#
#
#
# ### # #
#### #
##
##
#
###
####
##
##
##
#
##
## #
# # # # ##### ##
##### ####
##
####
#
###
##
###
##
## ##
###
##
###
##
##
###
#
#
#####
#
#
##
####
#
## #
###
##
#
#
#
#### ##
######
#####
#
##
##
##
#
#####
# #### #
#########
# ###### ### #
# ## ### ## ######## ##### # #### ###
#### ### ##
######### # ####
#
# ########### #
#########
#
#
#
##
#
# #
#### #
#
#####
#####
# ## ###
##
##
# ####
##
# ###
#
####
#
#####
#
##
# ##
#
# #
#
#
####
#
#
##
#
###
#
#
### ##
#
#
#
##
#
#
# #### #
#
##
##
# #
#
#
# ##
#
##
## ##
#
# #
##
######
#
#
#
##
#######
##
##
#
#
#
######
###
###
# ###
# ###
######
#
##
##
#
#######
###
#
# #
#
#
#
##
## ##
####
##
#
#
##
##
#
####
###
#
#
##
##
### ###
##
#
###
###
###
##
##
#
###
#
##
# ###
#
#
#
##
#
##
# # ###
###
#
#
###
#
#
#
##
##
##
#
#
###
##
#
#
##
#
#
###
#
##
#
##
#
#
# ## #
#
#
#
#
#
#
##
# #
#
#
####
#
#
#
#
##
##
#
###
##
#
#
##
##
#
##
#
##
##
# ##
#
#
##
#
###
#
# #
##
#
#
#
#
##
#
# ###
###
###
##
#
#
#
#
##
##
###
###
###
##
## ##
##
##
#
### #
##
#
# #
#
#
##
#
##
####
##
##
#
#
## #
#
##
#
#
#
#
#
#
#
##
#
##
##
##
##
#
#
##
#
##
##
##
#
#
#
#
#
#
#
#
#
# #
# ##
#
#
#
#
#
######
#
##
#
#
##
#
####
####
#
## ####
##
##
#
##
#
#
## #
#
#
#
#
##
##
##
##
#
##
#
#
##
#
##
##
#
##
#
#
#
##
#
##
#
#
#
#
#
#
#
##
#
##
#
##
##
###
##
###
#
##
#
#
##
#
##
# #
##
#
## ###
# ##
#
#
##
###
#
#
#
#
##
#
#
##
##
#
##
##
## #
#
##
#
#
#
##
#
# ###
#
##
#
##
# #
####
#### #
###
#
##
#### #
#
##
##
# #
##
##
#
## #
####
#
##
####
##
##
###
#
##
##
#
#
#
#
##
###
##
###
# ##
#
#
#
##
### #
###
#####
#
#
##
###
###
##
##
#####
#
##
#
###
#
##
#
##
###
##
####
###
##
###
#
##
##
##
##
##
##
###
###
#####
###
#
###
####
# ##
##
##
#
#
##
#
#
## #
##
## ##
##
#
#
###
###
###
##
##
###
#
#
# ## #
### ##
#
##
#### #
##
#
#####
# ### ##
##
##
###
###
##
#
###
##
#
##
###
#
##
#
#
#
###
#
#
#
#
#
##
##
# #
#
##
#
##
##
#
#
#
#
##
#
##
#
##
##
##
#
#
##
##
#
##
#
#
###
#
#
#
#
###
#
##
#
##
##
##
##
#
#
#
#### #####
######### ### # #
####
##
#
#
#
###
##
##
#
#
##
#
#
##
# #
### #
###
#
#
#
##
##
##
#
###
##
###
#
#
#
##
#
# #
##
##
#
#
# #
#
#
#
#
##
##
#
##
#
#
##
##
#
#
##
#
#
#
### #
#
##
##
#
#
##
#
#
##
###
#
#
#
# #
##
## ####
#
#
##
##
#
##
##
# #
#
#
#
#
#
###
##
###
##
#
##
####
###
# #####
#
#
#
#
##
##
# #
###
####
##
## ##
#
####
##
#
###
###
##
#
##
# #####
##
###
# ##
######
###
#
##
#
##
#
#
#
##
#
##
####
##
#
#
#
#
#
##
##
#
#
#
#
#
#
###
#
#
#
##
#
#
# #####
#
#
#
#
# ###
#
##
#
##
##
#
# #
#
#
#
###
##
##
###
###
#
#
#
#
##
# #
## #
###
## #
#
# ### ## ##
######
#
##### #######
##
##
#
####
#############################
#####################
#### #####
######
####
#
###
#
##
#
###
##
# ##
###
####
####
# ##
### ##
####
#
#
## #
#
##
##
#
#
##
#
##
#
#
###
#
##
##
#
#
## #
#####
####
##
###
#
### ##
#
#
##
#
#
#
#
##
#
#######
#
##
# ###
##
########
####
####
##
##
#
#
#
##
#
#
##
# #
###
#
#
#
#
#####
##
#
# ##
#
###
# ###
##
##
#
##
## #
##
#
###
##
# #
##
#
##
#
#
##### #
#
#
#
# ##
#
#
##
######
###
###
#
#
######
##
#
#
##
###
#
#
#
#
##
#
#
# #
#
#
# #####
##
###
# #
#
# #
#
#
#
#
#
##
#
#
#
###
#
##
#
##
#
#
###
##
#
#
### #
#
##
##
#
####
#####
#### ######################
#### ###########
##
##
###################
##########
#
##########
#
#
# ###
##
####
#
#
#
#
#
#
##
##
##
#
###
#
#
##
####
##
##
##
#
#
#
##
#
#
#
#
#
#
##
##
#
#
##
##
#
#
#
#
#
##
##
##
#
#
###
##
#
#
#
# ###
#
##
#####
##
#
#
#
#
##
#
##
#
#
#
#
#
#
#
#
#
#
##
#
##
#
##
#
###
#
##
# #
#
##
#
##
#
##
#
# ##
#
##
#
# #
###
##
##
#
#
#
#
###
##
#
##
###
#
#
#
#
# ##
##
#
#
#
#
#
#
#
#
## # #
##
## ##
#
####
#
##
#
#
#
#
##
#####
##
##
#
#
#
##
##
## ####
###
#
#####
###
# #
#
#
#
#
##
# #####
##
###### #
###
#
### ###
##
# ####
##
##
#
#
###
#####
# #
#
##
#
#
#
#
##
#
## ####
####
##
####
# ##
### #
#
##########
#
#
# ###
##
####
#
#
#
#
#
#
##
#
##
#
#
#
#
###
#
###
####
#
##
# ######
##
##
##############
#######
###
#
########
####
# ###
#
#
##
#
#
##
#
##
#
##
# #
#
#
#
###
#
#
##
#
#
#
##
#
##
## #
#
## ###
##
##
#
#
#
#
## ##
#
#
#### ##
##
#####
#
#
#
## #####
#
#
###
#
##
#
## ## ##
####
####
##
#
#
##
##
#
#
#
#
#
# ##
#
##
#
##
#
##
#
##
###
##
#
#
#
#
#
#
#
#
#
#
##
# ##
### ###
####
## ####
#
####
#
#
########
#####
##
# ##
###
#
#####
###
#
#
#
## #
###
# ########## #
##
#
#
#
#
##
# #
#
#
##
# #
##
#
#
#
##
#
#
#
##
#
#
#
#
#
#
##
##
# ### ##
##
# ##
#
#
##
#
#
#
#
#
#
###
##
#
###
#
#
##
## ####
#
#
#
#
#
##
#
##
##
##
#
##
#
##
#
#### ##
##
#
## ###
## ###
##
#
#
#
# ##
#
#
# ##
#
#
#
#
#
#
##
#
#
#
# ####
#
# ##
#
#
####
#
#
######
####
#
# ###
#
#
###
##
##
#
#
###
#
# ## #
####
## #####
##
#
##
#
##
#
# ### ##
##
#
# ##
#
#
#
#
# ##
#
##
##
###
#
## ###
#
#
##
##
#
##
##
###
##
#######
###
##
###
################
# ##
###
#
#
##
#
#
##
###
#
##
##
##
####
##
##
## #
#
#
#
#
##
##
##
#####
##
#
##
##
######
#
##
#
#
#
#
#
#
####
##
#
#
#####
#
#
#
#
##
####
####
#
##
#
#
##
###
###
#
#
# #
#
#
#
# #
#
#
#
#
###
##
#
#
##
#
#
###
##
## ##
# #
## #
#
#
#
#
## ##
#
###
###
#
#
#####
#
# #
##
# ##
#
####
#
##
##
###
#
#
#### #
#
#
#
#
#
#
##
###
###
#
#
#
###
######
#
#
###
###
#
#
#
####
##
##
#
#
#
#
#
#
#
###
#
##
#####
#
#
#
#
#
##
#
#
##
#
#
#
##
#
#
##
#
#
## #
#
##
#
##
#
## #
##
#
#
#
#
##
##
#
#
#
## #
#
#
#
##
##
##
##
# #
#
#
#
#
##
#
#
#
#
#
#
##
#
#
#
## ##
#
## #
##
# #
#
#### #
##
#
##
#
##
##
#
#
#
# #
##
##
####
#
#
#
#
#####
###
# #
####
###
###
#
#
###
##
#
#
##
##
##
#
#
#
##
##
##
###
## #
#
##
##
#
####
###
##
##
#
##
#
# ####
##
#
#
##
#
# #
## #
#
#
#
#
###
#
####
#
###
###
# #
##
##
#
##
###
##
#
######
#
####
#
#
###
#
####
#
###
##
#
## #
#
#
###
##
#
#
#
#
#
#
###
#
##
#
#
####
#
##
#
##
##
#
####
#
#
##
#
# ###
##
##
## #
####
## ##
####
#
####
#
#
#
#
#
#
###
#
#
##### #
## ##
#
#
##
###
###
##
###
#
## #
##
###
#
##
##
##
##
#
##
#
#
##
##
###
#
##
##
##
#
##
#
##
###
#
#
#
#
#
##
##
#
##
##
###
#
#
#
##
#
#
#
##
#
#
##
##
##
#
###
#
#
##
#
#
###
#
####
#
###
#
#
##
##
#
##
#
#
##
#
#
# #
##
#
##
##
#
# #
###
#
#
#
#
#####
##
#
###
#
##
##
#
##
#
##
#
#
######
#
#####
#
##
#####
### ## ##
## # ###
##
####
##
##
#
##
#
#
###
#
##
##
#
##
#
#
#
##
##
# #
#
# #
### ###
#
#
#
## ########
##
######
##
#
# #
######
#
########
####
#
#
##
#
#
#####
#### ###
#
# ##
#
#
#
##
##
######
#
#
#
#
##
#
#
###
##
#
###
#
####
##
###
# ##
##
##
###
#
#
##
#
# ###
###
##
## ##
# #
#
#
#
#
#
#
#
##
###
#
#
#
#
###
########### ##
##### #
#
#
#
#
##
###
## ##
###
##
####
####
#
#
#
##
##
##
####
##
##
###
#
##
##############
##########
###
###
#
##
#
#
##
#
#
# #
## ## #
###
#
#
##
#
###
# ###
# ##
##
#
#
#
#
# ##
### #
#
###
###
##
## #
#
## ## ###
#
#
##
##
#
#
#
#
#
#
#
# ### #
###
# ##
#
###
###
#
#
#
######
#
#
#
#
## #
##
###
#
#
#####
###
# ##
#
###
##
#
####
# #
#
#
#
###
#
# #
##
#
#
#
#
#
# #
#
#
##
##
##
###
#
###
#
##
# ####
###
#
#
#
##
#
# ###
##
##
#
####
#
####
###
#
#
#
#
###
#
#
#
#
#
#
#
#
#
##
##
#
#
#
#
#
#
#
#
###
#
###
#
###
##
##
#
#
##
## ##
# ##
#
##
###
#
#
#
###
#
##
#
# ##
##
#
#
#
###
#
## #
#
##
##
###
#
## #
#
#
#
#
###
## ##
###
#
#
#
##
#
#
#
#
#
##
#
###
#
#
##
##
#
#
#
#
#
#
# #
#
# #
#
#
#
#
##
##
##
# #
#
###
#
#
#
#
#
#
# #
## #
#
#
#
#
# #
#
# #
## #
# ##
##
#
###
#
##
##
###
#####
#
#
##
#
#
##
#
###
#
############
########
#####
#
##
########
###
#
##
#
#
#
#
###
###
#
# #
#
####
#
####
#
#
#
#
#
#
#
#
#
##
##
# ##
#
#
#
#
#
##
###
#
# #
#
#####
###
#
#
#
## #
#
##
#
#
#
#
##
#
#
#
#
#
###
#
#
#
#
#
#
#
#
# #
## ##
#
# #
##
#
#
#
##
#
#
## ##
#
#
#
#
#
#
#
#
##
#
##
#####
#
##
###
#
##
#
###
#
##
#
#
##
#
##
#
#
##
#
#
#
#
#
#
#
#
#
##
# #
#
#
#
#
#
#
#
#
####
#
#
#
##
#
#
#
##
# #
##
#
#
#
#
#
#
#
#
##
#
#
#
##
#
##
# #
#
#
#
#
#
#
#
#
####
#
#
#
#
##
#
#
#
#
#
#
##
###
# #
#
##
#
#
#
## ##
#
#
##
#
#
#
#
##
#
#
#
#
#
#
#
#
#
##
#
#
###
##
#
#
#
##
#
# #
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# #
#
#
##
#
##
#
#
#
###
#
#
#
##
#
####
###
##
##############
##########
##
#
##############
##
##
############
#######
##########
#
##########
#
#
# ###
#
#
##
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
#
#
#
# #
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
#
#
#
#
#
#
##
#
#
#
#
##
#
##
#
#
#
# #
##
#
#
#
# ##
##
##
#
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
# #
#
#
#
#
#
#
#
##
#
##
#
#
#
#
#
##
###
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
##
#
#
#
#
#
# ##
##
#
#
#
#
#
#
#
#
#
## #
#
#
## #
#
#
###
#
#
#
#
#
#
#
#
#
#
# ##
##
#
#
#
#
#
#
#
#
#
#
#
## ##
##
#
##
#
##
###
##
#
# #
#
#
#
#
#
#
# ##
###
#
#
####
#
##
##
#
#
#### #
#
#
#
####
#
##
#
#
#
#
###
###
##
##
#
##
#
#
#
#
##
#
## ###
#
#
#
#
#
#
#
##
##
# #
#
##
# #
#
##########
#
#
# ###
#
#
##
##
#
#
#
#
#
#
#
#
#
##
#
#
#
#
##
#
#
##
#
##
##
#
##
# #####
#
#
#
# #
## #
##########
#
#
##
#
# ##
## #
#
#
## #### #
#
#
#
#
# ###
#
#
##
#
4444
Applications – Classification trees: habitat models
The training dataset• Positive examples: - Locations of bear sightings (Hunting association; telemetry) - Females only - Using home-range (HR) areas instead of “raw”
locations - Narrower HR for optimal habitat, wider for maximal
• Negative examples: - Sampled from the unsuitable part of the study area - Stratified random sampling - Different land cover types equally accounted for
4545
Applications – Classification trees: habitat models
1,73,26,0,0,1,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,0,0,0,4123,0,0,0,0,63,211,11,11,11,83,213,213,0,0,4155.1,62,37,0,0,2,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,1,53,0,3640,0,0,0,-1347,63,211,11,11,11,83,213,213,11,89,3858.2,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,10404,0,2074,-309,48,0,0,11,11,11,83,83,83,0,20,3862.2,0,100,0,0,1,76,0,16,71,0,12,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,7500,0,1661,-319,-942,0,0,11,11,11,0,0,0,0,20,4088.1,8,91,0,0,1,52,0,59,41,0,0,0,0,0,0,0,4,0,0,0,0,5,1,6,82,0,6500,0,1505,-166,879,9,57,11,11,11,281,281,281,0,20,3199.4,3,0,86,9,0,75,0,33,67,0,0,0,0,0,0,1,2,0,0,0,0,0,1,2,54,0,0,0,465,-66,-191,4,225,11,31,31,41,72,272,60,619,4013.1,34,65,0,0,2,51,9,76,9,5,1,4,1,0,1,0,29,0,0,0,0,0,1,2,54,0,3000,0,841,-111,-264,34,220,11,41,41,151,141,112,60,619,3897.1,100,0,0,0,3,52,0,86,6,3,5,9,6,7,38,40,0,0,0,0,0,0,1,17,64,0,8062,0,932,-603,-71,100,337,11,41,41,171,232,202,4,24,3732.…….…….…….
Dataset
Present: 1
Absent: 0
16
4646
Applications – Classification trees: habitat models
The model for optimal habitat
4747
Applications – Classification trees: habitat models
The model for maximal habitat
4848
Applications – Classification trees: habitat models
Map of optimal habitat(13% SLO territory)
17
4949
Applications – Classification trees: habitat models
Map of maximal habitat (39% SLO territory)
5050
Applications – Multi-target classification : outcrossing rate
Multi-target classification model (Clus):
Modelling pollen dispersal of genetically modified oilseed rape within the field
Marko Debeljak, Claire Lavigne, Sašo Džeroski,Damjan Demšar
In: 90th ESA Annual Meeting [jointly with the]IX International Congress ofEcology, August 7-12,2005, Montréal, Canada. Abstracts. [S.l.]: ESA, 2005,
p. 152.
5151
Experiment design:
90m
90m
Donors: MF transgenicoilseed rape “B004.oxy”
(10×10m)
Filed for receptors (90×90m)
3×3m grid = 841 nodes
10 seeds of MS oilseed rape“FU58B004” planted per node
Field planted with MF oilseedrape ”B004”
% MS outcrossing
% MF outcrossing
Applications – Multi-target classification : outcrossing rate
18
5252
Selected attributes for modelling:• Rate of outcrossing of MS and MF receptor plants [rate per
1000]• Cardinal direction of the sampling point from the center of
donor field [rad]• Visual angle between the sampling point and the donor
field [rad]• Distance between the point to the center of donor filed [m]• The shortest distance between the sampling point and the
donor field [m]
Applications – Multi-target classification : outcrossing rate
5353
• Number of examples: 817
• Correlation coefficient: MF: 0.821MS: 0.846
MSMF
Applications – Multi-target classification : outcrossing rate
5454
Applications – Regression and Model trees: outcrossing rate
Model tress- M5’
Modelling outcrossing of transgenes in maize between neighboring maize fields.
DEBELJAK Marko, DEMŠAR Damjan, DŽEROSKI Sašo, SCHIEMANN Joachim
In: HŘEBIČEK, Jiří (Eds.), RÁČEK, Jaroslav (Eds.). Networking enviromental information : EnviroInfo Brno 2005 : proceedings of the 19th International Conference Informatics for Environmental Protection, September 7-9,2005,
Brno, Czech Republic. Brno: Masaryk University, 2005, str. 610-614.
19
5555
Data source: Federal Biological Research Centre (BBA), Braunschweig, Germany
Project title: Outcrossing of transgenes in maize between neighboring maize fields
Plants involved: corn (Zea mays) vs. GMO Corn
Brief description: Correlation of impact factors and outcrossing frequency in maize fields
Applications – Regression and Model trees: outcrossing rate
5656
100 m
a
b
c
d
efghj
k
l
m
n o p q 1
23
4 5
6Field design 2000
transgenic field / donor
non-transgenic field / recipient
2 m
access paths
Direction of drilling
2 m3 m4.5 m7.5 m
13.5 m
25,5m
49,5 m
220 m
sampling point
a3 system of coordinates
for the sampling points
N
Experiment design: 96 points
60 cobs – 2500 kernels
% of outcrossing
Receptors -NT corns
Donors -GMO cornsTe
mpe
ratu
reR
elat
ive
hum
idity
Win
d ve
loci
tyW
ind
dire
ctio
n
Applications – Regression and Model trees: outcrossing rate
5757
Selected attributes:• % of outcrossing• Cardinal direction of the sampling point from the center of
donor field• Visual angle between the sampling point and the donor
field• Distance between the point to the center of donor filed• The shortest distance between the sampling point and the
donor field• % of appropriate wind direction (exposure time)• Length of the wind ventilation route• Wind velocity
Applications – Regression and Model trees: outcrossing rate
20
5858
Model II (2001)
Applications – Regression and Model trees: outcrossing rate
LM1: outcrossing = -1.5976*min_distance-0.1233*distance_center+0.0107*visual_angle+16.1174
min_distance
min_distance
wind_tunnel_length
distance_center
LM2: outcrossing = -1.5976*min_distance -0.1521*distance_center +0.0107*visual_angle +18.0174
LM3: outcrossing = -1.5976*min_distance -0.1521*distance_center +0.0107*visual_angle +17.9795
LM4 outcrossing = -2.1123*min_distance-0.1085*distance_center+0.0107*visual_angle+0.0108*appropriate_wind_proc+16.065
LM5 outcrossing = -0.0021*angle -0.0143*min_distance+0.0161*visual_angle-0.0125*appropriate_wind_proc+0.0425
<=4.485 >4.485
<=3.71 >3.71
<=59.071 >59.071
<=60.755 >60.755
5959
Applications – Regression and Model trees: outcrossing rate
Measured 2001Predicted 2001 (corr. coeff. 0.50)
6060
Applications – Regression and Model trees: soil seed bank
Regression and Model trees – J4.8 and M5’
21
6161
Applications – Regression and Model trees: soil seed bank
6262
Applications – Regression and Model trees: soil seed bank
1-NW Scotland 2- N England3-SE England 4- S midd.Eng. 5- SW England
6363
Applications – Multi-target regression model: soil resilience
22
6464
A key component in the environmental
sustainability of ecosystems.
It defines the ability of soil to recover from external stresses that may occur
through either climate, soil management or industrial
pollution.
Goal:Underlying relationships between soil properties and its resilience and resistance capacity: Multi-Target regression modelling (MTRT) on soil dataset.
Applications – Multi-target regression trees: soil resilience
6565
The dataset: soil samples taken on 26 location throughout SCO
The dataset: The flat table of data:26 by 18 data entries
Applications – Multi-target regression trees: soil resilience
6666
The dataset:
physical properties: soil texture: sand, silt, clay
chemical properties: pH, C, N, SOM (soil organic matter)
FAO soil classification: Order and Suborder
physical resilience: resistance to compression: 1/Cc, recovery from compression: Ce/Cc, overburden stress: eg, recovery from overburden stress after two days cycles: eg2dc
biological resilience: heat, copper
Applications – Multi-target regression trees: soil resilience
23
6767
Applications – Multi-target regression trees: soil resilience
Different scenarios and multi-target regression models have been constructed:
A model predicting the resistance and resilience of soils to copper perturbation.
6868
Applications – Multi-target regression trees: soil resilience
The increasing importance of mapping soil functions to advice on land use and environmental management - to make a map of soil resilience for Scotland.
The models = filters for existing GIS datasets about physical and chemical properties of Scottish soils.
6969
Applications – Multi-target regression trees: soil resilience
Macaulay Institute (Aberdeen): soils data – attributes and maps:
Approximately 13 000 soil profiles held in database
Descriptions of over 40 000 soil horizons
25
7373
Applications – Rules
7474
Applications – Rules
7575
Applications – Rules
The simulations were run with the first GENESYS version (published 2001, evaluated 2005, studied in sensitivity analyses 2004, 2005)
Only one field plan was used:
- maximising pollen and seed dispersal
26
7676
Applications – Rules
Large-risk field pattern
7777
Applications – Rules
Variables describing simulations
- simulation number- genetic variables- for each field (1 to 35), the cultivation techniques of year -3, -2, -1, 0- for each field (1 to 35) the number of years since the last GM oilseed rape crop- the number of years since the last non-GM oilseed rape crop
- proportion of GM seeds in non-GM oilseed rape of field 14 at year 0
-TOTAL NUMBER OF VARIABLES: 1899
7878
Applications – Rules
Run of experiment
• simulation started with an empty seed bank
• lasted 25 years,
• but only the last 4 years were kept in the files for data mining
• TOTAL NUMBER of simulations on the field pattern without the borders: 100 000
27
7979
Applications – Rules
Non aggregated data: CUBIST
Use 60% of data for trainingEach rule must cover >=1% of casesMaximum of 10 rules
Rule 1: [29499 cases, mean 0.0160539, range 0 to 0.9883608, est err 0.0160594]if
SowingDateY0F14 > 258then
PropGMinField14 = 0.0056903
Rule 2: [12726 cases, mean 0.0297134, range 7.62473e-07 to 0.9883608, est err 0.0388636]
ifSowingDateY0F14 > 258SowingDateY0F14 <= 277
thenPropGMinField14 = 0.0451188 - 0.0024 YearsSinceLastGMcrop_F14
8080
Applications – Rules
Rule 4: [22830 cases, mean 0.0958018, range 0 to 0.9994726, est err 0.0884937]if
SowingDateY0F14 <= 258YearsSinceLastGMcrop_F14 > 2
thenPropGMinField14 = 0.3722531 - 0.0013 SowingDateY0F14 - 0.00024 SowingDensityY0F14
8181
Applications – Rules
Rule 10: [1911 cases, mean 0.5392408, est err 0.2179439]if
TillageSoilBedPrepY0F14 in {0, 2}SowingDateY0F14 <= 258SowingDensityY0F14 <= 55YearsSinceLastGMcrop_F14 <= 2
thenPropGMinField14 = 2.8659117 - 0.3607 YearsSinceLastGMcrop_F14 - 0.0087 SowingDateY0F14 - 0.0032 SowingDensityY0F14 + 0.00073 SowingDateY-1F14 + 0.21 HarvestLossY-1F14 + 0.13 HarvestLossY-2F14 + 0.00017 SowingDensityY-2F14- 0.09 HarvestLossY-1F8 - 0.07 EfficHerb2Y-3F27 - 7e-05 2cuttingY-3F23 + 7e-05 2cuttingY-2F12- 0.00027 SowingDateY-1F35 + 0.08 HarvestLossY0F9 + 5e-05 1cuttingY-3F17 + 0.00012 SowingDensityY-2F18 - 6e-05 2cuttingY-2F24 - 6e-05 2cuttingY-2F15 + 6e-05 2cuttingY0F32 - 6e-05 2cuttingY-2F16 + 0.00022 SowingDateY0F11 - 4e-05 1cuttingY0F33 + 0.0001 SowingDensityY-1F7 - 0.05 EfficHerb2Y-3F16 + 0.06 HarvestLossY-1F22 - 5e-05 2cuttingY-2F25 + 0.04 EfficHerb2GMvolunY0F6 + 0.04 EfficHerb1GMvolunY-2F27 + 0.04 fficHerb2GMvolunY0F5
28
8282
Applications – Rules
Non aggregated data: CUBIST Options:Use 60% of data for trainingEach rule must cover >=1% of casesMaximum of 10 rules
Target attribute `PropGMinField14'
Evaluation on training data (60000 cases):Average |error| 0.0633466Relative |error| 0.47Correlation coefficient 0.77
Evaluation on test data (40000 cases):
Average |error| 0.0657057Relative |error| 0.49Correlation coefficient 0.75
8383
Conclusions
What can data mining do for you?
Knowledge discovered by analyzing data with DM techniques can help:
Understand the domain studied Make predictions/classificationsSupport decision processes in environmental management
8484
Conclusions
What data mining cannot do for you?
The law of information conservation (garbage-in-garbage-out)
The knowledge we are seeking to discover has to come from the combination of data and background knowledge
If we have very little data of very low quality and no background knowledge no form of data analysis will help
29
8585
Conclusions
Side-effects?
• Discovering problems with the data during analysis– missing values– erroneous values– inappropriately measured variables
• Identifying new opportunities– new problems to be addressed– recommendations on what data to collect and how
86
1. Data preprocessing
DATA MINING
87
DATA MINING – data preprocessing
DATA FORMAT• File extension .arff• This a plain text format, files should be edited by
editors such as Notepad, TextPad, WordPad (that do not add extra formatting information)
• File consists of – Title: @relation NameOfDataset– List of attributes: @attribute AttName AttType
• Where AttType can be ‘numeric’ or a list of categorical values, e.g., ‘{red, green, blue}’
– Data: @data (in a separate line), followed by the actual data in comma separated value (.csv) format
30
88
DATA MINING – data preprocessing
• Excel• Attributes (variables) in columns and cases in
lines
• Use decimal POINT and not decimal COMMA for numbers
• Save excel sheet as CSV file
89
DATA MINING – data preprocessing
• TextPad, Notepad
• Open CSV file • Delete “ “ on the beginning of lines and save (just
save, don’t change the format) • Change all ; to ,• Numbers must have decimal dot (.) and not decimal
comma (,)• Save file as CSV file (don't change format)
90
DATA MINING – data preprocessing
• WEKA
• Open CVS file in WEKA
• Select algorithm and attributes
• Perform data mining
http://www.cs.waikato.ac.nz/ml/weka/index.html