4. models from data -...

30
1 1 4. MODELS FROM DATA 2 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining Data Patterns Data mining algorithms B) PRACTICAL IMPLEMENTATIONS 3. Applications: Equations Decision trees Rules 3 Knowledge discovery in data bases (KDD) What is KDD? Frawley et al., 1991: “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”, How to find patters in data? Data mining (DM) central step in the KDD process concerned with applying computational techniques to actually find patterns in the data. - step 1: preparing data for DM (data preprocessing) - step 3: evaluating the discovered patterns (results of DM)

Upload: dinhbao

Post on 11-Apr-2019

218 views

Category:

Documents


0 download

TRANSCRIPT

1

1

4. MODELS FROM DATA

2

Outline

A) THEORETICAL BACKGROUND

1. Knowledge discovery in data bases (KDD)

2. Data mining

• Data

• Patterns

• Data mining algorithms

B) PRACTICAL IMPLEMENTATIONS

3. Applications:• Equations

• Decision trees

• Rules

3

Knowledge discovery in data bases (KDD)

What is KDD?

Frawley et al., 1991: “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”,

How to find patters in data?

Data mining (DM) – central step in the KDD process concerned with applying computational techniques to actually find patterns in the data.

- step 1: preparing data for DM (data preprocessing)

- step 3: evaluating the discovered patterns (results of DM)

2

4

Knowledge discovery in data bases (KDD)

When the patterns can be treated as knowledge?

Frawley et al., (1991): “A pattern that is interesting (according to a user- imposed interest measure) and certain enough (again according to the user’s criteria) is called knowledge. “

Condition 1: Discovered patterns should be valid on new data with some degree of certainty (typically prescribed by the user).

Condition 2: The patterns should potentially lead to some useful actions (according to user defined utility criteria).

5

Knowledge discovery in data bases (KDD)

What may KDD contribute to environmental sciences (ES) (e.g. agronomy, forestry, ecology, …)?

ES deal with complex unpredictable natural systems (e.g. arable, forest and water ecosystems) in order to get answers on complex questions.

KDD was purposively designed to cope with such complex questions about complex systems like:

- understanding the domain/system studied (e.g., gene flow, seed bank, life cycle, …)- predicting future values of system variables of interest (e.g., rate of out-crossing with GM plants at location x at time y, seedbankdynamics, …)- decision support for management (how to achieve co-existence with GM crops)

6

Data mining (DM)

What is data mining?Data Mining, is the process of automatically searching large

volumes of data for patterns using algorithms.

The most relevant notions of data mining:1. Data

2. Patterns

3. Data mining algorithms

3

7

Data mining (DM) - data

1. What is data?According to Fayyad et al. (1996):” Data is a set of facts, e.g., cases in a database.”

Data in DM is given in a single flat table:- rows: objects or records (examples in ML)- columns: properties of objects (attributes, features in ML)

which is then used as input to a data mining algorithm.Properties of objects

Distance (m) Wind direction (0) Wind speed (m/s) Out-crossing rate (%) 10 123 3 8 12 88 4 7 14 121 6 3 18 147 2 4 20 93 1 5 22 115 3 1

Obj

ects

… … … ..

8

Data mining (DM) - pattern

2. What is a pattern?

A pattern is defined as: ”A statement (expression) in a given language, that describes (relationships among) the facts in a subset of the given data and is (in some sense) simpler than the enumeration of all facts in the subset” (Frawley et al. 1991, Fayyad et al. 1996).

Classes of patterns considered in DM (depend on the data mining task at hand):

1. equations,2. decision trees3. association, classification, and regression rules

9

Data mining (DM) - pattern

1.EquationsTo predict the value of a target (dependent) variable as a linear or non linear combination of the input (independent) variables.

Linear equations involving:- two variables: straight lines in a two dimensional space - three variables: planes in a three-dimensional space - more variables: hyper-plains in multidimensional spaces

Nonlinear equations involving:- two variables: curves in a two dimensional space - three variables: surfaces in a three-dimensional space - more variables: hyper-surfaces in multidimensional spaces

4

10

Data mining (DM) - pattern

2. Decision treesTo predicted the value of one target dependent variable (class) from the values of other independent variables (attributes) by decision tree.

Decision tree is a hierarchical structure, where: - each internal node contains a test on an attribute, - each branch corresponds to an outcome of the test, - each leaf gives a prediction for the value of the class variable.

11

Data mining (DM) - pattern

Decision tree is called:

A classification tree: class value in leaf is discrete (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, …)

A regression tree: class value in leaf is a constant (infinite set of values): e.g.,120, 220, 312, …

A model tree: leaf contains linear modelpredicting the class value (piece-wise linear function): out-crossing rate= 12.3 distance + (-)0.123 wind speed + 0.00123 wind direction

12

Data mining (DM) - pattern

3. RulesTo perform association analysis between attributes discovered by association rules.

The rule denotes patterns of the form:“IF Conjunction of conditions THEN Conclusion.”

For classification rules, the conclusion assigns one of the possible discrete values to the class (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, spec. D)

For predictive rules, the conclusion gives a prediction for the value of the target (class) variable (infinite set of values): e.g.,120, 220, 312, …

5

13

Data mining (DM) - algorithm

3. What is data mining algorithm?Algorithm in general:- a procedure (a finite set of well-defined instructions) for accomplishing some task which will terminate in a defined end-stat.

Data mining algorithm:- a computational process defined by a Turing machine (Gurevich et al. 2000) for finding patterns in data

14

Data mining (DM) - algorithm

What kind of possible algorithms do we use for discovering patterns?

It depends on the goals:

1. Equations = Linear and multiple regressions

2. Decision trees = Top/down induction of decision trees

3. Rules = Rule induction

15

Data mining (DM) - algorithm

1. Linear and multiple regression

• Bivariate linear regression:predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of one attribute (A):

C = α+ β×A

• Multiple regression:predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of a multi-dimensional attribute vector (Ai):

C = Σni =1 βi×Ai

6

16

Data mining (DM) - algorithm

2. Top/down induction of decision treesDecision tree is induced by Top-Down Induction of Decision Trees (TDIDT) algorithm (Quinlan, 1986)

Tree construction proceeds recursively starting with the entireset of training examples (entire table). At each step, an attribute is selected as the root of the (sub)tree and the current training set is split into subsets according to the values of the selected attribute.

17

Data mining (DM) - algorithm

3. Rule induction

A rule that correctly classifies some examples is constructed first.

The positive examples covered by the rule from the training set are removed and the process is repeated until no more examples remain.

1818

DATA MINING – CASE STUDIES

7

1919

Practical implementations

Each class of described patterns is illustrated with examples ofapplications:

1. Equations:• Algebraic equations • Differential equations

2. Decision trees:• Classification trees• Regression trees• Model trees

3. Predictive rules

2020

Applications – Difference equations

Algebraic equations: CIPER

2121

Applications – Algebraic equations

Dataset

Hydrological conditions(HMS Lendava; monthly data on

minimal, average and maximum values)

- Ledava River levels - groundwater levels

Meteorological conditions(monthly data, HMS Lendava):-Time of solar radiation (h), - precipitation (mm), - ET (mm)- Number of days with white frost - Number of days with snow- T: max, aver, min- Cumulative T>0ºC, >5ºC, and >10ºC- Number of days with:

- minT>0ºC- minT<-10ºC- minT<-4ºC- minT>25ºC- maxT>10ºC- maxT>25ºC

Measured radial increments:- 8 trees

- 69 years old

Management data(thinning; m3/y removed from the

stand; Forestry Unit Lendava)

•Monthly data + aggregated data (AMJ, MJJ, JJA, MJJA etc.)

•Σ: 333 attributes; 35 years

Materials and methods

8

2222

Applications – Algebraic equations

• 52 different combinations of attributes were tested.Σ: 124 models

Experiment RRSE # eq. elements

jnj3_2m 0,7282 6jnj3_3s 0,7599 6jnj3_1s 0,7614 6jnj3_4m 0,76455 3jnj2_2 0,7685 5jly_4xl 0,7686 6

2323

Applications – Algebraic equations

Model jnj3_2m:

RadialGrowthIncrement =+ -0.0511025526922 minL8-10^1 + -0.0291795197998 maxL8-10^1 + -0.017479975134 t-sun4-7^1 + 0.0346935385853 t-sun8-10^1 + -1.950606536e-05 t-sun8-10^2 + -2.01014710248 d-wf-4-7^1 + 9.35586778387e-05 minL4-7^1 t-sun4-7^1 + -0.000179339939732 minL4-7^1 t-sun8-10^1 + 6.45688563611e-05 minL8-10^1 t-sun8-10^1 + 3.06551434164e-05 maxL8-10^1 t-sun4-7^1 + 0.00282485442386 t-sun4-7^1 d-wf-4-7^1 + -0.00141078675225 t-sun8-10^1 d-wf-4-7^1 + 7.91071710872

Relative Root Squared Error = 0.728229824611

Correlation between average measured

(r-aver8) and modeled increments: linear regression:

R2 = 0.8771

2424

Applications – Algebraic equations

Model jnj3_2m

9

2525

Applications – Algebraic equations

Algebraic equations: Lagramge

2626

Data source:

• Federal Biological Research Centre (BBA), Braunschweig, Germany

(2000, 2001)

• Slovenian Agricultural Institute (KIS), Slovenia (2006)

Plants involved:

• BBA: - transgenic maize (var. “Acrobat”, glufosinate tolerant line) – donor - non-transgenic maize field (var. “Anjou”) - receptor

• KIS: - yellowkernel variety of maize (hybrid Bc462, simulatinga transgenic maize variety) – donor

- white kernel variety of maize (variety Bc38W, simulating a non-GM variety) - receptor

Applications – Algebraic equations

2727

100 m

a

b

c

d

efghj

k

l

m

n o p q 1

23

4 5

6Field design 2000

transgenic field / donor

non-transgenic field / recipient

2 m

access paths

Direction of drilling

2 m3 m4.5 m7.5 m

13.5 m

25,5m

49,5 m

220 m

sampling point

a3 system of coordinates

for the sampling points

N

Experiment design: 96 points

60 cobs – 2500 kernels

% of outcrossing

Receptors -NT corns

Donors -GMO cornsTe

mpe

ratu

reR

elat

ive

hum

idity

Win

d ve

loci

tyW

ind

dire

ctio

n

Applications – Algebraic equations

10

2828

Selected attributes:• % of outcrossing• Cardinal direction of the sampling point from the center of

donor field• Visual angle between the sampling point and the donor

field• Distance between the point to the center of donor filed• The shortest distance between the sampling point and the

donor field• % of appropriate wind direction (exposure time)• Length of the wind ventilation route• Wind velocity

Applications – Algebraic equations: outcrossing rate

2929

Applications – Algebraic equations

3030

Applications – Algebraic equations

11

3131

Applications – Algebraic equations

3232

Applications – Algebraic equations

3333

Applications – Algebraic equations

12

3434

Applications – Differential equations

Differential equations: Lagramge

3535

Applications – Differential equations

3636

Applications – Differential equations

13

3737

Applications – Differential equations

water in-flow out-flow

respiration

growth

Phosphorus

3838

Applications – Differential equations

growthrespiration

sedimentationgrazing

Phytoplankton

3939

Applications – Differential equations

respiration mortality

Feeds on phytoplankton

Zooplankton

14

4040

Applications – Differential equations

4141

Applications – Classification trees: habitat models

Classification trees: J48

4242

Applications – Classification trees: habitat models

Population growth

15

4343

Applications – Classification trees: habitat models

Observed locations of BBs

### ### ###### ## # ##

#######

###### ## ###

### ########### # ############################## ##

###### ####### ### ####### # ######## #### ##

#

## #### # # ###### ###

# ########## #######

## ## #### ######## ###############################################################

#####

## ################## ######## ### ## ######

### ##########

############## ###

#### ##

### ############################## ## # ### #

### # ## ######### # # ############## # # #### ###

###### ####### ###### # ## ##

##### ####

#########################################

###### # # ##

### ### ########## #

###

#

#

#

#

# ####

#

#

###

##

#

#

#

###

#

#

###

##

#

#

#

### #

#

##

#

#

###

##

#

#

#

#

#

#####

##

#####

##

##

###

##

#

###

#

##

#### # ### ##

#

##

##

#

#

##

#

##

##

#

###

###

#

####

#

##

##

##

#

###

#### ###

##

##

#

##

##

#

#

#

#### #

## ##

###

## #

###

#

##

##

#

#

##

###

###

#

#

# #

#

#

#

#

# ### # #

#### #

##

##

#

###

####

##

##

##

#

##

## #

# # # # ##### ##

##### ####

##

####

#

###

##

###

##

## ##

###

##

###

##

##

###

#

#

#####

#

#

##

####

#

## #

###

##

#

#

#

#### ##

######

#####

#

##

##

##

#

#####

# #### #

#########

# ###### ### #

# ## ### ## ######## ##### # #### ###

#### ### ##

######### # ####

#

# ########### #

#########

#

#

#

##

#

# #

#### #

#

#####

#####

# ## ###

##

##

# ####

##

# ###

#

####

#

#####

#

##

# ##

#

# #

#

#

####

#

#

##

#

###

#

#

### ##

#

#

#

##

#

#

# #### #

#

##

##

# #

#

#

# ##

#

##

## ##

#

# #

##

######

#

#

#

##

#######

##

##

#

#

#

######

###

###

# ###

# ###

######

#

##

##

#

#######

###

#

# #

#

#

#

##

## ##

####

##

#

#

##

##

#

####

###

#

#

##

##

### ###

##

#

###

###

###

##

##

#

###

#

##

# ###

#

#

#

##

#

##

# # ###

###

#

#

###

#

#

#

##

##

##

#

#

###

##

#

#

##

#

#

###

#

##

#

##

#

#

# ## #

#

#

#

#

#

#

##

# #

#

#

####

#

#

#

#

##

##

#

###

##

#

#

##

##

#

##

#

##

##

# ##

#

#

##

#

###

#

# #

##

#

#

#

#

##

#

# ###

###

###

##

#

#

#

#

##

##

###

###

###

##

## ##

##

##

#

### #

##

#

# #

#

#

##

#

##

####

##

##

#

#

## #

#

##

#

#

#

#

#

#

#

##

#

##

##

##

##

#

#

##

#

##

##

##

#

#

#

#

#

#

#

#

#

# #

# ##

#

#

#

#

#

######

#

##

#

#

##

#

####

####

#

## ####

##

##

#

##

#

#

## #

#

#

#

#

##

##

##

##

#

##

#

#

##

#

##

##

#

##

#

#

#

##

#

##

#

#

#

#

#

#

#

##

#

##

#

##

##

###

##

###

#

##

#

#

##

#

##

# #

##

#

## ###

# ##

#

#

##

###

#

#

#

#

##

#

#

##

##

#

##

##

## #

#

##

#

#

#

##

#

# ###

#

##

#

##

# #

####

#### #

###

#

##

#### #

#

##

##

# #

##

##

#

## #

####

#

##

####

##

##

###

#

##

##

#

#

#

#

##

###

##

###

# ##

#

#

#

##

### #

###

#####

#

#

##

###

###

##

##

#####

#

##

#

###

#

##

#

##

###

##

####

###

##

###

#

##

##

##

##

##

##

###

###

#####

###

#

###

####

# ##

##

##

#

#

##

#

#

## #

##

## ##

##

#

#

###

###

###

##

##

###

#

#

# ## #

### ##

#

##

#### #

##

#

#####

# ### ##

##

##

###

###

##

#

###

##

#

##

###

#

##

#

#

#

###

#

#

#

#

#

##

##

# #

#

##

#

##

##

#

#

#

#

##

#

##

#

##

##

##

#

#

##

##

#

##

#

#

###

#

#

#

#

###

#

##

#

##

##

##

##

#

#

#

#### #####

######### ### # #

####

##

#

#

#

###

##

##

#

#

##

#

#

##

# #

### #

###

#

#

#

##

##

##

#

###

##

###

#

#

#

##

#

# #

##

##

#

#

# #

#

#

#

#

##

##

#

##

#

#

##

##

#

#

##

#

#

#

### #

#

##

##

#

#

##

#

#

##

###

#

#

#

# #

##

## ####

#

#

##

##

#

##

##

# #

#

#

#

#

#

###

##

###

##

#

##

####

###

# #####

#

#

#

#

##

##

# #

###

####

##

## ##

#

####

##

#

###

###

##

#

##

# #####

##

###

# ##

######

###

#

##

#

##

#

#

#

##

#

##

####

##

#

#

#

#

#

##

##

#

#

#

#

#

#

###

#

#

#

##

#

#

# #####

#

#

#

#

# ###

#

##

#

##

##

#

# #

#

#

#

###

##

##

###

###

#

#

#

#

##

# #

## #

###

## #

#

# ### ## ##

######

#

##### #######

##

##

#

####

#############################

#####################

#### #####

######

####

#

###

#

##

#

###

##

# ##

###

####

####

# ##

### ##

####

#

#

## #

#

##

##

#

#

##

#

##

#

#

###

#

##

##

#

#

## #

#####

####

##

###

#

### ##

#

#

##

#

#

#

#

##

#

#######

#

##

# ###

##

########

####

####

##

##

#

#

#

##

#

#

##

# #

###

#

#

#

#

#####

##

#

# ##

#

###

# ###

##

##

#

##

## #

##

#

###

##

# #

##

#

##

#

#

##### #

#

#

#

# ##

#

#

##

######

###

###

#

#

######

##

#

#

##

###

#

#

#

#

##

#

#

# #

#

#

# #####

##

###

# #

#

# #

#

#

#

#

#

##

#

#

#

###

#

##

#

##

#

#

###

##

#

#

### #

#

##

##

#

####

#####

#### ######################

#### ###########

##

##

###################

##########

#

##########

#

#

# ###

##

####

#

#

#

#

#

#

##

##

##

#

###

#

#

##

####

##

##

##

#

#

#

##

#

#

#

#

#

#

##

##

#

#

##

##

#

#

#

#

#

##

##

##

#

#

###

##

#

#

#

# ###

#

##

#####

##

#

#

#

#

##

#

##

#

#

#

#

#

#

#

#

#

#

##

#

##

#

##

#

###

#

##

# #

#

##

#

##

#

##

#

# ##

#

##

#

# #

###

##

##

#

#

#

#

###

##

#

##

###

#

#

#

#

# ##

##

#

#

#

#

#

#

#

#

## # #

##

## ##

#

####

#

##

#

#

#

#

##

#####

##

##

#

#

#

##

##

## ####

###

#

#####

###

# #

#

#

#

#

##

# #####

##

###### #

###

#

### ###

##

# ####

##

##

#

#

###

#####

# #

#

##

#

#

#

#

##

#

## ####

####

##

####

# ##

### #

#

##########

#

#

# ###

##

####

#

#

#

#

#

#

##

#

##

#

#

#

#

###

#

###

####

#

##

# ######

##

##

##############

#######

###

#

########

####

# ###

#

#

##

#

#

##

#

##

#

##

# #

#

#

#

###

#

#

##

#

#

#

##

#

##

## #

#

## ###

##

##

#

#

#

#

## ##

#

#

#### ##

##

#####

#

#

#

## #####

#

#

###

#

##

#

## ## ##

####

####

##

#

#

##

##

#

#

#

#

#

# ##

#

##

#

##

#

##

#

##

###

##

#

#

#

#

#

#

#

#

#

#

##

# ##

### ###

####

## ####

#

####

#

#

########

#####

##

# ##

###

#

#####

###

#

#

#

## #

###

# ########## #

##

#

#

#

#

##

# #

#

#

##

# #

##

#

#

#

##

#

#

#

##

#

#

#

#

#

#

##

##

# ### ##

##

# ##

#

#

##

#

#

#

#

#

#

###

##

#

###

#

#

##

## ####

#

#

#

#

#

##

#

##

##

##

#

##

#

##

#

#### ##

##

#

## ###

## ###

##

#

#

#

# ##

#

#

# ##

#

#

#

#

#

#

##

#

#

#

# ####

#

# ##

#

#

####

#

#

######

####

#

# ###

#

#

###

##

##

#

#

###

#

# ## #

####

## #####

##

#

##

#

##

#

# ### ##

##

#

# ##

#

#

#

#

# ##

#

##

##

###

#

## ###

#

#

##

##

#

##

##

###

##

#######

###

##

###

################

# ##

###

#

#

##

#

#

##

###

#

##

##

##

####

##

##

## #

#

#

#

#

##

##

##

#####

##

#

##

##

######

#

##

#

#

#

#

#

#

####

##

#

#

#####

#

#

#

#

##

####

####

#

##

#

#

##

###

###

#

#

# #

#

#

#

# #

#

#

#

#

###

##

#

#

##

#

#

###

##

## ##

# #

## #

#

#

#

#

## ##

#

###

###

#

#

#####

#

# #

##

# ##

#

####

#

##

##

###

#

#

#### #

#

#

#

#

#

#

##

###

###

#

#

#

###

######

#

#

###

###

#

#

#

####

##

##

#

#

#

#

#

#

#

###

#

##

#####

#

#

#

#

#

##

#

#

##

#

#

#

##

#

#

##

#

#

## #

#

##

#

##

#

## #

##

#

#

#

#

##

##

#

#

#

## #

#

#

#

##

##

##

##

# #

#

#

#

#

##

#

#

#

#

#

#

##

#

#

#

## ##

#

## #

##

# #

#

#### #

##

#

##

#

##

##

#

#

#

# #

##

##

####

#

#

#

#

#####

###

# #

####

###

###

#

#

###

##

#

#

##

##

##

#

#

#

##

##

##

###

## #

#

##

##

#

####

###

##

##

#

##

#

# ####

##

#

#

##

#

# #

## #

#

#

#

#

###

#

####

#

###

###

# #

##

##

#

##

###

##

#

######

#

####

#

#

###

#

####

#

###

##

#

## #

#

#

###

##

#

#

#

#

#

#

###

#

##

#

#

####

#

##

#

##

##

#

####

#

#

##

#

# ###

##

##

## #

####

## ##

####

#

####

#

#

#

#

#

#

###

#

#

##### #

## ##

#

#

##

###

###

##

###

#

## #

##

###

#

##

##

##

##

#

##

#

#

##

##

###

#

##

##

##

#

##

#

##

###

#

#

#

#

#

##

##

#

##

##

###

#

#

#

##

#

#

#

##

#

#

##

##

##

#

###

#

#

##

#

#

###

#

####

#

###

#

#

##

##

#

##

#

#

##

#

#

# #

##

#

##

##

#

# #

###

#

#

#

#

#####

##

#

###

#

##

##

#

##

#

##

#

#

######

#

#####

#

##

#####

### ## ##

## # ###

##

####

##

##

#

##

#

#

###

#

##

##

#

##

#

#

#

##

##

# #

#

# #

### ###

#

#

#

## ########

##

######

##

#

# #

######

#

########

####

#

#

##

#

#

#####

#### ###

#

# ##

#

#

#

##

##

######

#

#

#

#

##

#

#

###

##

#

###

#

####

##

###

# ##

##

##

###

#

#

##

#

# ###

###

##

## ##

# #

#

#

#

#

#

#

#

##

###

#

#

#

#

###

########### ##

##### #

#

#

#

#

##

###

## ##

###

##

####

####

#

#

#

##

##

##

####

##

##

###

#

##

##############

##########

###

###

#

##

#

#

##

#

#

# #

## ## #

###

#

#

##

#

###

# ###

# ##

##

#

#

#

#

# ##

### #

#

###

###

##

## #

#

## ## ###

#

#

##

##

#

#

#

#

#

#

#

# ### #

###

# ##

#

###

###

#

#

#

######

#

#

#

#

## #

##

###

#

#

#####

###

# ##

#

###

##

#

####

# #

#

#

#

###

#

# #

##

#

#

#

#

#

# #

#

#

##

##

##

###

#

###

#

##

# ####

###

#

#

#

##

#

# ###

##

##

#

####

#

####

###

#

#

#

#

###

#

#

#

#

#

#

#

#

#

##

##

#

#

#

#

#

#

#

#

###

#

###

#

###

##

##

#

#

##

## ##

# ##

#

##

###

#

#

#

###

#

##

#

# ##

##

#

#

#

###

#

## #

#

##

##

###

#

## #

#

#

#

#

###

## ##

###

#

#

#

##

#

#

#

#

#

##

#

###

#

#

##

##

#

#

#

#

#

#

# #

#

# #

#

#

#

#

##

##

##

# #

#

###

#

#

#

#

#

#

# #

## #

#

#

#

#

# #

#

# #

## #

# ##

##

#

###

#

##

##

###

#####

#

#

##

#

#

##

#

###

#

############

########

#####

#

##

########

###

#

##

#

#

#

#

###

###

#

# #

#

####

#

####

#

#

#

#

#

#

#

#

#

##

##

# ##

#

#

#

#

#

##

###

#

# #

#

#####

###

#

#

#

## #

#

##

#

#

#

#

##

#

#

#

#

#

###

#

#

#

#

#

#

#

#

# #

## ##

#

# #

##

#

#

#

##

#

#

## ##

#

#

#

#

#

#

#

#

##

#

##

#####

#

##

###

#

##

#

###

#

##

#

#

##

#

##

#

#

##

#

#

#

#

#

#

#

#

#

##

# #

#

#

#

#

#

#

#

#

####

#

#

#

##

#

#

#

##

# #

##

#

#

#

#

#

#

#

#

##

#

#

#

##

#

##

# #

#

#

#

#

#

#

#

#

####

#

#

#

#

##

#

#

#

#

#

#

##

###

# #

#

##

#

#

#

## ##

#

#

##

#

#

#

#

##

#

#

#

#

#

#

#

#

#

##

#

#

###

##

#

#

#

##

#

# #

#

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

# #

#

#

##

#

##

#

#

#

###

#

#

#

##

#

####

###

##

##############

##########

##

#

##############

##

##

############

#######

##########

#

##########

#

#

# ###

#

#

##

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

#

#

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

#

#

#

#

#

#

#

##

#

#

#

#

##

#

##

#

#

#

# #

##

#

#

#

# ##

##

##

#

#

#

#

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

#

# #

#

#

#

#

#

#

#

##

#

##

#

#

#

#

#

##

###

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

##

#

#

#

#

#

# ##

##

#

#

#

#

#

#

#

#

#

## #

#

#

## #

#

#

###

#

#

#

#

#

#

#

#

#

#

# ##

##

#

#

#

#

#

#

#

#

#

#

#

## ##

##

#

##

#

##

###

##

#

# #

#

#

#

#

#

#

# ##

###

#

#

####

#

##

##

#

#

#### #

#

#

#

####

#

##

#

#

#

#

###

###

##

##

#

##

#

#

#

#

##

#

## ###

#

#

#

#

#

#

#

##

##

# #

#

##

# #

#

##########

#

#

# ###

#

#

##

##

#

#

#

#

#

#

#

#

#

##

#

#

#

#

##

#

#

##

#

##

##

#

##

# #####

#

#

#

# #

## #

##########

#

#

##

#

# ##

## #

#

#

## #### #

#

#

#

#

# ###

#

#

##

#

4444

Applications – Classification trees: habitat models

The training dataset• Positive examples: - Locations of bear sightings (Hunting association; telemetry) - Females only - Using home-range (HR) areas instead of “raw”

locations - Narrower HR for optimal habitat, wider for maximal

• Negative examples: - Sampled from the unsuitable part of the study area - Stratified random sampling - Different land cover types equally accounted for

4545

Applications – Classification trees: habitat models

1,73,26,0,0,1,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,0,0,0,4123,0,0,0,0,63,211,11,11,11,83,213,213,0,0,4155.1,62,37,0,0,2,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,1,53,0,3640,0,0,0,-1347,63,211,11,11,11,83,213,213,11,89,3858.2,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,10404,0,2074,-309,48,0,0,11,11,11,83,83,83,0,20,3862.2,0,100,0,0,1,76,0,16,71,0,12,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,7500,0,1661,-319,-942,0,0,11,11,11,0,0,0,0,20,4088.1,8,91,0,0,1,52,0,59,41,0,0,0,0,0,0,0,4,0,0,0,0,5,1,6,82,0,6500,0,1505,-166,879,9,57,11,11,11,281,281,281,0,20,3199.4,3,0,86,9,0,75,0,33,67,0,0,0,0,0,0,1,2,0,0,0,0,0,1,2,54,0,0,0,465,-66,-191,4,225,11,31,31,41,72,272,60,619,4013.1,34,65,0,0,2,51,9,76,9,5,1,4,1,0,1,0,29,0,0,0,0,0,1,2,54,0,3000,0,841,-111,-264,34,220,11,41,41,151,141,112,60,619,3897.1,100,0,0,0,3,52,0,86,6,3,5,9,6,7,38,40,0,0,0,0,0,0,1,17,64,0,8062,0,932,-603,-71,100,337,11,41,41,171,232,202,4,24,3732.…….…….…….

Dataset

Present: 1

Absent: 0

16

4646

Applications – Classification trees: habitat models

The model for optimal habitat

4747

Applications – Classification trees: habitat models

The model for maximal habitat

4848

Applications – Classification trees: habitat models

Map of optimal habitat(13% SLO territory)

17

4949

Applications – Classification trees: habitat models

Map of maximal habitat (39% SLO territory)

5050

Applications – Multi-target classification : outcrossing rate

Multi-target classification model (Clus):

Modelling pollen dispersal of genetically modified oilseed rape within the field

Marko Debeljak, Claire Lavigne, Sašo Džeroski,Damjan Demšar

In: 90th ESA Annual Meeting [jointly with the]IX International Congress ofEcology, August 7-12,2005, Montréal, Canada. Abstracts. [S.l.]: ESA, 2005,

p. 152.

5151

Experiment design:

90m

90m

Donors: MF transgenicoilseed rape “B004.oxy”

(10×10m)

Filed for receptors (90×90m)

3×3m grid = 841 nodes

10 seeds of MS oilseed rape“FU58B004” planted per node

Field planted with MF oilseedrape ”B004”

% MS outcrossing

% MF outcrossing

Applications – Multi-target classification : outcrossing rate

18

5252

Selected attributes for modelling:• Rate of outcrossing of MS and MF receptor plants [rate per

1000]• Cardinal direction of the sampling point from the center of

donor field [rad]• Visual angle between the sampling point and the donor

field [rad]• Distance between the point to the center of donor filed [m]• The shortest distance between the sampling point and the

donor field [m]

Applications – Multi-target classification : outcrossing rate

5353

• Number of examples: 817

• Correlation coefficient: MF: 0.821MS: 0.846

MSMF

Applications – Multi-target classification : outcrossing rate

5454

Applications – Regression and Model trees: outcrossing rate

Model tress- M5’

Modelling outcrossing of transgenes in maize between neighboring maize fields.

DEBELJAK Marko, DEMŠAR Damjan, DŽEROSKI Sašo, SCHIEMANN Joachim

In: HŘEBIČEK, Jiří (Eds.), RÁČEK, Jaroslav (Eds.). Networking enviromental information : EnviroInfo Brno 2005 : proceedings of the 19th International Conference Informatics for Environmental Protection, September 7-9,2005,

Brno, Czech Republic. Brno: Masaryk University, 2005, str. 610-614.

19

5555

Data source: Federal Biological Research Centre (BBA), Braunschweig, Germany

Project title: Outcrossing of transgenes in maize between neighboring maize fields

Plants involved: corn (Zea mays) vs. GMO Corn

Brief description: Correlation of impact factors and outcrossing frequency in maize fields

Applications – Regression and Model trees: outcrossing rate

5656

100 m

a

b

c

d

efghj

k

l

m

n o p q 1

23

4 5

6Field design 2000

transgenic field / donor

non-transgenic field / recipient

2 m

access paths

Direction of drilling

2 m3 m4.5 m7.5 m

13.5 m

25,5m

49,5 m

220 m

sampling point

a3 system of coordinates

for the sampling points

N

Experiment design: 96 points

60 cobs – 2500 kernels

% of outcrossing

Receptors -NT corns

Donors -GMO cornsTe

mpe

ratu

reR

elat

ive

hum

idity

Win

d ve

loci

tyW

ind

dire

ctio

n

Applications – Regression and Model trees: outcrossing rate

5757

Selected attributes:• % of outcrossing• Cardinal direction of the sampling point from the center of

donor field• Visual angle between the sampling point and the donor

field• Distance between the point to the center of donor filed• The shortest distance between the sampling point and the

donor field• % of appropriate wind direction (exposure time)• Length of the wind ventilation route• Wind velocity

Applications – Regression and Model trees: outcrossing rate

20

5858

Model II (2001)

Applications – Regression and Model trees: outcrossing rate

LM1: outcrossing = -1.5976*min_distance-0.1233*distance_center+0.0107*visual_angle+16.1174

min_distance

min_distance

wind_tunnel_length

distance_center

LM2: outcrossing = -1.5976*min_distance -0.1521*distance_center +0.0107*visual_angle +18.0174

LM3: outcrossing = -1.5976*min_distance -0.1521*distance_center +0.0107*visual_angle +17.9795

LM4 outcrossing = -2.1123*min_distance-0.1085*distance_center+0.0107*visual_angle+0.0108*appropriate_wind_proc+16.065

LM5 outcrossing = -0.0021*angle -0.0143*min_distance+0.0161*visual_angle-0.0125*appropriate_wind_proc+0.0425

<=4.485 >4.485

<=3.71 >3.71

<=59.071 >59.071

<=60.755 >60.755

5959

Applications – Regression and Model trees: outcrossing rate

Measured 2001Predicted 2001 (corr. coeff. 0.50)

6060

Applications – Regression and Model trees: soil seed bank

Regression and Model trees – J4.8 and M5’

21

6161

Applications – Regression and Model trees: soil seed bank

6262

Applications – Regression and Model trees: soil seed bank

1-NW Scotland 2- N England3-SE England 4- S midd.Eng. 5- SW England

6363

Applications – Multi-target regression model: soil resilience

22

6464

A key component in the environmental

sustainability of ecosystems.

It defines the ability of soil to recover from external stresses that may occur

through either climate, soil management or industrial

pollution.

Goal:Underlying relationships between soil properties and its resilience and resistance capacity: Multi-Target regression modelling (MTRT) on soil dataset.

Applications – Multi-target regression trees: soil resilience

6565

The dataset: soil samples taken on 26 location throughout SCO

The dataset: The flat table of data:26 by 18 data entries

Applications – Multi-target regression trees: soil resilience

6666

The dataset:

physical properties: soil texture: sand, silt, clay

chemical properties: pH, C, N, SOM (soil organic matter)

FAO soil classification: Order and Suborder

physical resilience: resistance to compression: 1/Cc, recovery from compression: Ce/Cc, overburden stress: eg, recovery from overburden stress after two days cycles: eg2dc

biological resilience: heat, copper

Applications – Multi-target regression trees: soil resilience

23

6767

Applications – Multi-target regression trees: soil resilience

Different scenarios and multi-target regression models have been constructed:

A model predicting the resistance and resilience of soils to copper perturbation.

6868

Applications – Multi-target regression trees: soil resilience

The increasing importance of mapping soil functions to advice on land use and environmental management - to make a map of soil resilience for Scotland.

The models = filters for existing GIS datasets about physical and chemical properties of Scottish soils.

6969

Applications – Multi-target regression trees: soil resilience

Macaulay Institute (Aberdeen): soils data – attributes and maps:

Approximately 13 000 soil profiles held in database

Descriptions of over 40 000 soil horizons

24

7070

Application

7171

Application

7272

Application

25

7373

Applications – Rules

7474

Applications – Rules

7575

Applications – Rules

The simulations were run with the first GENESYS version (published 2001, evaluated 2005, studied in sensitivity analyses 2004, 2005)

Only one field plan was used:

- maximising pollen and seed dispersal

26

7676

Applications – Rules

Large-risk field pattern

7777

Applications – Rules

Variables describing simulations

- simulation number- genetic variables- for each field (1 to 35), the cultivation techniques of year -3, -2, -1, 0- for each field (1 to 35) the number of years since the last GM oilseed rape crop- the number of years since the last non-GM oilseed rape crop

- proportion of GM seeds in non-GM oilseed rape of field 14 at year 0

-TOTAL NUMBER OF VARIABLES: 1899

7878

Applications – Rules

Run of experiment

• simulation started with an empty seed bank

• lasted 25 years,

• but only the last 4 years were kept in the files for data mining

• TOTAL NUMBER of simulations on the field pattern without the borders: 100 000

27

7979

Applications – Rules

Non aggregated data: CUBIST

Use 60% of data for trainingEach rule must cover >=1% of casesMaximum of 10 rules

Rule 1: [29499 cases, mean 0.0160539, range 0 to 0.9883608, est err 0.0160594]if

SowingDateY0F14 > 258then

PropGMinField14 = 0.0056903

Rule 2: [12726 cases, mean 0.0297134, range 7.62473e-07 to 0.9883608, est err 0.0388636]

ifSowingDateY0F14 > 258SowingDateY0F14 <= 277

thenPropGMinField14 = 0.0451188 - 0.0024 YearsSinceLastGMcrop_F14

8080

Applications – Rules

Rule 4: [22830 cases, mean 0.0958018, range 0 to 0.9994726, est err 0.0884937]if

SowingDateY0F14 <= 258YearsSinceLastGMcrop_F14 > 2

thenPropGMinField14 = 0.3722531 - 0.0013 SowingDateY0F14 - 0.00024 SowingDensityY0F14

8181

Applications – Rules

Rule 10: [1911 cases, mean 0.5392408, est err 0.2179439]if

TillageSoilBedPrepY0F14 in {0, 2}SowingDateY0F14 <= 258SowingDensityY0F14 <= 55YearsSinceLastGMcrop_F14 <= 2

thenPropGMinField14 = 2.8659117 - 0.3607 YearsSinceLastGMcrop_F14 - 0.0087 SowingDateY0F14 - 0.0032 SowingDensityY0F14 + 0.00073 SowingDateY-1F14 + 0.21 HarvestLossY-1F14 + 0.13 HarvestLossY-2F14 + 0.00017 SowingDensityY-2F14- 0.09 HarvestLossY-1F8 - 0.07 EfficHerb2Y-3F27 - 7e-05 2cuttingY-3F23 + 7e-05 2cuttingY-2F12- 0.00027 SowingDateY-1F35 + 0.08 HarvestLossY0F9 + 5e-05 1cuttingY-3F17 + 0.00012 SowingDensityY-2F18 - 6e-05 2cuttingY-2F24 - 6e-05 2cuttingY-2F15 + 6e-05 2cuttingY0F32 - 6e-05 2cuttingY-2F16 + 0.00022 SowingDateY0F11 - 4e-05 1cuttingY0F33 + 0.0001 SowingDensityY-1F7 - 0.05 EfficHerb2Y-3F16 + 0.06 HarvestLossY-1F22 - 5e-05 2cuttingY-2F25 + 0.04 EfficHerb2GMvolunY0F6 + 0.04 EfficHerb1GMvolunY-2F27 + 0.04 fficHerb2GMvolunY0F5

28

8282

Applications – Rules

Non aggregated data: CUBIST Options:Use 60% of data for trainingEach rule must cover >=1% of casesMaximum of 10 rules

Target attribute `PropGMinField14'

Evaluation on training data (60000 cases):Average |error| 0.0633466Relative |error| 0.47Correlation coefficient 0.77

Evaluation on test data (40000 cases):

Average |error| 0.0657057Relative |error| 0.49Correlation coefficient 0.75

8383

Conclusions

What can data mining do for you?

Knowledge discovered by analyzing data with DM techniques can help:

Understand the domain studied Make predictions/classificationsSupport decision processes in environmental management

8484

Conclusions

What data mining cannot do for you?

The law of information conservation (garbage-in-garbage-out)

The knowledge we are seeking to discover has to come from the combination of data and background knowledge

If we have very little data of very low quality and no background knowledge no form of data analysis will help

29

8585

Conclusions

Side-effects?

• Discovering problems with the data during analysis– missing values– erroneous values– inappropriately measured variables

• Identifying new opportunities– new problems to be addressed– recommendations on what data to collect and how

86

1. Data preprocessing

DATA MINING

87

DATA MINING – data preprocessing

DATA FORMAT• File extension .arff• This a plain text format, files should be edited by

editors such as Notepad, TextPad, WordPad (that do not add extra formatting information)

• File consists of – Title: @relation NameOfDataset– List of attributes: @attribute AttName AttType

• Where AttType can be ‘numeric’ or a list of categorical values, e.g., ‘{red, green, blue}’

– Data: @data (in a separate line), followed by the actual data in comma separated value (.csv) format

30

88

DATA MINING – data preprocessing

• Excel• Attributes (variables) in columns and cases in

lines

• Use decimal POINT and not decimal COMMA for numbers

• Save excel sheet as CSV file

89

DATA MINING – data preprocessing

• TextPad, Notepad

• Open CSV file • Delete “ “ on the beginning of lines and save (just

save, don’t change the format) • Change all ; to ,• Numbers must have decimal dot (.) and not decimal

comma (,)• Save file as CSV file (don't change format)

90

DATA MINING – data preprocessing

• WEKA

• Open CVS file in WEKA

• Select algorithm and attributes

• Perform data mining

http://www.cs.waikato.ac.nz/ml/weka/index.html