machine learning and inductive inference
DESCRIPTION
TRANSCRIPT
![Page 1: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/1.jpg)
Machine Learning and Inductive Inference
Hendrik Blockeel2001-2002
![Page 2: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/2.jpg)
1 Introduction
Practical information What is "machine learning and inductive
inference"? What is it useful for? (some example applications) Different learning tasks Data representation Brief overview of approaches Overview of the course
![Page 3: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/3.jpg)
Practical informationabout the course
10 lectures (2h) + 4 exercise sessions (2.5h) Audience with diverse backgrounds Course material:
Book: Machine Learning (Mitchell, 1997, McGraw-Hill)
Slides & notes, http://www.cs.kuleuven.ac.be/~hendrik/ML/
Examination: oral exam (20') with written preparation (+/- 2h) 2/3 theory, 1/3 exercises Only topics discussed in lectures / exercises
![Page 4: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/4.jpg)
What is machine learning?
Study of how to make programs improve their performance on certain tasks from own experience "performance" = speed, accuracy, ... "experience" = set of previously seen cases
("observations") For instance (simple method):
experience: taking action A in situation S yielded result R situation S arises again
• if R was undesirable: try something else• if R was desirable: try action A again
![Page 5: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/5.jpg)
This is a very simple example only works if precisely the same situation is encountered what if similar situation?
• Need for generalisation
how about choosing another action even if a good one is already known ? (you might find a better one)
• Need for exploration
This course focuses mostly on generalisation or inductive inference
![Page 6: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/6.jpg)
Inductive inference = Reasoning from specific to general e.g. statistics: from sample, infer properties of
population
sample population
observation: "these dogs are all brown" hypothesis: "all dogs are brown"
![Page 7: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/7.jpg)
Note: inductive inference is more general than statistics statistics mainly consists of numerical methods for
inference• infer mean, probability distribution, … of population
other approaches: • find symbolic definition of a concept (“concept learning”)• find laws with complicated structure that govern the data• study induction from a logical, philosophical, … point of view• …
![Page 8: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/8.jpg)
Applications of inductive inference: Machine learning
• "sample" of observations = experience• generalizing to population = finding patterns in the observations
that generally hold and may be used for future tasks
Knowledge discovery (Data mining)• "sample" = database• generalizing = finding patterns that hold in this database and can
also be expected to hold on similar data not in the database• discovered knowledge = comprehensible description of these
patterns
...
![Page 9: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/9.jpg)
What is it useful for? Scientifically: for understanding learning and intelligence
in humans and animals interesting for psychologists, philosophers, biologists, …
More practically: for building AI systems
• expert systems that improve automatically with time
• systems that help scientists discover new laws
also useful outside “classical” AI-like applications• when we don’t know how to program something ourselves
• when a program should adapt regularly to new circumstances
• when a program should tune itself towards its user
![Page 10: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/10.jpg)
Knowledge discovery Scientific knowledge discovery
Some “toy” examples:• Bacon : rediscovered some laws of physics (e.g. Kepler’s laws
of planetary motion)• AM: rediscovered some mathematical theorems
More serious recent examples:• mining the human genome• mining the web for information on genes, proteins, …• drug discovery
– context: robots perform lots of experiments at high rate; this yields lots of data, to be studied and interpreted by humans; try to automate this process (because humans can’t keep up with robots)
![Page 11: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/11.jpg)
Example: given molecules that are active against somedisease, find out what is common in them; this isprobably the reason for their activity.
![Page 12: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/12.jpg)
Data mining in databases, looking for “interesting” patterns e.g. for marketing
• based on data in DB, who should be interested in this new product? (useful for direct mailing)
• study customer behaviour to identify typical profiles of customers
• find out which products in store are often bought together
e.g. in hospital: help with diagnosis of patients
![Page 13: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/13.jpg)
Learning to perform difficult tasks
Difficult for humans… LEX system : learned how to perform symbolic
integration of functions … or “easy” for humans, but difficult to program
humans can do it, but can’t explain how they do it e.g.:
• learning to play games (chess, go, …)• learning to fly a plane, drive a car, …• recognising faces• …
![Page 14: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/14.jpg)
Adaptive systems
Robots in changing environment continuously needs to adapt its behaviour
Systems that adapt to the user based on user modelling:
• observe behaviour of user• build model describing this behaviour• use model to make user’s life easier
e.g. adaptive web pages, intelligent mail filters, adaptive user interfaces (e.g. intelligent Unix shell), …
![Page 15: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/15.jpg)
Illustration : building a system that learns checkers
Learning = improving on task T, with respect to performance measure P, based on experience E
In this example: T = playing checkers P: % of games won in world tournament E: games played against self
• possible problem: is experience representative for real task?
Questions to be answered: exactly what is given, exactly what is learnt, what
representation & learning algorithm should we use
![Page 16: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/16.jpg)
What do we want to learn? given board situation, which move to make
What is given? direct or indirect evidence ?
• direct: e.g., which moves were good, which were bad• indirect: consecutive moves in game, + outcome of the game
in our case: indirect evidence• direct evidence would require a teacher
![Page 17: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/17.jpg)
What exactly shall we learn? Choose type of target function:
• ChooseMove: Board Move ?– directly applicable
• V: Board ?– indicates quality of state
– when playing, choose move that leads to best state
– Note: reasonable definition for V easy to give:» V(won) = 100, V(lost) = -100, V(draw) = 0, V(s) = V(e) with e best state
reachable from s when playing optimally
» Not feasible in practice (exhaustive minimax search)
• …• Let’s choose the V function here
![Page 18: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/18.jpg)
Choose representation for target function:• set of rules?• neural network?• polynomial function of numerical board features?• …
Let’s choose: V = w1bp+w2rp+w3bk+w4rk+w5bt+w6rt• bp, rp : number of black / red pieces• bk, rk : number of black / red kings• bt, rt : number of black / read pieces threatened
• wi: constants to be learnt from experience
![Page 19: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/19.jpg)
How to obtaining training examples? we need a set of examples [bp, rp, bk, rk, bt, rt, V] bp etc. easy to determine; but how to guess V?
• we have indirect evidence only!
possible method:• with V(s) true target function, V’(s) learnt function, Vt(s) training value
for a state s:– Vt(s) <- V’(successor(s))
– adapt V’ using Vt values (making V’ and Vt converge)
– hope that V’ will converge to V
• intuitively: V for end states is known; propagate V values from later states to earlier states in the game
![Page 20: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/20.jpg)
Training algorithm: how to adapt the weights wi? possible method:
• look at “error”: error(s) = V’(s) - Vt(s)
• adapt weights so that error is reduced• e.g. using gradient descent method
– for each feature fi : wi wi + c fi error(s) with c some small constant
![Page 21: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/21.jpg)
Overview of design choices
type of training experience
games against selfgames against expert table of good moves
determine type of target function
determine representation
determine learning algorithm
… …
…
… ready!
Board Board Move
linear function of 6 features
…
gradient descent
![Page 22: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/22.jpg)
Some issues that influence choices Which algorithms useful for what type of functions? How is learning influenced by
# training examples complexity of hypothesis (function) representation noise in the data
Theoretical limits of learning? Can we help the learner with prior knowledge? Could a system alter its representation itself? …
![Page 23: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/23.jpg)
Typical learning tasks
Concept learning learn a definition of a concept supervised vs. unsupervised
Function learning ("predictive modelling") Discrete ("classification") or continuous ("regression") Concept = function with boolean result
Clustering Finding descriptive patterns
![Page 24: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/24.jpg)
Concept learning: supervised Given positive (+) and negative (-) examples of a
concept, infer properties that cause instances to be positive or negative (= concept definition)
++
+ ++
-
-
-
-
-
-
-
--
X
C: X {true,false}
++
+ ++
-
-
-
-
-
-
-
--
X C
![Page 25: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/25.jpg)
Concept learning: unsupervised Given examples of instances
Invent reasonable concepts ( = clustering) Find definitions for these concepts
..
. ..
.
.
. ..
.
X
. . . . ..
..
. ..
.
.
. ..
.
X
. . . . ..
C3
C2C1
Cf. taxonomy of animals, identification of market segments, ...
![Page 26: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/26.jpg)
Function learning Generalises over concept learning Learn function f: XS where
S is finite set of values: classification S is a continuous range of reals: regression
. 1.4
. 2.7. 0.6
. 2.1
X
. 0.9
. 1.4
. 2.7. 0.6
. 2.1
X
. 0.9 0
1
2
3 f
![Page 27: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/27.jpg)
Clustering Finding groups of instances that are similar May be a goal in itself (unsupervised classification) ... but also used for other tasks
regression flexible prediction: when it is not known in advance which
properties to predict from which other properties
..
. ..
.
.
. ..
.
X
. . . . ..
..
. ..
.
.
. ..
.
X
. . . . ..
![Page 28: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/28.jpg)
Finding descriptive patterns Descriptive patterns = any kind of patterns, not
necessarily directly useful for prediction Generalises over predictive modelling (= finding
predictive patterns) Examples of patterns:
"fast cars usually cost more than slower cars" "people are never married to more than one person at the
same time"
![Page 29: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/29.jpg)
Representation of data Numerical data: instances are points in n
Many techniques focus on this kind of data Symbolic data (true/false, black/white/red/blue, ...)
Can be converted to numeric data Some techniques work directly with symbolic data
Structural data Instances have internal structure (graphs, sets, …; cf.
molecules) Difficult to convert to simpler format Few techniques can handle these directly
![Page 30: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/30.jpg)
Brief overview of approaches
Symbolic approaches: Version Spaces, Induction of decision trees, Induction of
rule sets, inductive logic programming, … Numeric approaches:
neural networks, support vector machines, … Probabilistic approaches (“bayesian learning”) Miscellaneous:
instance based learning, genetic algorithms, reinforcement learning
![Page 31: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/31.jpg)
Overview of the course
Introduction (today) (Ch. 1) Concept-learning: Versionspaces (Ch. 2 - brief) Induction of decision trees (Ch. 3) Artificial neural networks (Ch. 4 - brief) Evaluating hypotheses (Ch. 5) Bayesian learning (Ch. 6) Computational learning theory (Ch. 7) Support vector machines (brief)
![Page 32: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/32.jpg)
Instance-based learning (Ch. 8) Genetic algorithms (Ch. 9) Induction of rule sets & association rules (Ch. 10) Reinforcement learning (Ch. 13) Clustering Inductive logic programming Combining different models
bagging, boosting, stacking, …
![Page 33: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/33.jpg)
2 Version Spaces
Recall basic principles from AI course stressing important concepts for later use
Difficulties with version space approaches Inductive bias
Mitchell, Ch.2
![Page 34: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/34.jpg)
Basic principles
Concept learning as search given: hypothesis space H and data set S find: all h H consistent with S this set is called the version space, VS(H,S)
How to search in H enumerate all h in H: not feasible prune search using some generality ordering
• h1 more general than h2 (x h2 x h1)
See Mitchell Chapter 2 for examples
![Page 35: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/35.jpg)
-
--
-
--
--- -
-
-
+++
- -
An example
+ : belongs to concept; - : does not S = set of these + and - examples
Assume hypotheses are rectangles I.e., H = set of all rectangles
VS(H,S) = set of all rectangles that contain all + and no -
![Page 36: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/36.jpg)
-
--
-
-
-
--- -
-
-
++
+
- -
Example of consistent hypothesis: green rectangle
![Page 37: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/37.jpg)
-
--
-
-
-
--- -
-
-
++
+
- -
h1h2
h3
h2 more specific than h1
h3 incomparable with h1
h1 more general than h2 h2 totally inside h1
![Page 38: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/38.jpg)
Version space boundaries Bound versionspace by giving its most specific (S)
and most general (G) borders S: rectangles that cannot become smaller without
excluding some + G: rectangles that cannot become larger without including
some - Any hypothesis h consistent with the data
must be more general than some element in S must be more specific than some element in G
Thus, G and S completely specify VS
![Page 39: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/39.jpg)
-
--
-
-
-
--- -
-
-
++
+
- -
h2: most general hypothesish3: another most general hyp.
S = {h1}, G = {h2,h3}
h1: most specific hypothesis
Example, continued
So what are S and G here?
![Page 40: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/40.jpg)
Computing the version space
Computing G and S is sufficient to know the full versionspace
Algorithms in Mitchell’s book: FindS: computes only S set
• S is always singleton in Mitchell’s examples
Candidate Elimination: computes S and G
![Page 41: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/41.jpg)
Candidate Elimination Algorithm: demonstration with rectangles
Algorithm: see Mitchell Representation:
Concepts are rectangles Rectangle represented with 2 attributes:
• <Xmin-Xmax, Ymin-Ymax>
Graphical representation: hypothesis consistent with data if
• all + inside rectangle• no - inside rectangle
![Page 42: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/42.jpg)
GS
•Start: S = {none}, G = {all}
S = {<,>}G = {<1-6, 1-6>}
3
4
5
6
2
1
321 4 5 6
![Page 43: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/43.jpg)
GS
(3,2) : + + •Example e1 appears, not covered by S•Start: S = {none}, G = {all}
S = {<,>} G = {<1-6,1-6>}
![Page 44: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/44.jpg)
+
G
S(3,2) : + •Example e1 appears, not covered by S
•Start: S = {none}, G = {all}
•S is extended to cover e1
S = {<3-3,2-2>}G = {<1-6,1-6>}
![Page 45: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/45.jpg)
(3,2) : + •Example e1 appears, not covered by S•Start: S = {none}, G = {all}
•S is extended to cover e1•Example e2 appears, covered by G
-
(5,4) : -+
G
S
S = {<3-3,2-2>}G = {<1-6,1-6>}
![Page 46: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/46.jpg)
+
G
S(3,2) : + •Example e1 appears, not covered by S
•Start: S = {none}, G = {all}
•S is extended to cover e1
S = {<3-3,2-2>} G = {<1-4,1-6>, <1-6, 1-3>}
•Example e2 appears, covered by G(5,4) : -
- •G is changed to avoid covering e2•note: now consists of 2 parts•each part covers all + and no -
![Page 47: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/47.jpg)
(5,4) : -
-
+
G
S(3,2) : + •Example e1 appears, not covered by S
•Start: S = {none}, G = {all}
•S is extended to cover e1•Example e2 appears, covered by G•G is changed to avoid covering e2•Example e3 appears, covered by G
(2,4) : - -
S = {<3-3,2-2>} G = {<1-4,1-6>, <1-6, 1-3>}
![Page 48: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/48.jpg)
+
G
S(3,2) : +(5,4) : -
-(2,4) : - -
•Example e1 appears, not covered by S•Start: S = {none}, G = {all}
•S is extended to cover e1•Example e2 appears, covered by G•G is changed to avoid covering e2•Example e3 appears, covered by G•One part of G is affected & reduced
S = {<3-3,2-2>} G = {<3-4,1-6>, <1-6, 1-3>}
![Page 49: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/49.jpg)
+
G
S(3,2) : +(5,4) : -
-(2,4) : - -
•Example e1 appears, not covered by S•Start: S = {none}, G = {all}
•S is extended to cover e1•Example e2 appears, covered by G•G is changed to avoid covering e2•Example e3 appears, covered by G•One part of G is affected & reduced•Example e4 appears, not covered by S
(5,3) : +
+
S = {<3-3,2-2>} G = {<3-4,1-6>, <1-6, 1-3>}
![Page 50: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/50.jpg)
+
G
S(3,2) : +(5,4) : -
-(2,4) : - -(5,3) : +
+
•Example e1 appears, not covered by S•Start: S = {none}, G = {all}
•S is extended to cover e1•Example e2 appears, covered by G•G is changed to avoid covering e2•Example e3 appears, covered by G•One part of G is affected & reduced•Example e4 appears, not covered by S•S is extended to cover e4
S = {<3-5,2-3>} G = {<3-4,1-6>, <1-6, 1-3>}
![Page 51: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/51.jpg)
+
G
S(3,2) : +(5,4) : -
-(2,4) : - -(5,3) : +
+
•Example e1 appears, not covered by S•Start: S = {none}, G = {all}
•S is extended to cover e1•Example e2 appears, covered by G•G is changed to avoid covering e2•Example e3 appears, covered by G•One part of G is affected & reduced•Example e4 appears, not covered by S•S is extended to cover e4•Part of G not covering new S is removedS = {<3-5,2-3>}
G = {<1-6, 1-3>}
![Page 52: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/52.jpg)
+
G
S
Current versionspace containsall rectangles covering S andcovered by G, e.g. h = <2-5,2-3>
h
S = {<3-5,2-3>} G = {<1-6, 1-3>}
![Page 53: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/53.jpg)
Interesting points: We here use an extended notion of generality
• In book : < value < ? • Here: e.g. < 2-3 < 2-5 < 1-5 < ?
We still use a conjunctive concept definition• each concept is 1 rectangle• this could be extended as well (but complicated)
![Page 54: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/54.jpg)
Difficulties with version space approaches
Idea of VS provides nice theoretical framework But not very useful for most practical problems Difficulties with these approaches:
Not very efficient• Borders G and S may be very large (may grow exponentially)
Not noise resistant• VS “collapses” when no consistent hypothesis exists• often we would like to find the “best” hypothesis in this case
in Mitchell’s examples: only conjunctive definitions We will compare with other approaches...
![Page 55: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/55.jpg)
Inductive bias
After having seen a limited number of examples, we believe we can make predictions for unseen cases.
From seen cases to unseen cases = inductive leap Why do we believe this ? Is there any guarantee
this prediction will be correct ? What extra assumptions do we need to guarantee correctness?
Inductive bias : minimal set of extra assumptions that guarantees correctness of inductive leap
![Page 56: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/56.jpg)
Equivalence between inductive and deductive systems
inductive system
training examples
new instance
deductive system
training examples
new instance
inductive bias
result (by proof)
result (by inductive leap)
![Page 57: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/57.jpg)
Definition of inductive bias
More formal definition of inductive bias (Mitchell): L(x,D) denotes classification assigned to instance x
by learner L after training on D The inductive bias of L is any minimal set of
assertions B such that for any target concept c and corresponding training examples D,
xX: BDx |- L(x,D)
![Page 58: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/58.jpg)
Effect of inductive bias
Different learning algorithms give different results on same dataset because each may have a different bias
Stronger bias means less learning more is assumed in advance
Is learning possible without any bias at all? I.e., “pure” learning, without any assumptions in
advance The answer is No.
![Page 59: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/59.jpg)
Inductive bias of version spaces
Bias of candidate elimination algorithm: target concept is in H
H typically consists of conjunctive concepts in our previous illustration, rectangles
H could be extended towards disjunctive concepts Is it possible to use version spaces with H = set of
all imaginable concepts, thereby eliminating all bias?
![Page 60: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/60.jpg)
Unbiased version spaces Let U be the example domain Unbiased: target concept C can be any subset of U
hence, H = 2U
Condider VS(H,D) with D a strict subset of U Assume you see an unseen instance x (x U \ D) For each hVS that predicts xC, there is a h’VS
that predicts xC, and vice versa just take h = h’ {x}: since xD, h and h’ are exactly the
same w.r.t. D; so either both are in VS, or none of them are
![Page 61: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/61.jpg)
Conclusion: version spaces without any bias do not allow generalisation
To be able to make an inductive leap, some bias is necessary.
We will see many different learning algorithms that all differ in their inductive bias.
When choosing one in practice, bias should be an important criterium unfortunately: not always well understood…
![Page 62: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/62.jpg)
To remember
Definition of version space, importance of generality ordering for searching
Definition of inductive bias, practical importance, why it is necessary for learning, how it relates inductive systems to deductive systems
![Page 63: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/63.jpg)
3 Induction of decision trees
What are decision trees? How can they be induced automatically?
top-down induction of decision trees avoiding overfitting converting trees to rules alternative heuristics a generic TDIDT algorithm
Mitchell, Ch. 3
![Page 64: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/64.jpg)
What are decision trees?
Represent sequences of tests According to outcome of test, perform a new test Continue until result obtained known Cf. guessing a person using only yes/no questions:
ask some question depending on answer, ask a new question continue until answer known
![Page 65: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/65.jpg)
Example decision tree 1
Mitchell’s example: Play tennis or not? (depending on weather conditions)
Outlook
Humidity Wind
No Yes No Yes
Yes
Sunny Overcast Rainy
High Normal Strong Weak
![Page 66: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/66.jpg)
Example decision tree 2
Again from Mitchell: tree for predicting whether C-section necessary
Leaves are not pure here; ratio pos/neg is given
Fetal_Presentation
Previous_Csection
+
--
1 2 3
0 1 [3+, 29-].11+ .89-
[8+, 22-].27+ .73-
[55+, 35-].61+ .39-
Primiparous
……
![Page 67: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/67.jpg)
Representation power
Typically: examples represented by array of attributes 1 node in tree tests value of 1 attribute 1 child node for each possible outcome of test Leaf nodes assign classification
Note: tree can represent any boolean function
• i.e., also disjunctive concepts (<-> VS examples)
tree can allow noise (non-pure leaves)
![Page 68: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/68.jpg)
Representing boolean formulae
E.g., A B
Similarly (try yourself): A B, A xor B, (A B) (C D E) “M of N” (at least M out of N propositions are true) What about complexity of tree vs. complexity of original
formula?
Afalsetrue
Btrue
true false
true false
![Page 69: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/69.jpg)
Classification, Regression and Clustering trees
Classification trees represent function X -> C with C discrete (like the decision trees we just saw)
Regression trees predict numbers in leaves could use a constant (e.g., mean), or linear regression
model, or … Clustering trees just group examples in leaves Most (but not all) research in machine learning
focuses on classification trees
![Page 70: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/70.jpg)
Example decision tree 3 (from study of river water quality)
"Data mining" application Given: descriptions of river water samples
biological description: occurrence of organisms in water (“abundance”, graded 0-5)
chemical description: 16 variables (temperature, concentrations of chemicals (NH4, ...))
Question: characterize chemical properties of water using organisms that occur
![Page 71: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/71.jpg)
abundance(Tubifex sp.,5) ?
T = 0.357111pH = -0.496808cond = 1.23151O2 = -1.09279O2sat = -1.04837 CO2 = 0.893152hard = 0.988909NO2 = 0.54731NO3 = 0.426773NH4 = 1.11263PO4 = 0.875459 Cl = 0.86275SiO2 = 0.997237KMnO4 = 1.29711K2Cr2O7 = 0.97025BOD = 0.67012
abundance(Sphaerotilus natans,5) ?
yesno
T = 0.0129737pH = -0.536434cond = 0.914569O2 = -0.810187O2sat = -0.848571CO2 = 0.443103hard = 0.806137NO2 = 0.4151NO3 = -0.0847706NH4 = 0.536927PO4 = 0.442398Cl = 0.668979SiO2 = 0.291415KMnO4 = 1.08462K2Cr2O7 = 0.850733BOD = 0.651707
yes no
abundance(...)
<- "standardized" values (how many standard deviations above mean)
Clustering tree
![Page 72: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/72.jpg)
Top-Down Induction of Decision Trees
Basic algorithm for TDIDT: (later more formal version) start with full data set find test that partitions examples as good as possible
• “good” = examples with same class, or otherwise similar examples, should be put together
for each outcome of test, create child node move examples to children according to outcome of test repeat procedure for each child that is not “pure”
Main question: how to decide which test is “best”
![Page 73: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/73.jpg)
Finding the best test (for classification trees)
For classification trees: find test for which children are as “pure” as possible
Purity measure borrowed from information theory: entropy is a measure of “missing information”; more precisely,
#bits needed to represent the missing information, on average, using optimal encoding
Given set S with instances belonging to class i with
probability pi: Entropy(S) = - pi log2 pi
![Page 74: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/74.jpg)
Entropy
Intuitive reasoning: use shorter encoding for more frequent messages information theory: message with probability p should
get -log2p bits• e.g. A,B,C,D both 25% probability: 2 bits for each
(00,01,10,11)• if some are more probable, it is possible to do better
average #bits for a message is then - pi log2 pi
![Page 75: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/75.jpg)
Entropy
Entropy in function of p, for 2 classes:
![Page 76: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/76.jpg)
Information gain
Heuristic for choosing a test in a node: choose that test that on average provides most
information about the class this is the test that, on average, reduces class entropy
most• on average: class entropy reduction differs according to
outcome of test
expected reduction of entropy = information gain
Gain(S,A) = Entropy(S) - |Sv|/|S| Entropy(Sv)
![Page 77: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/77.jpg)
Example
Humidity Wind
High Normal Strong Weak
S: [9+,5-] S: [9+,5-]
S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-]E = 0.985 E = 0.592 E = 0.811 E = 1.0
E = 0.940 E = 0.940
Gain(S, Humidity)= .940 - (7/14).985 - (7/14).592= 0.151
Gain(S, Wind)= .940 - (8/14).811 - (6/14)1.0= 0.048
Assume S has 9 + and 5 - examples; partition according to Wind or Humidity attribute
![Page 78: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/78.jpg)
Assume Outlook was chosen: continue partitioning in child nodes
Outlook
? ?Yes
Sunny Overcast Rainy
[9+,5-]
[2+,3-] [3+,2-][4+,0-]
![Page 79: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/79.jpg)
Hypothesis space search in TDIDT
Hypothesis space H = set of all trees H is searched in a hill-climbing fashion, from simple
to complex
...
![Page 80: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/80.jpg)
Inductive bias in TDIDT
Note: for e.g. boolean attributes, H is complete: each concept can be represented! given n attributes, can keep on adding tests until all
attributes tested So what about inductive bias?
Clearly no “restriction bias” (H 2U) as in cand. elim. Preference bias: some hypotheses in H are preferred
over others In this case: preference for short trees with informative
attributes at the top
![Page 81: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/81.jpg)
Occam’s Razor
Preference for simple models over complex models is quite generally used in machine learning
Similar principle in science: Occam’s Razor roughly: do not make things more complicated than
necessary Reasoning, in the case of decision trees: more
complex trees have higher probability of overfitting the data set
![Page 82: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/82.jpg)
..
... .
....
. .
Avoiding Overfitting
Phenomenon of overfitting: keep improving a model, making it better and better on
training set by making it more complicated … increases risk of modelling noise and coincidences in
the data set may actually harm predictive power of theory on unseen
cases Cf. fitting a curve with too many parameters
![Page 83: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/83.jpg)
area with probablywrong predictions
Overfitting: example
+
++
++ +
+
-
-
- -
-
---
---
-
-
- +
---
-
-
![Page 84: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/84.jpg)
Overfitting:effect on predictive accuracy
Typical phenomenon when overfitting: training accuracy keeps increasing accuracy on unseen validation set starts decreasing
accuracy on training data
accuracy on unseen data
size of tree
accuracy
overfitting starts about here
![Page 85: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/85.jpg)
How to avoid overfitting when building classification trees?
Option 1: stop adding nodes to tree when overfitting starts
occurring need stopping criterion
Option 2: don’t bother about overfitting when growing the tree after the tree has been built, start pruning it again
![Page 86: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/86.jpg)
Stopping criteria How do we know when overfitting starts?
a) use a validation set: data not considered for choosing the best test
• when accuracy goes down on validation set: stop adding nodes to this branch
b) use some statistical test• significance test: e.g., is the change in class distribution still
significant? (2-test) • MDL: minimal description length principle
– fully correct theory = tree + corrections for specific misclassifications
– minimize size(f.c.t.) = size(tree) + size(misclassifications(tree))
– Cf. Occam’s razor
![Page 87: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/87.jpg)
Post-pruning trees
After learning the tree: start pruning branches away For all nodes in tree:
• Estimate effect of pruning tree at this node on predictive accuracy
– e.g. using accuracy on validation set
Prune node that gives greatest improvement Continue until no improvements
Note : this pruning constitutes a second search in the hypothesis space
![Page 88: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/88.jpg)
accuracy on training data
accuracy on unseen data
size of tree
accuracy
effect of pruning
![Page 89: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/89.jpg)
Comparison
Advantage of Option 1: no superfluous work But: tests may be misleading
E.g., validation accuracy may go down briefly, then go up again
Therefore, Option 2 (post-pruning) is usually preferred (though more work, computationally)
![Page 90: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/90.jpg)
Turning trees into rules
From a tree a rule set can be derived Path from root to leaf in a tree = 1 if-then rule
Advantage of such rule sets may increase comprehensibility can be pruned more flexibly
• in 1 rule, 1 single condition can be removed– vs. tree: when removing a node, the whole subtree is removed
• 1 rule can be removed entirely
![Page 91: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/91.jpg)
Outlook
Humidity Wind
No Yes No Yes
Yes
Sunny Overcast Rainy
High Normal Strong Weak
if Outlook = Sunny and Humidity = High then Noif Outlook = Sunny and Humidity = Normal then Yes…
Rules from trees: example
![Page 92: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/92.jpg)
Pruning rules
Possible method: 1. convert tree to rules 2. prune each rule independently
• remove conditions that do not harm accuracy of rule
3. sort rules (e.g., most accurate rule first) • before pruning: each example covered by 1 rule• after pruning, 1 example might be covered by multiple rules• therefore some rules might contradict each other
![Page 93: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/93.jpg)
Afalsetrue
Btrue
true false
true false
if A=true then trueif A=false and B=true then trueif A=false and B=false then false
Pruning rules: example
Tree representing A B
Rules represent A (AB)A B
![Page 94: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/94.jpg)
Alternative heuristics for choosing tests
Attributes with continuous domains (numbers) cannot different branch for each possible outcome allow, e.g., binary test of the form Temperature < 20
Attributes with many discrete values unfair advantage over attributes with few values
• cf. question with many possible answers is more informative than yes/no question
To compensate: divide gain by “max. potential gain” SI Gain Ratio: GR(S,A) = Gain(S,A) / SI(S,A)
• Split-information SI(S,A) = - |Si|/|S| log2 |Si|/|S|• with i ranging over different results of test A
![Page 95: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/95.jpg)
Tests may have different costs e.g. medical diagnosis: blood test, visual examination,
… have different costs try to find tree with low expected cost
• instead of low expected number of tests
alternative heuristics, taking cost into account,have been proposed
![Page 96: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/96.jpg)
Properties of good heuristics Many alternatives exist
ID3 uses information gain or gain ratio CART uses “Gini criterion” (not discussed here)
Q: Why not simply use accuracy as a criterion?
A1
80-, 20+
40-,0+ 40-,20+
A2
80-, 20+
40-,10+ 40-,10+
How would - accuracy- information gainrate these splits?
![Page 97: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/97.jpg)
Heuristics compared
Good heuristics are strictly concave
![Page 98: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/98.jpg)
Why concave functions?
EE1
E2
(n1/n)E1+(n2/n)E2
p p2p1
Gain
Assume node with size n, entropy E and proportion of positives pis split into 2 nodes with n1, E1, p1 and n2, E2 p2.
We have p = (n1/n)p1 + (n2/n) p2
and the new average entropy E’ = (n1/n)E1+(n2/n)E2 is therefore found by linear interpolation between (p1,E1) and (p2,E2) at p.
Gain = difference in height between (p, E) and (p,E’).
![Page 99: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/99.jpg)
Handling missing values What if result of test is unknown for example?
e.g. because value of attribute unknown Some possible solutions, when training:
guess value: just take most common value (among all examples, among examples in this node / class, …)
assign example partially to different branches• e.g. counts for 0.7 in yes subtree, 0.3 in no subtree
When using tree for prediction: assign example partially to different branches combine predictions of different branches
![Page 100: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/100.jpg)
function TDIDT(E: set of examples) returns tree;T' := grow_tree(E);T := prune(T');return T;
function grow_tree(E: set of examples) returns tree;T := generate_tests(E);t := best_test(T, E);P := partition induced on E by t;if stop_criterion(E, P)then return leaf(info(E))else
for all Ej in P: tj := grow_tree(Ej);return node(t, {(j,tj)};
function TDIDT(E: set of examples) returns tree;T' := grow_tree(E);T := prune(T');return T;
function grow_tree(E: set of examples) returns tree;T := generate_tests(E);t := best_test(T, E);P := partition induced on E by t;if stop_criterion(E, P)then return leaf(info(E))else
for all Ej in P: tj := grow_tree(Ej);return node(t, {(j,tj)};
Generic TDIDT algorithm
![Page 101: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/101.jpg)
For classification... prune: e.g. reduced-error pruning, ... generate_tests : Attr=val, Attr<val, ...
for numeric attributes: generate val best_test : Gain, Gainratio, ... stop_criterion : MDL, significance test (e.g. 2-
test), ... info : most frequent class ("mode") Popular systems: C4.5 (Quinlan 1993), C5.0 (
www.rulequest.com)
![Page 102: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/102.jpg)
For regression... change
best_test: e.g. minimize average variance info: mean stop_criterion: significance test (e.g., F-test), ...
A1 A2
{1,3,4,7,8,12} {1,3,4,7,8,12}
{1,4,12} {3,7,8} {1,3,7} {4,8,12}
![Page 103: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/103.jpg)
CART Classification and regression trees (Breiman et
al., 1984) Classification: info: mode, best_test: Gini Regression: info: mean, best_test: variance prune: "error complexity" pruning
penalty for each node the higher , the smaller the tree will be optimal obtained empirically (cross-validation)
![Page 104: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/104.jpg)
n-dimensional target spaces Instead of predicting 1 number, predict vector of
numbers info: mean vector best_test: variance (mean squared distance) in
n-dimensional space stop_criterion: F-test mixed vectors (numbers and symbols)?
use appropriate distance measure -> "clustering trees"
![Page 105: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/105.jpg)
abundance(Tubifex sp.,5) ?
T = 0.357111pH = -0.496808cond = 1.23151O2 = -1.09279O2sat = -1.04837 CO2 = 0.893152hard = 0.988909NO2 = 0.54731NO3 = 0.426773NH4 = 1.11263PO4 = 0.875459 Cl = 0.86275SiO2 = 0.997237KMnO4 = 1.29711K2Cr2O7 = 0.97025BOD = 0.67012
abundance(Sphaerotilus natans,5) ?
yesno
T = 0.0129737pH = -0.536434cond = 0.914569O2 = -0.810187O2sat = -0.848571CO2 = 0.443103hard = 0.806137NO2 = 0.4151NO3 = -0.0847706NH4 = 0.536927PO4 = 0.442398Cl = 0.668979SiO2 = 0.291415KMnO4 = 1.08462K2Cr2O7 = 0.850733BOD = 0.651707
yes no
abundance(...)
<- "standardized" values (how many standard deviations above mean)
Clustering tree
![Page 106: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/106.jpg)
To Remember
Decision trees & their representational power Generic TDIDT algorithm and how to instantiate its
parameters Search through hypothesis space, bias, tree to rule
conversion For classification trees: details on heuristics,
handling missing values, pruning, … Some general concepts: overfitting, Occam’s razor
![Page 107: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/107.jpg)
4 Neural networks
(Brief summary - studied in detail in other courses) Basic principle of artificial neural networks Perceptrons and multi-layer neural networks Properties
Mitchell, Ch. 4
![Page 108: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/108.jpg)
Artificial neural networks
Modelled after biological neural systems Complex systems built from very simple units 1 unit = neuron
• has multiple inputs and outputs, connecting the neuron to other neurons
• when input signal sufficiently strong, neuron fires (i.,e., propagates signal)
![Page 109: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/109.jpg)
ANNs consists of neurons connections between them
• these connections have weights associated with them
input and output ANNs can learn to associate inputs to outputs by
adapting the weights For instance (classification):
inputs = pixels of photo outputs = classification of photo (person? tree? …)
![Page 110: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/110.jpg)
Perceptrons
Simplest type of neural network Perceptron simulates 1 neuron Fires if sum of (inputs * weights) > some threshold
Schematically:
computes wixiX
Y
threshold function:Y = -1 if X<t, Y=1 otherwise
x1
x2x3
x4
x5
w1
w5
![Page 111: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/111.jpg)
2-input perceptron
represent inputs in 2-D space perceptron learns a function of following form:
if aX + bY > c then +1, else -1 i.e., creates linear separation between classes + and -
+1
-1
![Page 112: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/112.jpg)
n-input perceptrons
In general, perceptrons construct a hyperplane in an n-dimensional space one side of hyperplane = +, other side = -
Hence, classes must be linearly separable, otherwise perceptron cannot learn them
E.g.: learning boolean functions encode true/false as +1, -1 is there a perceptron that encodes 1. A and B? 2. A or
B? 3. A xor B?
![Page 113: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/113.jpg)
Multi-layer networks
Increase representation power by combining neurons in a network
+1-
1
+1
-1
X Y neuron 1 neuron 2
+1
-1
output-1
-1
inputs
hidden layer
output layer
![Page 114: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/114.jpg)
00.10.20.30.40.50.60.70.80.9
1
-3 -2 -1 0 1 2 3
X
f(x
)
“Sigmoid” function instead of crisp threshold changes continuously instead of in 1 step has advantages for training multi-layer networks
x1
x2x3
x4
x5
w1
w5
xexf
1
1)(
![Page 115: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/115.jpg)
Non-linear sigmoid function causes non-linear decision surfaces e.g., 5 areas for 5 classes a,b,c,d,e
Very powerful representation
a
b
c
d
e
![Page 116: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/116.jpg)
Note : previous network had 2 layers of neurons Layered feedforward neural networks:
neurons organised in n layers each layer has output from previous layer as input
• neurons fully interconnected
successive layers = different representations of input 2-layer feedforward networks very popular… … but many other architectures possible!
e.g. recurrent NNs
![Page 117: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/117.jpg)
Example: 2-layer net representing ID function 8 input patterns, mapped to same pattern in output network converges to binary representation in hidden
layerfor instance:1 1012 1003 0114 1115 0006 0107 1108 001
![Page 118: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/118.jpg)
Training neural networks
Trained by adapting the weights Popular algorithm: backpropagation
minimizing error through gradient descent principle: output error of a layer is attributed to
• 1: weights of connections in that layer– adapt these weights
• 2: inputs of that layer (except if first layer)– “backpropagate” error to these inputs– now use same principle to adapt weights of previous layer
Iterative process, may be slow
![Page 119: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/119.jpg)
Properties of neural networks
Useful for modelling complex, non-linear functions of numerical inputs & outputs symbolic inputs/outputs representable using some
encoding, cf. true/false = 1/-1 2 or 3 layer networks can approximate a huge class of
functions (if enough neurons in hidden layers) Robust to noise
but: risk of overfitting! (because of high expressiveness)• may happen when training for too long
usually handled using e.g. validation sets
![Page 120: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/120.jpg)
All inputs have some effect cf. decision trees: selection of most important attributes
Explanatory power of ANNs is limited model represented as weights in network no simple explanation why networks makes a certain
prediction• contrast with e.g. trees: can give a “rule” that was used
![Page 121: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/121.jpg)
Hence, ANNs are good when high-dimensional input and output (numeric or symbolic) interpretability of model unimportant
Examples: typical: image recognition, speech recognition, …
• e.g. images: one input per pixel• see http://www.cs.cmu.edu/~tom/faces.html for illustration
less typical: symbolic problems • cases where e.g. trees would work too• performance of networks and trees then often comparable
![Page 122: Machine Learning and Inductive Inference](https://reader035.vdocuments.net/reader035/viewer/2022081721/5487093cb47959e20c8b53ec/html5/thumbnails/122.jpg)
To remember
Perceptrons, neural networks: inspiration what they are how they work representation power explanatory power