mai0203 lecture 13: machine learning

MAI0203 Lecture 13: MachineLearning

Methods of Artificial IntelligenceWS 2002/2003

Part III: Special Aspects

III.13 Machine Learning

MAI0203 Lecture 13: Machine Learning – p.345

Learning

Remember “Deduction vs. Induction”Induction: Derive a general rule (axiom) frombackground knowledge and observations.

Socrates is a human. (background knowledge)Socrates is mortal. (observation/example)Therefore, I hypothesize that all humans are mortal.(generalization)

Most important approach to learning: InductiveLearningothers: rote learning, adaptation/re-structuring,speed-up learning

Deduction is knowledge extraction, induction isknowledge acquisition!


Intelligent Systems must Learn

A flexible and adaptive an organism cannot rely on a fixedset of behavior rules but must learn (over its completelife-span)!

If an expert system – brilliantly designed, engineered andimplemented – cannot learn not to repeat its mistakes, it isnot as intelligent as a worm or a sea anemone or a kitten.

(O. Selfridge)


Learning vs. Knowledge Engineering

Technological advantage: learning avoids the“knowledge engineering bottleneck”!

Good analysis of data/observations can often besuperior to explicit programming.

E.g., expert system for medical diagnosis:Describe each patient by his/her clinical data,let physicians decide whether some disease isimplied or not,learn the classification rule (relevant featurers andtheir dependencies)

Epistemological problem: induced information cannever be “sure” knowledge, it has hypothetical statusand is open to revision!


Classification/Concept Learning

Classification/concept learning is the best researchedarea in machine learning.

Representation of a concept:Extensional: (infinite) set of all enti-ties belonging to a class/concept.Intensional: finite characterizationT � ��

has-3/4-Legs(x), has-Top(x)

� � ��

��

�

� � � � ��

Learning: Construction of a finite characterization froma subset of entities (training examples).

Typically, supervised learning: pre-classified positiveand negative examples for the concept/class.

In the following: decision tree learning as a symbolicapproach to inductive learning of concepts. (Quinlan,ID3, C4.5)


Decision Tree Learning/Training Ex.Fictional Example: Learn in which regions of the world,the Tse-Tse fly is living.

Training Examples: can presented sequentially(incremental learning) or all at once (batch learning)

Nr Vegetation Longitude Humidity Altitude Tse-Tse fly

1 swamp near eq. high high occurs

2 grassland near eq. high low occurs not

3 forrest far f. eq. low high occurs not

4 grassland far f. eq. high high occurs not

5 forrest near eq. high low occurs

6 grassland near eq. low low occurs not

7 swamp near eq. low low occurs

8 swamp far f. eq. low low occurs not


Feature Vectors

Examples are represented as feature vectors��

Attributes and their value ranges:Vegetation: ��

Longitude: � � ��

Humidity: ��

Altitude: ��

Class � ��


DT Learning/Hypothesispossible decision tree: x1 (vegetation)

x2 (longitude) x2 (longitude)NO Tse−Tse Fly

forrestswamp 0

grassland

1 2

NO Tse−Tse FlyNO Tse−Tse Fly Tse−Tse FlyTse−Tse Fly

0 1

near eq. far from eq. far from eq.near eq.0 1

Hypotheses: IF-THEN Classification RulesIF: Disjunction of conjunctions of constraints overattribute values THEN: Decision for a class

Each path is a ruleIF grassland THEN no Tse-Tse flyIF (swamp AND near equator) THEN Tse-Tse flye...

Generalization: only relevant attributes, rules can beapplied to classify unseen examples


DT Learning/Basic AlgorithmID3(examples, attributes, target-class):

1. Create a root node2. IF all examples are positive, return the root with label “target-class”

�

3. IF all examples are negative, return the root with label“target-class” �

4. IF attributes is empty, return the root and use the most commonvalue for the target class in the examples as label

5. Otherwise(a) Select an attribute

�

from attributes(b) Assign

�

to the root(c) For each possible value �� of

�

i. Add a new branch below the root, corresponding to the test

�

� ��

ii. Let examples �� be the subset of examples that have value � �

for

�

iii. IF examples �� is emptyTHEN add a leaf node and use the most common value forthe target class in the examples as labelELSE ID3(examples �� , attributes

�

, target-class)6. Return the tree


Example

x1 (vegetation)

forrestswamp 0

grassland

1 2

{1, 7, 8}

NO Tse−Tse Fly

{2, 4, 6} {3, 5}

leaf−node

next attribute? next attribute?

x2 (longitude)

0 1

near eq. far from eq.

swamp 0

Tse−Tse Fly NO Tse−Tse Fly

{1, 7, 8}

{1, 7} {8}


Selecting an AttributeNaive: From the list of attributes in some arbitraryorder, take always the next oneProblem: DT might be more complex then necessary;contain branches which are irrelevant

Selection by information gain � � DT with minimal depthWhat is the contribution of an attribute to decidewhether an entity belongs to the class or not?

Entropy of a collection

�

� � � � ��

��

��

with � as proportion (relative frequency) of objectsbelonging to category

�

.

Logarithm to base 2: information theory, expectedencoding length measured in bits.


Entropy/IllustrationIllustration of entropy

S: flipping a fair coin

head

�

tail

� ��

� �

Coin

� � � �� log ��

� �

� ��

head up

� �� log ��

� �

� ��

tail up

� �

bit

Flipping a coin with

�

heads:

� � ��

bitsFlipping a coin with

� � � �head:

� � � bits

Entropy of a set of training examples: proportion ofexamples belonging to the class vs. not belonging tothe class � � � � �

�� log

��

�� log

��

without looking at any attribute, we have a slightadvantage if guessing the more frequent class “noTse-Tse fly”


Information GainWhat does the knowledge of the value of an attribute

�contribute to predict the class?��

��

� � � ��

� ��

“relative” Entropy of A

Example: Attribute �� “Longitude” � �� "!# � �� "!# � � � $

� �&% ' �� % () � �&% ) �� * % +, $ �&% - (,

� �� * � � � . � � "!# � .� � � � � "!# � �� * � �

(remark

"!# �

is undefined, we set

� � "!# �

as

�

)weight with relation of examples with � � � �

:

�/ and �� *

:

�/

�� &% -0 � 1 � �&% ', 0 � �&% - (, � 2 � �&% + (0 � � �3 $ �&% + )

Remark: In different subtrees, different attributes might give

the best information gain.MAI0203 Lecture 13: Machine Learning – p.357

Performance Evaluation

A learning algorithm is good, if it produces hypotheseswhich work good when classifying unseen examples.

Validation: use a training set and a test setConstruct a hypothesis using the training setEstimate the generalization error using the test set

For decision trees, working on linearly separable data,the training error is always

�.

Danger: Overfitting of dataIf crucial information is missing (relevant features), theDT algorithm might nevertheless find a hypothesisconsistent with the examples: use irrelevant attributes.E.g., predict whether head or tail of a dice is top basedon the day of the week.


How to Avoid Overfitting

Pruning (Okams razor – prefer simple hypotheses): donot use attributes with information gain near zero

Cross-validation: run the alg.

�

-times, each time leaveout a different subset of

� � �

examples


Characteristics DT Algorithms

Hypothesis Language (DTs): restricted to separationplanes which are parallel to the axis in feature space

Inductive biases (no learning without bias):Each learning algorithm has a language bias: thehypothesis language restricts what can be learnedSearch bias: top-down search in the space ofpossible decision trees

DT research was inspired by psychological work ofBruner, Goodnow, & Austin, 1956 on concept formation

Advantage of symbolic approach: verbalizable ruleswhich can be directly used in knowledge-basedsystems


DTs vs. Perceptrons: XOR

Perceptron: � ��

�

if

��

� otherwise

Output: calculated as linear combination of the inputs

Learning: incremental, change weights

� � whenclassification error occurs

Works for linearly separable classes

e.g. not for XOR

�� class

0 0 0

0 1 1

1 0 1

1 1 0

x2

no yes

0 1

x10 1

x2

0 1

yes no


Other Learning Approaches

Theoretical foundations:Algorithmic/Computational/Statistic Learning TheoryWhat kind of hypotheses can be learned from which data withwhich effort?

Three main approaches:(1) Supervised Learning: learning from examples

Decision Tree AlgorithmsInductive Logic Programming (ILP)Genetic AlgorithmsFeedforward Nets (Backpropagation)Support Vector Machines, Statistical approaches


Other Learning Approaches cont.

...(2) Unsupervised Learning:

�

-means clustering,Self-organized Maps

Clustering of data in groups (e.g. customers withsimilar behavior)Application: Data Mining

(3) Reinforcement Learning: learning from problemsolving experience, positive/negative feedback


Other Learning Approaches cont.

Inductive Program Synthesis; learn recursive rules,typically, small sets of only positive examples

ILPGenetic ProgrammingFunctional Approaches (recurrence detection interms)Grammar Inference

Explanation-base learning

Learning from Analogical/Case-Based Reasoning(Schema acquisition, abstraction)

Textbook:

Tom Mitchell, Machine Learning, McGraw-Hill, 1997


The Running Gag of MAI 02/03

Question: How many AI people does it take to change alightbulb? Answer: At least 67.6th part of the solution: The Learning Group (4)

One to collect twenty lightbulbs

One to collect twenty "near misses"

One to write a concept learning program that learns toidentify lightbulbs

One to show that the program found a local maximumin the space of lightbulb descriptions

Check the groups we did not consider in this lecture:The Statistical Group (1), The Robotics Group (7), The LispHackers (7), The Connectionist Group (6), The NaturalLanguage Group (5). (“Artificial Intelligence”, Rich & Knight)


mai0203 lecture 13: machine learning

Documents

rote learning

topx learning

supervised learning

speedup learning deduction

inductive learning of

class mai0203 lecture

negative examples

subset of examples