mai0203 lecture 13: machine learning

21
MAI0203 Lecture 13: Machine Learning Methods of Artificial Intelligence WS 2002/2003 Part III: Special Aspects III.13 Machine Learning MAI0203 Lecture 13: Machine Learning – p.345

Upload: butest

Post on 17-Jun-2015

244 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: MAI0203 Lecture 13: Machine Learning

MAI0203 Lecture 13: MachineLearning

Methods of Artificial IntelligenceWS 2002/2003

Part III: Special Aspects

III.13 Machine Learning

MAI0203 Lecture 13: Machine Learning – p.345

Page 2: MAI0203 Lecture 13: Machine Learning

Learning

Remember “Deduction vs. Induction”Induction: Derive a general rule (axiom) frombackground knowledge and observations.

Socrates is a human. (background knowledge)Socrates is mortal. (observation/example)Therefore, I hypothesize that all humans are mortal.(generalization)

Most important approach to learning: InductiveLearningothers: rote learning, adaptation/re-structuring,speed-up learning

Deduction is knowledge extraction, induction isknowledge acquisition!

MAI0203 Lecture 13: Machine Learning – p.346

Page 3: MAI0203 Lecture 13: Machine Learning

Intelligent Systems must Learn

A flexible and adaptive an organism cannot rely on a fixedset of behavior rules but must learn (over its completelife-span)!

If an expert system – brilliantly designed, engineered andimplemented – cannot learn not to repeat its mistakes, it isnot as intelligent as a worm or a sea anemone or a kitten.

(O. Selfridge)

MAI0203 Lecture 13: Machine Learning – p.347

Page 4: MAI0203 Lecture 13: Machine Learning

Learning vs. Knowledge Engineering

Technological advantage: learning avoids the“knowledge engineering bottleneck”!

Good analysis of data/observations can often besuperior to explicit programming.

E.g., expert system for medical diagnosis:Describe each patient by his/her clinical data,let physicians decide whether some disease isimplied or not,learn the classification rule (relevant featurers andtheir dependencies)

Epistemological problem: induced information cannever be “sure” knowledge, it has hypothetical statusand is open to revision!

MAI0203 Lecture 13: Machine Learning – p.348

Page 5: MAI0203 Lecture 13: Machine Learning

Classification/Concept Learning

Classification/concept learning is the best researchedarea in machine learning.

Representation of a concept:Extensional: (infinite) set of all enti-ties belonging to a class/concept.Intensional: finite characterizationT � ��� �

has-3/4-Legs(x), has-Top(x)

� � �� �� �� ����

����

� � � � �� � � � �

Learning: Construction of a finite characterization froma subset of entities (training examples).

Typically, supervised learning: pre-classified positiveand negative examples for the concept/class.

In the following: decision tree learning as a symbolicapproach to inductive learning of concepts. (Quinlan,ID3, C4.5)

MAI0203 Lecture 13: Machine Learning – p.349

Page 6: MAI0203 Lecture 13: Machine Learning

Decision Tree Learning/Training Ex.Fictional Example: Learn in which regions of the world,the Tse-Tse fly is living.

Training Examples: can presented sequentially(incremental learning) or all at once (batch learning)

Nr Vegetation Longitude Humidity Altitude Tse-Tse fly

1 swamp near eq. high high occurs

2 grassland near eq. high low occurs not

3 forrest far f. eq. low high occurs not

4 grassland far f. eq. high high occurs not

5 forrest near eq. high low occurs

6 grassland near eq. low low occurs not

7 swamp near eq. low low occurs

8 swamp far f. eq. low low occurs not

MAI0203 Lecture 13: Machine Learning – p.350

Page 7: MAI0203 Lecture 13: Machine Learning

Feature Vectors

Examples are represented as feature vectors���

Attributes and their value ranges:Vegetation: �� � ���� �� �

Longitude: � � ���� � �

Humidity: �� � �� � � �

Altitude: �� � ���� � �

Class � ���� � �

MAI0203 Lecture 13: Machine Learning – p.351

Page 8: MAI0203 Lecture 13: Machine Learning

DT Learning/Hypothesispossible decision tree: x1 (vegetation)

x2 (longitude) x2 (longitude)NO Tse−Tse Fly

forrestswamp 0

grassland

1 2

NO Tse−Tse FlyNO Tse−Tse Fly Tse−Tse FlyTse−Tse Fly

0 1

near eq. far from eq. far from eq.near eq.0 1

Hypotheses: IF-THEN Classification RulesIF: Disjunction of conjunctions of constraints overattribute values THEN: Decision for a class

Each path is a ruleIF grassland THEN no Tse-Tse flyIF (swamp AND near equator) THEN Tse-Tse flye...

Generalization: only relevant attributes, rules can beapplied to classify unseen examples

MAI0203 Lecture 13: Machine Learning – p.352

Page 9: MAI0203 Lecture 13: Machine Learning

DT Learning/Basic AlgorithmID3(examples, attributes, target-class):

1. Create a root node2. IF all examples are positive, return the root with label “target-class”

3. IF all examples are negative, return the root with label“target-class” �

4. IF attributes is empty, return the root and use the most commonvalue for the target class in the examples as label

5. Otherwise(a) Select an attribute

from attributes(b) Assign

to the root(c) For each possible value ��� of

i. Add a new branch below the root, corresponding to the test

� ��

ii. Let examples �� be the subset of examples that have value � �

for

iii. IF examples �� is emptyTHEN add a leaf node and use the most common value forthe target class in the examples as labelELSE ID3(examples �� , attributes

, target-class)6. Return the tree

MAI0203 Lecture 13: Machine Learning – p.353

Page 10: MAI0203 Lecture 13: Machine Learning

Example

x1 (vegetation)

forrestswamp 0

grassland

1 2

{1, 7, 8}

NO Tse−Tse Fly

{2, 4, 6} {3, 5}

leaf−node

next attribute? next attribute?

x2 (longitude)

0 1

near eq. far from eq.

swamp 0

Tse−Tse Fly NO Tse−Tse Fly

{1, 7, 8}

{1, 7} {8}

MAI0203 Lecture 13: Machine Learning – p.354

Page 11: MAI0203 Lecture 13: Machine Learning

Selecting an AttributeNaive: From the list of attributes in some arbitraryorder, take always the next oneProblem: DT might be more complex then necessary;contain branches which are irrelevant

Selection by information gain � � DT with minimal depthWhat is the contribution of an attribute to decidewhether an entity belongs to the class or not?

Entropy of a collection

� � � � ��

��� �

�� �� � �

with � as proportion (relative frequency) of objectsbelonging to category

.

Logarithm to base 2: information theory, expectedencoding length measured in bits.

MAI0203 Lecture 13: Machine Learning – p.355

Page 12: MAI0203 Lecture 13: Machine Learning

Entropy/IllustrationIllustration of entropy

S: flipping a fair coin

head

tail

� ��� �

� �

Coin

� � � �� �� log ��

� �

� �� �

head up

� �� �� log ��

� �

� �� �

tail up

� �

bit

Flipping a coin with

heads:

� � �� � �

bitsFlipping a coin with

� � � �head:

� � � bits

Entropy of a set of training examples: proportion ofexamples belonging to the class vs. not belonging tothe class � � � � �

��� log

��

�� log

�� �� �

without looking at any attribute, we have a slightadvantage if guessing the more frequent class “noTse-Tse fly”

MAI0203 Lecture 13: Machine Learning – p.356

Page 13: MAI0203 Lecture 13: Machine Learning

Information GainWhat does the knowledge of the value of an attribute

�contribute to predict the class?��� ��� �� � � � �� ���

���� �

� � � �� � � � �� � �

� �� �

“relative” Entropy of A

Example: Attribute �� “Longitude” � �� � � � � � �� � "!# � �� � � � � "!# � � � $

� �&% ' �� �% () � �&% ) �� * % +, $ �&% - (,

� �� � * � � � . � � "!# � .� � � � � "!# � �� � � � * � �

(remark

"!# �

is undefined, we set

� � "!# �

as

)weight with relation of examples with � � � �

:

�/ and �� � *

:

�/

��� ��� �� �� � � �&% -0 � 1 � �&% ', 0 � �&% - (, � 2 � �&% + (0 � � �3 $ �&% + )

Remark: In different subtrees, different attributes might give

the best information gain.MAI0203 Lecture 13: Machine Learning – p.357

Page 14: MAI0203 Lecture 13: Machine Learning

Performance Evaluation

A learning algorithm is good, if it produces hypotheseswhich work good when classifying unseen examples.

Validation: use a training set and a test setConstruct a hypothesis using the training setEstimate the generalization error using the test set

For decision trees, working on linearly separable data,the training error is always

�.

Danger: Overfitting of dataIf crucial information is missing (relevant features), theDT algorithm might nevertheless find a hypothesisconsistent with the examples: use irrelevant attributes.E.g., predict whether head or tail of a dice is top basedon the day of the week.

MAI0203 Lecture 13: Machine Learning – p.358

Page 15: MAI0203 Lecture 13: Machine Learning

How to Avoid Overfitting

Pruning (Okams razor – prefer simple hypotheses): donot use attributes with information gain near zero

Cross-validation: run the alg.

-times, each time leaveout a different subset of

� � �

examples

MAI0203 Lecture 13: Machine Learning – p.359

Page 16: MAI0203 Lecture 13: Machine Learning

Characteristics DT Algorithms

Hypothesis Language (DTs): restricted to separationplanes which are parallel to the axis in feature space

Inductive biases (no learning without bias):Each learning algorithm has a language bias: thehypothesis language restricts what can be learnedSearch bias: top-down search in the space ofpossible decision trees

DT research was inspired by psychological work ofBruner, Goodnow, & Austin, 1956 on concept formation

Advantage of symbolic approach: verbalizable ruleswhich can be directly used in knowledge-basedsystems

MAI0203 Lecture 13: Machine Learning – p.360

Page 17: MAI0203 Lecture 13: Machine Learning

DTs vs. Perceptrons: XOR

Perceptron: � �� � �

if

���� �� � �

� otherwise

Output: calculated as linear combination of the inputs

Learning: incremental, change weights

� � whenclassification error occurs

Works for linearly separable classes

e.g. not for XOR

��� ��� class

0 0 0

0 1 1

1 0 1

1 1 0

x2

no yes

0 1

x10 1

x2

0 1

yes no

MAI0203 Lecture 13: Machine Learning – p.361

Page 18: MAI0203 Lecture 13: Machine Learning

Other Learning Approaches

Theoretical foundations:Algorithmic/Computational/Statistic Learning TheoryWhat kind of hypotheses can be learned from which data withwhich effort?

Three main approaches:(1) Supervised Learning: learning from examples

Decision Tree AlgorithmsInductive Logic Programming (ILP)Genetic AlgorithmsFeedforward Nets (Backpropagation)Support Vector Machines, Statistical approaches

MAI0203 Lecture 13: Machine Learning – p.362

Page 19: MAI0203 Lecture 13: Machine Learning

Other Learning Approaches cont.

...(2) Unsupervised Learning:

-means clustering,Self-organized Maps

Clustering of data in groups (e.g. customers withsimilar behavior)Application: Data Mining

(3) Reinforcement Learning: learning from problemsolving experience, positive/negative feedback

MAI0203 Lecture 13: Machine Learning – p.363

Page 20: MAI0203 Lecture 13: Machine Learning

Other Learning Approaches cont.

Inductive Program Synthesis; learn recursive rules,typically, small sets of only positive examples

ILPGenetic ProgrammingFunctional Approaches (recurrence detection interms)Grammar Inference

Explanation-base learning

Learning from Analogical/Case-Based Reasoning(Schema acquisition, abstraction)

Textbook:

Tom Mitchell, Machine Learning, McGraw-Hill, 1997

MAI0203 Lecture 13: Machine Learning – p.364

Page 21: MAI0203 Lecture 13: Machine Learning

The Running Gag of MAI 02/03

Question: How many AI people does it take to change alightbulb? Answer: At least 67.6th part of the solution: The Learning Group (4)

One to collect twenty lightbulbs

One to collect twenty "near misses"

One to write a concept learning program that learns toidentify lightbulbs

One to show that the program found a local maximumin the space of lightbulb descriptions

Check the groups we did not consider in this lecture:The Statistical Group (1), The Robotics Group (7), The LispHackers (7), The Connectionist Group (6), The NaturalLanguage Group (5). (“Artificial Intelligence”, Rich & Knight)

MAI0203 Lecture 13: Machine Learning – p.365