intro to ai learning ruth bergman fall 2002. learning what is learning? learning is a process by...

Intro to AILearning

Ruth Bergman

Fall 2002

Learning

• What is learning?Learning is a process

by which the learner improves his/her/its predictive ability

using new experiences

Machine Learning

• Learn a function• process: algorithms for improving the

model of the function• Improve predictive ability: less

difference between the true function and the model

• Experiences: input examples or instances

A classification problem

• predict whether a patient will experience heart disease

• Key steps:– Data: what “past experience” do we have? What are the

underlying assumptions?Medical records, exercise habits, smoking habits…

– Representation: how do we summarize a patient? – Estimation: how do we construct a map from patients to

presence of heart disease?– Evaluation: how well does our model predict? Can we do

better?

Heart-Disease Representation• Fourteen attributes:

1. Age (in years)2. Sex (male/female)3. Chest pain type (normal angina, atypical angina, non-anginal

pain, asymptomatic)4. Resting blood pressure (in mm Hg on admission to the hospital)5. serum cholestoral (in mg/dl)6. fasting blood sugar (1 if > 120 mg/dl, 0 otherwise)7. resting electrocardiographic results (normal, ST-T wave

abnormality, hypertrophy)8. maximum heart rate9. exercise induced angina (yes/no)10. ST depression induced by exercise relative to rest11. the slope of the peak exercise ST segment (upsloping, flat,

downsloping)12. number of major vessels (0-3) colored by flourosopy13. thal (normal, fixed defect, reversable defect)14. diagnosis of heart disease (0 if < 50% diameter narrowing in any

major vessel, 1 otherwise)

Heart-Disease Examplesage sex Chest

painResting blood pressure

cholestoral

blood sugar

exercise induced angina

… Heart disease

63 male typical 145 233 1 0 … no

67 male asymptomatic

160 286 0 1 … yes

67 male asymptomatic

120 229 0 1 yes

37 male non-anginal

130 250 0 0 no

41 female atypical 130 204 0 0 no

Learning Paradigms

• Supervised learning– given a set of examples and the correct results• Instance: <feature vector, classification>• Example: x f(x)

• Unsupervised learning– Capture the inherent organization in the data.– Instance:<feature vector>– Example: x

• Reinforcement learning– given feedback (reward) for performing well (or badly), not

what we should be doing– Instance: <feature vector>– Example: x rewards based on performance

Inductive Learning• Suppose the underlying problem domain is described by a

function f• Given pairs <x, f(x)>• Compute a hypothesis h that approximates f as well as possible

given the presented data

• In general the input under-constrains the function h, so we have to choose. The way that choice is performed is called bias.

Two Function Types • Classification

The target function is a classification

+

+ ++

+

-

-

-

-

-

-

Cx

CxxfC 0

1)(

• Regression

Representation of Classifiers

• We could use many representation models – Decision trees– A set of rules– A Prolog program (Horn clauses)– Neural Networks– Belief Networks

• Each representation has multiple learning algorithms

Decision Trees

blood pressure

cholesterol

1

10

0

0

10

cholesterol

Chest pain Chest pain

high

low

medium

high high

low

low

no yes yesno

Inducing decision trees from examples

• Trivial solution: one path in the tree for each example- Bad generalization

A Decision Tree Learning Algorithm

ID3(example, attributes, default)

if (examples is empty) return defaultif (examples have same classification) return classificationif (attributes is empty) return majority-value(examples)

best CHOOSE_ATTRIBUTE(attributes, examples)tree new tree with root test best

for each value vi of bestexamplesi = elements of examples with best=vi

subtree ID3(examplesi,attributes-best,majority-value(examples))

add a branch to tree with label vi and subtree subtree

return tree

Possible Trees

blood pressure

high mediumlow

cholesterol

highlowChest pain

yesno

Which is the best tree?

What Makes A Good Tree

• Ockham’s razor principle (assumption):The most likely hypothesis is the simplest one that is consistent with training examples.

Bias for short trees: minimize (on average) the # of questions we need to answer before reaching a decision

• Finding the smallest decision tree that matches training examples is NP-hard.

• Select test attributes using a heuristic from Information Theory– The attribute that provides the most information.

Information Content of Attributes

• A perfect attribute divides the examples into sets that are positive and negative.

• A useless attribute leaves the example sets with the same proportion of positive and negative examples as the original.

---

-- --

+

++

++

++

---

-- --

+

++

++

++

+

++

++

++

---

-- --

+

+--

++

+--

-+ -+-

Using Information Theory

• Suppose we have p positive and n negative queries– The probability of a positive example is p/p+n – The probability of a negative example is n/p+n

• Information content – For possible values v1,…, vn with respective probabilities

P(v1)...P(vn)– I(P(v1)...P(vn)) = Σi –P(vi) log2(P(vi)) for possible values vi

• Information content of an attribute

• for example, an even split gives – ½ (-1) – ½ (-1) = 1 bit • something that doesn’t divide at all yields 0 bits

np

n

np

n

np

p

np

pnpI

22 loglog),(

Using Information Theory

• Suppose test A divides the training set into sets E1,E2, ... Em and each subset has pi positive and ni negative examples– The uncertainty in the children is measure by

– on average we will need Remainder(A) more bits of information• the gain of test A is

Gain(A) = I(p,n) – Remainder(A)• for example, if A completely classifies the set, the Gain is 1

(remainder 0)• Gain(front_row) = ??1 – [2/12 I(0,1) + 4/12 I(1,0) + 6/12

I(2/6,4/6) = .541• Gain(prev_grade) = ??1 – [1 I(1/2,1/2)] = 0

),()(Re iiiii npI

np

npAmainder

Results of Learning• ID3 algorithm• 50 examples from Cleaveland heart disease database

N-Examples: 50. Choosing from : (att1 att2 att3 att4 att5 att6 att7 att8 att9 att10 att11 att12 att13)

Attribute: att1. Gain: 0.0741303 Attribute: att2. Gain: 0.041963935 Attribute: att3. Gain: 0.20732212 Attribute: att4. Gain: 0.06303495 Attribute: att5. Gain: 0.04215479 Attribute: att6. Gain: 0.12409884 Attribute: att7. Gain: 0.014267027 Attribute: att8. Gain: 0.12125653 Attribute: att9. Gain: 0.3067463 Attribute: att10. Gain: 0.18902457 Attribute: att11. Gain: 0.0412457 Attribute: att12. Gain: 0.23261213 Attribute: att13. Gain: 0.24738503Selected attribute: att9

Results of LearningN-Examples: 37. Choosing from : (att1 att2 att3 att4 att5 att6

att7 att8 att10 att11 att12 att13)

Attribute: att1. Gain: 0.13357532 Attribute: att2. Gain: 0.07225275 Attribute: att3. Gain: 0.06493038 Attribute: att4. Gain: 0.05581081 Attribute: att5. Gain: 0.053394675 Attribute: att6. Gain: 0.18130744 Attribute: att7. Gain: 0.020807564 Attribute: att8. Gain: 0.070365906 Attribute: att10. Gain: 0.08575398 Attribute: att11. Gain: 0.0064561963 Attribute: att12. Gain: 0.26021093 Attribute: att13. Gain: 0.20255792Selected attribute: att12

Results of Learning

N-Examples: 6. Choosing from : (att1 att2 att3 att4 att5 att6 att7 att8 att10 att11 att13)

Attribute: att1. Gain: 0.31668913 Attribute: att2. Gain: 0.45914793 Attribute: att3. Gain: 0.45914793 Attribute: att4. Gain: 0.31668913 Attribute: att5. Gain: 0.91829586 Attribute: att6. Gain: 0.10917032 Attribute: att7. Gain: 0.0 Attribute: att8. Gain: 0.45914793 Attribute: att10. Gain: 0.10917032 Attribute: att11. Gain: 0.25162917 Attribute: att13. Gain: 0.25162917Selected attribute: att5

Results of Learningexercise angina (9)

Cholesterol (5)

1

10

bp (4)

yes

1

no

0

flourosopy (12)

0

10

defect (13)normal

normal

1

32

10

<=239 >239

10

<=120 >120

0

fixedreversible

pain (3)

1 1

atypical non asymCholesterol (5)Cholesterol (5)

<=245<=241 >241 >245

Example Derived From 300 Tests

13

1

defectnormal

fixedreversible

pain 123

5normal

0

atypical non asym5 10 10

0<=225

>225

12

10

<=.15>.15

1

01

120

1

1 11 92 3

1 0

01

0

<=50

1

>50

>1.9

1

<=1.9

11

1

0 0

2 3

4

<=1220

>122

0

12

1

011

3

0 5

1 10

<=229

>267

1

<=.7

0

1 5

>.65

>237

8

0

<=114

1

12

1

1

0

1 31

1<=42

0

013

8

1

<=167

0

8

<=161

2113

12 4

1

21 35

0

<=301

18 2 5

0

<=258

1

0

<=1610

1

0 1

5

0 1

0 2

4 1>60.5

01

1

<=61.5

0

10>3.2

11

20

2

8

1

<=57

<=1570 1

<=63.5

4

08

1

<=120

86

… …

How Do We Assess Performance?

• Collect a large set of samples

• Divide into training set and test set (no examples in common!)

• Use the learning algorithm to generate function hypothesis (DT) H

• Measure the percentage of examples that are right

• Do this for lots of training sets of lots of different sizes

The Learning Curve

Cleaveland heart-disease learning curve

0102030405060708090

0 100 200 300 400

number of examples

%co

rrec

t cl

assi

fica

tio

n

Application: GASOIL

GASOIL is an expert system for designing gas/oil separation systems stationed off-shore.

• System attributes: proportions of gas, oil and water, flow rate, pressure, density, viscosity, temperature, and others.

• To build by hand would take ~10 person-years • Built by decision-tree induction ~ 100 person-days • At the time (1986), GASOIL was the biggest Expert

System in the world, containing ~2500 rules, and saved BP millions.

Application: Learning to Fly

• Learning to fly a Cessna on a flight simulator (1992). – Three skilled pilots performed an assigned flight plan 30

times each. – Each control action (e.g. on throttle, flaps) created an

example. – 90,000 examples– Decision tree created. – Converted into C and put into the simulator control loop.

• Program flies better than teachers! – probably because generalization cleans up occasional

mistakes

Representation power of decision trees

Any Boolean function can be written as a decision tree.

x2

x1

No

Yes

No

Yes

YesNo

Cannot represent tests that refer to 2 or more objects, e.g.r2 Nearby(r2,r) Price(r,p) Price(r2,r2) Cheaper (p2,p)

Representation with decision trees…

Parity problem x1

x2 x2

x3 x3 x3x3

0

1

0

0 0

0

0 0

1

1 1

1

1

1

Y N N Y N Y Y N

Exponentially large tree.Cannot be compressed.

•n features (aka attributes).•2n rows in truth table.•Each row can take one of 2 values.•So there are Boolean functions of n attributes.n22

Machine Learning Issues

• Unrepresentative examples• Insufficient Data• Noise – incorrectly labelled examples

– if there are errors in our examples, then these will end up in the decision tree

– Overfitting

• Missing data– sometimes we don’t have all of the attributes

• Attributes with lots of values– tend to look good because each example has an almost

unique value

• Continuous-valued attributes

Empty Leaves

• The examples do not represent all possible attribute values

• Pass a default value to subtrees when splitting– Usually the majority classification at the parent node

Noisy Input

• Incorrectly labeled examples can result in leaves where the examples have conflicting labels and no split exists

• Select majority label

Many Valued Attributes• Problem with Information Gain:

– prefer attributes with many values

– extreme cases: • Social Security Numbers • patient ID’s• integer/nominal attributes with many values (JulianDay)

– Use Gain Ratio splitting criterion

GainRatio(A) = gain(A)/SplitInfo(A)

+ – – + – + + –+. . .

np

np

np

npASplitInfo ii

iii

2log)(

Continuous Valued Attributes

• how to cluster into logical segments of values– Sort by value, then find best threshold for

binary split– Cluster into n intervals and do n-way split

Missing Attribute Values

• Some data sets have many missing values• Assume that the missing value is

– The same as the majority value for the attribute– The same as the majority value for this attribute at

this node– The same as the majority value among examples

with the same label

Overfitting

©Tom Mitchell, McGraw Hill, 1997

The tree is too large and has poor predictive power.

Pre-Pruning (Early Stopping)

• Evaluate splits before installing them: – don’t install splits that don’t look worthwhile– when no worthwhile splits to install, done

• Seems right, but:– hard to properly evaluate split without seeing what

splits would follow it (use lookahead?)– some attributes useful only in combination with

other attributes– suppose no single split looks good at root node?

Post-Pruning

• Grow decision tree to full depth (no pre-pruning)

• Prune-back full tree by eliminating splits that do not appear to be warranted statistically

• Use train set, or an independent prune/test set, to evaluate splits

• Stop pruning when remaining splits all appear to be warranted

• Alternate approach: convert to rules, then prune rules

Converting Decision Trees to Rules

• each path from root to a leaf is a separate rule:

if (fp=1 & ¬pc & primip & ¬fd & bw<3349) => 0,if (fp=2) => 1,if (fp=3) => 1.

fetal_presentation = 1: +822+116 (tree) 0.8759 0.1241 0| previous_csection = 0: +767+81 (tree) 0.904 0.096 0| | primiparous = 1: +368+68 (tree) 0.8432 0.1568 0| | | fetal_distress = 0: +334+47 (tree) 0.8757 0.1243 0| | | | birth_weight < 3349: +201+10.555 (tree) 0.9482 0.05176 0fetal_presentation = 2: +3+29 (tree) 0.1061 0.8939 1fetal_presentation = 3: +8+22 (tree) 0.2742 0.7258 1

Advantages of Decision Trees

• DT learning is relatively fast, even with large data sets (106) and many attributes (103)– advantage of recursive partitioning: only process all cases at

root

• Small-medium size trees usually intelligible• Can be converted to rules• The algorithm does feature selection• The resulting model is often compact (Occam’s

Razor)• Decision tree representation is understandable

Decision Trees are Intelligible

Not ALL Decision Trees Are Intelligible

Part of Best Performing C-Section Decision Tree

from Rich Caruana.

Disadvantages of Decision Trees

• Large or complex trees can be just as unintelligible as other models

• Trees don’t easily represent some basic concepts such as M-of-N, parity, non-axis-aligned classes…

• Don’t handle real-valued parameters as well as Booleans• If model depends on summing contribution of many different

attributes, DTs probably won’t do well• DTs that look very different can be same/similar• Propositional (as opposed to 1st order)• Recursive partitioning: run out of data fast as descend tree

Instance-Based Learning

Inductive Assumption• Similar inputs map to similar outputs

– If not true => learning is impossible– If true => learning reduces to defining “similar”

• Not all similarities created equal– predicting a person’s weight may depend on

different attributes than predicting their IQ

Nearest Neighbor Classification

• Training: retain all examples• Prediction: new example assigned the same

classification as the nearest neighbor.• Similarity measure: a distance function in

attribute space

attribute_1at

trib

ute_

2

++

+

+

+

++++

o

o

ooo

ooo oo

o

+

+

ooo

Similarity Measure

N

iii cacaccD

1

22121 )()(),(

Euclidean Distance

attribute_1

attr

ibut

e_2

++

+

+

+

++++

oo

ooo

ooo oo

o

Booleans, Nominals, Ordinals, and Reals

• Consider attribute value differences:

• Reals: easy! full continuum of differences

• Integers: not bad: discrete set of differences

• Ordinals: not bad: discrete set of differences

• Booleans: less info: use hamming distance

• Nominals: less info: use hamming distance

)(c a) (ca

)(c a) (ca )) (c) - a(cah

ii

iiii

21

2121 1

0(

)(c) - a(ca ii 21

k-Nearest Neighbor• 1-NN works well if no attribute or class noise• Average of k points more reliable when:

– noise in attributes– noise in class labels– classes partially overlap

• Prediction: new example assigned the classification of the majority of the k-nearest neighbors.

attribute_1

attr

ibut

e_2

++

+

+

+

++++

o

o

ooo

ooo oo

o

+

++o

o o

How to choose “k”

• Large k:– less sensitive to noise (particularly class noise)– better probability estimates for discrete classes– larger training sets allow larger values of k

• Small k:– captures fine structure of space better– may be necessary with small training sets

• Balance must be struck between large and small k

Cross-Validation

• Models usually perform better on training data than on future test cases

• 1-NN is 100% accurate on training data!• Leave-one-out-cross validation:

– “remove” each case one-at-a-time– use as test case with remaining cases as train set– average performance over all test cases

• LOOCV is impractical with most learning methods, but extremely efficient with Instance-Based methods!

Distance-Weighted kNN

• tradeoff between small and large k can be difficult– use large k, but more emphasis on nearer neighbors?

),(

11

1

testkk

k

ii

k

iii

test

ccDistw

w

classwprediction

Locally Weighted Averaging

• Let k = number of training points• Let weight fall-off rapidly with distance

),(

1

1

1testk ccDisthKernelWidtk

k

ii

k

iii

test

ew

w

classwprediction

• KernelWidth controls size of neighborhood that has large effect on value (analogous to k)

Similarity Measure

D(c1,c2) attri(c1) attr

i(c2) 2

i1

N

Euclidean Distance

• gives all attributes equal weight?– only if scale of attributes and differences are similar– scale attributes to equal range or equal variance

• assumes spherical classes

attribute_1

attr

ibut

e_2

++

+

+

+

++++

oo

o

o

o

ooo oo

o

Euclidean Distance?

• Attributes on a larger range affect distance more than attributes on small range

• Some attributes are more/less important than other attributes

• Some attributes may have more/less noise

attribute_1

attr

ibut

e_2

++

+

+

+

++++

oo

ooo

ooo oo

o

attribute_1

attr

ibut

e_2 +

+

+

++

+ +

+

++ oo o

o

oo

o

o

o

oo

o

Weighted Euclidean Distance

• large weights => attribute is more important• small weights => attribute is less important• zero weights => attribute doesn’t matter

• Weights allow kNN to be effective with elliptical classes– Use the weight to normalize for attribute range

N

iiii cacawccD

1

22121 )()(),(

iiiw

minmax

1

Curse of Dimensionality• as number of dimensions increases, distance between points becomes larger and more uniform• if number of relevant attributes is fixed, increasing the number of less relevant attributes may swamp distance

• when more irrelevant dimensions relevant dimensions, distance becomes less reliable• solutions: larger k or KernelWidth, feature selection, feature weights, more complex distance functions

D(c1,c2) attri(c1) attr

i(c2) 2 attr

j(c1) attr

j(c2) 2

j1

irrelevant

i1

relevant

Advantages of Memory-Based Methods

• Lazy learning: don’t do any work until you know what you want to predict (and from what variables!)– never need to learn a global model– many simple local models taken together can represent a

more complex global model– better focussed learning– handles missing values, time varying distributions, ...

• Very efficient cross-validation• Intelligible learning method to many users• Nearest neighbors support explanation and training• Can use any distance metric: string-edit distance, …• Easy to implement an incremental learning version

Disadvantages of Memory-Based Methods

• Curse of Dimensionality:– often works best with 25 or fewer dimensions

• Run-time cost scales with training set size• Large training sets will not fit in memory• Many MBL methods are strict averagers• Sometimes doesn’t seem to perform as well

as other methods such as neural nets• Predicted values for regression not

continuous

A Learning Problem

Assume a two dimensional space with positive and negative examples. Find a rectangle that includes the positive examples but not the negatives (input space is R2):

+++

++

+---

--

-

-

-

-

true concept

Definitions

Distribution D.Assume instances are generated at random from a distribution D.

Class of Concepts C Let C be a class of concepts that we wish to learn. In our example C is the family of all rectangles in R2.

Class of Hypotheses H The hypotheses our algorithm considers while learning the target concept.

True error of a hypothesis herrorD(h) = Pr[c(x) = h(x)]D

True Error

+++

++

+---

--

-

-

-

-

true concept c

hypothesis h

True error is the probability of regions A and B.Region A : false negativesRegion B : false positives

Region A

Region B

Learning Algorithm Desiderata

The learning algorithm• uses a small number of examples• is computationally efficient• Outputs a hypothesis subject to

1. The hypothesis does not need to be correct on every sample. The probability of failure will be bounded by a constant δ).

2. We don’t require a hypothesis with zero error. There might be some error as long as it is small (bounded by a constant

ε).A probably approximately correct (PAC) hypothesis.

PAC Learning

A concept class C is PAC-learnable if

• there is a learning algorithm L• for all target concepts c in C• for all δ>0 • for all ε>0• and for all distributions D

L given ε, δ and a source of examples, produces with probability at least 1-δ a hypothesis h with true error less than ε, in time polynomial in 1/ ε , 1/ δ and the size of C.

Example

+++

++

+---

--

-

-

-

-

true concept c

hypothesis h (most specific)

The learning algorithm: output the smallest rectangle that covers the positive examples.

Is this class of problems (rectangles in R2) PAC learnable by this learning algorithm?

Example

+++

++

+---

--

-

-

-

-

true concept c

hypothesis h (most specific)

The error is the probability of the area between h and the true targetrectangle c.

How many example do we need to make this error less than ε?

error

Example: Analysis

In general, the probability that m independent examples have NOT fallen within the error region is (1- ε) m which we want to be less than δ.

(1- ε) m <= δ

Since (1-x) <= e–x we have that

e – εm <= δ or

m >= (1/ ε) ln (1/ δ)

The resulting grows linearly in 1/ ε and logarithmically 1/ δ

Computational Learning Theory

Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows when learning may be impossible

Results due to theoretical analysis

1. Sample ComplexityHow many examples we need to find a good hypothesis?

2. Computational ComplexityHow much computational power we need to find a good

hypothesis?3. Mistake Bound

How many mistakes we will make before finding a good hypothesis?

intro to ai learning ruth bergman fall 2002. learning what is learning? learning is a process by...

Documents

performance slide

regression slide

inductive learning

machine learning

learning paradigms

heartdisease representation

x reinforcement learning

multiple learning algorithms