learning: nearest neighbor, perceptrons & neural nets

49
Learning: Nearest Neighbor, Perceptrons & Neural Nets Artificial Intelligence CSPP 56553 February 4, 2004

Upload: cloris

Post on 25-Feb-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Learning: Nearest Neighbor, Perceptrons & Neural Nets. Artificial Intelligence CSPP 56553 February 4, 2004. Nearest Neighbor Example II. Credit Rating: Classifier: Good / Poor Features: L = # late payments/yr; R = Income/Expenses. Name L R G/P. A0 1.2G. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Learning: Nearest Neighbor, Perceptrons & Neural Nets

Artificial IntelligenceCSPP 56553

February 4, 2004

Page 2: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Nearest Neighbor Example II

• Credit Rating:– Classifier: Good /

Poor– Features:

• L = # late payments/yr; • R = Income/Expenses

Name L R G/P

A 0 1.2 G

B 25 0.4 P

C 5 0.7 GD 20 0.8 PE 30 0.85 P

F 11 1.2 GG 7 1.15 GH 15 0.8 P

Page 3: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Nearest Neighbor Example IIName L R G/P

A 0 1.2 G

B 25 0.4 P

C 5 0.7 GD 20 0.8 PE 30 0.85 P

F 11 1.2 GG 7 1.15 GH 15 0.8 P L

R

302010

1 A

B

C D E

FGH

Page 4: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Nearest Neighbor Example II

L 302010

1 A

B

C D E

FG

HR

Name L R G/P

I 6 1.15

J 22 0.45

K 15 1.2

G

IP

J

??

K

Distance Measure:

Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2))

- Scaled distance

Page 5: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Nearest Neighbor: Issues

• Prediction can be expensive if many features

• Affected by classification, feature noise– One entry can change prediction

• Definition of distance metric– How to combine different features

• Different types, ranges of values

• Sensitive to feature selection

Page 6: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Efficient Implementations

• Classification cost:– Find nearest neighbor: O(n)

• Compute distance between unknown and all instances

• Compare distances– Problematic for large data sets

• Alternative:– Use binary search to reduce to O(log n)

Page 7: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Efficient Implementation: K-D Trees

• Divide instances into sets based on features– Binary branching: E.g. > value– 2^d leaves with d split path = n

• d= O(log n)

– To split cases into sets,• If there is one element in the set, stop• Otherwise pick a feature to split on

– Find average position of two middle objects on that dimension» Split remaining objects based on average position» Recursively split subsets

Page 8: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

K-D Trees: Classification

R > 0.825?

L > 17.5? L > 9 ?

No Yes

R > 0.6? R > 0.75? R > 1.025 ?R > 1.175 ?

NoYes No Yes

No

Poor Good

Yes No Yes

Good Poor

No Yes

Good Good

No

Poor

Yes

Good

Page 9: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Efficient Implementation:Parallel Hardware

• Classification cost:– # distance computations

• Const time if O(n) processors

– Cost of finding closest• Compute pairwise minimum, successively• O(log n) time

Page 10: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Nearest Neighbor: Analysis

• Issue: – What features should we use?

• E.g. Credit rating: Many possible features– Tax bracket, debt burden, retirement savings, etc..

– Nearest neighbor uses ALL – Irrelevant feature(s) could mislead

• Fundamental problem with nearest neighbor

Page 11: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Nearest Neighbor: Advantages

• Fast training:– Just record feature vector - output value set

• Can model wide variety of functions– Complex decision boundaries– Weak inductive bias

• Very generally applicable

Page 12: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Summary: Nearest Neighbor

• Nearest neighbor:– Training: record input vectors + output value– Prediction: closest training instance to new

data• Efficient implementations• Pros: fast training, very general, little bias• Cons: distance metric (scaling), sensitivity

to noise & extraneous features

Page 13: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Learning: Perceptrons

Artificial IntelligenceCSPP 56553

February 4, 2004

Page 14: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Agenda

• Neural Networks:– Biological analogy

• Perceptrons: Single layer networks• Perceptron training• Perceptron convergence theorem• Perceptron limitations

• Conclusions

Page 15: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Neurons: The Concept

Axon

Cell Body

Nucleus

Dendrites

Neurons: Receive inputs from other neurons (via synapses) When input exceeds threshold, “fires”

Sends output along axon to other neuronsBrain: 10^11 neurons, 10^16 synapses

Page 16: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Artificial Neural Nets

• Simulated Neuron:– Node connected to other nodes via links

• Links = axon+synapse+link• Links associated with weight (like synapse)

– Multiplied by output of node

– Node combines input via activation function• E.g. sum of weighted inputs passed thru threshold

• Simpler than real neuronal processes

Page 17: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Artificial Neural Net

x

x

x

w

w

w

Sum Threshold +

Page 18: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Perceptrons

• Single neuron-like element– Binary inputs– Binary outputs

• Weighted sum of inputs > threshold

Page 19: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Perceptron Structure

x0=1 x1 x3x2 xn

w1

w0

. . .

w2w3

wn

y

otherwise 0

0 if 10i

n

iixwy

x0 w0 compensates for threshold

Page 20: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Perceptron Example

• Logical-OR: Linearly separable– 00: 0; 01: 1; 10: 1; 11: 1

x2

x1

+

+ +

0

or

x2

x1

+

+ +

0

or

Page 21: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Perceptron Convergence Procedure

• Straight-forward training procedure– Learns linearly separable functions

• Until perceptron yields correct output for all– If the perceptron is correct, do nothing– If the percepton is wrong,

• If it incorrectly says “yes”, – Subtract input vector from weight vector

• Otherwise, add input vector to weight vector

Page 22: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Perceptron Convergence Example

• LOGICAL-OR:• Sample x0 x1 x2 Desired Output• 1 1 0 0 0• 2 1 0 1 1• 3 1 1 0 1• 4 1 1 1 1

• Initial: w=(000);After S2, w=w+s2=(101)• Pass2: S1:w=w-s1=(001);S3:w=w+s3=(111)• Pass3: S1:w=w-s1=(011)

Page 23: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Perceptron Convergence Theorem

• If there exists a vector W s.t. • Perceptron training will find it

• Assume

for all +ive examples x

• ||w||^2 increases by at most ||x||^2, in each iteration• ||w+x||^2 <= ||w||^2+||x||^2 <=k ||x||^2• v.w/||w|| > <= 1

Converges in k <= O steps

kwvxxxw k ,...21

otherwise 0

0 if 10i

n

iixwy

xv

kxk / 2/1

Page 24: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Perceptron Learning

• Perceptrons learn linear decision boundaries• E.g.

+ ++

+ + +

0

00

000

0

x1

x2

But not

x2

x1

+

+ 0

0

xor

X1 X2 -1 -1 w1x1 + w2x2 < 01 -1 w1x1 + w2x2 > 0 => implies w1 > 01 1 w1x1 + w2x2 >0 => but should be false-1 1 w1x1 + w2x2 > 0 => implies w2 > 0

Page 25: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Perceptron Example

• Digit recognition– Assume display= 8 lightable bars– Inputs – on/off + threshold – 65 steps to recognize “8”

Page 26: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Perceptron Summary

• Motivated by neuron activation• Simple training procedure• Guaranteed to converge

– IF linearly separable

Page 27: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Neural Nets• Multi-layer perceptrons

– Inputs: real-valued– Intermediate “hidden” nodes– Output(s): one (or more) discrete-valued

X1

X2X3X4

Inputs Hidden Hidden Outputs

Y1

Y2

Page 28: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Neural Nets

• Pro: More general than perceptrons– Not restricted to linear discriminants– Multiple outputs: one classification each

• Con: No simple, guaranteed training procedure– Use greedy, hill-climbing procedure to train– “Gradient descent”, “Backpropagation”

Page 29: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Solving the XOR Problem

x1w13

w11

w21

o2

o1

w12

y

w03w22

-1x2

w23

w02

-1

w01

-1

NetworkTopology:2 hidden nodes1 output

Desired behavior:x1 x2 o1 o2 y 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0

Weights:w11= w12=1w21=w22 = 1w01=3/2; w02=1/2; w03=1/2w13=-1; w23=1

Page 30: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Neural Net Applications

• Speech recognition

• Handwriting recognition

• NETtalk: Letter-to-sound rules

• ALVINN: Autonomous driving

Page 31: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

ALVINN• Driving as a neural network• Inputs:

– Image pixel intensities• I.e. lane lines

• 5 Hidden nodes• Outputs:

– Steering actions• E.g. turn left/right; how far

• Training:– Observe human behavior: sample images, steering

Page 32: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Backpropagation

• Greedy, Hill-climbing procedure– Weights are parameters to change– Original hill-climb changes one parameter/step

• Slow– If smooth function, change all parameters/step

• Gradient descent– Backpropagation: Computes current output, works

backward to correct error

Page 33: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Producing a Smooth Function

• Key problem: – Pure step threshold is discontinuous

• Not differentiable

• Solution: – Sigmoid (squashed ‘s’ function): Logistic fn

n

izii e

zsxwz1

1)(

Page 34: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Neural Net Training

• Goal: – Determine how to change weights to get correct

output• Large change in weight to produce large reduction in

error

• Approach:• Compute actual output: o• Compare to desired output: d• Determine effect of each weight w on error = d-o• Adjust weights

Page 35: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Neural Net Exampley3

w03

w23

z3

z2

w02w22

w21

w12w11

w01

z1

-1

-1 -1x1 x2

w13 y1 y2 i

ii wxFyE 2* )),((21

xi : ith sample input vectorw : weight vector yi*: desired output for ith sample

Sum of squares error over training samples

))()((),( 03022221122301221111133 wwxwxwswwxwxwswswxFy

z3

z1 z2

Full expression of output in terms of input and weights

-

From 6.034 notes lozano-perez

Page 36: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Gradient Descent

• Error: Sum of squares error of inputs with current weights

• Compute rate of change of error wrt each weight– Which weights have greatest effect on error?– Effectively, partial derivatives of error wrt

weights• In turn, depend on other weights => chain rule

Page 37: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Gradient Descent

• E = G(w)– Error as function of

weights• Find rate of change of

error– Follow steepest rate of

change– Change weights s.t. error

is minimized

E

w

G(w)

dGdw

Localminima

w0w1

Page 38: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

MIT AI lecture notes, Lozano-Perez 2000

Gradient of Error

))()((),( 03022221122301221111133 wwxwxwswwxwxwswswxFy

z3

z1 z2

ji

j wyyy

wE

3

3* )(

13

31

3

3

13

3

3

3

13

3 )()()()( yzzszs

zzs

wz

zzs

wy

11

113

3

3

11

1

1

113

3

3

11

3

3

3

11

3 )()()()()( xzzsw

zzs

wz

zzsw

zzs

wz

zzs

wy

i

ii wxFyE 2* )),((21

y3

w03

w23

z3

z2

w02w22

w21

w12w11

w01

z1

-1

-1 -1x1 x2

w13 y1 y2Note: Derivative of sigmoid:ds(z1) = s(z1)(1-s(z1)) dz1

-

From 6.034 notes lozano-perez

Page 39: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

From Effect to Update

• Gradient computation:– How each weight contributes to performance

• To train:– Need to determine how to CHANGE weight

based on contribution to performance– Need to determine how MUCH change to make

per iteration• Rate parameter ‘r’

– Large enough to learn quickly – Small enough reach but not overshoot target values

Page 40: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Backpropagation Procedure

• Pick rate parameter ‘r’• Until performance is good enough,

– Do forward computation to calculate output– Compute Beta in output node with

– Compute Beta in all other nodes with

– Compute change for all weights with

zzz od

k

kkkkjj oow )1(

jjjiji oorow )1(

i j kjiw kjw

io jo)1( jj oo )1( kk oo

Page 41: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Backprop Exampley3

w03

w23

z3

z2

w02w22

w21

w12w11

w01

z1

-1

-1 -1x1 x2

w13 y1 y2

)( 3*33 yy

233332 )1( wyy

133331 )1( wyy

Forward prop: Compute zi and yi given xk, wl

)1()1( 3330303 yryww)1()1( 2220202 yryww

)1()1( 1110101 yryww

33311313 )1( yyryww

22211212 )1( yyrxww

11111111 )1( yyrxww

33322323 )1( yyryww

22222222 )1( yyrxww 11122121 )1( yyrxww

Page 42: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Backpropagation Observations

• Procedure is (relatively) efficient– All computations are local

• Use inputs and outputs of current node

• What is “good enough”?– Rarely reach target (0 or 1) outputs

• Typically, train until within 0.1 of target

Page 43: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Neural Net Summary

• Training:– Backpropagation procedure

• Gradient descent strategy (usual problems)

• Prediction:– Compute outputs based on input vector & weights

• Pros: Very general, Fast prediction• Cons: Training can be VERY slow (1000’s of

epochs), Overfitting

Page 44: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Training Strategies

• Online training:– Update weights after each sample

• Offline (batch training):– Compute error over all samples

• Then update weights

• Online training “noisy”– Sensitive to individual instances– However, may escape local minima

Page 45: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Training Strategy

• To avoid overfitting:– Split data into: training, validation, & test

• Also, avoid excess weights (less than # samples)

• Initialize with small random weights– Small changes have noticeable effect

• Use offline training – Until validation set minimum

• Evaluate on test set – No more weight changes

Page 46: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Classification

• Neural networks best for classification task– Single output -> Binary classifier– Multiple outputs -> Multiway classification

• Applied successfully to learning pronunciation

– Sigmoid pushes to binary classification• Not good for regression

Page 47: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Neural Net Example

• NETtalk: Letter-to-sound by net• Inputs:

– Need context to pronounce• 7-letter window: predict sound of middle letter• 29 possible characters – alphabet+space+,+.

– 7*29=203 inputs

• 80 Hidden nodes• Output: Generate 60 phones

– Nodes map to 26 units: 21 articulatory, 5 stress/sil• Vector quantization of acoustic space

Page 48: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Neural Net Example: NETtalk

• Learning to talk:– 5 iterations/1024 training words: bound/stress– 10 iterations: intelligible– 400 new test words: 80% correct

• Not as good as DecTalk, but automatic

Page 49: Learning: Nearest Neighbor,  Perceptrons & Neural Nets

Neural Net Conclusions

• Simulation based on neurons in brain• Perceptrons (single neuron)

– Guaranteed to find linear discriminant • IF one exists -> problem XOR

• Neural nets (Multi-layer perceptrons)– Very general– Backpropagation training procedure

• Gradient descent - local min, overfitting issues