learning: nearest neighbor, perceptrons & neural nets
DESCRIPTION
Learning: Nearest Neighbor, Perceptrons & Neural Nets. Artificial Intelligence CSPP 56553 February 4, 2004. Nearest Neighbor Example II. Credit Rating: Classifier: Good / Poor Features: L = # late payments/yr; R = Income/Expenses. Name L R G/P. A0 1.2G. - PowerPoint PPT PresentationTRANSCRIPT
Learning: Nearest Neighbor, Perceptrons & Neural Nets
Artificial IntelligenceCSPP 56553
February 4, 2004
Nearest Neighbor Example II
• Credit Rating:– Classifier: Good /
Poor– Features:
• L = # late payments/yr; • R = Income/Expenses
Name L R G/P
A 0 1.2 G
B 25 0.4 P
C 5 0.7 GD 20 0.8 PE 30 0.85 P
F 11 1.2 GG 7 1.15 GH 15 0.8 P
Nearest Neighbor Example IIName L R G/P
A 0 1.2 G
B 25 0.4 P
C 5 0.7 GD 20 0.8 PE 30 0.85 P
F 11 1.2 GG 7 1.15 GH 15 0.8 P L
R
302010
1 A
B
C D E
FGH
Nearest Neighbor Example II
L 302010
1 A
B
C D E
FG
HR
Name L R G/P
I 6 1.15
J 22 0.45
K 15 1.2
G
IP
J
??
K
Distance Measure:
Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2))
- Scaled distance
Nearest Neighbor: Issues
• Prediction can be expensive if many features
• Affected by classification, feature noise– One entry can change prediction
• Definition of distance metric– How to combine different features
• Different types, ranges of values
• Sensitive to feature selection
Efficient Implementations
• Classification cost:– Find nearest neighbor: O(n)
• Compute distance between unknown and all instances
• Compare distances– Problematic for large data sets
• Alternative:– Use binary search to reduce to O(log n)
Efficient Implementation: K-D Trees
• Divide instances into sets based on features– Binary branching: E.g. > value– 2^d leaves with d split path = n
• d= O(log n)
– To split cases into sets,• If there is one element in the set, stop• Otherwise pick a feature to split on
– Find average position of two middle objects on that dimension» Split remaining objects based on average position» Recursively split subsets
K-D Trees: Classification
R > 0.825?
L > 17.5? L > 9 ?
No Yes
R > 0.6? R > 0.75? R > 1.025 ?R > 1.175 ?
NoYes No Yes
No
Poor Good
Yes No Yes
Good Poor
No Yes
Good Good
No
Poor
Yes
Good
Efficient Implementation:Parallel Hardware
• Classification cost:– # distance computations
• Const time if O(n) processors
– Cost of finding closest• Compute pairwise minimum, successively• O(log n) time
Nearest Neighbor: Analysis
• Issue: – What features should we use?
• E.g. Credit rating: Many possible features– Tax bracket, debt burden, retirement savings, etc..
– Nearest neighbor uses ALL – Irrelevant feature(s) could mislead
• Fundamental problem with nearest neighbor
Nearest Neighbor: Advantages
• Fast training:– Just record feature vector - output value set
• Can model wide variety of functions– Complex decision boundaries– Weak inductive bias
• Very generally applicable
Summary: Nearest Neighbor
• Nearest neighbor:– Training: record input vectors + output value– Prediction: closest training instance to new
data• Efficient implementations• Pros: fast training, very general, little bias• Cons: distance metric (scaling), sensitivity
to noise & extraneous features
Learning: Perceptrons
Artificial IntelligenceCSPP 56553
February 4, 2004
Agenda
• Neural Networks:– Biological analogy
• Perceptrons: Single layer networks• Perceptron training• Perceptron convergence theorem• Perceptron limitations
• Conclusions
Neurons: The Concept
Axon
Cell Body
Nucleus
Dendrites
Neurons: Receive inputs from other neurons (via synapses) When input exceeds threshold, “fires”
Sends output along axon to other neuronsBrain: 10^11 neurons, 10^16 synapses
Artificial Neural Nets
• Simulated Neuron:– Node connected to other nodes via links
• Links = axon+synapse+link• Links associated with weight (like synapse)
– Multiplied by output of node
– Node combines input via activation function• E.g. sum of weighted inputs passed thru threshold
• Simpler than real neuronal processes
Artificial Neural Net
x
x
x
w
w
w
Sum Threshold +
Perceptrons
• Single neuron-like element– Binary inputs– Binary outputs
• Weighted sum of inputs > threshold
Perceptron Structure
x0=1 x1 x3x2 xn
w1
w0
. . .
w2w3
wn
y
otherwise 0
0 if 10i
n
iixwy
x0 w0 compensates for threshold
Perceptron Example
• Logical-OR: Linearly separable– 00: 0; 01: 1; 10: 1; 11: 1
x2
x1
+
+ +
0
or
x2
x1
+
+ +
0
or
Perceptron Convergence Procedure
• Straight-forward training procedure– Learns linearly separable functions
• Until perceptron yields correct output for all– If the perceptron is correct, do nothing– If the percepton is wrong,
• If it incorrectly says “yes”, – Subtract input vector from weight vector
• Otherwise, add input vector to weight vector
Perceptron Convergence Example
• LOGICAL-OR:• Sample x0 x1 x2 Desired Output• 1 1 0 0 0• 2 1 0 1 1• 3 1 1 0 1• 4 1 1 1 1
• Initial: w=(000);After S2, w=w+s2=(101)• Pass2: S1:w=w-s1=(001);S3:w=w+s3=(111)• Pass3: S1:w=w-s1=(011)
Perceptron Convergence Theorem
• If there exists a vector W s.t. • Perceptron training will find it
• Assume
for all +ive examples x
• ||w||^2 increases by at most ||x||^2, in each iteration• ||w+x||^2 <= ||w||^2+||x||^2 <=k ||x||^2• v.w/||w|| > <= 1
Converges in k <= O steps
kwvxxxw k ,...21
otherwise 0
0 if 10i
n
iixwy
xv
kxk / 2/1
Perceptron Learning
• Perceptrons learn linear decision boundaries• E.g.
+ ++
+ + +
0
00
000
0
x1
x2
But not
x2
x1
+
+ 0
0
xor
X1 X2 -1 -1 w1x1 + w2x2 < 01 -1 w1x1 + w2x2 > 0 => implies w1 > 01 1 w1x1 + w2x2 >0 => but should be false-1 1 w1x1 + w2x2 > 0 => implies w2 > 0
Perceptron Example
• Digit recognition– Assume display= 8 lightable bars– Inputs – on/off + threshold – 65 steps to recognize “8”
Perceptron Summary
• Motivated by neuron activation• Simple training procedure• Guaranteed to converge
– IF linearly separable
Neural Nets• Multi-layer perceptrons
– Inputs: real-valued– Intermediate “hidden” nodes– Output(s): one (or more) discrete-valued
X1
X2X3X4
Inputs Hidden Hidden Outputs
Y1
Y2
Neural Nets
• Pro: More general than perceptrons– Not restricted to linear discriminants– Multiple outputs: one classification each
• Con: No simple, guaranteed training procedure– Use greedy, hill-climbing procedure to train– “Gradient descent”, “Backpropagation”
Solving the XOR Problem
x1w13
w11
w21
o2
o1
w12
y
w03w22
-1x2
w23
w02
-1
w01
-1
NetworkTopology:2 hidden nodes1 output
Desired behavior:x1 x2 o1 o2 y 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0
Weights:w11= w12=1w21=w22 = 1w01=3/2; w02=1/2; w03=1/2w13=-1; w23=1
Neural Net Applications
• Speech recognition
• Handwriting recognition
• NETtalk: Letter-to-sound rules
• ALVINN: Autonomous driving
ALVINN• Driving as a neural network• Inputs:
– Image pixel intensities• I.e. lane lines
• 5 Hidden nodes• Outputs:
– Steering actions• E.g. turn left/right; how far
• Training:– Observe human behavior: sample images, steering
Backpropagation
• Greedy, Hill-climbing procedure– Weights are parameters to change– Original hill-climb changes one parameter/step
• Slow– If smooth function, change all parameters/step
• Gradient descent– Backpropagation: Computes current output, works
backward to correct error
Producing a Smooth Function
• Key problem: – Pure step threshold is discontinuous
• Not differentiable
• Solution: – Sigmoid (squashed ‘s’ function): Logistic fn
n
izii e
zsxwz1
1)(
Neural Net Training
• Goal: – Determine how to change weights to get correct
output• Large change in weight to produce large reduction in
error
• Approach:• Compute actual output: o• Compare to desired output: d• Determine effect of each weight w on error = d-o• Adjust weights
Neural Net Exampley3
w03
w23
z3
z2
w02w22
w21
w12w11
w01
z1
-1
-1 -1x1 x2
w13 y1 y2 i
ii wxFyE 2* )),((21
xi : ith sample input vectorw : weight vector yi*: desired output for ith sample
Sum of squares error over training samples
))()((),( 03022221122301221111133 wwxwxwswwxwxwswswxFy
z3
z1 z2
Full expression of output in terms of input and weights
-
From 6.034 notes lozano-perez
Gradient Descent
• Error: Sum of squares error of inputs with current weights
• Compute rate of change of error wrt each weight– Which weights have greatest effect on error?– Effectively, partial derivatives of error wrt
weights• In turn, depend on other weights => chain rule
Gradient Descent
• E = G(w)– Error as function of
weights• Find rate of change of
error– Follow steepest rate of
change– Change weights s.t. error
is minimized
E
w
G(w)
dGdw
Localminima
w0w1
MIT AI lecture notes, Lozano-Perez 2000
Gradient of Error
))()((),( 03022221122301221111133 wwxwxwswwxwxwswswxFy
z3
z1 z2
ji
j wyyy
wE
3
3* )(
13
31
3
3
13
3
3
3
13
3 )()()()( yzzszs
zzs
wz
zzs
wy
11
113
3
3
11
1
1
113
3
3
11
3
3
3
11
3 )()()()()( xzzsw
zzs
wz
zzsw
zzs
wz
zzs
wy
i
ii wxFyE 2* )),((21
y3
w03
w23
z3
z2
w02w22
w21
w12w11
w01
z1
-1
-1 -1x1 x2
w13 y1 y2Note: Derivative of sigmoid:ds(z1) = s(z1)(1-s(z1)) dz1
-
From 6.034 notes lozano-perez
From Effect to Update
• Gradient computation:– How each weight contributes to performance
• To train:– Need to determine how to CHANGE weight
based on contribution to performance– Need to determine how MUCH change to make
per iteration• Rate parameter ‘r’
– Large enough to learn quickly – Small enough reach but not overshoot target values
Backpropagation Procedure
• Pick rate parameter ‘r’• Until performance is good enough,
– Do forward computation to calculate output– Compute Beta in output node with
– Compute Beta in all other nodes with
– Compute change for all weights with
zzz od
k
kkkkjj oow )1(
jjjiji oorow )1(
i j kjiw kjw
io jo)1( jj oo )1( kk oo
Backprop Exampley3
w03
w23
z3
z2
w02w22
w21
w12w11
w01
z1
-1
-1 -1x1 x2
w13 y1 y2
)( 3*33 yy
233332 )1( wyy
133331 )1( wyy
Forward prop: Compute zi and yi given xk, wl
)1()1( 3330303 yryww)1()1( 2220202 yryww
)1()1( 1110101 yryww
33311313 )1( yyryww
22211212 )1( yyrxww
11111111 )1( yyrxww
33322323 )1( yyryww
22222222 )1( yyrxww 11122121 )1( yyrxww
Backpropagation Observations
• Procedure is (relatively) efficient– All computations are local
• Use inputs and outputs of current node
• What is “good enough”?– Rarely reach target (0 or 1) outputs
• Typically, train until within 0.1 of target
Neural Net Summary
• Training:– Backpropagation procedure
• Gradient descent strategy (usual problems)
• Prediction:– Compute outputs based on input vector & weights
• Pros: Very general, Fast prediction• Cons: Training can be VERY slow (1000’s of
epochs), Overfitting
Training Strategies
• Online training:– Update weights after each sample
• Offline (batch training):– Compute error over all samples
• Then update weights
• Online training “noisy”– Sensitive to individual instances– However, may escape local minima
Training Strategy
• To avoid overfitting:– Split data into: training, validation, & test
• Also, avoid excess weights (less than # samples)
• Initialize with small random weights– Small changes have noticeable effect
• Use offline training – Until validation set minimum
• Evaluate on test set – No more weight changes
Classification
• Neural networks best for classification task– Single output -> Binary classifier– Multiple outputs -> Multiway classification
• Applied successfully to learning pronunciation
– Sigmoid pushes to binary classification• Not good for regression
Neural Net Example
• NETtalk: Letter-to-sound by net• Inputs:
– Need context to pronounce• 7-letter window: predict sound of middle letter• 29 possible characters – alphabet+space+,+.
– 7*29=203 inputs
• 80 Hidden nodes• Output: Generate 60 phones
– Nodes map to 26 units: 21 articulatory, 5 stress/sil• Vector quantization of acoustic space
Neural Net Example: NETtalk
• Learning to talk:– 5 iterations/1024 training words: bound/stress– 10 iterations: intelligible– 400 new test words: 80% correct
• Not as good as DecTalk, but automatic
Neural Net Conclusions
• Simulation based on neurons in brain• Perceptrons (single neuron)
– Guaranteed to find linear discriminant • IF one exists -> problem XOR
• Neural nets (Multi-layer perceptrons)– Very general– Backpropagation training procedure
• Gradient descent - local min, overfitting issues