neural networks for classification
TRANSCRIPT
1 / 40
Neural Networks for Classification
Andrei Alexandrescu
June 19, 2007
Introduction
IntroductionNeural Networks:History
What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
2 / 40
Neural Networks: History
IntroductionNeural Networks:History
What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
3 / 40
■ Modeled after the human brain■ Experimentation and marketing predated
theory■ Considered the forefront of the AI spring
Suffered from the AI winter■ Theory today still not fully developed and
understood
What is a Neural Network?
IntroductionNeural Networks:History
What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
4 / 40
■ Essentially:A network of interconnectedfunctional elementseach with several inputs/one output
y(x1, . . . , xn) = f(w1x1 +w2x2 + . . .+wnxn)(1)
■ wi are parameters■ f is the activation function■ Crucial for learning that addition is used
for integrating the inputs
Examples of Neural Networks
IntroductionNeural Networks:History
What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
5 / 40
■ Logical functions with 0/1 inputs andoutputs
■ Fourier series:
F (x) =∑
i≥0
(ai cos(ix) + bi sin(ix)) (2)
■ Taylor series:
F (x) =∑
i≥0
ai(x − x0)i (3)
■ Automata
Elements of a Neural Network
IntroductionNeural Networks:History
What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
6 / 40
■ The function performed by an element■ The topology of the network■ The method used to train the weights
Single-Layer Perceptrons
Introduction
Single-LayerPerceptrons
The Perceptron
PerceptronCapabilities
BiasTraining thePerceptron
Algorithm
Summary of SimplePerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
7 / 40
The Perceptron
Introduction
Single-LayerPerceptrons
The Perceptron
PerceptronCapabilities
BiasTraining thePerceptron
Algorithm
Summary of SimplePerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
8 / 40
■ n inputs, one output:
y(x1, . . . , xn) = f(w1x1 + . . . + wnxn)(4)
■ Oldest activation function(McCulloch/Pitts):
f(v) = 1x≥0(v) (5)
Perceptron Capabilities
Introduction
Single-LayerPerceptrons
The Perceptron
PerceptronCapabilities
BiasTraining thePerceptron
Algorithm
Summary of SimplePerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
9 / 40
■ Advertised to be as extensive as the brainitself
■ Can (only) distinguish between twolinearly-separable sets
■ Smallest undecidable function: XOR■ Minsky’s proof started the AI winter■ It was not fully understood what
connected layers could do
Bias
Introduction
Single-LayerPerceptrons
The Perceptron
PerceptronCapabilities
BiasTraining thePerceptron
Algorithm
Summary of SimplePerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
10 / 40
■ Notice that the decision hyperplane mustgo through the origin
■ Could be achieved by preprocessing theinput
■ Not always desirable or possible■ Add a bias input:
y(x1, . . . , xn) = f(w0+w1x1+. . .+wnxn)(6)
■ Same as an input connected to theconstant 1
■ We consider that ghost input implicithenceforth
Training the Perceptron
Introduction
Single-LayerPerceptrons
The Perceptron
PerceptronCapabilities
BiasTraining thePerceptron
Algorithm
Summary of SimplePerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
11 / 40
■ Switch to vector notation:
y(x) = f(wx) = fw(x) (7)
■ Assume we need to separate sets ofpoints A and B.
E(w) =∑
x∈A
(1−fw(x))+∑
x∈B
fw(x) (8)
■ Goal: E(w) = 0■ Start from a random w and improve it
Algorithm
Introduction
Single-LayerPerceptrons
The Perceptron
PerceptronCapabilities
BiasTraining thePerceptron
Algorithm
Summary of SimplePerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
12 / 40
1. Start with random w, set t = 02. Select a vector x ∈ A ∪ B
3. If x ∈ A and wx ≤ 0, thenwt+1 = wt + x
4. Else if x ∈ B and wx ≥ 0, thenwt+1 = wt − x
5. Conditionally go to step 2
■ Guaranteed to converge iff A and B arelinearly separable!
Summary of Simple Perceptrons
Introduction
Single-LayerPerceptrons
The Perceptron
PerceptronCapabilities
BiasTraining thePerceptron
Algorithm
Summary of SimplePerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
13 / 40
■ Simple training■ Limited capabilities■ Reasonably efficient training
Simplex, linear programming arebetter
Multi-Layer Perceptrons
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions14 / 40
Multi-Layer Perceptrons
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions15 / 40
■ Let’s connect the output of a perceptronto the input of another
■ What can we compute with thishorizontal combination?
■ (We already take vertical combination forgranted)
A Misunderstanding of Epic
Proportions
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions16 / 40
■ Some say “two-layered” network
◆ Two cascaded layers ofcomputational units
■ Some say “three-layered” network
◆ There is one extra input layer thatdoes nothing
■ Let’s arbitrarily choose “three-layered”
◆ Input◆ Hidden◆ Output
Workings
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions17 / 40
■ The hidden layer maps inputs into asecond space: “feature space,”“classification space”
■ This makes the job of the output layereasier
Capabilities
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions18 / 40
■ Each hidden unit computes a linearseparation of the input space
■ Several hidden units can carve a polytopein the input space
■ Output units can distinguish polytopemembership
⇓
Any union of polytopes can be decided
Training Prerequisite
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions19 / 40
■ The step function bad for gradientdescent techniques
■ Replace with a smooth step function:
f(v) =1
1 + e−v(9)
■ Notable fact:f ′(v) = f(v)(1 − f(v))
■ Makes the function cycles-friendly
Output Activation
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions20 / 40
■ Simple binarydiscrimination—zero-centered sigmoid:
f(v) =1 − e−v
1 + e−v(10)
■ Probability distribution—softmax:
f(vi) =evi
∑
j
evj(11)
The Backpropagation Algorithm
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions21 / 40
■ Works on any differentiable activationfunction
■ Gradient descent in weight space■ Metaphor: a ball rolls on the error
function’s envelope■ Condition: no flat portion■ Ball would stop in indifferent equilibrium■ Some add a slight pull term:
f(v) =1 − e−v
1 + e−v+ cv (12)
The Task
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions22 / 40
■ Minimize error function:
E =1
2
p∑
i=1
‖oi − ti‖2 (13)
where:
◆ oi actual outputs◆ ti desired outputs◆ p number of patterns
Training. The Delta Rule
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions23 / 40
■ Compute ∇E =(
∂E∂w1
, . . . , ∂E∂wl
)
■ Update weights:
∆wi = −γ∂E
∂wi
i = 1, . . . , l (14)
■ Expect to find a point ∇E = 0■ Algorithm for computing ∇E:
backpropagation■ Beyond the scope of this class
Gradient Locality
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions24 / 40
■ Only summation guarantees locality ofbackpropagation
■ Otherwise backpropagation wouldpropagate errors due to one input to allinputs
■ Essential to use summation as inputintegration!
Regularization
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions25 / 40
■ Weights can grow uncontrollably■ Add a regularization term that opposes
weight growth
∆wi = −γ∂E
∂wi
− αwi (15)
■ Very important practical trick■ Also avoids overspecialization■ Forces a smoother output
Local Minima
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
Multi-LayerPerceptrons
A Misunderstandingof Epic Proportions
Workings
Capabilities
Training Prerequisite
Output Activation
TheBackpropagationAlgorithm
The TaskTraining. The DeltaRule
Gradient Locality
Regularization
Local Minima
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions26 / 40
■ The gradient surf can stop in a localminimum
■ Biggest issue with neural networks■ Overspecialization second biggest■ Convergence not guaranteed either, but
regularization helps
Accommodating Discrete
Inputs
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Discrete Inputs
One-Hot Encoding
Optimizing One-HotEncoding
One-Hot Encoding:Interesting Tidbits
Outputs
NLP Applications
Conclusions
27 / 40
Discrete Inputs
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Discrete Inputs
One-Hot Encoding
Optimizing One-HotEncoding
One-Hot Encoding:Interesting Tidbits
Outputs
NLP Applications
Conclusions
28 / 40
■ Many NLP applications foster discretefeatures
■ Neural nets expect real numbers■ Smooth: similar outputs for similar inputs
■ Any two discrete inputs are “just asdifferent”
■ Treating them as integral numbersundemocratic
One-Hot Encoding
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Discrete Inputs
One-Hot Encoding
Optimizing One-HotEncoding
One-Hot Encoding:Interesting Tidbits
Outputs
NLP Applications
Conclusions
29 / 40
■ One discrete feature with n values → n
real inputs■ The ith feature value sets the ith input to
1 and others to 0■ The Hamming distance between any two
distinct inputs is now constant!■ Disadvantage: input vector size much
larger
Optimizing One-Hot Encoding
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Discrete Inputs
One-Hot Encoding
Optimizing One-HotEncoding
One-Hot Encoding:Interesting Tidbits
Outputs
NLP Applications
Conclusions
30 / 40
■ Each hidden unit has all inputs zeroexcept the ith one
■ Even that one is just multiplied by 1■ Regroup weights by discrete input, not by
hidden unit!■ Matrix w of size n × l
■ Input i just copies row i to the output(virtual multiplication by 1)
■ Cheap computation■ Delta rule applies as usual
One-Hot Encoding: Interesting
Tidbits
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Discrete Inputs
One-Hot Encoding
Optimizing One-HotEncoding
One-Hot Encoding:Interesting Tidbits
Outputs
NLP Applications
Conclusions
31 / 40
■ The row wi is a continuousrepresentation of discrete feature i
■ Only one row trained per sample■ The size of the continuous representation
can be chosen depending on the feature’scomplexity
■ Mix this continuous representation freelywith “truly” continuous features, such asacoustic features
Outputs
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
Multi-LabelClassification
Soft Training
NLP Applications
Conclusions
32 / 40
Multi-Label Classification
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
Multi-LabelClassification
Soft Training
NLP Applications
Conclusions
33 / 40
■ n real outputs summing to 1■ Normalization included in the softmax
function:
f(vi) =evi
∑
j
evj=
evi−vmax
∑
j
evj−vmax(16)
■ Train with 1 − ǫ for the known label, ǫn−1
for all others (avoids saturation)
Soft Training
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
Multi-LabelClassification
Soft Training
NLP Applications
Conclusions
34 / 40
■ Maybe the targets are known probabilitydistribution
■ Or want to reduce the number of trainingcycles
■ Train with actual desired distributions asdesired outputs
■ Example: for feature vector x, labels l1,l2, l3 are possible with equal probability
■ Train with 1−ǫ3
for the three, ǫn−3
for allothers
NLP Applications
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Language Modeling
Lexicon Learning
Word SenseDisambiguation
Conclusions
35 / 40
Language Modeling
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Language Modeling
Lexicon Learning
Word SenseDisambiguation
Conclusions
36 / 40
■ Input: n-gram context■ May include arbitrary word features
(cool!!!)■ Output: probability distribution of next
word■ Automatically figures which features are
important
Lexicon Learning
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Language Modeling
Lexicon Learning
Word SenseDisambiguation
Conclusions
37 / 40
■ Input: Word-level features (root, stem,morph)
■ Input: Most frequent previous/nextwords
■ Output: Probability distribution of theword’s possible POSs
Word Sense Disambiguation
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Language Modeling
Lexicon Learning
Word SenseDisambiguation
Conclusions
38 / 40
■ Input: bag of words in context, localcollocations
■ Output: Probability distribution oversenses
Conclusions
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
Conclusions
39 / 40
Conclusions
Introduction
Single-LayerPerceptrons
Multi-LayerPerceptrons
AccommodatingDiscrete Inputs
Outputs
NLP Applications
Conclusions
Conclusions
40 / 40
■ Neural nets respectable machine learningtechnique
■ Theory not fully developed■ Local optima and overspecialization are
killers■ Yet can learn very complex functions■ Long training time■ Short testing time■ Small memory requirements