neural networks for classification

40
1 / 40 Neural Networks for Classification Andrei Alexandrescu June 19, 2007

Upload: others

Post on 01-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Networks for Classification

1 / 40

Neural Networks for Classification

Andrei Alexandrescu

June 19, 2007

Page 2: Neural Networks for Classification

Introduction

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

2 / 40

Page 3: Neural Networks for Classification

Neural Networks: History

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

3 / 40

■ Modeled after the human brain■ Experimentation and marketing predated

theory■ Considered the forefront of the AI spring

Suffered from the AI winter■ Theory today still not fully developed and

understood

Page 4: Neural Networks for Classification

What is a Neural Network?

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

4 / 40

■ Essentially:A network of interconnectedfunctional elementseach with several inputs/one output

y(x1, . . . , xn) = f(w1x1 +w2x2 + . . .+wnxn)(1)

■ wi are parameters■ f is the activation function■ Crucial for learning that addition is used

for integrating the inputs

Page 5: Neural Networks for Classification

Examples of Neural Networks

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

5 / 40

■ Logical functions with 0/1 inputs andoutputs

■ Fourier series:

F (x) =∑

i≥0

(ai cos(ix) + bi sin(ix)) (2)

■ Taylor series:

F (x) =∑

i≥0

ai(x − x0)i (3)

■ Automata

Page 6: Neural Networks for Classification

Elements of a Neural Network

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

6 / 40

■ The function performed by an element■ The topology of the network■ The method used to train the weights

Page 7: Neural Networks for Classification

Single-Layer Perceptrons

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

7 / 40

Page 8: Neural Networks for Classification

The Perceptron

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

8 / 40

■ n inputs, one output:

y(x1, . . . , xn) = f(w1x1 + . . . + wnxn)(4)

■ Oldest activation function(McCulloch/Pitts):

f(v) = 1x≥0(v) (5)

Page 9: Neural Networks for Classification

Perceptron Capabilities

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

9 / 40

■ Advertised to be as extensive as the brainitself

■ Can (only) distinguish between twolinearly-separable sets

■ Smallest undecidable function: XOR■ Minsky’s proof started the AI winter■ It was not fully understood what

connected layers could do

Page 10: Neural Networks for Classification

Bias

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

10 / 40

■ Notice that the decision hyperplane mustgo through the origin

■ Could be achieved by preprocessing theinput

■ Not always desirable or possible■ Add a bias input:

y(x1, . . . , xn) = f(w0+w1x1+. . .+wnxn)(6)

■ Same as an input connected to theconstant 1

■ We consider that ghost input implicithenceforth

Page 11: Neural Networks for Classification

Training the Perceptron

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

11 / 40

■ Switch to vector notation:

y(x) = f(wx) = fw(x) (7)

■ Assume we need to separate sets ofpoints A and B.

E(w) =∑

x∈A

(1−fw(x))+∑

x∈B

fw(x) (8)

■ Goal: E(w) = 0■ Start from a random w and improve it

Page 12: Neural Networks for Classification

Algorithm

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

12 / 40

1. Start with random w, set t = 02. Select a vector x ∈ A ∪ B

3. If x ∈ A and wx ≤ 0, thenwt+1 = wt + x

4. Else if x ∈ B and wx ≥ 0, thenwt+1 = wt − x

5. Conditionally go to step 2

■ Guaranteed to converge iff A and B arelinearly separable!

Page 13: Neural Networks for Classification

Summary of Simple Perceptrons

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

13 / 40

■ Simple training■ Limited capabilities■ Reasonably efficient training

Simplex, linear programming arebetter

Page 14: Neural Networks for Classification

Multi-Layer Perceptrons

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions14 / 40

Page 15: Neural Networks for Classification

Multi-Layer Perceptrons

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions15 / 40

■ Let’s connect the output of a perceptronto the input of another

■ What can we compute with thishorizontal combination?

■ (We already take vertical combination forgranted)

Page 16: Neural Networks for Classification

A Misunderstanding of Epic

Proportions

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions16 / 40

■ Some say “two-layered” network

◆ Two cascaded layers ofcomputational units

■ Some say “three-layered” network

◆ There is one extra input layer thatdoes nothing

■ Let’s arbitrarily choose “three-layered”

◆ Input◆ Hidden◆ Output

Page 17: Neural Networks for Classification

Workings

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions17 / 40

■ The hidden layer maps inputs into asecond space: “feature space,”“classification space”

■ This makes the job of the output layereasier

Page 18: Neural Networks for Classification

Capabilities

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions18 / 40

■ Each hidden unit computes a linearseparation of the input space

■ Several hidden units can carve a polytopein the input space

■ Output units can distinguish polytopemembership

Any union of polytopes can be decided

Page 19: Neural Networks for Classification

Training Prerequisite

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions19 / 40

■ The step function bad for gradientdescent techniques

■ Replace with a smooth step function:

f(v) =1

1 + e−v(9)

■ Notable fact:f ′(v) = f(v)(1 − f(v))

■ Makes the function cycles-friendly

Page 20: Neural Networks for Classification

Output Activation

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions20 / 40

■ Simple binarydiscrimination—zero-centered sigmoid:

f(v) =1 − e−v

1 + e−v(10)

■ Probability distribution—softmax:

f(vi) =evi

j

evj(11)

Page 21: Neural Networks for Classification

The Backpropagation Algorithm

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions21 / 40

■ Works on any differentiable activationfunction

■ Gradient descent in weight space■ Metaphor: a ball rolls on the error

function’s envelope■ Condition: no flat portion■ Ball would stop in indifferent equilibrium■ Some add a slight pull term:

f(v) =1 − e−v

1 + e−v+ cv (12)

Page 22: Neural Networks for Classification

The Task

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions22 / 40

■ Minimize error function:

E =1

2

p∑

i=1

‖oi − ti‖2 (13)

where:

◆ oi actual outputs◆ ti desired outputs◆ p number of patterns

Page 23: Neural Networks for Classification

Training. The Delta Rule

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions23 / 40

■ Compute ∇E =(

∂E∂w1

, . . . , ∂E∂wl

)

■ Update weights:

∆wi = −γ∂E

∂wi

i = 1, . . . , l (14)

■ Expect to find a point ∇E = 0■ Algorithm for computing ∇E:

backpropagation■ Beyond the scope of this class

Page 24: Neural Networks for Classification

Gradient Locality

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions24 / 40

■ Only summation guarantees locality ofbackpropagation

■ Otherwise backpropagation wouldpropagate errors due to one input to allinputs

■ Essential to use summation as inputintegration!

Page 25: Neural Networks for Classification

Regularization

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions25 / 40

■ Weights can grow uncontrollably■ Add a regularization term that opposes

weight growth

∆wi = −γ∂E

∂wi

− αwi (15)

■ Very important practical trick■ Also avoids overspecialization■ Forces a smoother output

Page 26: Neural Networks for Classification

Local Minima

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions26 / 40

■ The gradient surf can stop in a localminimum

■ Biggest issue with neural networks■ Overspecialization second biggest■ Convergence not guaranteed either, but

regularization helps

Page 27: Neural Networks for Classification

Accommodating Discrete

Inputs

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

27 / 40

Page 28: Neural Networks for Classification

Discrete Inputs

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

28 / 40

■ Many NLP applications foster discretefeatures

■ Neural nets expect real numbers■ Smooth: similar outputs for similar inputs

■ Any two discrete inputs are “just asdifferent”

■ Treating them as integral numbersundemocratic

Page 29: Neural Networks for Classification

One-Hot Encoding

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

29 / 40

■ One discrete feature with n values → n

real inputs■ The ith feature value sets the ith input to

1 and others to 0■ The Hamming distance between any two

distinct inputs is now constant!■ Disadvantage: input vector size much

larger

Page 30: Neural Networks for Classification

Optimizing One-Hot Encoding

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

30 / 40

■ Each hidden unit has all inputs zeroexcept the ith one

■ Even that one is just multiplied by 1■ Regroup weights by discrete input, not by

hidden unit!■ Matrix w of size n × l

■ Input i just copies row i to the output(virtual multiplication by 1)

■ Cheap computation■ Delta rule applies as usual

Page 31: Neural Networks for Classification

One-Hot Encoding: Interesting

Tidbits

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

31 / 40

■ The row wi is a continuousrepresentation of discrete feature i

■ Only one row trained per sample■ The size of the continuous representation

can be chosen depending on the feature’scomplexity

■ Mix this continuous representation freelywith “truly” continuous features, such asacoustic features

Page 32: Neural Networks for Classification

Outputs

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

Multi-LabelClassification

Soft Training

NLP Applications

Conclusions

32 / 40

Page 33: Neural Networks for Classification

Multi-Label Classification

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

Multi-LabelClassification

Soft Training

NLP Applications

Conclusions

33 / 40

■ n real outputs summing to 1■ Normalization included in the softmax

function:

f(vi) =evi

j

evj=

evi−vmax

j

evj−vmax(16)

■ Train with 1 − ǫ for the known label, ǫn−1

for all others (avoids saturation)

Page 34: Neural Networks for Classification

Soft Training

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

Multi-LabelClassification

Soft Training

NLP Applications

Conclusions

34 / 40

■ Maybe the targets are known probabilitydistribution

■ Or want to reduce the number of trainingcycles

■ Train with actual desired distributions asdesired outputs

■ Example: for feature vector x, labels l1,l2, l3 are possible with equal probability

■ Train with 1−ǫ3

for the three, ǫn−3

for allothers

Page 35: Neural Networks for Classification

NLP Applications

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Language Modeling

Lexicon Learning

Word SenseDisambiguation

Conclusions

35 / 40

Page 36: Neural Networks for Classification

Language Modeling

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Language Modeling

Lexicon Learning

Word SenseDisambiguation

Conclusions

36 / 40

■ Input: n-gram context■ May include arbitrary word features

(cool!!!)■ Output: probability distribution of next

word■ Automatically figures which features are

important

Page 37: Neural Networks for Classification

Lexicon Learning

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Language Modeling

Lexicon Learning

Word SenseDisambiguation

Conclusions

37 / 40

■ Input: Word-level features (root, stem,morph)

■ Input: Most frequent previous/nextwords

■ Output: Probability distribution of theword’s possible POSs

Page 38: Neural Networks for Classification

Word Sense Disambiguation

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Language Modeling

Lexicon Learning

Word SenseDisambiguation

Conclusions

38 / 40

■ Input: bag of words in context, localcollocations

■ Output: Probability distribution oversenses

Page 39: Neural Networks for Classification

Conclusions

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

Conclusions

39 / 40

Page 40: Neural Networks for Classification

Conclusions

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

Conclusions

40 / 40

■ Neural nets respectable machine learningtechnique

■ Theory not fully developed■ Local optima and overspecialization are

killers■ Yet can learn very complex functions■ Long training time■ Short testing time■ Small memory requirements