basics of neural nets and past-tense model

Basics of Neural Nets and Past-Tense model

References to read

Chapter 10 in Copeland’s Artificial Intelligence.

Chapter 2 in Aleksander and Morton ‘Neurons and Symbols’.

Chapters 1,3 and 5 in Beale and Jackson, ‘Neural Computing: and Introduction’.

Chapter 18 in Rich and Knight, ‘Artificial Intelligence’.

The Human Brain

Contains approximately ten thousand million basic units, called Neurons.

Each neuron is connected to many others.

Neuron is basic unit of the brain.

A stand-alone analogue logical processing unit.

Only basic details of neurons really understood.

Neuron accepts many inputs, which are all added up (in some fashion).

If enough active inputs received at once, neuron will be activated, and fire. If not, remains in inactive quiet state.

Soma is the body of neuron.

Attached to soma are long filaments: dendrites.

Dendrites act as connections through which all the inputs to the neuron arrive.

Axon: electrically active. Serves as output channel of neuron.

Axon is non-linear threshold device. Produces pulse, called action potential when resting potential within soma rises above some threshold level.

Axon terminates in synapse which couples axon with dendrite of another cell.

No direct linkage, temporary chemical one. Synapse releases neurotransmitters which chemically activate gates on dendrites.

Activating gates, when open allow charged ions to flow.

These charged ions alters the dendritic potential and provides voltage pulse on dendrite which is conducted to next neuron body/Soma.

A single neuron will have many synaptic inputs on its dendrites, and may have many synaptic outputs connecting it to other cells.

axon

synapse

Learning: occurs when modifications made to effective coupling between one cell and another at the synaptic junction.

More neurotransmitters are released, which opens more gates in dendrite.

i.e. coupling is adjusted to favourably reinforce good connections.

The human brain: poorly understood, but capable of immensely impressive tasks.

For example: vision, speech recognition, learning etc.Also, fault tolerant: distributed processing, many simple

processing elements sharing each job. Therefore can tolerate some faults without producing nonsense.

Graceful degradation: with continual damage, performance gradually falls from high level to reduced level, but without dropping catastrophically to zero.

(Computers do not exhibit graceful degradation: intolerant of faults).

Idea behind neural computing: by modelling major features of the brain and its operation, we can produce computers that exhibit many of the useful properties of the brain.

Modelling single neuron

Important features to model The output from a neuron is either on or off The output depends only on the inputs. A certain

number must be on at any one time in order to make the neuron fire.

The efficiency of the synapses at coupling the incoming signal into the cell body can be modelled by having a multiplicative factor (I.e. weights) on each of the inputs to the neuron.

More efficient synapse has correspondingly larger weight.

Total input = weight on line 1 x input on 1 +

weight on line 2 x input on 2 +

weight on line n x input on n (for all n)

Basic model: performs weighted sum of inputs, compares this to internal threshold level, and turns on if this level exceeded.

This model of neuron proposed in 1943 by McCulloch and Pitts.

Model of neuron, not a copy: does not have complex patterns and timings of actual nervous activity in real neural systems.

Because it is simplified, can implement on digital computer (Logic gates and neural nets!)

Remember: it is only one more metaphor of the brain!

Learning in simple neurons

Training nets of interconnected units.

Essentially, if a neuron produces incorrect input, we want to reduce the chances of it happening again. If it gives correct output, do nothing.

For example, think of problem of teaching neural net to tell the difference between a set of handwritten As and a set of handwritten Bs.

In other words, to output a 1 when an A is presented, and a 0 when a B is presented.

Start up with random weights on input lines and present an A.

Neuron performs weighted sum of inputs and compares this to threshold.

If it exceeds threshold, output a 1, otherwise output of 0.

If correct, do nothing.

If it outputs a 0 (when A is presented) increase weighted sum so next time it will output a 1

Do this by increasing the weights.

If it outputs a 1 in response to a B, decrease the weights so next time output will be 0.

Summary

Set the weights and thresholds randomly Present an input Calculate the actual output by taking the threshold

value of the weighted sum of the inputs. Alter the weights to reinforce correct decisions – ie

reduce the error

Learning is guided by knowing what we want it to achieve = supervised learning.

The above shows the essentials of the early Percepton learning algorithm.

This early history was called Cybernetics (rather than AI)

But there are limitations to Perceptron learning.

00 01

10 11

1 0

0 1

Exclusive-OR truth table

Consider two propositions, either of which may be true or false

Exclusive-or is the relationship between them when JUST ONE OF THEM is true.

It EXCLUDES the case when both are true,so exclusive-or of the two is…

False when both are true or both are false, and true in the other two cases.

But there are limitations to Perceptron learning.

Consider Perceptron trying to find straight line that separates classes.

In some cases, cannot draw straight line to separate classes.

Eg Xor (if 0 is FALSE and 1 is TRUE)Input 0 1 – 1Input 1 0 – 1Input 1 1 – 0Input 0 0 -- 0

00 01

10 11

1 0

0 1

Cannot separate the two pattern classes by a straight line: They are linearly inseparable

This failure to solve apparently simple problems like XOR pointed out by Minsky and Papert in Perceptrons in 1969.

Stopped by research in the area for the next 20 years!

During which time (non-neural) AI got under way.

1986: Rumelhart and McClelland: multi-layer perceptron.

bias

bias

Output Units

Input Units

Hidden Units

A feedforward net with two weight layers and three sets of units.

Adapted perceptron, with units arranged in layers: an input layer, an output layer, and a hidden layer.

Added threshold function, and alter learning rule.

New learning rule: backpropagation (also a form of supervised learning)

Net is shown pattern, and output is compared to desired output (target).

Weights in the network adjusted, by calculating the value of the error function for a particular input, and then backpropagating the error from one layer to the previous one.

Output weights (weights connected to output layer) can adjust so that value of error function reduced.

Less obvious how to adjust weights for hidden units (not directly producing an output). Input weights adjusted in direct proportion to the error in the units to which it is connected.

0.5

0.5 1.5

1

-1

11

1

1

Inputs: 00011011

A solution to XOR problem

Right-hand hidden unit detects when both inputs are on, ensures output unit gets a net input of zero. Only one of two on never meets right-hand threshold (which multiplies with the negative weight)

.

When only one of the inputs on, left-hand hidden unit is on, turning on output unit.

When both inputs are off, hidden units are inactive, and output unit is off

BUT learning rule not guaranteed to produce convergence: can fall into situation where it cannot learn correct output. = local minimum.

BUT training requires repeated presentations.

Training multi-layer perceptrons, an inexact science: no guarantee that net will converge on a solution (ie that it will learn to produce the required output in response to inputs).

Can involve long training times.

Little guidance about a number of parameters, including the number of hidden units needed for a particular task.

Also need to find a good input representation of a problem. Often need to search for a good preprocessing method.

Generalisation

Main feature of neural networks: ability to generalise and to go beyond the patterns they have been trained on.

Unknown pattern will be classified with others that have same features.

Therefore learning by example is possible; net trained on representative set of patterns, and through generalisation similar patterns will also be classified.

Fault Tolerance

Multi-layer perceptrons are fault-tolerant because each node contributes to final output. If node or weights lost, only slight deterioration.

Ie graceful degradation

Brief History of Neural Nets

Connectionism/ Neural Nets/ Parallel Distributed Processing.

McCulloch and Pitts (1943) Brain-like mechanisms – showing how artificial neurons could be used to compute logical functions.

Simplification 1: Neural communication thresholded – neuron is either active enough to fire, or it is not. Thus can be thought of as binary computing device (ON or OFF).

Simplification 2: Synapses – equivalent to weighted connections. So can add up weighted inputs to an element, and use binary threshold as output function.

1949: Donald Hebb showed how neural nets could form a system that exhibits memory and learning.

Learning – a process of altering the strengths of connection between neurons.

Reinforcing active connections.

Rosenblatt (1962) and the Perceptron. Single layer net, which can learn to produce an output given an input.

But connectionist research almost killed off by Minsky and Papert, by their book called ‘Perceptrons’.

Argument: Perceptrons computationally weak. (certain problems which cannot be solved by a 1-layer net, and no learning mechanism for 2 layer net).

But resurgence of interest in neural computation.

- result of new neural network architectures, new learning algorithms. Ie Backpropagation and 2 layer nets.

Rumelhart and McClelland and PDP Research group (1986) 2 books on Parallel Distributed Processing.

Presented variety of NN models – including Past-tense model (see below).

Huge impact of these volumes partly because they contain cognitive models, I.e a model of some aspect of human cognition.

Cognition: thinking, understanding language, memory.

Human abilities that imply our ability to represent the world.

Best contrasted to behaviour-based approach.

Example applications of Neural Nets

NETtalk; Sejnowski and Rosenberg, 1987: network that learns to pronounce English text.

Takes text, maps text onto speech phonemes, and then produces sounds using electronic speech generator.

It is difficult to specify rules to govern translation of text into speech – many exceptions and complicated interactions between rules.

For example, ‘x’ pronounced as in ‘box’ or ‘axe’, but exceptions eg ‘xylophone’.

Connectionist approach: present words and their pronunciations, and see if net can discover mapping relationship between them.

203 input units, 80 hidden units, and 26 output units, corresponding to phonemes (basic sound in language).

A window seven letters wide is moved over the text, and net learns to pronounce the middle letter.

Each character is 29 input units, one for each of 26 letters, and one for blanks, periods and other punctuation. (7 x 29 inputs=203)

Trained on 1024 word text, after 50 passes NETtalk learns to perform at 95% accuracy on training set. Able to generalise to unseen words at level of 78%.

Note here: training vs. test sets

So, for the string ‘SHOEBOX’

The first of the seven inputs is 00000000000000000010000000 because S is the 19th letter and the output will be the zerophoneme

because the E is silent in SHOEBOX, I.e if the zerophoneme is placed first:

10000000000000000000000000

Particularly influential example: taperecorded NETtalk starting out with poor babbling speech and gradually improving until output intelligible. Sounds like child learning to speak. Passed the Breakfast TV test of real AI.

Cotterell et al 1987: image compression.

Gorman and Sejnowski 1988: classification of sonar signals (mines versus rocks)

Tesauro and Sejnowski, 1989: playing backgammon.

Le Cun et al, 1989: recognising handwritten postcodes

Pomerleau, 1989: navigation of car on winding road

Summary: What are Neural Nets?

Important characteristics: Large number of very simple neuronlike processing

elements. Large number of weighted connections between

these elements. Highly parallel. Graceful degradation and fault tolerant

Key concepts

Multi-layer perceptron.Backpropagation, and supervised learning.Generalisation: nets trained on one set of data, and

then tested on a previously unseen set of data. Percentage of previously unseen set they get right shows their ability to generalise.

What does ‘brain-style computing’ mean?

Rough resemblance between units and weights in Artificial Neural Network (or ANNs) and neurons in brain and connections between them.

Individual units in a net are like real neurons. Learning in brain similar to modifying connection

strengths. Nets and neurons operate in a parallel fashion. ANNs store information in a distributed manner as do

brains. ANNs and brain degrade gracefully. BUT these structures still model logic gates as well

and are not a different kind of non-von Neumann machine

BUT

Artificial Neural Net account is simplified. Several aspects of ANNs don’t occur in real brains. Similarly brain contains many different kinds of neurons, different cells in different regions.

e.g. not clear that backpropagation has any biological plausibility. Training with backpropagation needs enormous numbers of cycles.

Often what is modelled is not the kinds of process that are likely to occur at neuron level.

For example, if modelling our knowledge of kinship relationships, unlikely that we have individual neurons corresponding to ‘Aunt’ etc.

Edelman, 1987 suggests that it may take units ‘in the order of several thousand neurons to encode stimulus categories of significance to animals’.

Better to talk of Neurally inspired or Brain-style computation.

Remember too that (as with Aunt) even the best systems have nodes pre-coded with artificial notion slike the phonemes (corresponding to the phonetic alphabet). These cannot be precoded in the brain (as they are n Sejnowski’s NETTALK) but must themselves be learned.

Getting closer to real intelligence?

Idea that intelligence is adaptive behaviour.

I.e. an organism that can learn about its environment is intelligent.

Can contrast this with approach that assumes that something like playing chess is an example of intelligent behaviour.

Connectionism still in its infancy:- still not impressive compared to ants, earthworms or

cockroaches.

But arguably still closer to computation that does occur in brain than is the case in standard symbolic AI. Though remember McCarthy’s definition of AI as common-sense reasoning (esp. of a prelinguistic child).

And might still be a better approach than the symbolic one.

Like analogy of climbing a tree to reach the moon – may be able to perform certain tasks in symbolic AI, but may never be able to achieve real intelligence.

Ditto with connectionism/ANNs ---both sides use this argument.

Past-tense learning model

references:

Chapter 18: On learning the past tenses of English verbs. In McClelland, J.L., Rumelhart, D.E. and the PDP Research Group (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 2: Psychological and Biological Models, Cambridge, MA: MIT Press/Bradford Books.

Chapter 6: Two simulations of higher cognitive processes. Bechtel, W. and Abrahamsen, A. (1991) Connectionism and the mind: An introduction to parallel processing in networks. Basil Blackwell.

Past-tense model

A model of human ability to learn past-tenses of verbs.

Presented by Rumelhart and McClelland (1986) in their ‘PDP volumes’:

Main impact of these volumes: introduced and popularised the ideas of Multi-layer Perceptron, trained by means of Backpropagation

Children learning to speak:

Baby: DaDa

Toddler: Daddy

Very young child: Daddy home!!!!

Slightly older child: Daddy came home!

Older child: Daddy comed home!

Even older child: Daddy came home!

Stages of acquisition in children:

Stage 1: past tense of a few specific verbs, some regulare.g. looked, neededMost irregular: came, got, went, took, gaveAs if learned by rote (memorised).Stage 2: evidence of general rule for past-tense.I.e. add ed to stem of verb.And often overgeneralise irregularse.g. camed or comed instead of came.Also (Berko, 1958) can generate past tense for an

invented word. E.g. if they use rick describe an action, will tend to say ricked when using the word in the past-tense.

Stage 3

: Produce correct forms for both regular and irregular verbs.

Table: Characteristics of 3 stages of past-tense acquisition

Verb Type Stage 1 Stage 2 Stage 3

Early verbs Correct Regularised Correct

Regular Correct Correct

Irregular Regularised Correct

Novel Regularised Regularised

U-shaped curve – correct past-tense form used for verbs in Stage 1, errors in Stage 2 (overgeneralising rule), few errors in Stage 3.

Suggests Stage 2 children have acquired rule, and Stage 3 children have acquired exceptions to rule.

Aim of Rumelhart and McClelland: to show that connectionist network could show many of same learning phenomena as children.

- same stages and same error patterns.

Overview of past-tense NN model

Not a full-blown language processor that learns past-tenses from full sentences heard in everyday experience.

Simplified: model presented with pairs, corresponding to root form of word, and phonological structure of correct past-tense version of that word.

Can test model by presenting root form of word, and looking at past-tense form it generates.

More detailed account

Input and Output Representation

To capture order information used Wickelfeatures method of encoding words.

460 inputs:

Wickelphones: represent target phoneme and immediate context.

e.g. came - #Ka, kAm, aM#

These are coarse-coded onto Wickelfeatures, where 16 wickelfeatures correspond to each wickelphone.

Input and output of net consist of 460 units.

Inputs are ‘standard present’ forms of verbs, outputs are corresponding past forms, regular or irregular, and all are in the special ‘wikel’ format.

This is a good example of need to find a good way of representing the input can’t just present words to a net; have to find a way of encoding those words so they can be presented as a set of inputs.

Assessing output: compare the pattern of output Wickelphone activations to the pattern that the correct response would have generated.

Hits: a 1 in output when a 1 in target and a 0 in output when a 0 in target.

False alarms: 1s in the output not in the target.

Misses: 0s in output, not in target.

Training and Testing

Verb is input, and propagated across weighted connections – will activate wickelfeatures in output that correspond to past-tense of verb.

Used perceptron-convergence procedure to train net.

(NB not multi-layer perceptron: no hidden layer, and not trained with backpropagation. Problem must be linearly separable).

Target tells output unit what value it should have. When actual output matches target, no weights adjusted. When computed output is 0, and target is 1, need to increase the probability that unit will be active the next time that pattern presented. All weights from all active input units increased by small amount eta. Also threshold reduced by eta.

When computed output is 1 and target is 0, we want to reduce the likelihood of this happening. All weights from active units are reduced by eta, and threshold increased by eta.

Perceptron convergence procedure will find a set of weights that will allow the model to get each output unit correct, provided such a set of weights exist.

Before training: Divided 560 verbs into high frequency (regular and irregular), medium (regular and irregular) and low frequency (regular and irregular).

1. Train on 10 high frequency verbs (8 irregular)

Live – lived

Look – looked

Come – came

Get – got

Give – gave

Make – made

Take – took

Go – went

Have – had

Feel – felt

2. After 10 epochs, 410 medium frequency verbs added (76 irregular)

190 more epochs (training cycles)

Net showed dip in performance on irregular verbs which is like Stage 2 in children.

And when net made errors, these errors were like children‘s – I.e. adding ‘ed’.

e.g. for come – comed

3. Tested 86 low frequency verbs it had not been trained on.

Got 92% right of regular verbs, 84% right for irregular.

Results

With simple network, and no explicit encoding of rules it could simulate important characteristics of human children learning English past-tense. Same U-shaped curve produced for irregular words.

Main point: past-tense forms can be described using a few general rules, but can be accounted for by connectionist net which has no explicit rules.

Both regular and irregular words handled by the same mechanism.

Objectives:

To show that Past-tense formulation could be carried out by net, rather than by rule system.

To capture U-shaped function.

Rule-system

Linguists: stress importance of rules in describing human behaviour.

We know the rules of language, in that we are able to speak grammatically, or even to make judgements of whether a sentence is or is not grammatical.

But this does not mean we know the rule like we know the rule ‘i before e except after c’: may not be able to state them explicitly.

But has been held (e.g. Pinker, 1984 following Chomsky), that our knowledge of language is stored explicitly as rules. Only we cannot describe them verbally because they are written in a special code only the language processing system can understand:

Explicit inaccessible rule view

Alternative view: no explicit inaccessible rules. Our performance is characterisable by rules, but they are emergent from the system, and are not explicitly represented anyway.

e.g. honeycomb: structure could be described by a rule, but this rule is not explicitly coded. Regular structure of honeycomb arises from interaction of forces that wax balls exert on each other when compressed.

Parallel distributed processing view: no explicit (albeit inaccessible) rules.

Advantages of using NNs to model aspects of human behaviour.

Neurally plausible, or at least ‘brain-style computing’. Learned: not explicitly programmed. No explicit rules; permits new explanation of

phenomenon. Model both produces the behaviour and fits the data:

errors emerge naturally from the operation of the model.

Contrast to symbolic models in all 4 respects (above)

Rumelhart and McClelland:

…lawful behaviour and judgements maybe produced by a mechanism in which there is no explicit representation of the rule. Instead, we suggest that the mechanisms that process language and make judgements of grammaticality are constructed in such a way that their performance is characterizable by rules, but that the rules themselves are not written in explicit form anywhere in the mechanism..’

Important counter-argument to linguists, who tend to think that people were applying syntactic rules.

Point: can have syntactic rules that describe language, but that doesn’t mean that when we speak syntactically (as if we were following those rules) that we literally are following rules.

Many philosophers have made a similar point against the reality of explicit rules --e.g. Wittgenstein.

The ANN approach provides a computational model of how that might be possible in practice----to have the same behavioural effect as rules but without there being any anywhere in the system.

On the other hand, the standard model of science is of possible rule systems describing the same phenomenon--that also allows that real rules (in a brain) could be quite different from the ones we invent to describe a phenomenon.

Some computer scientists (e.g Charniak) refuse to accept incomprehensible explanations.

Specific criticisms of the past tense model:

Criticism 1

Performance of model depends on use of Wickelfeature representation: and this is an adaptation of standard linguistic featural analysis. – ie it relies on symbolic input representation(cf. phonemes in NETALK)

Ie what’s the contribution of the architecture?

Criticism 2

Pinker and Prince (1988): role of input and U-shaped curve.

Model’s entry to Stage 2 due to addition of 410 medium frequency verbs.

This change is more abrupt than is the case with children--there may be no relation between this method of partitioning the training data and what happens to children.

But later research (Plunkett and Marchman 1989) show that U-shaped curves can be achieved without abrupt changes in input. Trained on all examples together (using backpropogation net).

Presented more irregular verbs, but still found regularization, and other Stage 2 phenomena for certain verbs.

Criticism 3

Nets are not simply exposed to data, so that we can then examine what they learn.

They are programmed in a sense: Decisions have to be made about several things including

Training algorithm to be used Number of hidden units How to represent the task in question

Input and output representation Training examples, and manner of presentation

Criticism 4

At some point after or during learning this kind of thing, humans become able to articulate the rule.

Eg regular past tenses end in –ed.

Also can control and alter these rules – eg could pretend to be a younger child and say ‘runned’ even though she knows it is incorrect (cf. some use learned and some learnt, lit is UK and lighted US)

Hard to see how such kind of behaviour would emerge from a set of interconnected neurons.

Conclusions

Although the Past-tense model can be criticised, it is best to evaluate it in the context of the time (1986) when it was first presented.

At the time, it provided a tangible demonstration that Possible to use neural net to model an aspect of

human learning Possible to capture apparently rule-governed

behaviour in a neural net

basics of neural nets and past-tense model

Documents

body of neuron

neuron bodysoma

output channel of neuron

synaptic inputs

active inputs

neural computing

couples axon

dendrites act