neural networks ( 類神經網路概論 ) by 胡興民老師

Neural Networks( 類神經網路概論 ) BY 胡興民老師

1) What is a neural network ? A machine designed to model( 模擬 ) the way in which the brain performs a particular task or function of interest; the network is implemented( 實作 ) by using electronic components( 元件 ) or is simulated( 模

擬 ) in software on a digital computer [1].

Viewed as a adaptive( 調適型 ) machine, a massively parallel( 平行 ) distributed( 分散的 ) processor, made up of simple processing units which has propensity( 傾向 ) for storing experiential( 經驗上的 ) knowledge and make it available for use [2].[1] adapted from Neural Networks (2nd ed.) by Simon Haykin[2] adapted from Aleksander and Morton, 1990

2) significance( 重要意義 ) for a neural network the ability to learn from its environment( 環境 ) and to improve its performance( 性能 ) through learning.

Learning is a process by which the free parameters( 參數 ) of a neural network are adapted through a process of stimulation by environment [3]. The type of learning is determined( 決定 ) by the manner in which the parameter changes take place( 參數發生改變的方式 ).[3] adapted from Mendel and McClaren, 1970

3) Five basic learning rules

error-correction( 錯誤更正式學習 ) learning (rooted in optimum filtering)( 植基於最佳化過濾 )

memory-based( 記憶型學習 ) learning (memorizing

the training data explicitly( 清楚地記住 ))

Hebbian learning, competitive( 競爭型 ) learning

(these two are inspired( 啟發 ) by

neurobiological( 生物神經學上的 ) considerations)Boltzmann learning (based on statistical( 統計學

上的 ) mechanics( 機制 ))

One or more layers of hidden neurons

X(n)

Output neuronk

﹣ ﹢

Σyk(n) dk(n)

Block diagram of a neural network, highlighting the only neuron ( 神經元 ) in the output layer

Multilayer feedforward network ( 前饋式多層網路 )

Inputvector

ek(n)

yk(n) is compared to a desired response or target output,dk(n). Consequently, an error signal, ek(n), is produced.

ek(n) = dk(n) y﹣ k(n) The error signal actuates adjustments( 啟動調整 ) to the synaptic weights( 神經腱強度 ) to make the output signal come closer to the desired response in a step-by-step manner. The objective is achieved by minimizing a cost function or index of performance,ξ(n), and is the instantaneous value( 即時值 ) of the error energy:

ξ(n) = ½ [ek(n)] 2

The adjustments are continued until the system reaches a steady state. Minimization of theξ(n) is referred to as the delta rule or Widrow-Hoff rule.

x1(n)

x2(n)

xj(n)

xm(n)

X(n)wk2(n)

……

wkj(n)

…

wkm(n)

…

φ(‧)

vk(n) yk(n)

-1 dk(n)

ek(n)

Signal-flow graph of output neuron

wk1(n)

Let wkj(n) denote the value of synaptic weight of neuron k excited( 激發 ) by element xj(n) of the signal vector X(n) at time step n. The adjustment Δwkj(n) is defined by

Δwkj(n) = ηek(n) xj(n)

where η is a positive constant that determines the rate of learning as we proceed from one step in the learning process to another.

The updated value of synaptic weight wkj is determined by wkj(n+1) = wkj(n) +Δwkj(n)

5) memory-based learning

There are correctly classified( 正確分類的 ) input-output examples: ｛ (xi, di) ｝ , i = 1, …, N

where xi is an input vector and di denotes( 表示 ) the corresponding( 相對應的 ) desired response.

When classification of a test vector xtest (not seen before) is required, the algorithm responds by retrieving( 回去擷取 ) and analyzing the training data in a “local neighborhood” of xtest .

Local neighborhood is defined as the training example that lies in( 位於 ) the immediate( 最近的近鄰 ) neighborhood of xtest (e.g. the network employs the nearest neighbor rule). For example,

xN’ ∈ ｛ x1, x2 , …, xN ｝is said to the nearest neighbor of xtest if

mini d(xi, xtest) = d(xN’, xtest), i = 1, 2, …, N

where d(xi, xtest) is the Euclidean distance between the vectors xi and xtest .

The class associated with the minimum distance, that is, vector xN’, is reported as the classification of xtest .

A variant( 類似方法 ) is the k-nearest neighbor

classifier, which proceeds as follows:

(i) Identify( 辨認出 ) the k classified patterns that lie nearest to the xtest for some integer k.

(ii) assign xtest to the class (hypothesis) that is most frequently represented( 代表 ) in the k nearest neighbor to xtest (i.e., use a majority vote( 多數決 ) to make the classification and to discriminate( 區分出 ) against the outlier(s)( 外來者 ),

as illustrated in the following figure for k = 3).

The red marks include two points pertaining to( 屬於 ) class 1 and an outlier from class 0. The point d corresponds to the test vector xtest . With k = 3,the k-nearest neighbor classifier assigns class 1 to xtest even though it lies closest to the outlier. 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0d 0 1 1 1 11 1 1 1 1 1 11 1 1 1

6) Hebbian learning

(i) the original Hebb rule: If two neurons on either

side of a synapse (connection) are activated simultaneously( 即刻地 ) (i.e., synchronously( 同步地 )), then the strength of that synapse is selectively increased.

(ii) If two neurons on either side of a synapse are activated asynchronously( 非同步地 ), then that synapse is selectively weakened( 弱化 ) or eliminated( 摘除 ) [4].

[4] Stent, 1973; Changeux and Danchin, 1976

xj(n), yk(n) denote presynaptic and postsynaptic signals respectively( 各自地 ) at time step n, then the adjustment (Hebb’s hypothesis) is expressed by

Δwkj(n) = ηyk(n) xj(n)

The correlational( 關連 ) nature of a Hebbian synapse is sometimes referred to as the activity product rule as is shown in the top curve of the following figure, with the change Δwkj plotted verse the output signal (postsynaptic activity) yk.

Hebb’s hypothesis

Slope = ηxj

slope = η(xj – M

x)Covariance hypothesis( 共變異假說 )

balance point

postsynaptic activity, yk

Max. depression point

Δwkj

0

-η(xj - Mx)My

Illustration of Hebb’s hypothesis and the covariance hypothesis

Covariance hypothesis [5]: xj(n), yk (n) are replaced by xj - Mx , yk - My respectively, where Mx /My denote the time-averaged value of xj/yk .

Hence, the adjustment is described by

Δwkj = η(xj - Mx)( yk - My)The average values x, y constitute thresholds( 閥值 ), which determine the sign( 跡象 ) of synaptic modification.

[5] Sejnowski, 1977

7) competitive learning The output neurons of a neural network compete among themselves to become active (fired).

Based on Hebbian learning several output neurons may be active simultaneously, while in competitive learning only a single output neuron is active at any one time.

layer of source nodes Single layer of

output neurons

Architectural graph of a simple competitive learning network with feedforward connections, and lateral connections( 橫向腱值 ) among the neurons.

x1

x2

x3

x4

Output signals yk of winning neuron k is set to one; the y of all losing neurons are set to zero. That is,

yk = 1 if vk ＞ vj for all j, j≠k; yk = 0, otherwise

where the induced local field( 活化濳能 ) vk represents the combined action of all the forward and feedback inputs to neuron k.

Suppose each neuron is allotted( 配給 ) a fixed amount (i.e., positive) of synaptic weight, which is distributed among its input nodes; that is,

Σj wkj = 1 for all k (j = 1, …, N)

The change Δwkj applied to( 作用於 ) wkj is defined by

Δwkj = η(xj - wkj ) if neuron k wins the competition; Δwkj = 0 if k losing

x o o o o xo o o o

o o x x

o o o o

o o o x oo o o o o o o o x

(a) (b)

Geometric interpretation( 解讀 ) of the competitive learning process [6]. The o represents the input vectors, and the x the synaptic weight vectors of three output neurons. (a) representing initial state of the network (b) final state. (It is assumed( 假設 ) that all neurons in the network are constrained( 受限 ) to have the same Euclidean length, i.e. norm, as shown byΣj wkj

2 = 1 , for all k (j = 1, …, N).[6] Rumelhart and Zipser, 1985

8) boltzmann learning A recurrent ( 循環的 ) structure, and operating in a binary manner since they are either in an “on” state denoted by +1 or in an “off” state (-1). An energy function, E, is shown by

E = -½ΣjΣk wkjxkxj (j≠k)

where xj : state of neuron j, and j≠k meaning: none of the neurons has self-feedback.

By choosing xk at random( 隨機地 ) in the learning process,

then flipping to( 切換成 ) -xk at some pseudotemperature( 類溫度 ) T with probability

P(xk→ -xk) = [1 + exp(-ΔEk / T )]-1

where ΔEk : energy change resulting from such a flip. If applied repeatedly, the machine will reach thermal equilibrium.( 熱平衡 )

ρkj+ denote the correlation between the states of

neuros j and k, with the network in its clamped condition1,

whereas ρkj− denotes the correlation in free-running

condition2. The change Δwkj is defined by [7]

Δwkj = η(ρkj+ − ρkj

− ), j≠k

where both ρkj+ and ρkj

− range in value from -1 to +1.

Rk 1: Neurons providing an interface between the network and

the environment are called visible. Visible neurons are all

clamped( 跼限於 ) onto specific states, whereas hidden

neurons always operate freely.

Rk 2: All neurons (visible and hidden) are allowed to operate

freely in free-running condition. [7] Hinton and Sejnowski, 1986

9) supervised learning (learning with a teacher)

vector describing state of the environment

environment

Learning system

actual response

error signal

_

+

Σ

desired responseteacher

block diagram of supervised learning ( 監督式學習 )

Teacher has the knowledge of environment,

with that knowledge being represented by a set of

input-output examples. The environment is,

however, unknown to the network of interest. The learning process takes place under the tutelage( 監

護 ) of a teacher.

The adjustment is carried out iteratively( 疊代地執行 ) in

a step-by-step fashion( 方式 ) with the aim of

eventually making the network emulate( 盡量模仿 ) the

teacher.

10) Learning without a teacher No labeled( 標明配對的 ) examples of the function are

to be learned by the network, and two subdivision

are identified: (i) reinforcement learning/ neuro-

dynamic programming (ii) unsupervised learning.

(i) reinforcement learning

The learning of an input-output mapping is

performed through continued interaction with the

environment in order to minimize a scalar index of

performance.

primary reinforcement

critic

state (input) vector

environment

learning system

heuristic reinforcement( 啟發式的自行發展出補強 )

action

(ii) unsupervised/self-organized learning( 自組織

式學習 )

No external teacher or critic; rather, provision( 預備 ) is

made for a task-independent measure of the quality of

representation that the network is required to learn, and

the free parameters of the network are optimized with

respect to( 以該手段 ) that measure.

Once the network tuned to( 校準到 ) the statistical

regularities of the input data, it develops the ability to form internal representations for encoding features( 把特徵

編碼 ) and thereby to create new classes automatically.

environment

vector describing state of the environment

learning system

block diagram of unsupervised learning ( 非監督式學習 )

11) learning tasks The choice of a particular learning algorithm is

Influenced( 影響 ) by the learning task.

(i) pattern association( 圖訊聯想 ) takes one of two forms:

Autoassociation( 自聯想 ) or heteroassociation ( 異聯想 ) . In

autoassociation, a network is required to store a set of patterns (vectors) by repeatedly presenting ( 呈現 )

them to the network. The network is subsequently

presented a partial description or distorted (noisy)

version of an original pattern stored in it, and the

task is to retrieve (recall) that particular pattern.

Heteroassociation differs from autoassociation

in that an arbitrary set of input patterns is paired

with another arbitrary set of output patterns.

Autoassociation involves the use of

unsupervised learning, whereas the type of

learning involved in heteroassociation is

supervised.

input-output relation of pattern association

Pattern association

Input vector x

output vector y

(The key pattern xk acts as a stimulus( 激勵源 ) that not only determines the storage location of memorized pattern yk, but also holds the key ( 即 : 位址 ) for its retrieval.)

(ii) pattern recognition: defined as the process whereby a received pattern/signal is assigned to one of a prescribed number of classes (categories).

Unsupervised networked for feature extraction

…

Supervised network for classification

12

r(a)

․x y․Feature extraction

․

classification

m-dimensionalobservationspace

q- dimensionalfeature space

r- dimensionaldecision space

(b)

Illustration of the approach to pattern classification(The network may take either forms.)

…

Associative memory( 聯想記憶體 )

• 12) An associative memory offers the following • characteristics( 特徵 ):• ․The memory is ditributed.• ․Information is stored in memory by setting up( 建構 ) a • spatial pattern( 空間圖訊 ) of neural activities.• ․Information contained in a stimulus( 激勵源 ) not only • determines its storage location but also an address

• for its retrival( 再次被取用 )

• ․There may be interactions( 互動 ) and so, there is • distinct( 確切的 ) possibility to make errors during the • recall( 回憶 ) process.

The memory is said to perform a distributed mapping

( 映射 ) that transforms( 轉換 ) an activity pattern ( 圖訊訊息 ) in the

input space into another activity pattern in the output

space, as illustrated in the diagram. 1

2

.

.

m

1

2

.

.

m

Xk1

Xk2

.

.

Xkm

Yk1

Yk2

.

.

Ykm

Input layer synaptic output layer

of neurons junctions of neurons

input layer output layer

of source nodes of neurons

(a) Associative M. model (b) associative M. model using

component of a nervous system artificial neurons

The assumption: each neuron acts as a linear combiner, depicting in the following graph.

An activity pattern xk , yk occurring in the input, output layer simultaneously,

respectively is written in their expanded forms as: Xk=[xk1 , xk2 , …, xkm]T

Yk=[yk1 , yk2 , …, ykm]T

Signal-flow graph model of a linear neuron labeled i.

Wi1(k)

Wi2(k)

Wim

(k)

Xk1

Xk2

.

.

Xkm

.

.

Yki

The association of vector Xk with memorized vector

Yk is described in matrix form as:

Yk=W(k) Xk , k=1,2, …,q

where W(k) is determined by the input-output pair

(Xk , Yk). A detailed description is given by

yki= wij(k) xkj , i=1,2, …, m

Using matrix notation, yki is expressed in the

equivalent form

yki=[wi1(k), wi2(k), …, wim(k)]

i=1,2, …, m

km

k

k

x

x

x

.

.2

1

m

j 1

km

k

k

mmmm

m

m

km

k

k

x

x

x

kwkwkw

kwkwkw

kwkwkw

y

y

y

.

.

.

)()...()(

.

.

.

)()...()(

)()...()(

.

.

.2

1

21

22221

11211

2

1

The individual presentations of the q pairs of

associated patterns Xk→Yk , k=1,2, …, q produce

corresponding values of the individual matrix,

namely W(1), W(2), …, W(q).

We define an m-by-m memory matrix that describes

the summation of the weight matrices as follows:

M= W(k)

The matrix M defines the overall connectivity between

the input and output layers of the associative memory.

In effect, it also represents the total experience gained

By the memory and can be reconstructed as a recursion

shown by Mk=Mk-1+W(k) , k=1,2, …, q …(A)

q

k 1

where the initial value M0 is zero (i.e., the synaptic weights in the memory are all initially

zero), and the final value Mq is identically equal to M.

When W(k) is added to Mk-1, the increment W(k) loses its distinct identy among the mixture of

contributions that form Mk , whereas informationabout the stimuli may not have lost.

Note: As the number q increases, the influence of a new pattern on the memory as a whole is progressively reduced.

Suppose the associative memory has learned M, we may postulate , denoting an estimate of M in terms of

these patterns as [1]: = ykxkT …..(B)

ykxk

T represents the outer product of the xk and yk , and is an estimate of the W(k) that maps the output

pattern yk onto the input pattern xk . xkj is the output of

source node j in the input layer, and yki is the output of neuron i in the output layer.

In the context of wij(k) for the kth association, source node j acts as a presynaptic node and neuron i acts asa postsynaptic node.

[1] Anderson,1972, 1983; Cooper, 1970]:

M

q

k 1

The ‘local’ learning process in eq.(B) may be viewed as a generalization of Hebb’s postulate of learning. An associative memory so designed is called a correlation matrix memory. Correlation is the basis of learning, association, pattern recognition, and memory recall [2].

Eq.(B) may be reformulated:

= YXt ……. (C)

where X =[x1,x2, …, xq], called the key matrix, and Y=[y1, y2, …, yq], called the memorized matrix.[2] Eggermont, 1990

tq

t

t

qM

x

x

x

yyy

.

.,...,,2

1

21

Eq.(C) may be reconstructed as a recursion

and is depicted as follows:

k=1,2, …, q ….(D)

Comparing eq.(A) and (D), we see

represents an estimate of the W(k).

tkkkk xyMM 1

XX Z-1I

tkx

yk

kM

1kM

tkk xy

recall• The problem posed by an associative memory i

s the address and recall of patterns stored in memory.

Let xj be picked at random and reapplied as

stimuli to the memory, yielding the response

y = xj …(E)

Substituting eq.(B) in (E), we get

y= yk xj = ( xj)yk =( xj)yj + ( xj)yk

=yj + ( xj)yk = yj + vj …(F)

M

m

k 1

m

k 1

m

jkk 1

m

jkk 1

tkx

tkxt

kxtkx

tjx

yj of eq.(F) represents the “desired” response;

It may be viewed as the “signal” component of the

actual response y. vj is a “noise vector” that

is responsible for making errors on recall.

Let x1,x2, …, xq be normalized to have unit

energy; i.e., Ek= = xk =1, k=1,2, …, q

∵ ∴ cos(xk,xj)= = xj

⇒ cos(xk,xj)=0, k≠j, if xk and xj are orthogonal.

⇒ vj= =0; i.e., in such a case, y=yj

m

lklx

1

2

tkx

2

1

2

1

)( kktkk Exxx

jk

jtk

xx

xx

tkx

kj

m

kk yxx

jk

),cos(1

⇒ The memory associates perfectly if the key vectors form an orthogonal set; that is, if they satisfy the conditions:

xj=

Another question: What is the limit on the storage capacity of the associative memory? The answer lies in the rank (say r) of memory matrix

Suppose a rectangular matrix has dimensionsl-by-m, we then have r ≤ min(l,m). In the case of a correlation memory, is an m-by-m matrix.⇒ r ≤ m ⇒ the number of patterns can never exc

eed the input space dimensionality.

jk

jk

,0

,1tkx

M

M

What if the key patterns are neither orthogonal nor highly separated ? ⇒ a correlation matrix memory may get confused and make errors.• First, define the community of a set of patterns

as the lower bound on the inner products xj.

• Second, be further assumed that xj ≥λ, k≠j

Ifλis large enough, the memory may fail to

distinguish the response y from that of any other

key pattern.

tkx

tkx

Say, xj=x0+ v , where v is a stochastic vector.

It is likely the memory will recognize x0 and

associate with it a vector y0 rather than any of the

actual pattern pairs used to trained it.

⇒ x0 and y0 denote a pair of patterns never seen before.

兩題程式 (MATLAB) 練習 ( 解析在隨後的 slides): (1) 以 unsupervised 方式 , 將 (0, 0), (0,1), (1,0), (1,1) 分成兩類 , 即 AND邏輯運算 ; Note: 一開始給定 weight 值 , bias 值 ; 並注意

完成分類後的 weight 值 , bias 值(2) 將題 (1)之輸入分成兩類 , 但作 OR邏輯運算且為 supervised 設計 ; Note:一開始未設

定 weight 值 , bias 值 ( 即 , 皆零值 ; 注意程式中的測試 !); 查看完成分類後的 weight值 , bias 值 ; 請自行更改程式中 epoch 值 , 注意性能曲線變化及最後的 weight 值 , bias 值

• close all;clear;clc; % 題 (1): AND運算 , unsupervised 設計 %• net=newp([-2 2;-2 2],1);• net.IW{1,1}=[2 2];• net.b{1}=[-3];• % input vectors %• p1=[0;0];• a1=sim(net,p1)• p2=[0;1];• a2=sim(net,p2)• % a sequence of two input vectors %• p3={[0;0] [0;1]};• a3=sim(net,p3)• % a sequence of four input vectors %• p4={[0;0] [0;1] [1;0] [1;1]};• % simulated output and final wt./bias %• a4=sim(net,p4)• net.IW{1,1}• net.b{1}• % boundary / weight %• xx=-.5:.1:2;• yy=-net.b{1}/2-xx;• x2=0:.1:1.2;• y2=x2;• plot(xx,yy,0,0,'o',0,1,'o',1,0,'o',1,1,'x',x2,y2);• legend('boundary')• gtext('W')

• close all;clear;clc; % 題 (2): OR運算 , supervised 設計 ; 並請更改 epoch 值 %• net=newp([-2 2;-2 2],1);• testp=[1 1 1 1];• net.IW{1,1} • net.b{1} • while testp ~= [0 0 0 0]• net.trainparam.epochs=4;• p=[[0;0] [0;1] [1;0] [1;1]];• t=[0 1 1 1];• net=train(net,p,t);• % the new weights and bias are % • net.IW{1,1} • net.b{1} • % simulate the trained network for each of the inputs % • a = sim(net,p)• % The errors for the various inputs are % • error = [a(1)-t(1) a(2)-t(2) a(3)-t(3) a(4)-t(4)]• testp=error;• end• xx=-1:.1:2;• yy=-net.b{1}-xx;• x2=xx-.2;y2=yy-.2;• x3=0:.01:.7;• y3=x3;• plot(x2,y2,0,0,'o',0,1,'x',1,0,'x',1,1,'x',x3,y3,'->')• legend('boundary')• gtext('weight')

neural networks ( 類神經網路概論 ) by 胡興民老師

Documents