hierarchical temporal and spatial memory for gait pattern

Hierarchical Temporal and Spatial Memory for Gait Pattern Recognition

by Jianghao Shen

B.A. in Electrical Engineering, December 2012, University of Wollongong

A Thesis submitted to

The Faculty of The School of Engineering and Applied Science

of The George Washington University in partial fulfillment of the requirements

for the degree of Master of Science

January 31, 2017

Thesis directed by

Murray H. Loew Professor of Biomedical Engineering

© Copyright 2017 by Jianghao Shen All rights reserved

ii

iii

Abstract

Hierarchical Temporal and Spatial Memory for Gait Pattern Recognition

Many pattern recognition problems are inherently related to space and time. For

example, when we want to recognize someone by observing his walking patterns, every

time step there is a spatial structure of his body parts’ positions, and those positions tend

to vary with time. Thus, a model that can exploit both the spatial and temporal structure of

the visual world for classification is necessary for this kind of task. This research extends

a new model - Hierarchical Temporal Memory, which is a biologically - inspired

algorithm that can perform spatial and temporal learning simultaneously. By claiming that

patterns that come close by in time are likely to be variations of the same thing, HTM is

able to learn the invariant representation of an object by exploiting the temporal proximity

of patterns; by claiming that the world has a hierarchical structure, HTM is able to learn

complex object in terms of simpler building blocks. The contribution of this research is

adding the hierarchical temporal inference mechanism to the existing HTM model, so that

the extended model is able to better exploit the temporal structure of the visual event and

performing sequential inference. We evaluated the algorithm with the task of gait

recognition, and the result shows that our algorithm performs better than some of the

current methods and no worse than others on a gait dataset of 151 individuals.

iv

Table of Contents

Abstract of Thesis iii

List of Figures vi

List of Tables vii

Chapter 1: Introduction 1

1.1 Motivation 1

1.2 Previous work 1

1.3 New method - Hierarchical Temporal Memory 3

1.4 Summary of thesis organization 4

Chapter 2: Methods 5

2.1 Operations in a level - one node 6

2.1.1 Training in a level - one node 6

2.1.1.1 Memorizing patterns 6

2.1.1.2 Building a transition graph 6

2.1.1.3 Temporal grouping 8

2.1.2 Inference in a level - one node 13

v

2.1.2.1 Calculate the degree of match between input and memorized patterns

13

2.1.2.2 Calculate the degree of membership of the input in each of the

temporal groups 14

2.2 Operations in a higher - level node 15

2.2.1 Training in a higher - level node 15

2.2.2 Inference in a higher - level node 16

2.3 Extensions of the original HTM model 19

2.3.1 Sequential inference for a single Markov Chain 19

2.3.2 Implicit Temporal Hierarchy in the HTM model 21

2.3.3 Hierarchical Temporal Inference mechanism 22

Chapter 3: Experiments and Results 26

3.1 Experiment setting 26

3.2 Parameter selection 26

3.3 Results 28

Chapter 4: Conclusions 35

vi

References 37

Appendix A Description of the CASIA gait datasetC 39

Appendix B Python code for the extended HTM 40

vii

List of Figures Fig. 1 Example three - level HTM network 5

Fig. 2 Memorized patterns for the first step 7

Fig. 3 Example of temporal poximity of spatial features 8

Fig. 4 Illustration of resulting transition graph 8

Fig. 5 Example of resulting temporal groups 9

Fig. 6 An example of the internal structure of a trained level - one node 9

Fig. 7 Example of inference process in a level-one node 16

Fig. 8 Example two-level network training 17

Fig. 9 Example two-level network inference 18

Fig. 10 Illustration of the temporal hierarchy in HTM 22

Fig. 11 Illustration of sequential inference in a temporal hierarchy 24

Fig. 12 Illustration of the temporal hierarchy we built for the gait sequence 27

Fig. 13 Illustration of the spatial hierarchy we built for the gait image 28

Fig. 14. Dynamic beliefs of the five individuals over time. (extended HTM) 32

Fig. 15. Dynamic beliefs of the five individuals over time. (original Zeta1 model) 34

viii

List of Tables

Table 1 Comparison with other methods that use CAISA dataset C 30

Table 2 Performance with normal walking under noise 31

Table 3 Illustrative test result with noisy gait sequence (extended HTM) 31

Table 4 Illustrative test result with noisy gait sequence (original HTM) 33

１

Chapter 1 - Introduction 1.1 Motivation

We are living in a world that has both spatial and temporal structures, and so many

arguments must be made under certain spatial and/or temporal context. For example, if we

want to decide whether the weather condition of a certain area is normal, firstly, we need

to compare it to the average weather condition of the places around this area or even the

whole country. Secondly, we need to consider the history of the weather conditions of the

area itself. Also, since different problems may have different spatial/time scales. Thus, to

build a model that is able to describe or recognize the event in the world accurately, it must

be able to perform both spatial and temporal analysis, and is able to adjust the problem

space to different scales.

1.2 Previous works

There are many existing methods that focus on either spatial analysis or temporal

analysis. For example, the Convolutional Neural Network (CNN) is good at exploiting the

spatial structure of the visual world by constructing different kernels (feature detectors) at

different hidden layers. However, there is relatively little work on applying CNNs to video

classification. Video is more complex than images since it has another dimension--time.

Some extensions of CNNs into the video domain have been explored.

２

One approach is to treat space and time as equivalent dimensions of the input and perform

convolutions in both time and space.[1][2] Another way is to fuse the features of different

convolutional neural networks, responsible for spatial and temporal stream[3][4].

Specifically, it places two separate networks with shared parameters a distance of n frames

apart in the video sequence and then merges the two streams in a new network, thus neither

single network alone can detect any temporal information, but the new network can

compute the temporal variation characteristics by comparing the outputs of the two

separated networks. Unsupervised learning schemes for training spatio-temporal features

have also been introduced, based on Convolutional Gated Restricted Boltzmann

Machines[5] and Independent Subspace Analysis.[6]. However, those approaches lack an

intuitive high-level spatio-temporal structure [7].

Hierarchical Hidden Markov Models (HHMM) [8], are another class of

hierarchical models, in which different lengths of Markov Chains are learned at different

levels, and the Markov Chains at a lower level become a state at higher level, making the

HHMM easy to scale to problems with different lengths of temporal sequences. However,

these models are hierarchical in only one dimension - usually the temporal dimension [9].

Thus, they are inadequate for vision tasks that include complex spatial structures.

https://en.wikipedia.org/wiki/Convolutional_neural_network






３

1.3 New method - Hierarchical Temporal Memory

To address the limitations introduced above, this research extends and uses a new

algorithm - Hierarchical Temporal Memory (HTM), which is a machine intelligence model

that aims to mimic the way the human neocortex appears to perceive patterns [10]. The

strength of HTM is that it has clear definitions of the roles of time and space in task of

pattern recognition. The first idea of HTM is that patterns that are temporally close are

likely to be the variation of the same thing, thus the role of time is an implicit supervisor.

The second idea of HTM is that the world has a hierarchical structure, where complex

object can be expressed in terms of simpler building blocks, thus HTM decomposes the

spatial world into different levels of hierarchies based on the complexity of the building

blocks.

A typical HTM model is a network of nodes. And there are two stages of each node:

learning and inference. During learning, each node at the bottom level will learn the

variation of the local visual patterns within its receptive field, and the nodes at higher levels

will learn visual patterns in larger spatial areas in terms of the results of lower - level nodes’

; similarly, during inference, the test image will be decomposed into a series of small

patches corresponding to the receptive fields of the nodes, and each node will calculate its

own belief of what it is seeing within its receptive field, then the beliefs will be

concatenated at higher level to generate the recognition result at larger spatial area.

There are two versions of HTM. The first is called Zeta1 [11], it has a spatial

４

hierarchy model but can perform instantaneous inference only; The second version is called

the Cortical Learning Algorithm [12], it concatenates both the current and past inputs for

sequential inference, but it does not have a clearly defined hierarchical structure. Thus, our

goal is to combine the merits of both - making a new HTM that has a hierarchical formation

and is able to perform sequential learning.

1.4 Summary of thesis organization

The outline of this thesis is as follow: Chapter 2 is the literature review, Chapter 3

introduces the method of the HTM model, specifically, Sec. 3.1 and Sec. 3.2 introduce the

learning and inference stages of the original Zeta1 model, and Sec 3.3 introduces the

extension we made to the HTM, which includes the modified dynamic programming

algorithm that facilitates sequential inference in a temporal hierarchy. Chapter 4 contains

the results of experiments on gait recognition, and the final chapter is the conclusion.

５

Chapter 2 - Methods

An HTM network is trained level by level (Fig. 1), such that level one is trained first,

and then switches into the inference stage when level two is being trained. This process is

repeated until the top-level node is trained. Thus, the arrangement of the following sections

is: Sec. 2.1.1 – training of a level-one node, Sec. 2.1.2 – inference of a level-one node, Sec.

2.2.1 – training of a higher-level node, and Sec. 2.2.2 – inference of a higher-level node.

Fig.1[9]. An example three-level HTM network: the input image is of size 32x32. Level one of the network has 64 nodes arranged in an 8x8 grid, the receptive field of each level-one node is the 4x4 image patch under it. The input of a level-one node is the image patch within its visual area. The input of a level-two node comes from the inference outputs of four of its level-one child nodes. The inference outputs of all level-two nodes are sent to the top node, and the output of the top node is the recognition result.

６

2.1 Operations of a level - one node

2.1.1 Training of a level - one node

There are three steps in level-one training: first, memorize the input patterns; second,

build a Markov transition graph of the memorized patterns based on their transition

probabilities; third, learn the temporal groups by grouping the patterns that have high

mutual transition probabilities. The details are introduced below.

2.1.1.1 Memorizing patterns

In this step, the level-one node will memorize all the visual patterns it sees within its

receptive field during training, and those patterns will be the features learned by the node.

Since the images tend to be noisy, a nearest-neighbor method is applied to determine

whether the input patch is a new pattern: if the minimum of the Euclidean distances

between the current input and all of the memorized patterns is above a certain threshold,

the current input will be considered to be a newly discovered pattern and be added to the

memory. Fig.2. shows an illustrative result of the resulting memorized spatial features.

2.1.1.2 Building a transition graph

One important requirement of the training images for the HTM is that they have to

be a series of continuously changing patterns; An example is shown in Fig. 3. From t = 0

to t = 13, we can see a series of image patches represent a continuously moving left corner

(t = 0 to t = 3) and right corner (t = 9 to t = 11). Thus, whenever the same left (right) corner

is moving across the receptive field of the level-one node, the same set of

７

visual features (features at time 0,1,2 or at time 9,10,11) will occur again, and this proves

the idea that patterns occurring close by in time are likely to represent the same thing.

According to this deduction, we characterize the temporal relation of features by building

a transition graph of the features. Specifically, there are two steps to do this: first, count

the number of transitions between any pair of features during training, defining the features

as the vertices of the graph, then draw links between features according to their transition

Relationships, defining the link’s weight as transition times; second: normalize the graph

so that for every vertex, the weights of all its out-going links sum to one. Fig. 4 is an

illustration of the resulting transition graph.

Fig. 2.Memorized patterns for the first step. We can see that the level-one node learned different features such as vertical bar, horizontal bar, left corners, and right corners etc.[9]

８

Fig. 3. Example of temporal proximity of spatial features[9]

2.1.1.3 Temporal grouping

After building the transition graph, we form the invariant groups of features based

on their transition probabilities. Fig. 5. shows an example of resulting temporal groups, the

features within the same group have high transition probabilities, thus making them more

likely to be the invariant representation of the same thing. It can be shown that g2 represents

a horizontal bar, and g4 represents a right corner. Fig. 6. shows the internal structure of a

trained level-one node, it should contains a memory of all the learned features, and a

memory of all the learned temporal groups.

Fig. 4. Illustration of resulting transition graph, where the content within each cell is the spatial feature, and the direction of the link represent the transition relations, the weight of the link is the normalized transition probability.[9]

９

Fig. 5 Example of resulting temporal groups, the groups are marked by red circles. We can see that g2 is the invariant representation of a horizontal bar, g4 is the invariant representation of a right corner, and g8 is the invariant representation of a left corner[9].

Fig. 6. an example of the internal structure of a trained- level one node, where “Memory” represent the set of learned features, and “Temporal Groups” represent the temporal groups( invariant representations) of the features[9].

１０

Pseudo code

The pseudo code for the three training steps are shown at the next page, and we

define the following terms:

input(t): 2-d array of the current input image patch at time t

S: a list of memorized spatial features (in 2-d array form), where S[i] represents

the ith feature.

N: variable indicates the current size of S

Distance: a list of the Euclidean distances between the current input and

memorized features, where Distance[i] = ||S[i] – input(t)||

Threshold: parameter that determines whether the current input will be considered

new, the criteria of how to choose this parameter is discussed at Chapter 4.

Counter: dictionary indicates the time at which a given feature has occurred;

Counter[i] stores the value of how many times that the ith feature has occurred.

T: matrix that records the transition times between features, where T[i,j] records

how many times that jth feature occurs immediately after ith feature

Pseudo code for the first two steps:

K_star = argmink=1,…,N(Distance[k]) if Distance[K_star] > Threshold:

N = N + 1 S[N] = input(t) K_star = N // K_star is a variable that keeps track of the index of the current input

Counter[K_star] = Counter[K_star] + 1 T[K_prev,K_star] = T[K_prev,K_star] + 1 // update the transition matrix based on the index of the current input and the index of the previous input K_prev

１１

After this process, the level-one node has learned a memory of features S, and the

transition matrix T. Several processing steps are needed before we move into temporal

grouping: first, compute the prior probabilities of features:

P(S[i]) = Counter[i] / ∑ Counter[k] k

The second step is to perform a row-wise normalization of T:

After this normalization, T[i,j] represents the transition probability p(S[j]|S[i])

For temporal grouping, the level-one node initially begins with the first pattern in

its memory S[0], searching in the rest of the memory for the pattern to which S[0] has the

highest transition probability, and adding it to the group of S[0], the process will repeat

starting with the newly added pattern, until the length of the group reaches a maximum

length (pre-set parameter). New temporal groups will be built in the same manner based

on the patterns that have not been assigned to any other group. (the pseudo code of the third

step is shown on the next page)

After temporal grouping, there is one more processing step required in preparation for the inference stage, we must build the conditional matrix K(S|G), where S is the set

K_prev = K_star // update K_prev to be the index of current input, preparing for the operation for next time step.

for i in all rows of T:

SUM = ∑ T[i, k]

for j in all columns of T:

T[i,j] = T[i,j]/SUM

１２

of memorized visual patterns, and G represents the temporal groups[3], and K[i,j]

represents the conditional probability of the occurrence of the jth feature given the

occurrence of the ith temporal group p(S[j]|G[i]). The number of rows of the matrix equals

the number of temporal groups of the node, and the number of columns of the matrix equals

the number of features of the node. The function of K will be introduced in more detail in

the next section. The derivation of K(S|G) is shown immediately after the

pseudo code of the third step.

Pseudo code of the third step:

Ng: index of the temporal group, initially set to 0 G: a dictionary of all the currently learned temporal groups, G[i] represents the ith temporal group, initially G is empty maxsize: parameter determines the maximum size of temporal group allowed unassigned: list of features that have not been assigned to any temporal groups top_neighbour(S[i]): a function that returns the index of the feature to which S[i] has the highest transition probability.

while not all features are assigned:

initialize G[Ng] to be an empty list add unassigned[0] to G[Ng] pointer = 0 while size(G[Ng]) < maxsize:

s_seed = G[Ng][pointer] s_new = top_neighbour(s_seed) add s_new to G[Ng] remove s_seed from unassigned remove s_new from unassigned pointer = pointer + 1

１３

2 2

2.1.2 Inference of a level - one node

After training, the level-one node is switched into the inference state. There are

two steps that will be introduced below.

2.1.2.1 Calculate the degree of match between the input and the memorized

features

Since the input image patch will not perfectly match the stored patterns, we adopt

the original HTM assumption[source] that a Gaussian probability density function

describes the degree of match between an input pattern and a stored pattern:

p(input ( t ) | S[i]) = e − || S[ i ] − input ( t ) || / ∆

where Δ is a parameter of the node that decides the rate of decrease of the function as the

Euclidean distance becomes larger.

Derivation of the K matrix:

P: a dictionary where P(S[i]) represents the prior probability of ith feature, calculated at (1)

Constructing K: for i in all rows of K:

for j in all columns of K: if S[j] belongs to G[i]:

K[i,j] = P(S[j]) Normalization of K: for i in all rows of K:

for j in all columns of K: K[i,j] = K[i,j]/SUM

１４

pseudo code of instantaneous inference for level-one:

match: a dictionary where the ith element is the matching probability p(input(t)|S[i])

belief: a dictionary where the ith element is the likelihood that the current input belongs to G[i]

First step: for i in all memorized features:

match =

2.1.2.2 Calculate the degree of membership of the input in each of the

temporal groups

The goal of this step is to propagate the belief over features to the belief over

temporal groups. Here the K matrix calculated at the end of temporal grouping from Sec

2.1.1 will be used. Specifically, according to the definition of K, the belief that the

current input belongs to jth temporal group p(input(t)|G[j]) is calculated as:

p(input (t) | G[ j]) = ∑ p(input (t) | S[i]) • p(S[i] | i

G[ j])

and we compute the likelihood for each of the temporal groups, then the final output of

level-one inference is a vector indicating the likelihood that the current input belongs to

each of the temporal groups. Since this inference mechanism is instantaneous( the

inference depends on current input only), in order to distinguish it from the sequential

inference that will be introduced in introduced in Sec. 2.3, we call it the instantaneous

inference, and the pseudo code is shown below:

１５

belief is the final output belief vector of the level-one node. Fig. 7 shows an example of

level-one inference.

2.2 Operations of a higher - level node

2.2.1 Training of a higher - level node

After the level-one training is finished, all the level-one nodes are switched into

inference stage; at the same time the level-two nodes enter the learning stage. The

operations performed in level-two training are generally similar to that of level one, the

difference lies in the first step – memorizing features: since the input to a level-two node

is no longer the image patch, but the concatenation of the inference outputs sent by its

level-one child nodes, the features learned by the level-two node thus are expressed in

terms of the level-one nodes’ inference results. Fig. 8 shows an explanation of the detailed

operation for feature learning in level-two node, after this step, the subsequent two steps

(building transition graph and temporal grouping),

Second step: for j in all temporal groups:

belief[ j] = ∑match[l] • K[i, l] l

１６

Fig. 7. Example of inference process in a level-one node: the image patch at bottom is the input. According to the feature memory, the input is most likely to be c4, and c4 belongs to temporal group g2, thus the output belief vector has the highest value for g2

are the same as that of level one.

2.2.2 Inference of a higher - level node

Inference in a level-two node is also similar to that of level one. The difference lies

in the first step – calculate the degree of match between input and memorized features, an

explanation of the first step of inference process in a level – two node is shown at Fig.

9. After this step, the subsequent step–computing the degree of membership in temporal

groups, is the same as level-one.

１７

Fig. 8. Example two-level network training. The training image for the level-two node is the rectangular-shape image patch; during training, it is first divided into two, and each half corresponds to the receptive field of one of the level-one child nodes, and these two smaller image patches are sent to their corresponding level-one node first, each of the level-one node will calculate the belief over its own temporal groups, and send the result to the level-two node. The strategy that level-two node concatenate its child nodes’ beliefs is a winner take all principle: in this case, the maximum component of the belief sent by the left child node is the second temporal group, and the maximum component of the belief sent by the right child node is the fourth temporal group, so the level-two node will remember the combination of these two temporal groups, representing the “U” shape in its visual area. Thus the patterns memorized in higher-level node are actually the combination of the most likely temporal groups of lower-level nodes.

１８

Fig. 9. Example two-level network inference. The level-two node already has a memory of all the combinations and a memory of the temporal groups. After receiving the beliefs sent by the child nodes, the level-two node has to decide the closeness of matching of its input to its stored patterns. To do this, the level-two node has to concatenate the beliefs, specifically, since the first stored pattern is the combination of the second temporal group of the left child and the fourth temporal group of the right child, so the probability that the input match the first stored pattern is simply the second component of the left child’s belief multiply the fourth component of the right child’s belief : 0.7 0.2 = 0.14 , Similarly, the closeness of matching of the second stored pattern is calculated as: 0.7 0.1 = 0.07. Then, these probabilities can be used to calculate the input’s degree of membership in the level-two node’s temporal groups, and the process is the same as that of level-one inference.

１９

The training and inference for the levels higher than two proceeds similarly; where the

features learned by higher-level nodes are combinations of the outputs sent by its child

nodes, and the higher-level node’s inference is also based on these outputs. Note that the

indexes of the temporal groups of the top node are simply the labels of the objects to be

recognized, and the inference output of the top node is the recognition beliefs over the

objects.

2.3 Extensions of HTM

This section presents the details of the extension we make to the HTM model.

Sec.2.3.1 introduces the dynamic programming algorithm for sequential inference in a

level one Markov Chain, Sec 2.3.2 introduces the implicit temporal hierarchy of the HTM,

and Sec 2.3.3 introduces the modified dynamic programming algorithm for sequential

inference in a temporal hierarchy.

2.3.1 Dynamic programming algorithm for sequential inference in a level-one

Markov Chain

A closer look at Fig. 5 tells us that during training, the temporal groups learned by

the level-one node are smaller Markov Chains that describe sequential relation between the

patterns in the group. Thus, it is possible to use this memory of sequential information to

cumulate the belief when a series of patterns arrive. The process is

explained below:

２０

Initially, at time 0, input(0) arrives, and the level-one node will calculate its matching

probability with each of its memorized visual features. Then we wish to propagate these

beliefs to the probability of temporal group matching (which is the same as the

instantaneous inference introduced in Sec 3.1.2, since there is only one input and no

sequential information is needed). At time 1, input(1) arrives, and the first step is still to

calculate its matches with the memorized patterns p(input(1)|S[x]). Then, since our goal is

to compute the cumulative belief - p(input(0),input(1)|G[x]), we must consider any possible

transition between features from time 0 to time 1. That is where the sequential information

stored in the temporal groups is used, and there are two steps to

calculate the cumulative belief:

p(input (1), input (0), S[ j] / G[i])

Then

= ∑ p(input (1) | S[ j])

k

• T[k, j] • α (0)[ i, k ]

p ( input (1), input ( 0 ) / G [ i ]) = ∑ p ( input j

(1), input ( 0 ), S[ j]/ G [ i ])

where α represents the dynamic programming variable:

α(t)[i, k] = p(input (t), input (t − 1), input (t − 2),...input (0), S[k] | G[i])

Thus, from the deduction above, we can define a general rule for propagating the belief

from t-1 to t : we need to keep a memory of the dynamic programming variable at t-1; then,

when input(t) arrives, we cumulate the belief by combining the information stored in the T

２０

matrix and α(t-1).

2.3.2 Implicit temporal hierarchy of the HTM

２１

Sec 2.3.1 introduces the sequential inference for one Markov Chain. However, after

the level one node sending its output to level two, the temporal group of level one becomes

a single state (or feature) at level two (Fig. 10 explains this relation), and level two will

further learn its own temporal group by grouping its states (features). Thus,

there is also a temporal hierarchy in the HTM model, where higher level learns longer sequences in terms of the combination of shorter sequences of lower level.

Fig. 10. Illustration of the temporal hierarchy in HTM. s1-s6 are the features learned by level-one node, g1 and g2 are temporal groups of those features; g1 and g2 will further become the features of level two - F1 and F2; F1 and F2 will form the temporal group G1 of level two, which will become the feature P1at level three.

２２

2.3.3 Sequential inference in a temporal hierarchy

To explain the ideas of how to perform sequential inference in a temporal hierarchy,

we first distinguish two kinds of inference: the first is the instantaneous inference

introduced in Sec 2.1.2, which can perform inference based on only the current input; the

second is the sequential inference introduced in Sec 2.2.1, which is able to perform

inference based on both the current and the past inputs.

We introduce the idea with the example of inference about a song – a sequence of musical

notes. Imagine for Fig. 10 that P1 at level three represents a song, F1, F2 at level two

represent two melodies that form the song, and s1-s6 at level one are the notes that form

the melodies. Suppose we have memorized this temporal hierarchical structure of the song,

and now the song is being played from the first note, Fig. 11. shows an illustration of how

our proposed inference method works. There are three stages: first, within the period of

first melody (time 0-1-2), level one is performing sequential inference for melody, sending

its output to level two, and F1 receives the instantaneous input from level one to infer the

song’s name. The second period is from time step 2 to time step 3; level one will reset its

sequential memory, and start to infer for the new incoming melody F2, since according to

our memory, the first melody has ended. For level two, we must consider both the

instantaneous input sent by level one at time 3, and

the transition between the previous melody and the new melody, thus it is performing

２３

sequential inference at this period. The third period is time 3-4-5, during which level one

cumulates its belief of F2, and level two’s inference again will depend on only the

instantaneous input sent from level one, since no new melody is coming, thus no transition

has to be considered. For level three, since the song remains the same during the process,

the top node performs only instantaneous inference based on the input sent by level two.

According to the deduction above, the essence of the idea is that a lower-level

node resets its memory after a temporal group has ended, and at the same time, since

Fig. 11. An illustration of sequential inference in a temporal hierarchy. Assume each note occupies one time step; α(t) represents the memory of level one, and t represents the cumulative memory length.

２４

the ending of a lower-level temporal group indicates there is a transition between states at

higher level, the higher level parent node thus will perform sequential inference. if we

assume, for a node at level x, that the average time length of its lower level child node’s

temporal group is a, and the average time length of its own temporal group is b, then the

pseudo code for our hierarchical temporal inference at level x is as follows:

if x level’s temporal group just ended:

reset memory

else if within period of b:

if within period of a:

do instantaneous inference

else:

do sequential inference

２５

Chapter 3 - Experiment Results

3.1 Experiment setting

We evaluate our method on the CASIA gait dataset C [13], which includes the

silhouette gait images of 151 individuals. We first transform all the images into binary-

valued 2-d arrays, and train our algorithm with 36 frames (three steps) of each individual

walking normally (without bag), and test it with 12 frames. For temporal hierarchy, this

sequence will be divided into three one-steps(12 frames) at level two, and is further divided

into six half-steps (6 frames), (shown in Fig. 12), thus the average length of temporal group

(maxsize parameter for temporal group during training) is 6 for level one, 6x2 = 12 for

level two, and 12x3 = 36 for level three. For spatial hierarchy, we divide the gait image

into three areas at level two, corresponding to upper body, middle body, and bottom body,

which are further divided into six sub-areas (Fig. 13.).

3.2 Parameter selection

２６

For the Threshold parameter used at the first step of level one training, we set it to

be zero, to make the algorithm able to memorize as many features as possible. For the Δ

in the Gaussian probability function, since the image patches of different subareas

have different sizes, Δ should be designed separately for each sub-areas, the criteria are that Δ should not be too large, (which would make the algorithm insufficiently sensitive

and unable to distinguish the true pattern), and also not too small, (which would

make the algorithm too sensitive for the noisy input). After trial and error, we find that the

Δ values that are robust yet selective enough for different subareas are: 5 for head region,

15 for arm region, and 35 for leg region.

Fig. 12. illustration of temporal hierarchy we build for the gait sequence, which is a correspondence of the model in Fig. 11. Abbreviations: seq. means sequential inference, and inst. means instantaneous inference

２７

Figure 13. Spatial hierarchy we build for the gait image, at the first level, the sub areas corresponding to head, different arms, and different legs

3.3 Results

We tested the algorithm on three cases: recognition of normal walking person,

recognition of normal walking person carrying a bag, and recognition of noisy normal

walking gait images. Table 1 shows the comparison of our algorithm with other methods

use the same dataset on the normal test and bag test, it shows that our method performs

better than Tan’s methods at the test with bag, since HTM’s belief is generated by the

concatenation of the beliefs of all its sub-areas, and the difference of carrying a bag and

without a bag only affects one sub-area, thus it will not significantly affect the overall

result. Table 2 shows our result on test under noise. Specifically, since the binary image

is of the size 130x100 (the extracted region contain foreground pixels only), then

２８

to generate 1% noise, 130 pixels are randomly chosen and have their values inverted. We

have tested our method under 10 noise levels, ranging from 1% to 10%, we can observe

that the recognition results are not affected by noise, since our algorithm is able to apply

dynamic sequential inference to disambiguate the effect of noise at each time step. Table 3

shows an illustrative example of how the belief of our extended changes over five

individuals when being tested with noisy images, and Figure shows the corresponding plot

of the dynamic beliefs of the five persons over time. We observe that the algorithm’s belief

over person 2,3,4 and 5 drops significantly with time, while the belief over the first

person(true answer) change little, making the overall belief of the first person sharper.

That is because, as more gait images arrive, more sequential information can be used to

strengthen the belief. Table 4 shows the recognition result of the original Zeta1 model with

the same noisy gait images, and Figure 15 shows the corresponding plot of the dynamic

beliefs. We can see that the belief distribution does not show an obvious change with time.

That is because Zeta1’s inference is based on the current input only, thus there is no

cumulative belief, and Zeta1 is more sensitive to noise than our extended HTM, for

example, Zeta1’s belief at time step 3 is distorted, since the instantaneous input at time step

3 is more noisy than previous inputs.

２９

Table 1. Comparison with other methods that use CAISA dataset C

Exp

case

proposed

method

Tan's method

(CVPR 2007)

[14]

Tan's method (ICPR

2006, a

subset of 46

subjects) [15]

normal 80% 98.40% 94%

bag 66% 42.70% 51% Zhang’s method

Exp SVM Counting Signal

case classifier [16] classifier [16] Processing

2010 [17]

normal 99.84% 99.32% 88.89%

bag 89.82% 88.41% 79.74%

３０

Table 2. Performance with normal walking under noise

noise level 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

recognition

rate

80%

79%

79%

80%

81%

81%

78%

80%

81%

79%

Table 3. Illustrative test result with noisy gait sequence (extended HTM)

Persons

T 1 2 3 4 5

0 10 -30 10-44 10-45 10-46 10-58

1 10-31 10-73 10-85 10-75 10-68

2 10-31 10-106 10-117 10-108 10-97

3 10-32 10-149 10-154 10-140 10-130

4 10-32 10-201 10-203 10-184 10-173

5 10-33 10-287 10-272 10-232 10-211

The number of the row under ‘Person’ represent indexes for different individual, the column under ‘T’ represents different different time steps, and the values in other cells represent the log probability of the extended HTM’s inference result。

３１

Figure 14. Dynamic beliefs of the five individuals over time (Extended HTM). The coordinates at y axis represent the log-likelihoods, and the coordinates at x axis represent time instances. Red line indicates the beliefs over the first person, blue line indicates the beliefs over the second person, black line indicates the beliefs over the third person, yellow line indicates the beliefs over the fourth person, and green line indicates the beliefs over the fifth person.

３２

Table 4 Illustration test result with noisy gait sequence (original Zeta1 model)

Persons

T 1 2 3 4 5

0 10-30 10-44 10-46 10-44 10-37

1 10-25 10-39 10-40 10-40 10-34

2 10-21 10-32 10-34 10-31 10-29

3 10-47 10-30 10-31 10-26 10-34

4 10-18 10-33 10-30 10-30 10-28

5 10-23 10-32 10-34 10-31 10-28

３３

Figure 15. Dynamic beliefs of the five individuals over time (original Zeta1 model). The meaning of the axis and correspondence between color and individual is the same as Figure 14.

３４

Chapter 4 Conclusion

In this research we present an extended HTM and apply it to gait recognition. By

incorporating a hierarchical dynamic programming algorithm, the extended HTM can infer

a complex sequence in terms of the temporal memory of shorter sub-sequences. The

results show that our method is superior to some existing methods, and comparable to

others. One strength of HTM is that it uses only the image patch as its feature, where other

methods use gait-specific features, like GAI (Gait Energy Accumulation Images) and AEI

(Active Energy Image) [16], thus making HTM a more generic model for visual problems,

without the necessity of careful feature designs. The application in gait recognition proves

the effectiveness of HTM, a nascent mechanism, that we believe might be a good solution

for any medical problems with a spatial and temporal hierarchical structure, for example

EEG and ECG. As for future work, we believe several things can be improved: first, the

segmentation of the gait image of the current version of HTM is done manually, and to get

better performance, an automatic method that is able to autonomously localize the position

of head, arms, and legs is necessary. One possible approach to this is to measure pairwise

mutual information between the variations of different local regions of the gait image, the

regions that have high mutual information are more likely to belong to the same body parts;

second, the estimation of the gait period of the HTM is also done manually, but, since the

gait cycle tends to vary with the individuals, an automatic method that can accurately

estimate the temporal length of

３５

steps is needed. We believe the solution to this problem is try to measure the periodicity

of the movement of the gait positions.

３６

Recognition. CVPR '11. Washington, DC, USA: IEEE Computer Society: 3361–3368. doi:10.1109/CVPR.2011.5995496. ISBN 978-1-4577-0394-2.

References

[1] Baccouche, Moez; Mamalet, Franck; Wolf, Christian; Garcia, Christophe; Baskurt, Atilla (2011-11-16). "Sequential Deep Learning for Human Action Recognition". In Salah, Albert Ali; Lepri, Bruno. Human Behavior Unterstanding. Lecture Notes in Computer Science. 7065. Springer Berlin Heidelberg. pp. 29–39. doi:10.1007/978-3-642-25446-8_4. ISBN 978-3-642-25445-1. [2] Ji, Shuiwang; Xu, Wei; Yang, Ming; Yu, Kai (2013-01-01). "3D Convolutional Neural Networks for Human Action Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (1): 221–231. doi:10.1109/TPAMI.2012.59. ISSN 0162-8828. PMID 22392705. [3] Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks." IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2014. [4] Simonyan, Karen; Zisserman, Andrew (2014). "Two-Stream Convolutional Networks for Action Recognition in Videos". arXiv:1406.2199 [cs.CV]. (2014). [5] Taylor, Graham W.; Fergus, Rob; LeCun, Yann; Bregler, Christoph (2010-01-01). "Convolutional Learning of Spatio-temporal Features". Proceedings of the 11th European Conference on Computer Vision: Part VI. ECCV'10. Berlin, Heidelberg: Springer-Verlag: 140–153. ISBN 3-642-15566-9. [6] Le, Q. V.; Zou, W. Y.; Yeung, S. Y.; Ng, A. Y. (2011-01-01). "Learning Hierarchical Invariant Spatio-temporal Features for Action Recognition with Independent Subspace Analysis". Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern

[7] A.Jain, A.Zamie, A.Saxena. “Structural - RNN: Deep Learning on Spatio - Temporal Graphs”, CVPR 2016 [8] Shai Fine, Yoram Singer, and Naftali Tishby. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32(1):41{62, 1998 [9] D.George, “How the Brain Might Work: A Hierarchical and Temporal model for Learning and Recognition”, Ph.D. Thesis, Stanford University, 2008 [10] J.Hawkins, D.George, “Hierarchical Temporal Memory Concepts, Theory, and Terminology”, Numenta, 2006 . [11] D.Maltoni, “Pattern Recognition by Hierarchical Temporal Memory”, DEIS Technical Report, April 2011. [12] Numenta, “Hierarchical Temporal Memory including HTM cortical learning algorithms”, Numenta,September 12, [13] http://www.cbsr.ia.ac.cn/english/Gait%20Databases.asp [14] D. Tan, K. Huang, S. Yu, and T. Tan, “Recognizing night walkers based on one pseudoshape representation of gait,” in Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, june 2007, pp. 1 –8

https://en.wikipedia.org/wiki/Digital_object_identifier

https://dx.doi.org/10.1109/CVPR.2011.5995496

https://en.wikipedia.org/wiki/International_Standard_Book_Number

http://link.springer.com/chapter/10.1007/978-3-642-25446-8_4


https://dx.doi.org/10.1007/978-3-642-25446-8_4


http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6165309

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6165309


https://dx.doi.org/10.1109/TPAMI.2012.59

https://en.wikipedia.org/wiki/International_Standard_Serial_Number

https://www.worldcat.org/issn/0162-8828

https://en.wikipedia.org/wiki/PubMed_Identifier

https://www.ncbi.nlm.nih.gov/pubmed/22392705

https://en.wikipedia.org/wiki/ArXiv

https://arxiv.org/abs/1406.2199

https://arxiv.org/archive/cs.CV

http://dl.acm.org/citation.cfm?id=1888212.1888225


http://www.cbsr.ia.ac.cn/english/Gait%20Databases.asp

３７

[15] D. Tan, K. Huang, S. Yu, and T. Tan, “Efficient night gait recognition based on template matching,” in Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, vol. 3, 0-0 2006, pp. 1000 –1003. [16] Jianzhao Qin, T. Luo, W. Shao, R. H. Y. Chung and K. P. Chow., “A Bag-of-Gait Model for Gait Recognition”, worldcomp proceedings, 2012 [17] E. Zhang, Y. Zhao, and W. Xiong, “Active energy image plus 2dlpp for gait recognition,” Signal Processing, vol. 90, no. 7, pp. 2295 – 2302, 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0165168410000411

http://www.sciencedirect.com/science/article/pii/S0165168410000411

３８

Appendix A - a description of the CASIA gait datasetC Dataset C was collected by an infrared (thermal) camera in Jul.-Aug. 2005. It contains 153 subjects and takes into account four walking conditions: normal walking, slow walking, fast walking, and normal walking with a bag. The videos were all captured at night. The format of the video filename in Dataset C is '01xxxmmnn.avi', where

• xxx: subject id, from 001 to 153. • mm: walking status, can be 'fn' (normal), 'fq' (fast walk), 'fs' (slow walk) or 'fb'

(with a bag). • nn: sequence number.

３９

Appendix B - Python code Level one - training:

training_sequence: a sequence of continuously changing images coincidence_dic: a memory of learned spatial features (in 2-d array form) P: prior probabilities of memorized spatial features, where P[i] represents the prior probability of the ith memorized feature Temporal_connection: a transition matrix where T[i,j] represents how many times spatial feature j immediately follows spatial feature i Nc: a variable represents the current number of memorized features of the level-one node

def level_one_training(training_sequence):

Nc = -1 coincidence_dic = {} coincidence_counter = {} Kprev = 0 thrDist = 1 Tm = np.zeros((5000,5000),dtype = float) for i in range(len(training_sequence)):

if i==0: Nc+=1 coincidence_dic[Nc] = training_sequence[0] Kprev = Nc coincidence_counter[Nc] = 1

else: LIST = [sum(sum(training_sequence[i]^f)) for f in

coincidence_dic.values()] Kstar = LIST.index(min(LIST)) if min(LIST)>thrDist:

Nc+=1 coincidence_dic[Nc] = training_sequence[i] Kstar = Nc coincidence_counter[Kstar] = 0

coincidence_counter[Kstar] += 1 if (i+1)%337!=1:

T[Kprev,Kstar] += 1 Kprev = Kstar

P = {} values = coincidence_counter.values() values = [float(i) for i in values]

４０

SUM = sum(values) for k in range(len(values)):

P[k] = values[k]/SUM l = Nc + 1 Temporal_connection = T[0:l,0:l]

return coincidence_dic,P,Temporal_connection,Nc row-wise normalization of transition matrix:

def finalize_level_one_node_training(coincidence_dic,P,T):

length = len(coincidence_dic.keys()) for i in range(length):

for j in range(length): T[i,j],T[j,i] = 0.5*(T[i,j] + T[j,i]),0.5*(T[j,i] + T[i,j])

or i in range(length): SUM = sum(T[i,:]) for j in range(length):

if SUM!=0: T[i,j] = T[i,j]/SUM

return T

else: T[i,j] = 0

Temporal grouping:

G: learned temporal groups where G[i] represents the ith temporal group

def Temporal_grouping(P,T): G = {} ng = -1

temporal_connection = {} length = len(P.keys()) assigned = {} for i in range(length):

assigned[i] = False Unassigned = [] for j in assigned.keys():

if assigned[j]==False: Unassigned.append(j)

while assigned.values().count(True)!=length:

４１

temporal_connection = {} LEN = len(Unassigned) for coince in Unassigned:

temporal_connection[coince] = 0 for coince in Unassigned:

for l in range(LEN): temporal_connection[coince] +=

T[Unassigned[l],coince]*P[Unassigned[l]] Kstar = max(temporal_connection.iteritems(),key = operator.itemgetter(1))[0] ohm = [] ohm.append(Kstar) pos = 0 groupMaxSize = 11 while((pos<=len(ohm) - 1) and (pos < groupMaxSize - 1)):

START = ohm[pos] Neighbor_connections = {} for neighbor in Unassigned:

Neighbor_connections[neighbor] = T[START,neighbor] i = 0 while i<=3 and Neighbor_connections!={}:

Top_neighbour=max(Neighbor_connections.iteritems(),key= operator.itemgetter(1))[0]

if Top_neighbour not in ohm: ohm.append(Top_neighbour) i+=1

del Neighbor_connections[Top_neighbour] pos += 1

ng+=1 G[ng] = []

for k in range(pos):

G[ng].append(ohm[k]) assigned[ohm[k]] = True Unassigned.remove(ohm[k])

return G Constructing K matrix:

PCG is the K matrix introduce in the thesis, PCG[i,j] = p(S[i]|G[j])

def PCG_compute(P,G):

４２

LEN_1 = len(P.keys()) LEN_2 = len(G.keys()) PCG = np.zeros((LEN_1,LEN_2),dtype = float) for i in range(LEN_2):

for j in range(LEN_1): if j in G[i]:

PCG[j,i] = P[j] for i in range(LEN_2):

SUM = sum(PCG[:,i]) for j in range(LEN_1):

PCG[j,i] = PCG[j,i]/SUM return PCG

Level one inference: def level_one_node_inference(imagepatch,node_one,parameter,t,internal_time,interval,mode _one):

length = len(node_one.coincidence_dic.keys()) shape = node_one.alpha.shape a = shape[0] b = shape[1]

densities_over_coincidences = {} for j in range(length):

dist = sum(sum(imagepatch^node_one.coincidence_dic[j])) densities_over_coincidences[j] = np.exp(-dist/parameter)

index = max(densities_over_coincidences.iteritems(),key = operator.itemgetter(1))[0]

if t==0: prev_index = index internal_time.time = t

else: if mode_one==0:

interval.value = index - prev_index prev_index = index internal_time.time += step_interval

４３

if internal_time.time%6==0: // if a new temporal group arrives, do instantaneous inference

for k in range(a): for v in range(b):

node_one.alpha[k,v]=densities_over_coincidences[k*b+v]*node_one.PCG[k*b + v,k]

node_one.alpha_prev[k,v] = node_one.alpha[k,v] else:

if interval.value==1: // if within the temporal group, do sequential inference for k in range(a):

for v in range(b): node_one.alpha[k,v] = 0 for w in range(b):

node_one.alpha[k,v]+=densities_over_coincidences[k*b+ v]*node_one.T[k*b + w,k*b + v]*node_one.alpha_prev[k,w]

for c in range(a): for d in range(d):

node_one.alpha_prev[c,d] = node_one.alpha[c,d] elif interval.value==2:


node_one.alpnode_two.alpha_prev[k,v] = 0 for h in range(b):

factor = 0 for l in range(b):

factor += node_one.T[k*b + h,k*b + l]*node_one.T[k*b + l,k*b + v]

node_one.alpha[k,v]+=densities_over_coincidences[k*b+ v]*factor*node_one.alpha_prev[k,h]

for c in range(a): for d in range(b):

node_one.alpha_prev[c,d] = node_one.alpha[c,d] group_activaion = {} for l in range(a):

group_activation[l] = sum(node_one.alpha[l,:]) return group_activation

４４

Level two training: def level_two_node_training(training_sequence,coincidence_dic,PCG,G,label):

coincidence_dic_leveltwo = {} coincidence_counter = {} Temporal_connection = np.zeros((5000,5000),dtype = float) Nc_two = -1 Kprev = 0 thrDist = 0 for i in range(len(training_sequence)):

target = training_sequence[i] LIST = [target[0:4,0:4],target[0:4,4:8],target[4:8,0:4],target[4:8,4:8]] childwinning_group = [] for j in range(len(LIST)):

group_activation = level_one_inference(LIST[j],coincidence_dic,PCG,G) index = max(group_activation.iteritems(),key = operator.itemgetter(1))[0] childwinning_group.append(index) // winner - take - all learning principle

childwinning_group = np.array(childwinning_group) if i==0:

Nc_two+=1 coincidence_dic_leveltwo[Nc_two] = childwinning_group coincidence_counter[Nc_two] = 1 Kprev = 0

else: dist_list = [sum(abs(childwinning_group-f)) for f in

coincidence_dic_leveltwo.values()] Kstar = dist_list.index(min(dist_list)) if min(dist_list)>thrDist:

Nc_two+=1 coincidence_dic_leveltwo[Nc_two] = childwinning_group Kstar = Nc_two coincidence_counter[Nc_two] = 0

coincidence_counter[Kstar]+=1 if label[i]==False:

Temporal_connection[Kprev,Kstar]+=1 Kprev = Kstar

P = {} values = coincidence_counter.values() values = [float(i) for i in values] SUM = sum(values)

４５

for k in range(len(values)): P[k] = values[k]/SUM

l = Nc_two+1 Temporal_connection = Temporal_connection[0:l,0:l] return coincidence_dic_leveltwo,P,Temporal_connection,Nc_two

Level two inference: def

level_two_node_inference(imagepatch,node_two,left_node,right_node,parameter_left,pa rameter_right,t,internal_time,interval,mode_two):

length = len(node_two.coincidence_dic.keys()) shape = node_two.alpha.shape a = shape[0] b = shape[1] sub_patch_one = imagepatch[:,0:46] sub_patch_two = imagepatch[:,46:92] activation_left =

level_one_node_inference(sub_patch_one,left_node,parameter_left,t,internal_time,interv al,mode_two[0])

activation_right = level_one_node_inference(sub_patch_two,right_node,parameter_right,t,internal_time,int erval,mode_two[1])

densities_over_coincidences_leveltwo = {}

for i in range(length): densities_over_coincidences_leveltwo[i] =

activation_left[i]*activation_right[i]

if internal_time.time%12==0: // if new temporal group arrives, do instantaneous inference


node_two.alpha[k,v] = densities_over_coincidences_leveltwo[k*b + v]*node_two.PCG[k*b + v,k]

node_two.alpha_prev[k,v] = node_two.alpha[k,v] elif internal_time.time%6==0:// if within the temporal group, do sequential

inference for k in range(a):

for v in range(b): node_two.alpha[k,v] = 0

４６

for w in range(b): node_two.alpha[k,v] +=

densities_over_coincidences_leveltwo[k*b + v]*node_two.T[k*b + w,k*b + v]*node_two.alpha_prev[k,w]

for c in range(a): for d in range(d):

node_two.alpha_prev[c,d] = node_two.alpha[c,d] else:


node_two.alpha[k,v] = densities_over_coincidences_leveltwo[k*b + v]*node_two.PCG[k*b + v,k]

node_two.alpha_prev[k,v] = node_two.alpha[k,v]

group_activation = {} for l in range(a):

group_activation[l] = sum(node_two.alpha[l,:]) return group_activation

output_node training: def

output_node_training(training_sequence,label_levelthree,PCG,coincidence_dic,G,one,tw o,three,four):

PCW = np.zeros((6000,10),dtype = float) coincidence_dic_levelthree = {} Nc_three = -1 thrDist = 0 for l in range(len(training_sequence)):

target = training_sequence[l] LIST = [target[0:8,0:8],target[0:8,8:16],target[8:16,0:8],target[8:16,8:16]] groupactivation_one =

level_two_node_inference(LIST[0],one,coincidence_dic,PCG,G) index_one = max(groupactivation_one.iteritems(),key =

operator.itemgetter(1))[0] groupactivation_two =

level_two_node_inference(LIST[1],two,coincidence_dic,PCG,G) index_two = max(groupactivation_two.iteritems(),key =

operator.itemgetter(1))[0]

４７

groupactivation_three = level_two_node_inference(LIST[2],three,coincidence_dic,PCG,G)

index_three=max(groupactivation_three.iteritems(),key= operator.itemgetter(1))[0]

groupactivation_four = level_two_node_inference(LIST[3],four,coincidence_dic,PCG,G)

index_four = max(groupactivation_four.iteritems(),key = operator.itemgetter(1))[0]

childwiining_group = [index_one,index_two,index_three,index_four] childwinning_group = np.array(childwiining_group)

if l==0:

coincidence_dic_levelthree[0] = childwinning_group PCW[0,label_levelthree[0]] += 1 Nc_three+=1

else:

dist_list = [sum(abs(childwinning_group - f))for f in coincidence_dic_levelthree.values()]

Kstar = dist_list.index(min(dist_list)) if min(dist_list)>thrDist:

Nc_three+=1 coincidence_dic_levelthree[Nc_three] = childwinning_group Kstar = Nc_three

PCW[Kstar,label_levelthree[l]]+=1

return coincidence_dic_levelthree,PCW

output_node inference:

def level_three_node_inference(imagepatch,node_three,upper_node,middle_node,bottom_no de,upper_left,upper_right,middle_left,middle_right,bottom_left,bottom_right,parameter_ list,t,internal_time,interval,mode_three):

length = len(node_three.coincidence_dic.keys()) shape = node_three.alpha.shape a = shape[0] b = shape[1] sub_patch_one = imagepatch[0:35,:] sub_patch_two = imagepatch[35:80,:]

４８

sub_patch_three = imagepatch[80:152,:] activation_up=

level_two_node_inference(sub_patch_one,upper_node,upper_left,upper_right,parameter _list[0],parameter_list[1],t,internal_time,interval,mode_three[0:2])

activation_middle= level_two_node_inference(sub_patch_two,middle_node,middle_left,middle_right,param eter_list[2],parameter_list[3],t,internal_time,interval,mode_three[2:4])

activation_bottom= level_two_node_inference(sub_patch_three,bottom_node,bottom_left,bottom_right,para meter_list[4],parameter_list[5],t,internal_time,interval,mode_three[4:6])

densities_over_coincidences = {} for i in range(length):

densities_over_coincidences[i]= activation_up[i]*activation_middle[i]*activation_bottom[i]

if internal_time.time%36==0: // if new temporal group arrives, do instantaneous inference


node_three.alpha[k,v]=densities_over_coincidences[k*b+ v]*node_three.PCG[k*b + v,k]

node_three.alpha_prev[k,v] = node_three.alpha[k,v] elif internal_time.time%12==0: // if within temporal group, do sequential

inference for k in range(a):

for v in range(b): node_three.alpha[k,v] = 0 for w in range(b):

node_three.alpha[k,v]+=densities_over_coincidences[k*b+ v]*node_three.T[k*b + w,k*b + v]*node_three.alpha_prev[k,w]

for c in range(a): for d in range(b):

node_three.alpha_prev[c,d] = node_three.alpha[c,d] else:


node_three.alpha[k,v]=densities_over_coincidences[k*b+ v]*node_three.PCG[k*b + v,k]

node_three.alpha_prev[k,v] = node_three.alpha[k,v] group_activation = {}

for l in range(a): group_activation[l] = sum(node_three.alpha[l,:])

４９

return group_activation

hierarchical temporal and spatial memory for gait pattern

Documents