learning finite automaton from noisy observations – a...

35
CENTER FOR MACHINE PERCEPTION CZECH TECHNICAL UNIVERSITY RESEARCH REPORT ISSN 1213-2365 Learning Finite Automaton from Noisy Observations – A Simple Instance of a Bidirectional Signal-to-symbol Interface (The initial experiment) Working document of the EU project COSPAL IST-004176 aclav Hlav´ c, Zdenˇ ek K´ alal {hlavac,kalalz1}@fel.cvut.cz CTU–CMP–2004–13 December 31, 2004 Available at ftp://cmp.felk.cvut.cz/pub/cmp/articles/hlavac/HlavacKalalTr2004-13.pdf Authors were supported by the projects IST-004176 COSPAL, CONEX GZ 45.535 and GA ˇ CR 102/00/1679. Zdenˇ ek K´ alal is a MSc student and has conducted experiments reported here as a semestral project. Research Reports of CMP, Czech Technical University in Prague, No. 13, 2004 Published by Center for Machine Perception, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University Technick´ a 2, 166 27 Prague 6, Czech Republic fax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz

Upload: others

Post on 18-Oct-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

CENTER FOR

MACHINE PERCEPTION

CZECH TECHNICAL

UNIVERSITY

RESEARCH

REPO

RT

ISSN

1213

-236

5

Learning Finite Automaton fromNoisy Observations – A Simple

Instance of a BidirectionalSignal-to-symbol Interface

(The initial experiment)

Working document of the EU projectCOSPAL IST-004176

Vaclav Hlavac, Zdenek Kalal

{hlavac,kalalz1}@fel.cvut.cz

CTU–CMP–2004–13

December 31, 2004

Available atftp://cmp.felk.cvut.cz/pub/cmp/articles/hlavac/HlavacKalalTr2004-13.pdf

Authors were supported by the projects IST-004176 COSPAL,CONEX GZ 45.535 and GACR 102/00/1679. Zdenek Kalal is a MScstudent and has conducted experiments reported here as a semestralproject.

Research Reports of CMP, Czech Technical University in Prague, No. 13, 2004

Published by

Center for Machine Perception, Department of CyberneticsFaculty of Electrical Engineering, Czech Technical University

Technicka 2, 166 27 Prague 6, Czech Republicfax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz

Page 2: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,
Page 3: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

Contents

1 Motivation in the COSPAL project 2

2 Bidirectional signal-to-symbol interface 3

3 Goals of the reported work 4

4 Hidden Markov Models briefly 4

5 Three basic problems in Hidden Markov Models and plan for exper-iments 5

6 Experiment – Probability of the chosen statistical model from givenobservations leads to matrix multiplication 7

7 Experiment – Optimal sequence of hidden states from given observa-tions 9

8 Experiment – Learning a discrete HMM from a crisp training set 10

9 Experiment – Learning a discrete HMM from a noisy training set 15

10 Experiment – How to choose the number of hidden states? 17

11 Conclusions and future work 19

12 Acknowledgements 20

13 Appendix with the MATLAB code 21

1

Page 4: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

Abstract

This report investigates the way how to learn the finite automaton modelof the activity observed in real world. The related theory is reviewed, solutionproposed and experiments conducted. Learning finite automaton is similar tolearning a discrete Hidden Markov Model (HMM) using a variant of EM algo-rithm. We used J. Dupac’s discrete HMM Toolbox in Matlab. The experimentalpart of this work deals with learning HMM from a synthetic training set gener-ated from a known model. This approach provides us with ground truth.

1 Motivation in the COSPAL project

This work has been motivated by the project COSPAL. It is the EU STREP projectwhich belongs to the Cognitive Systems call and has been running from July 2004 for3 years. The project is coordinated by the Linkoping University, Sweden (M. Fels-berg, G. Granlund) and Christian-Albrechts-Universitat Kiel (G. Sommer), Germany,University of Surrey, UK (J. Kittler), and our Czech Technical University (V. Hlavac)teams participate.

COSPAL project deals with artificial cognitive systems, in particular the systemdesign and seeks for related learning strategies. The novelty of the approach lies in theinteraction of continuous and symbolic perception and action. The learning strategyis based on the idea that perception is learned by incremental (online) learning ofpercept-action mappings.

Project achievements are to be shown on a demonstrator for which a shape sorterpuzzle was selected. This toy is used to help developing perceptual and motoric skillsof kids between one and two years. The toy consists of a box with holes of severalshapes (circular, square, triangular, oval, star-like, etc.). There are blocks (prisms)bottoms of which match one-to-one to holes. The child’s task is to put the block intoa matching hole.

One of the COSPAL goals is to encounter if the perception-action capabilities canbe learned from scratch. This is, probably, an overambitious goal even for a simplebehavior which shape sorter puzzle requires. Even so the other interesting question ishow much of innate capabilities the system has to have.

The COSPAL system architecture, if simplified, consists of three layers, see. Fig. 1.Let us describe one of the possible instances of this architecture. The lower levelwhich learns perception-action associations is shown on the left. The middle-level isconstituted by the bidirectional signal-to-symbol interface. The top level is responsiblefor symbolic manipulation and planning. CMP Prague team is responsible for themiddle-level.

Consider a scenario with a robot and a shape sorter puzzle. A camera is attached toexternal world and overlooks the scene with puzzle and robot arm manipulating blocks.Assume that system runs in a supervisor operated mode in which many examples of ablock grasp and successful insertions are generated. The system is supposed to learnthe activity from the scratch.

2

Page 5: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

interweaving ofpercepts and actions

combined, incremental learning

Symbolic manipulation,

planning, and

communicationbidirectional

signal to symbol

interface

reinforcementof purpose

perceptsactions

Figure 1: COSPAL system architecture, simplified.

Let us consider a trajectory of the robot gripper in a 3D space and time. Perceptsfrom a vision module somehow reflect motion of the gripper. This simplifies to motionvectors parameterized by the vector origin, direction and length. Clustering of theseparameters generated from the examples in the training set obtained in a teachingphase leads to generalization. These naturally created clusters correspond to symbols.

Our other motivation was in interest to learn and recognize behaviors of people insurveillance video sequences. This problem can be treated also as deriving symbolicinformation from signals (2D images and changes in them).

This report is organized as follows. Section 2 formulates the concept of bidirec-tional signal-to-symbol interface and explains why the finite automaton was chosenas the representation of observed behavior. Section 3 lists the goals of the researchreported here. Section 4 introduces ultra briefly Hidden Markov Models (HMMs) andmentions Baum-Welsh algorithm to allowing to learn Hidden Markov Model from realobservations. Section 5 formulates three problems usually solved with Hidden MarkovModels which also constitute tasks for our experiments. Sections 6-10 constitute theexperimental core of the report in which we learned some practical aspects of the cho-sen behavior representation by the finite automaton and of Hidden Markov statisticalmodel which allows to infer the automaton from noisy observations. The last Section 11draws conclusions and discusses potential future work.

2 Bidirectional signal-to-symbol interface

Relation between numeric and symbolic information has been of interest for a long time.A good introductory text from a robotic point of view is [3]. Here, the transition fromnumeric and symbolic information is seen as discretization with variable discretizationinterval or, alternatively, as clustering in one dimension. The transition from symbolsto signals is performed by providing a typical numerical value for the interval/cluster.The spatial reasoning point of view to signal-to-symbol interface is given in [2].

The idea behind this work was to use the simplest possible high-level representationwhich can be learned from real behaviors. If the selected representation is a finiteautomaton then signal-to-symbol interface boils down to learning this automaton from

3

Page 6: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

examples.There exists an algorithm in classical automata theory which allows to infer the

finite state machine from crisp examples. However, this is not enough for our purposes.We like the automaton to generalize.

The problem of learning finite automaton can be formulated as an optimizationtask in which an automaton generating all expressions in the training set an minimalnumber of others is created. The difficulty is that there is not a tractable solution tothe problem formulated in this manner. There is a way around, i.e., to use a stochasticautomaton.

Stochastic finite learning automata were studied in the past [6] and applied to therouting problem in both circuit and packet-switched networks, task scheduling, processcontrol, etc.

The requirement is that the automaton generates expressions from the training setwith highest probability. The solution to this problem belongs to the class of EMalgorithms (called also clustering). In particular, the Baum-Welsh algorithm knownmore from Hidden Markov chains theory can be used to infer an automaton. Thereis a one-to-one mapping between the Hidden Markov chain and the stochastic regularautomaton.

3 Goals of the reported work

Our work has the following goals:

• To gather practical experience in learning Hidden Markov models from examples.

• To learn a particular Hidden Markov model on both crisp and noisy training sets.The synthetically generated data are to be used to have a ground truth.

• To encounter practicalities of the approach for learning observed activities inCOSPAL project or in observing behaviors of humans in video sequences.

• To consider the problem how to choose the number of hidden states.

• To perform experiments using Hidden Markov Models toolbox in Matlab [5] whichimplements methods published in [9].

4 Hidden Markov Models briefly

Let assume that the reader is familiar with the term Hidden Markov Model as discussedin [7] or [9]. In experiments reported here, we are going to use decomposable discretehidden Markov models.

Let an object be characterized by two sequences X = (x1, x2, . . . , xn) and K =(k0, k1, . . . , kn). Parameters k0, k1, . . . , kn are hidden and parameters x1, x2, . . . , xn areobservable parameters. The sequences X and K are random and assume values from

4

Page 7: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

the sets X n and Kn+1, where X is a set of all possible values of each observable parame-ter xi and K is a set of values of each hidden parameter ki. The connected subsequence(xi1 , xi1+1, . . . , xi2) will be denoted xi2

i1and the subsequence (ki1 , ki1+1, . . . , ki2) will be

denoted ki2i1

. The symbol xn1 means X, the symbol xi

i represents xi, the symbol kn0

means K, and kii represents ki.

The statistical model is determined by the function X n × Kn+1 → R which foreach sequence X and each sequence K expresses the probability p(X, K). With thisprobability, we will assume that for each i = 1, 2, . . . , n − 1, and for each sequenceK = (ki−1

0 , ki, kni+1), and for each sequence X = (xi

1, xni+1) the following holds

p(X, K) = p(ki) p(xi1, k

i−10 | ki) p(xn

i+1, kni+1 | ki) . (1)

This follows from the assumption that the probability p(K) can be expressed in theform

p(K) = p(ki) p(ki−10 | ki) p(kn

i+1 | ki) . (2)

Discrete decomposable HMM can be described by three probability matrices:

Transition matrix A. Entries aij = p(kj|ki) give the probability that the transition isperformed from state ki to state kj on condition the initial state was ki.

Observation matrix B. Entries bij = p(xj|ki) are probabilities that the symbol xj wasobserved while being in the state ki.

Initial state matrix Π. Entries p(ki) give the probability of being in the state ki at thebeginning.

Matrices A and B can be joined into a single three-dimensional algebraic structure(matrix) C and decomposed back to matrices A and B without loss of informationif the decomposability condition (2) holds. By this calculation we obtain a three-dimensional structure (matrix)

C = {cijk}, cijk = p(xk, kj|ki) . (3)

Entry cijk means the probability that the transition from state ki to state kj wasperformed and the observed symbol was xk. This matrix can be used for direct com-putation of p(X|λ) as will be seen later.

5 Three basic problems in Hidden Markov Models

and plan for experiments

Consider a training sets θ which will be used for training the Hidden Markov model andrelated finite automaton. These training sets consist of many sequences Xi, θ = {Xi},Xi = x1x2 . . . xn, i = 1, . . . ,m.

5

Page 8: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

Let θ1 denote a crisp training set which is generated synthetically from a chosen sta-tistical model. Let θ2 be a noisy training set which is created by artificially perturbingthe crisp training set θ1.

Let consider the statistical discrete Hidden Markov model λ = (Π, A,B). Twoinstances of it will be used. First, λ1 will be the known model (ground truth). Thesecond statistical model λ2 is hidden and should be learned from the training set θ.

Having said that three basic problems can be formulated:

Problem 1 – Probability of the chosen statistical model from given observations. Theobservation sequence X = x1x2 . . . xn and the statistical model λ1 are given. Theproblem is how p(X|λ1) is calculated?

Problem 2 – Optimal hidden state sequence from given observations. The observationsequence X = x1x2 . . . xn and the statistical model λ1 are given. The prob-lem is how a corresponding sequence of hidden states is chosen which is optimal,K∗ = argmaxK p(K|X, λ1)?

Problem 3 – Finding a statistical model from a given observation sequence. The train-ing set θ of observation sequences Xi = x1x2 . . . xn and randomly set model λ2 aregiven. The problem is how to adjust parameters λ2 = (A2, B2, Π2) to maximizep(θ|λ2).

The solution to these three problems is known. Its treatment can be found inliterature, e.g. in the Lecture 8 of our monograph [9] or in a very popular tutorialpaper by Rabiner [7]. In our group, the Hidden Markov Models Toolbox was written inMatlab as a diploma thesis by Jan Dupac [5] and was put to public domain. However,this tool needs some maintenance especially in the documentation part to be moreuseful to the third party.

The core of this report is in experiments. The main aim was to assess if the finiteautomaton and learning the related Hidden Markov model by a Baum-Welsch algorithmfrom real examples is suitable for our practical tasks. The first one is to representobserved behavior in COSPAL project. The second practical task stems from ourinterest in observing video sequences with a crowd of people and inferring/recognizingtheir behavior.

We also wanted to familiarize ourselves with the technique on simple experiments,learn how to use J. Dupac’s Hidden Markov Model Toolbox in Matlab, and find com-putational limits of the technique.

We decided to experiment with synthetic data which are also perturbed by artificialnoise. There are two reason for the choice. The first one is that we wanted to haveaccess to ground truth. The second reason is that we have not yet preprocessed imagesequences from practical experimentation which are long enough.

Five experiments were performed. Each of them is reported in one Section below.The first two experiments provide a solution to Problem 1 and Problem 2 which con-stitutes just a preparatory work. More interesting issues from the point of view ofprojects we are working on is in Experiment 3-5. Here, we experimented with learningof the finite automaton from observations.

6

Page 9: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

6 Experiment – Probability of the chosen statistical

model from given observations leads to matrix

multiplication

Task formulation

Consider the given statistical model λ1 and observed sequence X = x1x2 . . . xn. Thetask is to compute probability p(X|λ1).

Solution, the backward-forward algorithm

Consider as an exercise a very simple hidden Markov model with hidden states K ={1, 2} and set of observable parameters X = {1, 2}. Let the initial matrix Π, thetransition matrix A and the observation matrix B be as follows:

Π =[

0.2 0.8], πi = p(ki)

A =

[0.2 0.80.7 0.3

], aij = p(kj|ki)

B =

[0.7 0.30.4 0.6

], bij = p(xj|ki)

Figure 2: Visual interpretation of a simple HMM

Visual interpretation of this statistical model is in Figure 2. Black circles representhidden states, white circles correspond to observed parameters. Black circle with labelX represents the state from which we make the first transition to the initial state.

7

Page 10: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

Consider all possible observations X = x1x2 . . . xn from the model λ for n = 3.HMM toolbox offers a function prob x.m which returns the probability of a sequenceX given the model λ1. See example 1 on Appendix how to do it by the toolbox.

What is the length of the observed sequences for n = 3 drawn from the set X ={1, 2}? It can be seen that there are only 23 = 8 possible observations:

X1 = [111] p(X1|λ1) = 0.1548X2 = [112] p(X2|λ1) = 0.1340X3 = [121] p(X3|λ1) = 0.1676X4 = [122] p(X4|λ1) = 0.1216X5 = [211] p(X5|λ1) = 0.1184X6 = [212] p(X6|λ1) = 0.1108X7 = [221] p(X7|λ1) = 0.1072X8 = [222] p(X8|λ1) = 0.0836

As expected,∑8

i=1 p(Xi|λ) = 1. These probabilities do not tell us anything aboutthe hidden process. To obtain the most probable hidden sequence given the observedsequence X is another task.

To see, that these numbers are correct, we can calculate the probability by hand.There exist very simple algorithm to do so [9].

First, we need to get one joint matrix C from matrices A and B as was alreadyexplained in Equation (3),

p(xk, kj|ki) = p(xk|kj, ki) · p(kj|ki) = p(xk|kj) · p(kj|ki) .

See exp2.m in Appendix (Section 13) to check how function convert.m works.The 3D matrix C = {cijk}, cijk = p(xk, kj|ki) was obtained,

C1 = p(1, kj|ki) =

[0.14 0.320.49 0.12

]

C2 = p(2, kj|ki) =

[0.06 0.480.21 0.18

]Now it is easy to calculate the probability of the sequence X3 = [x1x2x3] = [121],

p(X) = Π(3∏

i=1

Cxi)F ,

where F = [11]T .

p(X) = p(121) = Π · C1 · C2 · C1 ·

=[

0.2 0.8] [

0.14 0.320.49 0.12

] [0.06 0.480.21 0.18

] [0.14 0.320.49 0.12

] [11

]= 0.1676

8

Page 11: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

Conclusion

Problem tackled in this Section 6 can be solved by a matrix multiplication. In oursimple example, we could do it by hand.

This procedure is called backward-forward algorithm and it’s computational com-plexity is |K|2 × n, where n is the sequence length.

7 Experiment – Optimal sequence of hidden states

from given observations

Task formulation

The observation sequence X = x1x2 . . . xn and the statistical model λ1 are given. Thetask is to find a sequence of hidden states which is optimal, K∗ = argmaxK p(K|X, λ1).

Solution

Consider the simple model λ1 as in the previous Section 6 and the observed sequenceX3 = [121]. All possible hidden sequences can be evaluated directly in such a simpleexample. For X = {1, 2}, K = {1, 2} and n = 3, 16 sequences of hidden states areobtained.

For each of this sequences the probability p(X, K|λ1) is calculated using the functionprob kx.m of the HMM toolbox. The sequence with the highest probability is that onewe are looking for.

It can be seen why it does not matter if we take either p(X, K|λ1 or p(K|X, λ1).

p(K|X, λ1) =p(X, X|λ1)

p(X|λ1)=

p(X, K|λ1)

0.1676

The length of sequences K is n + 1 because HMM in the initial state generates nooutput.

K01 = [1111]; p(X3, K01|λ1) = 0.00024 K02 = [1112]; p(X3, K02|λ1) = 0.00054K03 = [1121]; p(X3, K03|λ1) = 0.00659 K04 = [1122]; p(X3, K04|λ1) = 0.00161K05 = [1211]; p(X3, K05|λ1) = 0.00188 K06 = [1212]; p(X3, K06|λ1) = 0.00430K07 = [1221]; p(X3, K07|λ1) = 0.00564 K08 = [1222]; p(X3, K08|λ1) = 0.00138K09 = [2111]; p(X3, K09|λ1) = 0.00329 K10 = [2112]; p(X3, K10|λ1) = 0.00753K11 = [2121]; p(X3, K11|λ1) = 0.09220 K12 = [2122]; p(X3, K12|λ1) = 0.02258K13 = [2211]; p(X3, K13|λ1) = 0.00282 K14 = [2212]; p(X3, K14|λ1) = 0.00645K15 = [2221]; p(X3, K15|λ1) = 0.00847 K16 = [2222]; p(X3, K16|λ1) = 0.00207

Look at the sequence K11. It has the highest probability. HMM toolbox offers thefunction find k.m which returns the most probable sequence K∗ directly. See exp3.min Appendix (Section 13) for a source code in MATLAB.

9

Page 12: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

Figure 3: The most probable sequence of hidden states given the observed sequenceX3

K∗ = argmaxK

p(K|X, λ1) = [2121]

See Figure 3 to interpret this result. The initial state is 2 (80%), there is no outputat this point. Next, the transition to the state 1 (70%) is performed and the symbol 1is generated at the output (70%). Another transition is to the state 2 (80%) and thesymbol 2 (60%) is produced at the output. The last transition is to the state 1 (70%)and the symbol 1 (70%) is generated.

All other hidden sequences given the sequence X = [121] and the statistical modelλ1 are less probable than K∗.

8 Experiment – Learning a discrete HMM from a

crisp training set

Task formulation

Given is the training set θ of observed sequences Xi = x1x2 . . . xn and a randomlyinitiated model λ2. The task is to adjust the model parameters λ2 = (A2, B2, Π2) tomaximize p(θ|λ2).

Solution

This task belongs to among the unsupervised learning methods. The training set θhas no labels as the hidden processes are not directly observable. The solution to theis known under the name Baum-Welsh [1] algorithm and belongs to the class of EM(Expectation Maximization) algorithms [8], [4].

10

Page 13: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

Baum-Welsh takes an unlabelled data set and a randomly initialized model λ2 asthe input. Learning proceeds in 2 steps, which guarantee that p(θ|λ2) increases in eachiteration.

However, only a local optimum is found. Consequently, learning depends on initialvalues of the model λ2.

Figure 4: Scheme of the learning from crisp training set experiment.

Figure 4 shows the basic scheme of this experiment. The model λ1 is known and thetraining set θ is generate randomly using it. In HMMs toolbox, the function gen trs.m

is used. Generation provides m observations of length n with no labels attached..Such set of observed sequences are instances of a random process described by the

statistical model λ1. We start with randomly set model λ2 and pass it to Baum-Welshalgorithm. After completion, we compare p(θ|λ1), p(θ|λ2) which gives the probabilitythat λ1, λ2, respectively, were generated by the training set θ.

Consider a hidden Markov model with hidden states drawn from the set K ={1, 2, 3} and observable parameters from the set X = {1, 2, 3}. Matrices describing thestatistical behavior are as follows:

Π =[

0.2 0.6 0.2], πi = p(ki)

A =

0.2 0.7 0.10.2 0.2 0.60.1 0.1 0.8

, aij = p(kj|ki)

B =

0.1 0.6 0.30.2 0.6 0.20.1 0.1 0.8

, bij = p(xj|ki)

The visual interpretation of created statistical models defined by matrices A, B isprovided in Figure 5.

The training sequence generated by the statistical model λ1 contains m = 100sequences of the length n = 10, e.g.,

X001 = [2133223333]X002 = [1312332213]

11

Page 14: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

Figure 5: Visual interpretation of a statistical model defined by matrices A and B

. . .X100 = [3323322231]

0 10 20 30 40 50 60 70 80 90 100−446.5

−446

−445.5

−445

−444.5

−444

−443.5

−443

−442.5

−442

iterations of Baum−Welch

log 10

p(θ

| λi)

log10

p(θ | λ1)

log10

p(θ | λ2)

Figure 6: Progress in 100 iterations of Baum-Welsh algorithm. p(θ|λ1) is a constant,p(θ|λ2) grows.

Figure 6 shows the progress of p(θ|λ1) and p(θ|λ2). Weights of the statistical modelλ2 are adjusted in the learning process. p(θ|λ2) increases after each iteration. Noticethe first step of the Baum-Welsh algorithm in which the change is the biggest in all100 iterations.

Initial values of model λ2 are not important at this point. After 100 iterations ofBaum-Welsh algorithm, the statistical model λ2 with correctly adjusted matrices isobtained.

Π =[

0.38 0.15 0.46]

πi = p(ki)

12

Page 15: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

A =

0.52 0.42 0.040.32 0.52 0.140.26 0.20 0.52

aij = p(kj|ki)

B =

0.13 0.36 0.500.34 0.08 0.570.01 0.60 0.38

bij = p(xj|ki)

Notice that values in matrices describing statistical model λ2 are different fromthose of λ1. This is a natural result as infinitely many matrices correspond to thesame statistical behavior. Similarity of statistical models can be seen by comparingprobabilities

log10 p(θ|λ1) = −443.8

log10 p(θ|λ2) = −442.3

It can be seen from previous values that statistical model λ2 matches to θ betterthan λ1. Of course, the resulting model λ2 is a realization of a random process anddepends strongly on initial values of λ2.

See exp4.m in Appendix (Section 13) for a MATLAB code.

Histogram of p(θ|λ2) demonstrates convergence

As was mentioned above, p(θ|λ2) is a random variable depending on the initial con-ditions of λ2. The histogram of p(θ|λ2) provides some insight into the convergenceprocess.

Figure 7: Scheme of the experiment: Histogram of p(θ|λ2)

Let the same statistical model λ1 be taken as before. The different training set θis generated. A 100 randomly initialized models λi were taken and provided to Baum-Welsh algorithm together with the training set θ. A scheme of the experiment schemeis shown in Figure 7.

Figure 8 shows the distribution of p(θ|λ2). The upper part provides the histogramof p(θ|λ2) where λ2 are initialized. The bottom part of the figure shows p(θ|λ2) where

13

Page 16: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

−700 −650 −600 −550 −500 −4500

5

10

150 iterations

log p(θ | λ2)

coun

t [−

]

−700 −650 −600 −550 −500 −4500

50

100

1501 itereration

coun

t [−

]

log p(θ | λ2)

Figure 8: Histogram of p(θ|λ2) based on 100 models trained by Baum-Welsh algorithmwith 0 and 1 iterations.

λ2 are adjusted by 1 iteration of the Baum-Welsh algorithm. The black vertical linedepicts p(θ|λ1). It is obvious that the most of generated models are worse than theoriginal statistical model λ1. Notice a big change in the probability after the firstiteration.

Figure 9 shows the progress of histograms in more detail. Notice the black verticalline again. All of the models which were learned in 100 iterations have much biggerprobability p(θ|λ2) than p(θ|λ1).

Figure 10 shows the final histogram after 100 iterations in more detail. Thereare two hills and the valley in the middle. This is demonstrates the effect of gettingtrapped in a local optimum. Most of trained models ended up in this local optimum.However, some models were able to escape from the local optimum. The fixed point ofthe convergence depends on initial parameters of the statistical model λ2.

Figure 10 shows that about 5% of all models were able to overcome the localoptimum. Obtained statistical models correspond to θ about 10-times better thanthose trapped in the local optimum.

See exp5.m in Appendix (Section 13) for a MATLAB code.

Conclusion

The main point of the experiment in Section was to understand ambiguity of statisticalmodel given by matrices A, B and Π. It was seen that even the entries in this matricesmight be different the model can still represent the same statistical behavior. If sim-

14

Page 17: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

−464 −463 −462 −461 −460 −459 −458 −4570

5

10

15

201 iteration

coun

t [−

]

−464 −463 −462 −461 −460 −459 −458 −4570

10

20

3010 iterations

coun

t [−

]

−464 −463 −462 −461 −460 −459 −458 −4570

20

40

60100 iterations

log p(θ | λ2)

coun

t[−]

log p(θ | λ2)

log p(θ | λ2)

Figure 9: Histograms of p(θ|λ2) after 1, 10 and 100 iterations.

ilarity of two statistical models λ1, λ2 has to be studied then the probability p(θ|λ1),p(θ|λ2) can be compared.

The biggest step in adjustment of weights occurs in the first iteration of Baum-Welsh algorithm. After 100 iterations the algorithm ends up in the local optimumproblem.

9 Experiment – Learning a discrete HMM from a

noisy training set

Task formulation

Let training set θ1 of observed sequences Xi = x1x2 . . . xn is generated by by thestatistical model λ1. After noise is added to this set, another training set θ2 is obtained.Actually, this process is repeated several times and perturbed training sets θi

2 areobtained.

The task is to adjust parameters of the model λ2 = (Π2, A2, B2) to maximizep(θi

2|λ2). The question is how this new model corresponds to the original trainingset θ1.

15

Page 18: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

−458.2 −458 −457.8 −457.6 −457.4 −457.2 −457 −456.80

5

10

15

20

25100 iterations

log p(θ | λ2)

coun

t [−

]

Figure 10: Problem with local optimum after 100 iterations of p(θ|λ2).

Solution

Figure 11 shows the basic scheme of this experiment. Given the statistical modelλ1 the training sequence θ1 is generated randomly. The statistical model λ1 has 10hidden states and 10 observable parameters, so we need to adjust 10 · 10 · 2 + 10 = 210parameters. Crisp statistical model λ1 contains 100 sequences of length 10, i.e., m =100, n = 10. From this crisp training set θ1, new training sets are created by addingnoise. In HMM toolbox a function crisp.m serves for this purpose.

We added 10-40% of noise. This means that the observed symbol is changed ran-domly in 10-40% of positions and new training sets θi

2 are obtained. Each of thesetraining sets is used for training another group of models. Each group contains 20 ran-domly initialized models. After 100 iterations of Baum-Welsh algorithm, five groupsof trained models were obtained. Each of these group is then compared to the initialcrisp data set θ1.

Figure 12 shows results of this experiment. Line 3 (black circles) shows how prob-able it is that the initial model λ1 generated the crisp training sequence, i.e., p(θ1|λ1).As can be seen, this line has no progress along the increasing level of noise. Line 4(white circles) shows p(θi

2|λ1). There is a drop between 0-20% but after that the line isconstant. Line 2 (white squares) is probably the most important. It shows p(θ1|λ2, θ

i2),

which means the probability that the training set θ1 was generated by the model λ2

trained by the noisy training set θi2. Line 1 (black squares) shows p(θi

2|λ2, θi2). All of

these lines represent mean values of 20 models of the mentioned probability.

Mean values do not tell enough about the distribution of probability values. His-

16

Page 19: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

Figure 11: Scheme of experiment Learning HMM from noisy training set

togram in Figure 13 represents the progress of p(θ1|λ2, θi2) along different levels of

noise. The white bars represent p(θ1|λ2, θ1) the darkest bars p(θ1|λ2, θ42). Notice that

the variance of histograms raises with increased level of noise.

Conclusion

Previous figures show that the model λ2 can be learned from noisy data perturbed upto 20%. The new model λ2 corresponds to the ground truth model better than themodel λ1. This phenomenon deserves further study.

10 Experiment – How to choose the number of hid-

den states?

Task formulation

Let training set θ1 of observed sequences Xi = x1x2 . . . xn is generated by the knownstatistical model λ1. After addition of 10% noise another training set θ2 is obtained.

Let initialize randomly 20 models λi2, i = 1 . . . 20. Each of these models has different

internal structure as compared to λ12: |K| = 1, |X | = 10 to λ20

2 : |K| = 20, |X | = 10.

The task is to investigates how these models correspond to the ground truth train-ing set θ1 and perturbed training set θ2 after 100 iterations of Baum-Welsh learningalgorithm.

Solution

The scheme of this experiment is shown in Figure 14. Consider a model λ1 with 10hidden states and 10 observable parameters, |K| = 10, |X | = 10. Generate unlabelleddiscrete training set θ1 from this model using function trs gen.m from the HMMtoolbox which contains 100 sequences of length 10. Add noise to θ1 and obtain θ2.

17

Page 20: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

0 10 20 30 40−1010

−1000

−990

−980

−970

−960

−950

noise [%]

mea

n lo

g p(

θ i | λ i)

line 1: log p(θ2 | λ

2,θ

2)

line 2: log p(θ1 | λ

2,θ

2)

line 3: log p(θ1 | λ

1)

line 4: log p(θ2 | λ

1)

Figure 12: Mean value of probability. Line 1: p(θi2|λ2, θ

i2), line 2: p(θ1|λ2, θ

i2), line 3:

p(θ1|λ1), line 4: p(θi2|λ1).

Initialize randomly 20 models dissimilar in internal structure. We are limited to|X | = 10 because our observed sequences contain 10 discrete symbols. However, wecan change the number of hidden states |K|.

This group of models was trained in 100 iterations of the Baum-Welsh algorithmon the training sequence θ2. We obtain probabilities p(θ1|λi

2) and p(θ2|λi2). These

values are random variables depending on initial conditions of models λi2. That is

why we repeated the same measurement 5 times and computed mean values of thoseprobabilities.

Figure 15 shows the final result. The line labelled by circles represents mean valueof log10 p(θ2|λ2). It grows as the internal structure gets more complex. Models withmore complex structure have more levels of freedom and can learn the training sequencebetter. This fact is compensated by the second line labelled by crosses. It representsmean value of log10 p(θ1|λ2) which is the probability that our models trained by thenoisy set θ2 generated the crisp training set θ1. As can be seen, this probability doesnot grow with increasing model complexity.

Figure 16 plots the computation time as a function of growing |K|.

Conclusions

This experiment demonstrates a generally valid fact that generalization is lost withincreasing level of complexity. It is interesting that for |K| = 10 no change was observed.In other words, the initial structure of the model λ1 is not in principle different fromother structures applied to model λ2.

18

Page 21: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

0

10

20

30

40

−1040 −1022 −1004 −968 −950

0

5

10

15

noise [%]

log p(λ1 | theta

2)

coun

t [−

]

Figure 13: Histogram of probability p(θ1|λ2, θi2) of 20 models trained by 100 iterations

with noisy data.

11 Conclusions and future work

In the work reported, we have familiarized ourselves how to learn a finite automatonfrom examples. This task is equivalent to learning discrete Hidden Markov modelusing Baum-Welsh algorithm. It seems that the HMM toolbox contains all proceduresneeded to perform the job.

We hope that learning of a finite automaton can be one instantiation of the bidi-rectional signal to symbol interface needed in COSPAL project.

Experiments performed show that training HMMs (or equivalently a stochasticfinite automaton) from crisp examples is feasible. If we added up to 20% of noise tothe crisp training set we could train a statistical model with higher similarity to groundtruth as compared to crisp data. The question is, how this phenomenon will work inpractical example?

It is likely that there will be need to work with continuous hidden Markov modelsin our COSPAL demonstrator (shape sorter puzzle). It is known from literature thatdiscretization or clustering can be used to convert continuous to discrete hidden Markovmodel. Reversibility will be naturally lost.

The ideas for future work are the following:

• Experiments with real data is needed. In COSPAL, we have to wait until partnersdeliver us such data. We could find another ’real’ toy problem and experimentwith it.

• Perform experiments for the case in which observations are vector variables. Seehow is the HMM toolbox prepared for this eventuality.

19

Page 22: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

Figure 14: Scheme of the experiment checking influence of the chosen number of hiddenstates.

• Extend the statistical model within HMM by a mixture of Gaussians.

12 Acknowledgements

We like to thank to Prof. M.I. Schlesinger from the Ukrainian Academy of Sciences,International Research and Training Centre of Information Technologies and Systems,Kiev for initial advice. We acknowledge J. Dupac’s hints how to use his HMM toolboxin Matlab for our task.

The authors were supported in this research by The Czech Science Foundation underproject GACR 102/03/0440 and by The Austrian Ministry of Education under projectCONEX GZ 45.535 and by The European Commission under project IST-004176.

References

[1] L.E. Baum, T. Petrie, G. Soules, and M. Weiss. A maximization technique occuringin the statistical analysis of probabilistic functions of Markov chains. Annals ofMathematical Statistics, 41:164–167, 1970.

[2] A.G. Cohn and S.M. Hazarika. Qualitative spatial representation and reasoning:An overview. Fundamenta Informaticae, 46(1-2):2–32, 2001.

[3] S. Coradeshi and A. Saffiotti. An introduction to the anchoring problem. Roboticsand Automation Systems, 43(2-3):85–96, January 2003.

[4] A.P. Demster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incompletedata via the EM algorithm. Journal of the Royal Statistic Society, B39:1–38, 1977.

[5] J. Dupac and V. Hlavac. Toolbox for recognition of Markovian objects. In B Likar,editor, Proceedings of Computer Vision Winter Workshop, pages 244–254, Ljubl-jana, Slovenia, February 2001. Slovenian Pattern Recorgnition Society.

[6] K. Narendra and M.A.L. Thathachar. Learning automata - a survey. IEEE Trans-actions on Systems, Man, and Cybernetics, SMC-4(4):323–334, July 1974.

20

Page 23: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

0 2 4 6 8 10 12 14 16 18 20−1000

−990

−980

−970

−960

−950

−940

−930

−920

−910

−900

p(θ i |

λ 2i )

|K|

log p(θ2 | λ

2i )

log p(θ1 | λ

2i )

Figure 15: Likelihood of models with different number of hidden states.

[7] L.R. Rabiner. A tutorial on Hidden Markov Models and selecteed applications inspeech recognition. Proceedings of the IEEE, 77(2):257–286, February 1989.

[8] M.I. Schlesinger. Vzaimosvjaz obuchenija i samoobuchenija v raspoznavaniji obra-zov; in Russian (Relation between learning and self-learning in pattern recognition).Kibernetika, (2):81–88, 1968.

[9] M.I. Schlesinger and V. Hlavac. Ten lectures on statistical and structural patternrecognition, volume 24 of Computational Imaging and Vision. Kluwer AcademicPublishers, Dordrecht, The Netherlands, 2002.

13 Appendix with the MATLAB code

This section provides listings of MATLAB source code used in our experiments. How-ever, the source code of J. Dupac’s Hidden Markov models toolbox is too extensive tobe shown here.

Individual m-files (both functions and scripts) are listed in alphabetical order.

function [model,development] = my_usl(model,trs,BW)

%Learns Hidden Markov model using Baum-Welsh algorithm

% baum_welsh.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

21

Page 24: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

0 2 4 6 8 10 12 14 16 18 200

50

100

150

200

250

|K|

time

[s]

Figure 16: Computation time for 100 iterations of Baum-Welsh algorithm.

% History:

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

% Computes the adjusted model 2

% =====================================================

% model ... trained model

% trs ... training set

% BW ... number of Baum-Welsh iterations

%

function [model,development] = my_usl(model,trs,BW)

if BW == 0

LogL = 0

for j = 1 : size(trs.tset,2)

LogL = LogL + log10(prob_x(model,trs.tset{j}.x));

end

development = zeros(1,BW);

else i = 1; while i <= BW [alfa, L] =

alfa_kkj(model,trs,0); [model] =

alfa2mod(alfa,trs,model.type,0);

LogL = log10(L);

development(i) = sum(LogL);

22

Page 25: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

disp(sprintf(’Iteration %3d: %2.3f’,i,development(i)));

i=i+1;

end end

function [cmp] = cmptrs(trs1,trs2)

%Compares 2 training sets. Returns percentage diff. param.

% cmptrs.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

% History:

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

% Compares 2 training sets and returns the percentage of

% different observable parameters.

ok = 0; ko = 0; for j = 1:size(trs1.tset,2) for i =

1:length(trs1.tset{1}.x)

if trs1.tset{j}.x(i) == trs2.tset{j}.x(i)

ok = ok + 1;

else

ko = ko + 1;

end

end

end cmp = ko / (ok + ko);

function [trs] = crisp(trs,percent)

%Adds noise to a training set.

% crisp.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

% History:

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

% Adds noise to a training set. It changes randomly

% the observable symbol in ’percent’% of positions.

X = trs.max_x;

23

Page 26: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

for j = 1:size(trs.tset,2) for i =

1:double(int8(length(trs.tset{1}.x) * percent))

inx = double(int8(length(trs.tset{1}.x) * rand)) + 1;

x = double(int8(rand * X)) + 1;

trs.tset{j}.x(inx) = x;

end

end

% Experiment 1, skript

% exp1.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

% History:

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

% -----------------------------------------------------------------

% Given the observation sequence $X = x_1 x_2 \ldots x_n$ and model

% $\lambda_1$, how do we compute $p(X | \lambda_1)$?

clc; clear all;

K = 2; X = 2;

m1.type =’DHSM’; m1.name =’DHSM’; m1.pk0 = [0.2 0.8];

m1.pp.pkk = [0.2 0.8; 0.7 0.3]; m1.pp.pxk = [0.7 0.3; 0.4

0.6];

x1 = [1 1 1]; x2 = [1 1 2]; x3 = [1 2 1]; x4 = [2 1 1]; x5

= [2 1 2]; x6 = [2 2 1]; x7 = [2 2 2]; x8 = [1 2 2];

prob_x(m1,x1)

% Experiment 2

% exp2.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

% History:

24

Page 27: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

% --------------------------------------------------------------

% Converting matrixes A,B into one 3D matrix C

clc; clear all;

K = 2; X = 2; m1.type =’DHSM’; m1.name =’DHSM’; m1.pk0 =

[0.2 0.8]; m1.pp.pkk = [0.2 0.8; 0.7 0.3]; m1.pp.pxk = [0.7

0.3; 0.4 0.6];

x1 = [1 1 1]; x2 = [1 1 2]; x3 = [1 2 1]; x4 = [2 1 1]; x5

= [2 1 2]; x6 = [2 2 1]; x7 = [2 2 2]; x8 = [1 2 2];

m2 = convert(m1,’HSM’,’HSM’,1);

A = m1.pp.pkk B = m1.pp.pxk C = m2.pp

prob_x(m2,x1)

% Experiment 3

% exp3.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

% History:

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

% --------------------------------------------------------------

clc; clear all;

K = 2; X = 2; m1.type =’DHSM’; m1.name =’DHSM’; m1.pk0 =

[0.2 0.8]; m1.pp.pkk = [0.2 0.8; 0.7 0.3]; m1.pp.pxk = [0.7

0.3; 0.4 0.6];

x1 = [1 1 1]; x2 = [1 1 2]; x3 = [1 2 1]; x4 = [2 1 1]; x5

= [2 1 2]; x6 = [2 2 1]; x7 = [2 2 2]; x8 = [1 2 2];

k_01 = [1 1 1 1]; k_02 = [1 1 1 2]; k_03 = [1 1 2 1]; k_04

= [1 1 2 2]; k_05 = [1 2 1 1]; k_06 = [1 2 1 2]; k_07 = [1

2 2 1]; k_08 = [1 2 2 2]; k_09 = [2 1 1 1]; k_10 = [2 1 1

25

Page 28: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

2]; k_11 = [2 1 2 1]; k_12 = [2 1 2 2]; k_13 = [2 2 1 1];

k_14 = [2 2 1 2]; k_15 = [2 2 2 1]; k_16 = [2 2 2 2];

mo.x = x3; mo.k = k_11;

disp(sprintf(’Probability p(x3,k_11 | m1) = %2.5f’,prob_kx(m1,mo)))

disp(’-------------------------------------’)

[k,pxk] = find_k(m1,x3)

% Experiment 4

% exp4.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

% History:

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

% --------------------------------------------------------------

% Given the training set $\theta$ of observation sequences $X_i =

% x_1x_2...x_n$ and randomly set model $\lambda_2$, how do we adjust

% the model parameters $\lambda_2 = (A_2,B_2,\Pi_2)$ to maximize

% $p(\theta | \lambda_2)$?

clc; clear all;

K = 3; % hidden states

X = 3; % observed symbols

n = 10; % length of the sequence

m = 100; % training set size

BW = 12; % # of Baum-Welsh iterations

model1.type =’DHSM’; % model 1 setting

model1.name =’model_1’; model1.pk0 = [0.1 0.6 0.3];

model1.pp.pkk = [0.2 0.7 0.1; 0.2 0.3 0.5; 0.1 0.1 0.8];

model1.pp.pxk = [0.1 0.6 0.3; 0.3 0.4 0.3; 0.2 0.2 0.6];

model2.type =’DHSM’; model2.name =’model_2’; model2.pk0 =

matrigen(1,K); model2.pp.pkk = matrigen(K,K); model2.pp.pxk

= matrigen(K,X);

trs = gen_trs(model1,n,m,’xxx’); % training set generation

26

Page 29: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

[model2,development] = baum_welsh(model2,trs,BW);

LogLMod1 = 0; for (i = 1 : m)

LogLMod1 = LogLMod1 + log10(prob_x(model1,trs.tset{i}.x));

end

figure(1); hold on; plot(ones(1,BW)*LogLMod1,’k’);

plot(development,’k’);

% Experiment 5

% exp5.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

% History:

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

% --------------------------------------------------------------

% Take fixed model1 and generate about 100 different models

% (model2). Watch the difference in probability, that the training

% set (from model1) was generated by model1 and model2.

% Model2 is obtained by 0-100 iterations of Baum-Walsh algorithm.

% Visualize the result by a histogram.

clear all;

BW = [0 1 9 90];

K = 3; % hidden states

X = 3; % observed symbols

n = 10; % sequence length

m = 10; % training set size

NumMod = 2; % number of models

model1.type =’DHSM’; model1.pk0 = [0.1 0.6 0.3];

model1.pp.pkk = [0.2 0.7 0.1; 0.2 0.3 0.5; 0.1 0.1 0.8];

model1.pp.pxk = [0.1 0.6 0.3; 0.3 0.4 0.3; 0.2 0.2 0.6];

trs = gen_trs(model1,n,m,’xxx’);

LogLMod1 = 0; for i = 1 : m

LogLMod1 = LogLMod1 + log10(prob_x(model1,trs.tset{i}.x));

end

27

Page 30: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

for g = 1:NumMod clc; model2.type =’DHSM’; model2.pk0 =

matrigen(1,K); model2.pp.pkk = matrigen(K,K);

model2.pp.pxk = matrigen(K,X);

for j = 1:size(BW,2) [model2,vyvoj] =

baum_welsh(model2,trs,BW(j));

LogLMod2 = 0;

for i = 1 : m

LogLMod2 = LogLMod2 + log10(prob_x(model2,trs.tset{i}.x));

end

iter(g,j) = LogLMod2;

end

end

subplot(411); hist(iter(:,1)); title(’0 iterations’);

subplot(412); hist(iter(:,2));title(’1 itereration’);

subplot(413); hist(iter(:,3));title(’10 iterations’);

subplot(414); hist(iter(:,4)); title(’100 iterations’);

% iter(g,j)

% [ values of LogLMod2 after initialization ]

% [ values of LogLMod2 after 1 iteration ]

% [ values of LogLMod2 after 10 iterations ]

% [ values of LogLMod2 after 100 iterations ]

% Experiment 6

% exp6.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

% History:

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

% --------------------------------------------------------------

% Learning HMM from noisy training set

% score{exp} ... structure containing results

% for certain level of noise

% trs1 ... training set 1 (crisp)

% trs2 ... noisy training set

% noise ... desired level of noise

28

Page 31: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

% realnoise ... real noise level (% of different symbols

% in all seqences)

% model1 ... original model

% model2 ... array of models trained by trs2

% logTrs1Mod1 ... Likelihood p(trs1 | model1)

% logTrs2Mod1 ... Likelihood p(trs2 | model1)

% logTrs1Mod2 ... Likelihood p(trs1 | model2, trs2)

% logTrs2Mod2 ... Likelihood p(trs2 | model2, trs2)

clear all; clc;

K = 10; X = 10; n = 10; m = 100; noise = [0 .1 .2 .3 .4];

BW = 100; mod_num = 20;

model1.type =’DHSM’; model1.pk0 = matrigen(1,K);

model1.pp.pkk = matrigen(K,K); model1.pp.pxk =

matrigen(K,X); trs1 = gen_trs(model1,n,m,’x’);

for noise_level = 1 : size(noise,2) trs2 =

crisp(trs1,noise(noise_level)); score{noise_level}.trs1

= trs1; score{noise_level}.trs2 = trs2;

score{noise_level}.noise = noise(noise_level);

% computes real noise level

score{noise_level}.realnoise = cmptrs(trs1,trs2);

score{noise_level}.model1 = model1;

for i = 1:mod_num

tic

model2.type =’DHSM’; model2.pk0 = matrigen(1,K);

model2.pp.pkk = matrigen(K,K); model2.pp.pxk =

matrigen(K,X); [score{noise_level}.model2{i},xxx] =

baum_welsh(model2,trs2,BW);

disp(sprintf(’%2d : %2d : %2.2f min left’, noise_level, i, ...

((5-noise_level)*(mod_num)+(mod_num-i))*toc / 60 ))

end

end

disp(’== L I K E L I H O O D =========================’);

minLogL = 0; maxLogL = -1000000;

for i = 1:size(noise,2)

for j = 1:mod_num

score{i}.logTrs1Mod1 = 0;

29

Page 32: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

score{i}.logTrs2Mod1 = 0;

score{i}.logTrs1Mod2(j) = 0;

score{i}.logTrs2Mod2(j) = 0;

for k = 1 : m

score{i}.logTrs1Mod1 = score{i}.logTrs1Mod1 + ...

log10(prob_x(score{i}.model1,score{i}.trs1.tset{k}.x));

score{i}.logTrs2Mod1 = score{i}.logTrs2Mod1 + ...

log10(prob_x(score{i}.model1,score{i}.trs2.tset{k}.x));

score{i}.logTrs1Mod2(j) = score{i}.logTrs1Mod2(j) + ...

log10(prob_x(score{i}.model2{j},score{i}.trs1.tset{k}.x));

score{i}.logTrs2Mod2(j) = score{i}.logTrs2Mod2(j) + ...

log10(prob_x(score{i}.model2{j},score{i}.trs2.tset{k}.x));

end

end

if minLogL >= min(score{i}.logTrs1Mod2)

minLogL = min(score{i}.logTrs1Mod2)

end

if maxLogL <= max(score{i}.logTrs1Mod2)

maxLogL = max(score{i}.logTrs1Mod2)

end

end

disp(’== D I S P L A Y ==============================’);

minLogL = floor(minLogL); maxLogL = ceil(maxLogL); dilek =

(maxLogL - minLogL)/20

x = minLogL:dilek:maxLogL; n = zeros(5,size(x,2)); for i =

1:5

n(i,:) = hist(score{i}.logTrs1Mod2,x);

mean_px1mod2(i) = mean(score{i}.logTrs1Mod2);

mean_px2mod2(i) = mean(score{i}.logTrs2Mod2);

mean_px2mod1(i) = score{i}.logTrs2Mod1;

mean_px1mod1(i) = score{i}.logTrs1Mod1;

end

figure(1); bar3(n’);

figure(2); hold on; plot(mean_px1mod2,’:k’);

plot(mean_px2mod2,’k’); plot(mean_px2mod1,’--k’);

plot(mean_px1mod1,’-.k’); title(’Mean values of: p(\theta_1

30

Page 33: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

| \lambda_2) (:k), p(\theta_2 | \lambda_2) (k), p(\theta_2

| \lambda_1) (--k), p(\theta_1 | \lambda_1) (-.k)’);

xlabel(’noise level of \thata_2 [0:0.1:0.4] [%]’);

ylabel(’mean probability [log]’);

% Experiment 7

% exp7.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

% History:

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

% Experiment checking complexity of a model structure. Checks Baum-Welsh

% algorithm for varying number of hidden parameters.

clear all; clc; close all;

K = 10; % number of hidden states in model1

X = 10; % number of observed parameters in model1

seq_len = 10; % the length of training sequence

seq_num = 100; % number of training sequences in the set theta

BW = 100; % number of iterations of Baum-Walsh algorithm

mod_num = 20; % number of diferent models

% generates model1 ranomly

model1.type =’DHSM’; model1.pk0 = matrigen(1,K);

model1.pp.pkk = matrigen(K,K); model1.pp.pxk =

matrigen(K,X);

% generates training set by model1

trs1 = gen_trs(model1,seq_len,seq_num,’x’); trs2 =

crisp(trs1,0.1);

% count probability p(trs1 | model1)

pT1M1 = 0; for i = 1 : seq_num

pT1M1 = pT1M1 + log10(prob_x(model1,trs1.tset{i}.x));

end

for j=1:5

for i = 1:mod_num

tic

trs2.max_k = i;

31

Page 34: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

model2{j}{i}.type =’DHSM’; model2{j}{i}.pk0 =

matrigen(1,i); model2{j}{i}.pp.pkk = matrigen(i,i);

model2{j}{i}.pp.pxk = matrigen(i,X);

model2{j}{i} = my_usl(model2{j}{i},trs2,0,BW);

pT2M2(j,i) = 0;

for k = 1 : seq_num

pT2M2(j,i) = pT2M2(j,i) + ...

log10(prob_x(model2{j}{i},trs2.tset{k}.x));

end

pT1M2(j,i) = 0;

for k = 1 : seq_num

pT1M2(j,i) = pT1M2(j,i) + ...

log10(prob_x(model2{j}{i},trs1.tset{k}.x));

end

cas(j,i) = toc;

end

hold on;

figure(1);

plot(pT2M2(j,:),’ok’);

plot(pT1M2(j,:),’ob’);

figure(2)

plot(cas(j,:),’ok’);

drawnow; end save exp6_model1 model1 save exp6_trs1 trs1

save exp6_trs2 trs2 save exp6_model2 model2

figure(1); plot(mean(pT1M2),’b’); plot(mean(pT2M2),’k’);

plot(ones(1,mod_num)*pT1M1,’:k’);

figure(2); plot(mean(cas),’k’);

% Generates a matrix randomly

% matrigen.m

% Learning Finite Automaton

% Z. Kalal, V. Hlavac, Czech Technical University Prague

% [email protected]

% History:

% 26.08.2004 ZK Written.

% 31.12.2004 VH Modified for publication.

32

Page 35: Learning Finite Automaton from Noisy Observations – A ...cmp.felk.cvut.cz/ftp/articles/hlavac/HlavacKalalTR2004-13.pdf · combined, incremental learning Symbolic manipulation, planning,

% Generates randomly matrix (m,n) in which sum of each row = 1.

function matrix = matrigen(m,n) for i = 1:m

hlp = rand(1,n);

matrix(i,:) = hlp / sum(hlp);

end

33