learning probabilistic finite automata

87
Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Upload: violet

Post on 19-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Learning probabilistic finite automata. Colin de la Higuera University of Nantes. Acknowledgements. Laurent Miclet, Jose Oncina, Tim Oates, Rafael Carrasco, Paco Casacuberta, Rémi Eyraud, Philippe Ezequel, Henning Fernau, Thierry Murgue, Franck Thollard, Enrique Vidal, Frédéric Tantini,... - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Learning probabilistic finite automata

Zadar August 2010 1

Learning probabilistic finite automata

Colin de la HigueraUniversity of Nantes

Zadar August 2010 2

Acknowledgements Laurent Miclet Jose Oncina Tim Oates Rafael Carrasco

Paco Casacuberta Reacutemi Eyraud Philippe Ezequel Henning Fernau Thierry Murgue Franck Thollard Enrique Vidal Freacutedeacuteric Tantini

List is necessarily incomplete Excuses to those that have been forgotten

httppagespersolinauniv-nantesfr~cdlhslides

Chapters 5 and 16

Zadar August 2010 3

Outline

1 PFA2 Distances between distributions3 FFA4 Basic elements for learning PFA5 ALERGIA6 MDI and DSAI7 Open questions

Zadar August 2010 4

1 PFA

Probabilistic finite (state) automata

Zadar August 2010 5

Practical motivations

(Computational biology speech recognition web services automatic translation image processing hellip)

A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar August 2010 6

The grammar induction problem revisited

The data consists of positive strings laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

PrA(abab)=24

1

4

3

3

2

3

1

2

1

2

1

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 2: Learning probabilistic finite automata

Zadar August 2010 2

Acknowledgements Laurent Miclet Jose Oncina Tim Oates Rafael Carrasco

Paco Casacuberta Reacutemi Eyraud Philippe Ezequel Henning Fernau Thierry Murgue Franck Thollard Enrique Vidal Freacutedeacuteric Tantini

List is necessarily incomplete Excuses to those that have been forgotten

httppagespersolinauniv-nantesfr~cdlhslides

Chapters 5 and 16

Zadar August 2010 3

Outline

1 PFA2 Distances between distributions3 FFA4 Basic elements for learning PFA5 ALERGIA6 MDI and DSAI7 Open questions

Zadar August 2010 4

1 PFA

Probabilistic finite (state) automata

Zadar August 2010 5

Practical motivations

(Computational biology speech recognition web services automatic translation image processing hellip)

A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar August 2010 6

The grammar induction problem revisited

The data consists of positive strings laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

PrA(abab)=24

1

4

3

3

2

3

1

2

1

2

1

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 3: Learning probabilistic finite automata

Zadar August 2010 3

Outline

1 PFA2 Distances between distributions3 FFA4 Basic elements for learning PFA5 ALERGIA6 MDI and DSAI7 Open questions

Zadar August 2010 4

1 PFA

Probabilistic finite (state) automata

Zadar August 2010 5

Practical motivations

(Computational biology speech recognition web services automatic translation image processing hellip)

A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar August 2010 6

The grammar induction problem revisited

The data consists of positive strings laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

PrA(abab)=24

1

4

3

3

2

3

1

2

1

2

1

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 4: Learning probabilistic finite automata

Zadar August 2010 4

1 PFA

Probabilistic finite (state) automata

Zadar August 2010 5

Practical motivations

(Computational biology speech recognition web services automatic translation image processing hellip)

A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar August 2010 6

The grammar induction problem revisited

The data consists of positive strings laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

PrA(abab)=24

1

4

3

3

2

3

1

2

1

2

1

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 5: Learning probabilistic finite automata

Zadar August 2010 5

Practical motivations

(Computational biology speech recognition web services automatic translation image processing hellip)

A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar August 2010 6

The grammar induction problem revisited

The data consists of positive strings laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

PrA(abab)=24

1

4

3

3

2

3

1

2

1

2

1

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 6: Learning probabilistic finite automata

Zadar August 2010 6

The grammar induction problem revisited

The data consists of positive strings laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

PrA(abab)=24

1

4

3

3

2

3

1

2

1

2

1

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 7: Learning probabilistic finite automata

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

PrA(abab)=24

1

4

3

3

2

3

1

2

1

2

1

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 8: Learning probabilistic finite automata

Zadar August 2010 8

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

PrA(abab)=24

1

4

3

3

2

3

1

2

1

2

1

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 9: Learning probabilistic finite automata

Zadar August 2010 9

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

PrA(abab)=24

1

4

3

3

2

3

1

2

1

2

1

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 10: Learning probabilistic finite automata

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 11: Learning probabilistic finite automata

Zadar August 2010 11

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 12: Learning probabilistic finite automata

Zadar August 2010 12

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 13: Learning probabilistic finite automata

Zadar August 2010 13

How useful are these automata

They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 14: Learning probabilistic finite automata

Zadar August 2010 14

Basic references

The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 15: Learning probabilistic finite automata

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 16: Learning probabilistic finite automata

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01]

FP Q[01]

P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 17: Learning probabilistic finite automata

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i)

i=qi0ai1

qi1ai2

hellipainqin

Pr(i)=IP(qi0)FP(qin

) aij P (qij-1

aijqij

)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 18: Learning probabilistic finite automata

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 19: Learning probabilistic finite automata

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 20: Learning probabilistic finite automata

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1

x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 21: Learning probabilistic finite automata

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 22: Learning probabilistic finite automata

Zadar August 2010 22

Equivalence between models

Equivalence between PFA and HMMhellip

But the HMM usually define distributions over each n

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 23: Learning probabilistic finite automata

Zadar August 2010 23

4

1

2

1

win draw lose win draw lose win draw lose

2

1

2

1

34

12

14

14

34

14

14

4

14

1

4

1

4

14

1

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 24: Learning probabilistic finite automata

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 25: Learning probabilistic finite automata

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 26: Learning probabilistic finite automata

Zadar August 2010 26

3

1

2

1

2

1

2

1

2

1

3

2a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 27: Learning probabilistic finite automata

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 28: Learning probabilistic finite automata

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 29: Learning probabilistic finite automata

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 30: Learning probabilistic finite automata

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 31: Learning probabilistic finite automata

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 32: Learning probabilistic finite automata

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)

(The probability of being in state qi after having read a1aj)

A[i0]=IP(qi)

A[ij+1]= k|Q|A[kj] P(qkaj+1qi)

Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 33: Learning probabilistic finite automata

Zadar August 2010 33

2 Distances

What forEstimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 34: Learning probabilistic finite automata

Zadar August 2010 34

21 Entropy

How many bits do we need to correct our model

Two distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 35: Learning probabilistic finite automata

Zadar August 2010 35

22 Perplexity

The idea is to allow the computation of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 36: Learning probabilistic finite automata

Zadar August 2010 36

wS PrD(w)-1S

=

1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 37: Learning probabilistic finite automata

Zadar August 2010 37

Why multiply (1)

We are trying to compute the probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 38: Learning probabilistic finite automata

Zadar August 2010 38

Why multiply (2)

Suppose we have two predictors for a coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 39: Learning probabilistic finite automata

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 40: Learning probabilistic finite automata

Zadar August 2010 40

3 FFA

Frequency Finite (state) Automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 41: Learning probabilistic finite automata

Zadar August 2010 41

A learning sample

is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 42: Learning probabilistic finite automata

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 43: Learning probabilistic finite automata

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 44: Learning probabilistic finite automata

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27

b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 45: Learning probabilistic finite automata

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 46: Learning probabilistic finite automata

Zadar August 2010 46

Note

Another sample may lead to the same DFFA

Doing the same with a NFA is a much harder problem

Typically what algorithm Baum-Welch (EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 47: Learning probabilistic finite automata

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 48: Learning probabilistic finite automata

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4

b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 49: Learning probabilistic finite automata

Zadar August 2010 49

Red Blue and White states

a

a

ab

b

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 50: Learning probabilistic finite automata

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 51: Learning probabilistic finite automata

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 52: Learning probabilistic finite automata

Zadar August 2010 52

Merge and fold

60

9

10

11

64

Then fold

a26

a4

a6

a4a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 53: Learning probabilistic finite automata

Zadar August 2010 53

Merge and fold

60

10

9

10

11

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 54: Learning probabilistic finite automata

Zadar August 2010 54

State merging algorithm

A=FTA(S) Blue =(qIa) a

Red =qI

While Blue dochoose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small

then A = merge_and_fold(Apq) else Red = Red q

Blue = (qa) qRed ndash Red

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 55: Learning probabilistic finite automata

Zadar August 2010 55

The real question

How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 56: Learning probabilistic finite automata

Zadar August 2010 56

Deciding if two distributions are similar

If the two distributions are known equality can be tested

Distance (L2 norm) between distributions can be exactly computed

But what if the two distributions are unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 57: Learning probabilistic finite automata

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 58: Learning probabilistic finite automata

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similar

a26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 59: Learning probabilistic finite automata

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 60: Learning probabilistic finite automata

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 61: Learning probabilistic finite automata

Zadar August 2010 61

Hoeffding bounds

1

1

n

f

2

2

1

1

n

f

n

f

2

ln2

1

11

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

n

f

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 62: Learning probabilistic finite automata

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 63: Learning probabilistic finite automata

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 64: Learning probabilistic finite automata

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 65: Learning probabilistic finite automata

Zadar August 2010 65

Can we merge and a

Compare and a a and aa b and ab

4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 66: Learning probabilistic finite automata

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 67: Learning probabilistic finite automata

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 68: Learning probabilistic finite automata

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 69: Learning probabilistic finite automata

Zadar August 2010 69

Can we merge and b

Compare and b a and ba b and bb

6601341 and 225340 are different (giving = 0162)

On the other hand

11102

ln2

1

11

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 70: Learning probabilistic finite automata

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 71: Learning probabilistic finite automata

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 72: Learning probabilistic finite automata

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 73: Learning probabilistic finite automata

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 74: Learning probabilistic finite automata

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096

b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 75: Learning probabilistic finite automata

Zadar August 2010 75

Conclusion and logic

Alergia builds a DFFA in polynomial time

Alergia can identify DPFA in the limit with probability 1

No good definition of Alergiarsquos properties

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 76: Learning probabilistic finite automata

Zadar August 2010 76

6 DSAI and MDI

Why not change the criterion

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 77: Learning probabilistic finite automata

Zadar August 2010 77

Criterion for DSAI

Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 78: Learning probabilistic finite automata

Zadar August 2010 78

(much more to DSAI)

D Ron Y Singer and N Tishby On the learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 79: Learning probabilistic finite automata

Zadar August 2010 79

Criterion for MDI

MDL inspired heuristic Criterion is does the reduction of the size of

the automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 80: Learning probabilistic finite automata

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 81: Learning probabilistic finite automata

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 82: Learning probabilistic finite automata

Zadar August 2010 82

Appendix

Stern Brocot treesIdentification of probabilities

If we were able to discover the structure how do we identify the

probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 83: Learning probabilistic finite automata

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a

15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 84: Learning probabilistic finite automata

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 85: Learning probabilistic finite automata

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 86: Learning probabilistic finite automata

Zadar August 2010 86

Idea

Instead of returning c(x)n search the Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 87: Learning probabilistic finite automata

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87