1 grammatical inference vs grammar induction london 21-22 june 2007 colin de la higuera

1

Grammatical inference Vs Grammar induction

London 21-22 June 2007

Colin de la Higuera

2cdlh

Summary

1. Why study the algorithms and not the grammars

2. Learning in the exact setting3. Learning in a probabilistic setting

3cdlh

1 Why study the process and not the result?

Usual approach in grammatical inference is to build a grammar (automaton), small and adapted in some way to the data from which we are supposed to learn from.

4cdlh

Grammatical inference

Is about learning a grammar given information about a language.

5cdlh

Grammar induction

Is about learning a grammar given information about a language.

6cdlh

Difference?

Data G

Grammar induction

Grammatical inference

7cdlh

Motivating* example #1

Is 17 a random number? Is 17 more random than 25? Suppose I had a random

number generator, would I convince you by showing how well it does on an example? On various examples ?*(and only slightly provocative)

8cdlh

Motivating example #2

Is 01101101101101010110001111 a random sequence?

What about aaabaaabababaabbba?

9cdlh


Let X be a sample of strings. Is grammar G the correct grammar for sample X?

Or is it G’ ? Correct meaning something

like “the one we should learn”

10cdlh

Back to the definition

Grammar induction and grammatical inference are about finding a/the grammar from some information about the language.

But once we have done that, what can we say?

11cdlh

What would we like to say?

That the grammar is the smallest, best (re a score). Combinatorial characterisation

What we really want to say is that having solved some complex combinatorial question we have an Occam, Compression-MDL-Kolmogorov like argument proving that what we have found is of interest.

12cdlh

What else might we like to say?

That in the near future, given some string, we can predict if this string belongs to the language or not.

It would be nice to be able to bet £100 on this.

13cdlh

What else would we like to say?

That if the solution we have returned is not good, then that is because the initial data was bad (insufficient, biased).

Idea: blame the data, not the algorithm.

14cdlh

Suppose we cannot say anything of the sort?

Then that means that we may be terribly wrong even in a favourable setting.

15cdlh


Suppose we have an algorithm that ‘learns’ a grammar by applying iteratively the following two operations: Merge two non-terminals whenever

some nice MDL-like rule holds Add a new non-terminal and rule

corresponding to a substring when needed

16cdlh

Two learning operatorsCreation of non terminals and rules

NP ART ADJ NOUNNP ART ADJ ADJ NOUN

NP ART AP1NP ART ADJ AP1AP1 ADJ NOUN

17cdlh

Merging two non terminals

NP ART AP1NP ART AP2AP1 ADJ NOUNAP2 ADJ AP1

NP ART AP1AP1 ADJ NOUNAP1 ADJ AP1

18cdlh

What is bound to happen?

We will learn a context-free grammar that can only generate a regular language.

Brackets are not found.

This is a hidden bias.

19cdlh

But how do we say that a learning algorithm is good?

By accepting the existence of a target.

The question is that of studying the process of finding this target (or something close to this target). This is an inference process.

20cdlh

If you don’t believe there is a target?

Or that the target belongs to another class

You will have to come up with another bias. For example, believing that simplicity (eg MDL) is the correct way to handle the question.

21cdlh

If you are prepared to accept there is a target but..

Either the target is known and what is the point or learning?

Or we don’t know it in the practical case (with this data set) and it is of no use…

22cdlh

Then you are doing grammar induction.

23cdlh

Careful

Some statements that are dangerous Algorithm A can learn {anbncn: nN} Algorithm B can learn this rule with

just 2 examples Looks to me close to wanting free

lunch

24cdlh

A compromise

You only need to believe there is a target while evaluating the algorithm.

Then, in practice, there may not be one!

25cdlh

End of provocative example

If I run my random number generator and get 999999, I can only keep this number if I believe in the generator itself.

26cdlh

Credo (1)

Grammatical inference is about measuring the convergence of a grammar learning algorithm in a typical situation.

27cdlh

Credo(2)

Typical can be: In the limit: learning is always

achieved, one day Probabilistic

There is a distribution to be used (Errors are measurably small)

There is a distribution to be found

28cdlh

Credo(3)

Complexity theory should be used: the total or update runtime, the size of the data needed, the number of mind changes, the number and weight of errors…

…should be measured and limited.

29cdlh

2 Non probabilistic setting

Identification in the limit Resource bounded

identification in the limit Active learning (query

learning)

30cdlh

Identification in the limit

The definitions, presentations The alternatives

Order free or not Randomised algorithm

31cdlh

A presentation is

a function f: NXwhere X is any set, yields: Presentations

Languages

If f(N)=g(N) then yields(f)= yields(g)

34cdlh

Learning function

Given a presentation f, fn is the set of the first n elements in f.

A learning algorithm a is a function that takes as input a set fn ={f(0),…,f (n-1)} and returns a grammar.

Given a grammar G, L(G) is the language generated/recognised/ represented by G.

35cdlh

Identification in the limit

L Pres NXA class of languages

A class of grammars

G

L A learnerThe naming function

yields

a

f(N)=g(N) yields(f)=yields(g)

n N :k>n L(a(fk))=yields(f)

36cdlh

What about efficiency?

We can try to bound global time update time errors before converging mind changes queries good examples needed

37cdlh

What should we try to measure?

The size of G ? The size of L ? The size of f ? The size of fn ?

38cdlh

Some candidates for polynomial learning

Total runtime polynomial in ║L║ Update runtime polynomial in ║L║ # mind changes polynomial in ║L║ # implicit prediction errors

polynomial in ║L║ Size of characteristic sample

polynomial in ║L║

39cdlh

f1

G1

a

f(0)f2

G2

a

f(1)fn

Gn

a

f(n-1)fk

Gn

a

f(k)

40cdlh

Some selected results (1)

DFA text informant

Runtime no no

Update-time

“ yes

#IPE “ no

#MC “ ?

CS “ yes

41cdlh


CFG text informant

Runtime no no

Update-time

“ yes

#IPE “ no

#MC “ ?

CS “ no

42cdlh


Good Balls text informant

Runtime no no

Update-time

yes yes

#IPE yes no

#MC yes no

CS yes yes

43cdlh

3 Probabilistic setting

Using the distribution to measure error

Identifying the distribution Approximating the distribution

44cdlh

Probabilistic settings

PAC learning Identification with probability 1 PAC learning distributions

45cdlh

Learning a language from sampling

We have a distribution over * We sample twice:

Once to learn Once to see how well we have

learned The PAC settingProbably approximately correct

46cdlh

PAC learning(Valiant 84, Pitt 89)

L a set of languages G a set of grammars and m a maximal length over the

strings n a maximal size of grammars

47cdlh

Polynomially PAC learnable

There is an algorithm that samples reasonably and returns with probability at least 1- a grammar that will make at most errors.

48cdlh

Results

Using cryptographic assumptions, we cannot PAC learn DFA.

Cannot PAC learn NFA, CFGs with membership queries either.

49cdlh

Learning distributions

No error Small error

50cdlh

No error

This calls for identification in the limit with probability 1.

Means that the probability of not converging is 0.

51cdlh

Results

If probabilities are computable, we can learn with probability 1 finite state automata.

But not with bounded (polynomial) resources.

52cdlh

With error

PAC definition But error should be measured by a

distance between the target distribution and the hypothesis

L1,L2,L ?

53cdlh

Results

Too easy with L Too hard with L1

Nice algorithms for biased classes of distributions.

54cdlh

For those that are not convinced there is a difference

55cdlh

Structural completeness

Given a sample and a DFAeach edge is used at least onceeach final state accepts at least one string

Look only at DFA for which the sample is structurally complete!

56cdlh

not structurally complete… X+={aab, b, aaaba, bbaba} add

and abba b

a

a

a

b

b

b

57cdlh

Question

Why is the automaton structurally complete for the sample ?

And not the sample structurally complete for the automaton ?

58cdlh

Some of the many things I have not talked about

Grammatical inference is about new algorithms

Grammatical inference is applied to various fields: pattern recognition, machine translation, computational biology, NLP, software engineering, web mining, robotics…

59cdlh

And

Next ICGI in Britanny in 2008 Some references in the 1 page

abstract, others on the grammatical inference webpage.

60cdlh

Appendix, some technicalities

Size of G Size of L

#MC

PAC

Size of f

#IPE

Runtimes

#CS

62cdlh

The size of L If no grammar system is given,

meaningless If G is the class of grammars then ║L║

= min{║G║ : GG L(G)=L} Example: the size of a regular

language when considering DFA is the number of states of the minimal DFA that recognizes it.

63cdlh

Is a grammar representation reasonable?

Difficult question: typical arguments are that NFA are better than DFA because you can encode more languages with less bits.

Yet redundancy is necessary!

64cdlh

Proposal

A grammar class is reasonable if it encodes sufficient different languages.

Ie with n bits you have 2n+1 encodings so optimally you should have 2n+1 different languages.

Allow for redundancy and syntaxic sugar, so p(2n+1) different languages.

65cdlh

But We should allow for redundancy and

for some strings that do not encode grammars.

Therefore a grammar representation is reasonable if there exists a polynomial p() and for any n the number of different languages encoded by grammars of size n is at least p(2n)

66cdlh

The size of a presentation f

Meaningless. Or at least no convincing definition comes up.

But when associated with a learner a we can define the convergence point Cp(f,a) which is the point at which the learner a finds a grammar for the correct language L and does not change its mind.

Cp(f,a)=n : mn, a(fm)= a(fn)L

67cdlh

The size of a finite presentation fn

An easy attempt is n But then this does not represent

the quantity of information we have received to learn.

A better measure is in|f(i)|

68cdlh

Quantities associated with learner a

The update runtime: time needed to update hypothesis hn-1 into hn when presented with f(n).

The complete runtime. Time needed to build hypothesis hn from fn. Also the sum of all update-runtimes.

69cdlh

Definition 1 (total time)

G is polynomially identifiable in the

limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G), Cp(f,a) p(║G║).

(or global-runtime(a)p(║G║))

70cdlh

Impossible

Just take some presentation that stays useless until the bound is reached and then starts helping.

71cdlh

Definition 2 (update polynomial time)

G is polynomially identifiable in the limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G), update-runtime(a)p(║G║).

72cdlh

Doesn’t work

We can just differ identification Here we are measuring the time it

takes to build the next hypothesis.

73cdlh

Definition 4: polynomial number of mind changes

G is polynomially identifiable in the limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G),

#{i : a(fi) a(fi+1)} p(║G║).

74cdlh

Definition 5: polynomial number of implicit prediction errors

Denote by Gx if G is incorrect with respect to an element x of the presentation (i.e. the algorithm producing G has made an implicit prediction error.

75cdlh

G is polynomially identifiable in the

limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G), #{i : a(fi) f(i+1)} p(║G║).

76cdlh

Definition 6: polynomial characteristic sample

G has polynomial characteristic samples for identification algorithm a if there exists an and a polynomial p() such that: given any G in G, Y correct sample for G, such that when Yfn, a(fn)G and ║Y║ p(║G ║).

77cdlh

3 Probabilistic setting

Using the distribution to measure error

Identifying the distribution Approximating the distribution

78cdlh

Probabilistic settings

PAC learning Identification with probability 1 PAC learning distributions

79cdlh

Learning a language from sampling

We have a distribution over * We sample twice:

Once to learn Once to see how well we have

learned The PAC setting

80cdlh

How do we consider a finite set?

*

Dm

D≤mPr<

By sampling 1/ ln 1/ examples we can find a safe m.

81cdlh

PAC learning(Valiant 84, Pitt 89)

L a set of languages G a set of grammars and m a maximal length over the

strings n a maximal size of grammars

82cdlh

H is -AC (approximately correct)* if

PrD[H(x)G(x)]<

83cdlh

L(G) L(H)

Errors: we want L1(D(G),D(H))<

84cdlh

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

85cdlh

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

Pr(abab)=24

1

4

3

3

2

3

1

2

1

2

1

86cdlh

0.1

0.3

a

b

a

b

a

b

0.65

0.35

0.9

0.7

0.3

0.7

87cdlh

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

88cdlh

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1

1 grammatical inference vs grammar induction london 21-22 june 2007 colin de la higuera

Documents

correct grammar

grammar g

grammar automaton

contextfree grammar

inference process

learning algorithm

regular language

nice mdl