1 grammatical inference vs grammar induction london 21-22 june 2007 colin de la higuera
TRANSCRIPT
1
Grammatical inference Vs Grammar induction
London 21-22 June 2007
Colin de la Higuera
2cdlh
Summary
1. Why study the algorithms and not the grammars
2. Learning in the exact setting3. Learning in a probabilistic setting
3cdlh
1 Why study the process and not the result?
Usual approach in grammatical inference is to build a grammar (automaton), small and adapted in some way to the data from which we are supposed to learn from.
4cdlh
Grammatical inference
Is about learning a grammar given information about a language.
5cdlh
Grammar induction
Is about learning a grammar given information about a language.
6cdlh
Difference?
Data G
Grammar induction
Grammatical inference
7cdlh
Motivating* example #1
Is 17 a random number? Is 17 more random than 25? Suppose I had a random
number generator, would I convince you by showing how well it does on an example? On various examples ?*(and only slightly provocative)
8cdlh
Motivating example #2
Is 01101101101101010110001111 a random sequence?
What about aaabaaabababaabbba?
9cdlh
Motivating example #3
Let X be a sample of strings. Is grammar G the correct grammar for sample X?
Or is it G’ ? Correct meaning something
like “the one we should learn”
10cdlh
Back to the definition
Grammar induction and grammatical inference are about finding a/the grammar from some information about the language.
But once we have done that, what can we say?
11cdlh
What would we like to say?
That the grammar is the smallest, best (re a score). Combinatorial characterisation
What we really want to say is that having solved some complex combinatorial question we have an Occam, Compression-MDL-Kolmogorov like argument proving that what we have found is of interest.
12cdlh
What else might we like to say?
That in the near future, given some string, we can predict if this string belongs to the language or not.
It would be nice to be able to bet £100 on this.
13cdlh
What else would we like to say?
That if the solution we have returned is not good, then that is because the initial data was bad (insufficient, biased).
Idea: blame the data, not the algorithm.
14cdlh
Suppose we cannot say anything of the sort?
Then that means that we may be terribly wrong even in a favourable setting.
15cdlh
Motivating example #4
Suppose we have an algorithm that ‘learns’ a grammar by applying iteratively the following two operations: Merge two non-terminals whenever
some nice MDL-like rule holds Add a new non-terminal and rule
corresponding to a substring when needed
16cdlh
Two learning operatorsCreation of non terminals and rules
NP ART ADJ NOUNNP ART ADJ ADJ NOUN
NP ART AP1NP ART ADJ AP1AP1 ADJ NOUN
17cdlh
Merging two non terminals
NP ART AP1NP ART AP2AP1 ADJ NOUNAP2 ADJ AP1
NP ART AP1AP1 ADJ NOUNAP1 ADJ AP1
18cdlh
What is bound to happen?
We will learn a context-free grammar that can only generate a regular language.
Brackets are not found.
This is a hidden bias.
19cdlh
But how do we say that a learning algorithm is good?
By accepting the existence of a target.
The question is that of studying the process of finding this target (or something close to this target). This is an inference process.
20cdlh
If you don’t believe there is a target?
Or that the target belongs to another class
You will have to come up with another bias. For example, believing that simplicity (eg MDL) is the correct way to handle the question.
21cdlh
If you are prepared to accept there is a target but..
Either the target is known and what is the point or learning?
Or we don’t know it in the practical case (with this data set) and it is of no use…
22cdlh
Then you are doing grammar induction.
23cdlh
Careful
Some statements that are dangerous Algorithm A can learn {anbncn: nN} Algorithm B can learn this rule with
just 2 examples Looks to me close to wanting free
lunch
24cdlh
A compromise
You only need to believe there is a target while evaluating the algorithm.
Then, in practice, there may not be one!
25cdlh
End of provocative example
If I run my random number generator and get 999999, I can only keep this number if I believe in the generator itself.
26cdlh
Credo (1)
Grammatical inference is about measuring the convergence of a grammar learning algorithm in a typical situation.
27cdlh
Credo(2)
Typical can be: In the limit: learning is always
achieved, one day Probabilistic
There is a distribution to be used (Errors are measurably small)
There is a distribution to be found
28cdlh
Credo(3)
Complexity theory should be used: the total or update runtime, the size of the data needed, the number of mind changes, the number and weight of errors…
…should be measured and limited.
29cdlh
2 Non probabilistic setting
Identification in the limit Resource bounded
identification in the limit Active learning (query
learning)
30cdlh
Identification in the limit
The definitions, presentations The alternatives
Order free or not Randomised algorithm
31cdlh
A presentation is
a function f: NXwhere X is any set, yields: Presentations
Languages
If f(N)=g(N) then yields(f)= yields(g)
34cdlh
Learning function
Given a presentation f, fn is the set of the first n elements in f.
A learning algorithm a is a function that takes as input a set fn ={f(0),…,f (n-1)} and returns a grammar.
Given a grammar G, L(G) is the language generated/recognised/ represented by G.
35cdlh
Identification in the limit
L Pres NXA class of languages
A class of grammars
G
L A learnerThe naming function
yields
a
f(N)=g(N) yields(f)=yields(g)
n N :k>n L(a(fk))=yields(f)
36cdlh
What about efficiency?
We can try to bound global time update time errors before converging mind changes queries good examples needed
37cdlh
What should we try to measure?
The size of G ? The size of L ? The size of f ? The size of fn ?
38cdlh
Some candidates for polynomial learning
Total runtime polynomial in ║L║ Update runtime polynomial in ║L║ # mind changes polynomial in ║L║ # implicit prediction errors
polynomial in ║L║ Size of characteristic sample
polynomial in ║L║
39cdlh
f1
G1
a
f(0)f2
G2
a
f(1)fn
Gn
a
f(n-1)fk
Gn
a
f(k)
40cdlh
Some selected results (1)
DFA text informant
Runtime no no
Update-time
“ yes
#IPE “ no
#MC “ ?
CS “ yes
41cdlh
Some selected results (2)
CFG text informant
Runtime no no
Update-time
“ yes
#IPE “ no
#MC “ ?
CS “ no
42cdlh
Some selected results (3)
Good Balls text informant
Runtime no no
Update-time
yes yes
#IPE yes no
#MC yes no
CS yes yes
43cdlh
3 Probabilistic setting
Using the distribution to measure error
Identifying the distribution Approximating the distribution
44cdlh
Probabilistic settings
PAC learning Identification with probability 1 PAC learning distributions
45cdlh
Learning a language from sampling
We have a distribution over * We sample twice:
Once to learn Once to see how well we have
learned The PAC settingProbably approximately correct
46cdlh
PAC learning(Valiant 84, Pitt 89)
L a set of languages G a set of grammars and m a maximal length over the
strings n a maximal size of grammars
47cdlh
Polynomially PAC learnable
There is an algorithm that samples reasonably and returns with probability at least 1- a grammar that will make at most errors.
48cdlh
Results
Using cryptographic assumptions, we cannot PAC learn DFA.
Cannot PAC learn NFA, CFGs with membership queries either.
49cdlh
Learning distributions
No error Small error
50cdlh
No error
This calls for identification in the limit with probability 1.
Means that the probability of not converging is 0.
51cdlh
Results
If probabilities are computable, we can learn with probability 1 finite state automata.
But not with bounded (polynomial) resources.
52cdlh
With error
PAC definition But error should be measured by a
distance between the target distribution and the hypothesis
L1,L2,L ?
53cdlh
Results
Too easy with L Too hard with L1
Nice algorithms for biased classes of distributions.
54cdlh
For those that are not convinced there is a difference
55cdlh
Structural completeness
Given a sample and a DFAeach edge is used at least onceeach final state accepts at least one string
Look only at DFA for which the sample is structurally complete!
56cdlh
not structurally complete… X+={aab, b, aaaba, bbaba} add
and abba b
a
a
a
b
b
b
57cdlh
Question
Why is the automaton structurally complete for the sample ?
And not the sample structurally complete for the automaton ?
58cdlh
Some of the many things I have not talked about
Grammatical inference is about new algorithms
Grammatical inference is applied to various fields: pattern recognition, machine translation, computational biology, NLP, software engineering, web mining, robotics…
59cdlh
And
Next ICGI in Britanny in 2008 Some references in the 1 page
abstract, others on the grammatical inference webpage.
60cdlh
Appendix, some technicalities
Size of G Size of L
#MC
PAC
Size of f
#IPE
Runtimes
#CS
62cdlh
The size of L If no grammar system is given,
meaningless If G is the class of grammars then ║L║
= min{║G║ : GG L(G)=L} Example: the size of a regular
language when considering DFA is the number of states of the minimal DFA that recognizes it.
63cdlh
Is a grammar representation reasonable?
Difficult question: typical arguments are that NFA are better than DFA because you can encode more languages with less bits.
Yet redundancy is necessary!
64cdlh
Proposal
A grammar class is reasonable if it encodes sufficient different languages.
Ie with n bits you have 2n+1 encodings so optimally you should have 2n+1 different languages.
Allow for redundancy and syntaxic sugar, so p(2n+1) different languages.
65cdlh
But We should allow for redundancy and
for some strings that do not encode grammars.
Therefore a grammar representation is reasonable if there exists a polynomial p() and for any n the number of different languages encoded by grammars of size n is at least p(2n)
66cdlh
The size of a presentation f
Meaningless. Or at least no convincing definition comes up.
But when associated with a learner a we can define the convergence point Cp(f,a) which is the point at which the learner a finds a grammar for the correct language L and does not change its mind.
Cp(f,a)=n : mn, a(fm)= a(fn)L
67cdlh
The size of a finite presentation fn
An easy attempt is n But then this does not represent
the quantity of information we have received to learn.
A better measure is in|f(i)|
68cdlh
Quantities associated with learner a
The update runtime: time needed to update hypothesis hn-1 into hn when presented with f(n).
The complete runtime. Time needed to build hypothesis hn from fn. Also the sum of all update-runtimes.
69cdlh
Definition 1 (total time)
G is polynomially identifiable in the
limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G), Cp(f,a) p(║G║).
(or global-runtime(a)p(║G║))
70cdlh
Impossible
Just take some presentation that stays useless until the bound is reached and then starts helping.
71cdlh
Definition 2 (update polynomial time)
G is polynomially identifiable in the limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G), update-runtime(a)p(║G║).
72cdlh
Doesn’t work
We can just differ identification Here we are measuring the time it
takes to build the next hypothesis.
73cdlh
Definition 4: polynomial number of mind changes
G is polynomially identifiable in the limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G),
#{i : a(fi) a(fi+1)} p(║G║).
74cdlh
Definition 5: polynomial number of implicit prediction errors
Denote by Gx if G is incorrect with respect to an element x of the presentation (i.e. the algorithm producing G has made an implicit prediction error.
75cdlh
G is polynomially identifiable in the
limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G), #{i : a(fi) f(i+1)} p(║G║).
76cdlh
Definition 6: polynomial characteristic sample
G has polynomial characteristic samples for identification algorithm a if there exists an and a polynomial p() such that: given any G in G, Y correct sample for G, such that when Yfn, a(fn)G and ║Y║ p(║G ║).
77cdlh
3 Probabilistic setting
Using the distribution to measure error
Identifying the distribution Approximating the distribution
78cdlh
Probabilistic settings
PAC learning Identification with probability 1 PAC learning distributions
79cdlh
Learning a language from sampling
We have a distribution over * We sample twice:
Once to learn Once to see how well we have
learned The PAC setting
80cdlh
How do we consider a finite set?
*
Dm
D≤mPr<
By sampling 1/ ln 1/ examples we can find a safe m.
81cdlh
PAC learning(Valiant 84, Pitt 89)
L a set of languages G a set of grammars and m a maximal length over the
strings n a maximal size of grammars
82cdlh
H is -AC (approximately correct)* if
PrD[H(x)G(x)]<
83cdlh
L(G) L(H)
Errors: we want L1(D(G),D(H))<
84cdlh
4
1
3
1
2
1
2
1
2
13
2
a
b
a
b
a
b
4
3
2
1
85cdlh
4
1
3
1
2
1
2
1
2
13
2
a
b
a
b
a
b
4
3
2
1
Pr(abab)=24
1
4
3
3
2
3
1
2
1
2
1
86cdlh
0.1
0.3
a
b
a
b
a
b
0.65
0.35
0.9
0.7
0.3
0.7
87cdlh
4
1
3
1
2
1
2
1
2
13
2
b
b
a
a
a
b
4
3
2
1
88cdlh
4
1
3
1
2
1
2
1
2
13
2
b
a
b
4
3
2
1