formal learning theory: an introduction

Formal Learning Theory:an Introduction

Roberto Bonato

roberto.bonato@labri.fr

Universite de Bordeaux I

Universita degli Studi di Verona

Formal Learning Theory:an Introduction – p.1/35

Contents

Basic notions of Learnability Theory

Gold’s model of grammatical inference:identification in the limit

Some representative results of(un)learnability

Learning Categorial Grammars:the RG algorithm

Contents

The poverty of the stimulus paradox

How comes it that human beings,whose contacts with the worldare brief and personal and limited,are nevertheless able to know as much asthey do know?

Sir Bertrand Russellquoted by Chomsky, 1975

Its theoretical linguistics version

The learning paradox: how do children learn thesyntax of their mother tongue (and ratherquickly!) given that:

Natural language syntax is very complicated

Not so many examples are provided

Negative examples are of no use

Chomsky’s solution

Universal Grammar, defined by Principles

(Binary) parameters that specify any givenhuman language (example: SVO vs. SOVlanguages)

Categorial analogy:Universal rulesOnly types in the lexicon are languagespecific. . . MORE on this later

Categorial analogy:

Universal rulesOnly types in the lexicon are languagespecific. . . MORE on this later

Categorial analogy:Universal rules

Only types in the lexicon are languagespecific. . . MORE on this later

Categorial analogy:Universal rulesOnly types in the lexicon are languagespecific

. . . MORE on this later

What is Learnability Theory?

Learnability refers to a set of mathematicalmodels of how a human language can beacquired

Learnability is a constraint on UniversalGrammar:

the class of human languagesmust be learnable

Why should we care?

Mathematical precision is a good thing!

Learnability Theory can suggest differenttheories of Universal Grammar:

If one can show that some theory of UG can resultin an unlearnable array of possible languages, thattheory must be changed.

We can use learnability to constrain the observedset of languages, not just UG.

Why should we care?

The acquisition framework

Innateness

Positive evidence

Learning = setting the parameters of aUniversal Grammar

Innateness

Positive evidence

Innateness

Positive evidence

Innateness

A grammar is a finite specification of alanguage.

Innateness holds that the learner can onlyacquire certain kinds of grammars and notothers.

Some language type would therefore beimpossible.

Positive evidence

In general, children do not learn fromcorrection

R. Brown, C. Hanlon, DerivationalComplexity and the order of acquisition ofchild speech. 1970

Effectively, the input to the learner onlyinclude grammatical sentences:

Steven Pinker, The language instinct.Harper, 2000

The learning “algorithm”

The learner has a set of possible grammarsto choose from.

The learner is presented with some finite setof sentences.

What grammar does the learner choose?

Human Languages?

CSLCFLRL REL

Human Languages?

CSLCFLRL REL

Human Languages?

CSLCFLRL REL

Human Languages?

CSLCFLRL REL

Let us play a game...

I think of a certain set of numbers e.g.x : x ≥ 10 and x is even and you have toguess it

I’ll provide you with an infinite number ofclues in the form “the number x belong to theset”, one at a time

After each rule, you make a guess

I will never tell whether you’re right or not

Some questions

What should count as winning this game?

What happens if I am allowed to select theset of all positive integers?

Some questions

What should count as winning this game?

What happens if I am allowed to select theset of all positive integers?

Who are the players?

SCIENTIFIC INDUCTION

Nature vs. Scientists

FIRST LANGUAGE ACQUISITIONAdults vs. Child

SCIENTIFIC INDUCTIONNature vs. Scientists

FIRST LANGUAGE ACQUISITION

Adults vs. Child

Learning in Gold’s Framework

the learner is provided with infinite stream of examples:s1, s2, . . . , si, . . .;

at each step i the learner makes a guess Gi

compatible with the examples seen thus far;

the process is infinite:

s1, s2, s3, . . . sn, . . .

G1, G2, G3, . . . Gn, . . .

learning is successful when there is a certain point(even if we don’t know which!) after which the guessmade by the learner doesn’t change and is correct

s1, s2, s3, . . . sn, . . .

G1, G2, G3, . . . Gn, . . .

s1, s2, s3, . . . sn, . . .

G1, G2, G3, . . . Gn, . . .

s1, s2, s3, . . . sn, . . .

G1, G2, G3, . . . Gn, . . .

Grammatical Inference

grammars S = samplesΩ =

grammars S = samples

More Formally...

Let 〈Ω, S, L〉 be a grammar system

let φ : finite seq. of sentences of S Ω

let 〈si〉i∈N= 〈s0, s1, s2, . . .〉 be an infinite sequence of

sentences from S

let Gi = φ(〈s0, . . . , si〉)

φ converges to G on 〈si〉i∈Nif there exists an n ∈ N

such that for all i ≥ n,

Gi = φ(〈s0, . . . , si〉) is defined

and L(Gi) = L(G)

More Formally...

sentences from S

let Gi = φ(〈s0, . . . , si〉)

and L(Gi) = L(G)

More Formally...

sentences from S

let Gi = φ(〈s0, . . . , si〉)

and L(Gi) = L(G)

More Formally...

sentences from S

let Gi = φ(〈s0, . . . , si〉)

and L(Gi) = L(G)

More Formally...

sentences from S

let Gi = φ(〈s0, . . . , si〉)

and L(Gi) = L(G)

More Formally...

sentences from S

let Gi = φ(〈s0, . . . , si〉)

and L(Gi) = L(G)

More Formally...

sentences from S

let Gi = φ(〈s0, . . . , si〉)

and L(Gi) = L(G)

Towards Learnability

Convergence is about a function and agrammar

Learnability is about a (learning) function andclasses of grammars

Learnability in the limit

Let 〈Ω, S, L〉 be given, and G ⊆ Ω.A learning function φ learns G if:

for every language L ∈ L(G)

for every infinite sequence 〈si〉i∈Nwhich enumerates

the elements of L (i.e. si|i ∈ N = L)

there exists some G ∈ G such that

L(G) = L

and φ converges to G on 〈si〉i∈N

L(G) = L

Initial Pessimism

Gold, 1967: A class G of grammars is notlearnable if L(G) contains all finite languagesand at least one infinite language.

just like regular languages!

and context free-grammars!

and many others...

Initial Pessimism

and many others...

Initial Pessimism

and many others...

Initial Pessimism

and many others...

More Generally: Limits Points

A class L of languages has a limit point if there exists

an infinite sequence 〈Ln〉n∈Nof languages in L such that

L0 ⊂ L1

L0 ⊂ L1 ⊂ . . .

L0 ⊂ L1 ⊂ . . . ⊂ Ln

L0 ⊂ L1 ⊂ . . . ⊂ Ln ⊂ . . .

AND there exists another language L in L such that L =⋃

n∈NLn

A Renewed Interest

Gold, 1968 (pessimist!): neither regular nor context-free

grammars are identifiable in the limit from positive examples.

Angluin, 1980: “pattern” languages are learnable

Shinohara, 1990: more non-trivial classes learnable, k-rigid

context sensitive grammars are learnable!

Kanazawa, 1998: rigid and k-valued classical categorial

grammars are learnable, both from structures and from

strings (but that’s NP-hard)

A Renewed Interest

Pattern Languages

Σ = a, b, c, . . . is any finite alphabet

Var = x1, x2, x3, . . . set of variables

Σ ∩ Var = ∅

a pattern p over Σ is an element of (Σ ∪ Var)+

L(p)=w | w is obtained from p by replacing variableswith non-empty constant strings

example: L(axbx) = awbw|w ∈ Σ+

Finite Elasticity

A class L of languages is said to have infiniteelasticity if there exists an infinite sequence〈sn〉n∈N of sentences and an infinite sequence〈Ln〉n∈N of languages in L such that for all n ∈ N,

sn 6∈ Ln

s0, . . . , sn ⊆ Ln+1

A class L of languages is said to have finite elas-

ticity if it doesn’t have infinite elasticity

A Theorem by Angluin (1979)

Any class G with finite elasticity is inferablefrom positive data

The class of pattern languages has finiteelasticity so...

The class of pattern languages is inferablefrom positive data

Summing up...

L(G) has a limit point ⇒ G is unlearnable ⇒ L(G) has infinite

elasticity

L(G) has finite elasticity ⇒ G is learnable

Classical Categorial Grammars

Johnnp

likes(np\s)/np

Marynp

runsnp\s

Johnnp

CCGs = typed words

loves 7→ (np\s)/np, np\(s/np)

John 7→ np

Mary 7→ np

runs 7→ np\s

Johnnp

likes(np\s)/np

Marynp

runsnp\s

Johnnp

CCGs = typed words + composition rules

John 7→ np

Mary 7→ np

runs 7→ np\s

A A\B[\E]

B/A A[/E]

Johnnp

likes(np\s)/np

Marynp

runsnp\s

Johnnp

CCGs = typed words + composition rules

John 7→ np

Mary 7→ np

runs 7→ np\s

A A\B[\E]

B/A A[/E]

Johnnp

likes(np\s)/np

Marynp

runsnp\s

Johnnp

The RG Algorithm (Buszkowski 1989)

Input: finite sets of functor-argumentstructures

Output: a rigid categorial grammar thatgenerates them

a fish swims fast

RG runs

/E /E \E

fastswimsa fish

Assign a type to each node of the structures:

Assign s to each distinct root node

Assign distinct variables to argument nodes

Compute types for the functor nodes

RG runs

/E /E \E

fastswimsa fish

RG runs

/E /E \E

fastswimsa fish

RG runs

/E /E \E

fastswimsa fish

RG runs

fastswimsa fish

RG runs

fastswimsa fish

RG runs

fastswimsa fish

RG runs

fastswimsa fish

RG runs

fastswimsx5

a fish

RG runs

fastswimsx5

a fish

Compute types for the functor nodesFormal Learning Theory:an Introduction – p.29/35

RG runs

2x1 / x x2

fastswimsx5

a fish

RG runs

2x1 / x

fastswimsx5

a fish

RG runs

2x1 / x

x1 \ s1x

fastswimsx5

a fish

RG runs

2x1 / xmanx2

1xswimsx1 \ s

fastswimsx5

a fish

RG runs

2x1 / xmanx2

1xswimsx1 \ s

fastswimsx5

a fish

x3 \ s

RG runs

2x1 / xmanx2

1xswimsx1 \ s

fastswimsx5

x3 \ s

a fish

RG runs

2x1 / xmanx2

1xswimsx1 \ s

x3 / x 4 x4

fastswimsx5

x3 \ s

a fish

RG runs

2x1 / xmanx2

1xswimsx1 \ s

x3 / x 4

fastswimsx5

x \ s3

RG runs

2x1 / xmanx2

1xswimsx1 \ s

x3 / x 4

fastswimsx5 x5 \ s)3 \ (x

x \ s3

RG runs

2x1 / xmanx2

1xswimsx1 \ s

x3 / x 4

fishx5

swimsx5 \ (x 3\ s)

x3 \ sx3

RG: Collecting Types

GF(D):

a 7→ x1/x2, x3/x4

fast 7→ x5\(x3\s)

fish 7→ x4

man 7→ x2

swims 7→ x1\s, x5

RG: Collecting Types

GF(D):

a 7→ x1/x2, x3/x4

fast 7→ x5\(x3\s)

fish 7→ x4

man 7→ x2

swims 7→ x1\s, x5

RG: Unifying Types

a 7→ x1/x2, x3/x4

RG: Unifying Types