part iii hierarchical bayesian models. phrase structure utterance speech signal grammar universal...

Part IIIHierarchical Bayesian Models

VerbVP

NPVPVP

VNPRelRelClause

RelClauseNounAdjDetNP

VPNPS

][

][][

Phrase structure

Utterance

Speech signal

Grammar

Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)

(Han and Zhu, 2006)

Vision

Principles

Structure

Data

Whole-object principleShape biasTaxonomic principleContrast principleBasic-level bias

Word learning

Hierarchical Bayesian models

• Can represent and reason about knowledge at multiple levels of abstraction.

• Have been used by statisticians for many years.

Hierarchical Bayesian models• Can represent and reason about knowledge

at multiple levels of abstraction.• Have been used by statisticians for many

years.• Have been applied to many cognitive

problems:– causal reasoning (Mansinghka et al, 06)

– language (Chater and Manning, 06)

– vision (Fei-Fei, Fergus, Perona, 03)

– word learning (Kemp, Perfors, Tenenbaum,06)

– decision making (Lee, 06)

Outline

• A high-level view of HBMs

• A case study– Semantic knowledge

VerbVP

NPVPVP

VNPRelRelClause

RelClauseNounAdjDetNP

VPNPS

][

][][

Phrase structure

Utterance

Speech signal

Grammar

Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)

P(phrase structure | grammar)

P(utterance | phrase structure)

P(speech | utterance)

P(grammar | UG)

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Hierarchical Bayesian model

P(G|U)

P(s|G)

P(u|s)

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

A hierarchical Bayesian model specifies a joint distribution over all variables in the hierarchy:

P({ui}, {si}, G | U)

= P ({ui} | {si}) P({si} | G) P(G|U)


P(G|U)

P(s|G)

P(u|s)

Knowledge at multiple levels

1. Top-down inferences: – How does abstract knowledge guide

inferences at lower levels?

2. Bottom-up inferences:– How can abstract knowledge be acquired?

3. Simultaneous learning at multiple levels of abstraction

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Top-down inferences

Given grammar G and a collection of utterances, construct a phrase structure for each utterance.

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Infer {si} given {ui}, G:

P( {si} | {ui}, G) α P( {ui} | {si} ) P( {si} |G)

Top-down inferences

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Bottom-up inferences

Given a collection of phrase structures, learn a grammar G.

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Infer G given {si} and U:

P(G| {si}, U) α P( {si} | G) P(G|U)

Bottom-up inferences

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Given a set of utterances {ui} and innate knowledge U, construct a grammar G and a phrase structure for each utterance.

Simultaneous learning at multiple levels

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U


• A chicken-or-egg problem: – Given a grammar, phrase structures can be

constructed– Given a set of phrase structures, a grammar

can be learned

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Infer G and {si} given {ui} and U:

P(G, {si} | {ui}, U) α P( {ui} | {si} )P({si} |G)P(G|U)


Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U


P(G|U)

P(s|G)

P(u|s)

Knowledge at multiple levels

1. Top-down inferences: – How does abstract knowledge guide

inferences at lower levels?

2. Bottom-up inferences:– How can abstract knowledge be acquired?

3. Simultaneous learning at multiple levels of abstraction

Outline


• A case study: Semantic knowledge

Folk Biology

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

The relationships between living kinds are well described by tree-structured representations

“Gorillas have hands”

Folk Biology

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

Structural form: tree

Outline


• A case study: Semantic knowledge– Property induction– Learning structured representations– Learning the abstract organizing principles of a

domain

Property induction

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4


chimp ○ ● ● ?


Structural form: tree

Property Induction

R: Principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

Structural form: treeStochastic process: diffusion

mouse ●squirrel ?

chimp ?

gorilla ?

Approach: work with the distribution P(D|S,R)

Property Induction

Horses have T4 cells.Elephants have T4 cells.

All mammals have T4 cells.

Horses have T4 cells.Seals have T4 cells.

All mammals have T4 cells.

Previous approaches: Rips (75), Osherson et al (90), Sloman (93), Heit (98)

Horse ● ○ ● ● … ● … ○Cow ? ○ ○ ● ○ ○Chimp ? ○ ○ ○ ● ○Gorilla ? ○ ○ ○ ○ ○Mouse ? ○ ○ ○ ○ ○Squirrel ? ○ ○ ○ ○ ○Dolphin ? ○ ○ ○ ○ ○Seal ? ○ ○ ○ ○ ○Rhino ? ○ ○ ○ ○ ○Elephant ● ○ ○ ○ … ○ … ○

0.04 0.01 0.02 0.001 0.04

Hypotheses

Bayesian Property Induction

Horse ● ○ ● ● … ● … ●Cow ? ○ ○ ● ○ ●Chimp ? ○ ○ ○ ● ●Gorilla ? ○ ○ ○ ○ ●Mouse ? ○ ○ ○ ○ ●Squirrel ? ○ ○ ○ ○ ●Dolphin ? ○ ○ ○ ○ ●Seal ? ○ ○ ○ ○ ●Rhino ? ○ ○ ○ ○ ●Elephant ● ○ ○ ○ … ○ … ●

0.04 0.01 0.02 0.001 0.04

Hypotheses



0.04 0.01 0.02 0.001 0.04


Cows have T4 cells.

D

C

}

Choosing a prior

Chimps have T4 cells.

Gorillas have T4 cells.

Poodles can bite through wire.

Dobermans can bite through wire.

Salmon carry E. Spirus bacteria.

Grizzly bears carry E. Spirus bacteria.

Taxonomic similarity

Jaw strength

Food web relations


• A challenge:– We have to specify the prior, which typically

includes many numbers

• An opportunity:– The prior can capture knowledge about the

problem.

Property Induction

R: Principles

S: structure

D: data

mouse

squirrel

chimp

gorilla


mouse ●squirrel ?

chimp ?

gorilla ?

Biological properties

• Structure:– Living kinds are organized into a tree

• Stochastic process:– Nearby species in the tree tend to share

properties

Structure:

Smooth Not smooth

Stochastic Process• Nearby species in the tree tend to share

properties.

• In other words, properties tend to be smooth over the tree.


0.04 0.01 0.02 0.001 0.04

Hypotheses

Stochastic process

Horse 0.5 ●Cow 0.4 ●Chimp -0.7 ○Gorilla -0.1 ○Mouse -0.4 ○Squirrel -0.4 ○Dolphin -1.5 ○Seal -1.5 ○Rhino -0.1 ○Elephant -0.1 ○

Generating a property

y h where y tends to be smooth over the tree:

threshold

The diffusion process

where Ө(yi) is 1 if yi ≥ 0 and 0 otherwise

the covariance K encourages y to be smooth over the graph S

Let yi be the feature value at node i}

i

j

p(y|S,R): Generating a property

(Zhu, Lafferty, Ghahramani 03)

Biological properties

R: Principles

S: structure

D: data

mouse

squirrel

chimp

gorilla


mouse ●squirrel ?

chimp ?

gorilla ?



0.04 0.01 0.02 0.001 0.04


Cows have T4 cells.

D

C

}

Results

Dolphins have property P.Seals have property P.

Horses have property P. (Osherson et al)

Cows have property P.Elephants have property P.

Horses have property P.

Model

Human

Results

Cows have property P.Elephants have property P.Horses have property P.

All mammals have property P.

Gorillas have property P.Mice have property P.Seals have property P.

All mammals have property P.

Model

Human

Spatial model

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

Structural form: 2D spaceStochastic process: diffusion

mouse ●squirrel ?

chimp ?

gorilla ?

Structure:

Tree vs 2D

“horse” “all mammals”

Tree + diffusion

2D + diffusion

Biological Properties

R: Principles

S: structure

D: data

mouse

squirrel

chimp

gorilla


mouse ●squirrel ?

chimp ?

gorilla ?

Class C

Class A

Class D

Class E

Class G

Class F

Class BClass C

Class A

Class D

Class E

Class G

Class F

Class B

Class C

Class G

Class F

Class E

Class D

Class B

Class A

Three inductive contexts

R:

S:

tree +diffusion process

chain +driftprocess

network +causal transmission

“has T4 cells”“can bite through wire”

“carries E. Spirus bacteria”

Threshold properties• “can bite through wire”

• “has skin that is more resistant to penetration than most synthetic fibers”

HippoCat Lion Camel Elephant

Poodle Collie Doberman

(Osherson et al; Blok et al)

Threshold properties

• Structure:– The categories can be organized along a

single dimension

• Stochastic process:– Categories towards one end of the

dimension are more likely to have the novel property

Results “has skin that is more resistant to

penetration than most synthetic fibers”

(Blok et al, Smith et al)

1D + drift

1D + diffusion

Class C

Class A

Class D

Class E

Class G

Class F

Class BClass C

Class A

Class D

Class E

Class G

Class F

Class B

Class C

Class G

Class F

Class E

Class D

Class B

Class A


R:

S:


chain +driftprocess




Causally transmitted properties

(Medin et al; Shafto and Coley)

Salmon carry E. Spirus bacteria.

Grizzly bears carry E. Spirus bacteria.

Salmon

Grizzly bear

Causally transmitted properties

• Structure:– The categories can be organized into a

directed network

• Stochastic process:– Properties are generated by a noisy

transmission process

Experiment: disease properties

Island Mammals

(Shafto et al)

Results: disease properties

MammalsIsland

Web +transmission

Class C

Class A

Class D

Class E

Class G

Class F

Class BClass C

Class A

Class D

Class E

Class G

Class F

Class B

Class C

Class G

Class F

Class E

Class D

Class B

Class A


R:

S:


chain +driftprocess




Property Induction

R: Principles

S: structure

D: data

mouse

squirrel

chimp

gorilla


mouse ●squirrel ?

chimp ?

gorilla ?


Conclusions : property induction

• Hierarchical Bayesian models help to explain how abstract knowledge can be used for induction

Outline



domain

Structure learning

R: Principles

S: structure

D: data


mouse

squirrel

chimp

gorilla

F1 F2 F3 F4


chimp ○ ● ● ?


Structure learning

R: principles

S: structure

D: data

F1 F2 F3 F4


chimp ○ ● ● ?


?

Goal: find S that maximizes P(S|D,R)


Structure learning

R: principles

S: structure

D: data

F1 F2 F3 F4


chimp ○ ● ● ?


?

Goal: find S that maximizes P(S|D,R) α P(D|S,R) P(S|R)


Structure learning

R: principles

S: structure

D: data

F1 F2 F3 F4


chimp ○ ● ● ?


?


The distributionpreviously used for property induction


mouse

squirrel

chimp

gorilla

Generating features over the tree

mouse ●squirrel ?

chimp ?

gorilla ?

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4


chimp ○ ● ● ?


Generating features over the tree

Structure learning

R: principles

S: structure

D: data

F1 F2 F3 F4


chimp ○ ● ● ?


?



P(S|R): Generating structures

mouse

squirrel

chimp

gorilla

mouse

squirrel

chimp

gorilla

mousesquirrel

chimp gorilla

Consistent with R Inconsistent with R


mouse

squirrel

chimp

gorilla

mouse

squirrel

chimp

gorilla

mousesquirrel

chimp gorilla

Complex Simple


if S inconsistent with R

otherwise

• Each structure is weighted by the number of nodes it contains:

where is the number of nodes in S

mouse

squirrel

chimp

gorilla

mouse

squirrel

chimp

gorilla

mousesquirrel

chimp gorilla

Structure Learning

P(S|D,R) will be high when:– The features in D vary smoothly

over S – S is a simple graph (a graph with

few nodes)

Aim: find S that maximizes P(S|D,R) α P(D|S) P(S|R)

R: principles

S: structure

D: data

• Participants rated the goodness of 85 features for 48 animals

• E.g., elephant: gray hairless toughskin big bulbous longleg tail chewteeth tusks smelly walks slow strong muscle quadrapedal inactive vegetation grazer oldworld bush jungle ground timid smart group

Structure learning example

(Osherson et al)

Biological Data

Features

Ani

mal

s

Spatial model

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

Structural form: 2D spaceStochastic process: diffusion

F1 F2 F3 F4


chimp ○ ● ● ?


2D space:

Conclusions: structure learning

• Hierarchical Bayesian models provide a unified framework for the acquisition and use of structured representations

Outline



domain

Learning structural form

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4


chimp ○ ● ● ?



Ostrich

Robin

Croco

dile

Snake

Bat

Orangu

tan

Turtle

Which form is best?

Structural forms

Order Chain RingPartition

Hierarchy Tree Grid Cylinder


R: principles

S: structure

D: data

F1 F2 F3 F4


chimp ○ ● ● ?


?

Goal: find S,F that maximize P(S,F|D)

could betree,2D space,ring, ….

Structural form: FStochastic process: diffusion


R: principles

S: structure

D: data

F1 F2 F3 F4


chimp ○ ● ● ?


?

Aim: find S,F that maximize P(S,F|D) α P(D|S)P(S|F) P(F)

Uniform distribution on the set of forms



R: principles

S: structure

D: data

F1 F2 F3 F4


chimp ○ ● ● ?


?

Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)

The distribution used for property induction



R: principles

S: structure

D: data

F1 F2 F3 F4


chimp ○ ● ● ?


?



The distribution used for structure learning

P(S|F): Generating structures from forms

if S inconsistent with F

otherwise

• Each structure is weighted by the number of nodes it contains:

where is the number of nodes in S

mouse

squirrel

chimp

gorilla

mouse

squirrel

chimp

gorilla

mousesquirrel

chimp gorilla

• Simpler forms are preferred

A B C

P(S|F): Generating structures from forms

D

All possible graph structures S

P(S|F)

Chain

Grid


F: form

S: structure

D: data

F1 F2 F3 F4


chimp ○ ● ● ?


?

Goal: find S,F that maximize P(S,F|D)

?


• P(S,F|D) will be high when:– The features in D vary smoothly

over S – S is a simple graph (a graph with

few nodes)– F is a simple form (a form that can

generate only a few structures)

F: form

S: structure

D: data



• P(S,F|D) will be high when:– The features in D vary smoothly

over F – S is a simple graph (a graph with

few nodes)– F is a simple form (a form that can

generate only a few structures)

F: form

S: structure

D: data


Form learning: Biological Data

Features

Ani

mal

s

● 33 animals, 110 features

Form learning: Biological Data

Supreme Court (Spaeth)●Votes on 1600 cases (1987-2005)

Color (Ekman)

Outline



domain

Where do priors come from?

mouse ●squirrel ?

chimp ?

gorilla ?

mouse

squirrel

chimp

gorilla

mouse ●squirrel ?

chimp ?

gorilla ?

Stochastic process: diffusion

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4


chimp ○ ● ● ?



Order Chain RingPartition

Hierarchy Tree Grid Cylinder

Where do structural forms come from?

Where do structural forms come from?Form FormProcess Process

Node-replacement graph grammars

Production(Chain) Derivation

Where do structural forms come from?Form FormProcess Process

The complete space of grammars

1

4096

... ...

When can we stop adding levels?

• When the knowledge at the top level is simple or general enough that it can be plausibly assumed to be innate.

Conclusions

• Hierarchical Bayesian models provide a unified framework which can

– Explain how abstract knowledge is used for induction

– Explain how abstract knowledge can be acquired

Learning abstract knowledge

Applications of hierarchical Bayesian models at this conference:

1. Semantic knowledge: Schmidt et al.– Learning the M-constraint

2. Syntax: Perfors et al.– Learning that language is hierarchically organized

3. Word learning: Kemp et al.– Learning the shape bias

part iii hierarchical bayesian models. phrase structure utterance speech signal grammar universal...

Documents

grammar g

tag slide

vision slide

innate knowledge u

principles structure

hierarchical bayesian

knowledge guide inferences

lower levels