part iii hierarchical bayesian models. phrase structure utterance speech signal grammar universal...

114
Part III Hierarchical Bayesian Models

Post on 20-Dec-2015

233 views

Category:

Documents


2 download

TRANSCRIPT

Part IIIHierarchical Bayesian Models

VerbVP

NPVPVP

VNPRelRelClause

RelClauseNounAdjDetNP

VPNPS

][

][][

Phrase structure

Utterance

Speech signal

Grammar

Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)

(Han and Zhu, 2006)

Vision

Principles

Structure

Data

Whole-object principleShape biasTaxonomic principleContrast principleBasic-level bias

Word learning

Hierarchical Bayesian models

• Can represent and reason about knowledge at multiple levels of abstraction.

• Have been used by statisticians for many years.

Hierarchical Bayesian models• Can represent and reason about knowledge

at multiple levels of abstraction.• Have been used by statisticians for many

years.• Have been applied to many cognitive

problems:– causal reasoning (Mansinghka et al, 06)

– language (Chater and Manning, 06)

– vision (Fei-Fei, Fergus, Perona, 03)

– word learning (Kemp, Perfors, Tenenbaum,06)

– decision making (Lee, 06)

Outline

• A high-level view of HBMs

• A case study– Semantic knowledge

VerbVP

NPVPVP

VNPRelRelClause

RelClauseNounAdjDetNP

VPNPS

][

][][

Phrase structure

Utterance

Speech signal

Grammar

Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)

P(phrase structure | grammar)

P(utterance | phrase structure)

P(speech | utterance)

P(grammar | UG)

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Hierarchical Bayesian model

P(G|U)

P(s|G)

P(u|s)

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

A hierarchical Bayesian model specifies a joint distribution over all variables in the hierarchy:

P({ui}, {si}, G | U)

= P ({ui} | {si}) P({si} | G) P(G|U)

Hierarchical Bayesian model

P(G|U)

P(s|G)

P(u|s)

Knowledge at multiple levels

1. Top-down inferences: – How does abstract knowledge guide

inferences at lower levels?

2. Bottom-up inferences:– How can abstract knowledge be acquired?

3. Simultaneous learning at multiple levels of abstraction

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Top-down inferences

Given grammar G and a collection of utterances, construct a phrase structure for each utterance.

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Infer {si} given {ui}, G:

P( {si} | {ui}, G) α P( {ui} | {si} ) P( {si} |G)

Top-down inferences

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Bottom-up inferences

Given a collection of phrase structures, learn a grammar G.

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Infer G given {si} and U:

P(G| {si}, U) α P( {si} | G) P(G|U)

Bottom-up inferences

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Given a set of utterances {ui} and innate knowledge U, construct a grammar G and a phrase structure for each utterance.

Simultaneous learning at multiple levels

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Simultaneous learning at multiple levels

• A chicken-or-egg problem: – Given a grammar, phrase structures can be

constructed– Given a set of phrase structures, a grammar

can be learned

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Infer G and {si} given {ui} and U:

P(G, {si} | {ui}, U) α P( {ui} | {si} )P({si} |G)P(G|U)

Simultaneous learning at multiple levels

Phrase structure

Utterance

Grammar

Universal Grammar

u1 u2 u3 u4 u5 u6

s1 s2 s3 s4 s5 s6

G

U

Hierarchical Bayesian model

P(G|U)

P(s|G)

P(u|s)

Knowledge at multiple levels

1. Top-down inferences: – How does abstract knowledge guide

inferences at lower levels?

2. Bottom-up inferences:– How can abstract knowledge be acquired?

3. Simultaneous learning at multiple levels of abstraction

Outline

• A high-level view of HBMs

• A case study: Semantic knowledge

Folk Biology

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

The relationships between living kinds are well described by tree-structured representations

“Gorillas have hands”

Folk Biology

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

Structural form: tree

Outline

• A high-level view of HBMs

• A case study: Semantic knowledge– Property induction– Learning structured representations– Learning the abstract organizing principles of a

domain

Property induction

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

Structural form: tree

Property Induction

R: Principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

Structural form: treeStochastic process: diffusion

mouse ●squirrel ?

chimp ?

gorilla ?

Approach: work with the distribution P(D|S,R)

Property Induction

Horses have T4 cells.Elephants have T4 cells.

All mammals have T4 cells.

Horses have T4 cells.Seals have T4 cells.

All mammals have T4 cells.

Previous approaches: Rips (75), Osherson et al (90), Sloman (93), Heit (98)

Horse ● ○ ● ● … ● … ○Cow ? ○ ○ ● ○ ○Chimp ? ○ ○ ○ ● ○Gorilla ? ○ ○ ○ ○ ○Mouse ? ○ ○ ○ ○ ○Squirrel ? ○ ○ ○ ○ ○Dolphin ? ○ ○ ○ ○ ○Seal ? ○ ○ ○ ○ ○Rhino ? ○ ○ ○ ○ ○Elephant ● ○ ○ ○ … ○ … ○

0.04 0.01 0.02 0.001 0.04

Hypotheses

Bayesian Property Induction

Horse ● ○ ● ● … ● … ●Cow ? ○ ○ ● ○ ●Chimp ? ○ ○ ○ ● ●Gorilla ? ○ ○ ○ ○ ●Mouse ? ○ ○ ○ ○ ●Squirrel ? ○ ○ ○ ○ ●Dolphin ? ○ ○ ○ ○ ●Seal ? ○ ○ ○ ○ ●Rhino ? ○ ○ ○ ○ ●Elephant ● ○ ○ ○ … ○ … ●

0.04 0.01 0.02 0.001 0.04

Hypotheses

Bayesian Property Induction

Horse ● ○ ● ● … ● … ●Cow ? ○ ○ ● ○ ●Chimp ? ○ ○ ○ ● ●Gorilla ? ○ ○ ○ ○ ●Mouse ? ○ ○ ○ ○ ●Squirrel ? ○ ○ ○ ○ ●Dolphin ? ○ ○ ○ ○ ●Seal ? ○ ○ ○ ○ ●Rhino ? ○ ○ ○ ○ ●Elephant ● ○ ○ ○ … ○ … ●

0.04 0.01 0.02 0.001 0.04

Horses have T4 cells.Elephants have T4 cells.

Cows have T4 cells.

D

C

}

Choosing a prior

Chimps have T4 cells.

Gorillas have T4 cells.

Poodles can bite through wire.

Dobermans can bite through wire.

Salmon carry E. Spirus bacteria.

Grizzly bears carry E. Spirus bacteria.

Taxonomic similarity

Jaw strength

Food web relations

Bayesian Property Induction

• A challenge:– We have to specify the prior, which typically

includes many numbers

• An opportunity:– The prior can capture knowledge about the

problem.

Property Induction

R: Principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

Structural form: treeStochastic process: diffusion

mouse ●squirrel ?

chimp ?

gorilla ?

Biological properties

• Structure:– Living kinds are organized into a tree

• Stochastic process:– Nearby species in the tree tend to share

properties

Structure:

Structure:

Smooth Not smooth

Stochastic Process• Nearby species in the tree tend to share

properties.

• In other words, properties tend to be smooth over the tree.

Horse ● ○ ● ● … ● … ●Cow ? ○ ○ ● ○ ●Chimp ? ○ ○ ○ ● ●Gorilla ? ○ ○ ○ ○ ●Mouse ? ○ ○ ○ ○ ●Squirrel ? ○ ○ ○ ○ ●Dolphin ? ○ ○ ○ ○ ●Seal ? ○ ○ ○ ○ ●Rhino ? ○ ○ ○ ○ ●Elephant ● ○ ○ ○ … ○ … ●

0.04 0.01 0.02 0.001 0.04

Hypotheses

Stochastic process

Horse 0.5 ●Cow 0.4 ●Chimp -0.7 ○Gorilla -0.1 ○Mouse -0.4 ○Squirrel -0.4 ○Dolphin -1.5 ○Seal -1.5 ○Rhino -0.1 ○Elephant -0.1 ○

Generating a property

y h where y tends to be smooth over the tree:

threshold

S

The diffusion process

where Ө(yi) is 1 if yi ≥ 0 and 0 otherwise

the covariance K encourages y to be smooth over the graph S

Let yi be the feature value at node i}

i

j

p(y|S,R): Generating a property

(Zhu, Lafferty, Ghahramani 03)

Biological properties

R: Principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

Structural form: treeStochastic process: diffusion

mouse ●squirrel ?

chimp ?

gorilla ?

Approach: work with the distribution P(D|S,R)

Horse ● ○ ● ● … ● … ●Cow ? ○ ○ ● ○ ●Chimp ? ○ ○ ○ ● ●Gorilla ? ○ ○ ○ ○ ●Mouse ? ○ ○ ○ ○ ●Squirrel ? ○ ○ ○ ○ ●Dolphin ? ○ ○ ○ ○ ●Seal ? ○ ○ ○ ○ ●Rhino ? ○ ○ ○ ○ ●Elephant ● ○ ○ ○ … ○ … ●

0.04 0.01 0.02 0.001 0.04

Horses have T4 cells.Elephants have T4 cells.

Cows have T4 cells.

D

C

}

Results

Dolphins have property P.Seals have property P.

Horses have property P. (Osherson et al)

Cows have property P.Elephants have property P.

Horses have property P.

Model

Human

Results

Cows have property P.Elephants have property P.Horses have property P.

All mammals have property P.

Gorillas have property P.Mice have property P.Seals have property P.

All mammals have property P.

Model

Human

Spatial model

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

Structural form: 2D spaceStochastic process: diffusion

mouse ●squirrel ?

chimp ?

gorilla ?

Structure:

Structure:

Tree vs 2D

“horse” “all mammals”

Tree + diffusion

2D + diffusion

Biological Properties

R: Principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

Structural form: treeStochastic process: diffusion

mouse ●squirrel ?

chimp ?

gorilla ?

Class C

Class A

Class D

Class E

Class G

Class F

Class BClass C

Class A

Class D

Class E

Class G

Class F

Class B

Class C

Class G

Class F

Class E

Class D

Class B

Class A

Three inductive contexts

R:

S:

tree +diffusion process

chain +driftprocess

network +causal transmission

“has T4 cells”“can bite through wire”

“carries E. Spirus bacteria”

Threshold properties• “can bite through wire”

• “has skin that is more resistant to penetration than most synthetic fibers”

HippoCat Lion Camel Elephant

Poodle Collie Doberman

(Osherson et al; Blok et al)

Threshold properties

• Structure:– The categories can be organized along a

single dimension

• Stochastic process:– Categories towards one end of the

dimension are more likely to have the novel property

Results “has skin that is more resistant to

penetration than most synthetic fibers”

(Blok et al, Smith et al)

1D + drift

1D + diffusion

Class C

Class A

Class D

Class E

Class G

Class F

Class BClass C

Class A

Class D

Class E

Class G

Class F

Class B

Class C

Class G

Class F

Class E

Class D

Class B

Class A

Three inductive contexts

R:

S:

tree +diffusion process

chain +driftprocess

network +causal transmission

“has T4 cells”“can bite through wire”

“carries E. Spirus bacteria”

Causally transmitted properties

(Medin et al; Shafto and Coley)

Salmon carry E. Spirus bacteria.

Grizzly bears carry E. Spirus bacteria.

Salmon

Grizzly bear

Causally transmitted properties

• Structure:– The categories can be organized into a

directed network

• Stochastic process:– Properties are generated by a noisy

transmission process

Experiment: disease properties

Island Mammals

(Shafto et al)

Results: disease properties

MammalsIsland

Web +transmission

Class C

Class A

Class D

Class E

Class G

Class F

Class BClass C

Class A

Class D

Class E

Class G

Class F

Class B

Class C

Class G

Class F

Class E

Class D

Class B

Class A

Three inductive contexts

R:

S:

tree +diffusion process

chain +driftprocess

network +causal transmission

“has T4 cells”“can bite through wire”

“carries E. Spirus bacteria”

Property Induction

R: Principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

Structural form: treeStochastic process: diffusion

mouse ●squirrel ?

chimp ?

gorilla ?

Approach: work with the distribution P(D|S,R)

Conclusions : property induction

• Hierarchical Bayesian models help to explain how abstract knowledge can be used for induction

Outline

• A high-level view of HBMs

• A case study: Semantic knowledge– Property induction– Learning structured representations– Learning the abstract organizing principles of a

domain

Structure learning

R: Principles

S: structure

D: data

Structural form: treeStochastic process: diffusion

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

Structure learning

R: principles

S: structure

D: data

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

?

Goal: find S that maximizes P(S|D,R)

Structural form: treeStochastic process: diffusion

Structure learning

R: principles

S: structure

D: data

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

?

Goal: find S that maximizes P(S|D,R) α P(D|S,R) P(S|R)

Structural form: treeStochastic process: diffusion

Structure learning

R: principles

S: structure

D: data

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

?

Goal: find S that maximizes P(S|D,R) α P(D|S,R) P(S|R)

The distributionpreviously used for property induction

Structural form: treeStochastic process: diffusion

mouse

squirrel

chimp

gorilla

Generating features over the tree

mouse ●squirrel ?

chimp ?

gorilla ?

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

Generating features over the tree

Structure learning

R: principles

S: structure

D: data

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

?

Goal: find S that maximizes P(S|D,R) α P(D|S,R) P(S|R)

Structural form: treeStochastic process: diffusion

P(S|R): Generating structures

mouse

squirrel

chimp

gorilla

mouse

squirrel

chimp

gorilla

mousesquirrel

chimp gorilla

Consistent with R Inconsistent with R

P(S|R): Generating structures

mouse

squirrel

chimp

gorilla

mouse

squirrel

chimp

gorilla

mousesquirrel

chimp gorilla

Complex Simple

P(S|R): Generating structures

if S inconsistent with R

otherwise

• Each structure is weighted by the number of nodes it contains:

where is the number of nodes in S

mouse

squirrel

chimp

gorilla

mouse

squirrel

chimp

gorilla

mousesquirrel

chimp gorilla

Structure Learning

P(S|D,R) will be high when:– The features in D vary smoothly

over S – S is a simple graph (a graph with

few nodes)

Aim: find S that maximizes P(S|D,R) α P(D|S) P(S|R)

R: principles

S: structure

D: data

Structure Learning

P(S|D,R) will be high when:– The features in D vary smoothly

over S – S is a simple graph (a graph with

few nodes)

Aim: find S that maximizes P(S|D,R) α P(D|S) P(S|R)

R: principles

S: structure

D: data

• Participants rated the goodness of 85 features for 48 animals

• E.g., elephant: gray hairless toughskin big bulbous longleg tail chewteeth tusks smelly walks slow strong muscle quadrapedal inactive vegetation grazer oldworld bush jungle ground timid smart group

Structure learning example

(Osherson et al)

Biological Data

Features

Ani

mal

s

Tree:

Spatial model

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

Structural form: 2D spaceStochastic process: diffusion

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

2D space:

Conclusions: structure learning

• Hierarchical Bayesian models provide a unified framework for the acquisition and use of structured representations

Outline

• A high-level view of HBMs

• A case study: Semantic knowledge– Property induction– Learning structured representations– Learning the abstract organizing principles of a

domain

Learning structural form

R: principles

S: structure

D: data

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

Structural form: treeStochastic process: diffusion

Ostrich

Robin

Croco

dile

Snake

Bat

Orangu

tan

Turtle

Which form is best?

Structural forms

Order Chain RingPartition

Hierarchy Tree Grid Cylinder

Learning structural form

R: principles

S: structure

D: data

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

?

Goal: find S,F that maximize P(S,F|D)

could betree,2D space,ring, ….

Structural form: FStochastic process: diffusion

Learning structural form

R: principles

S: structure

D: data

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

?

Aim: find S,F that maximize P(S,F|D) α P(D|S)P(S|F) P(F)

Uniform distribution on the set of forms

Structural form: FStochastic process: diffusion

Learning structural form

R: principles

S: structure

D: data

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

?

Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)

The distribution used for property induction

Structural form: FStochastic process: diffusion

Learning structural form

R: principles

S: structure

D: data

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

?

Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)

Structural form: FStochastic process: diffusion

The distribution used for structure learning

P(S|F): Generating structures from forms

if S inconsistent with F

otherwise

• Each structure is weighted by the number of nodes it contains:

where is the number of nodes in S

mouse

squirrel

chimp

gorilla

mouse

squirrel

chimp

gorilla

mousesquirrel

chimp gorilla

• Simpler forms are preferred

A B C

P(S|F): Generating structures from forms

D

All possible graph structures S

P(S|F)

Chain

Grid

Learning structural form

F: form

S: structure

D: data

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

?

Goal: find S,F that maximize P(S,F|D)

?

Learning structural form

• P(S,F|D) will be high when:– The features in D vary smoothly

over S – S is a simple graph (a graph with

few nodes)– F is a simple form (a form that can

generate only a few structures)

F: form

S: structure

D: data

Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)

Learning structural form

• P(S,F|D) will be high when:– The features in D vary smoothly

over F – S is a simple graph (a graph with

few nodes)– F is a simple form (a form that can

generate only a few structures)

F: form

S: structure

D: data

Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)

Form learning: Biological Data

Features

Ani

mal

s

● 33 animals, 110 features

Form learning: Biological Data

Supreme Court (Spaeth)●Votes on 1600 cases (1987-2005)

Color (Ekman)

Outline

• A high-level view of HBMs

• A case study: Semantic knowledge– Property induction– Learning structured representations– Learning the abstract organizing principles of a

domain

Where do priors come from?

mouse ●squirrel ?

chimp ?

gorilla ?

mouse

squirrel

chimp

gorilla

mouse ●squirrel ?

chimp ?

gorilla ?

Stochastic process: diffusion

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

Structural form: treeStochastic process: diffusion

mouse

squirrel

chimp

gorilla

F1 F2 F3 F4

mouse ● ○ ○ ●squirrel ● ○ ○ ?

chimp ○ ● ● ?

gorilla ○ ● ● ?

Structural form: treeStochastic process: diffusion

Order Chain RingPartition

Hierarchy Tree Grid Cylinder

Where do structural forms come from?

Where do structural forms come from?Form FormProcess Process

Node-replacement graph grammars

Production(Chain) Derivation

Node-replacement graph grammars

Production(Chain) Derivation

Node-replacement graph grammars

Production(Chain) Derivation

Where do structural forms come from?Form FormProcess Process

The complete space of grammars

1

4096

... ...

When can we stop adding levels?

• When the knowledge at the top level is simple or general enough that it can be plausibly assumed to be innate.

Conclusions

• Hierarchical Bayesian models provide a unified framework which can

– Explain how abstract knowledge is used for induction

– Explain how abstract knowledge can be acquired

Learning abstract knowledge

Applications of hierarchical Bayesian models at this conference:

1. Semantic knowledge: Schmidt et al.– Learning the M-constraint

2. Syntax: Perfors et al.– Learning that language is hierarchically organized

3. Word learning: Kemp et al.– Learning the shape bias