primitive probabilistic auditory scene analysis -...

Post on 14-May-2018

231 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Primitive probabilistic auditory scene analysis

Richard E. Turner (ret26@cam.ac.uk)

Computational and Biological Learning Lab, University of Cambridge

Maneesh Sahani (maneesh@gatsby.ucl.ac.uk)

Gatsby Computational Neuroscience Unit, UCL, London

Motivation

object

parts of objects

structural primitives

sensoryinput

Motivation

object

parts of objects

structural primitives

sensoryinput

trees

bark

edges

photoreceptoractivities

Motivation

object

parts of objects

structural primitives

sensoryinput

trees

bark

edges

photoreceptoractivities

bird song

motif

AM tones and noise

auditory filteractivities

Auditory Scene Analysis

object

parts of objects

structural primitives

sensoryinput

trees

bark

edges

photoreceptoractivities

bird song

motif

AM tones and noise

auditory filteractivities

superschema−based

grouping

schema−basedgrouping

primitivegrouping

Generative model: Probabilistic Auditory Scene Synthesis

trees

bark

edges

photoreceptoractivities

bird song

motif

AM tones and noise

auditory filteractivities

superschema−based

grouping

schema−basedgrouping

primitivegrouping

p(source)

p(parts|source)

p(primitive|part)

p(sound|primitives)

Recognition model: Probabilistic Auditory Scene Analysis

bird song

motif

AM tones and noise

auditory filteractivities

superschema−based

grouping

schema−basedgrouping

primitivegrouping

p(source)

p(parts|source)

p(primitive|part)

p(sound|primitives)

p(source|sound)

p(part|sound)

p(primitive|sound)

sound waveform

Primitive Probabilistic Auditory Scene Analysis

bird song

motif

AM tones and noise

auditory filteractivities

superschema−based

grouping

schema−basedgrouping

primitivegrouping

p(source)

p(parts|source)

p(primitive|part)

p(sound|primitives)

p(source|sound)

p(part|sound)

p(primitive|sound)

sound waveform

Primitive Probabilistic Auditory Scene Analysis

Part 1: Probabilistic generative model: primitive auditory scenesynthesis

What are the important low-level statistics of natural sounds?

Part 2: Probabilistic recognition model: primitive auditoryscene analysis

Provocative computational theory: Grouping rules arise frominferences based on the statistics of natural sounds.

Primitive Probabilistic Auditory Scene Analysis

Part 1: Probabilistic generative model: primitive auditory scenesynthesis

What are the important low-level statistics of natural sounds?

Part 2: Probabilistic recognition model: primitive auditoryscene analysis

Provocative computational theory: Grouping rules arise frominferences based on the statistics of natural sounds.

Primitive Probabilistic Auditory Scene Analysis

Part 1: Probabilistic generative model: primitive auditory scenesynthesis

What are the important low-level statistics of natural sounds?

Part 2: Probabilistic recognition model: primitive auditoryscene analysis

Provocative computational theory: Grouping rules arise frominferences based on the statistics of natural sounds.

Heuristic Analysis: Fire sound

y

Heuristic Analysis: Fire sound

y

−202

4.1

KH

z

−202

2.4

KH

z

−202

1 K

Hz

0.5 0.52 0.54 0.56 0.58 0.6 0.62−2

02

time /s

0.4

KH

z

filter

Heuristic Analysis: Fire sound

y

−202

4.1

KH

z

−202

2.4

KH

z

−202

1 K

Hz

0.5 0.52 0.54 0.56 0.58 0.6 0.62−2

02

time /s

0.4

KH

z

filter and demodulate

Heuristic Analysis: Fire sound

Heuristic Analysis: Fire sound

Heuristic Analysis: Water

Heuristic Analysis: Rain

Heuristic Analysis: Speech

Summary

sound → filter bank → demodulate → transformed envelopes

• Important statistics include

– energy in sub-bands (power-spectrum)– patterns of co-modulation (local power-sharing)– time-scale of the modulation– depth of the modulation (sparsity)

• Josh’s texture synthesis work tomorrow

• Formulate a probabilistic model to capture these statistics:

sound ← modulate carriers ← modulators ← transformed envelopes

Structural Primitives = Co-modulated narrow-band processes

Summary

sound → filter bank → demodulate → transformed envelopes

• Important statistics include

– energy in sub-bands (power-spectrum)– patterns of co-modulation (local power-sharing)– time-scale of the modulation– depth of the modulation (sparsity)

• Josh’s texture synthesis work tomorrow

• Formulate a probabilistic model to capture these statistics:

sound ← modulate carriers ← modulators ← transformed envelopes

Structural Primitives = Co-modulated narrow-band processes

Summary

sound → filter bank → demodulate → transformed envelopes

• Important statistics include

– energy in sub-bands (power-spectrum)– patterns of co-modulation (local power-sharing)– time-scale of the modulation– depth of the modulation (sparsity)

• Josh’s texture synthesis work tomorrow

• Formulate a probabilistic model to capture these statistics:

sound ← modulate carriers ← modulators ← transformed envelopes

Structural Primitives = Co-modulated narrow-band processes

Model Schematic

0 0.05 0.1

0

time

y t

0 0.05 0.1

0

time0 0.05 0.1

0

time

0

a 1,t

0

a D,t

0c 1,t

0c D,t

=

...

...

+ ... +

× ×

= =

Model Schematic

0 0.05 0.1

0

time

y t

0 0.05 0.1

0

time0 0.05 0.1

0

time

0

a 1,t

0

a D,t

0c 1,t

0c D,t

0x 1,t

0x D,t

=

...

...

+ ... +

× ×

= =

...

Probabilistic Model

• Transformed envelopes: slow and +/–

p(xe,1:T |Γe,1:T,1:T ) = Norm (xe,1:T ; 0,Γe,1:T,1:T )

Γe,t−t′ = σ2e exp

(− 1

2τ2e

(t− t′)2)

• Envelopes: positive mixture of transformed envelopes with controllable sparsity

ad,t = a

(E∑e=1

gd,exe,t + µd

), e.g. a(z) = log(1 + exp(z)).

• Carriers: band-limited Gaussian noise

p(cd,t|cd,t−1:t−2, θ) = Norm

(cd,t;

2∑t′=1

λd,t′cd,t−t′, σ2d

)

yt =D∑d=1

cd,tad,t + σyεt

Probabilistic Model

• Transformed envelopes: slow and +/–

p(xe,1:T |Γe,1:T,1:T ) = Norm (xe,1:T ; 0,Γe,1:T,1:T )

Γe,t−t′ = σ2e exp

(− 1

2τ2e

(t− t′)2)

• Envelopes: positive mixture of transformed envelopes with controllable sparsity

ad,t = a

(E∑e=1

gd,exe,t + µd

), e.g. a(z) = log(1 + exp(z)).

• Carriers: band-limited Gaussian noise

p(cd,t|cd,t−1:t−2, θ) = Norm

(cd,t;

2∑t′=1

λd,t′cd,t−t′, σ2d

)

yt =D∑d=1

cd,tad,t + σyεt

Probabilistic Model

• Transformed envelopes: slow and +/–

p(xe,1:T |Γe,1:T,1:T ) = Norm (xe,1:T ; 0,Γe,1:T,1:T )

Γe,t−t′ = σ2e exp

(− 1

2τ2e

(t− t′)2)

• Envelopes: positive mixture of transformed envelopes with controllable sparsity

ad,t = a

(E∑e=1

gd,exe,t + µd

), e.g. a(z) = log(1 + exp(z)).

• Carriers: band-limited Gaussian noise

p(cd,t|cd,t−1:t−2, θ) = Norm

(cd,t;

2∑t′=1

λd,t′cd,t−t′, σ2d

)

yt =D∑d=1

cd,tad,t + σyεt

Probabilistic Model

• Transformed envelopes: slow and +/–

p(xe,1:T |Γe,1:T,1:T ) = Norm (xe,1:T ; 0,Γe,1:T,1:T )

Γe,t−t′ = σ2e exp

(− 1

2τ2e

(t− t′)2)

• Envelopes: positive mixture of transformed envelopes with controllable sparsity

ad,t = a

(E∑e=1

gd,exe,t + µd

), e.g. a(z) = log(1 + exp(z)).

• Carriers: band-limited Gaussian noise

p(cd,t|cd,t−1:t−2, θ) = Norm

(cd,t;

2∑t′=1

λd,t′cd,t−t′, σ2d

)

yt =D∑d=1

cd,tad,t + σyεt

What do the parameters control?

λd control the centre frequency and bandwidth of each sub-band

σ2d control the power in each sub-band

τe controls time-scale of modulators

µd and σ2e control the modulation depth, skew and the sparsity

gd,e control the patterns of modulation across sub-bands

0 0.1 0.2 0.3 0.40

0.1

0.2

0.3

0.4

0.5

centre frequency

band

wid

th

Intuitions for role of model parameters: Generation

Generation: Increasing sparsity

Generation: Increasing sparsity

Generation: Increasing sparsity

Generation: Increasing sparsity

Generation: Increasing sparsity

Generation: Increasing sparsity

Generation: Increasing co-modulation

Generation: Increasing co-modulation

Generation: Increasing co-modulation

Generation: Increasing co-modulation

Generation: Increasing co-modulation

Generation: Increasing co-modulation

Generation: Increasing co-modulation

Generation: Decreasing time-scale

Generation: Decreasing time-scale

Generation: Decreasing time-scale

Generation: Decreasing time-scale

Generation: Decreasing time-scale

Generation: Decreasing time-scale

Generation: Decreasing time-scale

Inference and Learning

• Key Observation: Fix envelopes, posterior over carriers is Gaussian.

• Leads to MAP estimation of the envelopes (like HMMs, ICA, NMF)

XMAP = arg maxX

p(X|Y)

p(X|Y) =1Zp(X,Y) =

1Z

∫dCp(X,Y,C) =

1Zp(X)

∫dCp(Y|A,C)p(C)

• Compute integral efficiently using Kalman Smoothing

• Leads to gradient based optimisation for transformed amplitudes

• Learning: approximate Maximum Likelihood θ = arg maxθ p(Y|θ)

• See Turner and Sahani, ICASSP 2010.

Inference and Learning

• Key Observation: Fix envelopes, posterior over carriers is Gaussian.

• Leads to MAP estimation of the envelopes (like HMMs, ICA, NMF)

XMAP = arg maxX

p(X|Y)

p(X|Y) =1Zp(X,Y) =

1Z

∫dCp(X,Y,C) =

1Zp(X)

∫dCp(Y|A,C)p(C)

• Compute integral efficiently using Kalman Smoothing

• Leads to gradient based optimisation for transformed amplitudes

• Learning: approximate Maximum Likelihood θ = arg maxθ p(Y|θ)

• See Turner and Sahani, ICASSP 2010.

Synthetic Sounds

Stream Stream Stream Wind Wind Fire Fire Rain Twigs Speech

Orig Orig Orig Orig Orig Orig Orig Orig Orig Orig

Model Model Model Model Model Model Model Model Model Model

Primitive Probabilistic Auditory Scene Analysis

Part 1: Probabilistic generative model: primitive auditory scenesynthesis

What are the important low-level statistics of natural sounds?

Part 2: Probabilistic recognition model: primitive auditoryscene analysis

Provocative computational theory: Grouping rules arise frominferences based on the statistics of natural sounds.

Introduction to the psychophysics

• Probe model with stimuli used to study psychophysics of auditory sceneanalysis

• Fix carriers to be ‘auditory filter’ like

• Assume perceptual access to:

– instantaneous frequency of the carriers (not the phase)– modulators

How does the model ‘hear’ these stimuli?

Continuity Illusion

yt = a1,tc1,t + a2,tc2,t

−5

0

5

sign

al

0

100

freq

uenc

y /H

z

Continuity Illusion

yt = a1,tc1,t + a2,tc2,t

−5

0

5

sign

al

0

100

freq

uenc

y /H

z

Continuity Illusion

yt = a1,tc1,t + a2,tc2,t

−5

0

5

sign

al

0

100

freq

uenc

y /H

z

−1

0

1

ytone

0 0.5 1 1.5

−5

0

5

ynois

e

time /s

Common Amplitude Modulation

yt = (a1,t + a3,t)c1,t + (a2,t + a3,t)c2,t

0

sign

al

0 0.5 1 1.5 2 2.5 3 3.5 4

100

150

200

time /s

freq

uenc

y /H

z

Common Amplitude Modulation

yt = (a1,t + a3,t)c1,t + (a2,t + a3,t)c2,t

0

sign

al

0 0.5 1 1.5 2 2.5 3 3.5 4

100

150

200

time /s

freq

uenc

y /H

z

Common Amplitude Modulation

yt = (a1,t + a3,t)c1,t + (a2,t + a3,t)c2,t

0

sign

al

0 0.5 1 1.5 2 2.5 3 3.5 4

100

150

200

time /s

freq

uenc

y /H

z

Old Plus New Heuristic

yt = a1,t(c1,t + c2,t + c3,t) + a2,t(c2,t + c4,t) + a3,tc2,t + a4,t(c1,t + c3,t)

−2

0

2

0

200

400

freq

uenc

y /H

z

Old Plus New Heuristic

yt = a1,t(c1,t + c2,t + c3,t) + a2,t(c2,t + c4,t) + a3,tc2,t + a4,t(c1,t + c3,t)

−2

0

2

0

200

400

freq

uenc

y /H

z

Old Plus New Heuristic

yt = a1,t(c1,t + c2,t + c3,t) + a2,t(c2,t + c4,t) + a3,tc2,t + a4,t(c1,t + c3,t)

−2

0

2

0

400

freq

uenc

y /H

z

−202

a 1

−202

a 2

−101

a 3

0 0.5 1 1.5 2 2.5 3 3.5 4−2

02

a 4

time /s

Comodulation Masking Release

yt = a1,tc1,t + a2,tc2,t

−4−2

024

signal 1

80

100

120

freq

uenc

y /H

z

signal 2

Comodulation Masking Release

yt = a1,tc1,t + a2,tc2,t

−4−2

024

signal 1

80

100

120

freq

uenc

y /H

z

signal 2

Comodulation Masking Release

yt = a1,tc1,t + a2,tc2,t

−4−2

024

signal 1

80

100

120

freq

uenc

y /H

z

signal 2

−4−2

024

ynois

e

0 0.5 1 1.5 2−2

0

2

ytone

time /s0 0.5 1 1.5 2

−2

0

2

time /s

Good Continuation

yt = a1,tc1,t + a2,tc2,t

0

sign

al 1

0

50

100

freq

uenc

y /H

z

0

sign

al 2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 80

120

160

time /s

freq

uenc

y /H

z

Good Continuation

yt = a1,tc1,t + a2,tc2,t

0

sign

al 1

0

50

100

freq

uenc

y /H

z

0

sign

al 2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 80

120

160

time /s

freq

uenc

y /H

z

Good Continuation

yt = a1,tc1,t + a2,tc2,t

0

sign

al 1

0

50

100

freq

uenc

y /H

z

0

sign

al 2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5160

240

320

time /s

freq

uenc

y /H

z

Proximity

yt = a1,tc1,t + a2,tc2,t

−1

0

1

sign

al 1

0

100

freq

uenc

y /H

z

−1

0

1

sign

al 2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0

100

time /s

freq

uenc

y /H

z

Proximity

yt = a1,tc1,t + a2,tc2,t

−1

0

1

sign

al 1

0

100

freq

uenc

y /H

z

−1

0

1

sign

al 2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0

100

time /s

freq

uenc

y /H

z

Proximity

yt = a1,tc1,t + a2,tc2,t

−1

0

1

sign

al 1

0

100

freq

uenc

y /H

z

−1

0

1

sign

al 2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0

100

time /s

freq

uenc

y /H

z

Conclusions

• Developed a model for natural sounds comprising quickly varying carriers andslowly varying modulators

• Inspired by natural sound statistics

• Inference replicates many of the characteristics of auditory scene analysis

Future work

• Train on a large corpus of natural sounds: quantitatively predictpsychophysical results

• Trading between grouping principles learned from natural sounds

• Psychophysics: prediction of grouping on new stimuli

• Experimental stimulus generation: samples from the model are controlled,but natural sounding

• Model extensions: Hierarchical (cascades of modulators), asymmetries, moregeneral filter-shapes

Additional Slides

Inference with fixed amplitudes, ad,t = 1

700 800 900 1000 1100 1200 1300 1400 1500 1600frequency /Hz

carr

ier

spec

tra

filte

r 3

filte

r 2

filte

r 1

5.35 5.36 5.37 5.38 5.39 5.4time /s

y

5.35 5.36 5.37 5.38 5.39 5.4time /s

top related