biomedical applications and the probabilistic framework · probabilistic kalman filter observation...

Biomedical Applications and theProbabilistic Framework

Peter Sykacek

http://www.sykacek.net

jump 2 TOC Probabilistic Methods with Applications, Peter Sykacek, 2004 – p.1/23

Talk Overview

� Motivation of Probabilistic Concepts

� BCI, current practice & shortcomings

� Probabilistic Kalman Filter

� Adaptive BCI

� Gene Discovery

� DAG for Bayesian Marker Identification

� Gene Selection

� Discussion of Model Selection


Probabilistic Motivations

Thomas Bayes (1701 - 1763)Learning from data using adecision theoretic framework

First consequence: wemust revise beliefs ac-cording to Bayes theorem

, where

.

Second consequence: De-cisions by maximising ex-pected utilities




� ��

� �


, where

.





� ��

� �


�� , where

� � � � � ��

� �� .



Brain Computer Interface

Computer is controlled directly by cortical activity.


Classification of BCIs

(mostly motor tasks)involuntary events (P300)voluntary control

Scalp EEG intracranial EEG

Neural activity

Cognitive events intracranial EEG � � high spatial

and temporal resolution; highly inva-

sive!; allows 2-d control of artificial

limb.

surface EEG � � low spatial and

temporal resolution; no permanent in-

terference with patient; slow! at most

20 bit per minute and task.

� � focus on BCI’s based on scalp recordings.

� � low bit rates; last resort if no other communica-

tion possible


BCI with almost no adaptation

� P

��

based: L. A. Farwell and E. Donchin, � �

User intention is embedded within a sequence ofsymbols. The correct symbol leads to “surprise”and triggers a P

��

.

� Filter & threshold: N. Birbaumer etal. , � �

threshold slow cortical potentials; J.R. Wolpawetal., � � threshold moving average in anappropriate pass band e.g. �-rhythm.

These principles rely mostly on user training.


BCI & static pattern recognition

� Extract representation of EEG “waveforms” (e.g.low pass filtered time series; spectralrepresentation)

� Parameterize supervised classification implicitlyassuming stationarity.

What ifTechnical setup changes during operation?

(e.g. electrolyte changes impedance)User learns from feedback?User shows fatigue?

Assuming stationarity must be wrong !

� � Probabilistic method for “adaptive” BCI.


Probabilistic Kalman Filter

� � � � � ��

� � � � � ��

n-1Λ

n-1w nw

λI

n-1

β α

w

observation n

yn

Key: get right (may regard as

learning rate)

0 200 400 600 800 10000

50

100

150

200

250

300

350

Simulations using σλ=1e+003

window sz. 1window sz. 5window sz. 10window sz. 15window sz. 20

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1



Illustration of and “in-

stantaneous” generalization error for

B. D. Ripley’s synthetic data with ar-

tificial non-stationarity (swap labels

after sample 500).



� � � � � ��

� � � � � ��

n-1Λ

n-1w nw

λI

n-1

β α

w

observation n

yn

Key: get

�

right (may regard

� � �

as

learning rate)

0 200 400 600 800 10000

50

100

150

200

250

300

350



0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1







after sample 500).



� � � � � ��

� � � � � ��

n-1Λ

n-1w nw

λI

n-1

β α

w

observation n

yn

Key: get

�

right (may regard� � �

as

learning rate)

Classification: Non linear and non

Gaussian, some eqns.

0 200 400 600 800 10000

50

100

150

200

250

300

350



0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1







after sample 500).



� � � � � ��

� � � � � ��

n-1Λ

n-1w nw

λI

n-1

β α

w

observation n

yn

Key: get

�

right (may regard� � �

as

learning rate)

Classification: Non linear and non

Gaussian, some eqns.

0 200 400 600 800 10000

50

100

150

200

250

300

350



0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1



Illustration of � � � and “in-




after sample 500).


Adaptive BCIby variational Kalman filtering.BCI: data driven prediction of cognitive state from EEG measurements.

Working hypothesis: EEG dynamics during a cognitive task are subject to

temporal variation (learning effects, fatigue ...)

Represent EEG segments by z-transformed reflection coefficients.

Mutual information, adaptive method and identical “stationary” model.

0 0.5 10

0.2

0.4

0.6

0.8

1

navigation/auditory

DP(y| z)||P(y)

in bit

empi

rical

cdf

0 0.5 10

0.2

0.4

0.6

0.8

1

navigation/movement

DP(y| z)||P(y)

in bit

empi

rical

cdf

0 0.5 10

0.2

0.4

0.6

0.8

1

auditory/movement

DP(y| z)||P(y)

in bit

empi

rical

cdf

0 0.5 1 1.50

0.2

0.4

0.6

0.8

1

navig./audit./move, 3 class

DP(y| z)||P(y)

in bit

empi

rical

cdf

0 0.5 10

0.2

0.4

0.6

0.8

1

rest/move no feedback

DP(y| z)||P(y)

in bit

empi

rical

cdf

0 0.5 10

0.2

0.4

0.6

0.8

1

rest/move feedback

DP(y| z)||P(y)

in bit

empi

rical

cdf

0 0.5 10

0.2

0.4

0.6

0.8

1

move/math no feedback

DP(y| z)||P(y)

in bit

empi

rical

cdf

0 0.5 10

0.2

0.4

0.6

0.8

1

move/math feedback

DP(y| z)||P(y)

in bit

empi

rical

cdf


Communication Bandwidthbit rates � � �� [bit/s]

task vkf vsi

�� rest/move no fb.

� � ��

rest/move fb.

� � ��

move/math no fb.

� � ��

move/math fb.

� � ��

nav./aud./move

� � � � � � ��

audit./move� � � � � � � � � � � � �

navig./move� � ��

navig./audit.

� � � � � � � � � � � � �

Conclusion: adaptive methods increase BCI band-

widths even on short time scales.jump 2 TOC Probabilistic Methods with Applications, Peter Sykacek, 2004 – p.10/23

Gene discoveryDiscovering “important” genes (or proteins) from

microarray datasets can be classified as

� Identification of all differentially expressedgenes.

� Identification of reliable (sets) of marker genes.

Current practise for the first: classical methods (e.g.t-test on differences of means) or probabilisticapproaches with one indicator variable for each gene.

The second is typically done by conventional featuresubset selection. As a result we obtain a set of genesthat was found by heuristic search.


Bayesian Marker IdentificationMissing in FSS: How good are other explanations?

Interpret microarray data as classification problem of“genetic” regressors w.r.t. a discrete response.

� � Bayesian variable selection provides thisinformation. However: hopeless, unless we constrainthe dimensionality.

Simplified attempt: � � Find distribution overindividual genes.

Probabilities result from the marginal likelihood ofeach model.

� � � � � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � ��


DAG for Marker IdentificationI

� � � � � � � � ��

� � � � � � � ��

� � � � � � � � ��

� � � � � � � ��

� � � � � � � ��

� � � � � � � ��

� � � � � � � ��

0Λ z

ny

n

observation n

nxw

Π Latent variable probit GLM.

� is a one dimensional Gaus-sian random variable with mean

� � � and precision

.

��

if ��

� if �

�

Inference can be done by a variational method(systematic error) or by sampling (random error). Thelatter allows to integrate over analytically and wedraw from and only.


DAG for Marker IdentificationI

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

� � � � � � � �

0Λ z

ny

n

observation n

nxw

Π Latent variable probit GLM.

� is a one dimensional Gaus-sian random variable with mean

� � � and precision

.

��

if ��

� if �

�

Inference can be done by a variational method(systematic error) or by sampling (random error). Thelatter allows to integrate over � analytically and wedraw from � and

�only.


Asymptotic Behaviour

0 50 100 150 200

0

0.2

0.4

0.6

0.8

1

Cum

ulat

ive

Pro

babi

lity

Gene index

Bayes Posterior, Leukaemia 3 slides18 slides38 slides

0 50 100 150 200

0

0.02

0.04

0.06

0.08

0.1

Asc

endi

ng p

−va

l.

Gene index

Test based ranking, Leukaemia 3 slides18 slides38 slides

0 50 100 150 200

0

0.2

0.4

0.6

0.8

1

Cum

ulat

ive

Pro

babi

lity

Gene index

Bayes Posterior, Colon cancer 3 slides30 slides62 slides

0 50 100 150 200

0

0.02

0.04

0.06

0.08

0.1

Asc

endi

ng p

−va

l.

Gene index

Test based ranking, Colon cancer 3 slides30 slides62 slides


Comparison with ML

0 50 100 150 200

0

0.2

0.4

0.6

0.8

1

Cum

ulat

ive

Pro

babi

lity

Gene index

Leukaemia

Bayesian prob.Max. likelihood

0 50 100 150 200

0

0.2

0.4

0.6

0.8

1

Cum

ulat

ive

Pro

babi

lity

Gene index

Colon cancer

Bayesian prob.Max. likelihood

Results differ since Bayesian model posteriors take“complexity” (ref. Hochreiter’s “flat minima”) intoaccount.


Selection and Gen. Accuracy

Most probable regressorsselected at a

��

� �

threshold

Acc. no. description

� ��

Colon Cancer (Alon et. al.)

Z50753 Uroguanylin 0.76

R87126 Myosin 0.21

M63391 desmin gene 0.01

M36634 vasoact. pept. 0.01

Leukaemia (Golub et. al.)

X95735 Zyxin 0.93

M55150 FAH Fumarylac. 0.05

M27891 CST3 Cystatin C 0.01

Generalization accuracy

Dataset B. probit “indifference”

Colon 84% 74% to 94%

Leukaemia 88% 91% to 96%

No “better” results in literature �

confirms model.

Biology confirms Uroguanylin (cell

apoptosis) as important in colon can-

cer development.

But: Meaning of the prob-abilities?


DiscussionQuoting

� � � �� -closed model selection

with zero-one utility.

Our approach should assume an -open scenario.Under asymptotic normality,

� � � � � �

degenerates on��

that minimizes� �� ).

If the predictive distribution of a new observation is ofinterest, B&S’s suggest to use a logarithmic scorefunction for -open model comparison.

��

(e.g. cross validation estimate, still to be done)


A simple idea:the world is one probabilistic model.

� Applications often require hierarchical structure:a feature extraction part and a probabilisticmodel.

� Classical approach: treat both parts separatelyand thus regard features as sufficient statistic ofthe data. � � Features are deterministicvariables.

� Our suggestion: treat such hierarchical settings asone probabilistic model. � � Feature extractionis a representation in a latent space.


Bayes’ Consistent Models

ϕϕI

baI

a

a b

b

X X

t

Expected utility requiresto integrate over all un-known variables, includ-ing , , and thatrepresent a feature space.

−6 −5 −4 −3 −2 −1 0 1 20.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Precision of p(φa|X

a) on logarithmic scale

Pos

terio

r pr

obab

ilitie

s

Probabilistic sensor fusion

P(2) − latent feauturesP(2) − conditioning

Decisions depend on(un)certainty and maythus change.


Bayes’ Consistent Models

ϕϕI

baI

a

a b

b

X X

t

Expected utility requiresto integrate over all un-known variables, includ-ing �� , �� ,

�� and

�� thatrepresent a feature space.

−6 −5 −4 −3 −2 −1 0 1 20.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Precision of p(φa|X

a) on logarithmic scale

Pos

terio

r pr

obab

ilitie

s

Probabilistic sensor fusion

P(2) − latent feauturesP(2) − conditioning

Decisions depend on(un)certainty and maythus change.


Time Series ClassificationROC Curves

0 0.5 1

0

0.2

0.4

0.6

0.8

1sen

sitivity

Navig. vs. Aud.

1 − specifity

Bayes cond.

0 0.5 1

0

0.2

0.4

0.6

0.8

1

sensiti

vity

Left vs. Right

1 − specifity

Bayes cond.

0 0.5 1

0

0.2

0.4

0.6

0.8

1

sensiti

vity

Spindle

1 − specifity

Bayes cond.

0 0.5 1

0

0.2

0.4

0.6

0.8

1

sensiti

vity

Synthetic

1 − specifity

Bayes cond.

Kullback Leibler Divergence

0 0.2 0.4 0.6 0.8

0

0.2

0.4

0.6

0.8

1

empiri

cal cd

f.

Navig. vs. Aud.

DP(t|x)||P(t)

bit

Bayes cond.

0 0.2 0.4 0.6 0.8

0

0.2

0.4

0.6

0.8

1

empiri

cal cd

f.

Left vs. Right

DP(t|x)||P(t)

bit

Bayes cond.

0 1 2

0

0.2

0.4

0.6

0.8

1

empiri

cal cd

f.

Spindle

DP(t|x)||P(t)

bit

Bayes cond.

0 0.5 1 1.5

0

0.2

0.4

0.6

0.8

1

empiri

cal cd

f.

Synthetic

DP(t|x)||P(t)

bit

Bayes cond.


More ResultsExpected feature values

−2 −1.5 −1

0

0.2

0.4

0.6

0.8

1

Bayes Spindle

dim. 2

dim. 1

−2 −1.5 −1

0

0.2

0.4

0.6

0.8

1

cond. Spindle

dim. 2

dim. 1

−1 −0.5 0

0

0.2

0.4

0.6

Bayes Synthetic

dim. 3

dim. 2

−1 −0.5 0

0

0.2

0.4

0.6

cond. Synthetic

dim. 3

dim. 2

−2 −1.5 −10

0.1

0.2

0.3

0.4

0.5

Bayes Navig. vs. Aud.

dim. 2

dim. 1

−2 −1.5 −10

0.1

0.2

0.3

0.4

0.5

cond. Navig. vs. Aud.

dim. 2

dim. 1

Kullback Leibler Divergence for “Artefacts”

0 0.2 0.4 0.6

0

0.2

0.4

0.6

0.8

1

empiri

cal cd

f.

Correct classifications

DP(t|x)||P(t)

bit

Bayescond.

0 0.2 0.4 0.6

0

0.2

0.4

0.6

0.8

1

empiri

cal cd

f.

Wrong classifications

DP(t|x)||P(t)

bit

Bayescond.


Variational Kalman FilterThe logarithmic model evidence for a window of size

�

is

��

��

� �

� � � � � � � � � �

� � ��

�

This is not a probabilistic structure! (need Rauch Tung Striebel smoother)

Plug in distributions and integrate over :


Variational Kalman FilterThe logarithmic model evidence for a window of size

�

is

��

��

� �

� � � � � � � � � �

� � ��

�

This is not a probabilistic structure! (need Rauch Tung Striebel smoother)

Plug in distributions and integrate over � � � :

��

��

�

��

� ��

� � � � ��

�

�

��


Lower Bounds

��

� � ��

� �

��

� � �

� � ��

��

back to vkf


Lower Bounds

��

� � ��

� �

��

� � �

� � ��

��

� � � � ��

��

��

��

��

��

�

back to vkf


Lower Bounds

��

� � ��

� �

��

� � �

� � ��

��

� � � � ��

��

��

��

��

��

�

� � � � � � � � � � � � � ��

� � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � ��

back to vkf


biomedical applications and the probabilistic framework · probabilistic kalman filter observation...

Documents