donald geman dept. of applied mathematics and statistics center for imaging science johns hopkins...

Donald GEMANDept. of Applied Mathematics and Statistics

Center for Imaging ScienceJohns Hopkins University

STRATEGIES FOR VISUAL RECOGNITION

2

Outline

General Orientation within Imaging Semantic Scene Interpretation Three Paradigms with Examples

Generative Predictive Hierarchical

Critique and Conclusions

3

Sensors to Images

Constructing images from measured data. Examples

Ordinary visible light cameras Computed tomography (CT, SPECT, PET, MRI) Ultrasound, Molecular, etc.

Mathematical Tools Harmonic analysis Partial differential equations Poisson processes

4

Images to Images/Surfaces

Transform images to more compact or informative data structures.

Examples Restoration (de-noising, de-blurring, inpainting) Compression Shape-from-shading

Mathematical Tools Harmonic analysis Regularization theory and variational methods Bayesian inference, graphical models, MCMC

5

Image to Words

Semantic and structural interpretations of images. Examples

Selective attention, figure/ground separation Object detection and classification Scene categorization

Mathematical Tools Distributions on grammars, graphs,

transformations Computational learning theory Shape spaces Geometric and algebraic invariants

6

Semantic Scene Interpretation

Understanding how brains interpret sensory data, or how computers might also, is a major challenge.

Here: One greyscale image. Although likely crucial to biological learning, no cues from color, motion or depth.

There is objective reality Y(I), at least at the level of key words.

7

Dreaming

A description machine

from images to description of an

underlying scene.

Better Yet: A sequence of increasingly fine interpretations perhaps “nested.”

:f I Y

I I Y Y

1 2Y = (Y ,Y ,...)

8

More Dreaming

ACCURACY: for most images.

LEARNING: There is an explicit set of instructions for building

involving samples from a learning set

EXECUTION: There is an explicit set of instructions for evaluating with as little computation as possible.

ANALYSIS: There is “supporting theory” which guides

construction and predicts performance.

Y(I)=Y(I)

Y

1 1( , ),..., ( , )n nL I Y I Y

Y(I)

9

Detecting Boats

10

Where Are the Faces? Whose?

11

Within Class Variability

12

How Many Samples are Necessary?

13

Recognizing Context

14

Many Levels of Description

15

Confounding Factors

Local (but not global) ambiguity Complexity: There are so many things to look for! Arbitrary views and lighting Clutter: Alternative hypothesis is not “white noise” Knowledge: Somehow quantify

Domination of clutter Invariance of object names under transforms Regularity of the physical world

16

Confounding Factors (cont)

Scene interpretation is an infinite-dimensional classification problem.

Is segmentation/grouping performed before, during or after recognition?

No advances in computers or statistical learning will overcome the small-sample dilemma.

Some organizational framework is unavoidable.

17

Small-Sample Computational Learning

: Training set for inductive learning : Measurement or feature vector; : True label or explanation of .

Examples: : Acoustic speech signals; : transcription into words : Natural images; :semantic description

Common property: n is very small relative to the effective dimensions

1 1( , ),..., ( , )n nL x y x y

ix Xiy Y ix

X YX Y

18

Three Paradigms

Generative: Centered on a joint statistical model for features X and interpretations Y.

Predictive: Proceed (almost) directly from data to decision boundaries.

Hierarchical: Exploit shared features among objects and interpretations.

19

Generative Modeling

The world is very special: Not all explanations and observations are equally likely. Capture regularities with stochastic models.

Learning and decision-making based P(X,Y), derived from A distribution P(Y) on interpretations accounting for priori

knowledge and expectations. A conditional data model P(X|Y) accounting for visual

appearance. Inference principle: Given X, choose the interpretation Y

which maximizes P(Y|X).

20

Generative Modeling: Examples

Deformable templates Prior on transformations Template + noise data model

Hidden Markov models Probabilities on grammars and production rules Graphical models, e.g., Bayesian networks LDA, etc. Gaussian mixtures

21

Gaussian Part/Appearance Model

Y: shape class with prior p(y) Z: locations of object “parts” X=X(I): features whose components capture local

topography (interest points, edges, wavelets) Compound Gaussian model:

p(z|y): multivariate normal (m(y), C(y)) p(x|z,y): multivariate normal (m(z,y), C(z,y))

Estimate Y as arg max p(z,y|x) = arg max p(x|z,y) p(z|y) p(y)

22

Generative Modeling: Critique

In principle, a very general framework. In practice,

Diabolically hard to model P(Y). Intensive computation with P(Y|X). P(X|Y) alone amounts to “templates-for-everything”

which lacks power requires infinite computation

23

Predictive Learning

Do not solve a more difficult problem than is necessary; ultimately only a decision boundary is needed.

Representation and learning: Replace I by a fixed length feature vector X Quantize Y to a finite number of classes 1,2,…,C Specify a family F of “classifiers” f(X) . Induce f(X) directly from a training set L .

Often does require some modeling.

24

Predictive Learning: Examples

Examples which, in effect, learn P(Y|X) directly and apply Bayes rule: Artificial neural networks k-NN with smart metrics (e.g., “shape context”) Decision trees

Support vector machines (interpretation as Bayes rule via logistic regression)

Multiple classifiers (e.g., random forests)

25

Support Vector Machines

1tw x b

1tw x b

0tw x b

w

2

w

Let be a training set generated i.i.d according to P(X,Y).

Maximize the margin:

1

2,min . . ( ) 1 0,t t

i iw b

w w s t y w x b i

1

2max ,

ii i j i j i ji i j

y y x x

. 0, 0i i iis t i and y

1 1( , ),..., ( , )n nL x y x y

26

SVM (cont)

The classification function:

Data in the input space are mapped into a higher dimensional one, where linear separability holds:

1 1

( ) , ,i i

ti i i i

y y

f x b w x x x x x

( ) ( )( )( )( )

( )( )( )

27

SVM (cont)

The optimization problem and the classification function are similar to the linear case.

The scalar product is replaced by a kernel.

1

2max ( ), ( )

ii i j i j i ji i j

y y x x

. 0, 0i i iis t i and y

1 1

( ) ( ), ( ) ( ), ( )i i

ti i i i

y y

f x b w x x x x x

,( ), ( )x x

28

Predictive Learning: Critique

In principle, universal learning machines which could mimic natural processes and “learn” invariance from enough examples.

In practice, lacks a global organizing principle to confront A very large number of classes (say 30,000) The small-sample dilemma The complexity of clutter Excessive computation

29

Hierarchical Modeling

The world is very special: Vision is only possible due to its hierarchical organization into common parts and sub-interpretations.

Determine common visual structure by: Clustering images; Information-theoretic criteria (e.g., mutual

information) to select common patches; Building classifiers (e.g., decision trees or multi-

class boosting); Constructing grammars.

30

Hierarchical: Examples

Compositional vision: A “theory of reusable parts” Hierarchies of image patches or fragments. Algorithmic modeling: coarse-to-fine

representation of the computational process.

31

Hierarchical Indexing

Coarse-to-fine modeling of both the interpretations and the computational process: Unites representation and processing. Proceed from broad scope with low power to

narrow scope with high power. Concentrate processing on ambiguous areas. Evidence that coarse information is conveyed

earlier than fine information in neural responses to visual stimuli.

32

Hierarchical Indexing (cont)

A

Estimate by exploring

where is a hierarchy of nested partitions of

and is binary test for vs .

Index : Explanation

= ,

s not ruled out

by any test:

X Y A Y A

:

A

Y

X A

y

y Y

X H

H Y

D Y

D = 1

y

AX A

33

Hierarchical Indexing (cont)

A recursive partitioning of Y with four levels; there is a binary test for each of the 15 cells.

(A): Positive tests are shown in black (B): The index is the union of leaves 3 and 4. (C): The “trace” of coarse-to-fine search.

34

When is CTF Optimal?

c(A) = cost, p(A) = power of the test for cell A in H. c* = cost of a perfect test for a single hypothesis. The mean cost of a sequential testing strategy T is

where is the probability of performing test

THEOREM: (G. Blanchard/DG) CTF is optimal if

where C(A) = direct children of A in H.

( )

( ) ( ),

( ) ( )B C A

c A c BA

p A p B

*( ) ( ) ( ) | |AA

EC T c A q T c E D

( )Aq TAX

35

Density of Work

Original image Spatial concentration of processing

36

Modeling vs. Learning: Variations on the Bias-Variance Dilemma

Reduce variance (dependence on L) by introducing the “right” biases (a priori structure) or by introducing more complexity?

Is dimensionality a “curse” or a “blessing”? Hard-wiring vs. tabula rasa. |L| - small vs. large:

“Credit” for learning with small L?. Is the interesting limit |L| goes to infinity or zero?

37

Conclusions

Automatic scene interpretation remains elusive. However, growing success with particular object

categories, e.g., vehicles and faces, and many industrial applications (e.g., wafer inspection).

No dominant mathematical framework, and the “right” one is unclear.

Few theoretical results outside classification.

38

Naïve Bayes

Map I to a feature vector X Boolean edges Wavelet coefficients Interest points

Assume the components of X are conditionally independent given Y.

Learn the marginal distributions under object and background hypotheses from data.

Uniform prior P(Y). Perform a likelihood ratio test to detect objects against

background.

donald geman dept. of applied mathematics and statistics center for imaging science johns hopkins...

Documents