layered scene representations vision for graphics cse 590ss, winter 2001 richard szeliski

Layered Scene Representations

Vision for GraphicsCSE 590SS, Winter 2001

Richard Szeliski

2/12/2001 Vision for Graphics 2

Motion representations

How can we describe this scene?


Block-based motion prediction

Break image up into square blocksEstimate translation for each blockUse this to predict next frame, code difference

(MPEG-2)


Layered motion

Break image sequence up into “layers”:

=

Describe each layer’s motion


Outline

• Why layers?• 2-D layers [Wang & Adelson 94; Weiss 97]• 3-D layers [Baker et al. 98]• Layered Depth Images [Shade et al. 98]• Transparency [Szeliski et al. 00]• Bayesian estimation [Torr et al. 99]


Layered motion

Advantages:• can represent occlusions / disocclusions• each layer’s motion can be smooth• video segmentation for semantic processingDifficulties:• how do we determine the correct number?• how do we assign pixels?• how do we model the motion?


Layers for video summarization


Background modeling (MPEG-4)

Convert masked images into a background sprite for layered video coding

+ + +

=


What are layers?

[Wang & Adelson, 1994]

• intensities• alphas• velocities


How do we composite them?


How do we form them?


How do we estimate the layers?

1. compute coarse-to-fine flow2. estimate affine motion in blocks (regression)3. cluster with k-means4. assign pixels to best fitting affine region5. re-estimate affine motions in each region…


Layer synthesis

For each layer:• stabilize the sequence with the affine motion• compute median value at each pixelDetermine occlusion relationships


Results


What if the motion is not affine?

Use a “regularized” (smooth) motion field[Weiss, CVPR’97]

A Layered Approach To Stereo Reconstruction

Simon Baker, Richard Szeliski and P. Anandan

CVPR’98


Examples:• Disparity-Spaces [Intille and Bobick, ‘94] [Scharstein and Szeliski, ‘96] • Space-Coloring [Seitz and Dyer, ‘97]• Maximum-Flow Stereo [Roy and Cox, ‘98]

Advantages:• Modeling occlusions [Intille and Bobick, ‘94] • Mixed pixels + transparency [Szeliski and Golland, ‘98]• Equal treatment of many images [Collins, ‘96]

Volumetric Approaches to Stereo

Camera 1Camera 2

x

z

y


2.5-D Layered Approach

Additional advantages over volumetric approaches:• Fewer degrees of freedom• Less resampling artifacts• Robustness of global model + local correction

– c.f. “Plane + Parallax” and “Model-Based Stereo”• Output particularly suitable for certain applications

– e.g. Image-based rendering and interactive editing

Layer 2

Camera 1

Layer 1

Layer 3Camera 2


Layered Stereo

Use arbitrarily oriented sprites

Estimate 3D plane equation for each sprite

layers (“sprites”)layers (“sprites”)


Layer Representation

Coordinate frame defined by (u, v, 1)Tu = = Q l x

Layer sprite L = (a . r , a . g , a . b , a)l

Residual depth Zl

World origin

x = (x, y, z, 1)TWorld point

u

v

Plane vectorn = (n , n , n , n )T

x zyPlane equation

n . x = 0

l

l

d


Image Formation

Scene

Image I k

Camera Pk

v

u Masked image M k l

v

u

Boolean mask B k l

v

u

Layer l


Overview

Output: n , L , & Zll l

Refine Layer Sprites L l

Input: Images I & Cameras Pkk

Re-assign pixelslayers Bkl

Estimate residual depth Z l

Estimate plane vectors n l

Estimate sprite images Ll

Initialize layer assignment B kl


Layer Initialization AlternativesIterate dominant motion estimation

• e.g. [Irani et al., ‘95]Apply simple stereo algorithm + fit planesColor segmentation

• e.g. [Sawhney and Ayer, ‘94]Human initialization

• e.g. [Debevec et al., ‘96]


H lik

M klH lik o

Estimation of Plane Equations

Camera PkM klCamera P j M jl

H lij

M jlH lij o

Camera Pi

M il

Layer l

Warped images , , … functions of n only M jlH lij o M klH l

ik ol

Minimize image variance using hierarchical gradient descent


Estimation of Layer Sprites

Camera PkM klCamera P j M jl

Camera Pi

M il

Plane nl

“Blend” the masked images, warped onto the layer plane


Estimation of Residual DepthPer-pixel residual depth estimation

• plane plus parallax [Anandan et al.]• model-based stereo [Debevec et al.]

• better accuracy / fidelity• makes forward warping more difficult


Estimation of Residual Depth

Camera Pk

M klCamera P j M jl

Camera P i

M il

Perturbed Plane n + (0,0,0,d)lT

Warp masked images onto perturbed plane

Compute variance image For each pixel, choose d that

minimizes varianceSmooth, incorporating

confidence weighting [Szeliski & Golland, ‘98]

Recompute sprite using “Plane + Parallax” warp


Pixel Assignment

Camera Pi

M il

Plane nlSprite L l

• Warp masked image onto each layer plane • Compute difference images • Un-warp difference images • For each pixel, choose the best difference across layers • Smooth pixel assignment

Un-warpeddifference

image


Flower Garden Results

Initial Segmentation

Image 1 Image 9

Grey coded planar depth


Recovered Sprite: Without residual depth estimation

Recovered Sprite: With residual depth estimation

Flower Garden Results


Graphics Symposium Results

Image 1 of 5 Initial segmentation

Grey coded planar depth Residual depth



Resulting sprite collection



Original image 3 Re-synthesized image 3

Novel view without residual depth Novel view with residual depth


Layered Stereo Demo

SpriteViewer: renders sprites with depth


Discussion

Layer initialization:• Can tolerate bad initial plane estimates• Residual depth estimation:

– Plane sweep algorithm, similar to [Szeliski and Golland, ‘98]

Pixel assignment:• Combine color and residual depth estimates• Currently under investigation


Summary

New approach to stereo matching:• represent scene as collection of layers• each layer has a 3-D plane equation, an alpha-matted color

image, and an optional residual depth• generalizes layered motion to 3-D

Computation:• plane eqns. by warping mosaics of masked images• residual depth by perturbing planes• iteratively refine color values and pixel assignments

Layered Depth Images

Jonathan Shade Steven GortlerLi-wei He Richard Szeliski

SIGGRAPH’98


How to render a layer + parallax?

Can’t use inverse warping [Laveau 94]


3D Sprites with Depth

3D sprite consists of:• alpha-matted image I1(x1,y1)• 4×4 camera matrix C1 [ w1x1 w1y1 w1d1 w1]T = C1 [X Y Z

1]T

• plane equation AX + BY + CZ + D = 0(forms third row of C1 )

• optional per-pixel depth d1 (x1,y1)


Sprites with Depth

Store d1(x1,y1) (scaled displacement) along with each sprite image I1(x1,y1)

I1 d1 I1 d1


3D Sprites — Reprojection

sprites new view

use standard texture mapping (projective warp)


Forward Mapping

Mapping equation with per-pixel depth d1:[ w2x2 w2y2 w2 ]T = H1,2 [ x1 y1 1 ]T + d1 e1,2

I1 d1 (I2 ) I2

Problems: gaps and aliasing


Inverse Mapping

Reverse order of images 1 & 2:[ w1x1 w1y1 w1 ]T = H2,1 [ x2 y2 1 ]T + d2 e2,1

I1 (I2) d2 I2

Problem: we don’t know d2!


Crude perspective map

How to map d1 d2?

Simple idea: use perspective transform H2,1

I1 d1 d2 I2

Works well for small amounts of motion


Better forward map

How to map d1 d2?

Better idea: use full H1,2x1+d1e1,2 fwd. map

I1 d1 d2 I2

Works better for moderate amounts of motion


2-pass Mapping

Why is 2-pass mapping (d1 d2 forward followed by I1 I2 backward) a good idea?• can tolerate bigger errors in d1 mapping (since d1 is

typically smooth)• can store/process d1 at lower resolution• can use better filtering on color image


Sprites with Depth — Demo

Demo


Refinements

Only forward map d1 with parallax component

Use affine approximation to parallax flowBetter gap filling

Forward map (u,v) flow instead of d1 depth


Layered Depth Images (LDIs)

Store multiple (color,z) values at each pixelSimilar to [sparse] volumetric representationRender with forward warp (splat)

Layer extraction from multiple images containing reflections and transparency

Richard SzeliskiShai AvidanP. Anandan

CVPR’2000


Transparent motion

Photograph (Lee) and reflection (Michael)


Previous work

Physics-based vision and polarization[Shafer et al.; Wolff; Nayar et al.]

Perception of transparency [Adelson…]

Transparent motion estimation[Shizawa & Mase; Bergen et al.; Irani et al.; Darrell & Simoncelli]

3-frame layer recovery [Bergen et al.]


Problem formulation

X

Y

MotionMotionX,iX,i( )( )

MotionMotionY,iY,i( )( )++


Image formation model

Pure additive mixing of positive signalsmk(x) = l Wkl fl(x)

ormk = l Wkl fl

Assume motion is planar (perspective transform, aka homography)


Two processing stages

Estimate the motions and initial layer estimates

Compute optimal layer estimates (for known motion)


Dominant motion estimation

Stabilize sequence by dominant motion

robust affine [Bergen et al. 92; Szeliski & Shum]


Dominant layer estimate

How do we form composite (estimate)?

TimeTime

Inte

nsity

Inte

nsity


Average?

TimeTime

Inte

nsity

Inte

nsity


Median?

Hint: all layers are non-negative

TimeTime

Inte

nsity

Inte

nsity


Min-composite

Smallest value is over-estimate of layer

TimeTime

Inte

nsity

Inte

nsity


Difference sequence

Subtract min-composite from original image

=

original - min composite = difference imageoriginal - min composite = difference image


Min composite

TimeTime

Inte

nsity

Inte

nsity

(overestimate of background layer)(overestimate of background layer)


Difference sequence

TimeTime

Inte

nsity

Inte

nsity

(underestimate of foreground layer)(underestimate of foreground layer)


Stabilizing secondary motion

TimeTime

Inte

nsity

Inte

nsity

How do we form composite (estimate)?How do we form composite (estimate)?


Max-composite

TimeTime

Inte

nsity

Inte

nsity

Largest value is Largest value is under-estimateunder-estimate of layer of layer


Min-max alternation

Subtract secondary layer (under-estimate) from original sequence

Re-compute dominant motion and better min-composite

Iterate …Does this process converge?


Min-max alternation

Does this process converge?• in theory: yes• each iteration reduces number of mis-estimated

pixels (tightens the bounds) — proof in paper


Min-max alternation

Does this process converge?• in practice: no• resampling errors and noise both lead to

divergence — discussion in paper

resampling error noisy


Two processing stages

Estimate the motions and initial layer estimates

Compute optimal layer estimates (for known motion)


Optimal estimation

Recall: additive mixing of positive signalsmk = l Wkl fl

Use constrained least squares(quadratic programming)

min k | l Wkl fl – mk |2 s.t. fl 0


Least squares example

background foregroundbackground foreground

blue: least squaresblue: least squares

red: constrained LSred: constrained LS


Uniqueness of solution

If any layer does not have a “black” region, i.e., if fl c, then can add this offset to another layer (and subtract it from fl)

background background foreground foreground


Degeneracies in solution

If motion is degenerate (e.g., horizontal), regions (scanlines) decouple (w/o MRF)

mixedmixed scaled scaled errors errors

recovered recovered


Noise sensitivity

In general, low-frequency components hard to recover for small motions

mixedmixed

recovered recovered

scaled scaled errors errors


Three-layer example

3 layers with general motion works well

= + +


Complete algorithm

Dominant motion with min-compositesDifference (residual) images Non-dominant motion on differences Improve the motion estimatesUnconstrained least-squares problemConstrained least-squares problem


Complete example

originaloriginal

stabilizedstabilizedmin-compositemin-composite


Complete example

differencedifference

stabilizedstabilizedmax-compositemax-composite


Final Results

= += +


Another example

original stabilized min-comp. resid. 2


Results: Anne and books

= += +

original background foreground (photo)original background foreground (photo)


Transparent layer recovery

Pure (additive) mixing of intensities• simple constrained least squares problem• degeneracies for simple or small motions

Processing stages• dominant motion estimation• min- and max-composites to initialize• optimization of motion and layers


Future workMitigating degeneracies (regularization)Opaque layers ( estimation)

Non-planar geometry (parallax)


BibliographyJ. Y. A. Wang and E. H. Adelson. Representing moving images with

layers. IEEE Transactions on Image Processing, 3(5):625--638, September 1994.

Y. Weiss and E. H. Adelson. A unified mixture framework for motion segmentation: Incorporating spatial coherence and estimating the number of models. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'96), pages 321--326, San Francisco, California, June 1996.

Y. Weiss. Smoothness in layers: Motion segmentation using nonparametric mixture estimation. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'97), pages 520--526, San Juan, Puerto Rico, June 1997.

P. R. Hsu, P. Anandan, and S. Peleg. Accurate computation of optical flow by using layered motion representations. In Twelfth International Conference on Pattern Recognition (ICPR'94), pages 743--746, Jerusalem, Israel, October 1994. IEEE Computer Society Press


BibliographyT. Darrell and A. Pentland. Cooperative robust estimation using layers of

support. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(5):474--487, May 1995.

S. X. Ju, M. J. Black, and A. D. Jepson. Skin and bones: Multi-layer, locally affine, optical flow and regularization with transparency. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'96), pages 307--314, San Francisco, California, June 1996.

M. Irani, B. Rousso, and S. Peleg. Computing occluding and transparent motions. International Journal of Computer Vision, 12(1):5--16, January 1994.

H. S. Sawhney and S. Ayer. Compact representation of videos through dominant multiple motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):814--830, August 1996.

M.-C. Lee et al. A layered video object coding system using sprite and affine motion model. IEEE Transactions on Circuits and Systems for Video Technology, 7(1):130--145, February 1997.


BibliographyS. Baker, R. Szeliski, and P. Anandan. A layered approach to stereo

reconstruction. In IEEE CVPR'98, pages 434--441, Santa Barbara, June 1998.

R. Szeliski, S. Avidan, and P. Anandan. Layer extraction from multiple images containing reflections and transparency. In IEEE CVPR'2000, volume 1, pages 246--253, Hilton Head Island, June 2000.

J. Shade, S. Gortler, L.-W. He, and R. Szeliski. Layered depth images. In Computer Graphics (SIGGRAPH'98) Proceedings, pages 231--242, Orlando, July 1998. ACM SIGGRAPH.

S. Laveau and O. D. Faugeras. 3-d scene representation as a collection of images. In Twelfth International Conference on Pattern Recognition (ICPR'94), volume A, pages 689--691, Jerusalem, Israel, October 1994. IEEE Computer Society Press.

P. H. S. Torr, R. Szeliski, and P. Anandan. An integrated Bayesian approach to layer extraction from image sequences. In Seventh ICCV'98, pages 983--990, Kerkyra, Greece, September 1999.

layered scene representations vision for graphics cse 590ss, winter 2001 richard szeliski

Documents