passive stereo vision with deep learning

Passive Stereo Vision: From Traditional to Deep Learning-based Methods

YU HUANG

SUNNYVALE, CALIFORNIA

[email protected]

Outline• Modeling from multiple views

• Stereo matching

• constraints in stereo vision

• difficulties in stereo vision

• pipeline of stereo matching

• state of art methods

• Quality metric of stereo matching

• census transform and hamming distance

• guided filter in cost aggregation (volume)

• semi-global matching

• ELAS: efficient large scale stereo

• stereo matching as energy minimization

• dynamic programming

• graph cut

• belief propagation

• phase matching for stereo vision

• disparity refinement

• Multiple cameras/views

• Learning sparse represent. of depth maps

• Stereopsis via deep learning

• Deep learning of depth (and motion)

• Stereo matching by CNN

• Appendix A: Depth learning from an image

• Appendix B: Learning and optimization

Modeling from Multiple Views in Computer Vision

time

# cameras

photograph

binocular stereo

trinocular stereo

multi-baseline stereo

camcorder

human vision

camera dome

two frames ...

...

Binocular Stereo• Given a calibrated binocular stereo pair, fuse it to produce a depth image

image 1 image 2

Dense depth map

Public Library, Stereoscopic Looking Room, Chicago, by Phillips, 1923

Basic Stereo Matching Algorithm• For each pixel in the first image

Find corresponding epipolar line in the right image

Examine all pixels on the epipolar line and pick the best match

Triangulate the matches to get depth information

• Simplest case: epipolar lines are corresponding scanlines;

• If necessary, rectify the two stereo images to transform epipolar lines into scanlines

Depth from Disparity

f

x x’

BaselineB

z

O O’

X

fz

fBxxdisparity

Disparity is inversely proportional to depth!

• Stereo camera calibration (focal length and baseline are known)

• Image planes of cameras parallel to each other and to the baseline

• Camera centers are at same height

• Focal lengths are the same

• Then, epipolar lines fall along the horizontal scan lines of the images

Calibration• Find the intrinsic and extrinsic parameters of a camera

◦ Extrinsic parameters: the camera’s location and orientation in the world.

◦ Intrinsic parameters: the relationships between pixel coordinates and camera coordinates.

• Work of Roger Tsai and work of Zhengyou Zhang are influential: 3-D node setting and 2-d plane

• Basic idea:◦ Given a set of world points Pi and their image coordinates (ui,vi)

◦ find the projection matrix M

◦ And then find intrinsic and extrinsic parameters.

• Calibration Techniques◦ Calibration using 3D calibration object

◦ Calibration using 2D planer pattern

◦ Calibration using 1D object (line-based calibration)

◦ Self Calibration: no calibration objects

◦ Vanishing points from for orthogonal direction

Calibration• Calibration using 3D calibration object:

◦ Calibration is performed by observing a calibration object whose geometry in 3D space is known with very good precision.

◦ Calibration object usually consists of two or three planes orthogonal to each other, e.g. calibration cube

◦ Calibration can also be done with a plane undergoing a precisely known translation (Tsai approach)

◦ (+) most accurate calibration, simple theory

◦ (-) more expensive, more elaborate setup

• 2D plane-based calibration (Zhang approach)◦ Require observation of a planar pattern at few different orientations

◦ No need to know the plane motion

◦ Set up is easy, most popular approach

◦ Seems to be a good compromise.

• 1D line-based calibration:◦ Relatively new technique.

◦ Calibration object is a set of collinear points, e.g., two points with known distance, three collinear points with known distances, four or more…

◦ Camera can be calibrated by observing a moving line around a fixed point, e.g. a string of balls hanging from the ceiling!

◦ Can be used to calibrate multiple cameras at once. Good for network of cameras.

Fundamental MatrixLet p be a point in left image, p’ in right image

Epipolar relation◦ p maps to epipolar line l’

◦ p’ maps to epipolar line l

Epipolar mapping described by a 3x3 matrix F

It follows that

l’l

p p’

This matrix F is called

• the “Essential Matrix”

– when image intrinsic parameters are known

• the “Fundamental Matrix”

– more generally (uncalibrated case)

Can solve for F from point correspondences

• Each (p, p’) pair gives one linear equation in entries of F

• 8/5 points give enough to solve for F (8/5-point algo)

Planar RectificationBring two views to standard stereo setup

(moves epipole to )

(not possible when in/close to image)

~ image size

(calibrated)

Distortion minimization

(uncalibrated)

Polar re-parameterization around epipoles

Requires only (oriented) epipolar geometry

Preserve length of epipolar lines

Choose so that no pixels are compressed

original image rectified image

Polar Rectification

Works for all relative motions

Guarantees minimal image size

Determine the common region from the extremal

epipolar lines and the location of epiole: e’F=0

Select half epipolar lines moving around the epipole

Construct rectified image line by line

Matching cost

disparity

Left Right

scanline

Correspondence Search• Slide a window along the right scanline and compare contents of that window with the reference window in the left image

• Matching cost: SSD or normalized correlation

Constraints in Stereo Vision• Color constancy

• Lambertian surface assumption;

•Epipolar geometry

• Scanline as epipolar line for rectifed pair;

• Uniqueness

• For any point in one image, there should be at most one matching point in the other image;

• Ordering

• Corresponding points should be in the same order in both views;

• Smoothness

• Disparities to change slowly (the most part).

Epipolar plane

Epipolar line for pEpipolar line for p’

Uniqueness Ordering

Difficulties in Stereo Vision• Photometric distortions and noise;

• Foreshortening;

• Perspective distortions;

• Uniform/ambiguous regions;

• Repetitive/ambiguous patterns;

• Transparent objects;

• Occlusions and discontinuities.

Pipeline of Stereo Matching Methods

• Pre-processing: compensate for photometric distortion;

• LoG, Census transform, phase only(DCT or WT), histogram equalization/matching, isotropic diffusion, …

• Cost computation:

• Absolute difference, squared difference, weighted difference, SAD, SSD, SWD, ZMNCC, …

• Cost aggregation:

• Bilateral filter, guided filter, non local, segment tree,...

• Disparity computation/optimization

• Integral image, box filtering, …

• Local (fast), global (slow), semi-global, …

• Disparity refinement

• Sub pixel interpolation, median filter, cross check (left-right consistency check) and occlusion filling.

State-of-Art Stereo Matching Methods• Local method

• Look at one image patch at at time

• Solve many small problems independently

• Faster, less accurate, usually works for high texture

• Needs enough texture in a patch to disambiguate

• Global method• Look at the whole image

• Solve one large problem

• Slower, more accurate, works up to medium texture

• Propagates estimates from textured to untextured regions

• Sparse point-based method• Still works for low textured regions, hard to handle ambiguous regions

• Semi-global method• SGM (semi-global-matching), 2-d search to 1-d search along 8/16 directions.

Quality Metrics in Stereo Matching (Passive)• General objective approaches:

• Compute error statistics w.r.t. some ground truth data;

• RMS (root-mean-squared) error (in disparity units) btw. computed disparity dC (x, y) and ground truth dT (x, y);

• Percentage of bad matching pixels;

• Select the following areas support the analysis of matching results

• textureless regions;

• occluded regions;

• depth discontinuity regions.

• Evaluate synthetic image by warping the reference with disparity map;

• Forward warp the reference image by the computed disparity map;

• Inverse warp a new view by the computed disparity map.

• Subjective evaluation

Census Transform and Hamming Distance• Census transform converts relative intensity difference to 0 or 1 and deforms 1 dimensional vector as much as

window size of census transform;

• Census transform makes data of (image size * vector size).

• Modified CTW: compared with the mean rather than the central pixel;

• Hamming distance of CT vectors with correlation windows used to find matched patches;

• Advantage: robustness to radiometric distortion, vignetting, lighting, boundaries and noise.

210159998639

198170326747

45677810298

304033115109

393126130121

11111

11000

00X11

00011

00011

111111100000110001100011

Census transform window (CTW)

Heig

ht

Width

Heig

ht

Width

(Square size of CTW)-1

CT o

n in

tensity

& grad

ient resp

.O

riginal grad

ient

Guided Filter in Cost Aggregation for Stereo Matching• Idea: stereo match as labeling, a spatially smooth labeling with label transitions aligned with color edges;

• Edge preserving filter: WLS, Anisotropic diffusion, bilateral filter, total variation filter, guided filter, ...

• Guided filter works better than bilateral filter;

•

• Cost volume filtering with guided filter works like segmentation implicitly;

Wi,j : The filter weights depend on the guidance image IC’ : the filtered cost volume

PatchMatch Stereo• Idea: First a random initialization of disparities and plane para.s for each pix. and

update the estimates by propagating info. from the neighboring pix.s;

• Spatial propagation: Check for each pix. the disparities and plane para.s for left and upper neighbors and replace the current estimates if matching costs are smaller;

• View propagation: Warp the point in the other view and check the corresponding etimates in the other image. Replace if the matching costs are lower;

• Temporal propagation: Propagate the information analogously by considering the etimates for the same pixel at the preceding and consecutive video frame;

• Plane refinement: Disparity and plane para.s for each pix. refined by generat. random samples within an interval and updat. estimates if matching costs reduced;

• Post-processing: Remove outliers with left/right consistency checking and weighted median filter; Gaps are filled by propagating information from the neighborhood.

PatchMatch Stereo

Semi-Global Matching for Stereo Computation• Semi-global matching approximates a global optimization by combining several local optimization steps;

• Minimizing E(D) in a two-dimensional manner would be very costly, while SGM simplifies it by traversing one-dimensional paths and ensures the constraints with respect to these explicit directions;

• At least 8 paths (16 suggested), like horizontal, vertical and diagonal orientations;

• For instance, cost aggregation along a horizontal path as

• Pixel-based cost computation by mutual information as

• Left-right consistency check for occlusion detection and disparity propagation for hole filling.

• To accelerate the process, down-sampled image pairs are used for disparity estimation.

a small penalty P1 a large penalty P2 for large disparity changes

ELAS: Efficient Large Scale Stereo Matching• Similar idea: Seed-and-grow, however no dense map built yet;

• Build prior for dense disparity search space from sparse ‘support points’ S = {s1; :::; sM} with sm = (um vm dm)𝑇

;

• Algorithm: observation on = (un vn fn)𝑇

with feature fn

• Split image domain into support points S and dense pixels;

• Assume factorization of distribution over disparity, observations and support points into a graphical model;

• Prior: support point triangulation (Dalaulay);

• Likelihood: Laplace distribution over 5x5 patches;

• Feature defined on Horiz./Vert. gradient neighbor field.

ELAS: Efficient Large Scale Stereo Matching• Prior model:

• Likelihood model:

• Posterior can be factorized by the Bayes rule as

• Likelihood calculated along the epipolar line as

• Disparity estimation as MAP:

• To minimize an energy function

A mean function linking the support points and the observations

Stereo as Energy Minimization• Find disparities d that minimize an energy function

• Simple pixel / window matching

= SSD distance between windows I(x, y) and J(x, y + d(x,y))

I(x, y) J(x, y)

y = 141

C(x, y, d); the disparity space image (DSI)x

d

• Choose the minimum of each column in the DSI independently: 𝐷

Dynamic Programming (DP) in Stereo Matching

• Can minimize E(d) independently per scanline using dynamic programming (DP);

leftS

rightS

Left

occ

lusi

on

t

q

Rightocclusion

s p

occlC

occlC corrC

Three cases:• Sequential – cost of match• Left occluded – cost of no match

• Right occluded – cost of no match

Left image

Right image I

I

• DP yields the optimal path through grid, the best set of matches for the ordering constraint in scan-line stereo.

d1

d2

d3

• Graph Cut• Delete enough edges so that

• each pixel is connected to exactly one label node

• Cost of a cut: sum of deleted edge weights• Finding min cost cut equivalent to finding global minimum of energy

function

Energy Minimization via Graph CutsLabels

(disparities)

edge weight

edge weight• What defines a good stereo correspondence?

• 1. Match quality• Want each pixel to find a good match in the other image

• 2. Smoothness• If two pixels are adjacent, they should (usually) move about

the same amount

{ {match cost smoothness cost

“Potts model”

L1 distance

Graph Cut: convert multi-way cut into a seq. of binary cut

Model Stereo Vision by MRF and Solution by Belief Propagation• Allows rich probabilistic models for images.

• But built in a local, modular way. Learn local relationships, get global effects out.

disparity

images

Disparity-disparity

compatibility

functionneighboring

disparity nodes

local

observationsImages-disparity

compatibility

function

FYi

ii

ji

jiyxxx

ZyxP ),(),(1

),(

,

BELIEFS: Approximate posterior marginal distributions

neighborhood of node i

MESSAGES: Approximate sufficient statistics

I. Belief Update (Message Product)

II. Message Propagation (Convolution)

Hierarchical Belief Propagation (HBP) and Constant Space HBP• HBP works in a coarse-to-fine manner;

• (a) initialize the messages at the coarsest level to all zeros;

• (b) apply BP at the coarsest level to iteratively refine the messages;

• (c) use refined messages from the coarser level to initialize the messages for the next level.

• Constant space HBP relies on that, only a small number of disparity levels and the corresponding

message values are needed at each pixel to losslessly reconstruct the BP messages;

• Apply the coarse-to-fine (CTF) scheme to both spatial and depth domain, i.e. gradually reduce the number of

disparity levels as the messages propagate in CTF;

• Re-computes the data term at each level (not each iter.);

• Slower 9/8, but memory does not grow with max disp;

• Energy computed only once at the finest level;

• Gradually reduce the disparity levels in CTF.

• The closer the messages are to the fixed points, the fewer the required disparity levels; Then, CSBP refines the

messages hierarchically to approach the fixed points.

Phase Matching in Frequency or WT Domain• Phase reflects the structure information of the signal and inhibit the HF noise effect;

• Phase singularity is a problem;

• Local phase information as the primitive;

• Wavelet transform builds a hierarchical framework for multi-level coarse-to-fine processing;

• Stereo matching (disparity) with phase separation and instantaneous frequency of signals:

• Dynamic programming (DP) used for global optimization (occlusion handling) in stereo matching ;

• Phase is not uniformly stable;

• Smoothness constraints;

• Discontinuities detection;

• Multiple resolution solution:

• 1. top level: control points with feature matching, apply DP;

• 2. middle level: interpolation, apply DP;

• 3. bottom level: sub-pixel precision.

Local phaseDisparity Left/Right images

Original Phase matching Phase matching with DP

Disparity/Depth Refinement• Sub-pixel refinement: real valued disparities may be obtained by approximating the cost

function locally using a parabola;

• Left-Right Consistency Check: outlier detection by difference;

• By computing a disparity for every pixel of the left image (left to right);

• by computing a disparity for every pixel of the right image (right to left);

• Segmentation can be used for outlier identification.

• Occlusion filling:

• Occlusion detection;

• Background expansion;

• Inpainting.

• Discontinuities smoothing:

• Bilateral filtering.

Multiple Cameras

Multi-baseline stereo

use the third view to verify depth estimates

Spatial Temporal Video Disparity Estimation

• The important problem of extending to video is flickering;

• Typical methods:

• Spatial temporal consistency: smoothing in the space-time volume;

• Post-processing of disparity maps by applying a median filter along the flow fields;

• Spatial-temporal cost aggregation and solved by local/global optimization methods;

• Joint disparity and flow estimation;• SGM-based, as an instance;

• Modeled with MRF and solved by global optimization.

• Scene flow: 2D motion field along with 1D disparity change field.

• Dense method is very computationally expensive;

• Sparse method relies on heavily initial sparse correspondence success.

Sparse CodingSparse coding (Olshausen & Field, 1996).

Originally developed to explain early visual processing in the brain (edge detection).

Objective: Given a set of input data vectors learn a dictionary of bases such that:

Each data vector is represented as a sparse linear combination of bases.

Sparse: mostly zeros

Predictive Sparse CodingRecall the objective function for sparse coding:

Modify by adding a penalty for prediction error: ◦ Approximate the sparse code with an encoder

PSD for hierarchical feature training◦ Phase 1: train the first layer;

◦ Phase 2: use encoder + absolute value as 1st feature extractor

◦ Phase 3: train the second layer;

◦ Phase 4: use encoder + absolute value as 1st feature extractor

◦ Phase 5: train a supervised classifier on top layer;

◦ Phase 6: optionally train the whole network with supervised BP.

Methods of Solving Sparse CodingGreedy methods: projecting the residual on some atom;

◦ Matching pursuit, orthogonal matching pursuit;

L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);◦ The residual is updated iteratively in the direction of the atom;

Gradient-based finding new search directions◦ Projected Gradient Descent

◦ Coordinate Descent

Homotopy: a set of solutions indexed by a parameter (regularization)◦ LARS (Least Angle Regression)

First order/proximal methods: Generalized gradient descent◦ solving efficiently the proximal operator

◦ soft-thresholding for L1-norm

◦ Accelerated by the Nesterov optimal first-order method

Iterative reweighting schemes◦ L2-norm: Chartand and Yin (2008)

◦ L1-norm: Cand`es et al. (2008)

Strategy of Dictionary Selection• What D to use?• A fixed overcomplete set of basis: no adaptivity.

• Steerable wavelet;• Bandlet, curvelet, contourlet;• DCT Basis;• Gabor function;• ….

• Data adaptive dictionary – learn from data;• K-SVD: a generalized K-means clustering process for Vector Quantization (VQ).

• An iterative algorithm to effectively optimize the sparse approximation of signals in a learned dictionary.

• Other methods of dictionary learning:• non-negative matrix decompositions.• sparse PCA (sparse dictionaries).• fused-lasso regularizations (piecewise constant dictionaries)

• Extending the models: Sparsity + Self-similarity=Group Sparsity

Learning Sparse Representation in Depth Maps• Sparse representations learned from

Middlebury database disparity maps;

• Then they are exploited in a two-layer graphical model for inferring depth from stereo, by including a sparsity prior on the learned features;

The first layer solved using an existing MRF-based stereo matching algorithm;

The second layer is solved using the non-stationary sparse coding algorithm.

Learning Sparse Representation in Depth Maps

(c) Graph cut (d) GC + Sparse coding

Deep LearningRepresentation learning attempts to automatically learn good features or representations;

Deep learning algorithms attempt to learn multiple levels of representation of increasingcomplexity/abstraction (intermediate and high level features);

Become effective via unsupervised pre-training + supervised fine tuning;◦ Deep networks trained with back propagation (without unsupervised pre-training) perform worse than

shallow networks.

Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);

Semi-supervised: structure of manifold assumption; ◦ labeled data is scarce and unlabeled data is abundant.

Why Deep Learning?Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization problem);◦ Learn prior from unlabeled data;

Shallow models are not for learning high-level abstractions;◦ Ensembles or forests do not learn features first;

◦ Graphical models could be deep net, but mostly not.

Unsupervised learning could be “local-learning”;◦ Resemble boosting with each layer being like a weak learner

Learning is weak in directed graphical models with many hidden variables;◦ Sparsity and regularizer.

Traditional unsupervised learning methods aren’t easy to learn multiple levels of representation.◦ Layer-wised unsupervised learning is the solution.

Multi-task learning (transfer learning and self taught learning);Other issues: scalability & parallelism with the burden from big data.

Multi Layer Neural NetworkA neural network = running several logistic regressions at the same time;

◦ Neuron=logistic regression or…

Calculate error derivatives (gradients) to refine: back propagate the error derivative through model (the chain rule)

◦ Online learning: stochastic/incremental gradient descent

◦ Batch learning: conjugate gradient descent

Problems in MLPsMulti Layer Perceptrons (MLPs), one feed-forward neural network, were popularly used for decades.

Gradient is progressively getting more scattered◦ Below the top few layers, the correction signal is minimal

Gets stuck in local minima ◦ Especially start out far from ‘good’ regions (i.e., random initialization)

In usual settings, use only labeled data ◦ Almost all data is unlabeled!

◦ Instead the human brain can learn from unlabeled data.

Convolutional Neural NetworksCNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized neural input;

◦ local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial or temporal sub-sampling;

◦ Related to generative MRF/discriminative CRF: ◦ CNN=Field of Experts MRF=ML inference in CRF;

◦ Generate ‘patterns of patterns’ for pattern recognition.

Each layer combines (merge, smooth) patches from previous layers◦ Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.

◦ Convolution filters: (translation invariance) unsupervised;

◦ Local contrast normalization: increase sparsity, improve optimization/invariance.

C layers convolutions, S layers pool/sample

Convolutional Neural NetworksConvolutional Networks are trainable multistage architectures composed of multiple stages;

Input and output of each stage are sets of arrays called feature maps;

At output, each feature map represents a particular feature extracted at all locations on input;

Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;

A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;◦ A fully connected layer: softmax transfer function for posterior distribution.

Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map;

Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;◦ In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;

Feature pooling: treats each feature map separately -> a reduced-resolution output feature map;

Supervised training is performed using a form of SGD to minimize the prediction error;◦ Gradients are computed with the back-propagation method.

Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-tuning.

* is discrete convolution operator

LeNet (LeNet-5)A layered model composed of convolution and subsampling operations followed by a holistic representation and ultimately a classifier for handwritten digits;

Local receptive fields (5x5) with local connections;

Output via a RBF function, one for each class, with 84 inputs each;

Learning by Graph Transformer Networks (GTN);

AlexNetA layered model composed of convol., subsample., followed by a holistic representation and all-in-all a landmark classifier;

Consists of 5 convolutional layers, some of which followed by max-pooling layers, 3 fully-connected layers with a final 1000-way softmax;

Fully-connected layers: linear classifiers/matrix multiplications;

ReLU are rectified-linear nonlinearities on layer output, can be trained several times faster;

Local (contrast) normalization scheme aids generalization;

Overlapping pooling slightly less prone to overfitting;

Data augmentation: artificially enlarge the dataset using label-preserving transformations;

Dropout: setting to zero output of each hidden neuron with prob. 0.5;

Trained by SGD with batch # 128, momentum 0.9, weight decay 0.0005.

The network’s input is 150,528-dimensional, and the number of neurons in the network’s

remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.

MattNetMatthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013;

Preprocessing: subtracting a per-pixel mean;

Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of the image and randomly flipped horizontally to provide more views of each example;

SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent overfitting;

65M parameters trained for 12 days on a single Nvidia GPU;

Visualization by layered DeconvNets: project the feature activations back to the input pixel space;

◦ Reveal input stimuli exciting individual feature maps at any layer;

◦ Observe evolution of features during training;

◦ Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are important;

DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve structure;

Multiple such models were averaged together to further boost performance;

Supervised pre-training with AlexNet, then modify it to get better performance (error rate 14.8%).

Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3 color planes). # 1-5

layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature maps: (i) via a rectified linear function, (ii)

3x3 max pooled (stride 2), (iii) contrast normalized 55x55 feature maps. # 6-7 layers: fully connected, input in

vector form (6x6x256 = 9216 dimensions). The final layer: a C-way softmax function, C - number of classes.

Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnetwill reconstruct approximate version of convnet features from the layer beneath.

Bottom: Unpooling operation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet.

Belief NetsBelief net is directed acyclic graph composed of stochastic var.

Can observe some of the variables and solve two problems:◦ inference: Infer the states of the unobserved variables.

◦ learning: Adjust the interactions between variables to more likely generate the observed data.

stochastichidden cause

visible effect

Use nets composed of layers of stochastic variables with weighted connections.

Boltzmann MachinesEnergy-based model associate a energy to each configuration of stochastic variables of interests (for example, MRF, Nearest Neighbor);

◦ Learning means adjustment of the low energy function’s shape properties;

Boltzmann machine is a stochastic recurrent model with hidden variables;◦ Monte Carlo Markov Chain, i.e. MCMC sampling (appendix);

Restricted Boltzmann machine is a special case: ◦ Only one layer of hidden units;◦ factorization of each layer’s neurons/units (no connections in the same layer);

Contrastive divergence: approximation of gradient (appendix).

probability

Energy Function

Learning rule

Deep Belief NetworksA hybrid model: can be trained as generative or discriminative model;

Deep architecture: multiple layers (learn features layer by layer);

◦ Multi layer learning is difficult in sigmoid belief networks.

◦ Top two layers are undirected connections, RBM;

◦ Lower layers get top down directed connections from layers above;

Unsupervised or self-taught pre-learning provides a good initialization;

◦ Greedy layer-wise unsupervised training for RBM

Supervised fine-tuning ◦ Generative: wake-sleep algorithm (Up-down)

◦ Discriminative: back propagation (bottom-up)

Deep Boltzmann MachineLearning internal representations that become increasingly complex;

High-level representations built from a large supply of unlabeled inputs;

Pre-training consists of learning a stack of modified RBMs, which are composed to create a deep Boltzmann machine (undirected graph);

Generative fine-tuning: different from DBN◦ Positive and negative phase (appendix)

Discriminative fine-tuning: the same to DBN◦ Back propagation.

Denoising Auto-EncoderMultilayer NNs with target output=input;Reconstruction=decoder(encoder(input));

◦ Perturbs the input x to a corrupted version;

◦ Randomly sets some of the coordinates of input to zeros.

◦ Recover x from encoded perturbed data.

Learns a vector field towards higher probability regions;Pre-trained with DBN or regularizer with perturbed training data; Minimizes variational lower bound on a generative model;

◦ corresponds to regularized score matching on an RBM;

PCA=linear manifold=linear Auto Encoder;Auto-encoder learns the salient variation like a nonlinear PCA.

Stacked Denoising Auto-EncoderStack many (may be sparse) auto-encoders in succession and train them using greedy layer-wise unsupervised learning

◦ Drop the decode layer each time

◦ Performs better than stacking RBMs;

Supervised training on the last layer using final features;

(option) Supervised training on the entire network to fine- tune all weights of the neural net;

Empirically not quite as accurate as DBNs.

Stereopsis via Deep Learning• Learn a binocular cross correlation model: use two quadrature pairs to detect disparity;

Various filters correspond to phases, positions and frequencies;

• Disparity as latent variable: a pattern of matching filter responses;

A joint probabilistic model over patch pairs and disparity defined as a Boltzmann machine.

Training amounts to finding the parameters for max the log probability for pairs

RBM used for this case;

During inference, each latent variable receives activity

from exactly two products of matched filter-responses.

pooling

Stereopsis via Deep Learning

Example training data: Row 1, row 2 and row 3

show rendered image planes for the left/right

camera, where in row 3 the right camera has

been rotated by 45 around the z axis. Images

are rendered by depth maps shown in row 4 and

a randomly selected texture map from the

Berkeley Segmentation Database. Example pairs from NORB-cluttered dataset. Learned binocular filter pairs.

Unsupervised Learning of Depth (and Motion)• Learning about the interrelations between images from multiple cameras, multiple

frames in a video, or the combination of both;

• Depth and motion in a feature learning architecture based on the energy model;

• An AutoEncoder single-layer model uses multiplicative interactions to detect synchrony, and a pooling layer independently trained on the hidden responses to achieve content invariance;

• Depth as a latent variable in learning:

• Reconstruction error:

• Contraction as regularization:

• Complete objective function:

Note: there is no need for rectification, since

the model can learn any transformation

between the frames not just horizontal shift

Unsupervised Learning of Depth (and Motion)• Extension to stereo sequences: both depth and motion;

Encoding depth:

Encoding motion:

Multiview disparity:

Representation of depth

Representation of motion

Representation of disparity

Products of frame responses



Unsupervised Learning of Depth (and Motion)

Filters learned on stereo patch pairs from KITTI dataset.

Example of a filter pair learned on sequences by the

SAE-D model from the Hollywood3D dataset.

Stereo Matching by CNN• Train a convolutional neural network on pairs of small image patches;

• The network output is used to initialize the matching cost btw a pair of patches;

• Eight layers, L1 through L8 with input as 9x9 gray patch and matching cost as output;

• 1st layer as convolutional only and other layers are fully connected.

• Rectified linear units follow each layer, except L8, but NO pooling!

• Trained with SGD (batch size as 128), by194 image pairs, 45 million extracted examples.

• Matching costs are combined between neighboring pixels with similar image intensities using cross-based cost aggregation;

• Smoothness constraints are enforced by semi-global matching (SGM) and a left-right consistency check is used to detect and eliminate errors in occluded regions;

• sub-pixel enhancement and median filter + bilateral filter -> final disparity map;

• Achieve the error rate of 2.61% on the KITTI stereo database ( < 2.83% before).

Stereo Matching by CNNSupport region

Appendix A:

Learning Depth from Single Image

Learning-based Depth from ImageInitial over-segmentation (super pixels);

Markov Random Field (MRF) to infer patch’s orientation and location from image features (texture, color and gradient);

◦ Connected, co-planar or colinear as prior;

◦ Occlusion boundaries /folds indication;

◦ Multi-conditional learning; solved by linear program;

MRF overlaid on “super pixels”

Occlusion/fold

Coplanarity and Colinearity

Single Image Depth Estimation From Predicted Semantic LabelsSemantic segmentation to guide the 3D reconstruction;

Works like holistic scene understanding: ◦ 1. Multi-class image labeling MRF for scene segmentation;◦ 2. Depth estimation for each semantic class by learning (logistic regression);

3. Scene depth estimation by MRF (pixel or super-pixel) with potential (learned boosted decision tree classifiers ) and prior of geometry (horizon prediction, vertical objects), pixel’s smoothness, super-pixel ‘s soft connectivity, co-planarity and orientation.

semantically derived geometric constraints

Smoothed per-pixel log-depth prior for each semantic class with horizon rotated to center of image

Image semantic overlay ground truth depth measurements

Learning Depth from ExamplesTwo similar images are likely to have similar 3D structure (depth).

Nearest-neighbor (kNN) search: finding k image+depth pairs that are most similar to the query (histograms of oriented gradients as feature);

Depth fusion: median filtering of the k depth fields;

Joint-bilateral depth filtering: smoothing of the median-fused depth.

K-NN query

Depth Fusion and Smoothing

Depth output

Note: depth (disparity) warping via SIFT-flow in aligning with the query is omitted.

Depth Transfer for Monocular VideoK-NN Search for candidates of query frames;

Depth changes are gradual frame-to-frame;

Moving objects are usually on the ground;

Warped with SIFT flow and regularized with smoothness and prior

Computational cost is worth?

Depth Inference with MRF

To form a basis (dictionary) over the RGB and depth spaces, and represent depth maps by a sparse linear combination of weights.

A prediction function is estimated between weight vectors in RGB to depth space to recover depth maps from query images.

A final super-pixel post processor aligns depth maps with occlusion boundaries, creating physically plausible results.

Scalable Exemplar Based Depth Transfer

images with similar global depth profile clustered together in 2D utilizing RGB pairwise features (left) and sparse positive descriptors on depth (right) effective in grouping images with similar depths profiles together.

estimate a transformation T, that maps points from one space to another.

Scalable Exemplar Based Depth Transfer

Learning to be a Depth Camera (Active Near-IR)• Use hybrid classification-regression forests to learn how to map from near infrared

intensity images to absolute, metric depth in real-time;

• Simplify the problem by dividing it into sub-problems in the first layer, and then applies models trained for these sub-problems in the second layer to solve the main problem efficiently;

• Restrict the depths of the object to a certain range for significant simplification;

• The first layer learns to infer a coarsely quantized depth range for each pixel, and optionally pools these predictions across all pixels to obtain a more reliable distribution over these depth ranges;

• The second layer then applies one or more expert repressor trained specifically on the inferred depth ranges.

• Note: the forests do not need to explicitly model scene illumination, surface geometry and reflectance, or complex inter-reflections, required by traditional SFS methods.

Learning to be a Depth Camera (Active Near-IR)• Comparable to high-quality consumer depth cameras with a reduced cost, power

consumption, and form-factor.

Learning to be a Depth Camera (Active Near-IR)• Applied for specific hand and face objects.

Depth Prediction using a Multi-Scale Deep Network• Two deep network stacks: one that makes a coarse global prediction based on the

entire image, and another that refines this prediction locally;

• Apply a scale-invariant error to help measure depth relations rather than scale;

• Augment training data with online

random transformations (scale,

rotation, translation, flips, color).• Baseline for comparison: Make3D.

Depth Prediction using a Multi-Scale Deep Network

Depth Prediction using a Multi-Scale Deep Network

(a) input, (b) output of coarse network, (c) refined

output of fine network, (d) ground truth.

Appendix B:

Machine Learning and Optimization

Graphical Models

• Graphical Models: Powerful framework for representing dependency structure between random variables.

• The joint probability distribution over a set of random variables.• The graph contains a set of nodes (vertices) that represent random variables, and a set of links (edges) that represent dependencies between those random variables.

• The joint distribution over all random variables decomposes into a product of factors, where each factor depends on a subset of the variables.• Two type of graphical models:

• Directed (Bayesian networks)• Undirected (Markov random fields, Boltzmann machines)• Hybrid graphical models that combine directed and undirected models, such as DeepBelief Networks, Hierarchical-Deep Models.

Generative Model: MRFRandom Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes value fi in a label set L.

Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property.◦ Generative model for joint probability p(x)

◦ allows no direct probabilistic interpretation

◦ define potential functions Ψ on maximal cliques A

◦ map joint assignment to non-negative real number

◦ requires normalization

MRF is undirected graphical models

A flow network G(V, E) defined as a fully connected directed graph where each edge (u,v) in E has a positive capacity c(u,v) >= 0;

The max-flow problem is to find the flow of maximum value on a flow network G;

A s-t cut or simply cut of a flow network G is a partition of V into S and T = V-S, such that s in S and t in T;

A minimum cut of a flow network is a cut whose capacity is the least over all the s-t cuts of the network;

Methods of max flow or mini-cut:

◦ Ford Fulkerson method;

◦ "Push-Relabel" method.

http://www.hindawi.com/journals/mpe/2012/814356/fig8/

http://www.hindawi.com/journals/mpe/2012/814356/fig8/

Mostly labeling is solved as an energy minimization problem;

Two common energy models:

◦ Potts Interaction Energy Model;

◦ Linear Interaction Energy Model.

Graph G contain two kinds of vertices: p-vertices and i-vertices;

◦ all the edges in the neighborhood N, called n-links;

◦ edges between the p-vertices and the i-vertices called t-links.

In the multiple labeling case, the multi-way cut should leave each p-vertex connected to one i-vertex;

The minimum cost multi-way cut will minimize the energy function where the severed n-links would correspond to the boundaries of the labeled vertices;

The approximation algorithms to find this multi-way cut:

◦ "alpha-expansion" algorithm;

◦ "alpha-beta swap" algorithm.

A simplified Bayes Net: it propagates info. throughout a graphical model via a series of messages between neighboring nodes iteratively; likely to converge to a consensus that determines the marginal prob. of all the variables;

messages estimate the cost (or energy) of a configuration of a clique given all other cliques; then the messages are combined to compute a belief (marginal or maximum probability);

Two types of BP methods:

◦ max-product;

◦ sum-product.

BP provides exact solution when there are no loops in graph!

Equivalent to dynamic programming/Viterbi in these cases;

Loopy Belief Propagation: still provides approximate (but often good) solution;

Generalized BP for pairwise MRFs◦ Hidden variables xi and xj are connected through a compatibility function;

◦ Hidden variables xi are connected to observable variables yi by the local “evidence” function;

The joint probability of {x} is given by

To improve inference by taking into account higher-order interactions among the variables;

◦ An intuitive way is to define messages that propagate between groups of nodes rather than just single nodes;

◦ This is the intuition in Generalized Belief Propagation (GBP).

Stochastic Gradient Descent (SGD)• The general class of estimators that arise as minimizers of sums are called M-

estimators;• Where are stationary points of the likelihood function (or zeroes of its derivative, the score

function)?

• Online gradient descent samples a subset of summand functions at every step;• The true gradient is approximated by a gradient at a single example;

• Shuffling of training set at each pass.

• There is a compromise between two forms, often called "mini-batches", where the true gradient is approximated by a sum over a small number of training examples.

• STD converges almost surely to a global minimum when the objective function is convex or pseudo-convex, and otherwise converges almost surely to a local minimum.

Back Propagation• Back propagation is a multi-layer network training method

• We want to find parameters W, to minimize an error• For this we will do iterative gradient descent:

w(t) = w(t-1) – λ * −𝜕𝐸

𝜕𝑤(t)

• Error propagation • Forward propagation of a training pattern's input through the multilayer network to generate the

output activations;• Backward propagation of the output activations (logistic or soft-max) through the multiplayer

network using the pattern target to generate deltas of all output and hidden units (the chain rule);

• Weight update• Multiply its output delta and input activation to get the weight gradient;• Subtract a ratio (i.e. the learning rate) of the gradient from the weight.

𝜕𝐸

𝜕𝑦𝑙−1=

𝜕𝐸

𝜕𝑦𝑙×𝜕𝑦𝑙(𝑤,𝑦𝑙−1)

𝜕𝑦𝑙−1𝜕𝐸

𝜕𝑤𝑙=

𝜕𝐸

𝜕𝑦𝑙×𝜕𝑦𝑙(𝑤,𝑦𝑙−1)

𝜕𝑤𝑙

E (f(x0,w),y0) = -log (f(x0,w)- y0).

Variable Learning RateToo large learning rate

◦ cause oscillation in searching for the minimal point

Too slow learning rate

◦ too slow convergence to the minimal point

Adaptive learning rate

◦ At the beginning, the learning rate can be large when the current point is far from the optimal point;

◦ Gradually, the learning rate will decay as time goes by.

Should not be too large or too small:

◦ annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)

◦ 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.

Variable MomentumClassical Momentum (CM) is a technique for accelerating gradient descent that accumulates a velocity vector in directions of persistent reduction in the objective across iterations: given the objective function f(θ),

Vt+1 = µVt - ε𝛻f(θt), θt+1 = θt + Vt+1,

With ε>0 as learning rate, µͼ[0,1] as momentum coefficient and 𝛻f(θt) as gradient at θt;

Nesterov’s Accelerated Gradient (NAG) is also a 1st order optimization method with better convergence rate guarantee than gradient descent;

Vt+1 = µVt - ε𝛻f(θt + µVt), θt+1 = θt + Vt+1,

For convex objectives, momentum-based methods outperform SGD in the early or transient stages of optimization, however equally effective in the final stage;

Hessian-free (HF) methods and truncated Newton methods work by optimizing a local quadratic model of the objective via the linear conjugate gradient (CG) algorithms;

◦ If CG terminated after just one step, HF becomes equivalent to NAG;

AdaGrad/AdaDeltaAdaGrad: asymptotically sublinear regret, adapt learning rate for each weight based on historical info.:

∆𝑊𝑖𝑗 𝑡 + 1 = −𝛾

1𝑡+1(

𝜕𝐸

𝜕𝑤𝑖𝑗(𝜏))2

∗𝜕𝐸

𝜕𝑤𝑖𝑗(𝑡 + 1)

◦ Normalizes each coordinate of gradient by the historical (previous iterations) magnitude of that coordinate;

◦ Frequently occurring features in the gradients get small learning rates and infrequent features get higher ones;

◦ Sensitive to initial conditions, continual decay of learning rate.

AdaDelta: accumulate the denominator over last k gradients (a sliding window):

𝛼 𝑡 + 1 = 𝑡−𝑘+1𝑡+1 (

𝜕𝐸

𝜕𝑤(𝜏))2

∆𝑊 𝑡 + 1 = −𝛾

𝛼(𝑡+1)∗𝜕𝐸

𝜕𝑤(𝑡 + 1) .

◦ This requires to keep last k gradients; instead it use a simpler formula:

𝛽 𝑡 + 1 = 𝜌 ∗ 𝛽 𝑡 + 1 − 𝜌 ∗ (𝜕𝐸

𝜕𝑤(𝑡 + 1))2

∆𝑊 𝑡 + 1 = −𝛾

𝛽 𝑡+1 +𝜖∗𝜕𝐸

𝜕𝑤(𝑡 + 1) .

◦ Avoid AdaGrad’s weakness.

Dropout and Maxout for OverfittingDropout: set the output of each hidden neuron to zero w.p. 0.5. ◦ Motivation: Combining many different models that share parameters succeeds in reducing test

errors by approximately averaging together the predictions, which resembles the bagging.

◦ The units which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation.

◦ So every time an input is presented, the NN samples a different architecture, but all these architectures share weights.

◦ This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence of particular other units.

◦ It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other units.

◦ Without dropout, the network exhibits substantial overfitting.

◦ Dropout roughly doubles the number of iterations required to converge.

Maxout takes the maximum across multiple feature maps;

Weight Decay for OverfittingWeight decay or L2 regularization adds a penalty term to the error function, a term called the regularization term: the negative log prior in Bayesian justification,

◦ Weight decay works as rescaling weights in the learning rule, but bias learning still the same;

◦ Prefer to learn small weights, and large weights allowed if improving the original cost function;

◦ A way of compromising btw finding small weights and minimizing the original cost function;

In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;

L1 regularization: the weights not really useful shrink by a constant amount toward zero;

◦ Act like a form of feature selection;

◦ Make the input filters cleaner and easier to interpret;

L2 regularization penalizes large values strongly while L1 regularization ;

Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium distr. is the posterior distribution for weights & hyper-parameters;

Hybrid Monte Carlo: gradient and sampling.

Early Stopping for OverfittingSteps in early stopping:

◦ Divide the available data into training and validation sets.

◦ Use a large number of hidden units.

◦ Use very small random initial values.

◦ Use a slow learning rate.

◦ Compute the validation error rate periodically during training.

◦ Stop training when the validation error rate "starts to go up".

Early stopping has several advantages:

◦ It is fast.

◦ It can be applied successfully to networks in which the number of weights far exceeds the sample size.

◦ It requires only one major decision by the user: what proportion of validation cases to use.

Practical issues in early stopping:

◦ How many cases do you assign to the training and validation sets?

◦ Do you split the data into training and validation sets randomly or by some systematic algorithm?

◦ How do you tell when the validation error rate "starts to go up"?

MCMC Sampling for OptimizationMarkov Chain: a stochastic process in which future states are independent of past states but the present state.

◦ Markov chain will typically converge to a stable distribution.

Monte Carlo Markov Chain: sampling using ‘local’ information◦ Devise a Markov chain whose stationary distribution is the target.

◦ Ergodic MC must be aperiodic, irreducible, and positive recurrent.

◦ Monte Carlo Integration to get quantities of interest.

Metropolis-Hastings method: sampling from a target distribution◦ Create a Markov chain whose transition matrix does not depend on the normalization term.

◦ Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio).

◦ After sufficient number of iterations, the chain will converge the stationary distribution.

Gibbs sampling is a special case of M-H Sampling.◦ The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional distribution.

Hybrid Monte Carlo: gradient sub step for each Markov chain.

Mean Field for OptimizationVariational approximation modifies the optimization problem to be tractable, at the price of approximate solution;

Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is disconnected graph);

◦ Density becomes factorized product distribution in this sub-family.

◦ Objective: K-L divergence.

Mean field is a structured variation approximation approach:◦ Coordinate ascent (deterministic);

Compared with stochastic approximation (sampling):◦ Faster, but maybe not exact.

Contrastive Divergence for RBMsContrastive divergence (CD) is proposed for training PoE first, also being a quicker way to learn RBMs;

◦ Contrastive divergence as the new objective;

◦ Taking gradients and ignoring a term which is usually very small.

Steps:◦ Start with a training vector on the visible units.

◦ Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs sampling);CD learning is biased: not work as gradient descentImproved: Persistent CD explores more modes in the distribution

◦ Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient update.

◦ Still suffer from divergence of likelihood due to missing the modes.

Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the model with the empirical density.

“Wake-Sleep” Algorithm for DBN

Pre-trained DBN is a generative model;

Do a stochastic bottom-up pass (wake phase)◦ Get samples from factorial distribution (visible first, then generate hidden);

◦ Adjust the top-down weights to be good at reconstructing the feature activities in the layer below.

Do a few iterations of sampling in the top level RBM◦ Adjust the weights in the top-level RBM.

Do a stochastic top-down pass (sleep phase)◦ Get visible and hidden samples generated by generative model using data coming from nowhere!

◦ Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above.

◦ Any guarantee for improvement? No!

The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding theory).

Greedy Layer-Wise TrainingDeep networks tend to have more local minima problems than shallow networks during supervised training

Train first layer using unlabeled data◦ Supervised or semi-supervised: use more unlabeled data.

Freeze the first layer parameters and train the second layer

Repeat this for as many layers as desire◦ Build more robust features

Use the outputs of the final layer to train the last supervised layer (leave early weights frozen)

Fine tune the full network with a supervised approach;

Avoid problems to train a deep net in a supervised fashion.◦ Each layer gets full learning

◦ Help with ineffective early layer learning

◦ Help with deep network local minima

Why Greedy Layer-Wise Training Works?

Take advantage of the unlabeled data;

Regularization Hypothesis ◦ Pre-training is “constraining” parameters in a region relevant to unsupervised

dataset;

◦ Better generalization (representations that better describe unlabeled data are more discriminative for labeled data) ;

Optimization Hypothesis ◦ Unsupervised training initializes lower level parameters near localities of better

minima than random initialization can.

Only need fine tuning in the supervised learning stage.

Two-Stage Pre-training in DBMs Pre-training in one stage

◦ Positive phase: clamp observed, sample hidden, using variational approximation (mean-field)

◦ Negative phase: sample both observed and hidden, using persistent sampling (stochastic approximation: MCMC)

Pre-training in two stages◦ Approximating a posterior distribution over the states of hidden units (a simpler directed deep model as DBNs

or stacked DAE);

◦ Train an RBM by updating parameters to maximize the lower-bound of log-likelihood and correspond. posterior of hidden units.◦ Options (CAST, contrastive divergence, stochastic approximation…).

passive stereo vision with deep learning

Technology