learning correspondence from the cycle-consistency of time · motivation: cycle-consistency feature...

Learning Correspondence from the Cycle-consistency of TimeCVPR 2019 Oral

Xiaolong Wang CMU Allan Jabri UC BerkeleyAlexei A. Efros UC Berkeley

Task: Visual Correspondence

— A Young Student: “What are the three most important problems in computer vision?”— Takeo Kanade: “Correspondence, correspondence, correspondence!”

This paper: A self-supervised method for learning visual

correspondence from unlabeled videoshttps://ajabri.github.io/timecycle/

“Correspondence is the glue that links disparate visual percepts into persistent entities and underlies visual reasoning in space and time”

https://ajabri.github.io/timecycle/

The main idea is to use cycle-consistency in time as free supervision signal

Motivation: Cycle-Consistency

Feature Representation Tracker

Complementary

Learn both representation and tracking simultaneously in a self-supervised manner.The learned representation can be used at test-time as a distance metric for correspondence.

In this example, the blue patch in frame t is tracked backward

to frame t-2 and tracked forward back to frame t. And the distance between the blue

and red patch in frame t can be used as the loss function.

In this self-supervised manner, the training data is unlimited.

Motivation: Challenges

1. Learning can take shortcuts (e.g. a static tracker) >>> Force re-localizing

2. The cycle may corrupt (e.g. sudden changes in object pose or occlusions) >>> Skip-cycles

3. Correspondence may be poor early in training (e.g. shorter cycles may ease learning) >>> Cycles with different lengths

Method: Formulation

Feature encoder: Find Correspondence at testing

Differentiable tracker: Only for trainingShould be weak so that the we can learn a strong representation

Recurrent Tracking Formulation:

1. Encode image sequence and the patch to track:

2. Find the most similar patch in image features: :

4. Iterative forward tracking:

3. Iterative backward tracking:

Method: Learning Objectives

Cycle-consistency (Full)

Cycle-consistency (Skip)

Patch Similarity

Cycles with difference lengths, k=4

Method: Encoder

The architecture of the encoder determines the type of correspondence.

A mid-level deep feature map is used, which is coarser than pixel space but with sufficient spatial resolution to support tasks that require localization.

• ResNet-50 architecture without the final 3 residual blocks• Input frames are 240 × 240 pixels, spatial features are thus 30 × 30. • Patches are randomly cropped 80 × 80, spatial features are thus 10 × 10.

Method: Tracker

Method: Sampler: Theta 2, 1, 180

1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1

x

y

Image


Translation: (2, 1)

1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1

x

y 2 3 4

2 3 4

2 3 4

1 1 1

2 2 2

3 3 3

x ySampling grids 3x3

Image


Translation: (2, 1)

Rotation: 180’

1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1

x

y

4 3 2

4 3 2

4 3 2

3 3 3

2 2 2

1 1 1

x y

2 3 4

2 3 4

2 3 4

1 1 1

2 2 2

3 3 3

x ySampling grids

Image


Translation: (2, 1)

Rotation: 180’

1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1

x

y

4 3 2

4 3 2

4 3 2

3 3 3

2 2 2

1 1 1

x y

4,3 3,3 2,3

4,2 3,2 2,2

4,1 3,1 2,1

(x, y)

2 3 4

2 3 4

2 3 4

1 1 1

2 2 2

3 3 3

x ySampling grids

Image


Translation: (2, 1)

Rotation: 180’

1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1

x

y

4 3 2

4 3 2

4 3 2

3 3 3

2 2 2

1 1 1

x y

4,3 3,3 2,3

4,2 3,2 2,2

4,1 3,1 2,1

(x, y)

2 2 3

5 4 4

0 3 0

2 3 4

2 3 4

2 3 4

1 1 1

2 2 2

3 3 3

x ySampling grids

Image

SampledPatch

Method: Learning Objectives

Cycle-consistency (Full)

Cycle-consistency (Skip)

Patch Similarity

Cycles with difference lengths, k=4


Translation: (2, 1)

Rotation: 180’

1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1

x

y

4 3 2

4 3 2

4 3 2

3 3 3

2 2 2

1 1 1

x y

4,3 3,3 2,3

4,2 3,2 2,2

4,1 3,1 2,1

(x, y)

2 2 3

5 4 4

0 3 0

2 3 4

2 3 4

2 3 4

1 1 1

2 2 2

3 3 3

x ySampling grids

Image

SampledPatch

Experiments: Setup

Training:• VLOG dataset• 114K videos • 344 hours • No annotation• No pre-training• No fine-tuning

Tasks:Label propagation from the first frame:• Video object segmentation (DAVIS2017)• Human pose keypoints (JHMDB)• Instance-level and semantic-level masks (VIP)

Testing:Propagation by k-NN

Compared to:- Baselines:• Identity Propagation• Optical Flow • SIFT Flow - Other self-supervised methods:• Video Colorization • Transitive Invariance • DeepCluster - ImageNet Pre-training - Fully-Supervised Methods

Experiments: Example

Experiments: Video object segmentation

Experiments: Keypoints propagation

Experiments: Instance-level and semantic-level masks propagation

Experiments: Visualization

Thank you

learning correspondence from the cycle-consistency of time · motivation: cycle-consistency feature...

Documents