learning correspondence from the cycle-consistency of time · motivation: cycle-consistency feature...
TRANSCRIPT
Learning Correspondence from the Cycle-consistency of TimeCVPR 2019 Oral
Xiaolong Wang CMU Allan Jabri UC BerkeleyAlexei A. Efros UC Berkeley
Task: Visual Correspondence
— A Young Student: “What are the three most important problems in computer vision?”— Takeo Kanade: “Correspondence, correspondence, correspondence!”
This paper: A self-supervised method for learning visual
correspondence from unlabeled videoshttps://ajabri.github.io/timecycle/
“Correspondence is the glue that links disparate visual percepts into persistent entities and underlies visual reasoning in space and time”
The main idea is to use cycle-consistency in time as free supervision signal
Motivation: Cycle-Consistency
Feature Representation Tracker
Complementary
Learn both representation and tracking simultaneously in a self-supervised manner.The learned representation can be used at test-time as a distance metric for correspondence.
In this example, the blue patch in frame t is tracked backward
to frame t-2 and tracked forward back to frame t. And the distance between the blue
and red patch in frame t can be used as the loss function.
In this self-supervised manner, the training data is unlimited.
Motivation: Challenges
1. Learning can take shortcuts (e.g. a static tracker) >>> Force re-localizing
2. The cycle may corrupt (e.g. sudden changes in object pose or occlusions) >>> Skip-cycles
3. Correspondence may be poor early in training (e.g. shorter cycles may ease learning) >>> Cycles with different lengths
Method: Formulation
Feature encoder: Find Correspondence at testing
Differentiable tracker: Only for trainingShould be weak so that the we can learn a strong representation
Recurrent Tracking Formulation:
1. Encode image sequence and the patch to track:
2. Find the most similar patch in image features: :
4. Iterative forward tracking:
3. Iterative backward tracking:
Method: Learning Objectives
Cycle-consistency (Full)
Cycle-consistency (Skip)
Patch Similarity
Cycles with difference lengths, k=4
Method: Encoder
The architecture of the encoder determines the type of correspondence.
A mid-level deep feature map is used, which is coarser than pixel space but with sufficient spatial resolution to support tasks that require localization.
• ResNet-50 architecture without the final 3 residual blocks• Input frames are 240 × 240 pixels, spatial features are thus 30 × 30. • Patches are randomly cropped 80 × 80, spatial features are thus 10 × 10.
Method: Tracker
Method: Sampler: Theta 2, 1, 180
1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1
x
y
Image
Method: Sampler: Theta 2, 1, 180
Translation: (2, 1)
1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1
x
y 2 3 4
2 3 4
2 3 4
1 1 1
2 2 2
3 3 3
x ySampling grids 3x3
Image
Method: Sampler: Theta 2, 1, 180
Translation: (2, 1)
Rotation: 180’
1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1
x
y
4 3 2
4 3 2
4 3 2
3 3 3
2 2 2
1 1 1
x y
2 3 4
2 3 4
2 3 4
1 1 1
2 2 2
3 3 3
x ySampling grids
Image
Method: Sampler: Theta 2, 1, 180
Translation: (2, 1)
Rotation: 180’
1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1
x
y
4 3 2
4 3 2
4 3 2
3 3 3
2 2 2
1 1 1
x y
4,3 3,3 2,3
4,2 3,2 2,2
4,1 3,1 2,1
(x, y)
2 3 4
2 3 4
2 3 4
1 1 1
2 2 2
3 3 3
x ySampling grids
Image
Method: Sampler: Theta 2, 1, 180
Translation: (2, 1)
Rotation: 180’
1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1
x
y
4 3 2
4 3 2
4 3 2
3 3 3
2 2 2
1 1 1
x y
4,3 3,3 2,3
4,2 3,2 2,2
4,1 3,1 2,1
(x, y)
2 2 3
5 4 4
0 3 0
2 3 4
2 3 4
2 3 4
1 1 1
2 2 2
3 3 3
x ySampling grids
Image
SampledPatch
Method: Learning Objectives
Cycle-consistency (Full)
Cycle-consistency (Skip)
Patch Similarity
Cycles with difference lengths, k=4
Method: Sampler: Theta 2, 1, 180
Translation: (2, 1)
Rotation: 180’
1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1
x
y
4 3 2
4 3 2
4 3 2
3 3 3
2 2 2
1 1 1
x y
4,3 3,3 2,3
4,2 3,2 2,2
4,1 3,1 2,1
(x, y)
2 2 3
5 4 4
0 3 0
2 3 4
2 3 4
2 3 4
1 1 1
2 2 2
3 3 3
x ySampling grids
Image
SampledPatch
Method: Sampler: Theta 2, 1, 180
Translation: (2, 1)
Rotation: 180’
1 2 3 4 33 2 0 3 00 2 4 4 51 1 3 2 22 2 2 3 1
x
y
4 3 2
4 3 2
4 3 2
3 3 3
2 2 2
1 1 1
x y
4,3 3,3 2,3
4,2 3,2 2,2
4,1 3,1 2,1
(x, y)
2 2 3
5 4 4
0 3 0
2 3 4
2 3 4
2 3 4
1 1 1
2 2 2
3 3 3
x ySampling grids
Image
SampledPatch
Experiments: Setup
Training:• VLOG dataset• 114K videos • 344 hours • No annotation• No pre-training• No fine-tuning
Tasks:Label propagation from the first frame:• Video object segmentation (DAVIS2017)• Human pose keypoints (JHMDB)• Instance-level and semantic-level masks (VIP)
Testing:Propagation by k-NN
Compared to:- Baselines:• Identity Propagation• Optical Flow • SIFT Flow - Other self-supervised methods:• Video Colorization • Transitive Invariance • DeepCluster - ImageNet Pre-training - Fully-Supervised Methods
Experiments: Example
Experiments: Video object segmentation
Experiments: Keypoints propagation
Experiments: Instance-level and semantic-level masks propagation
Experiments: Visualization
Thank you