learning graph representations for video...

64
Learning Graph Representations for Video Understanding Xiaolong Wang Carnegie Mellon University

Upload: others

Post on 14-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Learning Graph Representations for Video Understanding

Xiaolong WangCarnegie Mellon University

Page 2: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Computer Vision

Dog

He et al. Mask R-CNN. ICCV 2017.Güler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018.

Page 3: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Deep Learning

Mushroom Dog Ant Jelly Fungus Nest

ImageNet

Image

Mushroom

Dog

Ant

Jelly Fungus

Nest

Train a Convolutional Neural Network

Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. 2014.

Page 4: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Convolutional Neural Networks

Figure credit: Van Den Oord et al.

• Convolution is local

• Long-range Pairwise relations are not modeled

Page 5: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Related Work: Relation Networks

[Santoro et al, 2017]

Page 6: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Related Work: Self-Attention

[Vaswani et al, 2017]

Page 7: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Related Work: Graph Convolution Networks

[Kipf et al, 2017]

Page 8: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

This Tutorial

• Perform connections on different graph/relation networks

• Under the application of video understanding

• Both supervised and self-supervised methods

Page 9: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Video Recognition

3D Conv

3D Conv

3D Conv

Playing Soccer

Page 10: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Reasoning for Action Recognition

X. Wang , R. Girshick , A. Gupta, and K. He. Non-local Neural Networks. CVPR 2018.

Long-rang explicit reasoning

Page 11: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Non-local Means

𝑝

𝑞2

𝑞1

𝑞3

Buades et al. A non-local algorithm for image denoising. CVPR, 2005.

Page 12: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Non-local Operator

Operation in feature space

Can be embedded into any ConvNets

𝑥𝑖 𝑥𝑗

Page 13: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Non-local Operator

𝑥𝑖 𝑥𝑗

Affinity Features

𝑦𝑖 =1

𝐶(𝑥)

∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)

Page 14: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Non-local Operator

14𝑥

𝜃: 1 × 1× 1

𝜙: 1 × 1× 1

𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512

𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊

𝑇𝐻𝑊 × 𝑇𝐻𝑊

𝑇𝐻𝑊

512𝑇𝐻𝑊

512

𝑇𝐻𝑊

𝑇𝐻𝑊× =

𝑦𝑖 =1

𝐶(𝑥)

∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)

𝑇 × 𝐻 × 𝑊 × 512

Page 15: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Non-local Operator

15𝑥

𝜃: 1 × 1× 1

𝜙: 1 × 1× 1

𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512

𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊

𝑇𝐻𝑊 × 𝑇𝐻𝑊

𝑇𝐻𝑊

512𝑇𝐻𝑊

512

𝑇𝐻𝑊

𝑇𝐻𝑊× =

𝑦𝑖 =1

𝐶(𝑥)

∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)

𝑇 × 𝐻 × 𝑊 × 512

Page 16: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Non-local Operator

16𝑥

𝜃: 1 × 1× 1

𝜙: 1 × 1× 1

𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512

𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊

𝑇𝐻𝑊 × 𝑇𝐻𝑊

normalize

𝑦𝑖 =1

𝐶(𝑥)

∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)

𝑇 × 𝐻 × 𝑊 × 512

Page 17: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Non-local Operator

17𝑥

𝜃: 1 × 1× 1

𝜙: 1 × 1× 1

𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512

𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊

𝑇𝐻𝑊 × 𝑇𝐻𝑊

normalize

𝑦𝑖 =1

𝐶(𝑥)

∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)

𝑇 × 𝐻 × 𝑊 × 512

𝑓 𝑥𝑖 , 𝑥𝑗 = exp(𝑥𝑖𝑇𝑥𝑗)

𝐶(𝑥) = ∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗

𝑓 𝑥𝑖 , 𝑥𝑗

𝐶(𝑥)=

exp(𝑥𝑖𝑇𝑥𝑗)

∀𝑗 exp(𝑥𝑖𝑇𝑥𝑗)

Page 18: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Non-local Operator

18𝑥

𝜃: 1 × 1× 1

𝜙: 1 × 1× 1

𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512

𝑔: 1 × 1× 1

𝑇 × 𝐻 × 𝑊 × 512

𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊

𝑇𝐻𝑊 × 𝑇𝐻𝑊

𝑇𝐻𝑊 × 512

𝑇𝐻𝑊 × 512normalize

𝑦𝑖 =1

𝐶(𝑥)

∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)

𝑇 × 𝐻 × 𝑊 × 512

Page 19: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Non-local Operator

19𝑥

𝜃: 1 × 1× 1

𝜙: 1 × 1× 1

𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512

𝑔: 1 × 1× 1

𝑇 × 𝐻 × 𝑊 × 512

𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊

𝑇𝐻𝑊 × 𝑇𝐻𝑊

𝑇𝐻𝑊 × 512

𝑇𝐻𝑊 × 512normalize

𝑦𝑖 =1

𝐶(𝑥)

∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)

𝑇 × 𝐻 × 𝑊 × 512

Page 20: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Non-local Operator as A Residual Block

3D Conv

3DConv

Non-local3D

ConvNon-local

VideoAction Class

𝑧𝑖 = 𝑦𝑖𝑊 + 𝑥𝑖

Page 21: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Examples

Page 22: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Action Recognition in Daily Lives

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ivan Laptev, Ali Farhadi, Abhinav Gupta. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. ECCV 2016.

Charades Dataset: 157 classes, 9.8k videos, 30s per video

We let the people upload their own videos!

Page 23: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Action Recognition on Charades

Method mAP

3D Conv 31.8%

3D Conv + Non-local 33.5%

Page 24: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Opening A Book

24

Page 25: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Opening A Book

25

The Non-local Block

Page 26: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Opening A Book

Object states changes over time

Human-object, object-object interactions

X. Wang and A. Gupta. Video as Space-Time Region Graphs. ECCV 2018.

Page 27: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Opening A Book

27

A1

B1

A2

B2

A3

B3

A4

B4

Highly Correlated

Page 28: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Relations between Regions

Page 29: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Relations between Regions

𝑓 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖𝑇 𝜙′(𝑥𝑗)

𝐺𝑖𝑗 =exp 𝑓 𝑥𝑖, 𝑥𝑗

∀𝑗 exp 𝑓 𝑥𝑖, 𝑥𝑗

Page 30: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Graph Convolutional Network

𝑍 = 𝐺𝑋𝑊

Kipf. Semi-Supervised Classification with Graph Convolutional Networks. 2017

𝑁

𝐺𝑁 × 𝑋𝑁

𝑑

𝑑

𝑑

𝑊×𝑍𝑁

𝑑

=

Page 31: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Graph Convolutional Network

31

Propagation

Page 32: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Connecting Non-local and GCN

𝑧𝑖 = 𝑦𝑖𝑊 + 𝑥𝑖𝑦𝑖 =1

𝐶(𝑥)

∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)

=

∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗

∀𝑗 𝑓 𝑥𝑖 , 𝑥𝑗𝑔(𝑥𝑗)

=

∀𝑗

𝐺𝑖𝑗 𝑔(𝑥𝑗)

=

∀𝑗

𝐺𝑖𝑗 𝑔(𝑥𝑗)𝑊 + 𝑥𝑖

𝑍 = 𝐺 𝑔 𝑋 𝑊 + 𝑋

The Non-local Operator:

The Graph Convolution

Page 33: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Action Recognition on Charades

33

Method mean AP

3D Conv 31.8%

3D Conv + Non-local 33.5%

3D Conv + Region Graph 36.2%

+4.4%

Page 34: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Action Recognition on Charades

34

30%

35%

40%

45%

Involves Objects ?

No Yes

3D Conv

3D Conv + Graph

Page 35: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Action Recognition on Charades

35

30%

35%

40%

45%

Pose Variances

3D Conv

3D Conv + Graph

Page 36: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Connection to Mean-Shift

𝑦𝑖 =

∀𝑗

𝑓 𝑥𝑖 , 𝑥𝑗

∀𝑗 𝑓 𝑥𝑖 , 𝑥𝑗𝑔(𝑥𝑗)

The Non-local Operator:

The Mean-Shift Clustering:

𝑚(𝑥) =

𝑥𝑗∈𝑁(𝑥)

𝐾 𝑥, 𝑥𝑗

𝑥𝑗∈𝑁(𝑥) 𝐾 𝑥, 𝑥𝑗𝑥𝑗

Converging to the same mean?

https://tw.rpi.edu/web/project/JeffersonProjectAtLakeGeorge/Clustering

Page 37: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Recent Related Work

Actor-Centric Relation Network[Sun et al, 2018]

Video Action Transformer Network[Girdhar et al, 2019]

Long-Term Feature Banks for Detailed Video Understanding[Wu et al, 2019]

Page 38: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Learning Affinity with Semantic Supervision

Page 39: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Learn Correspondencewithout Human Supervision

Goal:

Page 40: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

The visual world exhibits continuity

Page 41: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Prior Work: Learning from Time

Predict Color in Time [Vondrick et al, 2018]

Inputs

Outputs

Predict Pixel in Time [Mathieu et al, 2015]

Predict Arrow of Time [Wei et al, 2018]

Page 42: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Using Tracking to Learn Features

CNN CNN

Similarity

Tracking → Similarity[Wang et al, 2015]

Page 43: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Using Tracking to Learn Features

CNN CNN

Similarity

Tracking → Similarity[Wang et al, 2015]

Limited by Off-the-shelf Trackers

Page 44: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Similarity requires tracking Tracking requires similarity

Let’s jointly learn both!

Page 45: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Learning to Track

How to obtain supervision?

ℱℱℱ

ℱ: a deep tracker

Page 46: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Supervision: Cycle-Consistency in Time

Track backwards

Track forwards, back to the future

ℱℱ

ℱℱℱ

Page 47: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Backpropagation through time along the cycle

Supervision: Cycle-Consistency in Time

ℱℱ

ℱℱℱ

Page 48: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Differentiable Tracking

48

Encoder

𝜙

Encoder

𝜙

transpose

Patch feature in time 𝑡: 𝑥𝑡𝑝

Image feature in time 𝑡 − 1: 𝑥𝑡−1𝐼

100

900× =900

𝑐100

𝑐

𝑥𝑡−1𝐼 𝑥𝑡

𝑝

Page 49: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Spatial

Transformer

𝜃Cropping

Differentiable Tracking

49

Encoder

𝜙

Encoder

𝜙

transpose

Patch feature in time 𝑡: 𝑥𝑡𝑝

Image feature in time 𝑡 − 1: 𝑥𝑡−1𝐼

Patch feature in time 𝑡 − 1: 𝑥𝑡−1𝑝

Page 50: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Spatial

Transformer

𝜃Cropping

Differentiable Tracking

50

Encoder

𝜙

Encoder

𝜙

transpose

𝑥𝑡−1𝑝

= ℱ(𝑥𝑡−1𝐼 , 𝑥𝑡

𝑝)

Page 51: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Recurrent Tracking

51

𝑥𝑡𝑝

𝑥𝑡𝑝

𝑡 − 1

𝑡

𝑡 − 1

𝑡 − 2

𝑡 − 2

𝑡 − 3

ℒ𝑐𝑦𝑐𝑙𝑒

Page 52: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Cycle-Consistency Loss Function

ℒ𝑐𝑦𝑐𝑙𝑒 = ||𝐿𝑜𝑐 𝑥𝑡𝑝

− 𝐿𝑜𝑐 𝑥𝑡𝑝

||22

𝑥𝑡𝑝

𝑥𝑡𝑝

ℱℱℱ

ℱℱ

Page 53: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Multiple Cycles

Sub-cycles: a natural curriculum

Page 54: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Skip Cycles

Skip-cycles: skipping occlusions

Page 55: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Visualization of Training

Page 56: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Test Time: Nearest Neighbors in Feature Space

𝑡 − 1 𝑡

Page 57: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

𝑡 − 1 𝑡

Test Time: Nearest Neighbors in Feature Space

Page 58: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Instance Mask Tracking

DAVIS Dataset

DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

Page 59: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Pose Keypoint Tracking

JHMDB Dataset

Page 60: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Comparison

Our Correspondence Optical Flow

Page 61: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Pose Keypoint Tracking

JHMDB Dataset

Method PCK @.1

Optical Flow 45%

Vondrick et al. 45%

Ours 58%

Vondrick et al. Tracking Emerges by Colorizing Videos. ECCV 2018.

Page 62: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Texture Tracking

DAVIS Dataset

DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

Page 63: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Semantic Masks Tracking

Video Instance Parsing Dataset

Zhou et al. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing. ACM MM 2018.

Page 64: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03  · Action Recognition in Daily Lives Gunnar A. Sigurdsson,

Questions?