end-to-end learning of action detection from frame glimpse in...

41
End-to-end Learning of Action Detection from Frame Glimpse in Videos Ezgi Pekşen Soysal Hacettepe University BIL722 - Advanced Topics in Computer Vision

Upload: others

Post on 25-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

End-to-end Learning of Action Detection from Frame Glimpse in

Videos

Ezgi Pekşen SoysalHacettepe University

BIL722 - Advanced Topics in Computer Vision

Olga Russakovsky: The human side of computer vision

Input Output

t = 0 t = TRunningTalking

Task: what is the person doing?

Olga Russakovsky: The human side of computer vision

Input Output

Task: what is the person doing?

InterpretabilityEfficiencyAccuracy

t = 0 t = TRunningTalking

Olga Russakovsky: The human side of computer vision

Efficient video processing

t = 0 t = T

Olga Russakovsky: The human side of computer vision

Efficient video processing

Talkingt = 0 t = T

Running

Olga Russakovsky: The human side of computer vision

Efficient video processing

Talkingt = 0 t = T

Running

Olga Russakovsky: The human side of computer vision

Efficient video processing

[N. I. Badler. “Temporal Scene Analysis…” 1975]

“Knowing the output or the final state… there is no need to explicitly store many previous states”

Talkingt = 0 t = T

Running

Olga Russakovsky: The human side of computer vision

Efficient video processing

Dominant paradigm: sliding windows

t = 0 t = T

……

Used in all THUMOS challenge action detection entries

[OneVerSch 2014] [WanQiaTan 2014] KarSeiBim 2014] [YuaPeiNiMouKas 2015]

“Knowing the output or the final state… there is no need to explicitly store many previous states”

[N. I. Badler. “Temporal Scene Analysis…” 1975]

Talkingt = 0 t = T

Running

Olga Russakovsky: The human side of computer vision

Efficient video processing

“Knowing the output or the final state… there is no need to explicitly store many previous states”

“Time may be represented in several ways… The intervals between ‘pulses’ need not be equal.”

[N. I. Badler. “Temporal Scene Analysis…” 1975]

Talkingt = 0 t = T

Running

Olga Russakovsky: The human side of computer vision

t = 0 t = T

[YeuRusMorFei CVPR’16]

Our model for efficient action detection

Frame model

Input: A frame

Output

Olga Russakovsky: The human side of computer vision

t = 0 t = T

Our model for efficient action detection

Frame model

Output: Detection instance [start, end] Next frame to glimpse

Input: A frame

[ ]Output

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

t = 0 t = T

Output: Detection instance [start, end] Next frame to glimpse

Our model for efficient action detection

Frame model

[ ][ ]OutputOutput

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

t = 0 t = T

OutputOutput:

Detection instance [start, end] Next frame to glimpse

Our model for efficient action detection

Frame model

[ ][ ] [ ]

…OutputOutput

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

t = 0 t = T

Output

Our model for efficient action detection

[ ]

Convolutional neural network (frame information)

Recurrent neural network (time information)

[ ] [ ]Output:

Detection instance [start, end] Next frame to glimpse…

OutputOutput

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

t = 0 t = T

Output

Optional output: Detection instance [start, end]

Output: Next frame to glimpse

Our model for efficient action detection

� �

Convolutional neural network (frame information)

Recurrent neural network (time information)

OutputOutput

[ ]

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the detection instance outputPositive video Negative video

t = 0 t = T[ ] [ ] t = 0 t = TTraining data

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the detection instance outputPositive video Negative video

t = 0 t = T[ ] [ ] t = 0 t = TTraining data

Aside:• effective video annotation

[YeuRusJinAndMorFei UnderReview] [LiuRusDenBerFei ImageNetChallenge ’15]

• weakly supervised detection[YeuRamRusMorFei InPreparation]

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the detection instance output

Training data

Positive video Negative video

t = 0 t = T[ ] [ ] t = 0 t = T

d1 d2

Detections

t = 0 t = T[ ] [ ] t = 0 t = T

g1

[ ]d3

[ ]d4

g2

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the detection instance output

Training data

Positive video Negative video

t = 0 t = T[ ] [ ] t = 0 t = T

d1 d2

Detections

t = 0 t = T[ ] [ ] t = 0 t = T

g1

[ ]d3

[ ]d4

Reward for detection

g2

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the detection instance output

Training data

Positive video Negative video

t = 0 t = T[ ] [ ] t = 0 t = T

d1 d2

Detections

t = 0 t = T[ ] [ ] t = 0 t = T

g1

[ ]d3

[ ]d4

Reward for detection

g2

y3 = 2y2 = 1y1 = 1 y4 = 0

cross-entropyclassification loss

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the detection instance output

Training data

Positive video Negative video

t = 0 t = T[ ] [ ] t = 0 t = T

d1 d2

Detections

t = 0 t = T[ ] [ ] t = 0 t = T

g1

[ ]d3

[ ]d4

g2

Reward for detection

L2 distancelocalization loss

y3 = 2y2 = 1y1 = 1 y4 = 0

cross-entropyclassification loss

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the non-differentiable outputsTraining data

t = 0 t = T[ ] [ ]Detections

t = 0 t = T[ ] [ ][ ]

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the non-differentiable outputsTraining data

t = 0 t = T[ ] [ ]d1 d2

Detections

t = 0 t = T[ ] [ ]

Model’s action sequence a Frame 1 Frame 8 Frame 6

go to frame 8 go to frame 6

(1) whether to predict a detection

(2) where to look next

[ ]

Frame 15

d3

go to frame 15

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the non-differentiable outputsTraining data

t = 0 t = T[ ] [ ]Detections

t = 0 t = T[ ] [ ]

Frame 1 Frame 8 Frame 6

[ ]

Frame 15Model’s action sequence a

go to frame 8 go to frame 6(2) where to look next

go to frame 15

d1 d2 � (1) whether to predict a detectiond3

Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the non-differentiable outputsTraining data

t = 0 t = TDetections

t = 0 t = T[ ] [ ]

Frame 1 Frame 8 Frame 6

Reward for an action sequence :

Frame 15Model’s action sequence a

[ ] [ ]bad bad

go to frame 8 go to frame 6(2) where to look next

go to frame 15

[ ]good

d1 d2 � (1) whether to predict a detectiond3

Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Training the non-differentiable outputsTraining data

t = 0 t = TDetections

t = 0 t = T[ ] [ ]

Frame 1 Frame 8 Frame 6

Reward for an action sequence :

Frame 15

Objective:Gradient:

Monte-Carlo approximation:

Model’s action sequence a

[ ] [ ]bad bad

go to frame 8 go to frame 6(2) where to look next

go to frame 15

[ ]good

d1 d2 � (1) whether to predict a detectiond3

Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]

[YeuRusMorFei CVPR’16]

Olga Russakovsky: The human side of computer vision

Accuracy

Interpretability

[YeuRusMorFei CVPR’16]

Efficiency

Olga Russakovsky: The human side of computer vision

[YeuRusMorFei CVPR’16]

Accuracy

DatasetDetection AP at IOU 0.5

State-of-the-art Our result

THUMOS 2014 14.4 17.1ActivityNet sports 33.2 36.7ActivityNet work 31.1 39.9

Interpretability

Efficiency

Olga Russakovsky: The human side of computer vision

[YeuRusMorFei CVPR’16]

Accuracy

DatasetDetection AP at IOU 0.5

State-of-the-art Our result

THUMOS 2014 14.4 17.1ActivityNet sports 33.2 36.7ActivityNet work 31.1 39.9

Interpretability

Efficiency✓Glimpse only 2% of video frames

Olga Russakovsky: The human side of computer vision

[YeuRusMorFei CVPR’16]

Accuracy

DatasetDetection AP at IOU 0.5

State-of-the-art Our result

THUMOS 2014 14.4 17.1ActivityNet sports 33.2 36.7ActivityNet work 31.1 39.9

Interpretability

Efficiency✓Glimpse only 2% of video frames

Samping Detection AP at IOU 0.5

Uniform 9.3

Our glimpses 17.1

Olga Russakovsky: The human side of computer vision

[YeuRusMorFei CVPR’16]

Accuracy

DatasetDetection AP at IOU 0.5

State-of-the-art Our result

THUMOS 2014 14.4 17.1ActivityNet sports 33.2 36.7ActivityNet work 31.1 39.9

Efficiency✓Glimpse only 2% of video frames

Samping Detection AP at IOU 0.5

Uniform 9.3

Our glimpses 17.1

Interpretability✓Javelin throw[ ]

Javelin throw[ ]Ground truth

Detections

Glimpses

Frames

Olga Russakovsky: The human side of computer vision

Algorithm Test data

Car, Person Building

AI vision system

Training data

cars

dogs

people

Reinforcement learning for human action detection [YeuRusMorFei CVPR’16]

Open-world probabilistic human-in-the-loop model [RusLiFei CVPR’15]

The human cost of developing computer vision expertise [RusDenDuKraSatEtal IJCV’15] [DenRusKraBerBerFei CHI’14]

Olga Russakovsky: The human side of computer vision

Data is critical Beyond classification

Resource allocation Broader context

Takeaways

Table Chair Bowl Dog Cat …

+ + - - - -

+ - + - + -

+ + - - - -

- - - + - -

Olga Russakovsky: The human side of computer vision

What next?

Olga Russakovsky: The human side of computer vision

Open-world recognition

Challenging objects

Long-tail distributions

Designing open-world evaluation frameworks

Olga Russakovsky: The human side of computer vision

Large-scale video analysis

Formulating the right temporal questions

Multi-task modelsPlaying tennis

Visual fluents

Olga Russakovsky: The human side of computer vision

Collaborative visual systems

Understanding human intention

Knowledge acquisition

Effective teaching

Olga Russakovsky: The human side of computer vision

AI will change the world. Who will change AI?

Olga Russakovsky: The human side of computer vision

Stanford Artificial Intelligence Laboratory’sOutreach Summer (SAILORS) program

24 high school girls, 2 weeks Rigorous AI curriculum emphasizing humanistic applications

http://sailors.stanford.edu

AI will change the world. Who will change AI?

[VacWuChaRusSomFei SIGCSE’16]

Olga Russakovsky: The human side of computer vision

Acknowledgements

Jia DengAlexander Berg

Fei-Fei Li Sean Ma

Jonathan KrauseZhiheng Huang Andrej Karpathy Aditya Khosla

Michael Bernstein

Jia Li Yuanqing Lin

Ellen Klingbeil

Vittorio FerrariAmy Bearman Blake Carpenter

Davide Modolo

Jenny Jin

Misha Andriluka

Hao SuSanjeev Satheesh

Andrew Ng

Kai Yu

Greg Mori

Serena YeungSasha VezhnevetsDeva Ramanan

Abhinav Gupta

Vignesh Ramanathan

Olga Russakovsky: The human side of computer vision

Algorithm Test data

Car, Person Building

AI vision system

Training data

cars

dogs

people

http://cs.cmu.edu/~orussako [email protected]

Questions?

[RusDenHuaBerFei ICCV’13] [KliCarRusNg ICRA’10]Strongly supervised:[RusNg CVPR’10] [RusFei ECCVW’10] [RusGupRam UnderReview] [ModVezRusFei CVPR’15] [YeuRusJinAndMorFei UnderReview] Weakly supervised:[RusLinYuFei ECCV’12] [BeaRusFerFei UnderReview]

[RusLiFei CVPR’15] [YeuRusMorFei CVPR’16][RusDenSuKraSatEtal IJCV’15][DenRusKraBerBerFei CHI’14]