finding time together: detection and classification of ... · 2. focused interaction detection...

1
Finding Time Together: Detection and Classification of Focused Interaction in Egocentric Video Sophia Bano, Jianguo Zhang, Stephen J. McKenna, {s.bano, j.n.zhang, s.j.z.mckenna}@dundee.ac.uk Computer Vision and Image Processing Group, Computing, School of Science and Engineering, University of Dundee, United Kingdom International Conference on Computer Vision 2017 1. Introduction Focused Interaction (FI) Co-present individuals, having mutual focus of attention, interact by establishing face-to-face engagement and direct conversation [1] Hypothesis Fusion of multimodal features can improve the overall FI detection Challenges Face-to-face engagement often not maintained Conversational partner not always present in the video frame Varying illumination Varied scenes Examples from our Focused Interaction Dataset Existing methods Off-line processing of video clips or photo streams captured in quite constrained conditions and interacting people always in view [2, 3, 4] 2. Focused Interaction Detection using Multimodal Features 3. Evaluation Focused Interaction Dataset 19 egocentric videos (378 mins) captured using a shoulder-mounted GoPro Hero4 at 18 different locations and with 16 conversational partners Observations Fusion of multimodal features is useful for discriminating no FI and FI (walk) when using SVM with RBF kernel Face track and VAD scores are significant for discriminating FI (non-walk) Limitations Sound from nearby surroundings influenced the VAD Low illumination scenarios affected the face tracker HOG: Histogram of Oriented Gradient KLT: Kanade-Lucas-Tomasi Tracker HOOF: Histogram of Oriented Optical Flow [5] VAD: Voice Activity Detection [6] References [1] E. Goffman. Encounters: Two studies in the sociology of interaction. Bobbs-Merrill, 1961. [2] M. Aghaei, M. Dimiccoli, P. Radeva. With whom do I interact? Detecting social interactions in egocentric photostreams. IEEE ICPR, 2016. [3] S. Alletto, G. Serra, S. Calderara, F. Solera, R. Cucchiara. From ego to nos-vision: Detecting social relationships in first-person views. IEEE CVPRW, 2014. [4] A. Fathi, J. K. Hodgins, J. M. Rehg. Social interactions: A first-person perspective. IEEE CVPR, 2012. [5] R. Chaudhry, A. Ravichandran, G. Hager, R. Vidal. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. IEEE CVPR, 2009. [6] M. Van Segbroeck, A. Tsiartas, S. Narayanan. A robust frontend for VAD: exploiting contextual, discriminative and spectral cues of human voice. INTERSPEECH, 2013. 4. Conclusion and Future Work Automatic online classification of Focused Interaction in continuous, egocentric videos Multimodal features: face track, VAD and camera motion profile Best performance with multimodal feature fusion and SVM with RBF kernel Future work involves the use of recurrent neural network for classification and to extend this work to identify conversational partners Acknowledgements This work is supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant EP/N014278/1: ACE-LP: Augmenting Communication using Environmental Data to drive Language Prediction. An outdoor night-time FI scenario with weak visual cues due to low illumination Linear Kernel SVM https://ace-lp.ac.uk/ FI in which conversational partners are in the field of view of the camera FI in which the conversational partner is no longer in the field of view as the interaction occurred while walking RBF Kernel SVM No Focused Interaction Focused Interaction (non-walk) Focused Interaction (walk) Computer work FI-NW Searching for documents Walk- turn around-walk FI-NW Camera setup HOOF (bins) 0 20 40 60 80 100 120 140 160 180 200 Time (sec) 0 0.5 1 VAD scores 0 20 40 60 80 100 120 140 160 180 200 0 200 400 Tracker score Camera motion feature (HOOF) Input video Video Stream Audio Stream Audio-based feature (VAD) Face detection (HOG) and tracking (KLT) Feature concatenation Temporal windowing Classification using SVM (Linear/RBF) No FI FI-NW (non-walk) FI-W (walk) M T V

Upload: others

Post on 23-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finding Time Together: Detection and Classification of ... · 2. Focused Interaction Detection using Multimodal Features 3. Evaluation Focused Interaction Dataset • 19 egocentric

Finding Time Together: Detection and Classification of Focused Interaction in Egocentric Video

Sophia Bano, Jianguo Zhang, Stephen J. McKenna, {s.bano, j.n.zhang, s.j.z.mckenna}@dundee.ac.uk

Computer Vision and Image Processing Group, Computing, School of Science and Engineering, University of Dundee, United Kingdom

International Conference on

Computer Vision 2017

1. Introduction

Focused Interaction (FI)

• Co-present individuals, having mutual focus of attention, interact by establishing face-to-face engagement and direct conversation [1]

Hypothesis

• Fusion of multimodal features can improve the overall FI detection

Challenges

• Face-to-face engagement often not maintained

• Conversational partner not always present in the video frame

• Varying illumination

• Varied scenes

Examples from our Focused Interaction Dataset

Existing methods

• Off-line processing of video clips or photo streams captured in quite constrained conditions and interacting people always in view [2, 3, 4]

2. Focused Interaction Detection using Multimodal Features

3. EvaluationFocused Interaction Dataset

• 19 egocentric videos (378 mins) captured using a shoulder-mounted GoPro Hero4 at 18 different locations and with 16 conversational partners

Observations

• Fusion of multimodal features is useful for discriminating no FI and FI (walk) when using SVM with RBF kernel

• Face track and VAD scores are significant for discriminating FI (non-walk)

Limitations

• Sound from nearby surroundings influenced the VAD

• Low illumination scenarios affected the face tracker

HOG: Histogram of Oriented Gradient

KLT: Kanade-Lucas-Tomasi Tracker

HOOF: Histogram of Oriented Optical Flow [5]

VAD: Voice Activity Detection [6]

References[1] E. Goffman. Encounters: Two studies in the sociology of interaction. Bobbs-Merrill, 1961.[2] M. Aghaei, M. Dimiccoli, P. Radeva. With whom do I interact? Detecting social interactions in egocentric photostreams. IEEE

ICPR, 2016.[3] S. Alletto, G. Serra, S. Calderara, F. Solera, R. Cucchiara. From ego to nos-vision: Detecting social relationships in first-person

views. IEEE CVPRW, 2014.[4] A. Fathi, J. K. Hodgins, J. M. Rehg. Social interactions: A first-person perspective. IEEE CVPR, 2012.[5] R. Chaudhry, A. Ravichandran, G. Hager, R. Vidal. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear

dynamical systems for the recognition of human actions. IEEE CVPR, 2009.[6] M. Van Segbroeck, A. Tsiartas, S. Narayanan. A robust frontend for VAD: exploiting contextual, discriminative and spectral cues

of human voice. INTERSPEECH, 2013.

4. Conclusion and Future Work• Automatic online classification of Focused Interaction in continuous,

egocentric videos

• Multimodal features: face track, VAD and camera motion profile

• Best performance with multimodal feature fusion and SVM with RBF kernel

• Future work involves the use of recurrent neural network for classification and to extend this work to identify conversational partners

AcknowledgementsThis work is supported by the UKEngineering and Physical SciencesResearch Council (EPSRC) under grantEP/N014278/1: ACE-LP: AugmentingCommunication using EnvironmentalData to drive Language Prediction.

An outdoor night-time FI scenario with weak visual cues due to low illumination

Lin

ear

Ke

rne

lSV

M

https://ace-lp.ac.uk/

FI in which conversational partners are in the field of view of the camera

FI in which the conversational partner is no longer in the field of view as the interaction occurred while walking

RB

FK

ern

elS

VM

No Focused Interaction Focused Interaction (non-walk) Focused Interaction (walk)

Computer work FI-NW Searching for documents Walk- turn around-walk FI-NWCamera setup

HO

OF

(bin

s)

VAD score

0 20 40 60 80 100 120 140 160 180 200Time (sec)

0

0.5

1

VA

D s

core

s

0 20 40 60 80 100 120 140 160 180 2000

200

400

Tra

cke

r sc

ore

Camera motion feature (HOOF)

Input video

Video Stream

Audio Stream Audio-based

feature (VAD)

Face detection (HOG) and

tracking (KLT)

Feature concatenation

Temporal windowing

Classification using SVM

(Linear/RBF)

No FI

FI-NW(non-walk)

FI-W (walk)

M

T

V