attentive action and context factorization

13
WANG ET AL.: ACTION AND CONTEXT FACTORIZATION 1 Attentive Action and Context Factorization Yang Wang 1 [email protected] Vinh Tran 14 [email protected] Gedas Bertasius 2 [email protected] Lorenzo Torresani 3 [email protected] Minh Hoai 14 [email protected] 1 Stony Brook University Stony Brook, NY 11794, USA 2 University of Pennsylvania Philadelphia, PA 19104, USA 3 Dartmouth College Hanover, NH 03755, USA 4 VinAI Research Hanoi, Vietnam Abstract We propose a method for human action recognition, one that can localize the spa- tiotemporal regions that ‘define’ the actions. This is a challenging task due to the subtlety of human actions in video and the co-occurrence of contextual elements. To address this challenge, we utilize conjugate samples of human actions, which are video clips that are contextually similar to human action samples but do not contain the action. We introduce a novel attentional mechanism that can spatially and temporally separate human actions from the co-occurring contextual factors. The separation of action and context factors is weakly supervised, eliminating the need for laboriously detailed annotation of these two factors in training samples. Our method can build human action classifiers with higher accuracy and better interpretability. Experiments on several human action recognition datasets demonstrate the quantitative and qualitative benefits of our approach. 1 Introduction Our work is concerned with human actions [4, 10, 18, 20, 28, 35, 36, 38, 41, 46] in video, and we consider the task of localizing the spatiotemporal regions of a video that define a human action. This task is useful for understanding human actions and for developing computational models of human actions with better accuracy and higher interpretability. This task, however, is very challenging due to the subtlety of human actions and the co-occurrence of other contextual elements in the video. Finding the constituent elements of a human action requires more than grounding the decisions of a human action classifier [3, 30, 33, 47, 48, 50], because the decisions of a human action classifier might be partially based on some contextual cues [6, 37, 39, 40, 45] such as the background scene and the camera motion. What we actually need is a detector rather than a classifier, and theoretically, this detector can be learned if there is training data in which the video voxels that define a human action are delineated. Unfortunately, collecting manual annotation at this level of details is notori- ously difficult and even impossible. Scalability of manual annotation procedure is obviously c 2020. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.

Upload: others

Post on 17-Feb-2022

16 views

Category:

Documents


0 download

TRANSCRIPT

WANG ET AL.: ACTION AND CONTEXT FACTORIZATION 1

Attentive Action and Context FactorizationYang Wang1

[email protected]

Vinh Tran14

[email protected]

Gedas Bertasius2

[email protected]

Lorenzo Torresani3

[email protected]

Minh Hoai14

[email protected]

1 Stony Brook UniversityStony Brook, NY 11794, USA

2 University of PennsylvaniaPhiladelphia, PA 19104, USA

3 Dartmouth CollegeHanover, NH 03755, USA

4 VinAI ResearchHanoi, Vietnam

Abstract

We propose a method for human action recognition, one that can localize the spa-tiotemporal regions that ‘define’ the actions. This is a challenging task due to the subtletyof human actions in video and the co-occurrence of contextual elements. To address thischallenge, we utilize conjugate samples of human actions, which are video clips that arecontextually similar to human action samples but do not contain the action. We introducea novel attentional mechanism that can spatially and temporally separate human actionsfrom the co-occurring contextual factors. The separation of action and context factors isweakly supervised, eliminating the need for laboriously detailed annotation of these twofactors in training samples. Our method can build human action classifiers with higheraccuracy and better interpretability. Experiments on several human action recognitiondatasets demonstrate the quantitative and qualitative benefits of our approach.

1 IntroductionOur work is concerned with human actions [4, 10, 18, 20, 28, 35, 36, 38, 41, 46] in video,and we consider the task of localizing the spatiotemporal regions of a video that definea human action. This task is useful for understanding human actions and for developingcomputational models of human actions with better accuracy and higher interpretability. Thistask, however, is very challenging due to the subtlety of human actions and the co-occurrenceof other contextual elements in the video. Finding the constituent elements of a human actionrequires more than grounding the decisions of a human action classifier [3, 30, 33, 47, 48,50], because the decisions of a human action classifier might be partially based on somecontextual cues [6, 37, 39, 40, 45] such as the background scene and the camera motion.

What we actually need is a detector rather than a classifier, and theoretically, this detectorcan be learned if there is training data in which the video voxels that define a human actionare delineated. Unfortunately, collecting manual annotation at this level of details is notori-ously difficult and even impossible. Scalability of manual annotation procedure is obviously

c© 2020. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms.

Citation
Citation
{Carreira and Zisserman} 2017
Citation
Citation
{Feichtenhofer, Pinz, and Wildes} 2016
Citation
Citation
{Hoai and Zisserman} 2014{}
Citation
Citation
{Karpathy, Toderici, Shetty, Leung, Sukthankar, and Fei-Fei} 2014
Citation
Citation
{Qiu, Yao, and Mei} 2017
Citation
Citation
{Tran, Bourdev, Fergus, Torresani, and Paluri} 2015
Citation
Citation
{Tran, Wang, Torresani, Ray, LeCun, and Paluri} 2018
Citation
Citation
{Wang, Xiong, Wang, Qiao, Lin, Tang, and Vanprotect unhbox voidb@x protect penalty @M {}Gool} 2016
Citation
Citation
{Wang and Hoai} 2016
Citation
Citation
{Xie, Girshick, Doll{á}r, Tu, and He} 2017
Citation
Citation
{Cao, Liu, Yang, Yu, Wang, Wang, Huang, Wang, Huang, Xu, etprotect unhbox voidb@x protect penalty @M {}al.} 2015
Citation
Citation
{Selvaraju, Cogswell, Das, Vedantam, Parikh, Batra, etprotect unhbox voidb@x protect penalty @M {}al.} 2017
Citation
Citation
{Simonyan, Vedaldi, and Zisserman} 2014
Citation
Citation
{Zeiler and Fergus} 2014
Citation
Citation
{Zhang, Bargal, Lin, Brandt, Shen, and Sclaroff} 2018
Citation
Citation
{Zhou, Khosla, Lapedriza, Oliva, and Torralba} 2016
Citation
Citation
{Chen, Kalantidis, Li, Yan, and Feng} 2018
Citation
Citation
{Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin} 2017
Citation
Citation
{Wang and Gupta} 2018
Citation
Citation
{Wang, Girshick, Gupta, and He}
Citation
Citation
{Wu, Feichtenhofer, Fan, He, Krähenbühl, and Girshick} 2019

2 WANG ET AL.: ACTION AND CONTEXT FACTORIZATION

Action sample

Conjugate sample

Classification loss: classifierh

Similaritymeasure

Difference measure

Combined training objective

a<latexit sha1_base64="lwVGbGjzWn85SnsjiBp8S7Dqf0Y=">AAAEN3ichVPNbtNAEHYcCsX8NXBCXFZEQQGlSQwVP0KVKjVEvbQqtGkqxW60WU+SVexda3dNCZZfgiu8CI/CiRviyhuwdpwooRRGsjw73zcz34x2B6FPpWo2vxXM4pW1q9fWr1s3bt66fWejdPdE8kgQ6BDuc3E6wBJ8yqCjqPLhNBSAg4EP3cFkN8W770FIytmxmobgBnjE6JASrHSoXzLXKm3kUIacd3AWHzuKBiDRXv7votxpJVbl6CzGSiUZu9es2e4Sf87rpjy0jVIuUYnDPa5QnmhVdjVio805uihpVdpV/HhJBrpUB9LkRws2chykz/9P0qyjeVKu3Z7DB6myf4HDFXGtOWRraHQ51K6SZShH9vNhZ/hM1ayM5TA4JzwIMPOeOMGUUEF6thtXLD3mAEaUxYpOPoaUqEhAoqM6zrgHPU/g81pK96FGGQOBJITbdqhcFJft5HVaAJi3kp5YeYvYwYll9TfKzXozM3TRsXOnbOR22C8V7jseJ1EATBEfS9mzm7phjIWiWocuH2kRmEzwCHraZVhP78bZhU1QRUc8NORCf0yhLLqcEeNAymkw0MwAq7H8E0uDf8N6kRq+dGPKwkgBI7NGw8hHiqP09iOPCiDKn2oHE0G1VkTGWOjbqN+I5WSJcaMj9akx5pg2WvmMsrE/PeAKZMODIWU0fT2yruDDyqTZipNsma9Se75Y3UXn5GndflbfertV3nmTr3XdeGA8NKqGbbwwdow949DoGMT0zU/mZ/NL8Wvxe/FH8eeMahbynHvGihV//QYcEU8R</latexit>

f(a)<latexit sha1_base64="UnUgiUkAYduXbl9CiaLeTS7SH6M=">AAAEPXichVNdb9MwFM1SPkbGxwZPiBeLKqhDo21g4kNo0qSNaS+bBlvXSXVXOc5NZy2xI9thlCj/g1f4I/wOfgBviFdecdJ0ahmDK0W5vuec6+Mr208ipnS7/W3Orl25eu36/A1n4eat23cWl+4eKpFKCh0qIiGPfKIgYhw6mukIjhIJJPYj6PqnGwXefQ9SMcEP9CiBfkyGnIWMEm1KgyV7wd1CmHGE38FxdoA1i0Gh7erfRVWymTvu/nFGtM5Ldq+94vWn+BNet+ChNVRwqc6xCIRGldBxNwzioScT9Lyl4241yPKUDXSpD2TIj87ZCGNk1v8XGdb+RFR59ybwbuHsX2A4Y25zAnkGGl4ObTXoNFQhOyWC/eUxwQkbmCyP+zgu5nBGRRwTHjzG8YgySXteP3Md15zUhyHjmWanHxNGdSohL8oG4CKAXiDJ2UohiGCFcQ4SKUjWvET3UVb38tdlC+DBTAPTodomw37uOIPFervZLgNdTLwqqVtV7A2W5u7jQNA0Bq5pRJTqeW2zZUakZsZJ7uDU2CD0lAyhZ1JOzAz6WXltc+SaSoBCIc3HNSqr04qMxEqNYt8wY6JP1J9YUfwb1kt1+LKfMZ6kGjgdbxSmEdICFW8ABUwC1dHIJIRKZrwiekKkuZPmpTi4FGatjjKr1okgrLVZnVG1dka7QoNqBRAyzoo3pJoaPsyctJxxXg7zVRHPz0d3MTl82vSeNVffrtbX31RjnbceWA+thuVZL6x1a9vaszoWtaX9yf5sf6l9rX2v/aj9HFPtuUpzz5qJ2q/fRzJQrg==</latexit>

g(a)<latexit sha1_base64="zIBbmnGiO68/m2TFPdEUzcDxaeQ=">AAAEPXichVNdb9MwFM1SPkbGxwZPiBeLKqhDo21g4kNo0qSNai+bBlvXSXVXOc5tZzWxI9thlCj/g1f4I/wOfgBviFdecdJ0ahmDK0W5vuec6+Mr249DpnSz+W3Brly5eu364g1n6eat23eWV+4eKZFICm0qQiGPfaIgZBzamukQjmMJJPJD6PijrRzvvAepmOCHehxDLyJDzgaMEm1K/RV7yW0hzDjC7+AkPcSaRaDQTvnvoDLZzhz34CQlWmcFu9tc83oz/Cmvk/PQBsq5VGdYBEKjUui4Wwbx0JMpet7ScVs1sjpjA13qAxnyo3M2whiZ9f9FhnUwFZXevSm8lzv7FziYM7c9hTwDDS+HWjU6C5XIboFgf3VCcIY1TFYnfRwXczijIooIDx7jaEyZpF2vl7qOa07qw5DxVLPRx5hRnUjI8rIBuAigG0hytpYLQlhjnINECuINL9Y9lFa97HXRAngw18B0KLdJsZ85Tn+52qw3i0AXE69MqlYZ+/2Vhfs4EDSJgGsaEqW6XtNsmRKpmXGSOTgxNggdkSF0TcqJmUEvLa5thlxTCdBASPNxjYrqrCIlkVLjyDfMiOhT9SeWF/+GdRM9eNlLGY8TDZxONhokIdIC5W8ABUwC1eHYJIRKZrwiekqkuZPmpTi4EKaNtjKrxqkgrLFdnlE1dsd7QoNqBDBgnOVvSNU1fJg7aTHjrBjmqzyen4/uYnL0tO49q6+/Xa9uvinHumg9sB5aNcuzXlib1o61b7Utakv7k/3Z/lL5Wvle+VH5OaHaC6XmnjUXlV+/AUmKUK8=</latexit>

a0<latexit sha1_base64="LBhftH9NsHd8T0EWnNqq9fFUA4w=">AAACrnicbVFba9RAFJ7EW7vetu2T+DK4LYqUTVIFBREKKviiVHC3hU1YJpOT3WEzM2HmxBpDfqjP/hEnaR7c1gMD33zn/p20LITFMPzt+bdu37l7b2d3dP/Bw0ePx3v7c6srw2HGdaHNRcosFELBDAUWcFEaYDIt4DzdfOj85z/AWKHVd6xLSCRbKZELztBRy3EbK7jkWkqmspexrLkwfBElzdGI0jiFlVANis2vUnCsDLSOdbzSGSwywy6Pu/ACjoVSYKiF8n1UYkKbSdS+6wqAyrbS29HQojmM2fPDdjmehNOwN3oTRAOYkMHOlnvekzjTvJKgkBfM2kUUuo4NMyjcIK5+5aZgfMNWsHBQMQk2aXqdWnrkmIzm2rinkPbsvxkNk9bWMnWRkuHaXvd15P98iwrzt0kjVFkhKH7VKK8Kipp2otNMGOBY1A4wboSblfI1M4yjO80o7hObYGbdL1hrJoKPw442+FJ/1Qg2yCAXSnRHs1OEn1ub9hp3YkbXpbsJ5ifT6NX05NvryemnQdYd8pQ8Iy9IRN6QU/KZnJEZ4eSPt+vtewd+6M/9xF9ehfrekHNAtsxf/wWIV9Ma</latexit>

Lc<latexit sha1_base64="n1gfNUcuCH+Gn09mDDmoMSaUnZ8=">AAACyHicbVFti9QwEE7r21nf9vST+CXYXTxl3banoCDCgS+IqJzg3h1slyVNp7thm6QkqWst/eKf8Lf5YwTTXkH3zoHAM8/MZGaeSYqcaROGvxz3wsVLl6/sXPWuXb9x89Zg9/aRlqWiMKUyl+okIRpyJmBqmMnhpFBAeJLDcbJ+1caPv4LSTIovpipgzslSsIxRYiy1GPyMBWyo5JyI9FHMK8oUnUXzeuRhHCewZKI2bP29YNSUChrLWl7IFGapIptxm57DmAkBCmsoXkaFmePaj5oX7Qcg0q3yxhv1PephTB4Mrf92z4KH3l+af1jQYbMY+OEk7AyfB1EPfNTb4WLXuRunkpYchKE50XoWhXaWmijD7IiNF5d2PkLXZAkzCwXhoOd1p2CDR5ZJcSaVfcLgjv23oiZc64onNpMTs9JnYy35v9isNNnzec1EURoQ9LRRVubYSNyeA6dMATV5ZQGhitlZMV0RRaixR/PirrAOptp6wUoSFrzud9TBx+qTNKCDFDImWHtOPTHwbWvTTv1WzOisdOfB0f4kejLZ//zUP3jTy7qD7qH7aA9F6Bk6QO/QIZoiin47vjN2Hrvv3cLduNVpquv0NXfQlrk//gBddtvx</latexit>

Ls<latexit sha1_base64="SEe+qqpdAaumrO5uYifrJGtoYFI=">AAACyHicbVFti9QwEE7r21nf9vST+CXYXTxl3banoCDCgS+IqJzg3h1slyVNp7thm6QkqWst/eKf8Lf5YwTTXkH3zoHAM8/MZGaeSYqcaROGvxz3wsVLl6/sXPWuXb9x89Zg9/aRlqWiMKUyl+okIRpyJmBqmMnhpFBAeJLDcbJ+1caPv4LSTIovpipgzslSsIxRYiy1GPyMBWyo5JyI9FHMK8oUnUXzeuRhHCewZKI2bP29YNSUChrLWl7IFGapIptxm57DmAkBCmsoXkaFmePaj5oX7Qcg0q3yxhv1PephTB4Mrf92z4KH3l+af1joYbMY+OEk7AyfB1EPfNTb4WLXuRunkpYchKE50XoWhXaWmijD7IiNF5d2PkLXZAkzCwXhoOd1p2CDR5ZJcSaVfcLgjv23oiZc64onNpMTs9JnYy35v9isNNnzec1EURoQ9LRRVubYSNyeA6dMATV5ZQGhitlZMV0RRaixR/PirrAOptp6wUoSFrzud9TBx+qTNKCDFDImWHtOPTHwbWvTTv1WzOisdOfB0f4kejLZ//zUP3jTy7qD7qH7aA9F6Bk6QO/QIZoiin47vjN2Hrvv3cLduNVpquv0NXfQlrk//gB6VtwB</latexit>

Ld<latexit sha1_base64="TI98ABLDVjsskobXiNfJkCqxhPo=">AAACyHicbVFti9QwEE7r21nf9vST+CXYXTxl3banoCDCgS+IqJzg3h1slyVNp7thm6QkqWst/eKf8Lf5YwTTXkH3zoHAM8/MZGaeSYqcaROGvxz3wsVLl6/sXPWuXb9x89Zg9/aRlqWiMKUyl+okIRpyJmBqmMnhpFBAeJLDcbJ+1caPv4LSTIovpipgzslSsIxRYiy1GPyMBWyo5JyI9FHMK8oUnUXzeuRhHCewZKI2bP29YNSUChrLWl7IFGapIptxm57DmAkBCmsoXkaFmePaj5oX7Qcg0q3yxhv1PephTB4Mrf92z4KH3l+af1ikw2Yx8MNJ2Bk+D6Ie+Ki3w8WuczdOJS05CENzovUsCu0sNVGG2REbLy7tfISuyRJmFgrCQc/rTsEGjyyT4kwq+4TBHftvRU241hVPbCYnZqXPxlryf7FZabLn85qJojQg6GmjrMyxkbg9B06ZAmryygJCFbOzYroiilBjj+bFXWEdTLX1gpUkLHjd76iDj9UnaUAHKWRMsPacemLg29amnfqtmNFZ6c6Do/1J9GSy//mpf/Cml3UH3UP30R6K0DN0gN6hQzRFFP12fGfsPHbfu4W7cavTVNfpa+6gLXN//AFfRNvy</latexit>

F<latexit sha1_base64="g5uov0DHfeDyzhwtz65rj6EQkw8=">AAAC13icbVFdb9MwFHXC1wgDOnhCvFh0FR2qmqQggTSQJgEVD4CGRLehpqoc56a1GtuR7TBKFPGGeOVP8J/4NzhZHtaNK1k699x7fL/iPGPaBMFfx71y9dr1G1s3vVvbt+/c7ezcO9KyUBQmVGZSncREQ8YETAwzGZzkCgiPMziOV6/r+PFXUJpJ8dmsc5hxshAsZZQYS807fyIBp1RyTkTyJOJryhSdhrOy52EcxbBgojRs9T1n1BQKKstaXsgEpokip4M6PYMBEwIU1pC/CnMzw2U3rPbrD0AkG/LK67U1yt2IPN61/rhvwd55nr+fJ3UkAkP64wEe73njeacbDIPG8GUQtqCLWjuc7zgPokTSgoMwNCNaT8PAtlYSZZjtuPKiwrZL6IosYGqhIBz0rGwWWuGeZRKcSmWfMLhhzytKwrVe89hmcmKW+mKsJv8XmxYmfTErmcgLA4KeFUqLDBuJ6+vghCmgJltbQKhitldMl0QRauwNvagRlv5EW89fSsL8N+2M2v+w/igNaD+BlAlWX1cPDXzbmLQ5RmWXGV5c3WVwNBqGT4ejT8+6B2/btW6hh+gR6qMQPUcH6B06RBNEnW1n5Ow7L90v7g/3p/vrLNV1Ws19tGHu73+zn97s</latexit>

S<latexit sha1_base64="sE19kgdsy06tVckepzviPSCvPvo=">AAAC13icbVFdb9MwFHXC1wiDdeMJ8WLRVXSoapsyCaSBNAmoeAA0BN2GkqpynNvWamxHtsMIUcQb4pU/wX/i3+BkeVg3rmTp3HPv8f2K0oRpMxz+ddxr12/cvLVx27uzeffeVmt751jLTFGYUJlIdRoRDQkTMDHMJHCaKiA8SuAkWr2q4idfQWkmxWeTpzDlZCHYnFFiLDVr/QkFnFHJORHxk5DnlCka+NOi42EcRrBgojBs9T1l1GQKSstaXsgYgliRs16VnkCPCQEKa0hf+qmZ4qLtlwfVByDiNXnpdZoaxW5IHu9af9y1YO8iz9/N4ioSgiHdcQ+P97xPs1Z72B/Whq8CvwFt1NjRbNt5EMaSZhyEoQnROvCHtrWCKMNsx6UXZrZdQldkAYGFgnDQ06JeaIk7lonxXCr7hME1e1FREK51ziObyYlZ6suxivxfLMjM/Pm0YCLNDAh6XmieJdhIXF0Hx0wBNUluAaGK2V4xXRJFqLE39MJaWAwm2nqDpSRs8LqZUQ/e5x+kAT2IYc4Eq66r+wa+rU1aH6O0y/Qvr+4qOB71/af90cf99uGbZq0b6CF6hLrIR8/QIXqLjtAEUWfTGTkHzgv3i/vD/en+Ok91nUZzH62Z+/sfyvve+Q==</latexit>

C<latexit sha1_base64="Vv6Y5e5WS6ghPMLpdokH5yepwxo=">AAAC13icbVFdb9MwFHXC1wiDdeMJ8WLRVXSoapsyCaSBNGlQ8QBoSHQbaqLKcW5aq7Ed2Q4jRBFviFf+BP+Jf4PT5WHduJKlc8+9x/crylKmzXD413Fv3Lx1+87GXe/e5v0HW63tnRMtc0VhQmUq1VlENKRMwMQwk8JZpoDwKIXTaHlUx0+/gtJMis+myCDkZC5Ywigxlpq1/gQCzqnknIj4WcALyhSd+mHZ8TAOIpgzURq2/J4xanIFlWUtL2QM01iR816dnkKPCQEKa8he+5kJcdn2q4P6AxDxmrzyOk2NcjcgT3etP+5asHeZ5+9ncR0JwJDuuIfHe97RrNUe9ocrw9eB34A2aux4tu08CmJJcw7C0JRoPfWHtrWSKMNsx5UX5LZdQpdkDlMLBeGgw3K10Ap3LBPjRCr7hMEr9rKiJFzrgkc2kxOz0FdjNfm/2DQ3ycuwZCLLDQh6USjJU2wkrq+DY6aAmrSwgFDFbK+YLogi1NgbesFKWA4m2nqDhSRs8KaZUQ8+FB+lAT2IIWGC1dfVfQPf1iZdHaOyy/Svru46OBn1/ef90af99uHbZq0b6DF6grrIRy/QIXqHjtEEUWfTGTkHziv3i/vD/en+ukh1nUbzEK2Z+/sfrjve6Q==</latexit>

F 0<latexit sha1_base64="gpGQnFbJW5LNl5TMOKKoEqIyjac=">AAAC2HicbVFdb9MwFHXC1wgMOnhCvFh01TpUNcmGBBJMmgRUPAAaEt0mmqpynJvWamxHtsMIUSTeEK/8CX4T/wany8O6cSVL5557j+9XnGdMmyD467jXrt+4eWvjtnfn7ua9+52tB8daForCmMpMqtOYaMiYgLFhJoPTXAHhcQYn8fJ1Ez/5CkozKT6bMocpJ3PBUkaJsdSs8ycScEYl50QkTyNeUqboJJxWPQ/jKIY5E5Vhy+85o6ZQUFvW8kImMEkUORs06RkMmBCgsIb8IMzNFFfdsH7ZfAAiWZPXXq+tUW1HZGfb+qO+BbsXef5+ljSRCAzpjwZ4tOuNdmadbjAMVoavgrAFXdTa0WzLeRQlkhYchKEZ0XoSBra3iijDbMu1FxW2X0KXZA4TCwXhoKfVaqM17lkmwalU9gmDV+xFRUW41iWPbSYnZqEvxxryf7FJYdIX04qJvDAg6HmhtMiwkbg5D06YAmqy0gJCFbO9YrogilBjj+hFK2Hlj7X1/IUkzH/Tzqj9D+VHaUD7CaRMsOa8emjg29qkq2vUdpnh5dVdBcd7w3B/uPfpWffwbbvWDfQYPUF9FKLn6BC9Q0dojKiz6ew7r5wD94v7w/3p/jpPdZ1W8xCtmfv7H1hk3x0=</latexit>

(a) test time behavior (b) learning framework

Figure 1: Attentive Action and Context Factorization. Attentive feature extraction andclassifier learning are two stages of a joint learning framework. S(a) and C(a) are the proba-bility maps for the action and context components. f (a) and g(a) are the action and contextfeature vectors, which will be subsequently fed into the classifier h. The objective is to min-imize the classification loss, the similarity between action elements across F and F ′ and thedifference between context elements across F and F ′.

a major concern, but a bigger challenge comes from the inherent ambiguity of human ac-tion. In a video, which voxels define the action, which are part of the context, and which arenoise? It is likely that we will get highly inconsistent annotations from different annotators.

In this paper, we propose a weakly-supervised method for localizing the spatiotemporalregions of a video that define a human action. We eliminate the need for laboriously and am-biguously detailed annotation by leveraging conjugate samples [42] of human actions, whichare video clips that are contextually similar to human action samples, but do not contain theactions. We introduce a novel attentional mechanism that can spatially and temporally sep-arate human actions from the co-occurring factors, and this attentional mechanism can belearned with the help of conjugate samples. Given an input video, this attentional mecha-nism will compute two spatiotemporal attention maps, one for the action components andone for the context components, as illustrated in Fig. 1(a). Using these attention maps, wecan identify and separate action voxels from context voxels. This allows us to selectivelypool information from relevant regions of a video to compute feature vectors for the actionand context parts of the video. This leads to an improved classifier with higher accuracy andbetter interpretability.

2 Related Work

2.1 Visual grounding

Given a query phrase or a referring expression, visual grounding requires a model to specifya region within the image or the video that corresponds to the query input. It is inspired by thetop-down influences on selective attention in the human visual system (see [1] for a review).Various methods [3, 33, 47, 48, 50] have been proposed for grounding a CNN’s predictionfor images. However, visual grounding for video data or temporal network architectures ismuch less explored. Karpathy et al. [21] visualized LSTM cells that keep track of long-rangedependencies in a character-based model. Selvaraju et al. [30] presented qualitative results ongrounding image captioning and visual question answering using an RNN. Bargal et al. [2]proposed a top-down attention mechanism for CNN-RNN models to produce spatiotemporal

Citation
Citation
{Wang and Hoai} 2018
Citation
Citation
{Baluch and Itti} 2011
Citation
Citation
{Cao, Liu, Yang, Yu, Wang, Wang, Huang, Wang, Huang, Xu, etprotect unhbox voidb@x protect penalty @M {}al.} 2015
Citation
Citation
{Simonyan, Vedaldi, and Zisserman} 2014
Citation
Citation
{Zeiler and Fergus} 2014
Citation
Citation
{Zhang, Bargal, Lin, Brandt, Shen, and Sclaroff} 2018
Citation
Citation
{Zhou, Khosla, Lapedriza, Oliva, and Torralba} 2016
Citation
Citation
{Karpathy, Johnson, and Fei-Fei} 2016
Citation
Citation
{Selvaraju, Cogswell, Das, Vedantam, Parikh, Batra, etprotect unhbox voidb@x protect penalty @M {}al.} 2017
Citation
Citation
{Bargal, Zunino, Kim, Zhang, Murino, and Sclaroff} 2018

WANG ET AL.: ACTION AND CONTEXT FACTORIZATION 3

saliency maps that can be used for action/caption localization in videos.

2.2 Disentanglement in image/video synthesis

In recent years, much effort [7, 24, 26, 31, 32] has been spent on the task of learning ex-plicitly disentangled representations and subsequently leveraging them to control image orvideo synthesis process. Early approaches used bilinear model [34] and encoder-decodernetwork [19] for separating content and style for images such as faces and text in variousfonts. The disentanglement of content and style was further explored in [8, 23] for com-puter graphics applications. More recently, deep learning frameworks based on VariationalAuto Encoders [22, 27] and Generative Adversarial Networks [5, 13, 29] have become morepopular because they are more powerful than classical methods at encoding and synthesizingimages and videos. Besides the disentanglement of content and style, some research [16, 32]has also been conducted on separating the feature representation into translation-related andtranslation-invariant factors.

Most related to our work is the Action-Context Factorization (ACF) framework [42] fortraining a human action classifier that can explicitly factorize human actions from the co-occurring context. However, the objective in ACF was to learn factorized feature extractionnetworks for better classification performance adaptability, while our goal here is to performvisual grounding of the two factors by automatically identifying the voxels associated toaction and those corresponding to context. Incorporated in our method is a novel attentionalmechanism to spatially and temporally separating human action from co-occurring context,a task that cannot be achieved by the ACF framework [42].

3 Attentive Factorization

3.1 Proposed Framework

We propose to train a human action classifier with both action and conjugate samples ofhuman actions. Let the training data be {(li,ai,a′i)}n

i=1, where li is an action label, ai is thevideo sample for action li, and a′i is a conjugate sample for ai. Video a′i is contextually similarto ai but it does not contain the action li.

Fig. 1(b) illustrates the proposed framework. Given the feature map F =F(a)∈ℜT×H×W×D

for an action sample a, we will compute two attention maps S and C ∈ [0,1]T×H×W for thespatiotemporal distribution of the action and the context elements of the video respectively.Si is the probability that voxel i of the feature map F corresponds to an action component,while Ci is the probability that voxel i is a context element. Finally, the action extractorf (a) can be defined as the function of the map F and the action map S, while the contextextractor g(a) is the function of F and the context map C. To aid the factorization process,our framework uses a conjugate sample a′ and minimize two additional losses Ls and Ld . Lsmeasures the similarity between the action elements of a (defined based on S) and the actionelements of a′. Similarly, Ld measures the difference between the context elements of a anda′, defined based on C. The combined training objective is to minimize:

n

∑i=1Lc(h( f (ai),g(ai)), li)+

n

∑i=1

[Ld(C(ai),F(ai),F(a′i))+Ls(S(ai),F(ai),F(a′i))

]. (1)

Citation
Citation
{Denton etprotect unhbox voidb@x protect penalty @M {}al.} 2017
Citation
Citation
{Lample, Zeghidour, Usunier, Bordes, Denoyer, etprotect unhbox voidb@x protect penalty @M {}al.} 2017
Citation
Citation
{Mathieu, Zhao, Zhao, Ramesh, Sprechmann, and LeCun} 2016
Citation
Citation
{Shu, Yumer, Hadap, Sunkavalli, Shechtman, and Samaras} 2017
Citation
Citation
{Shu, Sahasrabudhe, Guler, Samaras, Paragios, and Kokkinos} 2018
Citation
Citation
{Tenenbaum and Freeman} 2000
Citation
Citation
{Huang, Boureau, LeCun, etprotect unhbox voidb@x protect penalty @M {}al.} 2007
Citation
Citation
{Dosovitskiy, Tobiasprotect unhbox voidb@x protect penalty @M {}Springenberg, and Brox} 2015
Citation
Citation
{Kulkarni, Whitney, Kohli, and Tenenbaum} 2015
Citation
Citation
{Kingma and Welling} 2014
Citation
Citation
{Mescheder, Nowozin, and Geiger} 2017
Citation
Citation
{Chen, Duan, Houthooft, Schulman, Sutskever, and Abbeel} 2016
Citation
Citation
{Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio} 2014
Citation
Citation
{Radford, Metz, and Chintala} 2015
Citation
Citation
{Hinton, Krizhevsky, and Wang} 2011
Citation
Citation
{Shu, Sahasrabudhe, Guler, Samaras, Paragios, and Kokkinos} 2018
Citation
Citation
{Wang and Hoai} 2018
Citation
Citation
{Wang and Hoai} 2018

4 WANG ET AL.: ACTION AND CONTEXT FACTORIZATION

⌘(F, F 0)<latexit sha1_base64="QwnXJaCKS4heEHyK7a7buIxL8YA=">AAAC1XicbVHbbtNAEF27XIq5peUJ8bIijZqiKLELEjchVQIiHgAViTSV4ihar8fJKt5da3dNMZbfEK/8BD/F37B2/dC0jLTSmTNzZmZnoixl2vj+X8fdunb9xs3tW97tO3fv3e/s7J5omSsKEypTqU4joiFlAiaGmRROMwWERylMo/XbOj79BkozKb6aIoM5J0vBEkaJsdSi8ycUcEYl50TET0JeUKboLJiXPQ/jMIIlE6Vh6x8ZoyZXUFnW8kLGMIsVORvU6SkMmBCgsIbsTZCZOS67QfW6LgAi3pBXXq/tUe6FZH/P+uO+BQcXef5xEdtICIb0xwM83j9YdLr+0G8MXwVBC7qotePFjvMwjCXNOQhDU6L1LPDtYCVRhtl5bfHcDkvomixhZqEgHPS8bNZZ4Z5lYpxIZZ8wuGEvKkrCtS54ZDM5MSt9OVaT/4vNcpO8mJdMZLkBQc8bJXmKjcT1bXDMFFCTFhYQqpidFdMVUYQae0EvbITlaKKtN1pJwkbv2j/q0afiszSgRzEkTLD6tnpo4PvGT5tTVHaZweXVXQUnh8Pg6fDwy7Pu0ft2rdvoEXqM+ihAz9ER+oCO0QRRx3N856Xzyp26lfvT/XWe6jqt5gHaMPf3P2wL3oo=</latexit>

⌘(F, F )<latexit sha1_base64="lYh25s0jPVbDvE8msZk/sGDPtfM=">AAAC1HicbVFNb9NAEF2brzZ8NIUT4rIijUhRFNsFiSKEVAmIOAAqEkkr2Va0Xk+SVby71u6aYoxPiCt/gl/Fv2Gd+tC0jLTSmzfzZmZnkjxj2vj+X8e9dv3GzVtb253bd+7e2+nu3p9qWSgKEyozqU4ToiFjAiaGmQxOcwWEJxmcJKs3TfzkKyjNpPhiyhxiThaCzRklxlKz7p9IwBmVnBORPo14SZmiYRBX/Q7GUQILJirDVt9zRk2hoLas5YVMIUwVORs26RkMmRCgsIb8dZCbGFe9oH7VFACRbsjrTr/tUe1F5Mme9ccDC/Yv8vzDLLWRCAwZjId4vD/r9vyRvzZ8FQQt6KHWjme7zsMolbTgIAzNiNZh4Nu5KqIMs+Pa2oWdldAVWUBooSAcdFytt1njvmVSPJfKPmHwmr2oqAjXuuSJzeTELPXlWEP+LxYWZn4YV0zkhQFBzxvNiwwbiZvT4JQpoCYrLSBUMTsrpkuiCDX2gJ1oLay8ibaet5SEeW/bP2rvY/lJGtBeCnMmWHNaPTLwbeOn60vUdpnB5dVdBdODUfBsdPD5ee/oXbvWLfQIPUYDFKAX6Ai9R8dogqiz7XjOofPSnbo/3J/ur/NU12k1D9CGub//AcfD3lk=</latexit>

Ls<latexit sha1_base64="SEe+qqpdAaumrO5uYifrJGtoYFI=">AAACyHicbVFti9QwEE7r21nf9vST+CXYXTxl3banoCDCgS+IqJzg3h1slyVNp7thm6QkqWst/eKf8Lf5YwTTXkH3zoHAM8/MZGaeSYqcaROGvxz3wsVLl6/sXPWuXb9x89Zg9/aRlqWiMKUyl+okIRpyJmBqmMnhpFBAeJLDcbJ+1caPv4LSTIovpipgzslSsIxRYiy1GPyMBWyo5JyI9FHMK8oUnUXzeuRhHCewZKI2bP29YNSUChrLWl7IFGapIptxm57DmAkBCmsoXkaFmePaj5oX7Qcg0q3yxhv1PephTB4Mrf92z4KH3l+af1joYbMY+OEk7AyfB1EPfNTb4WLXuRunkpYchKE50XoWhXaWmijD7IiNF5d2PkLXZAkzCwXhoOd1p2CDR5ZJcSaVfcLgjv23oiZc64onNpMTs9JnYy35v9isNNnzec1EURoQ9LRRVubYSNyeA6dMATV5ZQGhitlZMV0RRaixR/PirrAOptp6wUoSFrzud9TBx+qTNKCDFDImWHtOPTHwbWvTTv1WzOisdOfB0f4kejLZ//zUP3jTy7qD7qH7aA9F6Bk6QO/QIZoiin47vjN2Hrvv3cLduNVpquv0NXfQlrk//gB6VtwB</latexit>

Ld<latexit sha1_base64="TI98ABLDVjsskobXiNfJkCqxhPo=">AAACyHicbVFti9QwEE7r21nf9vST+CXYXTxl3banoCDCgS+IqJzg3h1slyVNp7thm6QkqWst/eKf8Lf5YwTTXkH3zoHAM8/MZGaeSYqcaROGvxz3wsVLl6/sXPWuXb9x89Zg9/aRlqWiMKUyl+okIRpyJmBqmMnhpFBAeJLDcbJ+1caPv4LSTIovpipgzslSsIxRYiy1GPyMBWyo5JyI9FHMK8oUnUXzeuRhHCewZKI2bP29YNSUChrLWl7IFGapIptxm57DmAkBCmsoXkaFmePaj5oX7Qcg0q3yxhv1PephTB4Mrf92z4KH3l+af1ikw2Yx8MNJ2Bk+D6Ie+Ki3w8WuczdOJS05CENzovUsCu0sNVGG2REbLy7tfISuyRJmFgrCQc/rTsEGjyyT4kwq+4TBHftvRU241hVPbCYnZqXPxlryf7FZabLn85qJojQg6GmjrMyxkbg9B06ZAmryygJCFbOzYroiilBjj+bFXWEdTLX1gpUkLHjd76iDj9UnaUAHKWRMsPacemLg29amnfqtmNFZ6c6Do/1J9GSy//mpf/Cml3UH3UP30R6K0DN0gN6hQzRFFP12fGfsPHbfu4W7cavTVNfpa+6gLXN//AFfRNvy</latexit>

F<latexit sha1_base64="g5uov0DHfeDyzhwtz65rj6EQkw8=">AAAC13icbVFdb9MwFHXC1wgDOnhCvFh0FR2qmqQggTSQJgEVD4CGRLehpqoc56a1GtuR7TBKFPGGeOVP8J/4NzhZHtaNK1k699x7fL/iPGPaBMFfx71y9dr1G1s3vVvbt+/c7ezcO9KyUBQmVGZSncREQ8YETAwzGZzkCgiPMziOV6/r+PFXUJpJ8dmsc5hxshAsZZQYS807fyIBp1RyTkTyJOJryhSdhrOy52EcxbBgojRs9T1n1BQKKstaXsgEpokip4M6PYMBEwIU1pC/CnMzw2U3rPbrD0AkG/LK67U1yt2IPN61/rhvwd55nr+fJ3UkAkP64wEe73njeacbDIPG8GUQtqCLWjuc7zgPokTSgoMwNCNaT8PAtlYSZZjtuPKiwrZL6IosYGqhIBz0rGwWWuGeZRKcSmWfMLhhzytKwrVe89hmcmKW+mKsJv8XmxYmfTErmcgLA4KeFUqLDBuJ6+vghCmgJltbQKhitldMl0QRauwNvagRlv5EW89fSsL8N+2M2v+w/igNaD+BlAlWX1cPDXzbmLQ5RmWXGV5c3WVwNBqGT4ejT8+6B2/btW6hh+gR6qMQPUcH6B06RBNEnW1n5Ow7L90v7g/3p/vrLNV1Ws19tGHu73+zn97s</latexit>

S<latexit sha1_base64="sE19kgdsy06tVckepzviPSCvPvo=">AAAC13icbVFdb9MwFHXC1wiDdeMJ8WLRVXSoapsyCaSBNAmoeAA0BN2GkqpynNvWamxHtsMIUcQb4pU/wX/i3+BkeVg3rmTp3HPv8f2K0oRpMxz+ddxr12/cvLVx27uzeffeVmt751jLTFGYUJlIdRoRDQkTMDHMJHCaKiA8SuAkWr2q4idfQWkmxWeTpzDlZCHYnFFiLDVr/QkFnFHJORHxk5DnlCka+NOi42EcRrBgojBs9T1l1GQKSstaXsgYgliRs16VnkCPCQEKa0hf+qmZ4qLtlwfVByDiNXnpdZoaxW5IHu9af9y1YO8iz9/N4ioSgiHdcQ+P97xPs1Z72B/Whq8CvwFt1NjRbNt5EMaSZhyEoQnROvCHtrWCKMNsx6UXZrZdQldkAYGFgnDQ06JeaIk7lonxXCr7hME1e1FREK51ziObyYlZ6suxivxfLMjM/Pm0YCLNDAh6XmieJdhIXF0Hx0wBNUluAaGK2V4xXRJFqLE39MJaWAwm2nqDpSRs8LqZUQ/e5x+kAT2IYc4Eq66r+wa+rU1aH6O0y/Qvr+4qOB71/af90cf99uGbZq0b6CF6hLrIR8/QIXqLjtAEUWfTGTkHzgv3i/vD/en+Ok91nUZzH62Z+/sfyvve+Q==</latexit>

C<latexit sha1_base64="Vv6Y5e5WS6ghPMLpdokH5yepwxo=">AAAC13icbVFdb9MwFHXC1wiDdeMJ8WLRVXSoapsyCaSBNGlQ8QBoSHQbaqLKcW5aq7Ed2Q4jRBFviFf+BP+Jf4PT5WHduJKlc8+9x/crylKmzXD413Fv3Lx1+87GXe/e5v0HW63tnRMtc0VhQmUq1VlENKRMwMQwk8JZpoDwKIXTaHlUx0+/gtJMis+myCDkZC5Ywigxlpq1/gQCzqnknIj4WcALyhSd+mHZ8TAOIpgzURq2/J4xanIFlWUtL2QM01iR816dnkKPCQEKa8he+5kJcdn2q4P6AxDxmrzyOk2NcjcgT3etP+5asHeZ5+9ncR0JwJDuuIfHe97RrNUe9ocrw9eB34A2aux4tu08CmJJcw7C0JRoPfWHtrWSKMNsx5UX5LZdQpdkDlMLBeGgw3K10Ap3LBPjRCr7hMEr9rKiJFzrgkc2kxOz0FdjNfm/2DQ3ycuwZCLLDQh6USjJU2wkrq+DY6aAmrSwgFDFbK+YLogi1NgbesFKWA4m2nqDhSRs8KaZUQ8+FB+lAT2IIWGC1dfVfQPf1iZdHaOyy/Svru46OBn1/ef90af99uHbZq0b6DF6grrIRy/QIXqHjtEEUWfTGTkHziv3i/vD/en+ukh1nUbzEK2Z+/sfrjve6Q==</latexit>

F 0<latexit sha1_base64="gpGQnFbJW5LNl5TMOKKoEqIyjac=">AAAC2HicbVFdb9MwFHXC1wgMOnhCvFh01TpUNcmGBBJMmgRUPAAaEt0mmqpynJvWamxHtsMIUSTeEK/8CX4T/wany8O6cSVL5557j+9XnGdMmyD467jXrt+4eWvjtnfn7ua9+52tB8daForCmMpMqtOYaMiYgLFhJoPTXAHhcQYn8fJ1Ez/5CkozKT6bMocpJ3PBUkaJsdSs8ycScEYl50QkTyNeUqboJJxWPQ/jKIY5E5Vhy+85o6ZQUFvW8kImMEkUORs06RkMmBCgsIb8IMzNFFfdsH7ZfAAiWZPXXq+tUW1HZGfb+qO+BbsXef5+ljSRCAzpjwZ4tOuNdmadbjAMVoavgrAFXdTa0WzLeRQlkhYchKEZ0XoSBra3iijDbMu1FxW2X0KXZA4TCwXhoKfVaqM17lkmwalU9gmDV+xFRUW41iWPbSYnZqEvxxryf7FJYdIX04qJvDAg6HmhtMiwkbg5D06YAmqy0gJCFbO9YrogilBjj+hFK2Hlj7X1/IUkzH/Tzqj9D+VHaUD7CaRMsOa8emjg29qkq2vUdpnh5dVdBcd7w3B/uPfpWffwbbvWDfQYPUF9FKLn6BC9Q0dojKiz6ew7r5wD94v7w/3p/jpPdZ1W8xCtmfv7H1hk3x0=</latexit>

Context attention map

Attention map Action attention mapvideo

Feature map

Action map

Sact<latexit sha1_base64="lZFgRS0/AWZl7U3ISb5fRc5knVA=">AAACd3icbVHLbtQwFPWE1xBeU1iywGI0qAs0SaBqYVeJgroBFeg0lSbpyPHcdKw6dmTfIEZRPo0PYc0WPoAdThpQoRzJ8vG55+rax1kphcUw/Drwrly9dv3G8KZ/6/adu/dGG/ePrK4MhxnXUpvjjFmQQsEMBUo4Lg2wIpMQZ2ev2nr8CYwVWh3iuoS0YKdK5IIzdNJiFE/e0EQomnyAk/owQVGApfv9HtOe7DX+5ONJzRCbzj0Pn0bpBf9vX9z4rY1jsxiNw2nYgV4mUU/GpMfBYmMwTJaaVwUo5JJZO4/CEtOaGRRcQuMnlYWS8TN2CnNHFXMD07pLoKETpyxpro1bCmmnXuyoWWHtusics2C4sv/WWvF/tXmF+Yu0FqqsEBQ/H5RXkqKmbZx0KQxwlGtHGDfC3ZXyFTMuAhe6n3SNdTCz7hSsNBPBXv9GG7xdv9MINlhCLpRov8NOET43fpfdyxbbf5K6TI6eTaPn0633W+Pd132KQ/KQPCabJCI7ZJfskwMyI5x8Id/Id/Jj8NN75D3xNs+t3qDveUD+ghf9Akk9v0Q=</latexit>

C = 1 � Sact<latexit sha1_base64="pSRj1LZ3CO6i7t5+sQ5NeBSwJoo=">AAAClXicbVFdb9MwFHXCV8mAdfDAAy8WVSUeoElgYuNh0qSVaS/AgHWd1GSV49ys1hI7im8QVZQfivgzOJlBg3Eky8f3niNfHydlLjQGwQ/HvXX7zt17g/vexoOHjzaHW49PtaorDjOuclWdJUxDLiTMUGAOZ2UFrEhymCeXB11//g0qLZQ8wXUJccEupMgEZ2hKy2E9PqSRkDT6AufNSYSiAE2P7D6nlkxbb/z1vGGIba9eBC/D+Jr+t25udRzbSKUKqTV5B3SPhvQVtc3lcBRMgh70JgktGRGL4+WWM4hSxesCJPKcab0IgxLjhlUoeA6tF9UaSsYv2QUsDJXMjBM3fT4tHZtKSjNVmSWR9tXrjoYVWq+LxCgLhiv9b68r/q+3qDHbjRshyxpB8quLsjqnqGgXNk1FBRzztSGMV8LMSvmKVSYC8yVe1Bsbf6bNyV8pJvypfaP2P6w/KgTtp5AJKbrP0hOE763XZ/euw9s/Sd0kp68n4ZvJ9uft0f57m+KAPCPPyQsSkh2yT47IMZkRTn46juM5G+5Td8+duodXUtexnifkL7iffgGxWMYG</latexit>

Satt<latexit sha1_base64="KaWiVJ+CTIV8HDK+l0bZ8g2Gq08=">AAACnnicbVFdb9MwFHXCVwlfHeONF4uqEg/QJDBt8IA0aWPay2DAulZqsspxblZriR3FN4gqCv+Td34ITmbQYFzJ8vE958jXx0mZC41B8MNxb9y8dfvO4K537/6Dh4+GG49PtaorDlOuclXNE6YhFxKmKDCHeVkBK5IcZsnFXsfPvkKlhZInuC4hLti5FJngDE1rOfw+PqCRkDT6DGfNSYSiAE0P7T6jFuy33vjLWcMQ2169CF6E8RX9b93M6ji2kUoVUmvyxnv0HQ3pS2pZzxLL4SiYBH3R6yC0YERsHS83nEGUKl4XIJHnTOtFGJQYN6xCwXNovajWUDJ+wc5hYaBkZq646YNq6dh0UpqpyiyJtO9edTSs0HpdJEZZMFzpf7mu+T9uUWP2Jm6ELGsEyS8vyuqcoqJd6jQVFXDM1wYwXgkzK+UrVpkozN94UW9s/Kk2J3+lmPD37Ru1f7T+oBC0n0ImpOh+TU8QvrVen93brrb/JHUdnL6ahK8nW5+2RrvvbYoD8pQ8I89JSHbILjkkx2RKOPnpeM6m88Sl7oF75H68lLqO9WySv8qd/wLd/cmB</latexit>

S = Sact � Satt<latexit sha1_base64="N8v1KuXiDHpqHjYRMIRCQF0RiK4=">AAAConicbVFNb9QwEHXC1xKgbNsDSFwslpU4wCZpq1IOSJVaUIUEatlut9ImXTnOpGs1saN4glhFufAv+Qf8DJw0rQplJMvP897Y4zdRngqNnvfLsu/cvXf/Qe+h8+jxk5Wn/dW1E63KgsOEq1QVpxHTkAoJExSYwmleAMuiFKbRxV7DT79DoYWSx7jMIczYuRSJ4AxNat7/OfxEAyFp8A3OquMARQaaHnT7lHZgv3aG47OKIdateua98cMb+ivdtHbG9ANtpBzrQMUKaVfnDPcM49O3V+z1jfP+wBt5bdDbwO/AgHRxOF+1ekGseJmBRJ4yrWe+l2NYsQIFT6F2glJDzvgFO4eZgZKZ3sKqNaumQ5OJaaIKsyTSNnuzomKZ1sssMsqM4UL/yzXJ/3GzEpOdsBIyLxEkv3woKVOKijbO01gUwDFdGsB4IUyvlC9YYbww83GCtrByJ9qc3IViwt3v/qjdL8uvCkG7MSRCimZyeoTwo3Za7943sX3t1G1wsjHyN0dbR1uD3Y+diz3ygrwkr4lP3pFdckAOyYRw8ttasZ5Zz+1X9mf7yB5fSm2rq1knf4Ud/AF4y8p5</latexit>

F<latexit sha1_base64="RLVKwffVPZwrxr27PzxNamG3o+4=">AAACpXicbVHbbtQwEHXCpUu47cIbfbFYreABNgmtoH1AqtRS9aWowF4qbdKV40y6VhM7iieIVZRHPpJv4Cdw0rQqlJEsH885Y4/PRHkqNHreL8u+c/fe/Y3eA+fho8dPnvYHz2ZalQWHKVepKk4jpiEVEqYoMIXTvACWRSnMo4v9hp9/h0ILJSe4ziHM2LkUieAMTWrZ/zk6pIGQNPgKZ9UkQJGBpkfdPqcdOKid0beziiHWrXrhvfHDG/or3bzR0Y+00XKsAxUrpF2hM9o3jE/fXrHXVzqHy/7QG3tt0NvA78CQdHGyHFi9IFa8zEAiT5nWC9/LMaxYgYKnUDtBqSFn/IKdw8JAyUx7YdX6VdORycQ0UYVZEmmbvVlRsUzrdRYZZcZwpf/lmuT/uEWJyU5YCZmXCJJfPpSUKUVFG/NpLArgmK4NYLwQplfKV6wwbpgROUFbWLlTbU7uSjHhHnR/1O7x+rNC0G4MiZCiGZ4eI/yonda73SbeXzt1G8zejf2t8faX7eHep87FHtkkL8lr4pMPZI8ckRMyJZz8tgbWC2vTfmUf2xN7dim1ra7mOfkr7OUfL3DLDA==</latexit>

(a) Comparing weakly-aligned feature maps (b) Computing factorized attention maps

Figure 2: (a) To compare similarities and differences between weakly-aligned feature maps(F and F ′), we first transform them, then compute the weighted average of the similaritiesand differences. The weights are based on the action map S and context map C. (b) illustratesthe computation of action and context attention maps. F is a 4D tensor F ∈ ℜT×H×W×D,while Sact ,Satt ,S,C are 3D tensors, ∈ [0,1]T×H×W .

This formulation uses conjugate samples in the learning process to aid the factorization be-tween action and context components f (a) and g(a). This task is difficult or even impossibleto achieve using the standard approach to train the classifier by minimizing the classificationloss only. One can also train f (a) and g(a) with detailed annotations, but collecting detailedannotation is a laborious and ambiguous process. The above formulation eliminates the needfor detailed annotations by utilizing conjugate samples during training. During testing, con-jugate samples are not needed.

There are three loss terms in the above objective function: Lc,Ls, and Ld . One technicalchallenge that we will address is the design of proper loss functions for Ls and Ld . WhileLc can be defined based on any standard classification loss such as cross entropy, the de-sign of Ls and Ld requires special consideration because there is no direct voxel-to-voxelcorrespondence between an action sample and a conjugate sample. This problem will beaddressed in the next subsection. After that, we will describe the parameterization of theaction and context attention maps.

3.2 Comparing weakly-aligned feature mapsWe now describe the loss functions for measuring the similarity and difference betweenweakly-aligned action and conjugate video samples.

In general, it is not a good idea to measure the element-wise similarities and differencesbetween F = F(a) and F ′ = F(a′) due to the positional shift of action and context elements.Furthermore, an action sample and conjugate sample might have different lengths so thefeature maps might even have different sizes. To address the problem of not having voxel-to-voxel correspondences, we propose the following mechanism. Consider a feature vectorFk at location k of the feature map F for the action sample. If Fk corresponds to an actionelement, it should be different from F ′j for all j’s. On the other hand, if Fk corresponds toa context voxel, then there must be j such that F ′j is similar to Fk. We propose to define aquantity to measure the amount of feature vector Fk in a set of feature vectors F ′ as follows:

η(Fk,F ′) = ∑j

exp(γF̂Tk F̂ ′j )

∑l exp(γF̂Tk F̂ ′l )

F ′j . (2)

where F̂k and F̂ ′j are unit vectors. Similarly, we can compute a vector η(Fk,F) to representthe amount Fk in F . η(Fk,F) and η(Fk,F ′) should be different or similar depending on

WANG ET AL.: ACTION AND CONTEXT FACTORIZATION 5

whether Fk corresponds to an action or context voxel. We define the similarity and differenceloss functions based on the transformed feature maps as follows:

Ls =∑k Sk× sim(η(Fk,F), η(Fk,F ′))

∑k Sk, (3)

Ld =∑k Ck×diff(η(Fk,F), η(Fk,F ′))

∑k Ck. (4)

Here, sim and diff are two functions that measure the similarity and difference betweentwo vectors based on their cosine distance. Specifically in our experiments, sim(x,z) =max

{0, cos

⟨x,z

⟩−ξ

}and diff(x,z) = 1− cos

⟨x,z

⟩. Fig. 2(a) illustrates the procedure for

feature map transformation and loss computation.

3.3 Attentive Action & Context FactorizationA video is represented by a 4D feature map F ∈ ℜT×H×W×D, and we will compute two at-tention maps S and C ∈ [0,1]T×H×W for the spatiotemporal distribution of the action andthe context elements of the video. We parameterize S and C based on one action mapSact ∈ [0,1]T×H×W and one attention map Satt ∈ [0,1]T×H×W as illustrated in Fig. 2(b) anddescribed below:

S = Sact �Satt , C = 1−Sact . (5)

Operator � denotes element-wise multiplication. The action map Sact is the output of aper-location sigmoid function. Its values represent the probability of each location being anaction component. Conversely, 1− Sact represents the probability of each location being acontextual element that is common in both action and conjugate samples. We use Satt tofurther refine the action attention map S, because not all action elements are equal for theaction recognition task. Satt is a unit-sum tensor that allocates the percentage of attention foreach location. It is produced by a softmax function instead of a sigmoid function, and shouldattend to more discriminative action components. More details on the parameterization ofSact and Satt are in the supplementary material.

With the convolutional feature map F ∈ ℜT×H×W×D that represents the video, and twospatiotemporal attention maps S and C ∈ [0,1]T×H×W respectively for the action and the con-text elements, we propose to compute the action feature and the context feature as follows.

Fs =1

∑si∑

isi ·θ(F)i and Fc =

1∑ci

∑i

ci ·θ(F)i.

Fs and Fc are the action and the context feature representations for input video v. θ is alearnable transformation that is applied to the original feature maps. In this work, we defineθ(F) = relu(F + conv3d(F)), which includes a residual connection to aid the optimizationprocess [15]. Other transformation functions are also plausible. Finally, we apply the classi-fier h to obtain the class confidence scores. Specifically we concatenate Fs and Fc and applya fully-connected layer.

3.4 Implementation DetailsWe now provide more details on the parameterization of Sact and Satt in Eq. (5). We firstapply two separate 3D convolutional layers φ and ψ on the extracted feature map F to

Citation
Citation
{He, Zhang, Ren, and Sun} 2016

6 WANG ET AL.: ACTION AND CONTEXT FACTORIZATION

compute the action confidence map φ(F) and the attention confidence map ψ(F). Theseare unnormalized confidence scores, and their range highly depends on the values of theconvolution kernel weights of φ and ψ . To ensure consistent value ranges across differentvideos and time steps, we apply a normalization step, called AttNorm

dims . This function canbe applied on a multidimensional input tensor; it will subtract the average value from theinput tensor and then divide the result by the standard deviation. Both the average valueand the standard deviation are computed along the specified dimensions dims. This AttNorm

dimsfunction is similar to a Batch-Norm layer with a key difference. The proposed normalizationis performed among different spatial locations at different time steps, instead of differentsamples in a single batch. We also use a Percent

ρmodule to control the percentage of positive

values to be ρ ∈ [0,100] in each confidence map. This can be done by subtracting the ρ-percentile value from every element of the confidence map. By explicitly controlling thepercentage of positive values before the sigmoid function, as used in Eq. (6), we can regulatethe percentage of action components in Sact within the entire video. Both AttNorm

dims and Percentρ

are parameter-free. They enable efficient and stable network training in our experiments.After the normalization of the confidence scores, we apply the relaxed sigmoid or softmaxfunction to obtain the intermediate attentive distribution maps:

Sact = sigmoid(α ·Percentρ

AttNormt,h,w

φ(F)),Satt = softmaxt,h,w

(β ·AttNormt,h,w

ψ(F)). (6)

Here, ρ , α , and β are tunable hyper parameters. ρ controls the percentage of action com-ponent within each video. α controls the sharpness of the boundary between action andcontext. When α is 0, Sact is 0.5 everywhere, and every location contains equal amounts ofaction and context values. If α = ∞, ρ percent of the values in Sact are 1’s, and the rest are0’s. That entails a hard separation of action and context spatiotemporally. β can be used tocontrol the balance between average pooling and max pooling for the attention map. Whenβ is 0, Satt has uniform values, and this is equivalent to average pooling. If β = ∞, onlyone location in the video has the attention weight of 1, while the attention weights of otherlocations are 0; this is equivalent to max pooling.

The learnable parameters of the proposed action and context attentive mechanism are thekernel parameters of the functions φ , ψ and θ . The functions AttNorm

dims , Percentρ

, sigmoid andsoftmax are all parameter-free. ρ , α and β are hyper-parameters which can be tuned usingvalidation data; in our experiments, a good default starting point is ρ = 30.0,α = β = 1.

4 ExperimentsWe perform experiments on four action recognition datasets: ActionThread [17], Holly-wood2 [25], HACS [49], and Pascal VOC [9] (presented in supplementary material).

4.1 Separating action and context in video

Dataset. Our experiments in this section are performed on the ActionThread dataset [17].It contains not only the video clips that include human actions but also the sequences rightbefore and after the actions. These sequences are good candidates for conjugate samples ofhuman actions. The human action samples in ActionThread were automatically located andextracted using script mining in 15 different TV series. ActionThread has 13 different actionsand 3035 video clips, which are split into disjoint train and test subsets [17]. We consider the

Citation
Citation
{Hoai and Zisserman} 2014{}
Citation
Citation
{Marszalek, Laptev, and Schmid} 2009
Citation
Citation
{Zhao, Yan, Torresani, and Torralba} 2019
Citation
Citation
{Everingham, Vanprotect unhbox voidb@x protect penalty @M {}Gool, Williams, Winn, and Zisserman} 2012
Citation
Citation
{Hoai and Zisserman} 2014{}
Citation
Citation
{Hoai and Zisserman} 2014{}

WANG ET AL.: ACTION AND CONTEXT FACTORIZATION 7

Method RGB Only RGB + Flow

#param mAP #param mAP

I3D [4] 13.4M 46.7 28.8M 55.9PoseMask n/a 46.0 n/a 53.210Region n/a 50.1 n/a 57.650Region n/a 49.6 n/a 56.3Attentional [11] 13.4M 53.0 28.8M 61.8ACF [42] 13.4M 52.9 28.8M 60.4ActX [w/o Satt ] 13.4M 51.2 28.8M 58.8ActX [Proposed] 13.4M 55.4 28.8M 63.1

Table 1: Action recognition results on the ActionThread dataset (average precision val-ues are shown; higher is better). All methods are based on the same feature maps producedby the I3D backbone. I3D [4] uses global average pooling. All other methods use atten-tion for weighted pooling of features. The numbers of parameters added by the attentionalmechanisms are negligible compared to that of the I3D backbone. ActX achieves the bestperformance.

pre- and post-action sequences as the source for mining conjugate samples, and make sureeach pair of action and conjugate samples are extracted from the same video thread.

Feature Extraction. We use the pre-trained two-stream I3D ConvNet [4] for feature ex-traction. Each video is represented as two 4D feature maps Frgb,Ff low ∈ ℜT×8×11×1024. Tis the number of video frames divided by 8. The proposed attentive factorization networkcan take input feature maps with different temporal lengths. However, temporal croppingwith a fixed length is still necessary during training because we need to put feature mapsinto mini-batches. We use a batch size of 36 videos and the temporal cropping length is setto 10, equivalent to 80 frames or 3.2 seconds. At test time, temporal cropping is unneces-sary. The entire feature maps are fed into the network to perform attentive action and contextfactorization as well as action recognition. More details are in the supplementary material.

Action Recognition Performance. Tab. 1 compares the action recognition performance ofseveral methods on the ActionThread dataset. I3D is a baseline approach [4] where the I3Dfeature vectors are averaged over the entire video. ActX is the proposed approach whereattentive action and context factorization and human action classification are jointly opti-mized. As can be seen from Tab. 1, the proposed ActX significantly outperforms I3D, adirect baseline method without an attentional mechanism. The improvement brought by theattentional mechanism is most evident on the spatial stream where RGB images are used forfeature extraction.

We also compare our method with alternative approaches for weighting the I3D featuremaps. ACF is the approach of [42] where Laction

sim and Lcontextdi f f are defined based on cosine

distance between globally pooled feature vectors. Attentional is a state-of-the-art atten-tional pooling method [11] where both the top-down (category-specific) and the bottom-up(category-agnostic) attention maps are learned for weighting the predicted action scores.PoseMask is a method where human poses are used to obtain an attentive action map thatfocuses on regions of the human body joints. We use the Convolutional Pose Machine al-gorithm [44] to detect human pose keypoints within each video frame. The detected bodykeypoints include ears, eyes, and mouth regions, as well as shoulders, hips, limbs, wrists, and

Citation
Citation
{Carreira and Zisserman} 2017
Citation
Citation
{Girdhar and Ramanan} 2017
Citation
Citation
{Wang and Hoai} 2018
Citation
Citation
{Carreira and Zisserman} 2017
Citation
Citation
{Carreira and Zisserman} 2017
Citation
Citation
{Carreira and Zisserman} 2017
Citation
Citation
{Wang and Hoai} 2018
Citation
Citation
{Girdhar and Ramanan} 2017
Citation
Citation
{Wei, Ramakrishna, Kanade, and Sheikh} 2016

8 WANG ET AL.: ACTION AND CONTEXT FACTORIZATION

Action attention map Context attention map Action attention map Context attention map

(a) AnswerPhone (b) DriveCar

(c) Eat (d) KissFigure 3: Examples of attention maps for action and context. The action attention mapshave higher weights on humans, but not all humans receive the same attention, and not allparts of a human subject receive attention. The weights of the context maps are lower on thehuman subjects. The context maps have nonuniform distribution over background pixels.

ankles. We convert these keypoint heat maps into binary masks and use them for weightedpooling of the I3D features. The pose mask is expected to guide the classifier to attend to thehuman regions and achieve better action recognition results. However, the performance ofPoseMask is worse than the average-pooled I3D features, indicating the importance of non-human regions. Because of this, we also consider k-Region where object region proposalsare used for attentive attribution. We use the Detectron software library [12] to obtain thetop k object regions predicted by the Region Proposal Network of a Mask RCNN. We usethe ROI-align module to obtain the feature representation for each region and average theminto a single feature vector. We experiment with k = 10 and k = 50, both yielding an actionrecognition performance that is slightly better than the average pooling method (I3D). How-ever, these methods are still outperformed by the proposed method ActX. We also performan ablation study where Satt is not used (i.e., S = Sact ). The performance is not as good aswhen Satt is used to refine the action attention map. The numbers of parameters added bythese attentional mechanisms are negligible compared to that of the I3D backbone network.

Visualizing Action and Context Components. Fig. 3 shows some examples of action andcontext attention maps. To recognize an action, it is important to attend to the human sub-jects. This explains why action maps put higher weights on human subjects. However, not allhumans in a video frame receive the same attention, and not all body parts of a human receiveattention. On the contrary, the weights of the context maps are lower on the human subjects.The context maps emphasize the background regions with a nonuniform distribution.

4.2 Imperfect conjugate samples

Until now, we have assumed the availability of perfect conjugate samples that do not containthe actions in consideration. This assumption holds in general. For example, if a humanaction dataset has temporal annotations for the start and end times of the actions, we canconsider the sequences right before and after the action boundaries as conjugate samples.When a new action dataset is being collected, we can save the pre- and post-action sequences

Citation
Citation
{Girshick, Radosavovic, Gkioxari, Dollár, and He} 2018

WANG ET AL.: ACTION AND CONTEXT FACTORIZATION 9

Method RGB Only RGB + Flow

I3D [4] 55.3 62.8PoseMask 49.4 60.010Region 55.6 64.650Region 55.9 64.8Attentional [11] 61.3 69.8ACF [42] 61.3 71.2ActX [Proposed] 62.5 73.2

EEP + I3D [43] 73.5 81.0EEP + ActX [Proposed] 76.0 82.0

Table 2: Action recognition results on the Hollywood2 dataset (average precision valuesare shown; higher is better). All methods are based on the same feature maps producedby the I3D ConvNet. The proposed ActX achieves better performance than other attentionmechanisms, when used with or without Eigen Evolution Pooling (EEP [43]). We considerEEP because it has the current state-of-the-art performance on Hollywood2.

for future mining of conjugate samples. However, if we are forced to work with a specificdataset such as Hollywood2, where the pre- and post-action sequences are not available, wecan use “imperfect” conjugate samples instead. In this scenario, the conjugate samples arenot guaranteed to exclude all action elements that are shared with the action samples. Withthis in mind, we investigate whether or not a black-and-white separation between conjugateand action samples with respect to the action elements are necessary.

Experiments on Hollywood2. We first perform experiments on the Hollywood2 dataset [25],which contains 12 action classes and 1,707 videos collected from 69 Hollywood movies. Totrain the proposed method on Hollywood2, we use a batch size of 36 videos, and the tem-poral cropping length for I3D feature maps is set to 6, equivalent to 48 frames or around 2seconds. The source for mining conjugate samples is the sequences right before and afterthe temporally cropped action samples. The extracted action and conjugate samples sharethe same context, and the key difference between them is the dynamic content of the humanaction at different action stages.

Tab. 2 compares the action recognition performance of several methods on the Holly-wood2 dataset. ActX is the proposed approach where attentive action and context factor-ization and human action classification are jointly optimized. It significantly outperformsthe baseline I3D [4] approach where the I3D feature vectors are averaged. The performancegaps between the proposed ActX method and other attention mechanism on the Hollywood2dataset are similar to those on the ActionThread dataset, presented in Tab. 1. More specif-ically, the proposed ActX significantly outperforms Attentional, where both the category-specific and agnostic attention maps are used, while PoseMask, k-Region achieve worse orsimilar action recognition performance than the average-pooled I3D features. The last tworows of Tab. 2 also compares the action recognition performance of the baseline I3D fea-tures and the proposed ActX features, when they are temporally compressed using EigenEvolution Pooling (EEP) [43]. EEP uses a set of basis functions learned from data usingPCA to encode the evolution of features over time. It provides an effective way to capturethe long-term and complex dynamics of human actions in video. ActX still outperforms I3Dwhen combined with EEP.

Experiments on HACS-30. We also perform an experiment on the HACS dataset [49]. This

Citation
Citation
{Carreira and Zisserman} 2017
Citation
Citation
{Girdhar and Ramanan} 2017
Citation
Citation
{Wang and Hoai} 2018
Citation
Citation
{Wang, Tran, and Hoai} 2018
Citation
Citation
{Wang, Tran, and Hoai} 2018
Citation
Citation
{Marszalek, Laptev, and Schmid} 2009
Citation
Citation
{Carreira and Zisserman} 2017
Citation
Citation
{Wang, Tran, and Hoai} 2018
Citation
Citation
{Zhao, Yan, Torresani, and Torralba} 2019

10 WANG ET AL.: ACTION AND CONTEXT FACTORIZATION

Method mAP mAcc

I3D [14] 57.8 83.1Attention [11] 58.6 84.5ACF [42] 58.3 84.3ActX [Proposed] 59.1 85.3

Table 3: Average Precision results on the HACS-30 validation set. All methods are basedon the same feature maps produced by the backbone network (I3D-Res34-RGB [14]). Theproposed ActX achieves the best performance.

dataset provides both positive and negative samples for 200 action categories. Some positiveand negative samples are extracted from the same video, so they have similar context withthe key difference being whether or not an action of interest is observed. Thus the negativesamples are a great source for mining conjugate samples. More precisely, given a positiveaction sample, we consider its top three nearest neighbors in the negative data pool as theconjugate samples. To compute the distance between videos, we use feature embeddingsextracted by the I3D [14] network (I3D-Res34-RGB).

Limited by computation resource, we use a subset of the HACS dataset, which we willrefer to as HACS-30. This subset contains 30 positive action classes and one negativebackground class. The 30 actions are those with the least positive samples, which makesthe recognition task more challenging. After conjugate sample mining, the training set ofHACS-30 contains 11770 positive samples, and each positive sample has three correspond-ing conjugate samples. We evaluate the action recognition performance on the HACS-30validation set, which contains 1186 samples for the 30 actions and 1712 samples for the neg-ative background class. Given the imbalance between positive and negative data, we reportthe mean average precision (mAP) and the mean class accuracy (mAcc; obtained by averag-ing the per-class accuracies over 31 classes). The experiment results are reported in Tab. 3.As can be observed, the proposed ActX outperforms other attention mechanisms. However,the performance gain is smaller than those achieved on the ActionThread and Hollywood2dataset. This might be due to the short duration of HACS video clips (all clips are only 2seconds long), and the added benefit of an attention mechanism is less for short videos.

5 Conclusions

We have presented a novel method for identifying action and context voxels of a video.Our method disentangles the action and the context components of a video with a novelattentional mechanism that can compute two spatiotemporal attention maps for action andcontext separately. Our method requires paired training data of action and conjugate samplesof human actions, but this type of data can be collected easily. We have demonstrated thequantitative and qualitative benefits of using our method on the ActionThread, Hollywood2,HACS, and VOC Action datasets.

Acknowledgement. This project is supported by the National Science Foundation AwardIIS-1566248, a Google Research Award, the SUNY 2020 Infrastructure Transportation Se-curity Center, and VinAI Research.

Citation
Citation
{Hara, Kataoka, and Satoh} 2018
Citation
Citation
{Girdhar and Ramanan} 2017
Citation
Citation
{Wang and Hoai} 2018
Citation
Citation
{Hara, Kataoka, and Satoh} 2018
Citation
Citation
{Hara, Kataoka, and Satoh} 2018

WANG ET AL.: ACTION AND CONTEXT FACTORIZATION 11

References

[1] Farhan Baluch and Laurent Itti. Mechanisms of top-down attention. Trends in neurosciences, 34(4):210–224, 2011.

[2] Sarah Adel Bargal, Andrea Zunino, Donghyun Kim, Jianming Zhang, Vittorio Murino, and StanSclaroff. Excitation backprop for rnns. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2018.

[3] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang,Liang Wang, Chang Huang, Wei Xu, et al. Look and think twice: Capturing top-down visualattention with feedback convolutional neural networks. In Proceedings of the International Con-ference on Computer Vision, 2015.

[4] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinet-ics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017.

[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:Interpretable representation learning by information maximizing generative adversarial nets. InAdvances in Neural Information Processing Systems, 2016.

[6] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. Aˆ2-nets: Doubleattention networks. In Advances in Neural Information Processing Systems, 2018.

[7] Emily L Denton et al. Unsupervised learning of disentangled representations from video. InAdvances in Neural Information Processing Systems, 2017.

[8] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairswith convolutional neural networks. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2015.

[9] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. ThePASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. www.pascal-network.org/challenges/VOC/voc2012/workshop/, 2012.

[10] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. Spatiotemporal residual networks forvideo action recognition. In Advances in Neural Information Processing Systems, 2016.

[11] Rohit Girdhar and Deva Ramanan. Attentional pooling for action recognition. In Advances inNeural Information Processing Systems, 2017.

[12] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron.https://github.com/facebookresearch/detectron, 2018.

[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural InformationProcessing Systems. 2014.

[14] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace thehistory of 2d cnns and imagenet? In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2018.

[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion, 2016.

[16] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In Inter-national Conference on Artificial Neural Networks, 2011.

[17] Minh Hoai and Andrew Zisserman. Thread-safe: Towards recognizing human actions across shotboundaries. In Proceedings of the Asian Conference on Computer Vision, 2014.

[18] Minh Hoai and Andrew Zisserman. Improving human action recognition using score distributionand ranking. In Proceedings of the Asian Conference on Computer Vision, 2014.

[19] Fu Jie Huang, Y-Lan Boureau, Yann LeCun, et al. Unsupervised learning of invariant featurehierarchies with applications to object recognition. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2007.

12 WANG ET AL.: ACTION AND CONTEXT FACTORIZATION

[20] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2014.

[21] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent net-works. In ICLR Workshop, 2016.

[22] Diederik Kingma and Max Welling. Auto-encoding variational bayes. 2014.[23] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional

inverse graphics network. In Advances in Neural Information Processing Systems, 2015.[24] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, et al.

Fader networks: Manipulating images by sliding attributes. In Advances in Neural InformationProcessing Systems, 2017.

[25] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, 2009.

[26] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and YannLeCun. Disentangling factors of variation in deep representation using adversarial training. InAdvances in Neural Information Processing Systems, 2016.

[27] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unify-ing variational autoencoders and generative adversarial networks. In Proceedings of the Interna-tional Conference on Machine Learning, 2017.

[28] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with pseudo-3dresidual networks. In Proceedings of the International Conference on Computer Vision, 2017.

[29] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning withdeep convolutional generative adversarial networks. arXiv:1511.06434, 2015.

[30] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, DeviParikh, Dhruv Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the International Conference on Computer Vision, 2017.

[31] Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, and Dimitris Samaras.Neural face editing with intrinsic image disentangling. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2017.

[32] Zhixin Shu, Mihir Sahasrabudhe, Alp Guler, Dimitris Samaras, Nikos Paragios, and IasonasKokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. InProceedings of the European Conference on Computer Vision, 2018.

[33] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:Visualising image classification models and saliency maps. In ICLR Workshop, 2014.

[34] Joshua B Tenenbaum and William T Freeman. Separating style and content with bilinear models.volume 12, pages 1247–1283. MIT Press, 2000.

[35] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learningspatiotemporal features with 3d convolutional networks. In Proceedings of the InternationalConference on Computer Vision, 2015.

[36] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. Acloser look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2018.

[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural InformationProcessing Systems, 2017.

[38] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool.Temporal segment networks: Towards good practices for deep action recognition. In Proceedingsof the European Conference on Computer Vision, 2016.

[39] Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In Proceedings of theEuropean Conference on Computer Vision, 2018.

[40] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

WANG ET AL.: ACTION AND CONTEXT FACTORIZATION 13

[41] Yang Wang and Minh Hoai. Improving human action recognition by non-action classification. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[42] Yang Wang and Minh Hoai. Pulling actions out of context: Explicit separation for effective com-bination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018.

[43] Yang Wang, Vinh Tran, and Minh Hoai. Eigen-evolution dense trajectory descriptors. In Pro-ceedings of the International Conference on Automatic Face and Gesture Recognition, 2018.

[44] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose ma-chines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016.

[45] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, and RossGirshick. Long-term feature banks for detailed video understanding. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2019.

[46] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residualtransformations for deep neural networks. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2017.

[47] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InProceedings of the European Conference on Computer Vision, 2014.

[48] Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff.Top-down neural attention by excitation backprop. volume 126, pages 1084–1102. Springer,2018.

[49] Hang Zhao, Zhicheng Yan, Lorenzo Torresani, and Antonio Torralba. Hacs: Human action clipsand segments dataset for recognition and temporal localization. arXiv preprint arXiv:1712.09374,2019.

[50] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deepfeatures for discriminative localization. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2016.