supplementary material for tsm: temporal shift module for … · 2019. 10. 23. · tsm: temporal...

3
Supplementary Material for TSM: Temporal Shift Module for Efficient Video Understanding Ji Lin MIT [email protected] Chuang Gan MIT-IBM Watson AI Lab [email protected] Song Han MIT [email protected] 1. Uni-directional TSM for Online Video Detec- tion In this section, we show more details about the online video object detection with uni-directional TSM. Object detection suffers from poor object appearance due to motion blur, occlusion, defocus, etc. Video based object detection gives chances to correct such errors by aggregating and inferring temporal information. Existing methods on video object detection [4] fuses information along temporal dimension after the feature is extracted by the backbone. Here we show that we can enable temporal fusion in online video object detection by injecting our uni-directional TSM into the backbone. We show that we can significantly improve the performance of video detection by simply modifying the backbone with online TSM, without changing the detection module design or using optical flow features. We conducted experiments with R-FCN [2] detector on ImageNet-VID [3] dataset. Following the setting in [4], we used ResNet-101 [1] as the backbone for R-FCN detector. For TSM experiments, we inserted uni-directional TSM to the backbone, while keeping other settings the same. We used the official training code of [4] to conduct the experi- ments, and the results are shown in Table 1. Compared to 2D baseline R-FCN [2], our online TSM model significantly improves the performance, especially on the fast moving objects, where TSM increases mAP by 4.6%. FGFA [4] is a strong baseline that uses optical flow to aggregate the temporal information from 21 frames (past 10 frames and future 10 frames) for offline video detection. Compared to FGFA, TSM can achieve similar or higher performance while enabling online recognition (using information from only past frames) at much smaller latency per frame. The latency overhead of TSM module itself is less than 1ms per frame, making it a practical tool for real deployment. We visualize two video clips in Figure 1 and 2. In Fig- ure 1, 2D baseline R-FCN generates false positive due to the glare of car headlight on frame 2/3/4, while TSM suppresses false positive. In Figure 2, R-FCN generates false positive Table 1. Video detection results on ImageNet-VID. Model Online Need Flow Latency mAP Overall Slow Medium Fast R-FCN [2] X 1× 74.7 83.6 72.5 51.4 FGFA [4] X 2.5× 75.9 84.0 74.4 55.6 Online TSM X 1× 76.3 83.4 74.8 56.0 surrounding the bus due to occlusion by the traffic sign on frame 2/3/4. Also, it fails to detect motorcycle on frame 4 due to occlusion. TSM model addresses such issues with the help of temporal information. 2. Video Demo We provide more video demos of our TSM model in the following project page: https://hanlab.mit.edu/ projects/tsm/. References [1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 1 [2] K. H. J. S. Jifeng Dai, Yi Li. R-FCN: Object detection via region-based fully convolutional networks. 2016. 1 [3] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Ima- genet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. 1 [4] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 408–417, 2017. 1 1

Upload: others

Post on 06-Dec-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplementary Material for TSM: Temporal Shift Module for … · 2019. 10. 23. · TSM: Temporal Shift Module for Efficient Video Understanding Ji Lin MIT jilin@mit.edu Chuang Gan

Supplementary Material forTSM: Temporal Shift Module for Efficient Video Understanding

Ji LinMIT

[email protected]

Chuang GanMIT-IBM Watson AI [email protected]

Song HanMIT

[email protected]

1. Uni-directional TSM for Online Video Detec-tion

In this section, we show more details about the onlinevideo object detection with uni-directional TSM.

Object detection suffers from poor object appearance dueto motion blur, occlusion, defocus, etc. Video based objectdetection gives chances to correct such errors by aggregatingand inferring temporal information.

Existing methods on video object detection [4] fusesinformation along temporal dimension after the feature isextracted by the backbone. Here we show that we can enabletemporal fusion in online video object detection by injectingour uni-directional TSM into the backbone. We show that wecan significantly improve the performance of video detectionby simply modifying the backbone with online TSM, withoutchanging the detection module design or using optical flowfeatures.

We conducted experiments with R-FCN [2] detector onImageNet-VID [3] dataset. Following the setting in [4], weused ResNet-101 [1] as the backbone for R-FCN detector.For TSM experiments, we inserted uni-directional TSM tothe backbone, while keeping other settings the same. Weused the official training code of [4] to conduct the experi-ments, and the results are shown in Table 1. Compared to2D baseline R-FCN [2], our online TSM model significantlyimproves the performance, especially on the fast movingobjects, where TSM increases mAP by 4.6%. FGFA [4]is a strong baseline that uses optical flow to aggregate thetemporal information from 21 frames (past 10 frames andfuture 10 frames) for offline video detection. Comparedto FGFA, TSM can achieve similar or higher performancewhile enabling online recognition (using information fromonly past frames) at much smaller latency per frame. Thelatency overhead of TSM module itself is less than 1ms perframe, making it a practical tool for real deployment.

We visualize two video clips in Figure 1 and 2. In Fig-ure 1, 2D baseline R-FCN generates false positive due to theglare of car headlight on frame 2/3/4, while TSM suppressesfalse positive. In Figure 2, R-FCN generates false positive

Table 1. Video detection results on ImageNet-VID.

Model Online NeedFlow Latency mAP

Overall Slow Medium Fast

R-FCN [2] X 1× 74.7 83.6 72.5 51.4FGFA [4] X 2.5× 75.9 84.0 74.4 55.6

Online TSM X 1× 76.3 83.4 74.8 56.0

surrounding the bus due to occlusion by the traffic sign onframe 2/3/4. Also, it fails to detect motorcycle on frame 4due to occlusion. TSM model addresses such issues with thehelp of temporal information.

2. Video DemoWe provide more video demos of our TSM model in the

following project page: https://hanlab.mit.edu/projects/tsm/.

References[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning

for image recognition. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 770–778,2016. 1

[2] K. H. J. S. Jifeng Dai, Yi Li. R-FCN: Object detection viaregion-based fully convolutional networks. 2016. 1

[3] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Ima-genet large scale visual recognition challenge. InternationalJournal of Computer Vision, 115(3):211–252, 2015. 1

[4] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei. Flow-guidedfeature aggregation for video object detection. In Proceedingsof the IEEE International Conference on Computer Vision,pages 408–417, 2017. 1

1

Page 2: Supplementary Material for TSM: Temporal Shift Module for … · 2019. 10. 23. · TSM: Temporal Shift Module for Efficient Video Understanding Ji Lin MIT jilin@mit.edu Chuang Gan

R-FCN TSM

frame1

frame2

frame3

frame4

frame5

Figure 1. Comparing the result of R-FCN baseline and TSM model. 2D baseline R-FCN generates false positive due to the glare of carheadlight on frame 2/3/4, while TSM does not have such issue by considering the temporal information.

Page 3: Supplementary Material for TSM: Temporal Shift Module for … · 2019. 10. 23. · TSM: Temporal Shift Module for Efficient Video Understanding Ji Lin MIT jilin@mit.edu Chuang Gan

R-FCN TSM

frame1

frame2

frame3

frame4

frame5

Figure 2. Comparing the result of R-FCN baseline and TSM model. 2D baseline R-FCN generates false positive surrounding the bus due toocclusion by the traffic sign on frame 2/3/4. Also, it fails to detect motorcycle on frame 4 due to occlusion. TSM model addresses suchissues with the help of temporal information.