arxiv:2007.09470v1 [cs.cv] 18 jul 2020arxiv:2007.09470v1 [cs.cv] 18 jul 2020 2 r. yan et al. three...

Social Adaptive Module for Weakly-supervisedGroup Activity Recognition

Rui Yan1, Lingxi Xie2, Jinhui Tang1∗, Xiangbo Shu1, and Qi Tian2

1 School of Computer Science and Engineering, Nanjing University of Science andTechnology, Nanjing, China

2 Huawei Inc., Chinaruiyan, jinhuitang, [email protected], [email protected],

[email protected]

Abstract. This paper presents a new task named weakly-supervisedgroup activity recognition (GAR) which differs from conventional GARtasks in that only video-level labels are available, yet the important per-sons within each frame are not provided even in the training data. Thiseases us to collect and annotate a large-scale NBA dataset and thusraise new challenges to GAR. To mine useful information from weak su-pervision, we present a key insight that key instances are likely to berelated to each other, and thus design a social adaptive module (SAM)to reason about key persons and frames from noisy data. Experimentsshow significant improvement on the NBA dataset as well as the popularvolleyball dataset. In particular, our model trained on video-level anno-tation achieves comparable accuracy to prior algorithms which requiredstrong labels.

Keywords: Group Activity Recognition, Video Analysis, and Scene Un-derstanding

1 Introduction

Group activity recognition (GAR) has a variety of applications in video under-standing, such as sports analysis, video surveillance, and public security. Com-pared with traditional individual actions [30,38,23,14,27], group activities (a.k.a,collective activities) [10,18,45,42] are performed by multiple persons cooperatingwith each other. Thus, the models for GAR require to understand not only theindividual behaviors but also the relationship between each person.

Previous fully-supervised methods which require person-level annotation (i.e.ground-truth bounding boxes and individual action label for each person, eveninteraction label for person-person pairs) have achieved promising performanceon group activity recognition. Typically, these methods [18,45,40,39,35,42,3,4,44]extract feature for each people according to the corresponding bounding boxessupervised by individual action label, and then fuse person-level feature into a

∗Corresponding author

arX

iv:2

007.

0947

0v1

[cs

.CV

] 1

8 Ju

l 202

0

2 R. Yan et al.

Three shot：21=5+16 Defense rebound：12=6+6 Two shot：11=6+5

Fig. 1. Best viewed in color. Illustration of the uncertain input issue under weakly-supervised setting. For different activities, the off-the-shelf detector will generate vary-ing numbers of proposals, most of which (in red boxes) are useless for recognizing groupactivities. For instance, “Three shot: 21 = 5 + 16” means that the detector gener-ates a total of 21 proposals, but only 5 of them are players and other 16 proposals areoutliers in an activity of three-shot

single representation for each frame. However, previous methods are sensitive tothe varying number of people in each frame and require the explicit locations ofthem, which is limited in practical applications.

To this end, we investigate GAR in a weakly-supervised setting which onlyprovides video-level labels for each video clip. This setting not only is practicalto real-world scenarios but also provides a simpler and lower-cost way for theannotation of new benchmarks. Benefiting from it, we collect a larger and morechallenging benchmark, NBA, consisting of 181 basketball games which involvemore long-term temporal and fast-moving activities. Meanwhile, the weakly-supervised setting also brings uncertain input issue in each frame, as illustratedin Fig. 1. Under this setting, lots of useless proposals will be fed into the ap-proach. Besides, numerous irrelevant frames will also appear in the video clip, ifthe temporal structure of activities (e.g., in NBA) is long.

To tackle these issues, we further propose a simple yet effective module,namely Social Adaptive Module (SAM), which can adaptively select discrimina-tive proposals and frames from the video for weakly-supervised GAR. SAM aimsat assisting the weakly-supervised training by leveraging a social assumption thatkey instances (people/frames) are highly related to each other. Specif-ically, we firstly construct a dense relation graph on all possible input feature tomeasure the relatedness between each other, then pick the top ones according totheir relatedness. Based on the selected feature, a sparse relation graph is builtto perform relational embedding for them. Benefiting from SAM, our approachtrained without fully-supervision still obtains the comparable performance toprevious methods on the popular volleyball dataset [18].

Our contributions include: (a) The weakly-supervised setting that only pro-vides video-level labels is introduced for GAR. (b) Thanks to this setting, a largerand more challenging benchmark, NBA, is collected from the web at a low cost.(c) To ease the weakly-supervised training, a SAM is proposed to adaptivelyfind the effective person-level and frame-level representation based on the socialassumption that key instances are usually closely related to each other.

Social Adaptive Module 3

2 Related Work

Group Activity Recognition. Initial approaches [18,45,40,42] for recognizinggroup activities adopted the two-stage pipeline. They pre-extracted feature foreach person from a set of patch images and then fuse them into a single vector foreach frame by various methods (e.g., pooling strategies [18,35], attention mech-anism [31,40,45], recurrent models [13,39,42,45], graphical models [2,25,24], andAND-OR grammar models [1,36]). Nevertheless, these two-stage methods sepa-rate feature aggregation from representation learning, which is not conducive to adeep understanding of group activities. To this end, Bagautdinov et al. [4] intro-duced an end-to-end framework to jointly detect multiple individuals, infer theirindividual actions, and estimate the group activity. Wu et al. [44] extended [4] bystacking multiple graph convolutional layer to infer the latent relation betweeneach person. Azar et al. [3] constructed an activity map based on boundingboxes and explore the spatial relationships among people by iteratively refiningthe map. However, all of the above methods still require the action-level supervi-sion (action labels and bounding boxes for each person), which is time-consumingto tag. Ramanathan et al. [32] detected events and key actors in multi-personvideos without individual action labels, but they still needed to annotate thebounding boxes of all the players in a subset of 9, 000 frames for training adetector. This work introduces a more practical weakly-supervised setting thatonly provides video-level labels for group activity recognition.

Existing Datasets Related to GAR. Limited by the time-consuming tag-ging, there are currently only four datasets for understanding group activities,as shown in Table 1. Choi et al. [10] proposed the first dataset, Collective Activ-ity Dataset (CAD), consisting of real-world pedestrian sequences. Then, Choi etal. [11] extended CAD to CAED by adding two new actions (i.e., “Dancing”and “Jogging”) and removing the ill-defined action (i.e., “Walking”). There isno specific group activity defined in CAD and CAED, in which the scenar-ios are assigned group activities based on majority voting. Moreover, Choi andSavarese [9] collected a Choi’s New Dataset (CND) composed of many artifi-cial pedestrian sequences. Recently, Ibrahim et al. [18] introduced a sports videodataset, Volleyball Dataset (VD), which contains numerous volleyball games.However, as the largest and most popular dataset, VD contains quite a fewwrong labels which directly affect the evaluation of proposed approaches. In ad-dition, Ramanathan et al. [32] released NCAA but few researchers have used itfor GAR since only YouTube video links are provided and many of them are deadnow. Some activities (e.g., steal, slam dunk * and free-throw *) in NCAA can berecognized using one key frame, which actually evades from some key challengesof GAR. Limited by the size and quality of the above datasets, the recent studiesof group activity recognition have encountered the bottleneck. In this work, wecollect a larger and more challenging dataset from the basketball games and donot provide any person-level information (i.e., the bounding boxes and actionlabels for each person), thanks to the weakly-supervised setting. Moreover, com-pared with previous benchmarks, our NBA contains more activities that involvelong-term temporal structure and are fast-moving.

4 R. Yan et al.

Relational Reasoning. Recently, relationships among entities (i.e., pixels,objects or persons) have been widely leveraged in various computer vision tasks,such as Visual Question Answering [34,20,5], Scene Graph Generation [21,26,46],Object Detection [17,8], and Video Understanding [47,43,29]. Santoro et al. [34]presented a relational network module to infer the potential relationships amongobjects for improving the performance of visual question answering. Hu et al. [17]embedded a relation module into existing object detection systems for simulta-neously detecting a set of objects and interactions between their appearance andgeometry. Besides the spatial relationship among objects in the image, some re-cent works also explored the temporal relational structure of the video. Liu etal. [29] proposed a novel neural network to learn video representations by cap-turing potential correspondences for each feature point. Moreover, some recentmethods [13,31,44] explored the spatial relationships between each people ingroup activities. In this work, we apply relational reasoning to choose the mostrelevant people from a number of proposals for weakly-supervised GAR.

3 Weakly-supervised Group Activity Recognition

3.1 Weakly-supervised Setting

For a more practical group activity recognition, i) the number of people in thescene varies over different activities even time, and ii) the person-level annota-tions cannot be provided in real-world applications. Therefore, we introduce aweakly-supervised setting that only video-level labels are available, yet the loca-tion and action label of each person are not provided.

In this work, the task of recognizing group activity under this setting is calledweakly-supervised GAR which aims to directly recognize the activity performedby multiple collectively from the video with only a video-level label during train-ing. Apparently, weakly-supervised GAR can be applied to more complex andreal-world applications (e.g., real-time sports analysis and video surveillance)which cannot provide fine-grained supervision. Besides, the weakly-supervisedsetting eases the annotation of benchmarks for the task. Without annotating theperson-level supervision, we only require 1

2K+1 tagging labor1 as before whereK is the number of people in the scene.

3.2 The NBA Dataset for Weakly-supervised GAR

Under the weakly-supervised setting, we introduce a new video-based dataset,the NBA dataset. It describes the group activities that are common in basketballgames. There is no annotation for each person and only a group activity labelassigned to each clip. To the best of our knowledge, it is currently the largest andmost challenging benchmark for group activity analysis, as shown in Table. 1.

1The fully-supervised setting requires K boxes, K actions, and 1 group activity, butthe weakly-supervised setting only needs 1 group activity label. We roughly assumedthe same labor for each annotation.


Table 1. Comparison of the existing datasets for group activity recognition

Dataset # Videos # Clips# Individual

Actions# GroupActivities

ActivitySpeed

CameraMoving

CAD [10] 44 ≈ 2, 500 5 5 slow NCAED [11] 30 ≈ 3, 300 6 6 slow NCND [9] 32 ≈ 2, 000 3 6 slow NVD [18] 55 4, 830 9 8 medium YNBA (ours) 181 9, 172 - 9 fast Y

We will introduce the NBA dataset from the following aspects: the source of thevideo data, the effective annotation strategy, and the statistics of this dataset.

Data Source. It is a natural choice to collect videos of team sports forstudying group activity recognition. In this work, we collect a subset of the 181NBA games of 2019 periods from the web. Compared with the activities in vol-leyball games [18], the ones in basketball games have more long-term temporalstructure and fast moving-speed, which brings up new challenges to group ac-tivity analysis. For one thing, the number of players may vary over differentframes. On the other hand, the activity is so fast that the single-frame basedperson-level annotation is useless to track these players. Therefore, it is diffi-cult to label all people in these videos which differs from volleyball games, thuswe annotate this benchmark under the weakly-supervised setting. Due to thecopyright restriction, this dataset is available upon request.

Annotation. Given a video, the goal of annotation is to assign the groupactivities to the corresponding segments. It is time-consuming to manually labelsuch a huge dataset with conventional annotation tools. To improve the annota-tion efficiency, we take full advantage of the logs provided by the NBA’s officialwebsite and design a simple and automatic pipeline to label our dataset. Thereare three steps: i) Filter out some unwanted records in the log file correspondingwith a video. ii) Identify the timer in each frame by Tesseract-OCR [37] andmatch it with the valid records generated from step i. iii) Save the segmentswith a fixed length according to the time points obtained from step ii.

Statistics. We collect a total of 181 videos with a high resolution of 1920×1080. Then we divide each video into 6-second clips by the above-mentionedannotation method and sub-sample them to 12fps. Besides, we remove someabnormal clips which contain close-up shots of players or instant replays. Ulti-mately, there are a total of 9, 172 video clips, each of which belongs to one ofthe 9 activities. Here, we drop some activities such as “dunk” and “turnover”due to the limited sample size, and do not use “free-throw” that is easy to bedistinguished. We randomly select 7, 624 clips for training and 1, 548 clips fortesting. Table 2 shows the sample distributions across different categories ofgroup activities and the corresponding average number of people in the scene.

6 R. Yan et al.

Table 2. Statistics of the group activity labels in NBA. “2p”, “3p”, “succ”, “fail”,“def” and “off” are abbreviations of “two points”, “three points”, “success”, “failure”,“defensive rebound” and “offensive rebound”, respectively

Group Activity2p

-succ.

2p-fail.-off.

2p-fail.-def.

2p-layup-succ.

2p-layup-fail.-off.

2p-layup-fail.-def.

3p-succ.

3p-fail.-off.

3p-fail.-def.

# clipsTrain 798 434 1316 822 455 702 728 519 1850

Test 163 107 234 172 89 157 183 83 360

4 Approach

4.1 Mining Key Instances via Social Relationship

In general, the key and difficult point in obtaining category information fromvisual input is to construct and learn their intermediate representation. For thetask of group activity recognition, such intermediate representation made upof individual feature and underlying relationships among them, refers to social-representation in this paper. The previous fully-supervised setting [18,42,45] pro-vides a variety of extra fine-grained supervision information (e.g., ground-truthbounding box and action label for each person, and even the interaction label foreach person-person pair) to ensure that social-representation can be constructedand learned stably during training. However, under the weakly-supervised set-ting which only provides video-level labels, it is difficult for models to define andlearn discriminative social-representation stably.

To this end, we propose a simple yet effective framework, as illustrated inFig. 2, to stabilize the weakly-supervised training for GAR. The core idea of ourapproach is to firstly construct all possible social-representation and then findthe effective ones based on the social assumption that key instances (peo-ple/frames) are closely related to each other. Formally, given a sequenceof frames (V1, V2, · · · , VT ), our approach models them as follow:

O = O(F(V1;D(V1);W),F(V2;D(V2);W), · · · ,F(VT ;D(VT );W)). (1)

Here, D(Vt) represents detecting Np proposals from each frame. There are twochoices to determine the value of Np as follows, i) Quantity-aware: empiricallyselect top-Np boxes from numerous proposals; ii) Probability-aware: choosethe boxes whose probability is larger than a threshold θ.

The spatial modeling function F(Vt;D(Vt);W)) represents that i) adoptCNN with parameters W to extract the convolutional feature map for frameVt, ii) apply RoIAlign [15] to extract person-level features according to the cor-responding proposals from D(Vt), and iii) fuse person-level features into a singleframe-level vector. However, without person-level annotation, it is unavoidablefor D(·) to get many useless proposals from each frame. Moreover, the number


Spatial SAM

PoolingPruning

Frame-level feature

Detector

Feature map

RoIAlign

K persons

Sparse Relation Graph

N persons

Dense Relation Graph

FC

Activity Classification

Temporal SAM

Pruning

K frames

Sparse Relation Graph

N frames

Dense Relation Graph

Video-level feature

Fig. 2. Overview of our approach for weakly-supervised GAR. The inputs are a set offrames and the associating pre-detected bounding boxes for people. We apply SAM toconcurrently select discriminative person-level feature in spatial domain and effectiveframe-level representations in temporal domain (Best viewed in color)

of proposals (Np) varies over samples in practical applications. Thus, F(·) needsto be able to choose Kp discriminative person-level features in the spatial do-main. O(·) is a temporal modeling function that samples a set of N f frames fromthe entire video sequence (T frames) as the input of our approach according tothe sampling strategy used in [41]. However, the long temporal structure of theactivities in our NBA dataset will bring numerous irrelevant frames that mayaffect the construction of social-representation. Therefore, we also hope O(·) canselect Kf effective frame-level representations in the temporal domain.

It is clear that F(·) and O(·) need to have similar properties that attend-ing to effective person/frame-level features in the spatial and temporal domain,respectively. Therefore, this work aims at endowing the function F(·) and O(·)with the ability of feature selection according to the social assumption that keyinstances are highly related to each other.

4.2 Social Adaptive Module (SAM)

Inspired by relational reasoning [34,43,47], we build a generic module, namelySocial Adaptive Module, to implement the idea of assisting weakly-supervisedtraining with the social assumption. Specifically, we abstract F(·) and O(·) intoa unified form as

Z =M(X) = a | a ∈ λ1E(x1), λ2E(x2), · · · , λNE(xN ),a 6= 0,xi ∈ X, λi ∈ 0, 1, ‖λ‖1 = K, (2)

where X ∈ RN×D and Z ∈ RK×D are the input and output ofM(·), respectively,and K ≤ N . Put simply, M(·) aims to learn the parameter λ ∈ RN , a zero-one

8 R. Yan et al.

vector used to select K discriminative feature from N input feature. E(·) is theembedding function for input and is optional. We hold that λ will be effectivefor feature selection only if driven by X. Moreover, not only N 6= K but also thevalue of N varies over samples. Therefore, directly replacing the function F(·)and O(·) in Eq.(1) with M(·) is difficult for our approach to be optimized.

In this work, we approximate the solution of λ via pruning a Dense RelationGraph with N nodes to a Sparse Relation Graph with K nodes via a pruningoperation. Specifically, we build a dense relation graph on N input feature tomeasure relationships between each other. During the process of pruning, weaim at maintaining the top-K feature nodes of the graph according to theirrelatedness. Based on the K selected features, a sparse relation graph is built toperform relational embedding for them. The details are described as follows.

Dense Relation Graph. We first build dense relationships between eachinput node, based on their visual feature. More specifically, given a set of featurevectors as x1,x2, · · · ,xN, we compute the directional relation between themas rij = g(xi,xj) where i, j are indices and g(·, ·) is the relation function. Thereare several common implementations [29,34] of g(·, ·). For instance, we can mea-sure the L2 distance between each feature, but which is not a data-driven andlearnable method. Besides that, we can treat the concatenation [xi, xj ] as the in-put of a multi-layer perceptron to get the relation score. However, as the numberof pairs increases, this approach will consume a lot of memory and computation.In this work, we adopt a learnable and low-cost function to measure the relationbetween i-th and j-th feature node as g(xi,xj) = Φ(xi)

>Ψ(xj), where Φ(·) and

Ψ(·) are two embeddings of i-th and j-th feature node, respectively. Based onthis formulation, the calculation of relation matrices, R = rijN×N , can beimplemented by only two embedding processes and a matrix multiplication. Wealso apply a softmax computation along the dimension j of the matrix R.

Pruning Operation. To approximate the solution of λ in Eq.(2), this paperselect the K most relevant nodes from the above dense graph based on the socialassumption that key instances are likely to be related to each other. Concretely,after obtaining the N × N relation matrix for all feature pairs, we constructthe relatedness for each feature node as αi =

∑Nj=1 (rij + rji), where ri∗ and

r∗i denote the out-edges and in-edges of i-th feature node in the dense relationgraph. Intuitively, the nodes with strong connections can be easily retained in thegraph. Thus, we hold that the sum of a specific nodes corresponding connectionscan depict the importance (relatedness) of itself.

Based on the social assumption, we sort the values of α ∈ RN in descend-ing order and select the top-K values denoted as topk(α) ∈ RK . Thus, thesatisfactory λ can be expressed as

λi =

1, αi ∈ topk(α),

0, otherwise.(3)

Sparse Relation Graph. According to λ, we can get the corresponding Kselected feature, X ∈ RK×D, namely sparse feature. However, λ is driven by R,but X is unrelated to it. Therefore, λ will be unlearnable if we directly regard X


as the output of this module. To tackle this problem, we construct a relationalembedding E(·) for the sparse feature X by combing with relation matrix R.

Similarly, we obtain a sparse relation matrix R = rijK×K associating to theK selected feature, and then perform relational embedding as

zi = E(xi) = Wz

( K∑j=1

rijΩ(xj))

+ xi. (4)

Here “+” denotes a residual connection, Ω(·) is the embedding of sparse featurexj , and Wz is a weight vector that projects the relational feature to the newrepresentation with the same dimension as the sparse feature xi.

SAM is the first to introduce the social assumption that helps a lot in theGAR scenario where many uncertain inputs are involved. More importantly, thismakes our method more appropriate to work in the weakly-supervised setting.In comparison: i) [45] and [40] only built the pair-wise relationship between eachplayer and the scene, but SAM captured the relationship among all people thatprovides richer information for understanding complex scenes. ii) [13], [31], [39]and [44], as graph-based methods, indeed built relationships among different peo-ple, but they did not provide a mechanism to handle uncertain inputs. Therefore,we believe that SAM can also be used upon these methods.

4.3 Implementation details

Person Detection & Feature Extraction. For each frame, we first adoptFaster-RCNN [33] pre-trained on the MS-COCO [28] to detect possible personsin the scene, based on the mmdetection toolbox [7]. Then, we track them overall frames by correlation tracker [12] implemented by Dlib [22]. After that, weadopt ResNet-18 [16] as the backbone to extract the convolutional feature mapfor each frame. Finally, we get the aligned feature for each proposal from the mapby RoIAlign [15] with the crop size of 5 × 5 and embed it to 1024 dimensionalfeature vector by a fully connected layer.

Social Adaptive Module. This module is designed to select out K effectivefeature from N input ones. However, the values of N and K depend on thesituation and will be explained in experiments. If N varies over samples (e.g.,different numbers of proposals are generated by the Probability-aware strategymentioned in Section 4.1), we feed data into this module with a batch-size of 1but do not change the batch-size of the entire framework. The Φ(·), Ψ(·), andΩ(·) used to embed input feature are implemented by 1×1 convolutional layers.

Optimization. We adopt the ADAM to optimize our approach with fixedhyper-parameters (β1 = β2 = 0.9, ε = 10−4) and train it in 30 epochs with aninitial learning rate of 0.0001 that is reduced to 1/10 of the previous value for ev-ery 5 epochs. Compared with SSU [4] and ARG [44], which require pre-trainingthe CNN backbone and fine-tuning the top model separately, our approach ex-cluding detection can be optimized in an end-to-end fashion.

10 R. Yan et al.

Table 3. Ablation studies on NBA. Quan-Np and Prob-Np are two different strategiesof deciding the number of input proposals, as described in Section 4.1. θ is the prob-ability threshold used in Prob-Np, N f is the number of input frames, and K∗ denotethe number of feature selected by our SAM

Type Options of Our Approach Acc (%) Mean Acc (%)

Quan-Np

B1: w/o SAM (Np = 8) 44.6 39.5B2: w/ Spatial-SAM (Np = 14,Kp = 14) 46.8 41.3B3: w/ Spatial-SAM (Np = 14,Kp = 8) 50.3 43.6B4: w/ Spatial-SAM (Np = 8,Kp = 8) 47.4 41.4

B5: w/ Spatial-SAM (Np = 14,Kp = 8)

+ w/ Temporal-SAM (N f = 20,Kf = 6)49.1 47.5

Prob-Np B6: w/ Spatial-SAM (θ = 0.9,Kp = 8) 47.5 42.6

5 Experiments

5.1 Quantitative Analysis on the NBA Dataset

We first evaluate our approach on the new benchmark by compared with severalvariants and baseline methods. For this dataset, we sample N f = 20 frames fromthe entire video clip as the input for all methods and train them with a batch-sizeof 16. Because of the fast speed of activities in this benchmark, we do not trackpre-detected proposals over frames. Moreover, we do not apply any strategy tohandle the class-imbalance issue in this benchmark.

Ablation Study. To evaluate the effectiveness of our SAM, different variantsof our approach are performed on NBA, and the results are reported in Table 3.B1 that does not use the proposed SAM achieves the base accuracy of 44.6%and 39.5% on Acc and Mean Acc, respectively. Compared with B1, B2 thatemploys Spatial-SAM to build relational embedding among Np = 14 proposalsbut does not prune useless ones, only obtains 2.2% and 1.8% improvement onAcc and Mean Acc. Similarly, B4 which directly adapts Spatial-SAM to generaterelation representation from Np = 8 proposals has small improvement. However,by selecting Kp = 8 persons from Np = 14 proposals and modeling relationshipamong them, B3 improves Acc and Mean Acc by 5.7% and 4.1%, respectively,compared with B1. Moreover, our Quan-Np based approach (B6) suffering anuncertain number of proposals also gets a satisfactory Mean Acc of 42.6%. Basedon B3, B5 obtains the best Mean Acc by applying the SAM on the temporaldomain. It demonstrates that the ability of feature selection of SAM can also beused to capture the long temporal structure in our NBA dataset. The furtheranalysis on the parameters of N∗ and K∗ are present in Section 5.2.

Comparison with the baselines. We also compare our approach withrecent work in video classification domain, including TSN [41], TRN [47], I3D [6],I3D+NLN [43]. To be fair, all these baseline methods are built on ResNet-18and the input modality is RGB. The results are reported in Table 4. We see that


Table 4. Comparison on NBA. “Ours w/o SAM”, “Ours w/ SAM (S)”, and “Ours w/SAM (S+T)” are the B1, B3, and B5 reported in Table 3, respectively

Group ActivityFrame Classification Our Approach

TSN[41]

TRN[47]

I3D[6]

I3D+NLN[43]

w/o SAMw/ SAM

(S)w/ SAM(S+T)

2p-succ. 38.7 44.8 33.1 22.1 46.6 39.3 47.22p-fail.-off. 30.8 23.4 14.0 20.6 28.0 25.2 42.12p-fail.-def. 49.1 50.0 39.3 45.3 49.6 53.4 48.32p-layup-succ. 52.9 54.7 50.6 48.8 44.2 57.6 53.52p-layup-fail.-off. 10.1 22.5 22.5 22.5 20.2 19.1 32.62p-layup-fail.-def. 44.6 46.5 43.3 31.2 44.6 51.6 59.93p-succ. 39.3 37.7 31.1 26.8 39.9 41.0 30.13p-fail.-off. 10.8 20.5 4.8 12.0 24.1 38.6 55.43p-fail.-def. 63.9 62.8 55.3 61.7 58.6 66.9 58.1

Mean Acc (%) 37.8 40.3 32.7 32.3 39.5 43.6 47.5

“Ours w/o SAM” is hardly improved or worse due to noise input (irrelevant pre-detected proposals), compared with methods (“TSN” and “TRN”) only usingframe-level information. By introducing SAM to select discriminative proposalsin the spatial domain, “Ours w/ SAM (S)” achieves significant improvement onMean Acc but still overfits on some classes. As expected, “Ours w/ SAM (S+T)”outperforms all baselines by a good margin and obtained the best Mean Acc bysimultaneously applying SAM to the spatial and temporal domain. Nevertheless,“Ours w/ SAM (S+T)” performs poorly on the activity of “3p-succ.” which doesnot have long-term temporal structure. Moreover, “I3D” and “I3D+NLN” whichdepend on dense frames perform poorly on this benchmark.

5.2 Qualitative Analysis on the NBA dataset

Analysis of parameters. We first diagnose N , the number of nodes of thedense relation graph. Limited by the computation resource, we only analyze theNp of Spatial-SAM and it indicates how many pre-detected proposals shouldbe fed into our approach. It can be decided by two strategies as mentionedin Section 4.1. Thus, we first run our Quan-Np based approach on the NBAdataset by fixing Kp = 8 and changing Np from 8 to 64 with a step of 4. Asshown in Fig. 3(a), although Np is increasing, the performance of our approachhas been persistently higher than the baseline. Moreover, we also conduct ourProb-Np based approach on NBA by using fixed Kp = 8 and adjust θ from 0.05to 0.95 with a mini-step of 0.05. As shown in Fig. 3(b), our approach can achievepromising results when θ ≥ 0.3 and is more likely to get high performance whenθ around 0.4. Overall, our Spatial-SAM is not sensitive to Np whether decidedby Quan-Np or Prob-Np. We also diagnose K, the number of nodes of the sparse

12 R. Yan et al.

10 20 30 40 50 60The value of Np

20

25

30

35

40

45

50

Perfo

rman

ce (%

)

Acc (OURS w/o SAM)Mean Acc (OURS w/o SAM)ACC (OURS w/ Spatial-SAM)Mean ACC (OURS w/ Spatial-SAM)

(a) Np of Spatial-SAM

0.2 0.4 0.6 0.8The value of

20

25

30

35

40

45

50

Perfo

rman

ce (%

)


(b) θ of Spatial-SAM

2 4 6 8 10 12The value of Kp

20

25

30

35

40

45

50

Perfo

rman

ce (%

)


(c) Kp of Spatial-SAM

2 4 6 8 10 12 14 16 18The value of Kf

30.0

32.5

35.0

37.5

40.0

42.5

45.0

47.5

50.0

Perfo

rman

ce (%

)

Mean Acc (OURS w/o Temporal-SAM)Mean ACC (OURS w/ Temporal-SAM)

(d) Kf of Temporal-SAM

3p-suc

c.

3p-f.

-off.

3p-f.

-def.

2p-la

y.-su

cc.

2p-la

y.-f.-

off.

2p-la

y.-f.-

def.

2p-suc

c.

2p-f.

-off.

2p-f.

-def.

3p-succ.

3p-f.-off.

3p-f.-def.

2p-lay.-succ.

2p-lay.-f.-off.

2p-lay.-f.-def.

2p-succ.

2p-f.-off.

2p-f.-def.

30.05 22.95 19.67 4.92 1.64 0.00 13.11 6.01 1.64

4.82 55.42 24.10 3.61 0.00 1.20 1.20 7.23 2.41

4.72 15.00 58.06 1.67 1.39 1.94 2.22 3.33 11.67

0.00 4.07 0.58 53.49 8.14 9.30 15.12 7.56 1.74

0.00 4.49 2.25 8.99 32.58 23.60 3.37 22.47 2.25

1.27 0.64 1.91 5.73 8.28 59.87 2.55 5.73 14.01

4.91 2.45 3.07 15.95 3.68 1.23 47.24 13.50 7.98

0.93 11.21 0.93 0.93 18.69 2.80 12.15 42.06 10.28

0.43 3.42 13.25 1.71 6.41 13.68 4.70 8.12 48.29

(e) Confusion matrix (f) Embeddings of “shot”

Fig. 3. (a)-(d) Experimental analysis on parameters. (e) The confusion matrix of OURSw/ Spatial-SAM and Temporal-SAM. (f) t-SNE visualization of embeddings of 2/3-points based activities. These experiments are carried out on the NBA dataset

relation graph, and it decides how many feature nodes need to be selected formodeling. As shown in Fig. 3(c), the performance of Spatial-SAM maintains overthe baseline and it obtains the best result at Kp = 1. Therefore, we hold thatSpatial-SAM is not sensitive to Kp. By contrast, the performance of Temporal-SAM cannot get satisfactory performance when the Kf is too small or large, dueto the different temporal length of activities in NBA. However, our approachwith Temporal-SAM significantly improves Mean Acc when 4 < Kf < 10.

Confusion matrix. To figure out the confusion between each activity in theNBA dataset, we report the confusion matrix of our approach in Fig.3(d). Wecan see that the activities involving “defense” and “offense” are easily confused,due to the class-imbalance issue between these two kinds of activities. However,it is relatively easy to distinguish 2-points and 3-points, as embeddings shownin Fig.3(f). Because 3-point players usually jump to shot behind the 3-point linewithout blocking. By contrast, 2-point players are often blocked by others.

Visualization. To further understand the discriminative learning processof SAM, we show some typical cases of NBA in Fig.4. The group activitiesin NBA have long-term temporal information, thus top-K proposals vary overtime. Take the rightmost one as an example, a 3p-failure-defense has 3 parts:(1) preparation, (2) shooting, (3) defensive rebound. For (1) and (2), the playerscontrolling the ball are the key instances, but for (3), the players that quicklyturn back are the key instances. It is not hard to find that SAM aims at focusingon the players who are controlling the basketball or close to it and these peoplecan form a group semantically.


2p-success 3p-success 3p-failure-defense

Fig. 4. A visualization of the top-K proposals focused by SAM over time on the NBAdataset, where K = 3. Each column shows three different frames of an activity. Wehighlight the top-K players (in cyan boxes) at three time steps of different activities.The people in red boxes are treat as noisy data by our model

5.3 Quantitative Analysis on Volleyball Dataset

We also evaluate our approach on the existing largest and most common bench-mark, Volleyball Dataset (VD) [18] consisting of 4830 volleyball game sequences.The middle frame of each sequence is labeled with 9 action labels (not used inour approach) and 8 group activity labels. However, we find that there are manywrong annotations between “pass” and “set”, which seriously affects the evalu-ation for models, thus we merged them into “pass-set”. To be fair, we follow thetrain/test split provided in [18] and sample N f = 3 from the video clip similarto [44]. Because the activities in VD always occur in the middle frame, we donot apply our SAM to the temporal domain for this benchmark.

Ablation Study. We also perform ablation study on VD and the experimen-tal results are reported in Table 5(a). All these variants do not use person-levelsupervision information (bounding boxes and action labels) provided by [18] andare built on ResNet-18. Compared with the baseline method B1, our B2 and B3which only apply SAM to generate relational embedding for proposals but donot prune the irrelevant ones, only improve the accuracy by 0.9% and 0.4%,respectively. Besides, by using SAM to build relationships among N = 16 pro-posals and choosing K = 12 effective proposals from them, B3 and B5 improvethe accuracy of 1.6% based on whether Quan-N or Prob-N . This observationindicates again that useless proposals will affect the weakly-supervised trainingand SAM is effective for pruning them.

14 R. Yan et al.

Table 5. Results on VD. (a) Ablation studies. (b) Comparison with SOTA. “Ours”represents “Ours w/ Spatial-SAM” with Np = 16 and Kp = 12 based on Quan-Np

(a)

Type Our Approach Acc (%)

Quan-Np

B1: w/o SAM 91.5

B2: w/ Spatial-SAM(Np=16, Kp=16)

92.4


93.1


91.9

Prob-Np B5: w/ Spatial-SAM(θ = 0.9,Kp = 12)

93.1

(b)

Method Supervision Acc (%)

HTDM Fully 89.7PCTDM Fully 90.2CCGL Fully 91.0StagNet Fully 90.0‡ARG Fully 94.0‡ARG Weakly 90.7

†Ours Weakly 93.1‡Ours Weakly 94.0

† ResNet-18‡ Inception-v3

Comparison with the state-of-the-art. Referring to [42], we report theresults of HTDM [18,19], PCTDM [45], CCGL [39], and StagNet [31] by com-puting their corresponding confusion matrices. We reproduce the state-of-the-artmethod, ARG [44], with fully-supervised and weakly-supervised settings, respec-tively. As shown in Table 5(b), our weakly-supervised approach with the back-bone of ResNet-18 is superior to almost all previous fully-supervised methods,except ARG which is built on Inception-v3. But our approach goes far beyondARG under the weakly-supervised setting, suggesting that useless pre-detectedproposals seriously affect the construction of relation graphs in ARG. Further-more, our approach with Inception-v3 can achieve the best performance.

6 Conclusions

In this work, we introduce a weakly-supervised setting for GAR, which is morepractical and friendly for real-world scenarios. To investigate this problem, wecollect a larger and more challenging dataset from high-resolution basketballvideos of NBA. Furthermore, we propose a social adaptive module (SAM) forassisting the weakly-supervised training by leveraging the social assumption thatdiscriminative features are highly related to each other. SAM can be easilyplugged into existing frameworks and be optimized in an end-to-end fashion.As demonstrated on two datasets, our approach achieves state-of-the-art resultswhile it can attend to key proposals/frames automatically.

This work reveals that social relationship among visual entities is helpful forhigh-level semantic understanding. We look forward to applying this method tomore challenging scenarios, in particular, for mining semantic knowledge fromweakly-annotated or un-annotated visual data.


Acknowledgements

This work was supported by the National Key Research and Development Pro-gram of China under Grant 2018AAA0102002, the National Natural ScienceFoundation of China under Grants 61732007, 61702265, and 61932020.

References

1. Amer, M.R., Xie, D., Zhao, M., Todorovic, S., Zhu, S.C.: Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In: ECCV (2012)

2. Amer, M.R., Lei, P., Todorovic, S.: Hirf: Hierarchical random field for collectiveactivity recognition in videos. In: ECCV (2014)

3. Azar, S.M., Atigh, M.G., Nickabadi, A., Alahi, A.: Convolutional relational ma-chine for group activity recognition. In: CVPR (2019)

4. Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., Savarese, S.: Social scene un-derstanding: End-to-end multi-person action localization and collective activityrecognition. In: CVPR (2017)

5. Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: Murel: Multimodal relationalreasoning for visual question answering. In: CVPR (2019)

6. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and thekinetics dataset. In: CVPR (2017)

7. Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z.,Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R.,Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection:Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155(2019)

8. Chen, X., Gupta, A.: Spatial memory for context reasoning in object detection. In:ICCV (2017)

9. Choi, W., Savarese, S.: A unified framework for multi-target tracking and collectiveactivity recognition. In: ECCV (2012)

10. Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity clas-sification using spatio-temporal relationship among people. In: ICCV Workshops(2009)

11. Choi, W., Shahid, K., Savarese, S.: Learning context for collective activity recog-nition. In: CVPR (2011)

12. Danelljan, M., Hager, G., Khan, F., Felsberg, M.: Accurate scale estimation forrobust visual tracking. In: BMVC (2014)

13. Deng, Z., Vahdat, A., Hu, H., Mori, G.: Structure inference machines: Recurrentneural networks for analyzing relations in group activity recognition. In: CVPR(2016)

14. Gan, C., Wang, N., Yang, Y., Yeung, D.Y., Hauptmann, A.G.: Devnet: A deepevent network for multimedia event detection and evidence recounting. In: CVPR(2015)

15. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.

In: CVPR (2016)17. Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection.

In: CVPR (2018)

16 R. Yan et al.

18. Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchicaldeep temporal model for group activity recognition. In: CVPR (2016)

19. Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: Hierarchical deeptemporal models for group activity recognition. arXiv preprint arXiv:1607.02643(2016)

20. Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: Tgif-qa: Toward spatio-temporalreasoning in visual question answering. In: CVPR (2017)

21. Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei,L.: Image retrieval using scene graphs. In: CVPR (2015)

22. King, D.E.: Dlib-ml: A machine learning toolkit. JMLR (2009)23. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video

database for human motion recognition. In: ICCV (2011)24. Lan, T., Sigal, L., Mori, G.: Social roles in hierarchical models for human activity

recognition. In: CVPR (2012)25. Lan, T., Wang, Y., Yang, W., Robinovitch, S.N., Mori, G.: Discriminative latent

models for recognizing contextual group activities. TPAMI (2012)26. Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from

objects, phrases and caption regions. In: ICCV (2017)27. Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video under-

standing. In: ICCV (2019)28. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P.,

Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)29. Liu, X., Lee, J.Y., Jin, H.: Learning video representations from correspondence

proposals. In: CVPR (2019)30. Long, X., Gan, C., De Melo, G., Wu, J., Liu, X., Wen, S.: Attention clusters: Purely

attention based local feature integration for video classification. In: CVPR (2018)31. Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Van Gool, L.: stagnet: An attentive

semantic rnn for group activity recognition. In: ECCV (2018)32. Ramanathan, V., Huang, J., Abu-El-Haija, S., Gorban, A., Murphy, K., Fei-Fei, L.:

Detecting events and key actors in multi-person videos. In: CVPR. pp. 3043–3053(2016)

33. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de-tection with region proposal networks. In: NeurIPS (2015)

34. Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M., Pascanu, R., Battaglia, P.,Lillicrap, T.: A simple neural network module for relational reasoning. In: NeurIPS(2017)

35. Shu, T., Todorovic, S., Zhu, S.C.: Cern: confidence-energy recurrent network forgroup activity recognition. In: CVPR (2017)

36. Shu, T., Xie, D., Rothrock, B., Todorovic, S., Chun Zhu, S.: Joint inference ofgroups, events and human roles in aerial videos. In: CVPR (2015)

37. Smith, R.: An overview of the tesseract ocr engine. In: ICDAR (2007)38. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes

from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)39. Tang, J., Shu, X., Yan, R., Zhang, L.: Coherence constrained graph lstm for group

activity recognition. TPAMI (2019)40. Tang, Y., Wang, Z., Li, P., Lu, J., Yang, M., Zhou, J.: Mining semantics-preserving

attention for group activity recognition. In: ACM MM (2018)41. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal

segment networks: Towards good practices for deep action recognition. In: ECCV(2016)


42. Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collectiveactivity recognition. In: CVPR (2017)

43. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR(2018)

44. Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs forgroup activity recognition. In: CVPR (2019)

45. Yan, R., Tang, J., Shu, X., Li, Z., Tian, Q.: Participation-contributed temporaldynamic model for group activity recognition. In: ACM MM (2018)

46. Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph r-cnn for scene graph gen-eration. In: ECCV (2018)

47. Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning invideos. In: ECCV (2018)

arxiv:2007.09470v1 [cs.cv] 18 jul 2020arxiv:2007.09470v1 [cs.cv] 18 jul 2020 2 r. yan et al. three...

Documents