learning the compositional visual coherence for ...content attentive neural networks (cann) network...

Learning the Compositional Visual Coherence forComplementary Recommendations

Zhi Li1 , Bo Wu2 , Qi Liu1,3,∗ , Likang Wu3 , Hongke Zhao4 and Tao Mei51Anhui Province Key Laboratory of Big Data Analysis and Application, School of Data Science,

University of Science and Technology of China2Columbia University

3School of Computer Science and Technology, University of Science and Technology of China4The College of Management and Economics, Tianjin University

5JD AI Research{zhili03, wulk}@mail.ustc.edu.cn, [email protected], [email protected],

[email protected], [email protected]

AbstractComplementary recommendations, which aim atproviding users product suggestions that are sup-plementary and compatible with their obtaineditems, have become a hot topic in both academiaand industry in recent years. Existing work mainlyfocused on modeling the co-purchased relations be-tween two items, but the compositional associa-tions of item collections are largely unexplored.Actually, when a user chooses the complementaryitems for the purchased products, it is intuitive thatshe will consider the visual semantic coherence(such as color collocations, texture compatibilities)in addition to global impressions. Towards this end,in this paper, we propose a novel Content Atten-tive Neural Networks (CANN) to model the com-prehensive compositional coherence on both globalcontents and semantic contents. Specifically, wefirst propose a Global Coherence Learning (GCL)module based on multi-heads attention to model theglobal compositional coherence. Then, we gener-ate the semantic-focal representations from differ-ent semantic regions and design a Focal CoherenceLearning (FCL) module to learn the focal composi-tional coherence from different semantic-focal rep-resentations. Finally, we optimize the CANN in anovel compositional optimization strategy. Exten-sive experiments on the large-scale real-world dataclearly demonstrate the effectiveness of CANNcompared with several state-of-the-art methods.

1 IntroductionRecommender systems are those techniques that supportusers in the various decision-making process and catch theirinterest among the overloaded information. For enhanc-ing user satisfaction and recommendation performances, it isan indispensable part to understand how products relate to

∗Corresponding Author.

(b)(a)

Figure 1: Illustration of complementary recommendations. (a) Rec-ommendations based on co-purchased relations. (b) Recommenda-tions based on compatible relations. (c) Recommendations based oncompositional coherence.

each other in recommender systems [McAuley et al., 2015a;Li et al., 2018]. Along this line, complementary recommen-dations [Yu et al., 2019], which aim at exploring item com-patible associations to enhance the qualities of each item oranother, have become a hot topic in both academia and indus-try in recent years.

In the literature, previous researches can be clustered intotwo groups, i.e., unsupervised methods [Tan et al., 2004;Zheng et al., 2009] and supervised models [Zhao et al., 2017;He et al., 2016]. For a long time, researchers mainly usedunsupervised methods to model the association rules in therecommendation process [Tan et al., 2004]. Meanwhile,some studies proposed supervised approaches to learn com-plementary relationships, such as co-purchased [Zhao et al.,2017] and content compatibility [He et al., 2016]. Recently,many researchers attempted to mine the compatibility of fash-ion items to better understand the complementary relation-ships in the clothing recommendations [Han et al., 2017;Hsiao and Grauman, 2018]. In spite of the importance of ex-isting studies, the exploration of compositional associationsin the complementary item recommendations is still limited.

As a matter of fact, when a user chooses the complemen-tary items for the purchased products, it is intuitive that shewill consider the visual coherence (such as color collocations,texture compatibilities) in addition to global impressions. Forexample, Figure 1 shows the case of a complementary rec-ommendation process. If we only consider co-purchased re-lations, we may recommend a creamy-white shirt to the user

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3536

as shown in Figure 1(a). That is because the shirt was co-purchased respectively with three query items by other users.When we take consideration of the compatible relations asshown in Figure 1(b), we can find the missing componentof the user’s purchased collection is a pair of shoes. Then,a pair of black booties may be a suitable recommendation.However, for considering compositional relationships and vi-sual semantic, we can find the mini skirt and shoulder bagare in blue. Therefore, as Figure 1(c) shows, a pair of blueleather pumps is the best-matching. From this example, wecan conclude that a good complementary recommender sys-tem should model the comprehensive compositional relation-ships on both global contents and semantic contents. Unfor-tunately, the exploration of this compositional coherence inthe complementary recommendations is still limited.

In this paper, we provide a focused study of the compo-sitional coherence in item visual contents from two perspec-tives, i.e., the global visual coherence and the semantic-focalcoherence. Along this line, we propose a novel Content At-tentive Neural networks (CANN) to address the complemen-tary recommendations. More specifically, we first propose aGlobal Coherence Learning (GCL) module based on multi-heads attention to model the global compositional coherence.Then, considering the importance of visual semantics (such ascolor, texture), we generate the content semantic-focal repre-sentations of color-focal, texture-focal and hybrid-focal con-tents, respectively. Next, we design a Focal Coherence Learn-ing (FCL) module based on a hierarchical attention model tolearn the focal coherence from different semantic-focal rep-resentations. Finally, we optimize the CANN in a novel com-positional optimization strategy. We conduct extensive ex-periments on a real-world dataset and the experimental re-sults clearly demonstrate the effectiveness of CANN com-pared with several state-of-the-art methods.

2 Related Works2.1 Visual RecommendationsWith the rapid development of deep learning [Zhao et al.,2019; Wu et al., 2019] and computer vision, visual recom-mendations have raised lots of interests in both academiaand industry and benefited lots of applications, such as im-age recommendations [McAuley et al., 2015b], movie rec-ommendations [Zhao et al., 2016] and fashion recommen-dations [Han et al., 2017; Hou et al., 2019]. Some previ-ous works treated visual recommendations as special content-aware recommendations incorporating the visual appearanceof the items [He and McAuley, 2016]. Along this line, manyresearchers directly extracted the visual feature by a pre-trained CNN model and enhanced traditional recommendersystems, such as Matrix Factorization [He and McAuley,2016; Hou et al., 2019]. For example, [He and McAuley,2016] proposed a recommendation framework to incorporatethe visual signal of the items. Recently, many researchersutilized deep learning methods to generate item recommen-dations [Kang et al., 2017]. To further explore item visualcontents, some studies developed deep neural networks to ex-tract the aesthetic information of images [Liu et al., 2017;Yu et al., 2018]. Although various researches have leveraged

visual features in recommendation tasks, they usually treateditem visual features as side information and the item visual re-lations have been largely unexploited. In this paper, we pro-pose the compositional visual coherence, which is learningby the attention-based deep neural networks, to deeply modelthe item complementary relation for recommender systems.

2.2 Complementary RecommendationsFor enhancing the recommendations, explicitly modelingthe complex relations among items under domain-specificapplications is an indispensable part [Liu et al., 2018].Along this line, many researchers focused on exploring theitem combination-effect relations, such as substitutable rela-tions [McAuley et al., 2015a; McAuley et al., 2015b] andcomplementary relations [Rudolph et al., 2016; Yu et al.,2019]. For a long time, many researchers mainly utilized un-supervised learning methods to explore co-occurrence rela-tions for complementary recommendations [Tan et al., 2004;Zheng et al., 2009]. Recently, more and more studies basedon supervised approaches were proposed to model comple-mentary relationships, which were mainly reflected by users’purchases of the complements [Zhao et al., 2017; He et al.,2016] or item content similarities [McAuley et al., 2015a;Zhang et al., 2018]. Along this line, since natural com-plementary relationships in the fashion items [Chang et al.,2017; Lo et al., 2019], there was a particular interest in un-derstanding the compatibility [Song et al., 2017; Wang etal., 2019] of fashion items to generate complementary rec-ommendations [Han et al., 2017; Hsiao and Grauman, 2018;Vasileva et al., 2018]. For example, [He et al., 2016]learned the complicated and heterogeneous relationships be-tween items and enhance fashion recommendations. [Han etal., 2017] developed Bi-LSTMs to model the outfit comple-tion process and generate the complementary clothing rec-ommendations. Although these works have considered co-reactions between items, the item compositional coherencein the global and semantic contents cannot be depicted andcaptured well. In this paper, we propose a focal study on ex-ploring the item compositional relationship of visual contentto enhance the complementary item recommendations.

3 CANN: Content Attentive Neural NetworksIn this section, we introduce our proposed framework for ad-dressing the complementary recommendations.

In the real complementary item choosing process, peoplealways want to buy some items that can be compatible withtheir purchased products. Suppose we have a set of compat-ible item collections S = {O1, O2, ..., ON}, and for eachcollection Oi = {p1, p2, .., pk} contains k compatible itemsp. Meanwhile, we have a scenario that a user has purchased aseed collection of items P = {p1, p2, ..., pi}, but she is con-fused to choose complementary items P ∗ which can make theitem collection compatible. To that end, in this paper, we aimto give a complementary suggestion for the target user to helpher make the best-matched choice.

Along this line, we propose a content-based attentionmodel, i.e., Content Attentive Neural networks (CANN), toaddress the complementation recommendations. As shown


3537

Content Attentive Neural Networks (CANN)

Network

LayerInception V3

Model

SoftmaxElement

Product

Softmax

Layer

Prediction Layer

0.98

Attention Block

Feed-Forward

A. Global Coherence

Learning

Attention Block

Attention Block

B. Focal Coherence Learning

SFCE SFCE SFCE

Concate

Attention Block

Output

Normalization

Linear Projection

Concatenation

Softmax

...

X

1

1

Softmax

X V

X Q

. .

SFCESemantic-Focal

Content Extraction. Dot

Product

Figure 2: Illustration of Content Attentive Neural networks (CANN), which integrates two main components, i.e., A. Global CoherenceLearning (GCL) and B. Focal Coherence Learning (FCL).

in Figure 2, CANN consists of a Global Coherence Learning(GCL) component and a Focal Coherence Learning (FCL)component to jointly learn the compositional relationshipsfrom global and semantic-focal contents. In order to simulateusers’ decision-making process on complementary items, wepropose a novel compositional optimization strategy to trainour proposed CANN.

3.1 Global Coherence LearningConsidering the differences from traditional recommenda-tions, the visual contents of items are an important part ofthe complementary recommendations e.g., the colors shouldbe compatible and styles should be matching. Along thisline, we propose an attention-based module, i.e., Global Co-herence Learning (GCL) to learn compositional coherence ofitem global visual contents.

We develop the Inception-V3 CNN model [Szegedy et al.,2016] as the feature extractor to transform item global visualcontents (item images) to original feature vectors. Then foreach item collection P = {p1, p2, ..., pk}, we can generatethe items visual representations P = {x1, x2, ..., xk} fromthe item images, where xi is the feature representation de-rived from the CNN model for the i-th item. We adopt afully connected layer for all xi to reduce the feature dimen-sion and make the visual content representation fine-tuned inthe model training stage. More formally, we generate the finalvisual representation as following:

xfi = σ(xiW(1)f + b

(1)f )W

(2)f + b

(2)f , (1)

where the weight matrices W (1)f , W(2)f are of shape Rdc×dc ,

Rdc×df , respectively. And b(1)f , b(2)f are the bias terms, σ is

the activation function, here, we use ReLU function.After obtaining the visual representations of the item col-

lection P̂ = {xf1 , xf2 , ..., x

fk}, we can generate the item em-

bedding matrix M ∈ Rk×df , where k is the maximum length

of the item collections and df is the latent dimension of itemvisual feature. If the item collection length is shorter thank, we repeatedly use a ‘padding’ item until the length is k.Next, an attention mechanism is applied to model the itemcompositional coherence in multiple visual spaces. Inspiredby the multi-heads attention method [Vaswani et al., 2017],in this paper, we use a stacked attention model to capture theglobal visual coherence. More specifically, we transform theitem visual features into various visual spaces and perform alinear projection as xsi = xiWs + bs, where x

si is the repre-

sentation vector of i-th item in s-th visual space, Ws is theprojection matrix of shape Rdf×ds , and bs is the bias term.Then we can model the global coherence between two itemsin the target collection as:

αsi,j =W qs x

si · (W ks xsj)

T

√ds

, α̂si,j =exp(αsi,j)∑kj=1 exp(α

si,j)

, (2)

where αsi,j is the visual coherence score of item pi and pj ,weight matrices W qs , W

ks are of shape Rds×ds . Here, we

use√ds to constrain the values of αsi,j . Next, the coherence

scores are normalized by a softmax function. Then we cangenerate the item representation vsi incorporating the globalcoherence of other items in the s-th visual space as:

vsi =k∑

j=1

α̂si,j ⊗ xsj , (3)

where the ⊗ represents the element-wise product. For eachvisual space, we can generate each item representation in-cluding the compositional information. Then, we can fuseall the visual space to generate the global item visual repre-sentations, i.e., vi = Concate(v1i , v

2i , ..., v

si ). Therefore, we

model global coherence from multiple visual spaces within anattention-based block and item visual representation Ma =


3538

[v1, v2, ..., vk] is generated. Then, the global coherence learn-ing process can be defined as Ma = fa(P ). However, theglobal coherence learning process is a linear operation and theinteractions between different latent dimensions are largelyunexploited. With this in mind, we develop a feedforwardnetwork to endow the model with the nonlinear operation:

fn(M) = σ(MaW(1)n + b

(1)n )W

(2)n + b

(2)n , (4)

where the weight matrices W (1)n , W(2)n are of shape Rdf×df

and b(1)n , b(2)n are the bias terms. After that, we generated the

attention output from a multi-heads attention block. For mod-eling the sophisticated compositional coherence of global fea-tures, we stack multiple multi-heads attention blocks to builda deeper network architecture. In order to prevent overfittingand unstable training process, we perform the Batch Normal-ization (BN) layer [Ioffe and Szegedy, 2015] after the outputof each attention block. More formally, the output of b−thblock can be defined as:

M (b)a = fa(hb−1)⊕ hb−1, (5)

h(b) = BN(Wbnfn(M(b)a )) + bbn), (6)

where ⊕ is the element plus, fa, fn are the attention functionand feedforward function, respectively. After the final atten-tion block, we can generate global representations of the seedcollection G (P ) = h(b

∗), where h(b∗) is the output of the last

multi-heads attention block.

3.2 Focal Coherence LearningAs mentioned above, for learning the compositional coher-ence from the item content, the content semantic attributes arealso important to understanding item characteristics. To thatend, we propose a novel Focal Coherence Learning (FCL)to model the compositional coherence from semantic-focalcontents, i.e., color-focal contents, texture-focal contents andhybrid-focal contents. Firstly, we develop a Semantic-FocalContent Extraction (SFCE) based on the region-based seg-mentation methods as regions can yield much richer visualinformation than pixels [Uijlings et al., 2013]. Inspired bySelective Search [Uijlings et al., 2013], we firstly use a group-ing algorithm to merge the semantic regions, which are basedon the color or textual similarity computing. Specifically,we first generate the initial semantic regions by a segmenta-tion algorithm [Felzenszwalb and Huttenlocher, 2004]. Thenwe develop the Selective Search algorithm to group regionswhich are similar in color, texture and both of them, respec-tively. The similarity can be measured as Sim(ri, rj) =∑n

k=1min(Cki , C

kj ), where the Ci is the measurement of

characteristic feature histogram, i.e., the color histogram andtexture histogram, n is the feature dimensionality. Morespecifically, for each region, we extract three-dimensionalcolor histograms. Each color channel is discretized into 25bins, and n = 75 for the color feature in total. And we ob-tain n = 240 dimensional texture histogram by computedSIFT-like [Liu et al., 2010] measurements on 8 different ori-entations of each color channel, for each orientation for eachcolor using 10 bins. Both color histogram and texture his-togram are normalized by L1 norm.

After computing the semantic similarity of regions i and k,we can group the similar regions and the target histogram Ctcan be efficiently propagated by:

Ct =size(ri) · Ci + size(rj) · Cj

size(ri) + size(rj). (7)

The size of target regions can be simply computedsize(rt) = size(ri) + size(rj). Different from the Selec-tive Search [Uijlings et al., 2013], we do not combine thecolor and texture similarity. Because we want to deeply ex-plore the compositional coherence of different semantic con-tents. CANN extracts the semantic-focal contents Rc, Rt, Rhrespectively based on color similarity, texture similarity andhybrid similarity, and for each semantic-focal content wechoose three content regions. With a pre-trained InceptionV3 [Szegedy et al., 2016], we can generate the feature vec-tors F = [yc, yt, yh] of all the semantic-focal contents. Next,we can learn the focal coherence among the content regions.

Actually, considering the computational complexity of theattention mechanism for all the regions in all the items,similar with [Ma et al., 2019], we propose a hierarchicalattention module to model both the semantic-specific andcross-semantic dependency in a distinguishable way. Asshown in Figure 2, the input semantic-focal features V ∈Rk×|F |×|R|×dy can be reshaped as input matrices VC ∈R|F |×dC in the semantic-specific space and VS ∈ Rk×dSin the cross-semantic space. For each seed item collection,k is the seed item numbers, |F | is the number of semantic-focal features, |R| is the number of content regions of eachsemantic-focal feature, and dy is the dimension of each re-gion representation. Afterwards, we can compute our focalcoherence by hierarchical multi-heads attention as:

V ′ = AV = ÂCVS = ÂCÂSV, (8)

where ÂC and ÂS are attention matrices of semantic-specificand cross-semantic. Similar to GBL, the attention matricesare computed by Eq.2. More specifically, we firstly modelthe compositional coherence of different semantic-focal re-gions. Then, we learn the attention map of different items.For aligning the attention matrices to the semantic-focal fea-tures, we reshape the attention matrices as ÂC = AC ·IC andÂS = AS · IS , where IC , IS are the identity matrices. Tothat end, the semantic-focal representation of the item seedcollection P can be formulated as F (V ) = V ′.

3.3 Optimization StrategySo far, from GCL and FCL modules, we can generate theglobal and semantic-focal representations, respectively. Next,after a fully connected layer, we can obtain representations ofthe prediction items. Following [Han et al., 2017], we appenda softmax layer on the x̂ to calculate the probability of com-plementary items conditioned on the target seed collections:

Pr(x̂|P ) = exp(x̂ · xc)∑xc∈N exp(x̂ · xc)

, (9)

where N contains all the items from the seed collections.This can make model learn compositional coherence by


3539

Algorithm 1 Compositional Optimization StrategyInput: Initialization model f(Pi, C; θ); The length of the seed col-lection k; The complementary item database S; The number ofepochs T ; The size of batch mParameter: Model parameter θ1: for i = 1, 2, 3, ..., T do2: Random sample m seed collections O ∈ S3: Initial input mini-batch Input as ∅4: for Oi in Batch do5: Random choose an item p from the collection Oi6: Generate the seed collection Pi ←Mask(Oi, p)7: if |Pi| < k then8: Add an padding to the left of Pi until |Pi| = k9: end if

10: Generate the input mini-batch Input← Input ∪ Pi11: end for12: Build the training candidates N ← ∀x ∈ Input13: Update the model θ ← SGD(f(Input, C; θ), θ)14: end for

means of considering a diverse collection of candidates. Foroptimizing our CANN, we propose a compositional trainingstrategy to simulate users’ decision-making process on com-plementary items as Algorithm 1. In the training stage, weuse the Mask(Oi, p) operation to delete prediction items pfrom the seed collections O. Moreover, we take all items inthe mini-batch as the candidate collection N . But in the in-ference stage, we conduct the candidate collection as the xc.Actually, for some small datasets, we can use all the itemsas the candidates, but this is not practical for large datasetsbecause of the high dimensional item representations.

With the compositional optimization strategy, CANN canbe trained end-to-end. During training, we minimize the fol-lowing objective function:

L(P,N ; θ) = − 1|N |

|N |∑t=1

logPr(x̂|P ), (10)

where θ denotes model parameters. Similar with [Han et al.,2017], we add a visual-semantic embedding as a regulariza-tion. We use Stochastic Gradient Decent (SGD) [Robbinsand Monro, 1951] to update them through Algorithm 1.

4 ExperimentsIn this section, we first introduce the experimental settingsand compared methods. Then, we compare the performanceof CANN against the compared approaches on the comple-mentary recommendation task. Then, we make an ablationstudy on the focal coherence learning process. At last, weconduct a case study to visualize the compositional coherenceof our proposed CANN.

4.1 Experimental SetupsDataset. We evaluate our proposed method on a real-worlddataset, i.e., Polyvore dataset [Han et al., 2017; Vasileva etal., 2018]. It provides the fashion outfits, which are createdand uploaded by experienced fashion designers. Indeed, out-fit matching is a natural scenario for the content-based com-plementary recommendations, because the clothes in an out-fit are complementary to each other. We use the provided

datasets [Han et al., 2017; Vasileva et al., 2018], which con-tain 90,634 outfits. For the reliability of experimental results,we make the necessary specific processing as follows. First,we merge the datasets and remove the noise samples that ex-clude the seed sets of more than 8 items. It is because afashion outfit hardly contains more than 8 items in real-worldscenarios. Then, we split the dataset into 59,212 outfits with221,711 fashion items for training, 3,000 outfits for valida-tion and 10,218 outfits for testing. Next, we conduct a testingset (i.e., FITB Random) as fill-in-the-blank tasks [Han et al.,2017], which we randomly choose candidates as the negativesamples from the whole testing set. Meanwhile, in order tofurther make our testing more difficult and similar to users’decision-making process, we conduct a more specific testingset, i.e., FITB Category. In this set, we remove the easilyidentifiable negative samples and replace these samples withother items which have the same category of the ground-truth.

Evaluation Metrics. We evaluate our model and all com-pared methods by two evaluation metrics, i.e., the Accuracy(ACC) [Han et al., 2017; Vasileva et al., 2018] and Mean Re-ciprocal Rank (MRR) [Song et al., 2017; Jin et al., 2019].

Implementation Details. We adopt the GoogleNet Incep-tionV3 model [Szegedy et al., 2016] which was pretrained onImageNet [Deng et al., 2009] to transform global visual con-tents (images) and semantic-focal visual contents (regions) tofeature vectors. For fair comparisons, we set all the imageembedding size of df = 512 unless otherwise noted. Thenumber of visual space is set to S = 4 and for each visualspace, we set ds = df/S = 128 and b∗ = 4 unless otherwisenoted. Our model is trained with an initial learning rate of0.2 and is decayed by a factor of 2 every 2 epochs. The batchsize is set to 9, seed collection length k is set to 8. We stop thetraining process when the loss on the validation set stabilizes.Our model and all the compared methods are developed andtrained on a Linux server with two 2.20 GHz Intel Xeon E5-2650 v4 CPUs and four TITAN Xp GPUs. The datasets andsource codes are available in our project pages 1.

4.2 Compared ApproachesTo demonstrate the effectiveness of CANN, we compare itwith the following alternative methods:

• SetRNN [Li et al., 2017]. This method treats the outfit dataas a set and develops RNN model to generate the comple-mentary scores of target item sets.

• SiameseNet [Veit et al., 2015]. This model utilizes aSiamese CNN to project two items into a latent space toestimate their similarity. We use an L2 norm to normalizethe item visual embedding before calculating the Siameseloss and set the margin parameter to 0.8.

• VSE [Han et al., 2017]. Visual-Semantic Embedding(VSE) method learns a multimodal item embedding basedon images and texts. The resulting embeddings are used tomeasure the item recommendation scores.

• Bi-LSTM [Han et al., 2017]. This method builds a bidi-rectional LSTM to predict the complementary item condi-

1https://data.bdaa.pro/BDAA Fashion/index.html


3540

tioned on previously seen items in both directions. We usethe full model by jointly learning the multimodal inputs.

• CSN-Best [Vasileva et al., 2018]. This method learns a vi-sual content embedding that respects item type, and jointlylearns notions of item similarity and compatibility in anend-to-end model. We use the full components with a gen-eral embedding size of 512 dimensions in our experiments.

• NGNN [Cui et al., 2019]. This method learns item comple-mentary relations by the node-wise graph neural networks.This is the state-of-the-art method in fashion compatibilitylearning and clothing complementary recommendations.

Besides, for verifying the effectiveness of components, weconstruct three variant implements based on CANN.

• CANN-G. This is a variant of our model that only usesthe Global Coherence Learning (GCL) module to generatecomplementary recommendations.

• CANN-F. This is a specific implementation that only usesFocal Coherence Learning (FCL) module.

• CANN. This is the model with all proposed components.

4.3 Recommendation PerformancesWe compare CANN with the other methods on the content-based complementary recommendations. The number of can-didates is set to 4 for both FITB Random and FITB Category,which is the same as previous work [Han et al., 2017;Vasileva et al., 2018; Cui et al., 2019].

The results of all methods on both testing sets are shown inTable 1. we can make the following observations: 1) CANNoutperforms all the compared methods in both datasets, whichindicates the superiority of the proposed model for content-based complementary recommendations. 2) SetRNN andVSE perform the worst on both datasets. That indicates thecomplementary method is not a trivial problem which can behandled straightly by simple methods. 3) SiameseNet andCSN-Best are similar methods which model the item pair-wise compatibility. These models work better than SetRNNand VSE model, but worse than Bi-LSTM and CANN. Thismay be due to the fact that SiameseNet and CSN-Best aimto learn the similarity between two fashion items but ig-nore the compositional process. 4) CANN-F performs worsethan CANN-G, that because CANN-F only considers thesemantic-focal contents, which may make some informationignored. Moreover, CANN-G outperforms other methods ex-cept for CANN. That enables us to safely draw the conclusionthat it is advisable to model the compositional coherence ofitems on both global and semantic-focal contents.

4.4 Ablation Study on Focal CoherenceTo further assess the robustness of the model and necessity ofthe semantic-focal coherence, we set different semantic con-tents in FCL module and evaluate the performances as thenumber of candidate collections increases. We set the FCLmodule with only color-focal, texture-focal and hybrid-focalcontents and compare these models with all our proposedmodel CANN which contains all the semantic-focal contents.

The results are shown in the Figure 3, where the horizon-tal axis indicates the number of candidate collections. The

ApproachesFITB Random FITB Category

Accuracy MRR Accuracy MRRSetRNN 29.6% 48.1% 28.7% 46.1%

SiameseNet 52.2% 71.6% 54.0% 72.8%VSE 29.2% 49.1% 30.2% 53.2%

Bi-LSTM 83.6% 91.1% 58.2% 75.7%CSN-Best 58.9% 76.1% 56.1% 74.2%

NGNN 87.3% 93.2% 57.3% 74.9%CANN-G 88.8% 94.1% 62.4% 78.1%CANN-F 71.9% 84.1% 56.7% 74.7%CANN 90.7% 95.1% 66.5% 80.9%

Table 1: Performance comparisons of CANN with alternative meth-ods on two specific testing sets.

Color, Texture, Hybrid and ALL represent that CANN usesonly color-focal, texture-focal, hybrid-focal and all contents,respectively. From the Figure 3, we can get the followingobservations: 1) CANN with all the semantic-focal contentshas outperformed others, which clearly demonstrate the ef-fectiveness of all components in our proposed CANN. 2) TheCANN with hybrid-focal coherence learning performs bet-ter than color-focal and texture-focal methods. Meanwhile,the color-focal model outperforms the texture-focal model.These observations imply that the color is an important fac-tor for content-based complementary item recommendations.3) CANN with all semantic-focal contents outperforms othersingle semantic-focal models on FITB Category with a largermargin than on FITB Random. These observations imply thatsemantic-focal contents can help the model to better under-stand the item compositional relationships and generate thebest-matched complementary item suggestions.

4.5 Visualization of the Attention MechanismTo further illustrate the learning and expression of composi-tional visual coherence in our model, we visualize the inter-mediate results of the coherence score α̂i,j in Eq. 2. Figure 4illustrates the item compositional relations in an example. Forthe better visualization, we select one semantic region gener-ated by our model for each semantic-focal content. The colorin Figure 4 changes from light to dark while the value of co-herence score increases.

From Figure 4, it is worth noting that the coherence scoresbetween the t-shirt and shorts are higher than others in all co-herence spaces. Meanwhile, the scores between shoes andbracelet are also quite high. That is intuitive that the blackt-shirt and dark blue shorts are similar in color. The shoesand bracelet are similar in leopard print style. These observa-tions imply that our proposed CANN can provide a good wayto capture the visual coherence for the complementary itemsfrom both global and semantic-focal views.

5 ConclusionIn this paper, we proposed a novel framework, the Con-tent Attentive Neural Networks (CANN), to address theproblem of content-based complementary recommendations.For generating complementary item recommendations, the


3541

4 5 6 7 8 9 10

Number of Candidates

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

AC

C

Color

Texture

Hybrid

ALL

(a) Accuracy on FITB Random

4 5 6 7 8 9 10


0.86

0.87

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

MR

R

Color

Texture

Hybrid

ALL

(b) MRR on FITB Random

4 5 6 7 8 9 10


0.38

0.41

0.44

0.47

0.5

0.53

0.56

0.59

0.62

0.65

AC

C

Color

Texture

Hybrid

ALL

(c) Accuracy on FITB Category

4 5 6 7 8 9 10


0.58

0.61

0.64

0.67

0.7

0.73

0.76

0.79

MR

R

Color

Texture

Hybrid

ALL

(d) MRR on FITB Category

Figure 3: Results of complementary recommendations over different candidate numbers.

(a)Global Coherence (b)Color Coherence

(c)Texture Coherence (d)Hybrid Coherence

Figure 4: Visualization of the compositional coherence scores be-tween two items in both of the global and semantic-focal contents.

global and semantic contents (such as color collocations, tex-ture compatibilities) are indispensable parts to understandthe comprehensive compositional relationship among items.Along this line, we provided a focused study on the com-positional coherence in item visual contents. More specifi-cally, we first proposed a Global Coherence Learning (GCL)module based on multi-heads attention to model the globalcompositional coherence. Then, we generated the contentsemantic-focal representations and designed a hierarchicalattention module, i.e., Focal Coherence Learning (FCL), tolearn the focal coherence from different semantic-focal con-tents. Next, for simulating users’ decision-making processon complementary items, we optimized the CANN in a novelcompositional optimization strategy. Finally, we conductedextensive experiments on a real-world dataset and the ex-perimental results clearly demonstrated the effectiveness ofCANN compared with several state-of-the-art methods.

In the future, we would like to consider multi-modal infor-mation, such as item descriptions and categories for the deepexploration of the item complementary relationships. More-over, we are also willing to investigate the domain knowledgeabout aesthetics assessment and make the recommender sys-tems more explainable.

AcknowledgmentsThis research was supported by grants from the National Nat-ural Science Foundation of China (Grants No. 61922073,61672483, U1605251). Qi Liu acknowledges the supportof the Youth Innovation Promotion Association of CAS (No.2014299) and the USTC-JD joint lab. Bo Wu thanks the sup-port of JD AI Research. We special thanks to all the first-linehealthcare providers, physicians and nurses that are fightingthe war of COVID-19 against time.

References[Chang et al., 2017] Y. Chang, W. Cheng, B. Wu, and

K. Hua. Fashion world map: Understanding cities throughstreetwear fashion. In ACM MM, pages 91–99, 2017.

[Cui et al., 2019] Z. Cui, Z. Li, S. Wu, X. Zhang, andL. Wang. Dressing as a whole: Outfit compatibility learn-ing based on node-wise graph neural networks. In WWW,pages 307–317. ACM, 2019.

[Deng et al., 2009] J. Deng, W. Dong, R. Socher, L. Li,K. Li, and F. Li. Imagenet: A large-scale hierarchical im-age database. In CVPR, pages 248–255, 2009.

[Felzenszwalb and Huttenlocher, 2004] P F Felzenszwalband D P Huttenlocher. Efficient graph-based image seg-mentation. IJCV, 59(2):167–181, 2004.

[Han et al., 2017] X. Han, Z. Wu, Y. Jiang, and L. S. Davis.Learning fashion compatibility with bidirectional lstms. InACM MM, pages 1078–1086. ACM, 2017.

[He and McAuley, 2016] R. He and J. McAuley. Vbpr: Vi-sual bayesian personalized ranking from implicit feed-back. In AAAI, 2016.

[He et al., 2016] R. He, C. Packer, and J. McAuley. Learn-ing compatibility across categories for heterogeneous itemrecommendation. In ICDM, pages 937–942, 2016.

[Hou et al., 2019] M. Hou, L. Wu, E. Chen, Z. Li, V. W.Zheng, and Q. Liu. Explainable fashion recommendation:A semantic attribute region guided approach. In IJCAI,pages 4681–4688. AAAI Press, 2019.

[Hsiao and Grauman, 2018] Wei-Lin Hsiao and KristenGrauman. Creating capsule wardrobes from fashion im-ages. In CVPR, pages 7161–7170, 2018.


3542

[Ioffe and Szegedy, 2015] S. Ioffe and C. Szegedy. Batchnormalization: Accelerating deep network training by re-ducing internal covariate shift. In ICML, pages 448–456,2015.

[Jin et al., 2019] B. Jin, E. Chen, H. Zhao, Z. Huang, Q. Liu,H. Zhu, and S. Yu. Promotion of answer value measure-ment with domain effects in community question answer-ing systems. IEEE TSMC: Systems, 2019.

[Kang et al., 2017] W. Kang, C. Fang, Z. Wang, andJ. McAuley. Visually-aware fashion recommendation anddesign with generative image models. In ICDM, pages207–216. IEEE, 2017.

[Li et al., 2017] Y. Li, L. Cao, J. Zhu, and J. Luo. Miningfashion outfit composition using an end-to-end deep learn-ing approach on set data. IEEE TMM, 19(8):1946–1955,2017.

[Li et al., 2018] Z. Li, H. Zhao, Q. Liu, Z. Huang, T. Mei,and E. Chen. Learning from history and present: Next-item recommendation via discriminatively exploiting userbehaviors. In ACM SIGKDD, page 1734–1743, 2018.

[Liu et al., 2010] C. Liu, L. Sharan, E. H. Adelson, andR. Rosenholtz. Exploring features in a bayesian frame-work for material recognition. In CVPR, pages 239–246.IEEE, 2010.

[Liu et al., 2017] Q. Liu, S. Wu, and L. Wang. Deepstyle:Learning user preferences for visual recommendation. InSIGIR, pages 841–844. ACM, 2017.

[Liu et al., 2018] Q. Liu, H. Zhao, L. Wu, Z. Li, and E. Chen.Illuminating recommendation by understanding the ex-plicit item relations. JCST, 33(4):739–755, 2018.

[Lo et al., 2019] L. Lo, C. Liu, R. Lin, B. Wu, H. Shuai, andW. Cheng. Dressing for attention: Outfit based fashionpopularity prediction. In ICIP, pages 3222–3226. IEEE,2019.

[Ma et al., 2019] J. Ma, Z. Shou, A. Zareian, H. Mansour,A. Vetro, and S. Chang. Cdsa: Cross-dimensional self-attention for multivariate, geo-tagged time series imputa-tion. arXiv preprint arXiv:1905.09904, 2019.

[McAuley et al., 2015a] J. McAuley, R. Pandey, andJ. Leskovec. Inferring networks of substitutable andcomplementary products. In ACM SIGKDD, pages785–794. ACM, 2015.

[McAuley et al., 2015b] J. McAuley, C. Targett, Q. Shi, andA Van D. H. Image-based recommendations on styles andsubstitutes. In SIGIR, pages 43–52. ACM, 2015.

[Robbins and Monro, 1951] H. Robbins and S. Monro. Astochastic approximation method. The annals of mathe-matical statistics, pages 400–407, 1951.

[Rudolph et al., 2016] M. Rudolph, F. Ruiz, S. Mandt, andD. Blei. Exponential family embeddings. In NIPS, pages478–486, 2016.

[Song et al., 2017] X. Song, F. Feng, J. Liu, Z. Li, L. Nie,and J. Ma. Neurostylist: Neural compatibility model-ing for clothing matching. In ACM MM, pages 753–761.ACM, 2017.

[Szegedy et al., 2016] C. Szegedy, V. Vanhoucke, S. Ioffe,J. Shlens, and Z. Wojna. Rethinking the inception archi-tecture for computer vision. In CVPR, pages 2818–2826,2016.

[Tan et al., 2004] P. Tan, V. Kumar, and J. Srivastava. Se-lecting the right objective measure for association analy-sis. Information Systems, 29(4):293–313, 2004.

[Uijlings et al., 2013] Jasper RR Uijlings, Koen EA VanDe Sande, Theo Gevers, and Arnold WM Smeulders. Se-lective search for object recognition. IJCV, pages 154–171, 2013.

[Vasileva et al., 2018] M. I. Vasileva, B. A. Plummer, K. Du-sad, S. Rajpal, R. Kumar, and D. Forsyth. Learning type-aware embeddings for fashion compatibility. In ECCV,pages 390–405, 2018.

[Vaswani et al., 2017] A. Vaswani, N. Shazeer, N. Parmar,J. Uszkoreit, L. Jones, A. N Gomez, Ł. Kaiser, and I. Polo-sukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.

[Veit et al., 2015] A. Veit, B. Kovacs, S. Bell, J. McAuley,K. Bala, and S. Belongie. Learning visual clothing stylewith heterogeneous dyadic co-occurrences. In ICCV,pages 4642–4650, 2015.

[Wang et al., 2019] X. Wang, B. Wu, and Y. Zhong. Outfitcompatibility prediction and diagnosis with multi-layeredcomparison network. In ACM MM, pages 329–337, 2019.

[Wu et al., 2019] L. Wu, Z. Li, H. Zhao, Z. Pan, Q. Liu,and E. Chen. Estimating early fundraising performance ofinnovations via graph-based market environment model.arXiv preprint arXiv:1912.06767, 2019.

[Yu et al., 2018] W. Yu, H. Zhang, X. He, X. Chen, L. Xiong,and Z. Qin. Aesthetic-based clothing recommendation. InWWW, pages 649–658, 2018.

[Yu et al., 2019] H. Yu, L. Litchfield, T. Kernreiter, S. Jolly,and K. Hempstalk. Complementary recommendations: Abrief survey. In HPBD, pages 73–78, May 2019.

[Zhang et al., 2018] Y. Zhang, H. Lu, W. Niu, and J. Caver-lee. Quality-aware neural complementary item recommen-dation. In RecSys, pages 77–85. ACM, 2018.

[Zhao et al., 2016] L. Zhao, Z. Lu, S. J. Pan, and Q. Yang.Matrix factorization+ for movie recommendation. In IJ-CAI, pages 3945–3951, 2016.

[Zhao et al., 2017] T. Zhao, J. McAuley, M. Li, and I. King.Improving recommendation accuracy using networks ofsubstitutable and complementary products. In IJCNN,pages 3649–3655, 2017.

[Zhao et al., 2019] H. Zhao, B. Jin, Q. Liu, Y. Ge, E. Chen,X. Zhang, and T. Xu. Voice of charity: Prospecting thedonation recurrence & donor retention in crowdfunding.IEEE TKDE, 2019.

[Zheng et al., 2009] J. Zheng, X. Wu, J. Niu, and A. Boli-var. Substitutes or complements: another step forward inrecommendations. In ACM conference on Electronic com-merce, pages 139–146. ACM, 2009.


3543

IntroductionRelated WorksVisual RecommendationsComplementary Recommendations

CANN: Content Attentive Neural NetworksGlobal Coherence LearningFocal Coherence LearningOptimization Strategy

ExperimentsExperimental SetupsCompared ApproachesRecommendation PerformancesAblation Study on Focal CoherenceVisualization of the Attention Mechanism

Conclusion

learning the compositional visual coherence for ...content attentive neural networks (cann) network...

Documents