multi-entity dependence learning with rich context via...

Multi-Entity Dependence Learning with Rich Context viaConditional Variational Auto-encoder

Luming Tang∗

Tsinghua University, [email protected]

Yexiang Xue∗Cornell University, [email protected]

Di ChenCornell University, USA

[email protected]

Carla P. GomesCornell University, [email protected]

Abstract

Multi-Entity Dependence Learning (MEDL) explores condi-tional correlations among multiple entities. The availabil-ity of rich contextual information requires a nimble learn-ing scheme that tightly integrates with deep neural networksand has the ability to capture correlation structures among ex-ponentially many outcomes. We propose MEDL CVAE, whichencodes a conditional multivariate distribution as a gener-ating process. As a result, the variational lower bound ofthe joint likelihood can be optimized via a conditional vari-ational auto-encoder and trained end-to-end on GPUs. OurMEDL CVAE was motivated by two real-world applications incomputational sustainability: one studies the spatial correla-tion among multiple bird species using the eBird data andthe other models multi-dimensional landscape compositionand human footprint in the Amazon rainforest with satelliteimages. We show that MEDL CVAE captures rich dependencystructures, scales better than previous methods, and furtherimproves on the joint likelihood taking advantage of verylarge datasets that are beyond the capacity of previous meth-ods.

IntroductionLearning the dependencies among multiple entities is an im-portant problem with many real-world applications. For ex-ample, in the sustainability domain, the spatial distributionof one species depends on other species due to their inter-actions in the form of mutualism, commensalism, competi-tion and predation (MacKenzie, Bailey, and Nichols 2004).In natural language processing, the topics of an article areoften correlated (Nam et al. 2014). In computer vision, animage may have multiple correlated tags (Wang et al. 2016).

The key challenge behind dependency learning is to cap-ture correlation structures embedded among exponentiallymany outcomes. One classic approach is the ConditionalRandom Fields (CRF) (Lafferty, McCallum, and Pereira2001). However, to handle the intractable partition functionresulting from multi-entity interactions, CRFs have to incor-porate approximate inference techniques such as contrastivedivergence (Hinton 2002). In a related applicational domain

∗indicates equal contribution. The work was done when Tang,L. was a visiting student at Cornell University.Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Cultivation+ Water+ Forest

Species

Location

composition

Figure 1: Two computational sustainability related applica-tions for MEDL CVAE. The first application is to study theinteractions among bird species in the crowdsourced eBirddataset and environmental covariates including those fromsatellite images. The second application is to tag satellite im-ages with a few potentially overlapping landscape categoriesand track human footprint in the Amazon rainforest.

called multi-label classification, the classifier chains (CC)approach (Read et al. 2009) decomposes the joint likeli-hood into a product of conditionals and reduces a multi-labelclassification problem to a series of binary prediction prob-lems. However, as pointed out by (Dembczynski, Cheng,and Hullermeier 2010), finding the joint mode of CC is alsointractable, and to date only approximate search methods areavailable (Dembczynski et al. 2012).

The availability of rich contextual information such asmillions of high-resolution satellite images as well as recentdevelopments in deep learning create both opportunities andchallenges for multi-entity dependence learning. In terms ofopportunities, rich contextual information creates the possi-bility of improving predictive performance, especially whenit is combined with highly flexible deep neural networks.

The challenge, however, is to design a nimble scheme thatcan both tightly integrate with deep neural networks andcapture correlation structures among exponentially manyoutcomes. Deep neural nets are commonly used to extractfeatures from contextual information sources, and can effec-

tively use highly parallel infrastructure such as GPUs. How-ever, classical approaches for structured output, such as sam-pling, approximate inference and search methods, typicallycannot be easily parallelized.

Our contribution is an end-to-end approach to multi-entity dependence learning based on a conditional vari-ational auto-encoder, which handles high dimensionalspace effectively, and can be tightly integrated with deepneural nets to take advantages of rich contextual infor-mation. Specifically, (i) we propose a novel generating pro-cess to encode the conditional multivariate distribution inmulti-entity dependence learning, in which we bring in a setof hidden variables to capture the randomness in the jointdistribution. (ii) The novel conditional generating process al-lows us to work with imperfect data, capturing noisy and po-tentially multi-modal responses. (iii) The generating processalso allows us to encode the entire problem via a conditionalvariational auto-encoder, tightly integrated with deep neu-ral nets and implemented end-to-end on GPUs. Using thisapproach, we are able to leverage rich contextual informa-tion to enhance the performance of MEDL that is beyond thecapacity of previous methods.

We apply our Multi-Entity Dependence Learning viaConditional Variational Auto-encoder (MEDL CVAE) ap-proach to two sustainability related real-world appli-cations (Gomes 2009). In the first application, we studythe interaction among multiple bird species with crowd-sourced eBird data and environmental covariates includingthose from satellite images. As an important sustainabledevelopment indicator, studying how species distributionchanges over time helps us understand the effects of climatechange and conservation strategies. In our second applica-tion, we use high-resolution satellite imagery to study multi-dimensional landscape composition and track human foot-print in the Amazon rainforest. See Figure 1 for an overviewof the two problems. Both applications study the correla-tions of multiple entities and use satellite images as rich con-text information. We are able to show that our MEDL CVAE(i) captures rich correlation structures among entities,therefore outperforming approaches that assume inde-pendence among entities given contextual information; (ii)trains in an order of magnitude less time than previousmethods because of a full implementation on GPUs; (iii)achieves a better joint likelihood by incorporating deep neu-ral nets to take advantage of rich context information,namely satellite images, which are beyond the capacity ofprevious methods.

PreliminariesWe consider modeling the dependencies among multipleentities on problems with rich contextual information. Ourdataset consists of tuples D = {(xi, yi)|i = 1, . . . , N}, inwhich xi = (xi,1, . . . , xi,k) ∈ Rk is a high-dimensionalcontextual feature vector, and yi = (yi,1, . . . , yi,l) ∈ {0, 1}lis a sequence of l indicator variables, in which yi,j repre-sents whether the j-th entity is observed in an environmentcharacterized by covariates xi. The problem is to learn aconditional joint distribution Pr(y|x) which maximizes the

Steppe

Leopard Dhole

FierceCompetition

Figure 2: Leopards and dholes both live on steppes. There-fore, the probability that each animal occupies a steppe ishigh. However, due to the competition between the twospecies, the probability of their co-existence is low.

conditional joint log likelihood over N data points:

N∑i=1

logPr(yi|xi).

Multi-entity dependence learning is a general problemwith many applications. For example, in our species distri-bution application where we would like to model the rela-tionships of multiple bird species, xi is the vector of en-vironmental covariates of the observational site, which in-cludes a remote sensing picture, the national landscape clas-sification dataset (NLCD) values (Homer et al. 2015) etc.yi = (yi,1, . . . , yi,l) is a sequence of binary indicator vari-ables, where yi,j indicates whether species j is detected inthe observational session of the i-th data point. In our ap-plication to analyze landscape composition, xi is the featurevector made up with the satellite image of the given site,and yi includes multiple indicator variables, such as atmo-spheric conditions (clear or cloudy) and land cover phenom-ena (agriculture or forest) of the site.

Our problem is to capture rich correlations between en-tities. For example, in our species distribution modeling ap-plication, the distribution of multiple species are often corre-lated, due to their interactions such as cooperation and com-petition for shared resources. As a result, we often cannotassume that the probability of each entity’s existence are mu-tually independent given the feature vector, i.e.,

Pr(yi|xi) 6=l∏

j=1

Pr(yi,j |xi). (1)

See Figure 2 for a specific instance. As a baseline, we callthe model which takes the assumption in the righthand sideof Equation (1) an independent probabilistic model.

Our ApproachWe propose MEDL CVAE to address two challenges in multi-entity dependence learning.

The first challenge is the noisy and potentially multi-modal responses. For example, consider our species distri-bution modeling application. One bird watcher can makeslightly different observations during two consecutive visits

to the same forest location. He may be able to detect a songbird during one visit but not the other if, for example, thebird does not sing both times. This suggests that, even un-der the very best effort of bird watchers, there is still noiseinherently associated with the observations. Another, per-haps more complicated phenomenon is the multi-modal re-sponse, which results from intertwined ecological processessuch as mutualism and commensalism. Consider, territorialspecies such as the Red-winged and Yellow-headed Black-birds, both of which live in open marshes in NorthwesternUnited States. However, the Yellowheads tend to chase theRedwings out of their territories. As a result, a bird watcherwould see either Red-winged or Yellow-headed Blackbirdsat an open marsh, but seldom both of them. This suggeststhat, conditioned on an open marsh environment, there aretwo possible modes in the response, seeing the Red-wingedbut not the Yellow-headed, or seeing the Yellow-headed butnot the Red-winged.

The second challenge comes from the incorporation ofrich contextual information such as remote sensing imagery,which provides detailed feature description of the underly-ing environment, especially in conjunction with the flexibil-ity of deep neural networks. Nevertheless, previous multi-entity models, such as in (Chen et al. 2017; Guo and Gu2011; Sutton and McCallum 2012), rely on sampling ap-proaches to estimate the partition function during training.It is difficult to incorporate such sampling process into theback-propagation of the deep neural networks. This limita-tion poses serious challenges to taking advantage of the richcontextual information.

To address the aforementioned two challenges, we pro-pose a conditional generating model, which makes use ofhidden variables to represent the noisy and multi-modal re-sponses. MEDL CVAE incorporates this model into an end-to-end training pipeline using a conditional variational autoen-coder, which optimizes for a variational lower bound of theconditional likelihood function.

Conditional Generating Process Unlike classic ap-proaches such as probit models (Chib and Greenberg 1998),which have a single mode, we use a conditional generatingprocess, which models noisy and multi-modal responses us-ing additional hidden variables. The generating process isdepicted in Figure. 3.

In the generating process, we are given contextual fea-tures xi, which for example, contain a satellite image. Thenwe assume a set of hidden variables zi, which are gener-ated based on a normal distribution conditioned on the val-ues of xi. The binary response variables yi,j are drawn froma Bernoulli distribution, whose parameters depend on boththe contextual features xi and hidden variables zi. The com-plete generating process becomes,

xi : contextual features, (2)zi|xi ∼ N (µd(xi),Σd(xi)) , (3)

yi,j |zi, xi ∼ Bernoulli (pj(zi, xi)) . (4)

Here, µd(xi), Σd(xi) and pj(zi, xi) are general functionsdepending on xi and zi, which are modeled as deep neu-ral networks in our application and learned from data. We

Response:

(𝑦𝑖,𝑗|𝑧𝑖 , 𝑥𝑖)~Bernoulli 𝑝𝑗(𝑧𝑖 , 𝑥𝑖) , i.e.,

Pr 𝑦𝑖 𝑧𝑖 , 𝑥𝑖) = 𝑗=1𝑙 𝑝

𝑗

𝑦𝑖,𝑗 1 − 𝑝𝑗1−𝑦𝑖,𝑗

𝑥𝑖: context features

Hidden variable

𝑧𝑖|𝑥𝑖~𝑁 𝜇 𝑥𝑖 , Σ(𝑥𝑖)

Figure 3: Our proposed conditional generating process.Given contextual features xi such as satellite images, weuse hidden variables zi conditionally generated based on xito capture noisy and multi-modal response. The response yidepends on both contextual information xi and hidden vari-ables zi. See the main text for details.

denote the parameters in these neural networks as θd. Themachine learning problem is to find the best parameters thatmaximize the conditional likelihood

∏i Pr(yi|xi).

This generating process is able to capture noisy and poten-tially multi-modal distributions. Consider the Red-wingedand the Yellow-headed Blackbird example. We use yi,1 todenote the occurrence of Red-winged Blackbird and yi,2 todenote the occurrence of Yellow-headed Blackbird. Condi-tioned on the same environmental context xi of an openmarsh, the output (yi,1 = 0, yi,2 = 1) and (yi,1 = 1, yi,2 =0) should both have high probabilities. Therefore, there aretwo modes in the probability distribution. Notice that it isvery difficult to describe this case in any probabilistic modelthat assumes a single mode.

Our generating process provides the flexibility to capturemulti-modal distributions of this type. The high-level idea issimilar to mixture models, where we use hidden variables zito denote which mode the actual probabilistic outcome is in.For example, we can have zi|xi ∼ N(0, I) and two func-tions p1(z) p2(z), where half of the zi values are mapped to(p1 = 0, p2 = 1) and the other half to (p1 = 1, p2 = 0).Figure 3 provides an example, where the zi values in the re-gion with a yellow background are mapped to one value, andthe remaining values are mapped to the other value. In thisway, the model will have high probabilities to produce bothoutcomes (yi,1 = 0, yi,2 = 1) and (yi,1 = 1, yi,2 = 0).

Conditional Variational Autoencoder Our training algo-rithm is to maximize the conditional likelihood Pr(yi|xi).Nevertheless, a direct method would result in the followingoptimization problem:

maxθd

∑i

logPr(yi|xi) =∑i

log

∫Pr(yi|xi, zi)Pr(zi|xi)dzi

which is intractable because of a hard integral inside thelogarithmic function. Instead, we turn to maximizing vari-ational lower bound of the conditional log likelihood. Todo this, we use a variational function family Q(zi|xi, yi)to approximate the posterior: Pr(zi|xi, yi). In practice,Q(zi|xi, yi) is modeled using a conditional normal distri-bution:

Q(zi|xi, yi) = N(µe(xi, yi),Σe(xi, yi)). (5)

Here, µe(xi, yi) and Σe(xi, yi) are general functions, andare modeled using deep neural networks whose parametersare denoted as θe. We assume Σe is a diagonal matrix inour formulation. Following similar ideas in (Kingma andWelling 2013; Kingma et al. 2014), we can prove the fol-lowing variational equality:

logPr(yi|xi)−D [Q(zi|xi, yi)||Pr(zi|xi, yi)] (6)=Ezi∼Q(zi|xi,yi) [logPr(yi|zi, xi)]−D [Q(zi|xi, yi)||Pr(zi|xi)]

On the left-hand side, the first term is the conditional like-lihood, which is our objective function. The second term isthe Kullback-Leibler (KL) divergence, which measures howclose the variational approximation Q(zi|xi, yi) is to thetrue posterior likelihood Pr(zi|xi, yi). Because Q is mod-eled using a neural network, which captures a rich fam-ily of functions, we assume that Q(zi|xi, yi) approximatesPr(zi|xi, yi) well, and therefore the second KL term is al-most zero. And because the KL divergence is always non-negative, the right-hand side of Equation 6 is a tight lowerbound of the conditional likelihood, which is known as thevariational lower bound. We therefore directly maximizethis value and the training problem becomes:

maxθd,θe

∑i

Ezi∼Q(zi|xi,yi) [logPr(yi|zi, xi)]−

D [Q(zi|xi, yi)||Pr(zi|xi)] . (7)

The first term of the objective function in Equation 7 canbe directly formalized as two neural networks concatenatedtogether – one encoder network and the other decoder net-work, following the reparameterization trick, which is usedto backpropagate the gradient inside neural nets. At a highlevel, suppose r ∼ N(0, I) are samples from the standardGaussian distribution, then zi ∼ Q(zi|xi, yi) can be gen-erated from a “recognition network”, which is part of the“encoder network”: zi ← µe(xi, yi) + Σ

1/2e (xi, yi)r. No-

tice that zi is a hidden variable. Its value depends on theparameters in µe and Σe, which are learned from data, with-out any human intervention. The “decoder network” takesthe input of zi from the encoder network and feeds it to theneural network representing the function Pr(yi|zi, xi) =∏lj=1 (pj(zi, xi))

yi,j (1− pj(zi, xi))1−yi,j together withxi. The second KL divergence term can be calculated in aclose form. The entire neural network structure is shownas Figure 4. We refer to Pr(z|x) as the prior network,Q(z|x, y) as the recognition network and Pr(y|x, z) as thedecoder network. These three networks are all multi-layerfully connected neural networks. The fourth feature net-work, composed of multi-layer convolutional or fully con-nected network, extracts high-level features from the con-

Context Information

Multi-Entity Distribution

x

y

RecognitionNetwork

PriorNetwork

'

'

z

Z

KL(q|p)

Pr(y|x,z)

Decoder Network

Feature Network

(a) Training of MEDL_CVAE

(b) Testing of MEDL_CVAE

Context Information

Feature Network

PriorNetwork

'

'

zDecoder Network

Pr(y|x)

x

Figure 4: Overview of the neural network architecture ofMEDL CVAE for both training and testing stages. ⊕ denotesa concatenation operator.

textual source. All four neural networks are trained simul-taneously using stochastic gradient descent. In our experi-ment, the convolutional neural networks used in the featurenetwork are composed of multiple “Convolution→BatchNormalization→Relu→Pooling” layers, which is a classicalsetting in many applications. The feature network for eachcontext input has its own hyperparameter settings, such asthe number of layers and hidden layer size. According to ourexperiment, the performance of our algorithm is not sensi-tive to the network structure and hyperparameter settings.

Related WorkMulti-entity dependence learning was studied extensivelyfor prediction problems under the names of multi-label clas-sification and structured prediction. Our applications, on theother hand, focus more on probabilistic modeling rather thanclassification. Along this line of research, early methods in-clude k-nearest neighbors (Zhang and Zhou 2005) and di-mension reduction (Zhang and Zhou 2010; Li et al. 2016).

Classifier Chains (CC) First proposed by (Read et al.2009), the CC approach decomposes the joint distributioninto the product of a series of conditional probabilities.Therefore the multi-label classification problem is reducedto l binary classification problems. As noted by (Dem-bczynski, Cheng, and Hullermeier 2010), CC takes a greedyapproach to find the joint mode and the result can be arbi-trarily far from the true mode. Hence, Probabilistic Classi-fier Chains (PCC) were proposed which replaced the greedystrategy with exhaustive search (Dembczynski, Cheng, andHullermeier 2010), ε-approximate search (Dembczynski etal. 2012), beam search (Kumar et al. 2013) or A* search(Mena Waldo et al. 2015). To address the issue of error prop-agating in CC, Ensemble of Classifier Chains (ECC) (Liuand Tsang 2015) averages several predictions by differentchains to improve the prediction.

Conditional Random Field (CRF) (Lafferty, McCallum,and Pereira 2001) offers a general framework for structuredprediction based on undirected graphical models. Whenused in multi-label classification, CRF suffers from the prob-lem of computational intractability. To remedy this issue,(Xu et al. 2011) applied ensemble methods and (Deng etal. 2014) proposed a special CRF for problems involvingspecific hierarchical relations. In addition, (Guo and Gu2011) proposed using Conditional Dependency Networks,although their method also depended on the Gibbs samplingfor approximate inference.

Ecological Models Species distribution modeling hasbeen studied extensively in ecology and (Elith and Leath-wick 2009) presented a nice survey. For single-species mod-els, (Phillips, Dudık, and Schapire 2004) proposed max-entropy methods to deal with presence-only data. By takingimperfect detection into account, (MacKenzie, Bailey, andNichols 2004) proposed occupancy models, which were fur-ther improved with a stronger version of statistical inference(Hutchinson, Liu, and Dietterich 2011). Species distributionmodels have been extended to capture population dynam-ics using cascade models (Sheldon and Dietterich 2011) andnon-stationary predictor response models (Fink, Damoulas,and Dave 2013).

Multi-species interaction models were also proposed (Junet al. 2011; Harris 2015). Recently, Deep Multi-Species Em-bedding (DMSE) (Chen et al. 2017) uses a probit modelcoupled with a deep neural net to capture inter-species cor-relations. This approach is closely related to CRF and alsorequires MCMC sampling during training.

ExperimentsDatasets and Implementation DetailsWe evaluate our method on two computational sustainabil-ity related datasets. The first one is a crowd-sourced birdobservation dataset collected from the eBird citizen scienceproject (Munson et al. 2012). Each record in this dataset isreferred to as a checklist in which the bird observer reportsall the species he detects together with the time and thegeographical location of an observational session. Crossedwith the National Land Cover Dataset (NLCD) (Homer etal. 2015), we get a 15-dimension feature vector for each lo-cation which describes the nearby landscape compositionwith 15 different land types such as water, forest, etc. Inaddition, to take advantages of rich external context infor-mation, we also collect satellite images for each observationsite by matching the longitude and latitude of the observa-tional site to Google Earth1 . From the upper part of Figure5, the satellite images of different geographical locations arequite different. Therefore these images contain rich geologi-cal information. Each image covers an area of 12.3km2 nearthe observation site. For the use of training and testing, wetransform all this data into the form (xi, yi), where xi de-notes the contextual features including NLCD and satelliteimages and yi denotes the multi-species distribution. The

1https://www.google.com/earth/. Google Earth has already con-ducted preprocessing including cloud removing on the satellite im-ages.

Red Green Blue

Forest

Lake

Suburb

Hist64

Figure 5: (Top) Three satellite images (left) of differentlandscapes contain rich geographical information. We canalso see that the histograms of RGB channels for each im-age (middle) contain useful information and are good sum-mary statistics. (Bottom) Examples of sample satellite imagechips and their corresponding landscape composition.

whole dataset contains all the checklists from the Bird Con-servation Region (BCR) 13 (Committee and others 2000) inthe last two weeks of May from 2004 to 2014, which has50,949 observations in total. Since May is a migration sea-son and lots of non-native birds fly over BCR 13, this datasetprovides a good opportunity to study these migratory birdsusing this dataset. We choose the top 100 most frequentlyobserved birds as the target species which cover over 95% ofthe records in our dataset. A simple mutual information anal-ysis reveals rich correlation structure among these species.

Our second application is the Amazon rainforest land-scape analysis2, in which we tag satellite images with afew landscape categories. Some categories are about atmo-spheric conditions, such as clear or cloudy and others areabout landcover, such as agricultural or forest land. Rawsatellite images were derived from Planet’s full-frame an-alytic scene products using 4-band satellites in the sun-synchronous orbit and the International Space Station orbit.The organizers at Kaggle used Planet’s visual product pro-cessor to transform raw images to 3-band jpg format. Eachsatellite image sample in this dataset contains an image chipcovering a ground area of 0.9 km2. The chips were analyzedusing the Crowd Flower3 platform to obtain ground-truthlandscape composition. There are 17 composition label enti-ties and they represent a reasonable subset of phenomena ofinterest in the Amazon basin and can broadly be broken intothree groups: atmospheric conditions, common land coverphenomena and rare land use phenomena. Each chip has oneor more atmospheric label entities and zero or more commonand rare label entities. Sample chips and their composition

2https://www.kaggle.com/c/planet-understanding-the-amazon-from-space.

3https://www.crowdflower.com/

Dataset Training Set Size Test Set Size # EntitieseBird 45855 5094 100Amazon 30383 4048 17

Table 1: the statics of the eBird and the Amazon dataset

are demonstrated in the lower part of Figure 5. There existsrich correlations between label entities, for instance, agri-culture has a high probability to co-occur with water andcultivation. We randomly choose 34,431 samples for train-ing, validation and testing. The details of the two datasetsare listed in table 1.

We propose two different neural network architectures forthe feature network to extract useful features from satelliteimages: multi-layer fully connected neural network (MLP)and convolutional neural network (CNN). We also rescaleimages into different resolutions: Image64 for 64*64 pixelsand Image256 for 256*256 pixels. In addition, we experi-ment using summary statistics such as the histograms of im-age’s RGB channels (upper part of Figure 5) to describe animage (denoted as Hist). Inspired by (You et al. 2017) thatassumes permutation invariance holds and only the numberof different pixel type in an image (pixel counts) are infor-mative, we transfer each image into a matrix H ∈ Rd×b,where d is the number of bands and b is the number of dis-crete range section, thus Hi,j indicates the percentage ofpixels in the range of section j for band i. We use RGBso d = 3. We utilize histogram models with two differentb settings, Hist64 for b = 64 and Hist128 for b = 128.

All the training and testing process of our proposedMEDL CVAE are performed on one NVIDIA Quadro 4000GPU with 8GB memory. The whole training process lasts300 epochs, using batch size of 512, Adam optimizer(Kingma and Ba 2014) with learning rate of 10−4 and uti-lizing batch normalization (Ioffe and Szegedy 2015), 0.8dropout rate (Srivastava et al. 2014) and early stopping toaccelerate the training process and to prevent overfitting.

Experimental ResultsWe compare the proposed MEDL CVAE with two differentgroups of baseline models. The first group is models as-suming independence structures among entities; i.e., the dis-tribution of all entities are independent of each other con-ditioned on the feature vector. Within this group, we havetried different models with different feature inputs, includ-ing models that simply combine independent binary classi-fiers on each indicator yi,j as well as models with highly ad-vanced deep neural net structure, ResNet (He et al. 2016).The second group is previously proposed multi-entity de-pendence models which have the ability to capture correla-tions among entities. Within this group, we compare with therecent proposed Deep Multi-Species Embedding (DMSE)(Chen et al. 2017). This model is closely related to CRF, rep-resenting a wide class of energy based approach. Moreover,it further improves classic energy models, taking advantagesof the flexibility of deep neural nets to obtain useful featuredescription. Nevertheless, its training process uses classicMCMC sampling approaches, therefore cannot be fully in-

Method Neg. JLL Time(min)

NLCD+MLP 36.32 2Image256+ResNet50 34.16 5.3 hrsNLCD+Image256+ResNet50 34.48 5.7 hrsNLCD+Hist64+MLP 34.97 3NLCD+Hist128+MLP 34.62 4NLCD+Image64+MLP 33.73 9NLCD+MLP+PCC 35.99 21NLCD+Hist128+MLP+PCC 35.07 33NLCD+Image64+MLP+PCC 34.48 53NLCD+DMSE 30.53 20 hrsNLCD+MLP+MEDL CVAE 30.86 9NLCD+Hist64+MLP+MEDL CVAE 28.86 20NLCD+Hist128+MLP+MEDL CVAE 28.71 22NLCD+Image64+MLP+MEDL CVAE 28.24 48

Table 2: Negative joint log-likelihood and training timeof models assuming independence (first section), previousmulti-entity dependence models (second section) and ourMEDL CVAE on the eBird test set. MEDL CVAE achieves lowernegative log-likelihood compared to other methods with thesame feature network structure and context input while tak-ing much less training time. Our model is also the onlyone among joint models (second and third section) whichachieves the best log-likelihood taking images as inputs,while other models must rely on summary statistics to getgood but suboptimal results within a reasonable time limit.

tegrated on GPUs. We also compare with Probabilistic Clas-sifier Chains (PCC) (Dembczynski, Cheng, and Hullermeier2010), which is a representative approach among a series ofmodels proposed in multi-label classification.

Our baselines and MEDL CVAE are all trained using dif-ferent feature network architecture as well as satellite im-agery with different resolution and encoding. We use Nega-tive Joint Distribution Loglikelihood (Neg. JLL) as the main

indicator of a model’s performance: − 1N

N∑i=1

logPr(yi|xi),

where N is the number of samples in the test set. This indi-cator measures how likely the model produces the observa-tions in the dataset. For MEDL CVAE models, Pr(yi|xi) =Ezi∼Pr(zi|xi) [Pr(yi|zi, xi)]. We obtain 10,000 samplesfrom the posterior Pr(zi|xi) to estimate the expectation. Wealso double checked that the estimation is close and withinthe bound of the variational lower bound in Equation 6. Thesampling process can be performed on GPUs within a cou-ple of minutes. The experiment results on eBird and Amazondataset are shown in Table 2 and 3, respectively.

We can observe that: (1)MEDL CVAE significantly out-performs all independent models given the same fea-ture network (CNN or MLP) and context information (Im-age or Hist), even if we use highly advanced deep neu-ral net structures such as ResNet in independent models.It proves that our method is able to capture rich depen-dency structures among entities, therefore outperformingapproaches that assume independence among entities. (2)Compared with previous multi-entity dependence models,

Method Neg. JLLImage64+MLP 2.83Hist128+MLP 2.44Image64+CNN 2.16Image64+MLP+PCC 2.95Hist128+MLP+PCC 2.60Image64+CNN+PCC 2.45Image64+MLP+MEDL CVAE 2.37Hist128+MLP+MEDL CVAE 2.09Image64+CNN+MEDL CVAE 2.03

Table 3: Performance of baseline models and MEDL CVAEon the Amazon dataset. Our method clearly outperformsmodels assuming independence (first section) and previousmulti-entity dependence models (second section) with vari-ous types of context input and feature network structures.

0.00 0.05 0.10 0.15 0.20 0.25Recall

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Pre

cisi

on

Precision-Recall CurveNLCD+DMSE

NLCD+PCC

NLCD+MEDL_CVAE

NLCD+Image64+MEDL_CVAE

Figure 6: Precison-Recall Curves for a few multi-entity de-pendence models on the ebird dataset (better view in color).MEDL CVAE utilizing images as input outperforms other jointmethods without imagery.

MEDL CVAE trains in an order of magnitude less time.Using low-dimensional context information NLCD, whichis a 15-dimensional vector, PCC’s training needs nearlytwice the time of MEDL CVAE and DMSE needs over 130times (20 hours). (3) In each model group in Table 2, itis clear that adding satellite images improves the perfor-mance, which proves that rich context is informative. (4)Only our model MEDL CVAE is able to take full advantageof the rich context in satellite images. Other models, suchas DMSE, already suffer from a long training time withlow-dimensional feature inputs such as NLCD, and can-not scale to using satellite images. It should be noted thatNLCD+Image64+MLP+MEDL CVAE can achieve much bet-ter performance with only 1/25 time of DMSE. PCC needsless training time than DMSE but doesn’t perform well onjoint likelihood. It is clear that due to the end-to-end train-ing process on GPUs, our method is able to take advan-tage of rich context input to further improve multi-entitydependence modeling, which is beyond the capacity of pre-vious models.

To further prove MEDL CVAE’s modeling power, we plotthe precision-recall curve shown in Figure 6 for all depen-dency models on the ebird dataset. The precision and recallis defined on the marginals to predict the occurrence of in-

American Robin Wilson Warblers

Green-winged Teal

Hairy Woodpeckers

Figure 7: Visualization of the vectors inside decoder net-work’s last fully connected layer gives a reasonable embed-ding of multiple bird species when our model is trained onthe eBird dataset. Birds living in the same habitats are clus-tered together.

dividual species and averaged among all 100 species in thedataset. As we can see, our method outperforms other mod-els after taking rich context information into account.

Latent Space and Hidden Variables Analysis In order toqualitatively confirm that our MEDL CVAE learns useful de-pendence structure between entities, we analyze the latentspace formed by the hidden variables in the neural network.Inspired by (Chen et al. 2017), each vector of decoder net-work’s last fully connected layer can be treated as an em-bedding showing the relationship among species. Figure 7visualizes the embedding using t-SNE (Maaten and Hinton2008). We can observe that birds having similar environ-mental preferences tend to cluster together. In addition, pre-vious work (Kingma and Welling 2013) has shown that therecognition network in Variational Auto-encoder is able tocluster high-dimensional data. Therefore we conjecture thatthe posterior of z from the recognition network should carrymeaningful information on the cluster groups. Figure 8 vi-sualizes the posterior of z ∼ Q(z|x, y) in 2D space usingt-SNE on the Amazon dataset. We can see that satellite im-ages of similar landscape composition also cluster together.

ConclusionIn this paper, we propose MEDL CVAE for multi-entity de-pendence learning, which encodes a conditional multivariatedistribution as a generating process. As a result, the varia-tional lower bound of the joint likelihood can be optimizedvia a conditional variational auto-encoder and trained end-to-end on GPUs. Tested on two real-world applications incomputational sustainability, we show that MEDL CVAE cap-tures rich dependency structures, scales better than previousmethods, and further improves the joint likelihood taking ad-vantage of very rich context information that is beyond thecapacity of previous methods. Future directions include ex-ploring the connection between the current formulation ofMEDL CVAE based on deep neural nets and the classic multi-variate response models in statistics.

CloudyHabitation

Primary Forest

Figure 8: Visualization of the posterior z ∼ Q(z|x, y)gives a good embedding on landscape composition when ourmodel is trained on the Amazon dataset. Pictures with simi-lar landscapes are clustered together.

AcknowledgementsTang, L. would like to thank the Department of Physics, Ts-inghua University for providing financial support for his stayat Cornell. The authors thank Bin Dai for discussions andall the reviewers for helpful suggestions. We are thankful toKaggle competition organizers and the Cornell Lab of Or-nithology for providing data for this research. This researchwas supported by National Science Foundation (Grant Num-ber 0832782, 1522054, 1059284, 1356308), ARO grantW911-NF-14-1- 0498, the Leon Levy Foundation and theWolf Creek Foundation.

ReferencesChen, D.; Xue, Y.; Fink, D.; Chen, S.; and Gomes, C. P.2017. Deep multi-species embedding. In IJCAI.Chib, S., and Greenberg, E. 1998. Analysis of multivariateprobit models. Biometrika 85(2):347–361.Committee, U. N., et al. 2000. North american bird con-servation initiative: Bird conservation region descriptions, asupplement to the north american bird conservation initia-tive bird conservation regions map.Dembczynski, K.; Waegeman, W.; Cheng, W.; andHullermeier, E. 2012. On label dependence and loss min-imization in multi-label classification. Machine Learning88(1-2):5–45.Dembczynski, K.; Cheng, W.; and Hullermeier, E. 2010.Bayes optimal multilabel classification via probabilisticclassifier chains. In ICML.Deng, J.; Ding, N.; Jia, Y.; Frome, A.; Murphy, K.; Bengio,S.; Li, Y.; Neven, H.; and Adam, H. 2014. Large-scale objectclassification using label relation graphs. In ECCV.Elith, J., and Leathwick, J. R. 2009. Species distributionmodels: ecological explanation and prediction across spaceand time. Annual Review of Ecology, Evolution, and System-atics 40(1):677.

Fink, D.; Damoulas, T.; and Dave, J. 2013. Adaptive spatio-temporal exploratory models: Hemisphere-wide species dis-tributions from massively crowdsourced ebird data. In AAAI.Gomes, C. P. 2009. Computational sustainability: Computa-tional methods for a sustainable environment, economy, andsociety. The Bridge 39(4):5–13.Guo, Y., and Gu, S. 2011. Multi-label classification usingconditional dependency networks. In IJCAI.Harris, D. J. 2015. Generating realistic assemblages witha joint species distribution model. Methods in Ecology andEvolution.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Identity map-pings in deep residual networks. In ECCV.Hinton, G. E. 2002. Training products of experts byminimizing contrastive divergence. Neural computation14(8):1771–1800.Homer, C.; Dewitz, J.; Yang, L.; Jin, S.; Danielson, P.; Xian,G.; Coulston, J.; Herold, N.; Wickham, J.; and Megown, K.2015. Completion of the 2011 national land cover databasefor the conterminous united states–representing a decade ofland cover change information. Photogrammetric Engineer-ing & Remote Sensing.Hutchinson, R. A.; Liu, L.-P.; and Dietterich, T. G. 2011.Incorporating boosted regression trees into ecological latentvariable models. In AAAI.Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accel-erating deep network training by reducing internal covariateshift. In International Conference on Machine Learning,448–456.Jun, Y.; Wong, W.-K.; Dietterich, T.; Jones, J.; Betts, M.;Frey, S.; Shirley, S.; Miller, J.; and White, M. 2011. Multi-label classification for species distribution. In Proceed-ings of the ICML 2011 Workshop on Machine Learning forGlobal Challenges.Kingma, D., and Ba, J. 2014. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980.Kingma, D. P., and Welling, M. 2013. Auto-encoding vari-ational bayes. arXiv preprint arXiv:1312.6114.Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling,M. 2014. Semi-supervised learning with deep generativemodels. In Advances in Neural Information Processing Sys-tems, 3581–3589.Kumar, A.; Vembu, S.; Menon, A. K.; and Elkan, C. 2013.Beam search algorithms for multilabel learning. Machinelearning 92(1):65–89.Lafferty, J. D.; McCallum, A.; and Pereira, F. C. N. 2001.Conditional random fields: Probabilistic models for seg-menting and labeling sequence data. In Proceedings of theEighteenth International Conference on Machine Learning,ICML ’01.Li, C.; Wang, B.; Pavlu, V.; and Aslam, J. 2016. Conditionalbernoulli mixtures for multi-label classification. In Proceed-ings of the 33rd International Conference on InternationalConference on Machine Learning - Volume 48, ICML’16.

Liu, W., and Tsang, I. 2015. On the optimality of classifierchain for multi-label classification. In Advances in NeuralInformation Processing Systems, 712–720.Maaten, L. v. d., and Hinton, G. 2008. Visualizing data usingt-sne. Journal of Machine Learning Research 9(Nov):2579–2605.MacKenzie, D. I.; Bailey, L. L.; and Nichols, J. D. 2004. In-vestigating species co-occurrence patterns when species aredetected imperfectly. Journal of Animal Ecology 73.Mena Waldo, D.; Montanes Roces, E.; Quevedo Perez, J. R.;and Coz Velasco, J. J. d. 2015. Using a* for inference inprobabilistic classifier chains. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelli-gence (IJCAI 2015). Association for the Advancement ofArtificial Intelligence.Munson, M. A.; Webb, K.; Sheldon, D.; Fink, D.;Hochachka, W. M.; Iliff, M.; Riedewald, M.; Sorokina, D.;Sullivan, B.; Wood, C.; et al. 2012. The ebird referencedataset, version 4.0.Nam, J.; Kim, J.; Mencıa, E. L.; Gurevych, I.; andFurnkranz, J. 2014. Large-scale multi-label text classifica-tionrevisiting neural networks. In Joint european conferenceon machine learning and knowledge discovery in databases,437–452. Springer.Phillips, S. J.; Dudık, M.; and Schapire, R. E. 2004. A max-imum entropy approach to species distribution modeling. InProceedings of the Twenty-first International Conference onMachine Learning, ICML ’04, 83–. New York, NY, USA:ACM.Read, J.; Pfahringer, B.; Holmes, G.; and Frank, E. 2009.Classifier chains for multi-label classification. In MachineLearning and Knowledge Discovery in Databases, Euro-pean Conference, ECML PKDD 2009.Sheldon, D. R., and Dietterich, T. G. 2011. Collective graph-ical models. In NIPS.Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.;and Salakhutdinov, R. 2014. Dropout: a simple way to pre-vent neural networks from overfitting. Journal of machinelearning research 15(1):1929–1958.Sutton, C., and McCallum, A. 2012. An introduction toconditional random fields. Found. Trends Mach. Learn. 4.Wang, J.; Yang, Y.; Mao, J.; Huang, Z.; Huang, C.; and Xu,W. 2016. Cnn-rnn: A unified framework for multi-labelimage classification. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2285–2294.Xu, X.-S.; Jiang, Y.; Peng, L.; Xue, X.; and Zhou, Z.-H.2011. Ensemble approach based on conditional random fieldfor multi-label image and video annotation. In Proceedingsof the 19th ACM International Conference on Multimedia.You, J.; Li, X.; Low, M.; Lobell, D.; and Ermon, S. 2017.Deep gaussian process for crop yield prediction based onremote sensing data. In AAAI, 4559–4566.Zhang, M.-L., and Zhou, Z.-H. 2005. A k-nearest neigh-bor based algorithm for multi-label classification. In 2005IEEE International Conference on Granular Computing,volume 2.

Zhang, Y., and Zhou, Z.-H. 2010. Multilabel dimension-ality reduction via dependence maximization. ACM Trans.Knowl. Discov. Data 4(3).

multi-entity dependence learning with rich context via...

Documents