variational learning on aggregate outputs with gaussian processes · we use gaussian processes...

33
Variational Learning on Aggregate Outputs with Gaussian Processes Ho Chung Leon Law * University of Oxford Dino Sejdinovic * University of Oxford Ewan Cameron University of Oxford Tim CD Lucas University of Oxford Seth Flaxman Imperial College London Katherine Battle University Of Oxford Kenji Fukumizu § Institute of Statistical Mathematics Abstract While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications, including global mapping of disease, only have access to outputs at a much coarser level than that of the inputs. Aggregation of outputs makes generalization to new inputs much more difficult. We consider an approach to this problem based on variational learning with a model of output aggregation and Gaussian processes, where aggregation leads to intractability of the standard evidence lower bounds. We propose new bounds and tractable approximations, leading to improved prediction accuracy and scalability to large datasets, while explicitly taking uncertainty into account. We develop a framework which extends to several types of likelihoods, including the Poisson model for aggregated count data. We apply our framework to a challenging and important problem, the fine-scale spatial modelling of malaria incidence, with over 1 million observations. 1 Introduction A typical supervised learning setup assumes existence of a set of input-output examples {(x ,y )} from which a functional relationship or a conditional probabilistic model of outputs given inputs can be learned. A prototypical use-case is the situation where obtaining outputs y ? for new, previously unseen, inputs x ? is costly, i.e., labelling is expensive and requires human intervention, but measurements of inputs are cheap and automated. Similarly, in many applications, due to a much greater cost in acquiring labels, they are only available at a much coarser resolution than the level at which the inputs are available and at which we wish to make predictions. This is the problem of weakly supervised learning on aggregate outputs [14, 20], which has been studied in the literature in a variety of forms, with classification and regression notably being developed separately and without any unified treatment which can allow more flexible observation models. In this contribution, we consider a framework of observation models of aggregated outputs given bagged inputs, which reside in exponential families. While we develop a more general treatment, the main focus in the paper is on the Poisson likelihood for count data, which is motivated by the applications in spatial statistics. In particular, we consider the important problem of fine-scale mapping of diseases. High resolution maps of infectious disease risk can offer a powerful tool for developing National Strategic Plans, * Department of Statistics, Oxford, UK. <[email protected], [email protected]> Big Data Institute, Oxford, UK. <[email protected], [email protected], kather- [email protected]> Department of Mathematics and Data Science Institute, London, UK. <s.fl[email protected]> § Tachikawa, Japan. <[email protected]> Preprint.

Upload: others

Post on 12-Oct-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Variational Learning on Aggregate Outputswith Gaussian Processes

Ho Chung Leon Law∗University of Oxford

Dino Sejdinovic∗University of Oxford

Ewan Cameron†University of Oxford

Tim CD Lucas†University of Oxford

Seth Flaxman‡Imperial College London

Katherine Battle†University Of Oxford

Kenji Fukumizu§Institute of Statistical Mathematics

Abstract

While a typical supervised learning framework assumes that the inputs and theoutputs are measured at the same levels of granularity, many applications, includingglobal mapping of disease, only have access to outputs at a much coarser levelthan that of the inputs. Aggregation of outputs makes generalization to newinputs much more difficult. We consider an approach to this problem based onvariational learning with a model of output aggregation and Gaussian processes,where aggregation leads to intractability of the standard evidence lower bounds. Wepropose new bounds and tractable approximations, leading to improved predictionaccuracy and scalability to large datasets, while explicitly taking uncertainty intoaccount. We develop a framework which extends to several types of likelihoods,including the Poisson model for aggregated count data. We apply our frameworkto a challenging and important problem, the fine-scale spatial modelling of malariaincidence, with over 1 million observations.

1 Introduction

A typical supervised learning setup assumes existence of a set of input-output examples {(x`, y`)}`from which a functional relationship or a conditional probabilistic model of outputs given inputs can belearned. A prototypical use-case is the situation where obtaining outputs y? for new, previously unseen,inputs x? is costly, i.e., labelling is expensive and requires human intervention, but measurementsof inputs are cheap and automated. Similarly, in many applications, due to a much greater costin acquiring labels, they are only available at a much coarser resolution than the level at whichthe inputs are available and at which we wish to make predictions. This is the problem of weaklysupervised learning on aggregate outputs [14, 20], which has been studied in the literature in avariety of forms, with classification and regression notably being developed separately and withoutany unified treatment which can allow more flexible observation models. In this contribution, weconsider a framework of observation models of aggregated outputs given bagged inputs, which residein exponential families. While we develop a more general treatment, the main focus in the paper ison the Poisson likelihood for count data, which is motivated by the applications in spatial statistics.

In particular, we consider the important problem of fine-scale mapping of diseases. High resolutionmaps of infectious disease risk can offer a powerful tool for developing National Strategic Plans,

∗Department of Statistics, Oxford, UK. <[email protected], [email protected]>†Big Data Institute, Oxford, UK. <[email protected], [email protected], kather-

[email protected]>‡Department of Mathematics and Data Science Institute, London, UK. <[email protected]>§Tachikawa, Japan. <[email protected]>

Preprint.

Page 2: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

allowing accurate stratification of intervention types to areas of greatest impact [5]. In low resourcesettings these maps must be constructed through probabilistic models linking the limited observationaldata to a suite of spatial covariates (often from remote-sensing images) describing social, economic,and environmental factors thought to influence exposure to the relevant infectious pathways. Inthis paper, we apply our method to the incidence of clinical malaria cases. Point incidence data ofmalaria is typically available at a high temporal frequency (weekly or monthly), but lacks spatialprecision, being aggregated by administrative district or by health facility catchment. The challengefor risk modelling is to produce fine-scale predictions from these coarse incidence data, leveraging theremote-sensing covariates and appropriate regularity conditions to ensure a well-behaved problem.

Methodologically, the Poisson distribution is a popular choice for modelling count data. In themapping setting, the intensity of the Poisson distribution is modelled as a function of spatial andother covariates. We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are awidely used approach in spatial modelling but also one of the pillars of Bayesian machine learning,enabling predictive models which explicitly quantify their uncertainty. Recently, we have seen manyadvances in variational GP posterior approximations, allowing them to couple with more complexobservation likelihoods (e.g. binary or Poisson data [21, 17]) as well as a number of effective scalableGP approaches [24, 30, 8, 9], extending the applicability of GPs to dataset sizes previously deemedprohibitive.

Contribution Our contributions can be summarised as follows. A general framework is de-veloped for aggregated observation models using exponential families and Gaussian processes. Thisis novel, as previous work on aggregation or bag models focuses on specific types of output modelssuch as binary classification. Tractable and scalable variational inference methods are proposed forseveral instances of the aggregated observation models, making use of novel lower bounds on themodel evidence. In experiments, it is demonstrated that the proposed methods can scale to datasetsizes of more than 1 million observations. We thoroughly investigate an application of the developedmethodology to disease mapping from coarse measurements, where the observation model is Poisson,giving encouraging results. Uncertainty quantification, which is explicit in our models, is essentialfor this application.

Related Work The framework of learning from aggregate data was believed to have beenfirst introduced in [20], which considers the two regimes of classification and regression. However,while the task of classification of individuals from aggregate data (also known as learning fromlabel proportions) has been explored widely in the literature [23, 22, 13, 18, 35, 34, 14], therehas been little literature on the analogous regression regime in the machine learning community.Perhaps the closest literature available is [13], who considers a general framework for learningfrom aggregate data, but also only considers the classification case for experiments. In thiswork, we will appropriately adjust the framework in [13] and take this to be our baseline. Arelated problem arises in the spatial statistics community under the name of ‘down-scaling’,‘fine-scale modelling’ or ‘spatial disaggregation’ [11, 10], in the analysis of disease mapping,agricultural data, and species distribution modelling, with a variety of proposed methodologies(cf. [33] and references therein), including kriging [6]. However, to the best of our knowledge,approaches making use of recent advances in scalable variational inference for GPs are not considered.

Another closely related topic is multiple instance learning (MIL), concerned with classifica-tion with max-aggregation over labels in a bag, i.e. a bag is positively labeled if at least one individualis positive, and it is otherwise negatively labelled. While the task in MIL is typically to predictlabels of new unobserved bags, [7] demonstrates that individual labels of a GP classifier can also beinferred in MIL setting with variational inference. Our work parallels that approach, consideringbag observation models in exponential families and deriving new approximation bounds for somecommon generalized linear models. In deriving these bounds, we have taken an approach similar to[17], who considers the problem of Gaussian process-modulated Poisson process estimation usingvariational inference. However, our problem is made more complicated by the aggregation of labels.Other related research topics include distribution regression and set regression, as in [28, 15, 16] and[36]. In these regression problems, while the input data for learning is the same as the current setup,the goal is to learn a function at the bag level, rather than the individual level, the application of thesemethods in our setting, naively treating single individuals as “distributions”, may lead to suboptimal

2

Page 3: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

performance. An overview of some other approaches for classification using bags of instances isgiven in [4].

2 Bag observation model: aggregation in mean parameters

Suppose we have a statistical model p(y|η) for output y ∈ Y , with parameter η given by a function ofinput x ∈ X , i.e., η = η(x). Although one can formulate p(y|η) in an arbitrary fashion, practitionersoften only focus on interpretable simple models, hence we restrict our attention to p(y|η) arisingfrom exponential families. We assume that η is the mean parameter of the exponential family.

Assume that we have a fixed set of points xai ∈ X such that xa = {xa1 , . . . , xaNa} is a bag of points

withNa individuals, and we wish to estimate the regression value η(xai ) for each individual. However,instead of the typical setup where we have a paired sample {(x`, y`)}` of individuals and their outputsto use as a training set, we observe only aggregate outputs ya for each of the bags. Hence, ourtraining data is of the form

({x1i }N1i=1, y

1), . . . ({xni }Nni=1, y

n), (1)and the goal is to estimate parameters η(xai ) corresponding to individuals. To relate the aggregate ya

and the bag xa = (xai )Nai=1, we use the following bag observation model:

ya|xa ∼ p(y|ηa), ηa =

Na∑i=1

wai η(xai ), (2)

where wai is an optional fixed non-negative weight used to adjust the scales (see Section 3 foran example). Note that the aggregation in the bag observation model is on the mean parametersfor individuals, not necessarily on the individual responses yai . This implies that each individualcontributes to the mean bag response and that the observation model for bags belongs to the sameparametric form as the one for individuals. For tractable and scalable estimation, we will usevariational methods, as the aggregated observation model leads to intractable posteriors. We considerthe Poisson, normal, and exponential distributions, but devote a special focus to the Poisson model inthis paper, and refer readers to Appendix A for other cases and experimental results for the Normalmodel in Appendix H.2.

It is also worth noting that we place no restrictions on the collection of the individuals, with thebagging process possibly dependent on covariates xai or any unseen factors. The bags can also beof different sizes, with potentially the same individuals appearing in multiple bags. After we obtainour individual model η(x), we can use it for prediction of in-bag individuals, as well as out-of-bagindividuals.

3 Poisson bag model: Modelling aggregate counts

The Poisson distribution p(y|λ) = λye−λ/(y!) is considered for count observations, and this paperdiscusses the Poisson regression with intensity λ(xai ) multiplied by a ‘population’ pai , which is aconstant assumed to be known for each individual (or ‘sub-bag’) in the bag. The population for a baga is given by pa =

∑i pai . An observed bag count ya is assumed to follow

ya|xa ∼ Poisson(paλa), λa :=

Na∑i=1

paipaλ(xai ).

Note that, by introducing unobserved counts yai ∼ Poisson(yai |pai λ(xai )), the bag observation ya

has the same distribution as∑Na

i=1 yai since the Poisson distribution is closed under convolutions.

If a bag and its individuals correspond to an area and its partition in geostatistical applications, asin the malaria example in Section 4.2, the population in the above bag model can be regarded asthe population of an area or a sub-area. With this formulation, the goal is to estimate the basicintensity function λ(x) from the aggregated observations (1). Assuming independence given {xa}a,the negative log-likelihood (NLL) `0 across bags is

− log[Πna=1p(y

a|xa)]c=

n∑a=1

paλa−ya log(paλa)c=

n∑a=1

[Na∑i=1

pai λ(xai )− ya log

(Na∑i=1

pai λ(xai )

)],

(3)

3

Page 4: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

where c= denotes an equality up to additive constant. During training, this term will pass information

from the bag level observations {ya} to the individual basic intensity λ(xai ). It is noted that once wehave trained an appropriate model for λ(xai ), we will be able to make individual level predictions, andalso bag level predictions if desired. We will consider baselines with (3) using penalized likelihoodsinspired by manifold regularization in semi-supervised learning [2] – presented in Appendix B. In thenext section, we propose a model for λ based on GPs.

3.1 VBAgg-Poisson: Gaussian processes for aggregate counts

Suppose now we model f as a Gaussian process (GP), then we have:

ya|xa ∼ Poisson

(Na∑i=1

pai λai

), λai = Ψ(f(xai )), f ∼ GP (µ, k) (4)

where µ and k are some appropriate mean function and covariance kernel k(x, y). (For implementa-tion, we consider a constant mean function.) Since the intensity is always non-negative, in all models,we will need to use a transformation λ(x) = Ψ(f(x)), where Ψ is a non-negative valued function.We will consider cases Ψ(f) = f2 and Ψ(f) = ef . A discussion of various choices of this linkfunction in the context of Poisson intensities modulated by GPs is given in [17]. Modelling f witha GP allows us to propagate uncertainty on the predictions to λai , which is especially important inthis weakly supervised problem setting, where we do not directly observe any individual output yai .Since the total number of individuals in our target application of disease mapping is typically in themillions (see Section 4.2), we will approximate the posterior over λai := λ(xai ) using variationalinference, with details found in Appendix E.

For scalability of the GP method, as in previous literature [7, 17], we use a set of inducing points{u`}m`=1, which are given by the function evaluations of the Gaussian process f at landmark pointsW = {w1, . . . , wm}; i.e., u` = f(w`). The distribution p(u|W ) is thus given by

u ∼ N(µW ,KWW ), µW = (µ(w`))`, KWW = (k(ws, wt))s,t. (5)

The joint likelihood is given by:

p(y, f, u|X,W,Θ) =

n∏a=1

Na∏i=1

Poisson(ya|paλa)p(f |u)p(u|W ), with f |u ∼ GP (µ̃u, K̃), (6)

µ̃(z) = µz + kzWK−1WW (u− µW ), K̃(z, z′) = k(z, z′)− kzWK

−1WWkWz′ (7)

where kzW = (k(z, w1), . . . , k(z, w`))T , with µW , µz denoting their respective evaluations of the

mean function µ and Θ being parameters of the mean and kernel functions of the GP. Proceedingsimilarly to [17], which discusses (non-bag) Poisson regression with GP, we obtain a lower bound ofthe marginal log-likelihood log p(y|Θ):

log p(y|Θ) = log

∫ ∫p(y, f, u|X,W,Θ)dfdu

≥∫ ∫

log{p(y|f,Θ)

p(u|W )

q(u)

}p(f |u,Θ)q(u)dfdu (Jensen’s inequality)

=∑a

∫ ∫ {ya log

(Na∑i=1

paiΨ(f(xai ))−(Na∑i=1

paiΨ(f(xai )))}p(f |u)q(u)dfdu

−∑a

log(ya!)−KL(q(u)||p(u|W )) =: L(q,Θ), (8)

where q(u) is a variational distribution to be optimized. The general solution to the maximization overq of the evidence lower bound L(q,Θ) above is given by the posterior of the inducing points p(u|y),which is intractable. We introduce a restriction to the class of q(u) to approximate the posteriorp(u|y). Suppose that the variational distribution q(u) is Gaussian, q(u) = N(ηu,Σu). We thenneed to maximize the lower bound L(q,Θ) over the variational parameters ηu and Σu. The resultingq(u) gives an approximation to the posterior p(u|y) which also leads to a Gaussian approximationq(f) =

∫p(f |u)q(u)du to the posterior p(f |y), which we finally then transform through Ψ to obtain

4

Page 5: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

the desired approximate posterior on each λ(xia) (which is either log-normal or non-central χ2

depending on the form of Ψ). The approximate posterior on λ will then allow us to make predictionsfor individuals while, crucially, taking into account the uncertainties in f (note that even the posteriorpredictive mean of λ will depend on the predictive variance in f due to the nonlinearity Ψ). We alsowant to emphasis the use of inducing variables is essential for scalability in our model: we cannotdirectly obtain approximations to the posterior of λ(xai ) for all individuals, since this is often large inour problem setting (Section 4.2).

As the p(u|W ) and q(u) are both Gaussian, the last term (KL-divergence) of (8) can be computedexplicitly with exact form found in Appendix E.3. To consider the first two terms, let qa(va) be themarginal normal distribution of va = (f(xa1), . . . , f(xaNa

)), where f follows the variational posteriorq(f). The distribution of va is then N(ma, Sa), using (7) :

ma = µxa +KxaWK−1WW (ηu − µW ), Sa = Kxa,xa −KxaW

(K−1WW −K

−1WWΣuK

−1WW

)KWxa

(9)

In the first term of (8), each summand is of the form

ya∫

log(Na∑i=1

paiΨ (vai ))qa(va)dva −

Na∑i=1

pai

∫Ψ (vai ) qa(va)dva, (10)

in which the second term is tractable for both of Ψ(f) = f2 and Ψ(f) = ef . The integral of thefirst term, however with qa Gaussian is not tractable. To solve this, we take different approaches forΨ(f) = f2 and Ψ(f) = ef ; for the former, approximation by Taylor expansion is applied, while forthe latter, further lower bound is taken.

First consider the case Ψ(f) = f2, and rewrite the first term of (8) as:

yaE log ‖V a‖2 ,where V a ∼ N(m̃a, S̃a),

with P a = diag(pa1 , . . . , p

aNa

), m̃a = P a1/2ma and S̃a = P a1/2SaP a1/2. By a Taylor series

approximation for E log ‖V a‖2 (similar to [29]) around E ‖V a‖2 = ‖m̃a‖2 + trS̃a, we obtain∫log(Na∑i=1

pai (vai )2)qa(va)dva

≈ log(ma>P ama + tr(SaP a)

)−

2ma>P aSaP ama + tr(

(SaP a)2)

(ma>P ama + tr(SaP a))2 =: ζa. (11)

with details are in Appendix E.4. An alternative approach which we use for the case Ψ(f) = ef is totake a further lower bound, which is applicable to a general class of Ψ (we provide further details forthe analogous approach for Ψ(v) = v2 in Appendix E.2). We use the following Lemma (proof foundin Appendix E.1):Lemma 1. Let v = [v1, . . . , vN ]> be a random vector with probability density q(v) with marginaldensities qi(v), and let wi ≥ 0, i = 1, . . . , N . Then, for any non-negative valued function Ψ(v),∫

log( N∑i=1

wiΨ(vi))q(v)dv ≥ log

( N∑i=1

wieξi), where ξi :=

∫log Ψ(vi)qi(vi)dvi.

Hence we obtain that ∫log(Na∑i=1

pai evai)qa(va)dva ≥ log

(Na∑i=1

pai ema

i

), (12)

Using the above two approximation schemes, our objective (up to constant terms) can be formulatedas: 1) Ψ(v) = v2

Ls1(Θ, ηu,Σu,W ) :=

n∑a=1

yaζa −n∑a=1

Na∑i=1

{(ma

i )2 + Saii/2}−KL(q(u)||p(u|W )), (13)

5

Page 6: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Figure 1: Left: Random samples on the Swiss roll manifold. Middle, Right: Individual AverageNLL on train set for varying number of training bags n and increasing Nmean, over 5 repetitions.Constant prediction within bag gives a NLL of 2.22. bag-pixel model gives NLL above 2.4 for thevarying number of bags experiment.

2) Ψ(v) = ev

Le1(Θ, ηu,Σu,W ) :=

n∑a=1

ya log(Na∑i=1

emai)−

n∑j=1

Na∑i=1

emai +Sa

ii/2 −KL(q(u)||p(u|W )). (14)

Given these objectives, we can now optimise these lower bounds with respect to variational parameters{ηu,Σu}, parameters Θ of the mean and kernel functions, using stochastic gradient descent (SGD)on bags. Additionally, we might also learn W , locations for the landmark points. In this form, wecan also see that the bound for Ψ(v) = ev has the added computational advantage of not requiringthe full computation of the matrix Sa, but only its diagonals, while for Ψ(v) = v2 computation of ζainvolves full Sa, which may be problematic for extremely large bag sizes.

4 Experiments

We will now demonstrate various approaches: Variational Bayes with Gaussian Process (VBAgg), aMAP estimator of Bayesian Poisson regression with explicit feature maps (Nyström) and a neuralnetwork (NN) – the latter two employing manifold regularisation with RBF kernel (unless statedotherwise). For additional baselines, we consider a constant within bag model (constant), i.e. λ̂ai = ya

pa

and also consider creating ‘individual’ covariates by aggregation of the covariates within a bag (bag-pixel). For details of all these approaches, see Appendix B. We also denote Ψ(v) = ev and v2 as Expand Sq respectively.

We implement our models in TensorFlow5 and use SGD with Adam [12] to optimise their respectiveobjectives, and we split the dataset into 4 parts, namely train, early-stop, validation and test set. Herethe early-stop set is used for early stopping for the Nyström, NN and bag-pixel models, while theVBAgg approach ignores this partition as it optimises the lower bound to the marginal likelihood.The validation set is used for parameter tuning of any regularisation scaling, as well as learning rate,layer size and multiple initialisations. Throughout, VBAgg and Nyström have access to the same setof landmarks for fair comparison. It is also important to highlight that we perform early stopping andtuning based on bag level performance on NLL only, as this is the only information available to us.

For the VBAgg model, there are two approaches to tuning, one approach is to choose parametersbased on NLL on the validation bag sets, another approach is to select all parameters based on thetraining objective L1, the lower bound to the marginal likelihood. We denote the latter approachVBAgg-Obj and report its toy experimental results in Appendix H.1.1 for presentation purposes.In general, the results are relatively insensitive to this choice, especially when Ψ(v) = v2. Tomake predictions, we use the mean of our approximated posterior (provided by a log-normal andnon-central χ2 distribution for Exp and Sq). As an additional evaluation, we report mean square error(MSE) and bag performance results in Appendix H.

5Code will be available for use.

6

Page 7: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

4.1 Poisson Model: Swiss Roll

We first demonstrate our method on the swiss roll dataset6, illustrated in Figure 1 (left). To make thisan aggregate learning problem, we first construct n bags with sizes drawn from a negative binomialdistribution Na ∼ NB(Nmean, Nstd), where Nmean and Nstd represents the respective mean andstandard deviation of Na. We then randomly select

∑na=1Na points from the swiss roll manifold to

be the locations, giving us a set of colored locations in R3. Ordering these random locations by theirz-axis coordinate, we group them, filling up each bag in turn as we move along the z-axis. The aimof this is to simulate that in real life the partitioning of locations into bags is often not independentof covariates. Taking the colour of each location as the underlying rate λai at that location, wesimulate yai ∼ Poisson(λai ), and take our observed outputs to be ya =

∑Na

i=1 yai ∼ Poisson(λa),

where λa =∑Na

i=1 λai . Our goal is then to predict the underlying individual rate parameter λai ,

given only bag-level observations ya. To make this problem even more challenging, we embedthe data manifold into R18 by rotating it with a random orthogonal matrix. For the choice of k forVBAgg and Nyström, we use the RBF kernel, with the bandwidth parameter learnt. For landmark lo-cations, we use the K-means++ algorithm, so that landmark points lie evenly across the data manifold.

Varying number of Bags: n To see the effect of increasing number of bags available for training,we fix Nmean = 150 and Nstd = 50, and vary the number of bags n for the training set from100 to 350 with the same number of bags for early stopping and validation. Each experiment isrepeated for 5 runs, and results are shown in Figure 1 for individual NLL on the train set. Again weemphasise that the individual labels are not used in training. We see that all versions of VBAggoutperform all other models, in terms of MSE and NLL, with statistical significance confirmed by asigned rank permutation test (see Appendix H.1.1). We also observe that the bag-pixel model haspoor performance, as a result of losing individual level covariate information in training by simplyaggregating them.

Varying number of individuals per bag: Nmean To study the effect of increasing bagsizes (with larger bag sizes, we expect "disaggregation" to be more difficult), we fix the numberof training bags to be 600 with early stopping and validation set to be 150 bags, while varying thenumber of individuals per bag through Nmean and Nstd in the negative binomial distribution. Tokeep the relative scales between Nmean and Nstd the same, we take Nstd = Nmean/2. The resultsare shown in Figure 1, focusing on the best performing methods in the previous experiment. Here, weobserve that VBAgg models again perform better than the Nyström and NN models with statisticalsignificance as reported in Appendix H.1.1, with performance stable as Nmean increases.

Discussion To gain more insight into the VBAgg model, we look at the calibration ofour two different Bayesian models: VBAgg-Exp and VBAgg-Square. We compute their respectiveposterior quantiles and observe the ratio of times the true λai lie in these quantiles. We presentthese in Appendix H.1.1. The calibration plots reveal an interesting nature about using the twodifferent approximations for using ev versus v2 for Ψ(v). While experiments showed that the twomodel perform similarly in terms of NLL, the calibration of the models is very different. While theVBAgg-Square is well calibrated in general, the VBAgg-Exp suffers from poor calibration. This isnot surprising, as VBAgg-Exp uses an additional lower bound on model evidence. Thus, uncertaintyestimates given by VBAgg-Exp should be treated with care.

4.2 Malaria Incidence Prediction

We now demonstrate the proposed methodology on an important real life malaria prediction problemfor an endemic country from the Malaria Atlas Project database7. In this problem, we would liketo predict the underlying malaria incidence rate in each 1km by 1km region (referred to as a pixel),while having only observed aggregated incidences of malaria ya at much larger regional levels, whichare treated as bags of pixels. These bags are non-overlapping administrative units, with Na pixels perbag ranging from 13 to 6,667, with a total of 1,044,683 pixels. In total, data is available for 957 bags8.

6The swiss roll manifold function (for sampling) can be found on the Python scikit-learn package.7Due to confidentiality reasons, we do not report country or plot the full map of our results.8We consider 576 bags for train, 95 bags each for validation and early-stop, with 191 bags for testing, with

different splits across different trials, selecting them to ensure distributions of labels are similar across sets.

7

Page 8: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Figure 2: Triangle denotes approximate start and end of river location, crosses denotes non-train setbags. Malaria incidence rate λai is per 1000 people. Left, Middle: log(λ̂ai ), with constant model(Left), and VBAgg-Obj-Sq (tuned on Ls1) (Middle). Right: Standard deviation of the posterior v in(9) with VBAgg-Obj-Sq.

Along with these pixels, we also have population estimates pai (per 1000 people) for pixel i in bag a,spatial coordinates given by sai , as well as covariates xai ∈ R18, collected by remote sensing. Someexamples of covariates includes accessibility, distance to water, mean of land surface temperatureand stable night lights. It is clear that rather than expecting malaria incidence rate to be constantthroughout the entire bag (as in Figure 2), we expect pixel incidence rate to vary, depending on social,economic and environmental factors [32]. Our goal is therefore to build models that can predictmalaria incidence rates at a pixel level.

We assume a Poisson model on each individual pixel, i.e. ya ∼ Poisson(∑i pai λ

ai ), where λai is

the underlying pixel incidence rate of malaria per 1000 people that we are interested in predicting.We consider the VBAgg, Nyström and NN as prediction models and use a kernel given as a sum ofan ARD (automatic relevance determination) kernel on covariates and a Matérn kernel on spatiallocations for the VBAgg and Nyström methods, learning all kernel parameters (the kernel expressionis provided in Appendix G). We use the same kernel for manifold regularisation in the NN model.This kernel choice incorporates spatial information, while allowing feature selection amongst othercovariates. For choice of landmarks, we ensure landmarks are placed evenly throughout space byusing one landmark point per training bag (selected by k-means++). This is so that the uncertaintyestimates we obtain are not too sensitive to the choice of landmarks. In this problem, no individual-level labels are available, so we report Bag NLL and MSE (on observed incidences) on the test bagsin Appendix G over 10 different re-splits of the data. Although we can see that Nyström is the bestperforming method, the improvement over VBAgg models is not statistically significant. On theother hand, both VBAgg and Nyström models statistically significantly outperform NN, which alsohas some instability in its predictions, as discussed in Appendix G.1. However, a caution should beexercised when using the measure of performance at the bag level as a surrogate for the measure ofperformance at the individual level: in order to perform well at the bag level, one can simply utilisespatial coordinates and ignore other covariates, as malaria intensity appears to smoothly vary betweenthe bags (Left of Figure 2). However, we do not expect this to be true at the individual level.

To further investigate this, we consider a particular region, and look at the predicted individual malariaincidence rate, with results found in Figure 2 and in Appendix G.1 across 3 different data splits,where the behaviours of each of these models can be observed. While Nyström and VBAgg methodsboth provide good bag-level performance, Nyström and VBAgg-Exp can sometimes provide overly-smooth spatial patterns, which does not seem to be the case for the VBAgg-Sq method (recall thatVBAgg-Sq performed best in both prediction and calibration for the toy experiments). In particular,VBAgg-Sq consistently predicts higher intensity along rivers (a known factor [31]; indicated bytriangles in Figure 2) using only coarse aggregated intensities, demonstrating that prediction of(unobserved) pixel-level intensities is possible using fine-scale environmental covariates, especiallyones known to be relevant such as covariates indicated by the Topographic Wetness Index, a measureof wetness, see Appendix G.2 for more details.

In summary, by optimising the lower bound to the marginal likelihood, the proposed variationalmethods are able to learn useful relations between the covariates and pixel level intensities, whileavoiding the issue of overfitting to spatial coordinates. Furthermore, they also give uncertaintyestimates (Figure 2, right), which are essential for problems like these, where validation of predictionsis difficult, but they may guide policy and planning.

8

Page 9: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

5 Conclusion

Motivated by the vitally important problem of malaria, which is the direct cause of around 187million clinical cases [3] and 631,000 deaths [5] each year in sub-Saharan Africa, we have proposed ageneral framework of aggregated observation models using Gaussian processes, along with scalablevariational methods for inference in those models, making them applicable to large datasets. Theproposed method allows learning in situations where outputs of interest are available at a much coarserlevel than that of the inputs, while explicitly quantifying uncertainty of predictions. The recent uptakeof digital health information systems offers a wealth of new data which is abstracted to the aggregateor regional levels to preserve patient anonymity. The volume of this data, as well as the availability ofmuch more granular covariates provided by remote sensing and other geospatially tagged data sources,allows to probabilistically disaggregate outputs of interest for finer risk stratification, e.g. assistingpublic health agencies to plan the delivery of disease interventions. This task demands new high-performance machine learning methods and we see those that we have developed here as an importantstep in this direction.

Acknowledgement

We thank Kaspar Martens for useful discussions, and Dougal Sutherland for providing the codebase in which this work was based on. HCLL is supported by the EPSRC and MRC through theOxWaSP CDT programme (EP/L016710/1). HCLL and KF are supported by JSPS KAKENHI26280009. EC and KB are supported by OPP1152978, TL by OPP1132415 and the MAP databaseby OPP1106023. DS is supported in part by the ERC (FP7/617071) and by The Alan Turing Institute(EP/N510129/1). The data were provided by the Malaria Atlas Project supported by the Bill andMelinda Gates Foundation.

9

Page 10: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

References[1] LU Ancarani and G Gasaneo. Derivatives of any order of the confluent hypergeometric function f

1 1 (a, b, z) with respect to the parameter a or b. Journal of Mathematical Physics, 49(6):063508,2008.

[2] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometricframework for learning from labeled and unlabeled examples. Journal of machine learningresearch, 7(Nov):2399–2434, 2006.

[3] Samir Bhatt, DJ Weiss, E Cameron, D Bisanzio, B Mappin, U Dalrymple, KE Battle, CL Moyes,A Henry, PA Eckhoff, et al. The effect of malaria control on plasmodium falciparum in africabetween 2000 and 2015. Nature, 526(7572):207, 2015.

[4] Veronika Cheplygina, David M.J. Tax, and Marco Loog. On classification with bags, groupsand sets. Pattern Recognition Letters, 59:11 – 17, 2015.

[5] Peter W Gething, Daniel C Casey, Daniel J Weiss, Donal Bisanzio, Samir Bhatt, Ewan Cameron,Katherine E Battle, Ursula Dalrymple, Jennifer Rozier, Puja C Rao, et al. Mapping plasmodiumfalciparum mortality in africa between 1990 and 2015. New England Journal of Medicine,375(25):2435–2445, 2016.

[6] Pierre Goovaerts. Combining areal and point data in geostatistical interpolation: Applicationsto soil science and medical geography. Mathematical Geosciences, 42(5):535–554, Jul 2010.

[7] Manuel Haußmann, Fred A Hamprecht, and Melih Kandemir. Variational bayesian multipleinstance learning with gaussian processes. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 6570–6579, 2017.

[8] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. 2013.[9] James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable Variational Gaussian

Process Classification. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings ofthe Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38of Proceedings of Machine Learning Research, pages 351–360, San Diego, California, USA,09–12 May 2015. PMLR.

[10] Richard Howitt and Arnaud Reynaud. Spatial disaggregation of agricultural production datausing maximum entropy. European Review of Agricultural Economics, 30(3):359–387, 2003.

[11] Petr Keil, Jonathan Belmaker, Adam M Wilson, Philip Unitt, and Walter Jetz. Downscalingof species distribution models: a hierarchical approach. Methods in Ecology and Evolution,4(1):82–94, 2013.

[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[13] Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth. From group toindividual labels using deep features. In Proceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, pages 597–606. ACM, 2015.

[14] H. Kueck and N. de Freitas. Learning about individuals from group statistics. In UAI, pages332–339, 2005.

[15] H. C. L. Law, C. Yau, and D. Sejdinovic. Testing and learning on distributions with symmetricnoise invariance. In NIPS, 2017.

[16] Ho Chung Leon Law, Dougal Sutherland, Dino Sejdinovic, and Seth Flaxman. Bayesianapproaches to distribution regression. In International Conference on Artificial Intelligence andStatistics, pages 1167–1176, 2018.

[17] Chris Lloyd, Tom Gunter, Michael Osborne, and Stephen Roberts. Variational inferencefor gaussian process modulated poisson processes. In International Conference on MachineLearning, pages 1814–1822, 2015.

[18] Vitalik Melnikov and Eyke Hüllermeier. Learning to aggregate using uninorms. In JointEuropean Conference on Machine Learning and Knowledge Discovery in Databases, pages756–771. Springer, 2016.

[19] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernelmean embedding of distributions: A review and beyonds. arXiv preprint arXiv:1605.09522,2016.

10

Page 11: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

[20] David R Musicant, Janara M Christensen, and Jamie F Olson. Supervised learning by training onaggregate outputs. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conferenceon, pages 252–261. IEEE, 2007.

[21] H. Nickisch and CE. Rasmussen. Approximations for binary gaussian process classification.Journal of Machine Learning Research, 9:2035–2078, October 2008.

[22] Giorgio Patrini, Richard Nock, Tiberio Caetano, and Paul Rivera. (Almost) no label no cry. InNIPS. 2014.

[23] Novi Quadrianto, Alex J Smola, Tiberio S Caetano, and Quoc V Le. Estimating labels fromlabel proportions. JMLR, 10:2349–2374, 2009.

[24] Joaquin Quiñonero Candela and Carl Edward Rasmussen. A unifying view of sparse approxi-mate gaussian process regression. J. Mach. Learn. Res., 6:1939–1959, December 2005.

[25] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NIPS,pages 1177–1184, 2007.

[26] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning,2006.

[27] Alex J Smola and Peter L Bartlett. Sparse greedy gaussian process regression. In Advances inneural information processing systems, pages 619–625, 2001.

[28] Zoltán Szabó, Bharath K Sriperumbudur, Barnabás Póczos, and Arthur Gretton. Learning theoryfor distribution regression. The Journal of Machine Learning Research, 17(1):5272–5311, 2016.

[29] Yee W Teh, David Newman, and Max Welling. A collapsed variational bayesian inferencealgorithm for latent dirichlet allocation. In Advances in neural information processing systems,pages 1353–1360, 2007.

[30] Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. InDavid van Dyk and Max Welling, editors, Proceedings of the Twelth International Conferenceon Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research,pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr2009. PMLR.

[31] DA Warrel, T Cox, J Firth, and Jr E Benz. Oxford textbook of medicine, 2017.[32] Daniel J Weiss, Bonnie Mappin, Ursula Dalrymple, Samir Bhatt, Ewan Cameron, Simon I

Hay, and Peter W Gething. Re-examining environmental correlates of plasmodium falciparummalaria endemicity: a data-intensive variable selection approach. Malaria journal, 14(1):68,2015.

[33] António Xavier, Maria de Belém Costa Freitas, Maria do Socorro Rosário, and Rui Fragoso.Disaggregating statistical data at the field level: An entropy approach. Spatial Statistics, 23:91 –108, 2018.

[34] Felix X Yu, Krzysztof Choromanski, Sanjiv Kumar, Tony Jebara, and Shih-Fu Chang. Onlearning from label proportions. arXiv preprint arXiv:1402.5902, 2014.

[35] Felix X Yu, Dong Liu, Sanjiv Kumar, Tony Jebara, and Shih-Fu Chang. propto svm for learningwith label proportions. arXiv preprint arXiv:1306.0886, 2013.

[36] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov,and Alexander Smola. Deep sets. In NIPS, 2017.

11

Page 12: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

A Aggregated Exponential Family Models

Consider an observation model of the form

p(y|θ) = exp

(yθ − c(θ)

τ

)h(y, τ), (15)

where response y is one-dimensional, θ is a natural parameter corresponding to the statistic y, τis a dispersion parameter, and h(y, τ) is base measure. For simplicity, we will assume that naturalparameters corresponding to the other parts of the sufficient statistic are fixed and folded into the basemeasure. Let η be the corresponding mean parameter, i.e.

η = Eθy =

∫yp(y|θ)dy

and θ = F (η) be the link function mapping from mean to the natural parameters and G(θ) its inverse.We wish to model the mean parameter η = η(x) using a Gaussian process on a domain X togetherwith a function Ψ which transforms the GP value to the natural parameter space, i.e.

η(x) = Ψ(f(x)), f ∼ GP(µ, k). (16)

For example, the mean parameter for some models is restricted to the positive part of the real line,while the GP values cover the whole real line. We will consider the following examples:

• Normal (with fixed variance). F = G = idenity and Ψ can be identity, too, as there are norestrictions on the mean parameter space.

• Poisson. F (η) = log η, G(θ) = eθ. Ψ should take a positive value, so we considerΨ(v) = ev or Ψ(v) = v2.• Exponential. p(y|η) = exp(−y/η)/η and θ = −η, F (η) = −1/η, G(θ) = −1/θ. Ψ

should take a positive value, so we consider Ψ(v) = ev or Ψ(v) = v2

Note that the link function F is concave for all the examples above.

A.1 Bag model

We will consider the aggregation in the mean parameter space. Namely, let y1, . . . , yn be n indepen-dent aggregate responses for each of the n bags of covariates xa = {xa1 , . . . , xaNa

}, a = 1, . . . , n.We assume the following aggregation model:

ya ∼ p(y|ηa), ηa =

Na∑i=1

wai ηai =

Na∑i=1

wai Ψ(f(xai )), a = 1, . . . , n. (17)

where wai are fixed weights to adjust the scales among the individuals and the bag (e.g., adjusting forpopulation size).

We also can model individual (unobserved) variables yai (i = 1, . . . , Na), which follow:

yai ∼ p(y|ηai ), ηai = Ψ(f(xai )), i = 1, . . . , Na, a = 1, . . . , n. (18)

Note that we consider aggregation in mean parameters of responses, not in the responses themselves.If we consider a case where underlying individual responses yai aggregate to ya as a weightedsum, the form of the bag likelihood and individual likelihood would be different unless we restrictattention to distribution families which are closed under both scaling and convolution. However,when aggregation occurs in the mean parameter space, the form of the bag likelihood and individuallikelihood is always the same. This corresponds to the following measurement process:

• Each individual has a mean parameter ηai - if it were possible to sample a response for thatparticular individual, we would obtain a sample yai ∼ p(·|ηai )

• However, we cannot sample the individual and we can only observe a bag response. But inthat case, only a single bag response is taken and depends on all individuals simultaneously.Each individual contributes in terms of an increase in a mean bag response, but this measure-ment process is different from the two-stage procedure by which we aggregate individualresponses.

12

Page 13: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

A.2 Marginal likelihood and ELBO

Let Y = (y1, . . . , yn) (bag observations). With the inducing points u = f(W ), the marginallikelihood is

p(Y ) =

∫ ∫ n∏a=1

p(ya|ηa)p(f |u)p(u)dudf. (19)

The evidence lower bound can be derived as

log p(Y ) = log

∫ ∫ { n∏a=1

p(ya|ηa)p(u)

q(u)

}p(f |u)q(u)dudf

≥∫ ∫

log{ n∏a=1

p(ya|ηa)p(u)

q(u)

}p(f |u)q(u)dudf

=

n∑a=1

ya

τ

∫F(∑

i

wai Ψ(f(xai )))q(f)df −

∫c(F(∑

i

wai Ψ(f(xai ))))q(f)df

−∫q(u) log

q(u)

p(u)du, (20)

where q(f) =∫p(f |u)q(u)du.

By setting the variational distribution q(u) as Gaussian, the third term is tractable. The first andsecond terms are however tractable only in limited cases. The cases we develop are the Poisson bagmodel, described in the main text, as well as the normal bag model and the exponential bag model,described below.

A.3 Normal bag model

F is identity and c(θ) = θ2/2, which makes both the first and the second terms tractable with thechoice of Ψ(v) = v. Moreover, the viewpoints of aggregating in the mean parameters and in theindividual responses are equivalent for this model and we can also allow different variance parametersfor different bags (and individuals).

Consider a bag a of items {xai }Nai=1. Each item xai is assumed to have a weight wai . At the individual

level, we model the (unobserved) responses yai as

yai |xai ∼ N(wai µ

ai , (w

ai )

2τai

)(21)

where µai = µ(xai ), thus µai is a mean parameter per unit weight corresponding to the item xai andit is assumed to be a function of both xai . Similarly, τai is a variance parameter per unit weight. Atthe bag level, we consider the following model for the observed aggregate response ya, assumingconditional independence of individual responses given covariates xa = {xa1 , . . . , xaNa

}:

ya =

Na∑i=1

yai , i.e. ya|xa ∼ N (waµa, (wa)2τa), µa =

Na∑i=1

waiwa

µai , τa =

∑Na

i=1(wai )2τai(wa)2

(22)

where µa and τa are the mean and variance parameters per unit weight of the whole bag a andwa =

∑Na

i=1 wai is the total weight of bag a. Although we can take τai to also be a function of

the covariates, here for simplicity, we take τai = τa to be constant per bag (note the abuse ofnotation). We can now compute the negative log-likelihood (NLL) across bags (assuming conditionalindependence given the xa):

`0 = − log [Πna=1p(y

a|xa)] =1

2

n∑a=1

log

(2πτa

Na∑i=1

(wai )2

)+

(ya −

∑Na

i=1 wai µ

ai

)2

∑Na

i=1(wai )2τa

(23)

where µai = f(xai ) is the function we are interested in, and τa are the variance parameters to belearnt.

13

Page 14: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

We can now consider the lower bound to the marginal likelihood as below (assuming wai = 1 here tosimplify notation, while the analogous expression with non-uniform weights is straightforward):

log p(y|Θ) = log

∫ ∫p(y, f, u|X,W,Θ)dfdu

= log

∫ ∫ ( n∏a=1

1√2πNaτa

exp

(−

(ya −∑Na

i=1 f(xai ))2

2Naτa

))p(u|W )

q(u)p(f |u)q(u)dfdu

≥∫ ∫

log

{n∏a=1

1√2πNaτa

exp

(−

(ya −∑Na

i=1 f(xai ))2

2Naτa

)p(u|W )

q(u)

}p(f |u)q(u)dfdu

= −1

2

∑a

∫ ∫ (ya)2 − 2ya

∑Na

i=1 f(xai ) +(∑Na

i=1 f(xai ))2

Naτa

p(f |u)q(u)dfdu

− 1

2

∑a

log(2πNaτa)−

∫q(u) log

q(u)

p(u|W )du. (24)

Using again a Gaussian distribution for q(u), we have q(f) =∫p(f |u)q(u)du, which is a normal

distribution and let qa(fa) be its marginal normal distribution of fa = (f(xa1), . . . , f(xaNa)) with

mean and covariance given by ma and Sa as before in (9).

Then all expectations with respect to q(f) are tractable and the ELBO is simply

L(q, θ) = −1

2

n∑a=1

{(ya)2 − 2ya1>ma + 1>

(Sa +ma(ma)>

)1

Naτa

}− 1

2

∑a

log(2πNaτa)

−KL(q(u)||p(u|W )). (25)

A.4 Exponential bag model

In this case, we have F (η) = −1/η. We can apply the similar argument as in Lemma 1. For anyαi > 0 with

∑i αi = 1, by the concavity of F ,∫F

(∑i

wiΨ(vi)

)q(vi)dvi =

∫F

(∑i

αiwi/αiΨ(vi)

)q(vi)dvi

≥∫ ∑

i

αiF (wi/αiΨ(vi)) q(vi)dvi

=∑i

αi

∫F (wi/αiΨ(vi)) q(vi)dvi.

For F (η) = −1/η, the last line is equal to∑i

α2i

wi

∫1

Ψ(vi)q(vi)dvi.

When using a normal q, this is tractable for several choices of Ψ including ev and v2. If we letξi :=

∫1

Ψ(vi)q(vi)dvi, and maximize ∑

i

α2i

ξiwi

under the constraint∑i αi = 1, we obtain

αi =(wi/ξi)∑`(wi/ξi)

.

14

Page 15: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Finally, we have a lower bound∫F

(∑i

wiΨ(vi)

)q(vi)dvi ≥ −

∑i(wi/ξi)∑i(wi/ξi)

2(26)

whereξi =

∫1

Ψ(vi)q(vi)dvi.

which is tractable for a Gaussian variational family. Also with an explicit form of Ψ, it is easy to takethe derivatives of the resulting lower bound with respect to the variational parameters in q(v).

B Alternative approaches

Constant For the Poisson model, we can take λai = λac , a constant rate across the bag, then:

λ̂ac =ya

pa

hence the individual level predictive distribution is the form yai ∼ Poisson(λ̂ac ), and for unseen bagr, λ̂bag

c = 1∑na=1 p

a

∑na=1 y

a, with predictive distribution given by yr ∼ Poisson(prλ̂bagc ).

bag-pixel: Bag as Individual Another baseline is to train a model from the weighted average ofthe covariates, given by xa =

∑Na

i=1paipax

ai in the Poisson case, and xa =

∑Na

i=1wa

i

waxai in the normal

case. The purpose of this baseline is to demonstrate that modelling at the individual level is importantduring training. Since we now have labels and covariates at the bag level, we can consider thefollowing model:

ya|xa ∼ Poisson(paλ(xa))with λ(xa) = Ψ(f(xa)) for the Poisson model. For the normal model, we have:

ya|xa ∼ Normal(waµ(xa), (wa)2τ)

where µ(xa) = f(xa) and τ is a parameter to be learnt (assuming constant across bags). Now weobserve that these models are identical to the individual model, except for a difference in indexing.Hence, after learning the function f at the bag level, we can transfer the model to the individuallevel. Essentially here we have created fake individual level instances by aggregation of individualcovariates inside a bag.

Nyström: Bayesian MAP for Poisson regression on explicit feature maps Instead of the pos-terior based on the model (6), we can also consider an explicit feature map in order to directlyconstruct a MAP estimator. While this method does not provide posterior uncertainty over λai , it doesprovide an interesting connection to the settings we have considered and also manifold-regularizedneural networks, as discussed below. Let Kzz be the covariance function defined on covariates{z1, . . . zn}, and consider its low rank approximation Kzz ≈ kzWK

−1WWkWz with landmark points

W = {w`}m`=1 and kzW = (k(z, w1), . . . , k(z, w`))T . By using landmark points W , we have

avoided computation of the full kernel matrix, reducing computational complexity. Under this setup,we have that Kzz ≈ ΦzΦ

>z , with Φz = kzWK

− 12

WW being the explicit (Nyström) feature map. Usingthis explicit feature map Φ, we have the following model:

fai = φai β, β ∼ N (0, γ2I)

ya|xa ∼ Poisson

(Na∑i=1

pai λ(xai )

), λ(xai ) = Ψ(fai ),

where γ is a prior parameter and φai is the corresponding ith row of Φxa . We can then consider aMAP estimator of the model coefficients β:

β̂ = argmaxβ log[Πna=1p(y

a|β,xa)] + log p(β). (27)This essentially recovers the same model as in (3) with the standard l2 loss regularising the complexityof the function. This model can be thought of in several different ways, for example as a weight spaceview of the GP ([26] for an overview), or as a MAP of the Subset of Regressors (SoR) approximation[27] of the GP when σ = 1. Additional we may include manifold regulariser as part of the prior, seediscussion below about neural network.

15

Page 16: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

NN: Manifold-regularized neural networks The next approach we consider is a parametric modelfor f as in [13], and search the best parameter to minimize negative log-likelihood `0 across bags. Thispaper considers a neural network with parameters θ for the model f , and uses the back-propagation tolearn θ and hence individual level model f . However, since we only have aggregated observations atthe bag level, but lots of individual covariate information, it is useful to incorporate this informationalso, by enforcing smoothness on the data manifold given by the unlabelled data. To do this, following[13] and [22], we pursue a semisupervised view of the problem and include an additional manifoldregularisation term [2] (rescaling with N2

total during implementation):

`1 =

Ntotal∑w=1

Ntotal∑u=1

(fu − fw)2kL(xu, xw) = f>L f (28)

where we have suppressed the bag index, Ntotal represents the total number of individuals, kL(·, ·) issome user-specified kernel9, f = [f1, . . . , fNtotal ]

>, L is the Laplacian defined as L = diag(KL1>)−

KL, where 1 is just [1, . . . , 1] and KL is a kernel matrix. Although this term involves calculation of akernel matrix across individuals, in practice we consider stochastic gradient descent (SGD) and alsorandom Fourier features [25] or Nyström approximation (see Appendix C), with scale parameter λ1

to control the strength of the regularisation. Similarly, one can also consider manifold regularisationat the bag level, if bag-level covariates/embeddings are available, for further details, see Appendix D.

In fact, the same regularisation can be applied to the MAP estimation with the explicit feature maps.This is equivalent to having a prior β ∼ N (0, σ2I + (λ1Φ>LΦ)−1) that is data dependent andincorporates the structure of the manifold 10.

For implementation, we consider a one hidden layer neural network, with also an output layer, for afair comparison to the Nyström approach. For activation function, we consider the Rectified LinearUnit (ReLU).

MAP estimation of GP We introduce p(f, u) = p(f |u)p(u|W ) and consider the posterior givenby p(u|f, y, w, θ), where here the conditional distribution f |u is given by:

f |u ∼ GP (µ̃u, K̃), (29)

µ̃(z) = µz + kzWK−1WW (u− µW ), K̃(z, z′) = k(z, z′)− kzWK

−1WWkWz′

where kzW = (k(z,W1), . . . , k(z,W`))T . Using Bayes rule, we obtain:

log[p(u|f, y, w)] = log[p(y|f, u)p(f, u|X,W )]

= log[p(y|f)p(f |u,X)p(u|W )]

=

n∑a=1

ya log(paλa) +

n∑a=1

paλa −n∑a=1

log(ya!) + log(p(f |u,X)) + log(p(u|W ))

where p(f |u,X) ∼ N (µ̃u, K̃) given by above, and p(u|W ) ∼ N (µW ,ΣWW ), i.e.

log p(f |u,X)+log p(u|W ) = −1

2(log(|K̃||ΣWW |)+(f−µ̃u)>K̃−1(f−µ̃u)+(u−µW )>Σ−1

WW (u−µW )

(30)Here, we can not perform SGD, as the latter terms does not decompose into a sum over the data.More importantly, here we require the computation of K̃, which contains the kernel matrix K, evenafter the use of landmarks. This direct approach is not feasible for large number of individuals, whichis true in our target application, and hence we do not pursue this method, and consider Nyström andNN as baselines.

C Random Fourier Features on Laplacian

Here we discuss using random Fourier features [25] to reduce computational cost in calculation ofthe Laplacian defined as L = diag(K1>)−K, where 1 is just [1, . . . , 1] and K. Suppose the kernel

9In practice, this does not have to be a positive semi-definite kernel, it can be derived from any notion ofsimilarity between observations, including k-nearest neighbours.

10In order to guarantee positive definiteness of Laplacian, one can add εI , where ε > 0.

16

Page 17: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

is stationary i.e. kw(x − y) = k(x, y) (some examples include the gaussian and matern kernel),then using random Fourier features, we obtain K ≈ ΦΦ>, where Φ ∈ RbN×m, bN denotes the totalnumber of individuals in the batch and m denotes the number of frequencies. Now we have:

f>Lf ≈ f>diag(ΦΦ>1>)f − f>ΦΦ>f = f>diag(ΦΦ>1>)f − ||Φ>f ||22 (31)In both terms, we can avoid computing the kernel matrix, by carefully selecting the order of computa-tion. Note another option is to consider Nyström approximation with landmark points {z1, . . . zm},then K ≈ KnmK

−1mmKmn, where Kmm denotes the kernel matrix on landmark points, while Knm

is the kernel matrix between landmark and data. Then Φ = KnmK− 1

2mm.

D Bag Manifold regularisation

Suppose we have bag covariates sa (note these are for the entire bag), and also some summarystatistics of a bag, e.g. mean embeddings [19] given by Ha = 1

Na

∑Na

i=1 h(xai ), with some user-defined h. Then similarly to individual level manifold regularisation, we can consider manifoldregularisation at the bag level (assuming a seperable kernel for simplicity), i.e.

`2 =

n∑l=1

n∑m=1

(F l − Fm)2ks(sl, sm)kh(H l, Hm) = F>LbagF (32)

where F a = 1Nl

∑Na

i=1 fai , ks is a kernel on bag covariates sa, kµ is a kernel on Ha, Lbag is the bag

level Laplacian with the corresponding kernel, and F = [F 1, . . . , Fn]>. Combining all these terms,we have the following loss function to minimise:

` =1

b`0 +

λ1

b2N`1 +

λ2

b2N`2 (33)

where b is the mini-batch size in SGD, BN is the total number of individuals in each mini-batch, λ1

and λ2 are parameters controlling the strength of the respective regularisation.

E Additional details for Poisson variational derivation

E.1 Log-sum lemma

Lemma 2. Let v = [v1, . . . , vN ]> be a random vector with probability density q(v), and let wi ≥ 0,i = 1, . . . , N . Then, for any non-negative valued function Ψ(v),∫

log( N∑i=1

wiΨ(vi))q(v)dv ≥ log

( N∑i=1

wieξi),

whereξi :=

∫log Ψ(vi)qi(vi)dvi.

Proof. Let α1, . . . , αN be non-negative numbers with∑Ni=1 αi = 1. It follows from Jensen’s

inequality that ∫log( N∑i=1

wiΨ(vi))q(v)dv =

∫log( N∑i=1

αiwi

αiΨ(vi)

)q(v)dv ≥

N∑i=1

αi

[∫log(

Ψ(vi))q(vi)dvi + log

wiαi

]=

N∑i=1

αiξi +

N∑i=1

αi logwiαi. (34)

17

Page 18: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

By Lagrange multiplier method, maximizing the last line with respect to α gives

αi =wie

ξi∑Nj=1 wje

ξj.

Plugging this to (34) completes the proof.

E.2 A lower bound of marginal likelihood for Ψ(f) = ef and Ψ(f) = f2

Using Lemma 2, we obtain that∫log( N∑i=1

paiΨ(vai ))q(va)dva ≥ log

( N∑i=1

paiΨ(ξai )), (35)

where

ξai =

∫log Ψ(vai )qai (vai )dvai .

The above lower bound is tractable for the popular functions Ψ(v) = v2 and Ψ(v) = ev under thenormal variational distributions qa(va) ∼ N (ma, Sa) . In particular,

Ψ(v) = ev : ξai =

∫vai q

ai (vai )dvai = ma

i ,

Ψ(v) = v2 : ξai =

∫log(vai )2qai (vai )dvai = −G

(− ma

i

2Saii

)+ log

(Saii2

)− γ,

where γ is the Euler constant and

G(t) = 2t

∞∑j=0

j!

(2)j (3/2)jtj

is the partial derivative of the confluent hypergeometric function [17, 1]. However, in this work wefocus on the Taylor series approximation for Ψ(v) = v2, as implementation of the above bound usesa large look-up table and involves linear interpolation. Furthermore, it is suggested in experimentsthat the secondary lower bound proposed above in Lemma 2 can lead to poor calibration, for moredetails, refer to Section 4.

E.3 KL Term

Since q(u) and p(u|W ) are both normal distribution, the KL divergence is tractable:

KL(q(u)||p(u|W )) =1

2

{Tr[K−1

WWΣu]+log|KWW ||Σu|

−m+(µW−ηu)TK−1WW (µW−ηu)

}(36)

E.4 Taylor series approximation in the variational method

We consider the integral ∫log( N∑i=1

pai (vai )2)qa(va)dva

where qa is N (ma, Sa). We note that this can be written as E log ‖V a‖2, where V a ∼ N(m̃a, S̃a),with P a = diag

(pa1 , . . . , p

aNa

), m̃a = P a1/2ma and S̃a = P a1/2SaP a1/2. Note that ‖V a‖2

follows a non-central chi-squared distribution. We now resort to a Taylor series approximation for

18

Page 19: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

E log ‖V a‖2 (similar to [29]) around E ‖V a‖2 = ‖m̃a‖2 + trS̃a, resulting in

E log(‖V a‖2

)= log

(E ‖V a‖2

)+E

‖V a‖2 − E ‖V a‖2

E ‖V a‖2−

(‖V a‖2 − E ‖V a‖2

)2

2(E ‖V a‖2

)2 +O((‖V a‖2 − E ‖V a‖2

)3)

≈ log(‖m̃a‖2 + trS̃a

)−

2m̃a>S̃am̃a + tr

((S̃a)2)

(‖m̃a‖2 + trS̃a

)2 .

As commented in [29], approximation is very accurate when E ‖V a‖2 is large, but the caveat is thatthe Taylor series converges only for ‖V ‖2 ∈ (0, 2E ‖V ‖2) so this approach effectively ignores thetail of the non-central chi-squared.

F Code

All of our models were implemented in TensorFlow, and code will be published and available for use.

G Additional Malaria Experimental Results

Here we provide additional experimental results for the malaria dataset. In table 1, we provideresults for bag level performance for NLL and MSE with 10 different test sets (after retrial of theexperiments, splitting the data across train, early-stop, validation and testing). Statistical significancewas not establish for the best performing Nyström method versus the VBAgg methods, this is shownin Table 2. We further provide additional prediction/uncertainty patches for 3 different splits tohighlight the general behaviour of the trained models, with further explanation and details below.

It is also noted in all cases λai is the incidence rate per 1000 people. For VBAgg and Nyström, weuse an additive kernel, between an ARD kernel and a Matern kernel:

k((x, sx), (y, sy)) = γ1 exp

(−1

2

18∑k=1

1

`k(xk − yk)2

)+γ2

(1 +

√3||sx − sy||2

ρ

)exp

(−√

3||sx − sy||2ρ

)(37)

where x, y are covariates, and sx, sy are their respective spatial location. Here, we learn any scaleparameters and weights during training. For the NN, we also use this kernel as part of manifoldregularisation, however we use an RBF kernel instead of an ARD kernel, due to parameter tuningreasons (we can no longer learn these scales).

For constant model, bag rate predictions are computed by, paλ̂bagc ,where λ̂bag

c = 1∑na=1 p

a

∑na=1 y

a.This essentially takes into account of population.

Table 1: Results for the Poisson Model on the malaria dataset with 10 different re-splits of train,early-stopping, validation and test. Approximately, 191 bags are used for test set. Bag performance ismeasured on a test set, with MSE computed between log(ya) and log(

∑Na

i=1 pai λ̂

ai ). Brackets include

standard deviation.Bag NLL Bag MSE (Log)

Constant 173.1 (31.2) 4.08 (0.13)Nyström-Exp 88.1 (25.1) 1.31 (0.15)VBAgg-Sq-Obj 94.1 (34.0) 1.21 (0.05)VBAgg-Exp-Obj 97.2 (39.6) 1.04 (0.11)VBAgg-Sq 97.6 (39.0) 1.38 (0.18)VBAgg-Exp 99.2 (39.8) 1.21 (0.19)NN-Exp 164.4 (127.8) 1.82 (0.29)

19

Page 20: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Table 2: p-values from a Wilcoxon signed-rank test for Nyström-Exp versus the methods below forBag NLL and MSE for the malaria dataset. The null hypothesis is Nyström-Exp performs equal orworse than the considered method on the test bag performance.

NLL MSEConstant 0.0009766 0.0009766NN-Exp 0.00293 0.0009766VBAgg-Sq-Obj 0.1162 0.958VBAgg-Sq 0.1377 0.1611VBAgg-Exp-Obj 0.08008 1.0VBAgg-Exp 0.09668 0.958

Table 3: p-values from a Wilcoxon signed-rank test for VBAgg-Sq versus the methods below for BagNLL and MSE for the malaria dataset. The null hypothesis is VBAgg-Sq performs equal or worsethan the considered method on the test bag performance.

NLL MSEConstant 0.0009766 0.0009766NN-Exp 0.01855 0.001953VBAgg-Sq-Obj 0.6234 0.9861Nyström-Exp 0.8838 0.8623VBAgg-Exp-Obj 0.6875 1.0VBAgg-Exp 0.3477 0.9346

G.1 Predicted log malaria incidence rate for various models

Constant: Bag level observed incidences This is the baseline with λ̂ai being constant throughoutthe bag, as shown in Figure 3. For training, we only use 60% of the data.

Figure 3: Predicted λ̂ai on log scale using constant model, for 3 different re-splits of the data.× denotenon-train set bags.

VBAgg-Sq-Obj This is the VBAgg model with Ψ(v) = v2 and tuning of hyperparameters isperformed based on training objective, the lower bound to the marginal likelihood, we ignore early-stop and validation set here. The uncertainty of the model seems reasonable, and we also observe thatin general the areas that are not in the training set have higher uncertainties. Furthermore, in all cases,malaria incidence was predicted to be higher near the river, as discussed in Section 4.2.

20

Page 21: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Figure 4: Top: Predicted λ̂ai on log scale for VBAgg-Sq-Obj. Bottom: Standard deviation of theposterior v in (9) with VBAgg-Sq-Obj.

VBAgg-Sq This is the VBAgg model with Ψ(v) = v2 and tuning of hyperparameters is performedbased on NLL at the bag level. Predicted incidence are similar to the VBAgg-Sq-Obj model.The uncertainty of the model is less reasonable here, this is expected behaviour, as we are tuninghyperparameters based on NLL here. In the first patch, the same parameters was chosen as VBAgg-Sq-Obj.

Figure 5: Top: Predicted λ̂ai on log scale for VBAgg-Sq. Bottom: Standard deviation of the posteriorv in (9) with VBAgg-Sq.

VBAgg-Exp-Obj This is the VBAgg model with Ψ(v) = ev and tuning of hyperparametersis performed based on training objective, the lower bound to the marginal likelihood, we ignoreearly-stop and validation set here. Predicted incidence seem to be stable in general, though somesmoothness is observed. The uncertainty of the model is also not very reasonably here, but thisbehaviour was observed in the Toy experiments, and likely due to an additional lower bound.

21

Page 22: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Figure 6: Top: Predicted λ̂ai on log scale for VBAgg-Exp-Obj.Bottom: Standard deviation of theposterior v in (9) with VBAgg-Exp-Obj.

VBAgg-Exp This is the VBAgg model with Ψ(v) = ev and tuning of hyperparameters is performedbased on NLL. For details, see discussion above for the VBAgg-Exp-Obj model.

Figure 7: Top: Predicted λ̂ai on log scale for VBAgg-Exp. Bottom: Standard deviation of theposterior v in (9) with VBAgg-Exp.

Nyström-Exp This is the Nyström-Exp model, it is clear that while it performs best in terms of bagNLL, sometimes prediction are too smooth in the pixel space, this is because it optimises directly bagNLL. This pattern might be seen to be unrealistic, and may cause useful covariates to be neglected.

22

Page 23: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Figure 8: Predicted λ̂ai on log scale for Nyström-Exp.

NN-Exp We can see that the model is not very stable, this can be potentially due to the modeldoes not have an inbuilt spatial smoothness function unlike other methods. It only uses manifoldregularisation for training. Also, the maximum predicted pixel level intensity rate λ̂ai is over 1000 insome cases, this is clearly physically impossible given λai is rate per 1000 people.

Figure 9: Predicted λ̂ai on log scale for NN-Exp.

G.2 Remote Sensing covariates that provide the existence of a river

Here, we provide figures for some covariates that give information that there is a river as indicated bythe triangles in Figure 2.

Figure 10: Topographic wetness index, measures the wetness of an area, rivers are wetter than others,as clearly highlighted.

23

Page 24: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Figure 11: Land Surface Temperature at night, river is hotter at night, due to river being able to retainheat better.

H Additional Toy Experimental Results

In this section, we provide additional experimental results for the Normal and Poisson model. Inparticular, we provide results on test bag level performance, and provide also prediction, calibrationand uncertainty plots.

For the VBAgg model, during the tuning process, it is possible to choose tuning parameters (e.g.learning rate, multiple-initialisations, landmark choices) based on NLL with an additional validationset or on the objective L1 on the training set. To compare the difference, we denote the model tunedon NLL as VBAgg and the model tuned on L1 as VBAgg-Obj. Intuitively, as VBAgg-Obj attempts toobtain as tight a bound to the marginal likelihood, we would expect better performance in calibration,i.e. more accurate uncertainties.

For calibration plots, we compute the α quantiles of the approximated posterior distribution andconsider the ratio of times the underlying rate parameter λai (or µai for the normal model) appearinside the quantiles of the posterior distribution. If the model provides good uncertainties/calibration,we should expect to see the quantiles to match with the observed ratio.

In the case of Ψ(v) = v2, the approximated posterior distribution is simply a non-central χ2

distribution, while for Ψ(v) = ev, this is a log-normal distribution. For the Normal Model, it issimply a normal distribution, as we do not have any transformations. Calibration plots can be foundin Figure 20 and Figure 21 for the Normal Model, with Figure 14 and Figure 15 for the PoissonModel.

For uncertainty plots, we plot the standard deviation of the posterior of v ∼ N (ma, Sa) (i.e. beforetransformation through Ψ), as this provides better interpretability. Uncertainty plots can be found inFigure 17 and 23

To demonstrate statistical significance of our result, we aggregate the repetitions in each experimentfor each method and consider a one sided rank permutation test (Wilcoxon signed-rank test) to seewhether VBAgg is statistically significant better than other approaches for individual NLL and MSE.

H.1 Poisson Model

H.1.1 Swiss Roll Dataset

We provide additional results here for the experimental settings that we consider.

24

Page 25: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

100 150 200 250 300 350Number of Bags

2.14

2.16

2.18

2.20

2.22

Indi

vidu

al A

vg N

LL

100 150 200 250 300 350Number of Bags

4.70

4.72

4.74

4.76

4.78

4.80

4.82

4.84

Bag

Avg

NLL

100 150 200 250 300 350Number of Bags

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Indi

vidu

al M

SE

VBAgg-Exp-ObjVBAgg-Sq-Obj

VBAgg-ExpVBAgg-Sq

NN-ExpNN-Sq

Nystrom-ExpNystrom-Sq

100 150 200 250 300 350Number of Bags

50

100

150

200

250

Bag

MS

E

Figure 12: Varying number of bags over 5 repetitions.Left Column: Individual average NLL andMSE on train set. Right Column: Bag average NLL and MSE on test set (of size 500). Constantprediction NLL and MSE is 2.23 and 0.85 respectively. bag-pixel model prediction NLL is above 2.4and MSE is above 3.0, hence not shown on graph.

The varying number of bags experimental results is found in Figure 12, with the corresponding tableof p-values in Table 4, 5 demonstrating statistical significance of the VBAgg-Exp and VBAgg-Sqmethod. Similarly, the varying number of individuals per bag through Nmean experimental resultcan be found in Figure 13, with the corresponding table of p-values in Table 6, 7. The comparisonbetween VBAgg-Exp and VBAgg-Sq was found to be non-significant.

Table 4: p-values from a Wilcoxon signed-rank test for VBAgg-Sq versus the methods below forthe varying number of bags experiment for the Poisson model. The null hypothesis is VBAgg-Sqperforms equal or worse than NN or Nyström in terms of individual NLL or MSE on the train set.

NLL MSENN-Exp 6.98e−06 0.00025Nyström-Exp 0.00048 0.00015

Table 5: p-values from a Wilcoxon signed-rank test for VBAgg-Exp versus the methods below forthe varying number of bags experiment for the Poisson model. The null hypothesis is VBAgg-Expperforms equal or worse than NN or Nyström in terms of individual NLL or MSE on the train set.

NLL MSENN-Exp 2.48e−06 2.48e−05Nyström-Exp 0.0005 0.00025

25

Page 26: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

200 300 400 500 600Nmean

2.14

2.15

2.16

2.17

Indi

vidu

al A

vera

ge N

LL

200 300 400 500 600Nmean

4.8

5.0

5.2

5.4

Bag

Ave

rage

NLL

200 300 400 500 600Nmean

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Indi

vidu

al M

SE

VB-ExpVB-Sq

VBAgg-Exp-ObjVBAgg-Sq-Obj

Nystrom-ExpNN-Exp

200 300 400 500 600Nmean

0

100

200

300

400

500

600

Bag

MS

E

Figure 13: Varying number of individuals per bagNmean over 5 repetitions.Left Column: Individualaverage NLL and MSE on train set. Right Column: Bag average NLL and MSE on test set (of size500). Constant prediction NLL and MSE is 2.23 and 0.85 respectively.

Table 6: p-values from a Wilcoxon signed-rank test for VBAgg-Sq versus the methods below forthe varying number of individuals per bag experiment for the Poisson model. The null hypothesis isVBAgg-Sq performs equal or worse than NN or Nyström in terms of individual NLL or MSE on thetrain set.

NLL MSENN-Exp 1.81e−05 9.53e−06Nyström-Exp 0.062 0.041

Table 7: p-values from a Wilcoxon signed-rank test for VBAgg-Exp versus the methods below forthe varying number of individuals per bag experiment for the Poisson model. The null hypothesis isVBAgg-Exp performs worse than NN or Nyström in terms of individual NLL or MSE on the train set.

NLL MSENN-Exp 6.68e−05 0.00016Nyström-Exp 0.049 0.062

Calibration Plots for the Swiss Roll Dataset In Figure 14 and 15, we provide calibration resultsfor both experiments that we have considered. See top of Appendix H for a further details. It is clearthat while VBAgg-Sq-Obj and VBAgg-Sq provides good calibration in general, this is not the casefor VBAgg-Exp-Obj and VBAgg-Exp. This is not surprising as the VBAgg-Exp methods uses anadditional lower bound.

26

Page 27: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.00

0.25

0.50

0.75

Abs

olut

e E

rror

in C

over

age Number of Bags 100

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.00

0.25

0.50

0.75

Abs

olut

e E

rror

in C

over

age Number of Bags 150

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.00

0.25

0.50

0.75

Abs

olut

e E

rror

in C

over

age Number of Bags 200

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.00

0.25

0.50

0.75

Abs

olut

e E

rror

in C

over

age Number of Bags 250

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.00

0.25

0.50

0.75

Abs

olut

e E

rror

in C

over

age Number of Bags 300

VBAgg-ExpVBAgg-Sq

VBAgg-Exp-ObjVBAgg-Sq-Obj

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.00

0.25

0.50

0.75

Abs

olut

e E

rror

in C

over

age Number of Bags 350

Figure 14: Absolute Error in coverage from 70% to 95% for the increasing number of bags experimentfor the Poisson Model. Shaded regions highlight the standard deviation. Perfect coverage wouldprovide a straight line at 0 error.

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.2

0.4

0.6

0.8

Abs

olut

e E

rror

in C

over

age

Nmean = 150

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.2

0.4

0.6

0.8

Abs

olut

e E

rror

in C

over

age

Nmean = 300

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.2

0.4

0.6

0.8

Abs

olut

e E

rror

in C

over

age

Nmean = 450

VBAgg-ExpVBAgg-Sq

VBAgg-Exp-ObjVBAgg-Sq-Obj

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.2

0.4

0.6

0.8

Abs

olut

e E

rror

in C

over

age

Nmean = 600

Figure 15: Absolute Error in coverage from 70% to 95% for the increasing number of individualsper bag Nmean and Nstd for the Poisson Model. Shaded regions highlight the standard deviation.Perfect coverage would provide a straight line at 0 error.

27

Page 28: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Prediction and uncertainty plots In Figure 16 and 17, we provide some prediction plots fordifferent models, and uncertainties for VBAgg models.

10 5 0 5 10 0510

1520

10

5

0

5

10

15

NN-Exp

3

4

5

6

7

10 5 0 5 10 0510

1520

10

5

0

5

10

15

Nystrom-Exp

3

4

5

6

7

Figure 16: Individual predictions on the train set for the swiss roll dataset with 150 bags for NN andNyström model. Here Nmean = 150, with Nstd = 50.

10 5 0 5 10 05101520

10

5

0

5

10

15

VBAgg-Obj-Exp

10 5 0 5 10 05101520

10

5

0

5

10

15

VBAgg-Obj-Exp uncertainty

3

4

5

6

7

0.02

0.03

0.04

0.05

0.06

10 5 0 5 10 05101520

10

5

0

5

10

15

VBAgg-Obj-Sq

10 5 0 5 10 05101520

10

5

0

5

10

15

VBAgg-Obj-Sq uncertainty

3

4

5

6

7

0.0750.1000.1250.1500.1750.2000.2250.250

Figure 17: Predictions and uncertainty on the swiss roll dataset with 150 bags for the VBAgg-Objmodels. Here Nmean = 150, with Nstd = 50. For uncertainty, we plot the standard deviation of theposterior of v, coming from va ∼ N (ma, Sa) in (9).

28

Page 29: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

H.2 Normal Model

H.2.1 Swiss Roll Dataset

In this section, we provide some experimental results for the Normal model, where throughout weassume τai = τ , same for all individuals.

We consider the same swiss roll dataset as in the Poisson model, here the colour of each point to bethe underlying mean µai . We then consider yai ∼ N (µa, τ) with τ = 0.1, hence bag observations aregiven by ya =

∑Na

i=1 yai ∼ N (µa, Naτ) with µa =

∑Na

i=1 µai . Here, the goal is to predict µai and τ ,

given bag observations ya only. The results for the experiments are shown below in Figure 18 andFigure 19, which shows the VBAgg outperforming the NN and Nyström model. To show statisticalsignificance, we also report the corresponding table of p-values in Table 8 and Table 9. Furthermore,we would also like to point out that the VBAgg is well calibrated as shown in Figure 20.

100 150 200 250 300 350Number of Bags

0.80

0.75

0.70

0.65

0.60

0.55

0.50

Indi

vidu

al A

vg N

LL

VBAggVBAgg-ObjNystromNN

100 150 200 250 300 350Number of Bags

1.70

1.75

1.80

1.85

1.90

1.95

2.00

Bag

Avg

NLL

VBAggVBAgg-ObjNystromNN

100 150 200 250 300 350Number of Bags

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

0.0040

0.0045

Indi

vidu

al M

SE

VBAggVBAgg-ObjNystromNN

100 150 200 250 300 350Number of Bags

0.25

0.50

0.75

1.00

1.25

1.50

1.75

Bag

MS

E

VBAggVBAgg-ObjNystromNN

Figure 18: Varying number of bags over 5 repetitions for the Normal model.Left Column: Individualaverage NLL and MSE on train set. Right Column: Bag average NLL and MSE on test set (of size500). Constant model individual MSE is 0.04.

Table 8: p-values from a Wilcoxon signed-rank test for VBAgg versus the methods below for thevarying number of bags experiment for the Normal model. The null hypothesis is VBAgg performsequal or worse than NN or Nyström in terms of individual NLL or MSE on the train set.

NLL MSENN 5.96e−07 4.79e−09Nyström 4.01e−08 6.52e−09

Table 9: p-values from a Wilcoxon signed-rank test for VBAgg versus the methods below for thevarying number of individuals per bag Nmean experiment for the Normal nodel. The null hypothesisis VBAgg performs worse than NN or Nyström in terms of individual NLL or MSE on the train set.

NLL MSENN 4.77e−06 4.77e−06Nyström 4.77e−06 4.77e−06

Calibration Plots for the Swiss Roll Dataset In Figure 20 and 21, we provide calibration resultsfor both experiments that we have considered. See top of Appendix H for further details. It is clear

29

Page 30: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

200 300 400 500 600Nmean

0.80

0.75

0.70

0.65

Indi

vidu

al A

vera

ge N

LL

200 300 400 500 600Nmean

1.6

1.8

2.0

2.2

2.4

2.6

Bag

Ave

rage

NLL

200 300 400 500 600Nmean

0.001

0.002

0.003

0.004

Indi

vidu

al M

SE

VBAgg VBAgg-Obj Nystrom NN

200 300 400 500 600Nmean

0

1

2

3

4

5

6

7

Bag

MS

E

Figure 19: Varying number of individuals per bagNmean over 5 repetitions.Left Column: Individualaverage NLL and MSE on train set. Right Column: Bag average NLL and MSE on test set (of size500). Constant model individual MSE is 0.039.

that VBAgg-Obj has better calibration in general, this is not surprising as it is tuned based on thecorrect objective, rather than NLL.

30

Page 31: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.1

0.2

0.3

0.4

Abs

olut

e E

rror

in C

over

age Number Of Bags 100

VBAggVBAgg-Obj

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.1

0.2

0.3

0.4

Abs

olut

e E

rror

in C

over

age Number Of Bags 150

VBAggVBAgg-Obj

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.1

0.2

0.3

0.4

Abs

olut

e E

rror

in C

over

age Number Of Bags 200

VBAggVBAgg-Obj

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.1

0.2

0.3

0.4

Abs

olut

e E

rror

in C

over

age Number Of Bags 250

VBAggVBAgg-Obj

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.1

0.2

0.3

0.4

Abs

olut

e E

rror

in C

over

age Number Of Bags 300

VBAggVBAgg-Obj

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.1

0.2

0.3

0.4

Abs

olut

e E

rror

in C

over

age Number Of Bags 350

VBAggVBAgg-Obj

Figure 20: Absolute Error in coverage from 70% to 95% for the increasing number of bags experimentfor the Normal Model. Shaded regions highlight the standard deviation. Perfect coverage wouldprovide a straight line at 0 error.

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Abs

olut

e E

rror

in C

over

age

Nmean = 150VBAggVBAgg-Obj

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Abs

olut

e E

rror

in C

over

age

Nmean = 300VBAggVBAgg-Obj

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Abs

olut

e E

rror

in C

over

age

Nmean = 450VBAggVBAgg-Obj

0.70 0.75 0.80 0.85 0.90 0.95Claimed Coverage

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Abs

olut

e E

rror

in C

over

age

Nmean = 600VBAggVBAgg-Obj

Figure 21: Absolute Error in coverage from 70% to 95% for the increasing number of individuals perbag Nmean and Nstd for the Normal Model. Shaded regions highlight the standard deviation. Perfectcoverage would provide a straight line at 0 error.

31

Page 32: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Prediction and uncertainty plots Here, we provide some prediction plots for different models.

10 5 0 5 10 0510

1520

10

5

0

5

10

15

NN

0.2

0.4

0.6

0.8

1.0

10 5 0 5 10 0510

1520

10

5

0

5

10

15

Nystrom

0.2

0.4

0.6

0.8

1.0

Figure 22: Individual predictions on the train set for the swiss roll dataset with 150 bags for NN andNyström model. Here Nmean = 150, with Nstd = 50.

10 5 0 5 10 0510

1520

10

5

0

5

10

15

VBAgg-Obj

10 5 0 5 10 0510

1520

10

5

0

5

10

15

VBAgg-Obj uncertainty

0.2

0.4

0.6

0.8

1.0

0.05

0.10

0.15

0.20

0.25

Figure 23: Predictions and uncertainty on the swiss roll dataset with 150 bags for the VBAgg-Objmodel. Here Nmean = 150, with Nstd = 50. For uncertainty, we plot the standard deviation of theposterior of v, coming from va ∼ N (ma, Sa) in (9).

H.2.2 Elevators Dataset

For a real dataset experiment, we consider the elevators dataset11, which is a large scale regressiondataset12 containing 16599 instances, with each instance ∈ R17. This dataset is obtained from thetask of controlling F16 aircraft, with the label y being a particular action taken on the elevators of theaircraft ∈ R. For the model formulation we assume each label follows a normal distribution, i.e.yl ∼ N (µl, τ), where τ is a fixed quantity to be learnt. In practice, we can imagine the action takenmay differ according to the operator.

In order formulate this dataset in an aggregate data setting, we sample bag sizes from a negativebinomial distribution as before, with Nmean = 30 and Nstd = 15, and also take wai = 1. To placeobservations into bags, similar to the swiss roll dataset, we consider a particular covariate, and placeinstances into bags based on the ordering of the covariate. We now have the bag-level model given byya ∼ N (µa, Naτ), with individual model yai ∼ N (µai , τ) and it is our goal to predict µai (and alsoinfer τ ), given only ya. After the bagging process, we obtain approximately 225 bags for training,and 33 bags each for early stopping, validation and testing (for bag level performance). Further,in order to neglect variables that do not provide signal, we use an ARD kernel for the VBAgg and

11This dataset is publicly available at http://sci2s.ugr.es/keel/dataset.php?cod=9412We have removed one column that is almost completely sparse.

32

Page 33: Variational Learning on Aggregate Outputs with Gaussian Processes · We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial

Table 10: Results for the Normal Model on the elevators dataset with 50 repetitions. Indiv representsindividuals on train set here, while bag performance is measured on a test set. Numbers in bracketsdenotes p-values from a Wilcoxon signed-rank test for VBAgg versus the method. The null hypothesisis VBAgg performs equal or worse than NN or Nyström in terms of individual NLL or MSE on thetrain set. It is also noted MSE is computed on the observed yai or ya, rather than the unknown µai orµa.

Indiv NLL Bag NLL Indiv MSE Bag MSEConstant N/A N/A 0.010 0.366VBAgg -1.69 0.003 0.0018 0.052VBAgg-Obj -1.71 -0.02 0.0018 0.052Nyström −1.57(1.5e−13) 0.003 0.0024 (8.9e−16) 0.041NN -1.64 (0.0001258) 0.082 0.0021 (8.8e−10) 0.041

Nyström model, as below:

kard(x, y) = γscale exp

(−1

2

d∑k=1

1

`k(xk − yk)2

)(38)

and learn kernel parameters γscale and {`k}dk=1. We repeat this process and splitting of the dataset50 times and report individual NLL results, and also MSE results in Table 10. From the results,we observe that the VBAgg model performs better the Nyström and NN model, with statisticalsignificance.

33