long horizon forecasting with temporal point processes

Long Horizon Forecasting With Temporal Point ProcessesPrathamesh Deshpande∗

IIT BombayKamlesh Marathe

IIT Bombay

Abir DeIIT Bombay

Sunita SarawagiIIT Bombay

ABSTRACT

In recent years, marked temporal point processes (MTPPs) haveemerged as a powerful modeling machinery to characterize asyn-chronous events in a wide variety of applications. MTPPs havedemonstrated significant potential in predicting event-timings, es-pecially for events arriving in near future. However, due to currentdesign choices, MTPPs often show poor predictive performanceat forecasting event arrivals in distant future. To ameliorate thislimitation, in this paper, we design DualTPP which is specificallywell-suited to long horizon event forecasting. DualTPP has twocomponents. The first component is an intensity free MTPP model,which captures microscopic event dynamics by modeling the timeof future events. The second component takes a different dual per-spective of modeling aggregated counts of events in a given time-window, thus encapsulating macroscopic event dynamics. Then wedevelop a novel inference framework jointly over the twomodels bysolving a sequence of constrained quadratic optimization problems.Experiments with a diverse set of real datasets show that DualTPPoutperforms existing MTPP methods on long horizon forecastingby substantial margins, achieving almost an order of magnitudereduction in Wasserstein distance between actual events and fore-casts. The code and the datasets can be found at the following URL:https://github.com/pratham16cse/DualTPP

ACM Reference Format:

Prathamesh Deshpande, Kamlesh Marathe, Abir De, and Sunita Sarawagi.2021. Long Horizon Forecasting With Temporal Point Processes. In Proceed-ings of the Fourteenth ACM International Conference on Web Search and DataMining (WSDM ’21), March 8–12, 2021, Virtual Event, Israel. ACM, New York,NY, USA, 9 pages. https://doi.org/10.1145/3437963.3441740

1 INTRODUCTION

In recent years, marked temporal point processes (MTPPs) haveemerged as a powerful tool in modeling asynchronous events in adiverse set of applications, such as information diffusion in socialnetworks [8, 11, 12, 26, 27, 39, 57, 59], disease progression [37, 40, 41,56], traffic flow [33], and financial transactions [15, 17, 19, 22, 30, 44].MTPPs are realized using two quantities: (i) intensity functions

∗Contact author, [email protected]

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’21, March 8–12, 2021, Virtual Event, Israel© 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8297-7/21/03. . . $15.00https://doi.org/10.1145/3437963.3441740

which characterize the probabilities of arrivals of subsequent events,based on the history of previous events; and (ii) the distribution ofmarks which captures extra information attached with each evente.g., sentiment in a Tweet, location in traffic flow, etc.

Over the myriad applications of MTPPs, we identify two modesin which MTPPs are used during prediction: (i) nowcasting, whichimplies prediction of only the immediate next event i.e. one-stepahead prediction; and, (ii) forecasting, which requires predictionof events in a distant future i.e. long-term forecasting. Forecastingcontinuous-time events with TPP models has a wide variety of usecases. For example, in emergency planning, it can assist resourceallocation by anticipating demand; in transportation, it can helpin congestion management; and in a social network, it can helpto anticipate the rise of an orchestrated campaign. In this work,our goal is to develop a temporal point process model, which isspecifically suited for accurate forecasting of arrival of events inthe long term given a history of events in the past.Limitations of prior work. Predictive models of temporal pointprocesses have been extensively researched in recent literature [3, 7,13, 29, 31, 34, 45, 47, 52, 55, 58]. A predominate approach is to trainan intensity function for the next event conditioned on historicalevents, and then based on this estimated intensity function, forwardsample events to predict a sequence of events in the future. Whilethese approaches have shown promise at predicting the arrival ofevents in the near future, they suffer from two major limitations:

I Their modeling frameworks heavily rest on designing the inten-sity function—which in turn can sample only the next subsequentevent. Such a design choice allows these models to be trainedonly for nowcasting rather than forecasting.

II Over long time horizons, the forward sampling method accu-mulates cascading errors as we condition on predicted eventsto generate the next event, whereas during training we condi-tion on true events. Existing approaches [51, 54] of handling thismismatch via sequence-level losses provide only modest gains.

Present work. Responding to the limitations of prior approaches,we develop DualTPP, which is specifically designed to forecastevents over long time horizons. TheDualTPPmodel consists of twocomponents. The first component encapsulates the event dynam-ics at a microscopic scale, whereas the second component viewsthe same event dynamics from a different perspective and at ahigher macroscopic scale. The first component is an intensity freerecurrent temporal point process, which models the time of eventsconditioned on all previous events along with marks. This modelhas sufficient predictive ability to capture the event arrival processin the immediate future, but like existing TPPs is subject to cascad-ing drift. The second component models the count of events overfixed time-intervals in the long-term future. Together, this leads to

arX

iv:2

101.

0281

5v2

[cs

.LG

] 7

Mar

202

1

WSDM ’21, March 8–12, 2021, Virtual Event, Israel Prathamesh Deshpande, Kamlesh Marathe, Abir De, and Sunita Sarawagi

an accurate modeling of both short and long term behavior of theassociated event arrival process.

Inference in DualTPP involves forecasting events while achiev-ing consensus across predictions from both models. This presentsnew algorithmic challenges. We formulate a novel joint inferenceobjective on the two models, and show how to decompose it intoa sequence of constrained concave quadratic maximization prob-lems over continuous variables, combined with a binary searchover discrete count variables. Our algorithm provides a significantdeparture from existing sampling-based inference that are subjectto gross inaccuracies.

Our model includes both elements of multi-scale modeling likein hierarchies and multi-view learning. We show that this formof multi-view, multi-scale modeling, coupled with our joint infer-ence algorithm, provides more accurate long-term forecasting thanjust multi-scale models [6, 46, 47]. We provide a comprehensiveevaluation of our proposal across several real world datasets. Ourexperiments show that the proposed model outperforms severalstate-of-the-art baselines in terms of forecasting accuracy, by asubstantial margin.Summary of contributions. Summarizing, we make the follow-ing contributions in this paper.— Forecasting aware modeling framework: We propose a novel fore-casting aware modeling framework for temporal point process,which consists of two parts— the first part captures the low-levelmicroscopic behavior, whereas the other part captures the highlevel macroscopic signals from a different perspective. These twocomponents complement the predictive ability of each other, thathelps the joint model to accurately characterize the long horizonbehavior of the event dynamics.— Efficient inference protocol: We devise a novel inference methodto forecast the arrival of events during an arbitrary time-interval.In sharp contrast to expensive sampling procedures, the proposedinference method casts the forecasting task as a sequence of con-strained quadratic optimization problems, which can be efficientlysolved using standard tools.— Comprehensive evaluation: Our proposal is not only theoreticallyprincipled, but also practically effective. We show superior predic-tive ability compared to several state-of-the-art algorithms. Ourexperiments are on practically motivated datasets spanning ap-plications in social media, traffic and emergency planning. Thesubstantial gains we obtain over existing methods establish ourpractical impact on these applications.

2 RELATEDWORK

Our work is related to temporal point processes, long-term fore-casting in time-series, and peripherally with the area of multi-viewlearning.Temporal point process. Modeling continuous time event streamswith temporal point processes (TPPs) follow two predominant ap-proaches. The first approach focuses on characterizing TPPs usingfixed parameterizations, by means of linear or quasi-linear formsof intensity functions [5, 20, 21, 23, 32], e.g., Hawkes process, self-correcting process, etc. Such TPP models are designed to capturespecific phenomena of interest. For example, Hawkes process encap-sulates the self-exciting nature of information diffusion in online

social networks whereas, Markov modulated point process canaccurately model online check-ins. While such models provide in-terpretability, their fixed parameterizations often lead to modelmis-specifications, limited expressiveness, which in turn constraintheir predictive power. The second approach overcomes such limi-tations by designing deep neural TPP models, guided by a recurrentneural network which captures the dependence of previous eventson the arrival of subsequent events. Du et al. [13] proposed Re-current Marked Temporal Point Process (RMTPP), a three layerneural architecture for TPP model, which relies on a vanilla RNNto capture the dependence between inter-event arrival times. Sucha design is still the workhorse of many deep recurrent TPP mod-els. Neural Hawkes process [31] provides a robust nonlinear TPPmodel, which can incorporate the effect of missing data. However,these models heavily rest on learning the arrival dynamics of onesubsequent event and as a consequence, they show poor forecastingperformance. Recently, a number of more powerful deep learningtechniques have been borrowed to capture richer dependenciesamong events in TPPs. For example, Xiao et al. [54] proposes asequence to sequence encoder-decoder model for predicting next 𝑘events; Xiao et al. [51] use Wasserstein GANs to generate an entiresequence of events; Vassøy et al. [47] deploy a hierarchical model;and Zuo et al. [60] apply transformer architecture to capture thedependence among events via self-attention. We compareDualTPPagainst these methods in Section 5 and show substantial gains.Long-term Forecasting in Time Series. The topic of long-termforecasting has been more explored in the regular time-series set-ting than in the TPP setting. Existing time-series models are alsoauto-regressive and trained for one-step ahead prediction [16], andsubject to similar phenomenon of cascading errors when used forlong-range forecasting. Efforts to fix the teacher-forcing training ofthese one-step ahead model to adapt better to multi-step forecast-ing [49], have been not as effective as breaking the auto-regressivestructure to directly predict for each future time-step [4, 9, 50]. An-other idea is to use dilated convolutions, as successfully deployedin Wavenet [46] for audio generation, that connect each output tosuccessively doubling hops into the past [6]. A hierarchical modelthat we compared with in Section 5.2 also uses dilated connectionsto past events. We found that this model provided much betterlong-range forecasts than existing TPP models, however our hybridevent-count model surpassed it consistently. A third idea is to usea loss function [28] over the entire prediction range that preservessequence-level properties, analogous to how Wasserstein loss isused in [54] for the TPP setting.

A key difference of DualTPP compared to all previous work inboth the TPP and time-series literature is that, all existing methodsfocus on training, and during inference continue to deploy the sameone-step event generation. Our key idea is to use a second modelto output properties of the aggregated set of predicted events. Wethen solve an efficient joint optimization problem over the predictedsequence to achieve consensus between the predicted aggregateproperties and one-step generated events. This relates our approachto early work on multi-view learning in the traditional machinelearning literature that we discuss next.Multi-view Learning Models. Inference in structured predictiontasks with aggregate potentials over a large number of predicted

Long Horizon Forecasting With Temporal Point Processes WSDM ’21, March 8–12, 2021, Virtual Event, Israel

variables was studied in tasks like image segmentation, [25, 38, 43]and information extraction [18]. In several NLP tasks too, enforcingconstraints during inference via efficient optimization formulationshas been found to be effective in [10, 14, 36]. In this paper wedemonstrate, for the first time, the use of these ideas to TPPs, whichdue to their continuous nature, pose very different challenges thanclassical multi-view models on discrete labels.

3 MODEL FORMULATION

In this section, we formulate DualTPP, our two-component mod-eling framework for marked temporal point processes (MTPPs).We begin with an overview of MTPPs and then provide a detaileddescription of our proposed DualTPP.

3.1 Background on MTPP

AnMTPP [13, 51, 60] is a stochastic process, which is realized usinga series of discrete events arriving in continuous time. Given a se-quence of events {𝑒1 = (𝑚1, 𝑡1), 𝑒2 = (𝑚2, 𝑡2), . . .} where𝑚𝑖 ∈ [𝐾]1 indicate the discrete mark and 𝑡𝑖 ∈ R+ indicate the arrival time ofthe 𝑖−th event, an MTPP is characterized by𝐻𝑡 = {𝑒𝑖 = (𝑚𝑖 , 𝑡𝑖 ) |𝑡𝑖 <𝑡} which gathers all events that arrived until time 𝑡 . Equivalently, itcan also be described using a counting process 𝑁 (𝑡) which countsthe number of events arrived until time 𝑡 , i.e., 𝑁 (𝑡) = |𝐻𝑡 |. Thedynamics of 𝑁 (𝑡) is characterized using an intensity function _∗ (𝑡),which specifies the likelihood of the next event, conditioned onthe history of events 𝐻𝑡 2. The intensity function _∗ (𝑡) computesthe infinitesimal probability that an event will happen in the timewindow (𝑡, 𝑡 + 𝑑𝑡] conditioned on the history 𝐻𝑡 as follows:

P(𝑑𝑁 (𝑡) = 𝑁 (𝑡 + 𝑑𝑡) − 𝑁 (𝑡) = 1 |𝐻𝑡 ) = _∗ (𝑡)𝑑𝑡, (1)

The intensity function is used to compute the expected time of thenext event as:

E[𝑡𝑖 |𝐻𝑡𝑖 ] =∫ ∞

𝑡𝑖−1𝑡 · _∗ (𝑡)𝑑𝑡 (2)

The marks are generated using some probability distribution 𝑞𝑚conditioned on the history of the events, i.e.,

P(𝑚𝑖 = 𝑘 |𝐻𝑡𝑖 ) = 𝑞∗𝑚 (𝑘) (3)

Given the history of events 𝐻𝑇 observed during the time interval(0,𝑇 ], one typically learns the intensity function _∗ (𝑡) and the markdistribution 𝑞∗𝑚 by maximizing the following likelihood function:

L(𝐻𝑇 | _∗, 𝑞∗𝑚) =∑︁

(𝑚𝑖 ,𝑡𝑖 ) ∈𝐻𝑇

(log𝑞∗𝑚 (𝑚𝑖 ) + log _∗ (𝑡𝑖 )

)+∫ 𝑇

0_∗ (𝜏)𝑑𝜏

Once the intensity function _∗ (𝑡) and the mark distribution 𝑞∗𝑚are estimated, they are used to forecast events by means of thin-ning [32] or inverse sampling [45] mechanisms. Such mechanismsoften suffer from poor time complexity. Moreover, such recursivesampling methods build up prediction error led by any model mis-specification. In the following, we aim to design a temporal pointprocess model that is able to overcome that limitation.

1In the current work, we consider discrete marks which can take 𝐾 labels, however, our methodcan easily be extended to continuous marks.2∗ indicates the dependence on history

3.2 Design of DualTPP

We now set about to design our proposed model DualTPP. At thevery outset,DualTPP has two components to model the underlyingMTPP— the event model for capturing the dynamics of individualevents, and the count model that provides an alternative countperspective over a set of events in the long-term future. Here wedescribe the model structure and training. In Section 4, we describehow we combine outputs from the two models during inference.Eventmodel. Our eventmodel is a generative process which drawsthe next event (𝑚, 𝑡), given the history of events 𝐻𝑡 . In several ap-plications [24, 53] the arrival times as well as the marks of thesubsequent events depend on the history of previous events. There-fore, we capture such inter-event dependencies by realizing ourevent model using a conditional density function 𝑝\ (•|𝐻𝑡 ). Fol-lowing several existing MTPP models [13, 24, 31, 53] we model𝑝\ (•|𝐻𝑡 ) by means of a recurrent neural network with parameter\ , which embeds the history of events in compact vectors 𝒉•. It hasthree layers: (i) input layer, (ii) hidden layer and (iii) output layer.In the following, we illustrate them in detail.— Input layer. Upon arrival of the 𝑖-th event 𝑒 = (𝑚𝑖 , 𝑡𝑖 ), the inputlayer transforms𝑚𝑖 into an embedding vector 𝒎𝑖 and computesthe inter-arrival gap 𝛿𝑖 , which are used by next layers afterwards.More specifically, it computes

𝒎𝑖 =𝑚𝑖𝒘𝑚 + 𝒃𝑚, (4)𝛿𝑖 = 𝑡𝑖 − 𝑡𝑖−1, (5)

where𝒘𝑚 is embedding matrix and 𝒃𝑚 is bias.— Hidden layer. This layer embeds the history of events in thesequence of hidden state vectors (𝒉•) using a recurrent neuralnetwork. More specifically, it takes three signals as input: (i) theembeddings 𝒎𝑖 and (ii) the inter-arrival time duration 𝛿𝑖 , which iscomputed in the previous layer, as well as (iii) the hour of the eventas an additional feature 𝑓𝑖 ; and then updates the hidden state 𝒉•using a gated recurrent unit as follows:

𝒉𝑖 = GRU𝒘ℎ(𝒉𝑖−1;𝒎𝑖 , 𝛿𝑖 , 𝑓𝑖 ). (6)

Note that 𝒉𝑖 summarizes the history of first 𝑖 events.— Output layer. Finally, the output layer computes the distributionof the mark𝑚𝑖+1 and timing 𝑡𝑖+1 of the next event as follows: Weparameterize the distribution over marks as a softmax over thehidden states:

P(𝑚𝑖+1 = 𝑐) =exp(𝒘⊤𝑦,𝑐𝒉𝑖 + 𝑏𝑦,𝑐 )∑𝐾𝑗=1 exp(𝒘⊤𝑦,𝑗𝒉𝑖 + 𝑏𝑦,𝑗 )

(7)

Similar to [42], we use a Gaussian distribution to model the gap 𝛿𝑖to the next event:

𝛿𝑖 ∼ N(` (𝒉𝑖 ), 𝜎 (𝒉𝑖 )), 𝑡𝑖+1 = 𝑡𝑖 + 𝛿𝑖 . (8)

Here ` (𝒉𝑖 ) the mean gap and its standard deviation 𝜎 (𝒉𝑖 ) are com-puted from the hidden state as:

` (𝒉𝑖 ) = softplus(𝒘⊤` 𝒉𝑖 + 𝑏` ) (9)

𝜎 (𝒉𝑖 ) = softplus(𝒘⊤𝜎 𝒉𝑖 + 𝑏𝜎 ). (10)

Here, \ = {𝒘•, 𝑏•} are the set of trainable parameters. The Gaussiandensity provided more accurate long-term forecasts than existing


e1e1e1

<latexit sha1_base64="MhVr86VzdJar/CfiqbybLz825jA=">AAAB8HicdVDLSgMxFM34rPVVdekmWARXQ0ZH22XRjcsK9iHtUDJppg1NMkOSEcrQr3DjQhG3fo47/8ZMO4KKHrhwOOde7r0nTDjTBqEPZ2l5ZXVtvbRR3tza3tmt7O23dZwqQlsk5rHqhlhTziRtGWY47SaKYhFy2gknV7nfuadKs1jemmlCA4FHkkWMYGOlu34iwowOvNmgUkVuzT9D9TpELpojJxd+HfnQK5QqKNAcVN77w5ikgkpDONa656HEBBlWhhFOZ+V+qmmCyQSPaM9SiQXVQTY/eAaPrTKEUaxsSQPn6veJDAutpyK0nQKbsf7t5eJfXi81UT3ImExSQyVZLIpSDk0M8+/hkClKDJ9agoli9lZIxlhhYmxGZRvC16fwf9I+dT3fPb/xq43LIo4SOARH4AR4oAYa4Bo0QQsQIMADeALPjnIenRfnddG65BQzB+AHnLdPDsSQmA==</latexit>

eieiei

<latexit sha1_base64="BLj7fMJUetJeuxZcJhxb66ia3G8=">AAAB8HicdVDLSgMxFM3UV62vqks3wSK4GjK2nXZZdOOygn1IO5RMmrahSWZIMkIZ+hVuXCji1s9x59+YTiuo6IELh3Pu5d57wpgzbRD6cHJr6xubW/ntws7u3v5B8fCoraNEEdoiEY9UN8SaciZpyzDDaTdWFIuQ0044vVr4nXuqNIvkrZnFNBB4LNmIEWysdNePRZjSAZsPiiXk+p6P/DJELvK9cka8Wr1qieeiDCWwQnNQfO8PI5IIKg3hWOueh2ITpFgZRjidF/qJpjEmUzymPUslFlQHaXbwHJ5ZZQhHkbIlDczU7xMpFlrPRGg7BTYT/dtbiH95vcSM6kHKZJwYKsly0Sjh0ERw8T0cMkWJ4TNLMFHM3grJBCtMjM2oYEP4+hT+T9oXrldxqzeVUuNyFUcenIBTcA48UAMNcA2aoAUIEOABPIFnRzmPzovzumzNOauZY/ADztsnfs6Q4w==</latexit>

e5e5e5

<latexit sha1_base64="ItqzGAjvP7YFAme6yMI+9sFU3FU=">AAAB8HicdVDLSgMxFM3UV62vqks3wSK4GjK2nXZZdOOygn1IO5RMmrahSWZIMkIZ+hVuXCji1s9x59+YTiuo6IELh3Pu5d57wpgzbRD6cHJr6xubW/ntws7u3v5B8fCoraNEEdoiEY9UN8SaciZpyzDDaTdWFIuQ0044vVr4nXuqNIvkrZnFNBB4LNmIEWysdNePRZjSQXU+KJaQ63s+8ssQucj3yhnxavWqJZ6LMpTACs1B8b0/jEgiqDSEY617HopNkGJlGOF0XugnmsaYTPGY9iyVWFAdpNnBc3hmlSEcRcqWNDBTv0+kWGg9E6HtFNhM9G9vIf7l9RIzqgcpk3FiqCTLRaOEQxPBxfdwyBQlhs8swUQxeyskE6wwMTajgg3h61P4P2lfuF7Frd5USo3LVRx5cAJOwTnwQA00wDVoghYgQIAH8ASeHeU8Oi/O67I156xmjsEPOG+fL8qQrw==</latexit>

C1C1C1 = 3

<latexit sha1_base64="LTkhWc2q93QfKCtwft9XMrVuzg4=">AAAB8nicdVDLSgMxFM3UV62vqks3wSK4GjK17dSFUOzGZQX7gOlQMmmmDc1MhiQjlKGf4caFIm79Gnf+jelDUNEDFw7n3Mu99wQJZ0oj9GHl1tY3Nrfy24Wd3b39g+LhUUeJVBLaJoIL2QuwopzFtK2Z5rSXSIqjgNNuMGnO/e49lYqJ+E5PE+pHeBSzkBGsjeT1kyjImgNndnUxKJaQjaqui1yI7IpTuyw7hqByuV5D0LHRAiWwQmtQfO8PBUkjGmvCsVKegxLtZ1hqRjidFfqpogkmEzyinqExjqjys8XJM3hmlCEMhTQVa7hQv09kOFJqGgWmM8J6rH57c/Evz0t1WPczFieppjFZLgpTDrWA8//hkElKNJ8agolk5lZIxlhiok1KBRPC16fwf9Ip207Frt5WSo3rVRx5cAJOwTlwgAsa4Aa0QBsQIMADeALPlrYerRfrddmas1Yzx+AHrLdP6zqRCQ==</latexit>

C2C2C2 = 2

<latexit sha1_base64="Vjsm463nRcENjT0/y63lBjBRGDE=">AAAB8nicdVDLSsNAFJ34rPVVdelmsAiuwiS0TV0IxW5cVrAPSEOZTCft0JkkzEyEEvoZblwo4tavceffOH0IKnrgwuGce7n3njDlTGmEPqy19Y3Nre3CTnF3b//gsHR03FFJJgltk4QnshdiRTmLaVszzWkvlRSLkNNuOGnO/e49lYol8Z2epjQQeBSziBGsjeT3UxHmzYE7u3IHpTKyUdXzkAeRXXFql65jCHLdeg1Bx0YLlMEKrUHpvT9MSCZorAnHSvkOSnWQY6kZ4XRW7GeKpphM8Ij6hsZYUBXki5Nn8NwoQxgl0lSs4UL9PpFjodRUhKZTYD1Wv725+JfnZzqqBzmL00zTmCwXRRmHOoHz/+GQSUo0nxqCiWTmVkjGWGKiTUpFE8LXp/B/0nFtp2JXbyvlxvUqjgI4BWfgAjjAAw1wA1qgDQhIwAN4As+Wth6tF+t12bpmrWZOwA9Yb5/rPZEJ</latexit>

C3C3C3 = 2

<latexit sha1_base64="YbwCfIRLd8WSJFzn+BylOPqwkJ4=">AAAB8nicdVDLSgMxFM3UV62vqks3wSK4GjJj26kLodiNywr2AdOhZNK0Dc1MhiQjlKGf4caFIm79Gnf+jelDUNEDFw7n3Mu994QJZ0oj9GHl1tY3Nrfy24Wd3b39g+LhUVuJVBLaIoIL2Q2xopzFtKWZ5rSbSIqjkNNOOGnM/c49lYqJ+E5PExpEeBSzISNYG8nvJVGYNfoXsyu3XywhG1U8D3kQ2WWneuk6hiDXrVURdGy0QAms0OwX33sDQdKIxppwrJTvoEQHGZaaEU5nhV6qaILJBI+ob2iMI6qCbHHyDJ4ZZQCHQpqKNVyo3ycyHCk1jULTGWE9Vr+9ufiX56d6WAsyFieppjFZLhqmHGoB5//DAZOUaD41BBPJzK2QjLHERJuUCiaEr0/h/6Tt2k7ZrtyWS/XrVRx5cAJOwTlwgAfq4AY0QQsQIMADeALPlrYerRfrddmas1Yzx+AHrLdP7MSRCg==</latexit>

C4C4C4 = 4

<latexit sha1_base64="ErTzaEeJAg/oSbo8MJRKCi8txZs=">AAAB8nicdVDLSsNAFJ34rPVVdelmsAiuwqSkTV0IxW5cVrAPSEOZTCft0JkkzEyEEvoZblwo4tavceffOH0IKnrgwuGce7n3njDlTGmEPqy19Y3Nre3CTnF3b//gsHR03FFJJgltk4QnshdiRTmLaVszzWkvlRSLkNNuOGnO/e49lYol8Z2epjQQeBSziBGsjeT3UxHmzYE7u3IHpTKyUdXzkAeR7Tq1y4pjCKpU6jUEHRstUAYrtAal9/4wIZmgsSYcK+U7KNVBjqVmhNNZsZ8pmmIywSPqGxpjQVWQL06ewXOjDGGUSFOxhgv1+0SOhVJTEZpOgfVY/fbm4l+en+moHuQsTjNNY7JcFGUc6gTO/4dDJinRfGoIJpKZWyEZY4mJNikVTQhfn8L/SadiO65dvXXLjetVHAVwCs7ABXCABxrgBrRAGxCQgAfwBJ4tbT1aL9brsnXNWs2cgB+w3j4B8VORDQ==</latexit>

Count model : p�p�p�

<latexit sha1_base64="DsH7Xs1QJfVxDxGVjXE7gDU/Yms=">AAACC3icdVDBThsxEPVCoZAWCHDsxSKq1NPKGwVCOaFy4UilBpCyUeR1ZomFvbbs2arRKncu/EovHECIKz/Ajb/BCalUEDxppKf3ZjQzL7NKemTsMZqb/7Cw+HFpufbp88rqWn1949ib0gnoCKOMO824ByUL6KBEBafWAdeZgpPs/GDin/wG56UpfuHIQk/zs0LmUnAMUr++lSL8wSyvDkxZINVmAGq8R1Ors8r2UzuU4369wWK23W6zNmVxK9n53kwCYc3m7g6jScymaJAZjvr1h3RgRKmhQKG4992EWexV3KEUCsa1tPRguTjnZ9ANtOAafK+a/jKmX4MyoLlxocJBU/X/iYpr70c6C52a49C/9ibiW163xHy3V8nClgiFeF6Ul4qioZNg6EA6EKhGgXDhZLiViiF3XGCIrxZC+PcpfZ8cN+OkFW//bDX2f8ziWCJfyBb5RhLSJvvkkByRDhHkgvwl1+Qmuoyuotvo7rl1LprNbJIXiO6fABc9m74=</latexit>

Event model : p✓p✓p✓

<latexit sha1_base64="w7wcUh6nI5tlujm+Zj5zXNENySQ=">AAACDXicdVBNSyNBEO1xP4xxd4169NIYBU/DjHEywVNYETxmwRghE0JPpyZp7J4ZumvEMOQPeNm/shcPinj17s1/s52Yhd1l90HD472q6qoX51IY9LxXZ+Xd+w8fVytr1fVPn79s1Da3LkxWaA5dnslMX8bMgBQpdFGghMtcA1OxhF58dTL3e9egjcjSc5zmMFBsnIpEcIZWGtb2IoQbjJPy9BpSpCobgZwd0yhXcZkPI5wAstmwVvfcZtA4bDao5/peGLRCS7yw0QxC6rveAnWyRGdYe4lGGS+UHcklM6bvezkOSqZRcAmzalQYyBm/YmPoW5oyBWZQLq6Z0X2rjGiSafvsSgv1946SKWOmKraViuHE/O3NxX95/QKT1qAUaV4gpPzto6SQFDM6j4aOhAaOcmoJ41rYXSmfMM042gCrNoRfl9L/k4tD1z9yg29H9fbXZRwVskN2yQHxSUja5Ix0SJdwckt+kHvy4Hx37pxH5+mtdMVZ9myTP+A8/wTIdpyu</latexit>

t ⇠ N (µ,�µ,�µ,�)

<latexit sha1_base64="lGmRXLD5U0QGtxQ8b/SNhtvAjqs=">AAACC3icdVDLSgMxFM34rPVVdekmtAgVZEjH2seu6MaVVLAP6JSSSdM2NJkZkoxQhu7d+CtuXCji1h9w59+YaSuo6IELh3Pu5d57vJAzpRH6sJaWV1bX1lMb6c2t7Z3dzN5+UwWRJLRBAh7ItocV5cynDc00p+1QUiw8Tlve+CLxW7dUKhb4N3oS0q7AQ58NGMHaSL1MVruKCVdgPSKYx1fTvBsKL3ZFdGKMocDT414mh+xSuVwtIYhsVEVlx0lIqeKUTmHBRjPkwAL1Xubd7QckEtTXhGOlOgUU6m6MpWaE02najRQNMRnjIe0Y6mNBVTee/TKFR0bpw0EgTfkaztTvEzEWSk2EZzqTo9VvLxH/8jqRHlS6MfPDSFOfzBcNIg51AJNgYJ9JSjSfGIKJZOZWSEZYYqJNfGkTwten8H/SdOxC0T67LuZq54s4UuAQZEEeFEAZ1MAlqIMGIOAOPIAn8GzdW4/Wi/U6b12yFjMH4Aest0/1/Jun</latexit>

Joint Inference

<latexit sha1_base64="ptEHwE1d53H6VClcFt/8XU3yNM0=">AAACAXicbVDLSsNAFJ34rPUVdSO4CRbBVUmkosuiG3VVwT6gLWUyvWmHTiZh5kYsoW78FTcuFHHrX7jzb5y2WWjrgYHDOfcy9xw/Flyj635bC4tLyyurubX8+sbm1ra9s1vTUaIYVFkkItXwqQbBJVSRo4BGrICGvoC6P7gc+/V7UJpH8g6HMbRD2pM84IyikTr2fgvhATVLbyIu0bmWASiQDEYdu+AW3QmceeJlpEAyVDr2V6sbsSQEiUxQrZueG2M7pQo5EzDKtxINMWUD2oOmoZKGoNvpJMHIOTJK1wkiZZ45Y6L+3khpqPUw9M1kSLGvZ72x+J/XTDA4b6dcxgmaWNOPgkQ4GDnjOpwuV8BQDA2hTHFzq8P6VFGGprS8KcGbjTxPaidFr1Q8vS0VyhdZHTlyQA7JMfHIGSmTK1IhVcLII3kmr+TNerJerHfrYzq6YGU7e+QPrM8fCD+XRQ==</latexit>

HT

<latexit sha1_base64="9z/L7LqTTCTu35jMh1aLO8sEVz8=">AAAB6nicdVBNS8NAEJ3Ur1q/qh69LBbBU0i01XoreumxYr+gDWWz3bRLN5uwuxFK6U/w4kERr/4ib/4bt2kEFX0w8Hhvhpl5fsyZ0o7zYeVWVtfWN/Kbha3tnd294v5BW0WJJLRFIh7Jro8V5UzQlmaa024sKQ59Tjv+5Gbhd+6pVCwSTT2NqRfikWABI1gb6a4+aA6KJceuOO7VhYMc20mRkqp77iI3U0qQoTEovveHEUlCKjThWKme68Tam2GpGeF0XugnisaYTPCI9gwVOKTKm6WnztGJUYYoiKQpoVGqfp+Y4VCpaeibzhDrsfrtLcS/vF6ig6o3YyJONBVkuShIONIRWvyNhkxSovnUEEwkM7ciMsYSE23SKZgQvj5F/5P2me2W7cptuVS7zuLIwxEcwym4cAk1qEMDWkBgBA/wBM8Wtx6tF+t12ZqzsplD+AHr7RMrrY29</latexit>

C1, . . . , C4

<latexit sha1_base64="0NMKdv7VG3bJORk14q33Gqe/ieI=">AAAB9XicdVDLSgMxFM34rPVVdekmWAQXZUjKjHZZ7MZlBfuAdhwymUwbmnmQZJRS+h9uXCji1n9x59+YPgQVPXDhcM693HtPkAmuNEIf1srq2vrGZmGruL2zu7dfOjhsqzSXlLVoKlLZDYhigiespbkWrJtJRuJAsE4wasz8zh2TiqfJjR5nzIvJIOERp0Qb6bbh40pfhKlWlYbv+KUystG56+AaRLaLcA27hlRdjFAVYhvNUQZLNP3Sez9MaR6zRFNBlOphlGlvQqTmVLBpsZ8rlhE6IgPWMzQhMVPeZH71FJ4aJYRRKk0lGs7V7xMTEis1jgPTGRM9VL+9mfiX18t1VPMmPMlyzRK6WBTlAuoUziKAIZeMajE2hFDJza2QDokkVJugiiaEr0/h/6RdtbFju9dOuX65jKMAjsEJOAMYXIA6uAJN0AIUSPAAnsCzdW89Wi/W66J1xVrOHIEfsN4+AVXpkcc=</latexit>

2 ± 0.82 ± 0.82 ± 0.8

<latexit sha1_base64="1Lz0yTcL7HBaqEE/j/GbXA4+rAU=">AAAB9XicdVDLSgMxFM3UV62vqks3wSK4GpLS0VkW3bisYB/QjiWTZtrQZGZIMkoZ+h9uXCji1n9x59+YPgQVPZBwcs693JsTpoJrg9CHU1hZXVvfKG6WtrZ3dvfK+wctnWSKsiZNRKI6IdFM8Jg1DTeCdVLFiAwFa4fjy5nfvmNK8yS+MZOUBZIMYx5xSoyVbnupDPOqvSFy/Wm/XEEuOvNq2LdvD2Efe5ZUPYxQFWIXzVEBSzT65ffeIKGZZLGhgmjdxSg1QU6U4VSwaamXaZYSOiZD1rU0JpLpIJ9vPYUnVhnAKFH2xAbO1e8dOZFaT2RoKyUxI/3bm4l/ed3MRH6Q8zjNDIvpYlCUCWgSOIsADrhi1IiJJYQqbneFdEQUocYGVbIhfP0U/k9aVRfXXO+6VqlfLOMogiNwDE4BBuegDq5AAzQBBQo8gCfw7Nw7j86L87ooLTjLnkPwA87bJ6RPkfg=</latexit>

2 ± 0.12 ± 0.12 ± 0.1

<latexit sha1_base64="aQm0Y1B/JFNnmjx2yZ+fcfhlfiU=">AAAB9XicdVDLSgMxFM3UV62vqks3wSK4GpKho10W3bisYB/QjiWTZtrQZGZIMkoZ+h9uXCji1n9x59+YPgQVPZBwcs693JsTpoJrg9CHU1hZXVvfKG6WtrZ3dvfK+wctnWSKsiZNRKI6IdFM8Jg1DTeCdVLFiAwFa4fjy5nfvmNK8yS+MZOUBZIMYx5xSoyVbnupDHPP3hC5eNovV5CLzvwqrtm3j3AN+5Z4PkbIg9hFc1TAEo1++b03SGgmWWyoIFp3MUpNkBNlOBVsWuplmqWEjsmQdS2NiWQ6yOdbT+GJVQYwSpQ9sYFz9XtHTqTWExnaSknMSP/2ZuJfXjczUS3IeZxmhsV0MSjKBDQJnEUAB1wxasTEEkIVt7tCOiKKUGODKtkQvn4K/yctz8VV17+uVuoXyziK4Agcg1OAwTmogyvQAE1AgQIP4Ak8O/fOo/PivC5KC86y5xD8gPP2CZmskfE=</latexit>

222

<latexit sha1_base64="N06oAyIN63wtDp4nvRc25/Qx41E=">AAAB7nicdVDLSgMxFM3UV62vqks3wSK4GjKlo10W3bisYB/QDiWTZtrQJBOSjFCGfoQbF4q49Xvc+TemD0FFD1w4nHMv994TK86MRejDK6ytb2xuFbdLO7t7+wflw6O2STNNaIukPNXdGBvKmaQtyyynXaUpFjGnnXhyPfc791Qblso7O1U0EngkWcIItk7q9JWI8+psUK4gH12EtaAOkR+ioB6EjlTDAKEqDHy0QAWs0ByU3/vDlGSCSks4NqYXIGWjHGvLCKezUj8zVGEywSPac1RiQU2UL86dwTOnDGGSalfSwoX6fSLHwpipiF2nwHZsfntz8S+vl9mkHuVMqsxSSZaLkoxDm8L573DINCWWTx3BRDN3KyRjrDGxLqGSC+HrU/g/aVf9oOaHt7VK42oVRxGcgFNwDgJwCRrgBjRBCxAwAQ/gCTx7ynv0XrzXZWvBW80cgx/w3j4BmjyPww==</latexit>

222

<latexit sha1_base64="N06oAyIN63wtDp4nvRc25/Qx41E=">AAAB7nicdVDLSgMxFM3UV62vqks3wSK4GjKlo10W3bisYB/QDiWTZtrQJBOSjFCGfoQbF4q49Xvc+TemD0FFD1w4nHMv994TK86MRejDK6ytb2xuFbdLO7t7+wflw6O2STNNaIukPNXdGBvKmaQtyyynXaUpFjGnnXhyPfc791Qblso7O1U0EngkWcIItk7q9JWI8+psUK4gH12EtaAOkR+ioB6EjlTDAKEqDHy0QAWs0ByU3/vDlGSCSks4NqYXIGWjHGvLCKezUj8zVGEywSPac1RiQU2UL86dwTOnDGGSalfSwoX6fSLHwpipiF2nwHZsfntz8S+vl9mkHuVMqsxSSZaLkoxDm8L573DINCWWTx3BRDN3KyRjrDGxLqGSC+HrU/g/aVf9oOaHt7VK42oVRxGcgFNwDgJwCRrgBjRBCxAwAQ/gCTx7ynv0XrzXZWvBW80cgx/w3j4BmjyPww==</latexit>

Predictions

<latexit sha1_base64="YlXTKxy0ywMf9yK3owW/RxY0+4Y=">AAAB/XicbVDLSsNAFJ34rPUVHzs3wSK4KolUdFl047KCfUAbymRy0w6dPJi5EWso/oobF4q49T/c+TdO2iy09cCFwzn3cu89XiK4Qtv+NpaWV1bX1ksb5c2t7Z1dc2+/peJUMmiyWMSy41EFgkfQRI4COokEGnoC2t7oOvfb9yAVj6M7HCfghnQQ8YAzilrqm4c9hAf0gqwhwecsF9Wkb1bsqj2FtUicglRIgUbf/Or5MUtDiJAJqlTXsRN0MyqRMwGTci9VkFA2ogPoahrREJSbTa+fWCda8a0glroitKbq74mMhkqNQ093hhSHat7Lxf+8borBpZvxKEkRIjZbFKTCwtjKo7B8LoGhGGtCmeT6VosNqaQMdWBlHYIz//IiaZ1VnVr1/LZWqV8VcZTIETkmp8QhF6RObkiDNAkjj+SZvJI348l4Md6Nj1nrklHMHJA/MD5/AHIYleA=</latexit>

b = 1b = 1b = 1

<latexit sha1_base64="Tp0s7i1Jr//Ekq/w3xagFF3Y+N0=">AAAB8HicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoheh6MVjBfsh7VKyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmhYngxnreNyqsrK6tbxQ3S1vbO7t75f2DpolTTVmDxiLW7ZAYJrhiDcutYO1EMyJDwVrh6Hbqt56YNjxWD3acsECSgeIRp8Q66bGbyDALr/1Jr1zxqt4MeJn4OalAjnqv/NXtxzSVTFkqiDEd30tskBFtORVsUuqmhiWEjsiAdRxVRDITZLODJ/jEKX0cxdqVsnim/p7IiDRmLEPXKYkdmkVvKv7ndVIbXQUZV0lqmaLzRVEqsI3x9Hvc55pRK8aOEKq5uxXTIdGEWpdRyYXgL768TJpnVf+8enF/Xqnd5HEU4QiO4RR8uIQa3EEdGkBBwjO8whvS6AW9o495awHlM4fwB+jzB4rUkD8=</latexit>

b = 2b = 2b = 2

<latexit sha1_base64="hK1iGaHJTkJKGo2qtupOJc4NnLA=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU9ktFb0IRS8eK9gPaZeSTbNtaJJdkqxQlv4KLx4U8erP8ea/MdvuQVsfDDzem2FmXhBzpo3rfjuFtfWNza3idmlnd2//oHx41NZRoghtkYhHqhtgTTmTtGWY4bQbK4pFwGknmNxmfueJKs0i+WCmMfUFHkkWMoKNlR77sQjS4Lo2G5QrbtWdA60SLycVyNEclL/6w4gkgkpDONa657mx8VOsDCOczkr9RNMYkwke0Z6lEguq/XR+8AydWWWIwkjZkgbN1d8TKRZaT0VgOwU2Y73sZeJ/Xi8x4ZWfMhknhkqyWBQmHJkIZd+jIVOUGD61BBPF7K2IjLHCxNiMSjYEb/nlVdKuVb169eK+Xmnc5HEU4QRO4Rw8uIQG3EETWkBAwDO8wpujnBfn3flYtBacfOYY/sD5/AGMWZBA</latexit>

b = 3b = 3b = 3

<latexit sha1_base64="hy+/0rvuXXzYK9y3LlFnJzJ6dog=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU9nVil6EohePFeyHtEvJptk2NMkuSVYoS3+FFw+KePXnePPfmG33oK0PBh7vzTAzL4g508Z1v53Cyura+kZxs7S1vbO7V94/aOkoUYQ2ScQj1QmwppxJ2jTMcNqJFcUi4LQdjG8zv/1ElWaRfDCTmPoCDyULGcHGSo+9WARpcH0+7ZcrbtWdAS0TLycVyNHol796g4gkgkpDONa667mx8VOsDCOcTku9RNMYkzEe0q6lEguq/XR28BSdWGWAwkjZkgbN1N8TKRZaT0RgOwU2I73oZeJ/Xjcx4ZWfMhknhkoyXxQmHJkIZd+jAVOUGD6xBBPF7K2IjLDCxNiMSjYEb/HlZdI6q3q16sV9rVK/yeMowhEcwyl4cAl1uIMGNIGAgGd4hTdHOS/Ou/Mxby04+cwh/IHz+QON3pBB</latexit>

b = 4b = 4b = 4

<latexit sha1_base64="GTulkD9IdZNOSUPmlKxZ/sIK8HU=">AAAB8HicbVBNSwMxEJ34WetX1aOXYBE8lV2p6EUoevFYwX5Iu5Rsmm1Dk+ySZIWy9Fd48aCIV3+ON/+NabsHbX0w8Hhvhpl5YSK4sZ73jVZW19Y3Ngtbxe2d3b390sFh08SppqxBYxHrdkgME1yxhuVWsHaiGZGhYK1wdDv1W09MGx6rBztOWCDJQPGIU2Kd9NhNZJiF19VJr1T2Kt4MeJn4OSlDjnqv9NXtxzSVTFkqiDEd30tskBFtORVsUuymhiWEjsiAdRxVRDITZLODJ/jUKX0cxdqVsnim/p7IiDRmLEPXKYkdmkVvKv7ndVIbXQUZV0lqmaLzRVEqsI3x9Hvc55pRK8aOEKq5uxXTIdGEWpdR0YXgL768TJrnFb9aubivlms3eRwFOIYTOAMfLqEGd1CHBlCQ8Ayv8IY0ekHv6GPeuoLymSP4A/T5A49jkEI=</latexit>

b = 5b = 5b = 5

<latexit sha1_base64="BTPcoUZtv4iqi0GB+e85XSqLnGc=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU9mVFr0IRS8eK9gPaZeSTbNtaJJdkqxQlv4KLx4U8erP8ea/MdvuQVsfDDzem2FmXhBzpo3rfjuFtfWNza3idmlnd2//oHx41NZRoghtkYhHqhtgTTmTtGWY4bQbK4pFwGknmNxmfueJKs0i+WCmMfUFHkkWMoKNlR77sQjS4Lo+G5QrbtWdA60SLycVyNEclL/6w4gkgkpDONa657mx8VOsDCOczkr9RNMYkwke0Z6lEguq/XR+8AydWWWIwkjZkgbN1d8TKRZaT0VgOwU2Y73sZeJ/Xi8x4ZWfMhknhkqyWBQmHJkIZd+jIVOUGD61BBPF7K2IjLHCxNiMSjYEb/nlVdK+qHq1av2+Vmnc5HEU4QRO4Rw8uIQG3EETWkBAwDO8wpujnBfn3flYtBacfOYY/sD5/AGQ6JBD</latexit>

b = 6b = 6b = 6

<latexit sha1_base64="Cb1u7ij9kNWrqZiCgvr2PLJ3aqk=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgKexKfFyEoBePEcxDkiXMTmaTITOzy8ysEJb9Ci8eFPHq53jzb5wke9DEgoaiqpvuriDmTBvX/XYKK6tr6xvFzdLW9s7uXnn/oKWjRBHaJBGPVCfAmnImadMww2knVhSLgNN2ML6d+u0nqjSL5IOZxNQXeChZyAg2VnrsxSJIg+uLrF+uuFV3BrRMvJxUIEejX/7qDSKSCCoN4VjrrufGxk+xMoxwmpV6iaYxJmM8pF1LJRZU++ns4AydWGWAwkjZkgbN1N8TKRZaT0RgOwU2I73oTcX/vG5iwis/ZTJODJVkvihMODIRmn6PBkxRYvjEEkwUs7ciMsIKE2MzKtkQvMWXl0nrrOrVquf3tUr9Jo+jCEdwDKfgwSXU4Q4a0AQCAp7hFd4c5bw4787HvLXg5DOH8AfO5w+SbZBE</latexit>

b = 5b = 5b = 5


b = 6b = 6b = 6


b = 5b = 5b = 5


b = 6b = 6b = 6


for

<latexit sha1_base64="LVS228U/hWEIS44syFoiMxNaOI8=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoseiF48V7Ac0pWy2m3bpZhN2J2IJ/RtePCji1T/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//ci1EbF6wEnCexEdKhEKRtFKvo/8CYMwC2M97ZcrbtWdg6wSLycVyNHol7/8QczSiCtkkhrT9dwEexnVKJjk05KfGp5QNqZD3rVU0YibXja/eUrOrDIgdq8thWSu/p7IaGTMJApsZ0RxZJa9mfif100xvO5lQiUpcsUWi8JUEozJLAAyEJozlBNLKNPC3krYiGrK0MZUsiF4yy+vktZF1atVL+9rlfpNHkcRTuAUzsGDK6jDHTSgCQwSeIZXeHNS58V5dz4WrQUnnzmGP3A+fwC1/5Ii</latexit>

t = Tt = Tt = T

<latexit sha1_base64="bYyMftQXrQJnreC91wpqwFYJUHc=">AAAB8HicbVDLSgNBEOyNrxhfUY9eFoPgKexKRC9C0IvHCHlJsoTZySQZMjO7zPQKYclXePGgiFc/x5t/4yTZgyYWNBRV3XR3hbHgBj3v28mtrW9sbuW3Czu7e/sHxcOjpokSTVmDRiLS7ZAYJrhiDeQoWDvWjMhQsFY4vpv5rSemDY9UHScxCyQZKj7glKCVHruxDFO8qU97xZJX9uZwV4mfkRJkqPWKX91+RBPJFFJBjOn4XoxBSjRyKti00E0MiwkdkyHrWKqIZCZI5wdP3TOr9N1BpG0pdOfq74mUSGMmMrSdkuDILHsz8T+vk+DgOki5ihNkii4WDRLhYuTOvnf7XDOKYmIJoZrbW106IppQtBkVbAj+8surpHlR9ivly4dKqXqbxZGHEziFc/DhCqpwDzVoAAUJz/AKb452Xpx352PRmnOymWP4A+fzB9uBkHQ=</latexit>

b = 5b = 5b = 5 and b = 6b = 6b = 6

<latexit sha1_base64="bI/di0eUFU9tOVcCjGIBRznDr2U=">AAACC3icbVC7SgNBFJ2Nrxhfq5Y2Q4JgFXYlURshaGMZwTwgG8LsZDYZMju7zNwVw5Lexl+xsVDE1h+w82+cJFto4oELZ865l7n3+LHgGhzn28qtrK6tb+Q3C1vbO7t79v5BU0eJoqxBIxGptk80E1yyBnAQrB0rRkJfsJY/up76rXumNI/kHYxj1g3JQPKAUwJG6tlFLw791L+sTrAH7AH8IMVE9rF5zo2zSc8uOWVnBrxM3IyUUIZ6z/7y+hFNQiaBCqJ1x3Vi6KZEAaeCTQpeollM6IgMWMdQSUKmu+nslgk+NkofB5EyJQHP1N8TKQm1Hoe+6QwJDPWiNxX/8zoJBBfdlMs4ASbp/KMgERgiPA0G97liFMTYEEIVN7tiOiSKUDDxFUwI7uLJy6R5WnYr5eptpVS7yuLIoyNURCfIReeohm5QHTUQRY/oGb2iN+vJerHerY95a87KZg7RH1ifPwkNmmo=</latexit>

CbCbCb ⇠ N (⌫b, ⇢b)

<latexit sha1_base64="y6J2uLE9PsQ+Fm1+QYeBDVzz9LQ=">AAACEHicdVBNSwMxEM36WetX1aOXYBEVpGTV2vZW9OJJFKwK3bIkaWpDk+ySZIWy9Cd48a948aCIV4/e/Ddm2woq+mDg8d4MM/NILLixCH14E5NT0zOzubn8/MLi0nJhZfXSRImmrEEjEelrgg0TXLGG5Vaw61gzLIlgV6R3nPlXt0wbHqkL249ZS+IbxTucYuuksLAVxJKkxyEZBIZLGEhsuxSL9HSwHagkJLuB7kYh2QkLRVRCh+XaPoKoVEZ+pVZzBKHD6v4e9B3JUARjnIWF96Ad0UQyZanAxjR9FNtWirXlVLBBPkgMizHt4RvWdFRhyUwrHT40gJtOacNOpF0pC4fq94kUS2P6krjO7GDz28vEv7xmYjvVVspVnFim6GhRJxHQRjBLB7a5ZtSKviOYau5uhbSLNabWZZh3IXx9Cv8nl3sl/6BUPj8o1o/GceTAOtgA28AHFVAHJ+AMNAAFd+ABPIFn79579F6811HrhDeeWQM/4L19AgpwnT0=</latexit>

t = Tet = Tet = Te

<latexit sha1_base64="vmYzSzzghxzveD+3HiIz1Uy3cic=">AAAB8nicbVDLSgNBEJyNrxhfUY9eBoPgKeyKohch6MVjhLxgs4TZySQZMo9lplcISz7DiwdFvPo13vwbJ8keNLGgoajqprsrTgS34PvfXmFtfWNzq7hd2tnd2z8oHx61rE4NZU2qhTadmFgmuGJN4CBYJzGMyFiwdjy+n/ntJ2Ys16oBk4RFkgwVH3BKwElhN5FxBreNHpv2yhW/6s+BV0mQkwrKUe+Vv7p9TVPJFFBBrA0DP4EoIwY4FWxa6qaWJYSOyZCFjioimY2y+clTfOaUPh5o40oBnqu/JzIirZ3I2HVKAiO77M3E/7wwhcFNlHGVpMAUXSwapAKDxrP/cZ8bRkFMHCHUcHcrpiNiCAWXUsmFECy/vEpaF9Xgsnr1eFmp3eVxFNEJOkXnKEDXqIYeUB01EUUaPaNX9OaB9+K9ex+L1oKXzxyjP/A+fwBV8pFM</latexit>

t = Tet = Tet = Te

<latexit sha1_base64="vmYzSzzghxzveD+3HiIz1Uy3cic=">AAAB8nicbVDLSgNBEJyNrxhfUY9eBoPgKeyKohch6MVjhLxgs4TZySQZMo9lplcISz7DiwdFvPo13vwbJ8keNLGgoajqprsrTgS34PvfXmFtfWNzq7hd2tnd2z8oHx61rE4NZU2qhTadmFgmuGJN4CBYJzGMyFiwdjy+n/ntJ2Ys16oBk4RFkgwVH3BKwElhN5FxBreNHpv2yhW/6s+BV0mQkwrKUe+Vv7p9TVPJFFBBrA0DP4EoIwY4FWxa6qaWJYSOyZCFjioimY2y+clTfOaUPh5o40oBnqu/JzIirZ3I2HVKAiO77M3E/7wwhcFNlHGVpMAUXSwapAKDxrP/cZ8bRkFMHCHUcHcrpiNiCAWXUsmFECy/vEpaF9Xgsnr1eFmp3eVxFNEJOkXnKEDXqIYeUB01EUUaPaNX9OaB9+K9ex+L1oKXzxyjP/A+fwBV8pFM</latexit>

Figure 1: Overview of the inference task inDualTPP. In this example, the time horizon is split into six bins. The eleven events

in the first four bins comprise the known history 𝐻𝑇 . The event-model 𝑝\ conditioned on 𝐻𝑇 predicts five events until 𝑇espanning two bins. The count model at the top predicts two Gaussians of mean 2 each. These are combined by DualTPP’s

joint inference algorithm (Algorithm 1) to get the revised event predictions shown on the right. Marks are omitted for clarity.

intensity-based approaches such as RMTPP [13]. Our ablation studyin Table 3 shows this with a version of DualTPP.

The above event model via its auto-regressive structure is effec-tive in capturing the arrival process of events at a microscopic level.Indeed it is sufficiently capable of accurately predicting events inthe short-term future. However, its long-term predictions suffersdue to cascading errors when auto-regressing on predicted events.The count model is designed to contain this drift.Count model. Here we aim to capture the number of events arriv-ing in a sequence of time intervals.We partition time into equal time-intervals — called as bins— of size Δ which is a hyper-parameter ofour model (Figure 1 shows an example). Given a history of events𝐻𝑇 , we develop a simple distribution 𝑝𝜙 which generates the totalcount of events for subsequent 𝑛 bins. Let I𝑠 be the time-interval[𝑇 + (𝑠 − 1)Δ,𝑇 + 𝑠Δ), and 𝐶𝑠 denote the number of events occur-ring within it. We factorize the distribution 𝑝𝜙 over the 𝑛 futurebins independently over each of the bins while conditioning on theknown history 𝐻𝑇 , and properties of the predicted bin.

𝑝𝜙 (𝐶𝑛,𝐶𝑛−1, · · · ,𝐶1 |𝐻𝑇 , 𝐼𝑛, . . . , 𝐼1) =𝑛∏𝑗=1

𝑝𝜙 (𝐶 𝑗 |𝐻𝑇 , 𝐼 𝑗 ) (11)

This conditionally independent model provided better accuracythan an auto-regressive model that would require conditioning onfuture unknown counts. Similar observations have been made fortime-series models in [9, 50].

Each 𝑝𝜙 (𝐶 𝑗 |𝐻𝑇 , 𝐼 𝑗 ) is modeled as a Gaussian distribution withmean a 𝑗,𝜙 and variance 𝜌 𝑗,𝜙 . Gaussian distribution provides ex-plicit control on variance. This is necessary for efficient inferencedescribed in Section 4. Although the domain of Gaussian distribu-tion is −∞ to +∞, it is convenient for training and does not lead toany issues during inference. A feed-forward network with parame-ters 𝜙 , learns these parameters as a function of features extractedfrom the history 𝐻𝑇 and current interval 𝐼 𝑗 as follows: From a timeinterval we extract time-features such as the hour-of-the-day inthe mid-point of the bin. Then from 𝐻𝑇 we extract the counts ofevents in the most recent 𝑛− bins before 𝑇 and time features fromtheir corresponding bins.

Learning the parameters \ and 𝜙 . Given a stream of observedevents {𝑒−

𝑖} during the time window (0,𝑇 ], we learn the event

model \ by maximizing the following likelihood function:

maximize\

∑︁𝑒𝑖 ∈𝐻𝑇

log 𝑝\ (𝑒−𝑖 |𝐻𝑡𝑖 ). (12)

In order to train the count model, we first group the events intodifferent bins of same width Δ. Next, we sample them in differentbatches of 𝑛− + 𝑛 bins and then learn 𝜙 in the following manner:

maximize𝜙

E𝐻𝑠∼𝑝Data

𝑛∑︁𝑗=1

log𝑝𝜙 (𝐶−𝑠+𝑗 |𝐻𝑠 , 𝐼−𝑠+𝑗 ) (13)

Here, 𝐻𝑠 denotes a history of events between time (𝑠 − 𝑛−)Δ and 𝑠 ,and 𝐶−

𝑠+𝑗 denotes the observed counts of events in bin 𝐼𝑠+𝑗 .

4 INFERENCE

In this section, we formulate our inference procedure over thetrained models (𝑝\ , 𝑝𝜙 ) for forecasting all events (marks and time)within a user-provided future time𝑇e given the history𝐻𝑇 of eventsbefore 𝑇 < 𝑇e.Inference with Event-only Model. First, we review an existingmethod of solving this inference task using the event-only method𝑝\ . Note 𝑝\ is an auto-regressive model that provides a distributionon the next event 𝑒𝑖+1 given known historical events 𝐻𝑇 before 𝑇and predicted prior events 𝑒1, . . . 𝑒𝑖 , that is 𝑝\ (𝑒𝑖+1 |𝐻𝑇 , 𝑒1, . . . 𝑒𝑖 ).Let 𝒉𝑖 denote the RNN state after input of events in the history𝐻𝑇 and predicted events 𝑒1, . . . 𝑒𝑖 . Based on the state, we predict adistribution of the next gap via Eq. 8 and next mark using Eq. 7. Thepredicted times and marks of the next event are just the modes ofthe respective distribution as: 𝑒𝑖+1 = (�̂�𝑖+1 = argmax𝑚 𝑃 (𝑚𝑖+1 =

𝑚), 𝑡𝑖+1 = 𝑡𝑖 + ` (𝒉𝑖 )). The predicted event is input to the RNN toget a new state and we repeat the process until we predict an eventwith time > 𝑇e.

As mentioned earlier, the events predicted by such forward-sampling method on 𝑝\ alone is subject to drift particularly when𝑇e is far from𝑇 . We next go over howDualTPP captures the drift bygenerating an event sequence that jointly maximizes the probabilityof the event and count model.


Joint Inference Objective ofDualTPP. The event model gives adistribution of the next event given all previous events whereas thecount model 𝑝𝜙 imposes a distribution over the number of eventsthat fall within the regular Δ-sized bins between time 𝑇 and 𝑇e.For simplicity of exposition we assume that 𝑇e aligns with a binboundary, i.e., 𝑇e = 𝑇 + 𝑛eΔ for a positive 𝑛e. During inferencewe wish to determine the sequence of events that maximizes theproduct of their joint probability as follows:

max𝑟,𝑒1,...,𝑒𝑟 ,𝐶1,...,𝐶𝑛e

[ 𝑟∑︁𝑖=1

log 𝑝\ (𝑒𝑖 |𝐻𝑇 , 𝑒1, . . . , 𝑒𝑖−1) +𝑛e∑︁𝑏=1

log𝑝𝜙 (𝐶𝑏 |𝐻𝑇 , 𝐼𝑏 )]

(14)

such that, 𝑡𝑟 < 𝑇e,��{𝑒𝑖 | 𝑡𝑖 ∈ 𝐼𝑏 }�� = 𝐶𝑏 ∀𝑏 ∈ [𝑛e] (15)

Unlike the number of bins, the number of events 𝑟 is unknownand part of the optimization process. The constraints ensure thatthe last event ends before 𝑇e and there is consensus between thecount and event model. Solving the above optimization problemexactly over all possible event sequences completing before 𝑇𝑒 isintractable for several confounding reasons — the event modelexpresses the dependence of an event over all previous events, andthat too via arbitrary non-linear functions. Also, it is not obvioushow to enforce the integral constraint on the number of events ina bin as expressed in Eq. 15.Tractable Decomposition of the Joint Objective. We proposetwo simplifications that allow us to decompose the above intractableobjective into a sequence of optimization problems that are opti-mally solvable. First, we decompose the objective into 𝑛e stages. Inthe 𝑏-th stage we infer the set of events whose times fall withinthe 𝑏-th bin 𝐼𝑏 assuming we already predicted the set of all eventsbefore that bin. Call these 𝑬𝑏 = 𝑒1, . . . , 𝑒𝑟𝑏 where 𝑟𝑏 = |𝑬𝑏 | denotesthe number of predicted events before start of 𝑏-th bin, i.e, left of𝐼𝑏 . Second, we fix the RNN state 𝒉𝑖 for all potential events in 𝐼𝑏 totheir unconstrained values as follows: Starting with the RNN state𝒉𝑟𝑏 , we perform forward sampling as in the event-only baselineuntil we sample an upper limit 𝐶max of events likely to be in 𝐼𝑏 .We will discuss how to choose 𝐶max later. Once the RNN state 𝒉𝑖is fixed, the distribution of the gap between the 𝑖-th and (𝑖 + 1)thevent is modeled as a Gaussian N(` (𝒉𝑖 ), 𝜎 (𝒉𝑖 )) and the predictedmark �̂�𝑖+1 is also fixed. We can then rewrite the above inferenceproblem for the events on 𝑏-th bin as a double optimization problemas follows:

max𝑐∈[𝐶max ]

[max𝑔1,...𝑔𝑐

𝐶max∑︁𝑖=1

logN(𝑔𝑖 ; ` (𝒉𝑟𝑏+𝑖 ), 𝜎 (𝒉𝑟𝑏+𝑖 )) + logN(𝑐;a𝑏 , 𝜌𝑏 )]

such that, 𝑔𝑖 ≥ 0,𝑐∑︁𝑖=1

𝑔𝑖 ≤ Δ,𝑐+1∑︁𝑖=1

𝑔𝑖 > Δ, 𝑡𝑟𝑏 + 𝑔1 ∈ 𝐼𝑏 (16)

In the above equation, the constraints in the inner optimizationjust ensure that exactly 𝑐 events are inside bin 𝐼𝑏 . All constraintsare linear in 𝑔𝑖 unlike in Eq. 15. The optimization problem in Eq. 16is amenable to efficient inference: For a fixed 𝑐 , the inner maximiza-tion is over real-valued gap variables 𝑔𝑖 with a concave quadraticobjective and linear constraints. Thus, for a given 𝑐 , the optimal gapvalues can be efficiently solved using any off-the-shelf QP solver.The outer maximization is over integral values of 𝑐 but we use a

Algorithm 1: Inference of events in the [𝑇,𝑇e)1: Input: Trained event model and trained count model 𝑝\ , 𝑝𝜙 , event

history 𝐻𝑇 , end time𝑇e = 𝑇 + 𝑛𝑒Δ.2: Output: Forecast events {𝑒 | 𝑡 ∈ [𝑇,𝑇e) }3: 𝑬 ← ∅ /* Predicted events so far */4: for 𝑏 in [𝑛𝑒 ] do5: a𝑏 , 𝜌𝑏 ← Count distribution from 𝑝𝜙 (. |𝐻𝑇 , 𝐼𝑏 )6: 𝒉,𝐶max ← RNNStates(𝑝\ , 𝐻𝑇 , 𝑬 , a𝑏 , 𝑏) /*set 𝒉• */

7: /* Solve the optimization problem in Eq. 16*/8: 𝑬 ← 𝑬 + OptimizeInBin(𝒉, a𝑏 , 𝜌𝑏 ,𝐶max, 𝐼𝑏 )9: end for

10: Return 𝑬

simple binary search between the range 0 and 𝐶max to solve theabove in log(𝐶max) time.

Let 𝑐∗, 𝑔∗1, . . . , 𝑔∗𝑐∗ denote the optimal solution. Using these we

expand the predicted event sequence from 𝑟𝑏 by 𝑐∗ more eventsas (�̂�𝑟𝑏+1, 𝑡𝑟𝑏 + 𝑔∗1), . . . (�̂�𝑟𝑏+𝑐∗ , 𝑡𝑟𝑏 +

∑𝑐∗𝑖=1 𝑔

∗𝑖). We append these to

𝑬𝑏 to get the new history of predicted events 𝑬𝑏+1 conditionedon which we predict events for the (𝑏 + 1)-th bin. The final set ofpredicted events are obtained after 𝑛𝑒 stages in 𝐸𝑛𝑒+1Choosing𝐶max . Let𝐶𝐸 denote the count of events in bin 𝐼𝑏 wheneach gap 𝑔𝑖 is set to its unconstrained optimum value of ` (.). Weobtain this value as we perform forward sampling from RNN state𝒉𝑟𝑏 . The optimum value of 𝑐 from the count-only model is a𝑏 . Dueto the unimodal nature of the count model 𝑝𝜙 , one can show thatthe optimal 𝑐∗ lies between a𝑏 and 𝐶𝐸 . Thus, we set the value𝐶max = max(a𝑏 + 1,𝐶𝐸 ). Also, to protect against degenerate event-models that do not advance time of events, we upper bound 𝐶maxto be a𝑏 + 𝜌𝑏 since the count model is significantly more accurate,and the optimum 𝑐∗ is close to its mode a𝑏 .Overall Algorithm. Algorithm 1 summarizes DualTPP’s infer-ence method. An example is shown in Figure 1. To predict theevents in the 𝑏-th bin, we first invoke the count model 𝑝𝜙 and getmean count a𝑏 , variance 𝜌𝑏 . We then forward step through theevent RNN 𝑝\ after conditioning on previous events 𝐻𝑇 , 𝑬 . Wethen continue forward sampling until bin end or `𝑏 + 1, and returnthe visited RNN states, and number of steps 𝐶max. Now, we invokethe optimization problem in Eq. 16 to get the predicted events inthe 𝑏th bin which we then append to 𝑬 .

5 EXPERIMENTS

In this section, we evaluate our method against five state-of-the-artexisting methods, on four real datasets.

5.1 Datasets

We use four real world datasets that contain diverse characteristicsin terms of their application domains and temporal statistics. Wealso summarize the details of these datasets in Table 1.Election. [8] This dataset contains tweets related to presidentialelection results in the United-States, collected from 7th April to13th April, 2016. Here, given a tweet 𝑒 , the mark𝑚 indicates theuser who posted it and the time 𝑡 indicates the time of the post.Taxi. [2] This contains the pickup, drop-off timestamps and pickup,drop-off locations of taxis in New York city from 1st Jan 2019 to 28th


Dataset Train E[𝑡] 𝜎 [𝑡] Avg. #Events Bin SizeSize in [𝑇,𝑇𝑒 ) (Δ)

Elections 51859 7.0 5.8 203 7 mins.Taxi 399433 8.0 25.8 1254 1 hourTraffic-911 115463 778 1517 281 1 dayEMS-911 182845 492 601 275 12 hours

Table 1: Statistics of the datasets used in our experiments.

Train Size denotes the number of events in the training set.

E[𝑡] and 𝜎 [𝑡] denote the mean and variance of the inter-

event arrival time.

Feb 2019. The dataset is categorized by zones. In our experimentswe only consider pick up zone with zone id 237. We consider eachtravel as an event 𝑒 = (𝑚, 𝑡), with pick up time denoted by 𝑡 anddrop-off zone as the marker𝑚.Traffic-911. [1] This dataset consists of emergency calls related toroad traffic in the US, in which each event contains timestamp ofthe call and location of the caller, which we treat as a marker.EMS-911. [1] This dataset consists of emergency calls related tomedical services in the US, in which each event contains timestampsof the call, and location of the caller which we treat as the marker.

For all datasets, we rank markers based on their occurrencefrequency and keep the top 10 markers. Rest of the markers aremerged into a single mark. Hence, we have 11 markers in eachdataset.

5.2 Methods Compared

We compare DualTPP against five other methods spanning a var-ied set of loss functions and architectures: The first two (RMTPP,THP) are trained to predict the next event via intensity functionsusing maximum likelihood (Sec 3.1). The next two (WGAN andSeq2Seq) are trained to predict a number of future events usinga sequence-level Wasserstein loss and are better suited for long-term forecasting. The last uses a two-level hierarchy to capturelong-term dynamics. We present more details below:RMTPP. RMTPP [13] is one of the earliest neural point processmodel that uses a three layer recurrent neural network to modelthe intensity function and mark distribution of an MTPP.Transformer Hawkes Process (THP). THP [60] is more recentand uses Transformers [48] instead of RNNs to model the inten-sity function of the next event. The THP leverages the positionalencoding in the transformer model to encode the timestamp.WGAN. Wasserstein TPPs [51] train a generative adversarial net-work to generate an event sequence. A Homogeneous Poissonprocess provides the input noise to the generator of future events,which by a Wasserstein discriminator loss is trained to resemblereal events. Since our predicted events are conditioned on the inputhistory, we initialize the generator by encoding known history ofevents using an RNN.Seq2Seq. is a conditional generative model [54], in which, anencoder-decodermodel for sequence-to-sequence learning is trainedby maximizing the likelihood of the output sequence. Also addedis a Wasserstein loss computed via a CNN-based discriminator.Hierarchical Generation. We designed this method to exploreif hierarchical models [6, 46, 47], could be just as effective as our

count-model to capture macroscopic dynamics. We create a two-level hierarchy where the top-level events are compound events of𝜏 consecutive events. We train a second event-only model 𝑝𝜓 (•|𝐻𝑡 )over the compound events to replace the count-model. Using trainedmodels (𝑝\ , 𝑝𝜓 ) we perform inference similar to Eq. 16. However,since compound model 𝑝𝜓 imposes a distribution over every 𝜏-thevent, we solve the following optimization problem for every 𝑗-thcompound event:

max𝑔1 ...𝑔𝜏 ,𝑔𝑖 ∈R+

[ 𝜏∑︁𝑖=1

logN(𝑔𝑖 ; ` (𝒉 𝑗𝜏+𝑖 ), 𝜎 (𝒉 𝑗𝜏+𝑖 )) +

logN(𝜏∑︁𝑖=1

𝑔𝑖 ; ` (𝒉𝑐𝑗 ), 𝜎 (𝒉𝑐𝑗 ))

](17)

Similar to Eq. 16, the maximization is over positive real-valuedgap variables 𝑔𝑖 and with a concave quadratic objective. Here, thenumber of stages is not fixed to 𝑛e, but we stop when the lastpredicted time-stamp is greater than 𝑇e.

5.3 Evaluation protocol

We create train-validation-test splits for each dataset by selectingthe first 60% time-ordered events as training set, next 20% as vali-dation and rest 20% as test set. We chose the value of the bin-size Δso that each bin has at least five events on average while aligningwith standard time periodicity as shown in Table 1. A test ‘instance’starts at a random time 𝑇𝑠 within the test time, includes all eventsup to 𝑇 = 𝑇𝑠 + 20Δ as the known history 𝐻𝑇 , and treat the intervalbetween 𝑇 and 𝑇e = 𝑇 + 3Δ as the forecast horizon. The averagenumber of events in the forecast horizon ranges between 200 and1250 across the four datasets (shown in Table 1). For training thecount model 𝑝𝜙 we created instances using the same scheme. Theevent model 𝑝\ just trains for the next event using random eventsub sequences of length 80.Architectural Details. For event model, we use a single layerrecurrent network with GRU cell of 32 units. We fixed the batchsize to 32 and used Adam optimizer with learning rate 1e−3. Thesize of the embedding vector of a mark is set to 8. We train theevent model for 10 epochs. We checkpoint the model at the end ofeach epoch and select the model that gives least validation error.The Count model is a feed-forward network with three hiddenlayers of 32 units, all with ReLU activation. The input layer of countmodel has 40 units, corresponding to counts of 20 input bins andhour-of-day at the mid-point of each bin. The output layer predictsthe Gaussian parameters a 𝑗 , 𝜌 𝑗 for each future bin 𝑗 .Evaluation Metrics. We use three metrics to measure perfor-mance. First, we measure the Wasserstein distance between pre-dicted and actual event sequences to assess the microscopic dy-namics between events. Given true event times 𝐻 := {𝑡1, . . . , 𝑡 |𝐻 |}in an interval [𝑇st,𝑇e) and the corresponding predicted events𝐻 := {𝑡1, . . . , 𝑡 |𝐻 |}, assuming without loss of generality, |𝐻 | < |𝐻 |,we compute the Wasserstein distance3 [51] between the two se-quence of events as

WassDist(𝐻,𝐻 ) =|𝐻 |∑︁𝑖=1|𝑡𝑖 − 𝑡𝑖 | +

𝐻∑︁𝑖= |𝐻 |+1

(𝑇e − 𝑡𝑖 ) (18)

3Here, the term ‘Wasserstein distance’ is overloaded. However, as shown in [51], fordistributions with point masses, Wasserstein distance simplifies to Eq. 18.


00:30:0001:00:00

01:30:0002:00:00

02:30:00100

110

120

130

140

150

160

170

Cou

nts

inth

e10

min

ute

inte

rval True counts

RMTPPDualTPP

00:30:0001:00:00

01:30:0002:00:00

02:30:00

20

40

60

80

Cou

nts

inth

e10

min

ute

inte

rval True counts

RMTPPDualTPP

00:30:0001:00:00

01:30:0002:00:00

02:30:00

100

110

120

130

Cou

nts

inth

e10

min

ute

inte

rval

True countsRMTPPDualTPP

Figure 2: Anecdotal examples of variation of counts against

time, collected fromTaxi datasets. They show thatDualTPP

can mimic the high level trajectory more accurately than

RMTPP. In the second example, we observe that RMTPP and

DualTPP show similar nowcasting performance, whereas

DualTPP shows more accurate forecasting performance

than RMTPP.

We randomly sample several such intervals [𝑇st,𝑇e) and report theaverage of WassDist of all intervals. Second, to assess the macro-scopic modeling component of each method we define a CountMAEthat aims to measure the relative error in predicted count in ran-domly sampled time interval:

CountMAE =1𝑀

∑︁𝑖∈𝑀

��{𝑒 | 𝑡 ∈ I (𝑖) }�� − ��{𝑒 | 𝑡 ∈ I (𝑖) }��{𝑒 | 𝑡 ∈ I (𝑖) }�� , (19)

where I (𝑖) is randomly sampled in test-horizon and we sample𝑀 such intervals. Finally, for evaluating accuracy of the predicteddiscrete mark sequence, we compare our generated mark sequencewith the true mark sequence (which could be of a different length)using the BLEU score popular in the NLP community [35].

5.4 Results

In this section, we first compare DualTPP against the five methodsof Sec 5.2, and then analyze how accurately it can forecast eventsin a distant time. Next, we provide a thorough ablation study onDualTPP.Comparative analysis. Here we compare DualTPP against fivestate-of-the-art methods. WGAN and Seq2Seq papers do not modelmarks, hence their BLEU scores are omitted. Table 2 summarizesthe results, which reveals the following observations.

(1) DualTPP achieves significant accuracy gains beyond all fivemethods, in terms of all three metrics i.e., CountMAE, WassDistand BLEUScore. For some datasets, e.g. Taxi the gains by ourmethod are particularly striking — our error in counts is 39, andthe closest alternative has almost three times higher error! Evenfor microscopic inter-event dynamics as measured by theWasser-stein distance we achieved a factor of two reduction. Figure 2shows three anecdotal sequences comparing counts of events indifferent time-intervals of DualTPP (Blue) against actual (Black)

Dataset Model Wass. BLEU Countdist Score MAE

Elections RMTPP [13] 1231 0.684 26.7TransMTPP [60] 1458 0.579 31.8WGAN [51] 442 - 10.0Seq2Seq [54] 739 - 15.9Hierarchical 415 0.880 8.5DualTPP 267 0.882 5.0

Taxi RMTPP [13] 9826 0.089 288WGAN [51] 4060 - 128Seq2Seq [54] 5105 - 161Hierarchical 8838 0.088 206DualTPP 1923 0.090 39

Traffic-911 RMTPP [13] 2406 0.248 41.7TransMTPP [60] 6096 0.081 110.0WGAN [54] 3892 - 69.0Seq2Seq [54] 4520 - 83.0Hierarchical 1853 0.211 33.1DualTPP 1700 0.221 29.1

EMS-911 RMTPP [13] 2674 0.162 20.9TransMTPP [60] 5792 0.070 50.0WGAN [54] 2432 - 19.3Seq2Seq [54] 9856 - 90.3Hierarchical 1639 0.163 11.8DualTPP 1419 0.163 10.1

Table 2: Comparative analysis of our method against

all baselines across all datasets in terms of WassDist,

BLEUScore, and CountMAE. It shows the DualTPP consis-

tently outperforms all the baselines.

and the RMTPP baseline (red). Notice how RMTPP drifts awaywhereas DualTPP tracks the actual.

(2) The Hierarchical variant of our method is the second best per-former, but its performance is substantially poor compared toDualTPP, establishing that the alternative count-based perspec-tive is as important as viewing events at different scales for ac-curate long-term perspective. More specifically, the Hierarchicalvariant considers aggregating a fixed number of events, whichmakes it oblivious to the prediction for heterogeneous countsin an arbitrary time interval. DualTPP aims to overcome theselimitations by means of both the event and the count model,which characterize both short term and long term characteristicsof an event sequence.

(3) Both RMTPP and TransMTPP are much worse than DualTPP.WGAN and Seq2Seq provide unreliable performance and showlarge variance across datasets.

Performance on long term forecasting. Next we analyze theperformance difference further by looking at errors in differentforecast time-intervals in the future in Figure 3. Here, on the X-axiseach tick gives the average number of events since the known his-tory 𝑇 in gold and on the Y-axis we show the Wasserstein distancefor events predicted between time of two consecutive ticks. Weobserve the expected pattern that events further into the futurehave larger error than closer events for all methods. However, Du-alTPP shows a modest deterioration, whereas, both RMTPP andHierarchical show a significant deterioration. For example in theleftmost plots on the Election dataset, the Wasserstein distance forthe first 68 events increases from 500 to almost 800 for RMTPP, butonly from 200 to 270 for DualTPP.


68 136 204

200

400

600

Number of events predicted

WassDist

Election

DualTPPRMTPP-d

Hierarchical

418 836 1254

0.2

0.4

0.6

0.8

1

·104


WassDist

Taxi

DualTPPRMTPP-d

Hierarchical

94 188 282

1,400

1,500

1,600

1,700

1,800

1,900


WassDist

Traffic-911

DualTPPRMTPP-d

Hierarchical

Figure 3: Long term forecasting of DualTPP, RMTPP-d and

Hierarchical across three datasets in terms of WassDist.

RMTPP-d is just RMTPP with Gaussian density instead of

intensity. X-axis denotes the average number of events in

the gold since the known history 𝑇 and Y-axis denotes the

Wasserstein distance between gold and predicted events.

15 mins 30 mins 1 hr 3 hrs 6 hrs

0.2

0.4

0.6

0.8

1

1.2

·104

Bin Size (Δ)

WassDist

DualTPPRMTPP-d

Hierarchical

Figure 4: Long term forecasting comparison on Taxi dataset:

X-axis denotes the bin size used to train the count model 𝑝𝜙and Y-axis denotes the Wasserstein Distance between true

and predicted events.

We measure sensitivity of our results to bin-sizes by varying thebin-size Δ, and correspondingly the forecast horizon [𝑇,𝑇e = 3Δ].Figure 4 shows theWasserstein distance between true and predictedevents across different bin sizes on the Taxi dataset. We find thatDualTPP continues to perform better than competing methodsacross all bin sizes.Ablation Study. We perform ablation study using variants of Du-alTPP to analyze which elements of our design contributed mostto our observed gains. We evaluate these variants using the Wasser-stein distance metric and summarize in Table 3.

First we see the performance obtained by our Event-only model.We observe that the event-only model performs much worse thanDualTPP, establishing the importance of the countmodel to captureits drift. We next compare with the Count-only model, where wefirst predict counts of event for the 𝑏-th bin (a𝑏 ), and then randomlygenerate a𝑏 events in the 𝑏-th bin. In this case, marks are ignored.

Dataset Model Wass distElection DualTPP 267

Event-only 633Count-only 310DualTPP-with-intensity 271DualTPP-without-count-variance 272

Taxi DualTPP 1923Event-only 5679Count-only 1923DualTPP-with-intensity 1790DualTPP-without-count-variance 1916

Traffic-911 DualTPP 1700Event-only 1767Count-only 2098DualTPP-with-intensity 2211DualTPP-without-count-variance 1746

EMS-911 DualTPP 1419Event-only 1485Count-only 2186DualTPP-with-intensity 2318DualTPP-without-count-variance 1423

Table 3: Ablation Study: Comparison of DualTPP and its

variants in terms of Wasserstein Distance between true and

predicted events.

We observe that the count-only model is worse than DualTPP butit performs much better than the Event-only method.

Next, we analyze other finer characteristics of our model. InDualTPP, the event model uses the Gaussian density whereas mostexisting TPP models (e.g. RMTPP and THP discussed earlier) use anintensity function.We create a version ofDualTPP calledDualTPP-with-intensity where we model the distribution 𝑝\ (•|𝐻𝑡 ) usingthe conditional intensity of RMTPP. Comparing the two methodswe observe that the choice of Gaussian density also contributessignificantly to the gains observed in DualTPP.

In the DualTPP-without-count-variance model, we predict theevents in the 𝑏-th bin by solving the inner optimization problem inEq. 16 only for the mean a𝑏 , thereby treating 𝑝𝜙 as a point distri-bution. We observe a performance drop highlighting the benefit ofmodeling the uncertainty of the count distribution.

6 CONCLUSIONS

In this paper, we propose DualTPP, a novel MTPP model specif-ically designed for long-term forecasting of events. It consists oftwo components— Event-model which captures dynamics of theunderlying MTPP in a microscopic scale and Count-model whichcaptures the macrocopic dynamics. Such a model demands a freshapproach for inferring future events. We design a novel inferencemethod that solves a sequence of efficient constrained quadratic pro-grams to achieve consensus across the twomodels. Our experimentsshow that DualTPP achieves substantial accuracy gains beyondfive competing methods in terms of all three metrics: Wassersteindistance that measures microscopic inter-event dynamics, Count-MAE that measures macroscopic count error, and BLEU score thatevaluates the sequence of generated marks. Future work in thearea could include capturing other richer aggregate statistics ofevent sequences. Another interesting area is providing inferenceprocedures for answering aggregate queries directly.


REFERENCES

[1] 911 dataset. URL https://www.kaggle.com/mchirico/montcoalert.[2] Taxi dataset. URL https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.[3] I. Apostolopoulou, S. Linderman, K.Miller, andA. Dubrawski. Mutually regressive

point processes. In NeurIPS, pages 5115–5126, 2019.[4] S. Ben Taieb and A. Atiya. A bias and variance analysis for multistep-ahead time

series forecasting. IEEE transactions on neural networks and learning systems,27(3), 2015.

[5] J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, andM.West.The markov modulated poisson process and markov poisson cascade with appli-cations to web traffic modeling. Bayesian Statistics, 2003.

[6] A. Borovykh, S. Bohte, and C. W. Oosterlee. Conditional time series forecastingwith convolutional neural networks, 2017.

[7] R. Cai, X. Bai, Z.Wang, Y. Shi, P. Sondhi, and H.Wang. Modeling sequential onlineinteractive behaviors with temporal point process. In CIKM, pages 873–882, 2018.

[8] A. De, S. Bhattacharya, and N. Ganguly. Demarcating endogenous and exogenousopinion diffusion process on social networks. In WWW, pages 549–558, 2018.

[9] P. Deshpande and S. Sarawagi. Streaming adaptation of deep forecasting modelsusing adaptive recurrent units. In ACM SIGKDD, 2019.

[10] D. Deutsch, S. Upadhyay, and D. Roth. A general-purpose algorithm for con-strained sequential inference. In M. Bansal and A. Villavicencio, editors, Pro-ceedings of the 23rd Conference on Computational Natural Language Learning,CoNLL 2019, Hong Kong, China, November 3-4, 2019, pages 482–492. Associationfor Computational Linguistics, 2019.

[11] N. Du, L. Song, M. Yuan, and A. J. Smola. Learning networks of heterogeneousinfluence. In NeurIPS, pages 2780–2788. Curran Associates, Inc., 2012.

[12] N. Du, L. Song, H.Woo, and H. Zha. Uncover topic-sensitive information diffusionnetworks. In Artificial Intelligence and Statistics, pages 229–237, 2013.

[13] N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. Gomez-Rodriguez, and L. Song. Re-current marked temporal point processes: Embedding event history to vector.In Proceedings of the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 1555–1564, 2016.

[14] E. Fersini, E. Messina, G. Felici, and D. Roth. Soft-constrained inference for namedentity recognition. Inf. Process. Manag., 50(5):807–819, 2014.

[15] V. Filimonov and D. Sornette. Apparent criticality and calibration issues in thehawkes self-excited point process model: application to high-frequency financialdata. Quantitative Finance, 15(8):1293–1314, 2015.

[16] V. Flunkert, D. Salinas, and J. Gasthaus. Deepar: Probabilistic forecasting withautoregressive recurrent networks. CoRR, abs/1704.04110, 2017.

[17] K. Giesecke and G. Schwenkler. Filtered likelihood for point processes. Journalof Econometrics, 204(1):33–53, 2018.

[18] R. Gupta, S. Sarawagi, and A. A. Diwan. Collective inference for extraction mrfscoupled with symmetric clique potentials. JMLR, 11, Nov. 2010.

[19] B. Hambly and A. Søjmark. An spde model for systemic risk with endogenouscontagion. Finance and Stochastics, 23(3):535–594, 2019.

[20] A. G. Hawkes. Point spectra of some mutually exciting point processes. Journalof the Royal Statistical Society: Series B (Methodological), 33(3):438–443, 1971.

[21] A. G. Hawkes. Spectra of some self-exciting andmutually exciting point processes.Biometrika, 58(1):83–90, 1971.

[22] A. G. Hawkes. Hawkes jump-diffusions and finance: a brief history and review.The European Journal of Finance, pages 1–15, 2020.

[23] V. Isham and M. Westcott. A self-correcting point process. Stochastic processesand their applications, 8(3):335–347, 1979.

[24] H. Jing and A. J. Smola. Neural survival recommender. InWSDM, pages 515–524,2017.

[25] P. Kohli, L. Ladicky, and P. H. S. Torr. Robust higher order potentials for enforcinglabel consistency. International Journal of Computer Vision, 82(3):302–324, 2009.

[26] Q. Kong, M.-A. Rizoiu, and L. Xie. Modeling information cascades with self-exciting processes via generalized epidemic models. InWSDM, pages 286–294,2020.

[27] S. Lamprier. A recurrent neural cascade-based model for continuous-time diffu-sion. In ICML, volume 97, pages 3632–3641. PMLR, 2019.

[28] V. LE GUEN and N. THOME. Shape and time distortion loss for training deeptime series forecasting models. In Advances in Neural Information ProcessingSystems 32. 2019.

[29] G. Loaiza-Ganem, S. Perkins, K. Schroeder, M. Churchland, and J. P. Cunningham.Deep random splines for point process intensity estimation of neural populationdata. In NeurIPS, pages 13346–13356, 2019.

[30] M. Maciak, O. Okhrin, and M. Pešta. Infinitely stochastic micro forecasting. arXiv,pages arXiv–1908, 2019.

[31] H. Mei and J. M. Eisner. The neural hawkes process: A neurally self-modulatingmultivariate point process. In NeurIPS, pages 6754–6764, 2017.

[32] Y. Ogata. Space-time point-process models for earthquake occurrences. Annalsof the Institute of Statistical Mathematics, 50(2):379–402, 1998.

[33] M. Okawa, T. Iwata, T. Kurashima, Y. Tanaka, H. Toda, and N. Ueda. Deepmixture point processes: Spatio-temporal event prediction with rich contextualinformation. In KDD, pages 373–383, 2019.

[34] T. Omi, K. Aihara, et al. Fully neural network based model for general temporalpoint processes. In NeurIPS, pages 2122–2132, 2019.

[35] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automaticevaluation of machine translation. In Proceedings of the 40th Annual Meeting ofthe Association for Computational Linguistics, July 2002.

[36] V. Punyakanok, D. Roth, W. Yih, and D. Zimak. Learning and inference overconstrained output. In Proc. of the International Joint Conference on ArtificialIntelligence (IJCAI), pages 1124–1129, 2005.

[37] Z. Qian, A. M. Alaa, A. Bellot, J. Rashbass, and M. van der Schaar. Learningdynamic and personalized comorbidity networks from event data using deepdiffusion processes. arXiv preprint arXiv:2001.02585, 2020.

[38] S. Ramalingam, P. Kohli, K. Alahari, and P. H. S. Torr. Exact inference in multi-label crfs with higher order cliques. In CVPR, 2008.

[39] M.-A. Rizoiu and L. X. Xie. Online popularity under promotion: Viral potential,forecasting, and the economics of time. In Eleventh International AAAI Conferenceon Web and Social Media, 2017.

[40] M.-A. Rizoiu, S. Mishra, Q. Kong, M. Carman, and L. Xie. Sir-hawkes: linkingepidemic models and hawkes processes to model diffusions in finite populations.In WWW, pages 419–428, 2018.

[41] A. Saichev and D. Sornette. Generating functions and stability study of multi-variate self-excited epidemic processes. The European Physical Journal B, 83(2):271, 2011.

[42] O. Shchur, M. Biloš, and S. Günnemann. Intensity-free learning of temporal pointprocesses. arXiv preprint arXiv:1909.12127, 2019.

[43] D. Tarlow, I. Givoni, and R. Zemel. Hop-map: Efficient message passing withhigh order potentials. In Proceedings of the Thirteenth International Conference onArtificial Intelligence and Statistics (AI-STATS), volume 9, pages 812–819. JMLR:W&CP, 2010.

[44] M. Trinh. Non-stationary processes and their application to financial high-frequencydata. PhD thesis, University of Sussex, 2018.

[45] U. Upadhyay, A. De, andM. G. Rodriguez. Deep reinforcement learning of markedtemporal point processes. In NeurIPS, pages 3168–3178, 2018.

[46] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu. Wavenet: A generativemodel for raw audio. CoRR, abs/1609.03499, 2016.

[47] B. Vassøy, M. Ruocco, E. de Souza da Silva, and E. Aune. Time is of the essence: ajoint hierarchical rnn and point process model for time and item predictions. InWeb Search and Data Mining, pages 591–599, 2019.

[48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser,and I. Polosukhin. Attention is all you need. In NIPS. 2017.

[49] A. Venkatraman, M. Hebert, and J. Bagnell. Improving multi-step prediction oflearned time series models. In Proceedings of the AAAI Conference on ArtificialIntelligence, volume 29, 2015.

[50] R. Wen, K. Torkkola, and B. Narayanaswamy. A multi-horizon quantile recurrentforecaster. arXiv preprint arXiv:1711.11053, 2017.

[51] S. Xiao, M. Farajtabar, X. Ye, J. Yan, L. Song, and H. Zha. Wasserstein learningof deep generative point process models. In Advances in neural informationprocessing systems, pages 3247–3257, 2017.

[52] S. Xiao, J. Yan, M. Farajtabar, L. Song, X. Yang, and H. Zha. Joint modeling ofevent sequence and time series with attentional twin recurrent neural networks.arXiv preprint arXiv:1703.08524, 2017.

[53] S. Xiao, J. Yan, X. Yang, H. Zha, and S. M. Chu. Modeling the intensity functionof point process via recurrent neural networks. In AAAI, 2017.

[54] S. Xiao, H. Xu, J. Yan, M. Farajtabar, X. Yang, L. Song, and H. Zha. Learningconditional generative models for temporal point processes. In Thirty-SecondAAAI Conference on Artificial Intelligence, 2018.

[55] S. Xiao, J. Yan, M. Farajtabar, L. Song, X. Yang, and H. Zha. Learning timeseries associated event sequences with recurrent point process networks. IEEEtransactions on neural networks and learning systems, 30(10):3124–3136, 2019.

[56] A. S. Yang. Modeling the Transmission Dynamics of Pertussis Using Recursive PointProcess and SEIR model. PhD thesis, UCLA, 2019.

[57] S.-H. Yang and H. Zha. Mixture of mutually exciting processes for viral diffusion.In International Conference on Machine Learning, pages 1–9, 2013.

[58] Y. Zhong, B. Xu, G.-T. Zhou, L. Bornn, and G. Mori. Time perception machine:Temporal point processes for the when, where and what of activity prediction.arXiv preprint arXiv:1808.04063, 2018.

[59] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-ranknetworks using multi-dimensional hawkes processes. In Artificial Intelligenceand Statistics, pages 641–649, 2013.

[60] S. Zuo, H. Jiang, Z. Li, T. Zhao, and H. Zha. Transformer hawkes process. arXivpreprint arXiv:2002.09291, 2020.

long horizon forecasting with temporal point processes

Documents