arxiv:2007.13794v2 [cs.lg] 7 dec 2020

Proceedings of Machine Learning Research LEAVE UNSET:1–29, 2020 Machine Learning for Health (ML4H) 2020

Neural Temporal Point ProcessesFor Modelling Electronic Health Records

Joseph Enguehard∗ [email protected] Health

Dan Busbridge∗

Babylon Health

Adam BozsonBabylon Health

Claire WoodcockBabylon Health

Nils HammerlaBabylon Health

Abstract

The modelling of Electronic HealthRecords (EHRs) has the potential todrive more efficient allocation of health-care resources, enabling early interven-tion strategies and advancing person-alised healthcare. However, EHRs arechallenging to model due to their real-isation as noisy, multi-modal data oc-curring at irregular time intervals. Toaddress their temporal nature, we treatEHRs as samples generated by a Tem-poral Point Process (TPP), enabling usto model what happened in an eventwith when it happened in a principledway. We gather and propose neural net-work parameterisations of TPPs, collec-tively referred to as Neural TPPs. Weperform evaluations on synthetic EHRsas well as on a set of established bench-marks. We show that TPPs signifi-

∗ Equal contribution. Joseph Enguehard and DanBusbridge led the research, implemented themodels and performed the experiments. AdamBozson created the synthetic EHR datasetsand provided insights during project inception.Claire Woodcock strategically guided the projectand contributed the broader impact statement.Nils Y. Hammerla provided technical guidancethroughout.

cantly outperform their non-TPP coun-terparts on EHRs. We also show thatan assumption of many Neural TPPs,that the class distribution is condition-ally independent of time, reduces per-formance on EHRs. Finally, our pro-posed attention-based Neural TPP per-forms favourably compared to existingmodels, whilst aligning with real worldinterpretability requirements, an impor-tant step towards a component of clini-cal decision support systems.

1. Introduction

Healthcare systems today are under intensepressure. Costs of care are increasing, re-sources are constrained, and outcomes areworsening (Topol, 2019). Given these in-tense challenges, healthcare stands as oneof the most promising applications of Ma-chine Learning (ML). In particular, inter-est in modelling Electronic Health Records(EHRs) has recently increased (Islam et al.,2017; Shickel et al., 2018; Darabi et al., 2019;Li et al., 2019; Rodrigues-Jr et al., 2019).Better EHR modelling at an institutionallevel could enable more effective allocation

© 2020 J. Enguehard, D. Busbridge, A. Bozson, C. Woodcock & N. Hammerla.

arX

iv:2

007.

1379

4v2

[cs

.LG

] 7

Dec

202

0

Neural TPPs for Modelling Electronic Health Records

of healthcare (Gijsberts et al., 2015). Ata clinical level accurate models could provi-sion tools which improve patient outcomesby automating labour intensive administra-tive tasks (Dhruva et al., 2020) and develop-ing early and optimal intervention strategiesfor doctors (Komorowski, 2019). However,facets of health evolve at differing rates, andEHRs are typically realised as noisy, multi-modal data occurring at irregular time in-tervals. This makes EHRs difficult to modelusing common ML methods.

We propose to address the temporal na-ture of longitudinal EHRs1 by treating themas samples generated by a Temporal PointProcess (TPP), a probabilistic frameworkwhich can deal with data occurring at ir-regular time intervals. Specifically, we usea Neural Network (NN) to approximate itsdensity, which specifies the probability of thenext event happening at a given time. Wecall these models Neural TPPs. We proposejointly modelling times and labels to capturethe varying evolution rates within EHRs.

To test this, aware of the need for trans-parency in research (Pineau et al., 2020) andfor sensitive handling of health data (Kalk-man et al., 2019), we perform an exten-sive study on synthetic EHR datasets gener-ated using the open source Synthea simulator(Walonoski et al., 2018). For completeness,we also perform evaluations on benchmarkdatasets commonly used in the TPP litera-ture. We find that:

• Our proposed Neural TPPs, where la-bels are jointly modelled with time, sig-nificantly outperform those that treatthem as conditionally independent; a

1. Longitudinal EHRs consist of comprehensiverecords of patients’ clinical experience, usuallyspanning over several decades. This type of datadoes not include ICU datasets, such as MIMIC-III (Johnson et al., 2016) which focus on a shortertime frame.

common simplification in the TPP lit-erature,

• Particularly in the case of TPPs on la-belled data, metrics that decompose theperformance in terms of both label andtemporal accuracy should be reported,

• Some datasets used for benchmarkingTPPs are easily solved by a time-independent model. On other datasets,including synthetic EHRs, TPPs out-perform their non-TPP counterparts.

Finally, we present a Neural TPP whoseattention-based mechanism provides inter-pretable information without compromisingon performance. This is an essential con-tribution in the development of research forclinical applications. The wider impact ofpursuing this direction is detailed in Ap-pendix E.

2. Background

2.1. Temporal point processes

A TPP is a random process that generatesa sequence of N events H = {(ti,Mi)}Ni=1

within a given observation window ti ∈[w−, w+]. Each event consists of a set oflabels Mi ⊆ {1, . . . ,M} localised at timesti−1 < ti. Labels may be independent ormutually exclusive, depending on the task.

A TPP is fully characterised through itsconditional intensity λ∗m(t), for ti−1 < t ≤ ti:

λ∗m(t) dt = λm(t|Ht) dt (1)

= Pr (tmi ∈ [t, t+ dt)|Ht) , (2)

which specifies the probability that a label moccurs in the infinitesimal time interval [t, t+dt) given past events Ht = {tmi ∈ H|ti < t}.We follow Daley and Vere-Jones (2003) inusing λ∗m(t) = λm(t|Ht) to indicate λ∗m(t) isconditioned on past events, and tmi to refer toa label of type m at time ti. Given a specified

2


conditional intensity λ∗m(t), the conditionaldensity p∗m(t) is, for ti−1 < t ≤ ti,

p∗m(t) = λ∗m(t) exp

[−

M∑n=1

Λ∗n(t)

], (3)

where Λ∗m(t) is the cumulative intensity:

Λ∗m(t) = Λm(t|Ht) =

∫ t

ti−1

λ∗m(t′) dt′. (4)

In a multi-class setting, where exactly onelabel (displayed as the indicator function 1)is present in any event, the log-likelihood ofa sequence H is a form of categorical cross-entropy:

log pmulti-class(H) =M∑m=1

N∑i=1

1i,m log p∗m(ti)︸︷︷︸events at t1, . . . , tN

−M∑m=1

∫ w+

tN

λ∗m(t′) dt′︸︷︷︸no events between tN and w+

, (5)

where Ht0 = Ht1 = {}, and t0 = w− butdoes not correspond to an event.

In a multi-label setting, where at leastone label is present in any event, the log-likelihood of a sequence H is a form of binarycross-entropy:

log pmulti-label(H) = log pmulti-class(H)

+M∑m=1

N∑i=1

(1− 1i,m) log[1− p∗m(ti)

]. (6)

This setting should be especially useful tomodel EHRs, as a single medical consulta-tion usually includes various events, such asdiagnoses or prescriptions, all happening atthe same time.

2.2. Neural temporal point processes

Encoder-decoder architectures have proveneffective for Natural Language Processing

(NLP) (Cho et al., 2014; Kiros et al., 2015;Hill et al., 2016; Vaswani et al., 2017). Ex-isting Neural TPPs also exhibit this struc-ture: the encoder creates event representa-tions based only on information about otherevents; the decoder takes these representa-tions and the decoding time to produce anew representation. The output of the de-coder produces the conditional intensity andconditional cumulative intensity at that de-coding time. For more detail, see Figure 1.

3. Previous work

While a Neural TPP encoder can be readilychosen from existing sequence models such asRecurrent Neural Networks (RNNs), choos-ing a decoder is much less straightforwarddue to the integral in Equation (4). Exist-ing work can be categorised based on therelationship between the conditional inten-sity λ∗(t) and conditional cumulative inten-sity Λ∗(t). We focus on three approaches:

• Closed form likelihood: the conditionaldensity p∗(t) is closed form,

• Analytic conditional intensity: onlyλ∗(t) is closed form, Λ∗(t) is estimatednumerically,

• Analytic cumulative conditional inten-sity: only Λ∗(t) is closed form, λ∗(t) iscomputed through differentiation.

3.1. Closed form likelihood

The closed form likelihood approach impliesthat the contribution from each event to thelikelihood p∗m(t) = λ∗m(t) exp[−

∑Mn=1 Λ∗n(t)]

is closed form.A well-known example is the Hawkes pro-

cess (Hawkes, 1971; Nickel and Le, 2020),which models the conditional intensity as

λ∗m(t) = µm+M∑n=1

αm,n∑i:tni <t

exp[−βm,n(t−tni )]

3


t

H Ht Enc( · ;θEnc) Zt Dec( · , · ;θDec)

Λ(t|Ht;θ)

λ(t|Ht;θ)

Figure 1: Encoder/decoder architecture of Neural TPPs. Given a query time t, the se-quence H is filtered to the events Ht in the past of t. The encoder maps Ht to

continuous representations Zt = {zi}|Ht|i=1 = Enc(Ht;θEnc). Each zi can be con-

sidered as a contextualised representation for the event at ti. Given Zt and t, thedecoder outputs Dec(t,Zt;θDec) ∈ RM that the conditional intensity and condi-tional cumulative intensity are derived from without any learnable parameters.

with learnable parameters µ ∈ RM>0, α ∈RM×M≥0 , and β ∈ RM×M . The closed form ofΛ∗m(t;θ) comes from the simple exponentiallinear t-dependence of λ∗m(t;θ), which limitsthe model flexibility.

Du et al. (2016) leveraged the same closedform, conditioning an exponential linear de-coder on the output of a RNN encoder(RMTPP). As with the Hawkes process, themodel can only model exponential depen-dence in time. Additionally, it assumes labelsare conditionally independent of time givena history, which we will show is a limiting as-sumption in domains like modelling EHRs.

An alternative to taking λ∗(t) and Λ∗(t)closed form is to take p∗(t) closed form.Shchur et al. (2020) directly approximate theconditional density using a log-normal mix-ture (LNM). With a sufficiently large mix-ture, p∗(t) can approximate any conditionaldensity and is a more flexible model than theRMTPP. However, as in Du et al. (2016), la-bels are modelled conditionally independentof decoding time.

While no decode-time dependent NeuralTPPs methods have been applied to mod-elling EHRs, Weiss and Page (2013) and Lianet al. (2015) apply conditional Poisson pro-cesses to various EHRs datasets, Islam et al.(2017) combines a conditional Poisson pro-

cess with a Gaussian distribution to predictin hospital mortality using ICU data, andZhang et al. (2020) uses an individual hetero-geneous conditional Poisson model on longi-tudinal EHRs for adverse drug reaction dis-covery.

3.2. Analytic conditional intensity

The analytic conditional intensity approachapproximates the conditional intensity witha NN whose output is positive

λ∗m(t;θ) = Dec(t,Zt;θDec)m ∈ R≥0,

and whose t-integral must be approximatednumerically.

The positivity constraint is satisfied by re-quiring the final activation function to bepositive. Early Neural TPPs employed ex-ponential activation (Du et al., 2016). Re-cently, the scaled softplus activation

σ+(xm) = sm log(1 + exp(xm/sm)) ∈ R>0

with learnable s ∈ RM has gained popularity(Mei and Eisner, 2017). Ultimately, writinga neural approximator for λ∗m(t) is relativelysimple as there is no constraint on the NNarchitecture itself. Of course, this does notmean it is easy to train.

4


To approximate the integral, Monte Carlo(MC) estimation can be employed given asampling strategy (Mei and Eisner, 2017).Zhu et al. (2020) also applies this strategy forspatio-temporal point process, which jointlymodels times and labels, but only uses anembedding layer as an encoder.

3.3. Analytic conditional cumulativeintensity

The analytic conditional cumulative inten-sity approach approximates the conditionalcumulative intensity with a NN whose out-put is positive

Λ∗m(t;θ) = Dec(t,Zt;θDec)m ∈ R≥0

and whose derivative is positive and approx-imates the conditional intensity

λ∗m(t;θ) = dDec(t,Zt;θDec)m/dt ∈ R≥0

This derivative can be computed using back-propagation. While the cumulative intensityapproach removes the variance induced byMC estimation, the monotonicity entails spe-cific constraints on the NN.

Omi et al. (2019) uses a Multi-Layer Per-ceptron (MLP) with positive weights in orderto model a monotonic decoder. It also usestanh activation functions, and a softplus finalactivation function to ensure a positive out-put. However, this model does not handle la-bels, and has been criticised by Shchur et al.(2020) for violating lim

t→+∞ Λ∗m(t;θ) = +∞.

4. Proposed models

Our goal is to model the joint distribu-tion of EHRs which contain long-term, non-sequential dependencies. This makes thesuccessful application of RNNs challengingdue to their recency bias (Bahdanau et al.,2016; Ravfogel et al., 2019). Additionally,RNNs do not meet the explainability stan-dard required for real-world application; al-though there are methods for interpreting

RNN outputs (Samek and Muller, 2019),when deployed in healthcare applications,models should be interpretable without re-lying on auxiliary models (Rudin, 2019).

To address these challenges, we proposethe use of attention. This is capable ofdealing with long range context (Liu et al.,2018), and benefits some model explainabil-ity2 without requiring auxiliary models. Inthis section we develop the necessary build-ing blocks to define attention-based NeuralTPPs that directly model either the con-ditional intensity λ∗m(t) or the conditionalcumulative intensity Λ∗m(t), alongside sev-eral benchmark Neural TPPs. In total,we present 2 encoders (Section 4.1) and 7decoders (Section 4.2) whose combinationyields the 14 NeuralTPPs we evaluate3.

4.1. Encoders

As in RMTPPs (Du et al., 2016) we usea Gated Recurrent Unit (GRU) encoder(GRU). We also evaluate a continuous-timeself-attention encoder (Vaswani et al., 2017;Xiong et al., 2020) (SA). Precise forms arepresented in Appendix B.1.

4.2. Decoders

Precise forms for all decoders are presentedin Appendix B.2.

4.2.1. Closed form likelihood

We take two decoders from the litera-ture: a conditional Poisson process (CP), a

2. This intepretability may be limited to the im-portance of events in a patient record (Serranoand Smith, 2019), which without auxiliary mod-els provides much clinical utility. For example,rapidly provisioning important information abouta patient to a doctor to facilitate greater focus onpatient need (Overhage and McCallie, 2020).

3. Our implementation of these models, as well asthe experimental setup can be found at https:

//github.com/babylonhealth/neuralTPPs.

5

https://github.com/babylonhealth/neuralTPPs

https://github.com/babylonhealth/neuralTPPs


RMTPP (Du et al., 2016), and a log-normalmixture (LNM) (Shchur et al., 2020).

4.2.2. Analytic conditional intensity

To model a positive λ∗m(t) we follow Mei andEisner (2017): we apply a softplus activationto the final layer, and use MC estimationwith a single, uniformly distributed sampleper time interval.

Using this strategy, we propose a decoderbased on a MLP (MLP-MC), and an atten-tion mechanism (Attn-MC).

4.2.3. Analytic conditionalcumulative intensity

Finally, we consider modelling Λ∗m(t) =Decoder(t,Zt;θ)m. This is more difficult asthe decoder must satisfy four properties:

1. Λ∗m(t) > 0,

2. Λ∗m(ti) = 0,

3. limt→∞ Λ∗m(t) =∞,

4. dΛ∗m(t)/dt > 0.

1. is solved using a final layer positive ac-tivation. 2. is solved by parameterising thedecoder as Decoder(t,Zt;θ) = f(t,Zt;θ) −f(ti,Zt;θ). 3. can be satisfied in two ways.The approach we take is to add a Poissonterm to the conditional intensity (see Sec-tion 4.3). An alternative is to use a non-saturating activation function, which we in-vestigate in Appendix A.

4. is the most challenging. If f representsa L-layer NN, where the output of each layerfi is fed into the next fi+1, we can writef(t) = (fL ◦ · · · ◦ f1)(t). Providing each stepproduces an output that is a monotonic inits input, i.e. dfi/dfi−1 ≥ 0 and df1/dt ≥ 0,then f(t) is a monotonic function of t.

Given that softmax, layer normalisation,trigonometric encoding, and projection areinconsistent with monotonicity, many modi-fications were required in order to use MLPsand attention networks for cumulative mod-elling. Given the the importance of t-

derivatives, we use the adaptive Gumbel acti-vation (Farhadi et al., 2019), allowing learn-able adaption of first derivatives. All pro-posed modifications are presented in Ap-pendix A.

Using these modifications, we propose twocumulative decoders: a MLP (MLP-CM)and an attention mechanism (Attn-CM),which handle labels and address the criti-cisms of Shchur et al. (2020).

4.3. Base intensity

To every model4 we add a Poisson process5

λ∗Total(t) = α1µ+ α2 λ∗(t) (7)

where α = softmax(a), with a learnable.

We found that initialising a such thatα1 ∼ e3α2, i.e. starting the combined TPPas mostly Poisson, aided convergence.

5. Evaluation

5.1. Datasets

Dataset statistics are summarised in Table 1.

Hawkes Processes This synthetic dataallows us to have access to a theoreticallyinfinite amount of data. Moreover, as weknow the intensity function, we were ableto compare each of our modelled intensityagainst the true one. We designed onedataset consisting of two independent pro-cesses, Hawkes (independent), and a sec-ond one allowing interactions between twoprocesses, Hawkes (dependent).

4. We do not add a Poisson term to the conditionalPoisson process. Moreover, the original RMTPPdoes not include this Poisson term, however weinclude it to provide a fairer comparison.

5. In the TPP literature, the Poisson term is oftenreferred to as the exogenous intensity, and thetime-dependent piece the endogenous impact.

6


Table 1: Properties of datasets used for evaluation.

Size

Dataset # classes Task type # events Avg. length Train Valid Test Batch

Hawkes (ind.) 2 Multi-class 457,788 19 16,384 4,096 4,096 512Hawkes (dep.) 2 Multi-class 607,512 25 16,384 4,096 4,096 512MIMIC-II 75 Multi-class 2,419 4 585 NA 65 65Stack Overflow 22 Multi-class 480,413 72 5,307 NA 1,326 32Retweets 3 Multi-label 2,087,866 104 16,000 2,000 2,000 256Ear infection 39 Multi-label 14,810 2 8,179 1,022 1,023 512Synthea 357 Multi-label 496,625 43 10,524 585 585 64

Baselines We also compared our mod-els on datasets commonly used to evalu-ate TPPs. We use: MIMIC-II6, a med-ical dataset of clinical visits to IntensiveCare Units, Stack Overflow, which classi-fies users on this question answering website,and Retweets, which consists in streams ofretweet events by different types of Twitterusers. Further details about the datasets andtheir preprocessing can be found in Du et al.(2016); Mei and Eisner (2017).

Synthetic EHRs We used the Syntheasimulator (Walonoski et al., 2018) which gen-erates patient-level EHRs using human ex-pert curated Markov processes.

We created a dataset utilising all ofSynthea’s modules, Synthea (Full), as wellas a dataset generated solely from the ear in-fection module Synthea (Ear Infection)7.For each task, we generated approximately10,000 patients, and we kept 10% of thedata for validation and testing, using 5 dif-ferent folds randomly chosen. Moreover,to optimise GPU utilisation, we truncatedsequences larger than 400 events, whichamounts to around 0.4% of the dataset.

We filtered out all events except conditionsand medications. For each patient, we trans-

6. Although this dataset contains EHRs, they arenot longitudinal. We only include it as a knownTPP benchmark.

7. Ear infection was selected for a simpler bench-mark as the generation process demonstratedtemporal dependence

formed event times by subtracting their birthdate, then multiplying by 10−5. When pro-vided, their death date is used as the endof the observation window w+, otherwise weuse the latest event in their history.

5.2. Hyperparameters

Encoder and decoder were set to one layereach. The size of each layer was 8 for thesmall datasets (Hawkes and MIMIC-II), 32for the medium ones (Ear infection, StackOverflow and Retweets), and 64 for Synthea.Batch size adjusted per-dataset due to GPUmemory availability (see Table 1). TheAdam optimizer (Kingma and Ba, 2017) wasused with the Noam learning rate scheduler(Vaswani et al., 2017). The scheduler hada maximum learning rate value of 0.01 and10 epochs of warming-up, allowing the Pois-son component to be adjusted before tun-ing the temporal piece. All our models weretrained using Maximum Likelihood Estima-tion (MLE), and early stopping was em-ployed with a patience of 100 epochs.

5.3. Metrics

Models were evaluated using a weightedF1 metric for multi-class datasets, and aweighted ROC-AUC metric for the multi-label ones. This enabled us to assess amodel’s ability predict the label(s) of thenext event given the time it occurred andthe previous events.

7


Table 2: Evaluation on multi-class tasks. On both tables, (†) indicates a model newlypresented in this work. Best performance and performances whose confidenceintervals overlap the best are boldened.

Encoder GRU MIMIC-II Stack Overflow Hawkes (ind.) Hawkes (dep.)

Decoder F1 score NLL/time F1 score NLL/time NLL/time NLL/time

CP .691 (.083) 6.78 (1.99) .325 (.004) .553 (.003) .623 (.002) .774 (.004)RMTPP .215 (.12) 11.4 (3.57) .284 (.005) .592 (.006) .610 (.004) .731 (.004)LNM .705 (.17) 6.33 (.37) .314 (.003) .548 (.004) .605 (.001) .727 (.001)MLP-CM† .166 (.083) 12.2 (3.57) .305 (.016) .573 (.004) .609 (.005) .728 (.003)MLP-MC† .665 (.050) 9.45 (3.87) .333 (.008) .540 (.006) .605 (.001) .727 (.001)Attn-CM† .189 (.10) 10.4 (2.34) .286 (.017) .589 (.012) .613 (.003) .732 (.002)Attn-MC† .641 (.131) 5.27 (2.05) .314 (.006) .560 (.005) .612 (.005) .729 (.002)Encoder SACP .686 (.067) 7.26 (1.44) .326 (.003) .555 (.003) .622 (.001) .774 (.002)RMTPP .709 (.076) 4.24 (2.66) .288 (.002) .592 (.002) .609 (.005) .734 (.002)LNM .632 (.20) 7.76 (2.42) .305 (.003) .561 (.006) .607 (.005) .730 (.004)MLP-CM† .159 (.088) 12.7 (3.45) .292 (.003) .634 (.042) .611 (.005) .733 (.002)MLP-MC† .498 (.150) 10.82 (2.13) .327 (.016) .557 (.015) .606 (.002) .730 (.001)Attn-CM† .186 (.082) 11.4 (3.90) .269 (.011) .665 (.047) .614 (.001) .733 (.001)Attn-MC† .648 (.098) 4.61 (2.49) .342 (.006) .543 (.005) .605 (.001) .728 (.001)

0.0

0.5

λ0(t

)

Oral tablet

0.0

0.1

0.2

λ1(t

)

0 5 10 15 20 25

Age (years)

Otitis media

(a) GRU-Conditional-Poisson

0

20

λ(t

) 0

Oral tablet

0.0

0.5

λ(t

) 1

0 5 10 15 20 25

Age (years)

Otitis media

(b) GRU-Attn-MC

Figure 2: Intensity functions on several labels applied to the same EHR. We can see that theintensity of the GRU-Attn-MC decreases more smoothly than the GRU-CP onthe Otitis media label and is able to spike on the recurring medicine prescription.This enables the GRU-Attn-MC to achieve a lower NLL than the GRU-CP.

Negative Log Likelihood (NLL) nor-malised by time8 (NLL/time) was also usedto evaluate models. This metric enables usto assess a model’s ability to predict the la-bel(s) of the next event, as well as when thatevent will happen given the previous events.

We want to stress that neither of thesemetrics is a perfect measure of performance

8. Time normalisation makes the metric meaningfulover sequences defined on different time intervals.

in isolation. Indeed, F1 and ROC-AUC re-ward a model for accurate predictions givenan event instance, but do not penalise amodel for inaccurate predictions in the inter-event interval. This happens in Figure 2.

Alternatively, the NLL is prone to over-fit: if a Neural TPP accurately guesses whenan event will happen, it has the incentive toendlessly increase the intensity at that time,reducing the overall NLL without improving

8


Table 3: Evaluation on multi-label tasks.Encoder GRU Retweets Synthea (Ear infection) Synthea (Full)

Decoder ROC-AUC NLL/time ROC-AUC NLL/time ROC-AUC NLL/time

CP .611 (.001) 5.71 (1.02) .786 (.015) 37.6 (7.76) .850 (.014) 89.6 (3.15)RMTPP .532 (.003) -9.05 (1.40) .672 (.013) 36.7 (4.25) .616 (.043) 62.2 (6.63)LNM .521 (.010) -10.9 (2.11) .765 (.008) 26.7 (5.35) .770 (.010) 9.49 (6.37)MLP-CM† .533 (.001) -10.4 (.301) .742 (.050) 30.4 (7.03) .692 (.051) 34.2 (14.5)MLP-MC† .535 (.001) -10.3 (.186) .838 (.024) 29.2 (9.09) .789 (.024) 15.7 (9.57)Attn-CM† .526 (.001) -10.0 (.535) .799 (.028) 27.1 (4.95) .508 (.000) 80.9 (2.51)Attn-MC† .532 (.001) -9.14 (.358) .857 (.005) 25.1 (4.45) .822 (.006) 21.9 (2.05)Encoder SACP .608 (.001) 4.85 (.021) .792 (.009) 38.6 (8.96) .825 (.020) 91.2 (4.24)RMTPP .535 (.001) -8.70 (.087) .675 (.068) 42.9 (9.01) .593 (.029) 67.8 (6.13)LNM .528 (.007) -10.9 (1.29) .767 (.007) 26.2 (5.51) .760 (.014) 27.5 (27.2)MLP-CM† .512 (.001) -8.93 (.180) .684 (.094) 33.8 (15.3) .589 (.123) 62.8 (30.8)MLP-MC† .533 (.001) -9.21 (.174) .849 (.016) 24.9 (5.10) .796 (.007) 18.5 (1.20)Attn-CM† .517 (.001) -8.54 (.168) .697 (.059) 31.4 (6.90) .503 (.004) 85.0 (3.01)Attn-MC† .534 (.001) -9.48 (.188) .855 (.010) 24.5 (4.56) .805 (.007) 27.1 (4.77)

the resulting model meaningfully. Ideally, amodel should perform well on both metrics.

6. Results

We report our results on multi-class tasks(Table 2) and on multi-labels tasks (Table 3).

First, we note that the GRU-CP hasa competitive NLL and F1 score onthe MIMIC-II and StackOverflow datasets.As the the CP decoder is decode time-independent, the result indicates thatMIMIC-II and StackOverflow may not beappropriate for benchmarking TPPs. Wesuggest caution for future work using thesedatasets. All other datasets appear suitable.We focus here on EHRs, and discuss theHawkes and Retweets results in Appendix C.

On EHRs, we see that modelling time andlabels jointly yields better results than as-suming them independent given the history.Indeed, most of our models outperform theRMTPP on all metrics, and, while the LNMis the best in terms of NLL, it achieves a sig-nificantly lower ROC-AUC score than all ofour models except those using an Attn-CMdecoder. Overall, our MC-based models out-perform the CP in terms of NLL/time, andthe LNM in terms of ROC-AUC.

Despite Attn-MC decoders performingstrongly across all baseline datasets, as wellas the Synthea (Ear Infection), this perfor-mance gap is not as wide on Synthea (Full).Studies have found Synthea to be medicallyrepresentative (Walonoski et al., 2018; Chenet al., 2019). However, it is possible thatnoise in Synthea (Full) may be obfuscatingthe temporal progression of diseases (non-clinical investigation of the records showed36.5% received a diagnosis of prediabetes af-ter being diagnosed with diabetes). Apply-ing Neural TPP to genuine EHRs would bea fruitful line of future enquiry.

Although the cumulative-based modelshave theoretical benefits compared with theirMonte Carlo counterparts, they are in prac-tice harder to train. This may explain theirrelatively poor performance on datasets suchas on MIMIC-II and Synthea (Full), and thehigh variance between F1/ROC-AUC scoreson these datasets.

7. Conclusion

In this work we gathered and proposedseveral neural network parameterisations ofTPPs, evaluating them on synthetic EHRs,as well as common benchmarks.

9


Given the significant out-performance ofour models on synthetic EHRs, labels shouldbe modelled jointly with time, rather than betreated as conditionally independent; a com-mon simplification in the TPP literature. Wealso note that common TPP metrics are notgood performance indicators on their own.For a fair comparison we recommend usingmultiple metrics, each capturing distinct per-formance characteristics.

By employing a simple test that checks adataset’s validity for benchmarking TPPs,we demonstrated potential issues withinseveral widely-used benchmark datasets.We recommend caution when using thosedatasets, and that our test be run as a sanitycheck of any new evaluation task.

Finally, we have demonstrated thatattention-based TPPs appear to transmitpertinent EHR information and performfavourably compared to existing models.This is an exciting line for further enquiry inEHR modelling, where human interpretabil-ity is essential. Future work should applythese models to real EHR data to investigatemedical relevance.

Acknowledgements

We are grateful to Sheldon Hall, JeremieVallee and Max Wasylow for assistance withexperimental infrastructure, and to Kris-tian Boda, Sunir Gohil, Kostis Gourgoulias,Claudia Schulz and Vitalii Zhelezniak forfruitful comments and discussion.

References

Dzmitry Bahdanau, Kyunghyun Cho, andYoshua Bengio. Neural Machine Trans-lation by Jointly Learning to Align andTranslate. arXiv:1409.0473 [cs, stat], May2016.

Ramnath Balasubramanian, Ari Libarikian,and Doug McElhaney. Insurance 2030–

The impact of AI on the future of insur-ance, 2018.

Karen L. Calderone. The influence of gen-der on the frequency of pain and seda-tive medication administered to postop-erative patients. Sex Roles, 23(11):713–725, December 1990. ISSN 1573-2762. doi:10/d87j9w.

Rich Caruana, Yin Lou, Johannes Gehrke,Paul Koch, Marc Sturm, and Noemie El-hadad. Intelligible Models for HealthCare:Predicting Pneumonia Risk and Hospital30-day Readmission. In Proceedings of the21th ACM SIGKDD International Confer-ence on Knowledge Discovery and DataMining - KDD ’15, pages 1721–1730, Syd-ney, NSW, Australia, 2015. ACM Press.ISBN 978-1-4503-3664-2. doi: 10/gftgxk.

Junqiao Chen, David Chun, Milesh Patel,Epson Chiang, and Jesse James. The va-lidity of synthetic clinical data: a valida-tion study of a leading synthetic data gen-erator (synthea) using clinical quality mea-sures. BMC Medical Informatics and Deci-sion Making, 19(1), March 2019. doi: 10.1186/s12911-019-0793-0. URL https://

doi.org/10.1186/s12911-019-0793-0.

Kyunghyun Cho, Bart van Merrienboer,Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, andYoshua Bengio. Learning Phrase Repre-sentations using RNN Encoder-Decoderfor Statistical Machine Translation.arXiv:1406.1078 [cs, stat], September2014.

Daryl J. Daley and D. Vere-Jones. An Intro-duction to the Theory of Point Processes.Springer, New York, 2nd ed edition, 2003.ISBN 978-0-387-95541-4 978-0-387-21337-8 978-0-387-49835-5.

Sajad Darabi, Mohammad Kachuee, andMajid Sarrafzadeh. Unsupervised Repre-

10

https://doi.org/10.1186/s12911-019-0793-0

https://doi.org/10.1186/s12911-019-0793-0


sentation for EHR Signals and Codes asPatient Status Vector. arXiv:1910.01803[cs, stat], October 2019.

Claudia de Freitas and Graham Martin.Inclusive public participation in health:Policy, practice and theoretical contri-butions to promote the involvement ofmarginalised groups in healthcare. So-cial Science & Medicine (1982), 135:31–39, June 2015. ISSN 1873-5347. doi:10/f7gh39.

Sanket S. Dhruva, Joseph S. Ross, Joseph G.Akar, Brittany Caldwell, Karla Childers,Wing Chow, Laura Ciaccio, Paul Coplan,Jun Dong, Hayley J. Dykhoff, StephenJohnston, Todd Kellogg, Cynthia Long,Peter A. Noseworthy, Kurt Roberts, Anin-dita Saha, Andrew Yoo, and Nilay D.Shah. Aggregating multiple real-worlddata sources using a patient-centeredhealth-data-sharing platform. npj Digi-tal Medicine, 3(1), April 2020. doi: 10.1038/s41746-020-0265-z. URL https://

doi.org/10.1038/s41746-020-0265-z.

Ezio di Nucci. Should we be afraid of medicalAI? Journal of Medical Ethics, 45(8):556–558, August 2019. ISSN 0306-6800, 1473-4257. doi: 10/ggsqsf.

Nan Du, Hanjun Dai, Rakshit Trivedi,Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. RecurrentMarked Temporal Point Processes: Em-bedding Event History to Vector. InProceedings of the 22nd ACM SIGKDDInternational Conference on KnowledgeDiscovery and Data Mining - KDD ’16,pages 1555–1564, San Francisco, Cali-fornia, USA, 2016. ACM Press. ISBN978-1-4503-4232-2. doi: 10/ggntmn.

Farnoush Farhadi, Vahid Partovi Nia, andAndrea Lodi. Activation Adaptation in

Neural Networks. arXiv:1901.09849 [cs,stat], September 2019.

Michael A. Gara, Shula Minsky, Steven MSilverstein, Theresa Miskimen, andStephen M. Strakowski. A NaturalisticStudy of Racial Disparities in Diagnosesat an Outpatient Behavioral HealthClinic. Psychiatric Services, 70(2):130–134, December 2018. ISSN 1075-2730.doi: 10/drvz.

Crystel M. Gijsberts, Karlijn A. Groenewe-gen, Imo E. Hoefer, Marinus J. C. Eijke-mans, Folkert W. Asselbergs, Todd J. An-derson, Annie R. Britton, Jacqueline M.Dekker, Gunnar Engstrom, Greg W.Evans, Jacqueline de Graaf, Diederick E.Grobbee, Bo Hedblad, Suzanne Holewijn,Ai Ikeda, Kazuo Kitagawa, Akihiko Kita-mura, Dominique P. V. de Kleijn, Eva M.Lonn, Matthias W. Lorenz, Ellisiv B.Mathiesen, Giel Nijpels, Shuhei Okazaki,Daniel H. O’Leary, Gerard Pasterkamp,Sanne A. E. Peters, Joseph F. Polak,Jacqueline F. Price, Christine Robert-son, Christopher M. Rembold, Maria Ros-vall, Tatjana Rundek, Jukka T. Salonen,Matthias Sitzer, Coen D. A. Stehouwer,Michiel L. Bots, and Hester M. den Rui-jter. Race/Ethnic Differences in the As-sociations of the Framingham Risk Fac-tors with Carotid IMT and CardiovascularEvents. PloS One, 10(7):e0132321, 2015.ISSN 1932-6203. doi: 10/f3nj9f.

Mark L. Graber. The incidence of diagnosticerror in medicine. BMJ Quality & Safety,22(Suppl 2):ii21–ii27, October 2013. ISSN2044-5415, 2044-5423. doi: 10/f5db55.

Felix Greaves, Indra Joshi, Mark Camp-bell, Samantha Roberts, Neelam Patel,and John Powell. What is an appropriatelevel of evidence for a digital health inter-vention? The Lancet, 392(10165):2665–

11

https://doi.org/10.1038/s41746-020-0265-z

https://doi.org/10.1038/s41746-020-0265-z


2667, December 2018. ISSN 01406736. doi:10/gfpfnn.

Ben Green and Yiling Chen. Disparate Inter-actions: An Algorithm-in-the-Loop Anal-ysis of Fairness in Risk Assessments. InProceedings of the Conference on Fairness,Accountability, and Transparency - FAT*’19, pages 90–99, Atlanta, GA, USA, 2019.ACM Press. ISBN 978-1-4503-6125-5. doi:10/gftmqh.

Alan G. Hawkes. Spectra of some self-exciting and mutually exciting point pro-cesses. Biometrika, 58(1):83–90, 1971.ISSN 0006-3444, 1464-3510. doi: 10/fj27m2.

William H. Herman and Robert M. Cohen.Racial and Ethnic Differences in the Rela-tionship between HbA1c and Blood Glu-cose: Implications for the Diagnosis ofDiabetes. The Journal of Clinical En-docrinology and Metabolism, 97(4):1067–1072, April 2012. ISSN 0021-972X, 1945-7197. doi: 10.1210/jc.2011-1894.

Felix Hill, Kyunghyun Cho, and Anna Ko-rhonen. Learning Distributed Representa-tions of Sentences from Unlabelled Data.arXiv:1602.03483 [cs], February 2016.

Andreas Holzinger, Chris Biemann, Con-stantinos S. Pattichis, and Douglas B.Kell. What do we need to build ex-plainable AI systems for the medical do-main? arXiv:1712.09923 [cs, stat], De-cember 2017.

Kazi T. Islam, Christian R. Shelton, Juan I.Casse, and Randall Wetzel. Marked PointProcess for Severity of Illness Assessment.In Finale Doshi-Velez, Jim Fackler, DavidKale, Rajesh Ranganath, Byron Wallace,and Jenna Wiens, editors, Proceedings ofthe 2nd Machine Learning for HealthcareConference, volume 68 of Proceedings of

Machine Learning Research, pages 255–270, Boston, Massachusetts, August 2017.PMLR.

Gaurav Jetley and He Zhang. Electronichealth records in IS research: Quality is-sues, essential thresholds and remedial ac-tions. Decision Support Systems, 126:113137, November 2019. ISSN 0167-9236.doi: 10/ggfwd5.

Alistair EW Johnson, Tom J Pollard,Lu Shen, H Lehman Li-wei, MenglingFeng, Mohammad Ghassemi, BenjaminMoody, Peter Szolovits, Leo Anthony Celi,and Roger G Mark. Mimic-iii, a freely ac-cessible critical care database. Scientificdata, 3:160035, 2016.

Tamara Jugov and Lea Ypi. Structural Injus-tice, Epistemic Opacity, and the Responsi-bilities of the Oppressed. Journal of SocialPhilosophy, 50(1):7–27, 2019. ISSN 1467-9833. doi: 10/ggx48b.

Shona Kalkman, Menno Mostert, ChristophGerlinger, Johannes J. M. van Delden, andGhislaine J. M. W. van Thiel. Respon-sible data sharing in international healthresearch: A systematic review of princi-ples and norms. BMC Medical Ethics, 20(1):21, March 2019. ISSN 1472-6939. doi:10/ggx47x.

Diederik P. Kingma and Jimmy Ba. Adam:A Method for Stochastic Optimization.arXiv:1412.6980 [cs], January 2017.

Ryan Kiros, Yukun Zhu, Russ R Salakhutdi-nov, Richard Zemel, Raquel Urtasun, An-tonio Torralba, and Sanja Fidler. Skip-Thought Vectors. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, andR. Garnett, editors, Advances in NeuralInformation Processing Systems 28, pages3294–3302. Curran Associates, Inc., 2015.

12


E. Kleinpeter. Four Ethical Issues of “E-Health”. IRBM, 38(5):245–249, October2017. ISSN 1959-0318. doi: 10/ggx479.

Matthieu Komorowski. Clinical managementof sepsis can be improved by artificial in-telligence: yes. Intensive Care Medicine,46(2):375–377, December 2019. doi: 10.1007/s00134-019-05898-2. URL https://

doi.org/10.1007/s00134-019-05898-2.

Yikuan Li, Shishir Rao, Jose Roberto Ay-ala Solares, Abdelaali Hassaine, DexterCanoy, Yajie Zhu, Kazem Rahimi, andGholamreza Salimi-Khorshidi. BEHRT:Transformer for Electronic HealthRecords. arXiv:1907.09538 [cs, stat], July2019.

Li Shanshan, Fonarow Gregg C., Muka-mal Kenneth J., Liang Li, SchultePhillip J., Smith Eric E., DeVoreAdam, Hernandez Adrian F., PetersonEric D., and Bhatt Deepak L. Sexand Race/Ethnicity–Related Disparities inCare and Outcomes After Hospitalizationfor Coronary Artery Disease Among OlderAdults. Circulation: Cardiovascular Qual-ity and Outcomes, 9(2 suppl 1):S36–S44,February 2016. doi: 10/ggx476.

Wenzhao Lian, Ricardo Henao, VinayakRao, Joseph Lucas, and Lawrence Carin.A multitask point process predictivemodel. In International Conference onMachine Learning, pages 2030–2038, 2015.

Peter J Liu, Mohammad Saleh, EtiennePot, Ben Goodrich, Ryan Sepassi, LukaszKaiser, and Noam Shazeer. Generatingwikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198, 2018.

Hongyuan Mei and Jason M Eisner. TheNeural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process.In I. Guyon, U. V. Luxburg, S. Bengio,

H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, editors, Advances in Neu-ral Information Processing Systems 30,pages 6754–6764. Curran Associates, Inc.,2017.

Tim Miller. Explanation in artificial intel-ligence: Insights from the social sciences.Artificial Intelligence, 267:1–38, February2019. ISSN 00043702. doi: 10/gfwcxw.

Brent Daniel Mittelstadt, Patrick Allo, Mari-arosaria Taddeo, Sandra Wachter, andLuciano Floridi. The ethics of algo-rithms: Mapping the debate. Big Data& Society, 3(2):205395171667967, Decem-ber 2016. ISSN 2053-9517, 2053-9517. doi:10/gcdx92.

Jessica Morley and Luciano Floridi. En-abling digital health companionship is bet-ter than empowerment. The Lancet Digi-tal Health, 1(4):e155–e156, August 2019a.ISSN 2589-7500. doi: 10/ggx478.

Jessica Morley and Luciano Floridi. TheLimits of Empowerment: How to Reframethe Role of mHealth Tools in the Health-care Ecosystem. Science and EngineeringEthics, June 2019b. ISSN 1471-5546. doi:10/gf3w62.

Jessica Morley and Luciano Floridi. An eth-ically mindful approach to AI for healthcare. The Lancet, 395(10220):254–255,January 2020. ISSN 01406736. doi: 10/ggx592.

Jessica Morley, Caio Machado, ChristopherBurr, Josh Cowls, Mariarosaria Taddeo,and Luciano Floridi. The Debate onthe Ethics of AI in Health Care: A Re-construction and Critical Review. SSRNScholarly Paper ID 3486518, Social Sci-ence Research Network, Rochester, NY,November 2019.

13

https://doi.org/10.1007/s00134-019-05898-2

https://doi.org/10.1007/s00134-019-05898-2


Maximilian Nickel and Matthew Le. Learn-ing Multivariate Hawkes Processes atScale. arXiv:2002.12501 [cs, stat], Febru-ary 2020.

Ziad Obermeyer, Brian Powers, ChristineVogeli, and Sendhil Mullainathan. Dis-secting racial bias in an algorithm used tomanage the health of populations. Science,366(6464):447–453, October 2019. ISSN0036-8075, 1095-9203. doi: 10/ggchqk.

Office for National Statistics. Avoidable mor-tality in the UK, 2017.

Takahiro Omi, naonori ueda, and KazuyukiAihara. Fully Neural Network basedModel for General Temporal Point Pro-cesses. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d\ textquotesingleAlche-Buc, E. Fox, and R. Garnett, edi-tors, Advances in Neural Information Pro-cessing Systems 32, pages 2122–2132. Cur-ran Associates, Inc., 2019.

J. Marc Overhage and David McCallie.Physician time spent using the electronichealth record during outpatient encoun-ters. Annals of Internal Medicine, 172(3):169, January 2020. doi: 10.7326/m18-3684. URL https://doi.org/10.

7326/m18-3684.

Joelle Pineau, Philippe Vincent-Lamarre,Koustuv Sinha, Vincent Lariviere, AlinaBeygelzimer, Florence d’Alche-Buc,Emily Fox, and Hugo Larochelle. Im-proving Reproducibility in MachineLearning Research (A Report from theNeurIPS 2019 Reproducibility Program).arXiv:2003.12206 [cs, stat], April 2020.

Wullianallur Raghupathi and Viju Raghu-pathi. An Empirical Study of ChronicDiseases in the United States: A VisualAnalytics Approach to Public Health. In-ternational Journal of Environmental Re-

search and Public Health, 15(3), March2018. ISSN 1661-7827. doi: 10/gdc634.

Shauli Ravfogel, Yoav Goldberg, and TalLinzen. Studying the Inductive Biases ofRNNs with Synthetic Variations of Nat-ural Languages. arXiv:1903.06400 [cs],March 2019.

Jose F. Rodrigues-Jr, Gabriel Spadon,Bruno Brandoli, and Sihem Amer-Yahia.Patient trajectory prediction in theMimic-III dataset, challenges and pitfalls.arXiv:1909.04605 [cs, stat], September2019.

Cynthia Rudin. Stop explaining black boxmachine learning models for high stakesdecisions and use interpretable models in-stead. Nature Machine Intelligence, 1(5):206–215, May 2019. ISSN 2522-5839. doi:10/gf4tp9.

Wojciech Samek and Klaus-Robert Muller.Towards explainable artificial intelligence.In Explainable AI: interpreting, explainingand visualizing deep learning, pages 5–22.Springer, 2019.

Sofia Serrano and Noah A. Smith. Is atten-tion interpretable? In Proceedings of the57th Annual Meeting of the Association forComputational Linguistics, pages 2931–2951, Florence, Italy, July 2019. Associa-tion for Computational Linguistics. doi:10.18653/v1/P19-1282. URL https://

www.aclweb.org/anthology/P19-1282.

Oleksandr Shchur, Marin Bilos, andStephan Gunnemann. Intensity-FreeLearning of Temporal Point Processes.arXiv:1909.12127 [cs, stat], January 2020.

Benjamin Shickel, Patrick Tighe, Azra Bi-horac, and Parisa Rashidi. Deep EHR:A Survey of Recent Advances in DeepLearning Techniques for Electronic HealthRecord (EHR) Analysis. IEEE Journal of

14

https://doi.org/10.7326/m18-3684

https://doi.org/10.7326/m18-3684

https://www.aclweb.org/anthology/P19-1282

https://www.aclweb.org/anthology/P19-1282


Biomedical and Health Informatics, 22(5):1589–1604, September 2018. ISSN 2168-2194, 2168-2208. doi: 10.1109/JBHI.2017.2767063.

Joseph Sill. Monotonic Networks. In M. I.Jordan, M. J. Kearns, and S. A. Solla, edi-tors, Advances in Neural Information Pro-cessing Systems 10, pages 661–667. MITPress, 1998.

Harini Suresh and John V. Guttag. AFramework for Understanding Unin-tended Consequences of Machine Learn-ing. arXiv:1901.10002 [cs, stat], February2020.

Eric J. Topol. High-performance medicine:The convergence of human and artificialintelligence. Nature Medicine, 25(1):44–56, January 2019. ISSN 1546-170X. doi:10/gfsvzn.

Ashish Vaswani, Noam Shazeer, Niki Par-mar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Il-lia Polosukhin. Attention Is All You Need.arXiv:1706.03762 [cs], December 2017.

Jason Walonoski, Mark Kramer, JosephNichols, Andre Quina, Chris Moesel, Dy-lan Hall, Carlton Duffett, KudakwasheDube, Thomas Gallagher, and ScottMcLachlan. Synthea: An approach,method, and software mechanism for gen-erating synthetic patients and the syn-thetic electronic health care record. Jour-nal of the American Medical InformaticsAssociation, 25(3):230–238, March 2018.ISSN 1067-5027. doi: 10/gb4mkg.

Jeremy C Weiss and David Page. Forest-based point process for event predic-tion from electronic health records. InJoint European Conference on MachineLearning and Knowledge Discovery inDatabases, pages 547–562. Springer, 2013.

Ruibin Xiong, Yunchang Yang, Di He,Kai Zheng, Shuxin Zheng, Chen Xing,Huishuai Zhang, Yanyan Lan, Liwei Wang,and Tie-Yan Liu. On Layer Nor-malization in the Transformer Architec-ture. arXiv:2002.04745 [cs, stat], Febru-ary 2020.

Wei Zhang, Zhaobin Kuang, Peggy Peissig,and David Page. Adverse drug reactiondiscovery from electronic health recordswith deep neural networks. In Proceedingsof the ACM Conference on Health, Infer-ence, and Learning, pages 30–39, 2020.

Shixiang Zhu, Minghe Zhang, RuyiDing, and Yao Xie. Deep Atten-tion Spatio-Temporal Point Processes.arXiv:2002.07281 [cs, stat], February2020.

Appendix A. Positive monotonicbuilding blocks

As discussed in Section 4.2.3, if f(t) is givenby a L-layer NN, where the output of eachlayer fi is fed into the next fi+1, then we canwrite f(t) = (fL◦· · ·◦f2◦f1)(t). Applicationof the chain rule

df(t)

dt=

dfLdfL−1

dfL−1

dfL−2· · · df2

df1

df1

dt(8)

shows that a sufficient solution to achievingdf(t)/dt ≥ 0 is to enforce each step of pro-cessing to be a monotonic function of its in-put, i.e. dfi/dfi−1 ≥ 0 and df1/dt ≥ 0 impliesf(t) is a monotonic function of t.

As well as the monotonicity constraint, aswe are interested in intensities that can de-cay as well as increase, it needs to be possiblethat the second derivative of f(t) can be neg-ative

dλ(t|Ht)dt

∼ d2f(t|Ht)dt2

< 0 (9)

for some history Ht.

15


Here we collect all of the modifications weemploy to ensure our architectures complywith this restriction.

A.1. Linear projections

Linear projections are a core component ofdense layers, and can be made monotonic byrequiring every element of the projection ma-trix is at least 0

f(x) = Wx, (10)

df(x)i/dxj = Wi,j ≥ 0 ∀ i, j. (11)

One solution to this problem is to parame-terise the W as a positive g transformationof some auxiliary parameters V such thatW = g(V ) ≥ 0. Taking g as Rectified LinearUnit (ReLU) leads to issue during training:any weight which reaches zero will becomefrozen no longer be updated by the network,harming convergence properties. We also ex-perimented using softplus and sigmoid acti-vations. These have the issue where the aux-iliary weights V can be pushed arbitrarilynegative and, when the network needs themagain, training needs to pull them very farback for little change in W , again harmingconvergence. This approach is also burdenedwith additional computational cost.

The most effective approach, which we em-ploy in our models, involves modifying theforward pass such that any weights belowsome small ε > 0 are set to ε. This εcan be thought of as a lower bound on thederivative, has no computational overheads,and none of the issues discussed in the ap-proaches above. In our experiments we takeε = 10−30.

A.2. Dense activation functions

ReLU The widely used ReLU activationclearly satisfies the monotonicity constraint,

however

dReLU(x)

dx=

{0 if x ≤ 0

1 if x > 0,

d2 ReLU(x)

dx2= 0.

As the second derivative is 0 then

dλ∗m(t)

dt= 0 (12)

which means that any cumulative model us-ing the ReLU activation is equivalent tothe conditional Poisson process λ(τ |Ht) =µ(Ht) introduced in Section 4.2.3. We can-not use ReLU in a cumulative model.

Tanh Omi et al. (2019) proposed to use thetanh activation function which doesn’t havethe constraints of ReLU:

d tanh(x)

dx= sech2(x) ∈ (0, 1),

d2 tanh(x)

dx2= 2 tanh(x) sech2(x) ∈ (c−, c+)

where c± = log(2±√

2)/2, so tanh meets ourrequirements of a positive first derivative anda non-zero second derivative.

Gumbel An alternative to tanh is theadaptive Gumbel activation introduced inFarhadi et al. (2019):

σ(xm) = 1− [1 + sm exp(xm)]−1

sm (13)

where ∀m : sm ∈ R>0, and m is a dimen-sion/activation index. For brevity, we will re-fer to this activation function as the Gumbelactivation, and while its analytic propertieswe will drop the dimension index m, how-ever, since the activation is applied element-wise, the analytic properties discussed di-rectly transfer to the vector application case.

16


Its first and second derivatives match ourpositivity and negative requirements

dσ(x)

dx= exp(x)

× [1 + s exp(x)]−s+1s ∈ (0, 1/e),

(14)

d2σ(x)

dx2= exp(x) [1− exp(x)]

× [1 + s exp(x)]−2s+1

s ∈ (c−, c+),(15)

where c± = (±√

5−2) exp[(±√

5−3)/2]. TheGumbel activation shares many of the limit-ing properties of the tanh activation.

limx→−∞

σ(x) = 0, (16)

limx→∞

σ(x) = 1, (17)

limx→±∞

dσ(x)

dx= lim

x→±∞

d2σ(x)

dx2= 0. (18)

For our purposes, the key advantage ofthe Gumbel over tanh are the learnable pa-rameters sm. These parameters control themagnitude of the gradient through equationEquation (14) and the magnitude of the sec-ond derivative through Equation (15). Themaximum value of the first derivative is ob-tained at x = 0, which corresponds to themode of the Gumbel distribution, and forany value of s

maxx

dσ(x)

dx= (1 + s)−

2s+1s . (19)

By sending s → 0 we get the largest maxi-mum gradient of 1/e which occurs in a shortwindow in x, and by sending s → ∞ we getthe smallest maximum gradient of 0, whichoccurs for all x. This allows the activationfunction to control its sensitivity changes inthe input, and allows the NN to be selec-tive in where it wants fine-grained controlover its first and second derivatives gradient(i.e. produce therefore have an output that

changes slowly over a large range in the inputvalues), and where it needs the gradient tobe very large in a small region of the input.

These properties are extremely beneficialfor cumulative modelling, and we employ theGumbel activation in all of these models.

Gumbel-Softplus Although tanh andadaptive Gumbel meet our positivity andnegativity requirements, they share a furtherissue raised by Shchur et al. (2020) in thattheir saturation (limx→∞ σ(x) = 1) doesnot allow the property limt→∞ Λ∗m(t) = ∞,leading to an ill-defined joint probabilitydensity p∗m(t). To solve this, we introducethe Gumbel-Softplus activation: 9

σ(xm) = gumbel(xm) [1 + softplus(xm)] ,(20)

where gumbel is defined in Equation (13),and softplus is the parametric softplus func-tion:

softplus(xm) =log (1 + sm exp(xm))

sm(21)

This activation function has the propertylimt→∞ σ(t) = ∞, in addition to satisfyingthe positivity and negativity constraints.

A.3. Attention activation functions

In addition to the activation functions usedin the dense layers of NNs, the attentionblock requires another type of activation.This is indicated in Equation (63), as the at-tention logits Ei,j are passed through a func-tion g to produce the attention coefficientsαi,j . Generally, this activation function is theSoftmax function (see Equation (64)), how-ever, this function is not monotonic:

∂ softmax(xi)

∂xj= softmax(xi)

∗ [δij − softmax(xj)] < 0 ∀ i 6= j, (22)

9. This modification could equally be applied to thetanh activation.

17


where δij is the Kronecker delta function.Consequently, we chose to use the sigmoid

activation function instead of the softmaxwhen the modelled function required to bemonotonic. Indeed, the sigmoid function ismonotonic:

∂σ(x)

∂x= σ(x) [1− σ(x)] > 0. (23)

The softmax activation function has the niceproperty of making the attention coefficientsum to one, therefore forcing the network toattend to a limited number of points. Thiscan make the interpretation of the attentioncoefficient easier. However, using a softmaxcould potentially lead to a decrease of perfor-mance if the points shouldn’t strictly com-pete for contribution to the intensity underthe generating process. As a result, we be-lieve that choosing a sigmoid or a softmaxis not straightforward. We use sigmoid forcumulative approximators, and softmax ev-erywhere else.

A.4. Layer normalisation

The final layer of the Transformer requiringmodification issue is layer normalisation, alayer which dramatically improves the con-vergence speed of these models.

Modifying this layer requires realising thatany layer requiring the division by the sumof activations (for example, even L2 normal-isation) will result in a negative derivativeoccurring somewhere. Indeed, this is the un-derlying reason for the negative elements ofthe softmax Jacobian in Equation (22).

It follows that any kind of normalisationcannot explicitly depend on the current setof activations. Taking inspiration from batchnormalisation, we construct an exponentialmoving average form of Layer normalisationduring training. When performing a forwardpass, the means and standard deviations aretaken from these moving averages and aretreated as constants. After the forward pass,

the means and standard deviations are thenupdated in a similar fashion to batch normal-isation. Finally, taking the gain parameterpositive as discussed in Appendix A.1 resultsin a monotonic form of layer normalisation.

We employ this form of layer normalisationin the cumulative self-attention decoder (SA-CM), otherwise we use the standard form.

A.5. Encoding

Temporal Encoding for monotonic ap-proximators The temporal embedding ofEquation (65) is not monotonic in t andtherefore cannot be used in any conditionalcumulative approximator. (Vaswani et al.,2017) noted that a learnable temporal encod-ing had similar performance to the one pre-sented in Equation (66). In order to modelthe conditional cumulative intensity Λ∗m(t),we will instead use a MLP

ParametricTemporal(t) =

MLP(t;θ≥ε) ∈ RdModel , (24)

where θ≥ε indicates that all projection ma-trices have positive values in all entries. Bi-ases may be negative. If we choose mono-tonic activation functions for the MLP, thenit is a monotonic function approximator (Sill,1998).

The dModel dimensional temporal encodingof an event at ti with labels Mi is then

xi = TemporalEncode(ti,Mi) (25)

= vi(Mi)√dModel+

ParametricTemporal(ti). (26)

A.6. Double backward trick

When using a marked neural network basedon the cumulative intensity, one issue comesfrom the fact that neural networks accumu-late gradients. Specifically, the way that

18


autograd is implemented in commonly useddeep learning frameworks means that it isnot possible to compute ∂

∂tΛ∗m(t) for a single

m. The value of the derivative will always beaccompanied by the derivatives of all m dueto gradient accumulation, i.e. one can onlycompute the sum

∑Mm=1

∂∂tΛ∗m(t) but not an

individual ∂∂tΛ∗m(t).

A way of solving this issue is to split Λ∗m(t)into individual M components, and to com-pute the gradient for each of them. However,this method is not applicable to high valuesof M due to computational overhead. An-other way to solve this issue is to use thedouble backward trick which allows to com-pute this gradient without having to split thecumulative intensity.

This trick is based on the following: wedefine a B × L×M (Batch size × Sequencelength ×# classes) tensor a filled with zeros,and compute the Jacobian vector product

jvp =

(∂

∂tΛ∗m(t)

)Ta ∈ RB×L. (27)

We then define second tensor b of shape B×L×M , filled with ones, and we compute thefollowing Jacobian vector product:

(∂

∂ajvp

)T=

{∂

∂a

[(∂

∂tΛ∗m(t)

)Ta

]}T(28)(

∂

∂ajvp

)Tb =

(∂

∂tΛ∗m(t)

)Tb (29)

=∂

∂tΛ∗m(t) ∈ RB×L×M (30)

This recovers the element-wise derivative ofthe cumulative intensity function as desired,with the number of derivative operations per-formed being independent of the number oflabels M .

A.7. Modelling the log cumulativeintensity

We also investigated an alternative whichis to let the decoder directly approximatelog Λ∗m(t). We have then:

∂ log Λ∗m(t)

∂t=λ∗m(t)

Λ∗m(t), (31)

log λ∗m(t) = log Λ∗m(t)

+ log

(∂ log Λ∗m(t)

∂t

). (32)

While we didn’t use this method to produceour results, we implemented it in our code-base. The advantage of this method is thatthe log intensity is directly modelled by thenetwork, and its disadvantage is that thesubtraction of the exponential terms to com-pute Λ∗m(t) − Λ∗m(ti) can lead to numericalinstability.

Appendix B. Taxonomy

We ran our experiments using 2 encoders:SA and GRU, and 7 decoders: CP, RMTPP,LNM, MLP-CM, MLP-MC, Attn-CM, Attn-MM. We combined these encoders and de-coders to form models, which are defined bylinking an encoder with a decoder: for in-stance, GRU-MLP-CM denotes a GRU en-coder with a MLP-CM (for cumulative) de-coder. We present here these different com-ponents.

For the implementations of these models,as well as instances trained on the tasks inour setup, please refer to code base suppliedin the supplementary material.

Label embeddings Although NeuralTPP encoders differ in how they encodetemporal information, they share a labelembedding step. Given labels m ∈ Mi

localised at time ti, the DEmb dimensional

19


embedding vi is

vi = fpool

(Wi = {w(m)|m ∈Mi}

)∈ RDEmb

(33)where w(m) is the learnable embedding forclass m, and fpool(W) is a pooling func-tion, e.g. mean pooling: fpool(W) =∑

w∈W w/|W|, or max pooling: fpool(W) =

⊕DEmbα=1 max({wα|w ∈ W}). 10

B.1. Encoders

GRU We follow the default equations ofthe GRU NN:

ri = σ(W (1)xi + b(1)+ (34)

W (2)h(i−1) + b(2))

zi = σ(W (3)xt + b(3)+

W (4)h(i−1) + b(4)) (35)

ni = tanh(W (5)xi + b(5)+

ri ◦ (W (6)h(i−1) + b(6))) (36)

hi = (1− zi) ◦ ni + zi ◦ h(i−1), (37)

where ht, rt, zt and nt are the hidden state,and the reset, update and new gates, respec-tively, at time t. xt is the input at time t,defined by Equation (65). σ designs the sig-moid function and ◦ is the Hadamard prod-uct.

Self-attention (SA) We follow Equa-tion (63) and Equation (64) to form our at-tention block. The keys, queries and valuesare linear projections of the input: q = Wqx,k = Wkx, v = Wvx, with the input X ={x1,x2, . . . , }, with each xi defined by Equa-tion (66). We also apply a Layer Normalisa-tion layer before each attention and feedfor-

10. In the multi-class setting, only one label appearsat each time ti, and so vi is directly the embed-ding for that label, and pooling has no effect.

ward block as in Xiong et al. (2020):

Q′ = LayerNorm(Q), (38)

Q′ = MultiHeadAttn(Q′,V) (39)

Q′ = Q′ +Q (40)

Z = LayerNorm(Q′) (41)

Z = {W (2)ReLU(W (1)zi + b(1)) + b(2)}(42)

Z = Z +Q′, (43)

where MultiHeadAttn is defined by Equa-tion (67). We summarise these equations by:

Z = Attn(Q = X ,V = X ) ≡ SA(X ). (44)

B.2. Decoders

For every decoder, we use when appropriatethe same MLP, composed of two layers: onefrom DHid to DHid, and another from DHid

to M, where DHid is either 8, 32 and 64 de-pending on the dataset, and M is the numberof marks.

We also define the following terms for ev-ery decoder: t, a query time, zt ≡ z|Ht|,the representation for the latest event in Ht,i.e. the past event closest in time to t, andti, the time of the previous event. In ad-dition, we define qt, the representation ofthe query time t. For the cumulative mod-els, qt = ParametricTemporal(t), and for theMonte-Carlo models, qt = Temporal(t), fol-lowing Equation (24) and Equation (65), re-spectively.

Conditional Poisson (CP) The condi-tional Poisson decoder returns:

λ∗m(t) = MLP(zt), (45)

Λ∗m(t) = MLP(zt)(t− ti), (46)

where the MLP is the same in both equa-tions.

20


RMTPP Given the same elements as forthe conditional Poisson, the RMTPP re-turns:

λ∗m(t) = exp[W (1)zt + w(2)(t− ti) + b(1)

]m

(47)

Λ∗m(t) =1

w(2)

[exp(W (1)zt + b(1))− (48)

exp(W (1)zt + w(2)(t− ti) + b(1))]m

(49)

whereW (1) ∈ RDHid×M , w(2) ∈ R, and b(1) ∈RM , are learnable parameters.

Log-normal mixture (LNM) FollowingShchur et al. (2020), the log-normal mixturemodel returns:

p∗(t) =

K∑k=1

wk1

(t− ti)σk√

2π

exp

[−(log(t− ti)− µk)2

2σ2k

], (50)

where p∗m(t) = p∗(t) ρm(Ht), and w,σ,µ ∈RK are mixture weights, distribution meansand standard deviations, and are outputs ofthe encoder. These parameters are definedby:

w = softmax(W (1)zt + b(1)), (51)

σ = exp(W (2)zt + b(2)), (52)

µ = W (3)zt + b(3). (53)

MLP Cumulative (MLP-CM) TheMLP-CM returns:

λ∗m(t) =∂Λ∗m(t)

∂t, (54)

Λ∗m(t) = MLP([qt, zt];θ≥ε), (55)

where the square brackets indicate a concate-nation.

MLP Monte Carlo (MLP-MC) TheMLP-MC returns:

λ∗m(t) = MLP([qt, zt]), (56)

Λ∗m(t) = MC[λ∗m(t′), t′, ti, t

], (57)

where MC represents the estimation of theintegral

∫ ttiλ∗m(t′) dt′ using a Monte-Carlo

sampling method.

Attention Cumulative (Attn-CM) Forthe Attn-CM and Attn-MC, the attentionblock differs from the SA encoder: thequeries Wqq are defined from the query rep-resentations, and the keys Wkz and valuesWvz are defined from the encoder represen-tations.

The Attn-CM returns:

λ∗m(t) =∂Λ∗m(t)

∂t, (58)

Λ∗m(t) = Attn(Q = {qt},V = Z), (59)

where Transformer refers to Equation (44),with x = q as input. Moreover, the compo-nents of this Attention are modified followingAppendix A. In particular the ReLU activa-tion is replaced by a Gumbel.

Attention Monte Carlo (Attn-MC)The Attn-MC returns:

λ∗m(t) = Attn(Q = {qt},V = Z), (60)

Λ∗m(t) = MC[λ∗m(t′), t′, ti, t

], (61)

where MC represents the estimation of theintegral

∫ ttiλ∗m(t′) dt′ using a Monte-Carlo

sampling method.

21


Appendix C. Additional results

C.1. Hawkes datasets

The parameters of our Hawkes datasets are:

Independent :

µ =

[0.10.05

],

α =

[0.2 0.00.0 0.4

],

β =

[1.0 1.01.0 1.0

],

Dependent : (62)

µ =

[0.10.05

],

α =

[0.2 0.10.2 0.3

],

β =

[1.0 1.01.0 2.0

].

C.2. Hawkes results

A useful property of these datasets is that wecan visually compare the modelled intensitieswith the intensity of the underlying generat-ing process. On each datasets, Hawkes (inde-pendent) and Hawkes (dependent), all mod-els perform similarly, with the exception ofthe conditional Poisson. We present resultsfor the SA-MLP-MC on Figure 3.

C.3. Retweets results

On the Retweets dataset, while the con-ditional Poisson model significantly outper-forms TPP models in terms of ROC-AUC,the opposite is true in terms of NLL/time.This is probably due to the fact that TPPscan model the intensity decay that occurswhen a tweet is not retweeted over a certainperiod of time. We compare on Figure 4 theintensity functions of a SA-CP and and a SA-MLP-MC.

C.4. Attention coefficients

Figure 5 shows an example of attention co-efficients for a sequence of Synthea (Full).

Appendix D. Transformerarchitecture

Attention block The key component ofthe Transformer is attention. This com-putes contextualised representations q′i =Attention(qi, {kj}, {vj}) ∈ RdModel of queriesqi ∈ RdModel from linear combinations of val-ues vj ∈ RdModel whose magnitude of contri-bution is governed by keys kj ∈ RdModel

Attention(qi, {kj}, {vj}) =∑j

αi,j vj ,

αi,j = g (Ei,j) , (63)

Ei,j =qTi kj√dk

where αi,j are the attention coefficients, Ei,jare the attention logits, and g is an activa-tion function that is usually taken to be thesoftmax

softmaxj

(Ei,j) =exp(Ei,j)∑k exp(Ei,k)

. (64)

Masking The masking in our encoder issuch that events only attend to themselvesand events that happened strictly beforethem in time. The masking in the decoderis such that the query only attends to eventsoccurring strictly before the query time.

Temporal Encoding In the originaltransformer (Vaswani et al., 2017), absolutepositional information is encoded into qi,kj and vj . This was achieved by produc-ing positional embeddings where the embed-dings of relative positions were linearly re-lated x(i) = R(i− j)x(j) for some rotationmatrix R(i − j). Our temporal embedding

22


0.5

1.0λ

(t) 0

0

0

1

λ(t

) 1

0 20 40 60 80 100

t

1

(a) SA-MLP-MC on Hawkes (dependent)

0.25

0.50

0.75

λ(t

) 0

0

0.25

0.50

0.75

λ(t

) 1

0 20 40 60 80 100

t

1

(b) SA-MLP-MC on Hawkes (independent)

Figure 3: Intensity functions on the Hawkes datasets. The orange line represents the trueintensity of the sequence, while the blue line represents the intensity modelled bythe SA-MLP-MC.

0.3

0.4

λ(t

) 0

0

0.6

0.7

λ(t

) 1

1

0.025

0.050

λ(t

) 2

0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Age (years)

2

(a) SA-CP on Retweets data

20

40

λ(t

) 00

50

100

λ(t

) 1

1

3

4

5

λ(t

) 2

0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Age (years)

2

(b) SA-MLP-MC on Retweets data

Figure 4: Intensity functions of a SA-CP and a SA-MLP-MC on the Retweets datasets.The MLP-MC is able to model the time decay of the intensity, contrary to theCP.

method generalises this to the continuous do-main

Temporal(t) =

dModel/2−1⊕k=0

sin(αkt)⊕ cos(αkt) ∈ RdModel ,

(65)

where αk is proportional to 10000−2k/dModel ,and β is a temporal rescaling parameter andplays the role of setting the shortest timescale the model is sensitive to. In practicewe estimate β = E[(w+ − w−)/N ] from thetraining set, so that a TPP with smaller av-erage gaps between events is modelled at a

higher granularity by the encoding. We ex-perimented with having β a learnable param-eter of the model, however, this made themodel extremely difficult to optimise. Notethat β = 1 for a language model. In addition,β does not change the relative frequency ofthe rotational subspaces in the temporal em-bedding from the form in Vaswani et al.(2017). The encoding with temporal infor-mation of an event at ti with labels Mi isthen

xi = TemporalEncode(ti,Mi) =

vi(Mi)√dModel + Temporal(ti) ∈ RdModel .

(66)

23


Figure 5: Encoder attention coefficients for an EHR extracted from the Synthea (Full)dataset. Line thickness corresponds to attention strength, with values below 0.5are omitted for clarity. As stressed in Serrano and Smith (2019), “while attentionis by no means a fail-safe indicator”, it is still able to ”noisily predicts inputcomponents’ overall importance to a model”.

0 10 20 30 40 50

Age (years)

Amoxicillin 250 MG

Viral sinusitis (disorder)

Aspirin 81 MG Oral Tablet

Otitis media

Sinusitis (disorder)

Chronic sinusitis (disorder)

Acute viral pharyngitis (disorder)

Acetaminophen 325 MG Oral Tablet

Acute bronchitis (disorder)

Concussion with loss of consciousness

Mestranol

Anemia (disorder)

Normal pregnancy

Body mass index 30+ - obesity (finding)

Miscarriage in first trimester

Blighted ovum

Preeclampsia

Prediabetes

Norinyl 1+50 28 Day Pack

Ibuprofen 200 MG Oral Tablet

Injury of anterior cruciate ligament

Acetaminophen 325 MG

Diabetes

Metabolic syndrome X (disorder)

Hypertriglyceridemia (disorder)

Fetus with unknown complication

24 HR Metformin hydrochloride 500 MG Extended Release Oral Tablet

Hyperglycemia (disorder)

Penicillin V Potassium 500 MG Oral Tablet

Streptococcal sore throat (disorder)

Neuropathy due to type 2 diabetes mellitus (disorder)

Multi-head attention Multi-head atten-tion produces q′i using H parallel attentionlayers (heads) in order to jointly attend toinformation from different subspaces at dif-

ferent positions

MultiHead(qi, {kj}, {vj}) =

W (o)

[H⊕h=1

Attn(W

(q)h qi, {W (k)

h kj},

{W (v)h vj}

)], (67)

24


where W(q)h ,W

(k)h ∈ Rdk×dModel , W

(v)h ∈

Rdv×dModel , and W (o) ∈ RdModel×h dv arelearnable projections.

Appendix E. Broader impactstatement

E.1. Overview

In this paper, we demonstrate that TemporalPoint Processes perform favourably in mod-elling Electronic Health Records. We em-barked on this project aware that as highimpact systems, Clinical Decision Supporttools must ultimately provide explanations(Holzinger et al., 2017) to ensure clinicianadoption (Miller, 2019) and provide account-ability (Mittelstadt et al., 2016) whilst beingmindful of the call for researchers to build in-terpretable ML models (Rudin, 2019). Ourresults demonstrate that our attention-basedmodel performs favourably against less inter-pretable models. As attention could carrymedically interpretable information our in-novation contributes towards the develop-ment of medically useful AI tools.

Aware of both the need for transparencyin the research community (Pineau et al.,2020) and for sensitive handling of healthdata (Kalkman et al., 2019), we opted to useopen source synthetic EHRs. We provide alltrained models, benchmarked datasets anda high-quality deterministic codebase to al-low others to easily implement and bench-mark their own models. We also make theimportant contribution of highlighting ex-isting benchmark datasets as inappropriate;continuing to develop temporal models us-ing these datasets may slow development ofuseful technologies or cause poor outputs ifapplied to EHR data.

Turning to the potential impact of ourtechnology, we begin by briefly discussing thebenefits temporal EHR models could bring tohealthcare before, with no less objectivity,

analysing harms which could occur, raisingquestions for further reflection.

E.2. The potential benefits of EHRmodelling

Healthcare systems today are under intensepressure. Chronic disease prevalence is rising(Raghupathi and Raghupathi, 2018), costs ofcare are increasing, resources are constrainedand outcomes are worsening (Topol, 2019).23% of UK deaths in 2017 were classedas avoidable (Office for National Statistics,2017), clinician diagnostic error is estimatedat 10-15% (Graber, 2013) and EHR data suf-fers from poor coding and incompletenesswhich affects downstream tasks (Jetley andZhang, 2019). Our research contributes tech-niques which could be used to:

• More accurately impute missing EHRdata.

• Identify likely misdiagnoses.

• Predict future health outcomes.

• Identify higher risk patients for success-ful health interventions.

• Design optimal care pathways.

• Optimise resource management.

• Identify previously undiscovered linksbetween health conditions.

Clearly, implemented well, temporal EHRmodelling could provide immense benefit toour society. These benefits outweigh possi-ble harms on the provision that they are wellmitigated. Hence, we devote the rest of thediscussion to evaluating potential harms.

E.3. Potential Harms of EHRModelling

There are a number of ways in which thetemporal modelling of EHRs could create so-

25


cietal harm. To examine these risks, we eval-uate them by topic, framed as a series of casestudies.

E.3.1. Poor Health OutcomesThrough Data Bias

Bias can mean many things when discussingthe application of AI to healthcare. As such,we would encourage researchers to use Sureshand Guttag (2020)’s framework to betteridentify and address the specific challengesthat arise when modelling health data. Allare relevant to our technology, but we willspecifically focus on issues arising from his-torical, representation and aggregation bi-ases. Importantly, we anticipate these biaseswould be more easily identified due to theinterpretability of our attention-based tech-nique.

Scenario Temporal EHR modelling is de-ployed to predict future health having beentrained on historical health records.

Societal harm Societal groups who havehistorically been discriminated against con-tinue to receive below standard healthcare.

Examples of historical biases Histori-cally, the majority of medical trials have beenconducted on white males (Morley et al.,2019) providing greater health informationon this population group. We also know thatclinicians are fallible to human biases. Forexample: women are less likely than men toreceive optimal care despite being more likelyto present with hypertension and heart fail-ure (Li Shanshan et al., 2016); They are oftennot diagnosed with diseases due to humanbias (Suresh and Guttag, 2020) and are lesslikely than men to be given painkillers whenthey report they are in pain (Calderone,1990); African Americans more likely to bemisdiagnosed with schizophrenia than whitepatients (Gara et al., 2018).

Societal harm Societal groups who havehistorically been discriminated against con-tinue to receive below standard healthcare.

Examples of historical biases Histori-cally, the majority of medical trials have beenconducted on white males (Morley et al.,2019) providing greater health informationon this population group. We also know thatclinicians are fallible to human biases. Forexample: women are less likely than men toreceive optimal care despite being more likelyto present with hypertension and heart fail-ure (Li Shanshan et al., 2016); They are oftennot diagnosed with diseases due to humanbias (Suresh and Guttag, 2020) and are lesslikely than men to be given painkillers whenthey report they are in pain (Calderone,1990); African Americans more likely to bemisdiagnosed with schizophrenia than whitepatients (Gara et al., 2018).

* * *

Scenario Temporal EHR modelling is de-ployed to predict future health and developcare plans. One model is used for a diversepopulation group.

Societal harm Demographics’ varyingsymptom patterns and care plan needs arenot accounted for, resulting in below optimalhealth outcomes for the majority.

Examples of aggregation biases It hasbeen shown that the risk factors for carotidartery disease differ by ethnicity (Gijsbertset al., 2015) yet prevention plans are devel-oped on data from almost exclusively a whitepopulation. A common measurement for di-abetes widely used for diagnosis and moni-toring differs in value across ethnicities andgenders (Herman and Cohen, 2012).

* * *

Scenario Researchers select proxies to rep-resent metrics within their model withoutconsidering broader socioeconomic factors.

26


Societal harm Societal groups are subjectto allocation biases where they do not receivethe same standard of healthcare compared toadvantaged groups.

Example of poor proxy selection(Obermeyer et al., 2019) analysed a modelin current commercial use which identifieshigh risk patients for care interventions usedmedical costs as a proxy for illness. It wasfound that black people were much sickerthan white people with the same risk scoreas historical and socioeconomic factors meanthat black people do not receive parity inhealth care and treatments.

* * *

Scenario Biased models are deployed asclinician-in-the-loop with the intention of us-ing a human to protect against model biases.

Societal harm The clinician cannot re-dress model biases, causing the above men-tioned harms to persist.

Example of human-in-the-loop failureHumans are subject to confirmation bias(Green and Chen, 2019) and (Obermeyeret al., 2019) found that whilst clinicians canredress some disparity in care allocation,they do not adjust the balance to that of afair algorithm.

* * *

Recommendations:

• Researchers must employ caution whenusing proxies such as “diagnosed withcondition X” equals “has condition X”.

• Models must be evaluated against theintended recipient populations.

• We call on all institutions deployingtemporal EHR models to assess fairperformance in the target population,and, on policy makers to provide bench-marked patient datasets on which fair

and safe performance of models must bedemonstrated before deployment.

Questions for reflection:

• How do we best formulate the problemstemporal EHR modelling is trying tosolve to prevent discriminatory harms?

• If temporal EHR modelling was de-ployed, what techniques would be usedto protect society from the reinforce-ment of structural inequalities andstructural injustices (Jugov and Ypi,2019)?

• How do you construct adequate bench-marking datasets to test the suitabilityof a temporal EHR model for a givenpopulation?

• What evidence is needed to demonstratethe clinical effectiveness of a temporalmodel (Greaves et al., 2018)?

E.3.2. Poor care provision throughlack of contextual awareness

Scenario A temporal model is trained ina perfectly unbiased way and is deployed foruse.

Societal harm As the model is not awareof emergent health outcomes, informationprovided is out of sync with modern medi-cal knowledge. Potentially resulting in sub-optimal care plan recommendations and pre-dictions.

Example of lack of contextual aware-ness Consider training on a cohort of 100-year-old patients. They lived through thesecond world war, ate rations, never took acontraceptive pill and were retiring as com-puters became commonplace. Individualsapproaching old age in the future will likelyhave very different health trajectories. Bytheir nature, modelling of EHR data will be

27


context blind (Caruana et al., 2015). For ex-ample, if a new drug is found to prevent thedevelopment of Type 2 diabetes, model pre-dictions will be out of sync with patient out-comes, it would view the prescription of thisbetter drug as an error.

* * *


• How would we develop temporal EHRmodels to correct dynamically for non-stationary processes (changing medicalknowledge)?

• How do we introduce contextual aware-ness to temporal EHR models?

E.3.3. Restrictions in health access

Scenario A perfect model is deployed tobe used by health insurers to identify highcost patients. Those who can be assisted byinterventions are helped. The model identi-fies a set of individuals who are unavoidablycostly, due to health issues beyond their con-trol.

Societal harm This may cause a shift inthe economic model of the society in whichthe EHR model is deployed (Balasubrama-nian et al., 2018).

Examples of economic impact Tempo-ral models will make increasingly accuratepredictions about how much each individ-ual’s healthcare will cost over a lifetime. Incountries where healthcare is not universalinsurers and providers will adjust their actu-arial models, moving away from pooled riskhealthcare costs. Health insurers increasepremiums to an unfeasibly high amount forcostly individuals, preventing their access tohealthcare and pricing models change forthose who are able to access care. Govern-ments are forced to determine whether to usesocial security to provide universal health-care. If it is not the divide between those who

can access healthcare and those who can’tpotentially widens with ripple effects acrossthe broader economy.

* * *


• How do we as a society determine how tocarry the cost of healthcare provision?

• What level of risk granularity shouldhealth insurance companies be privy to?

E.3.4. Changes to the Doctor /Patient relationship andagency

Scenario Temporal EHR modelling causesa paradigm shift towards solely data-centredmedicine. Reducing a patient solely to theirdatapoints. Theoretically, the model canpredict a patient’s death.

Societal harm Agency is removed fromboth the patient and the doctor. Knowl-edge of death raises existential questions andchanges individuals’ behaviours.

Example of solely data-centredmedicine The doctor is left to evaluatesymptoms solely on datapoints (Kleinpeter,2017). A patient’s ability to fully expresswhat it means to inhabit their body andfeel illness (or not) is removed. Theymay feel unable to decline or deviate fromrecommended treatment plans, impactingthe patient’s perception of bodily autonomy.

* * *

Scenario Temporal EHR models are de-ployed in consumer health and wellness ap-plications with the current narrative of em-powering an individual to take control oftheir health (Morley and Floridi, 2019a).

28


Societal harm The burden of disease pre-vention is shifted to the individual whohas proportionally little control over macrohealth issues, tantamount to victim blaming(Morley and Floridi, 2019a).

Example of empowerment ineffective-ness Many factors of health are not solv-able with applications. For example, thoseon low incomes can’t afford healthy foodsand other socioeconomic factors prevent in-dividuals from accessing healthcare (de Fre-itas and Martin, 2015).

* * *

Recommendations:

• Policy makers should contemplate howfar into the future various institutionsshould be permitted to predict peoples’health.

• We encourage future researchers to viewconsumer applications that use tempo-ral EHR modelling as digital compan-ions (Morley and Floridi, 2019b).


• How many years in advance should pre-dictive health technologies be able topredict?

• How much clinical decision-makingshould we be delegating to AI-Health so-lutions (di Nucci, 2019)?

• How can we develop temporal EHRmodels to account for the socioeconomicand historical factors which contributeto poor health outcomes?

• How do we incorporate into temporalEHR modelling an ethical focus on theend user, and their expectations, de-mands, needs, and rights (Morley andFloridi, 2020)?

E.4. Wider reaching applications

The temporal modelling of irregular datacould be applied to many domains. For ex-ample, education and personalised learning,crime and predictive policing, social mediainteractions and more. Each of these appli-cations raises questions around the societalimpact of improved temporal modelling. Wehope that the above analysis provides stim-ulation when developing its beneficial us-age.

29

arxiv:2007.13794v2 [cs.lg] 7 dec 2020

Documents