kernel density estimation-based real-time prediction...

Kernel density estimation-based real-time prediction for respiratory motion

This article has been downloaded from IOPscience. Please scroll down to see the full text article.

2010 Phys. Med. Biol. 55 1311

(http://iopscience.iop.org/0031-9155/55/5/004)

Download details:

IP Address: 149.142.201.159

The article was downloaded on 20/01/2011 at 06:24

Please note that terms and conditions apply.

View the table of contents for this issue, or go to the journal homepage for more

Home Search Collections Journals About Contact us My IOPscience

http://iopscience.iop.org/page/terms

http://iopscience.iop.org/0031-9155/55/5

http://iopscience.iop.org/0031-9155

http://iopscience.iop.org/

http://iopscience.iop.org/search

http://iopscience.iop.org/collections

http://iopscience.iop.org/journals

http://iopscience.iop.org/page/aboutioppublishing

http://iopscience.iop.org/contact

http://iopscience.iop.org/myiopscience

IOP PUBLISHING PHYSICS IN MEDICINE AND BIOLOGY

Phys. Med. Biol. 55 (2010) 1311–1326 doi:10.1088/0031-9155/55/5/004

Kernel density estimation-based real-time predictionfor respiratory motion

Dan Ruan

Department of Radiation Oncology, Stanford University, Stanford, CA, USA

E-mail: [email protected]

Received 16 April 2009, in final form 15 November 2009Published 4 February 2010Online at stacks.iop.org/PMB/55/1311

AbstractEffective delivery of adaptive radiotherapy requires locating the target withhigh precision in real time. System latency caused by data acquisition,streaming, processing and delivery control necessitates prediction. Predictionis particularly challenging for highly mobile targets such as thoracic andabdominal tumors undergoing respiration-induced motion. The complexityof the respiratory motion makes it difficult to build and justify explicit models.In this study, we honor the intrinsic uncertainties in respiratory motion andpropose a statistical treatment of the prediction problem. Instead of askingfor a deterministic covariate–response map and a unique estimate value forfuture target position, we aim to obtain a distribution of the future targetposition (response variable) conditioned on the observed historical samplevalues (covariate variable). The key idea is to estimate the joint probabilitydistribution (pdf) of the covariate and response variables using an efficientkernel density estimation method. Then, the problem of identifying thedistribution of the future target position reduces to identifying the section in thejoint pdf based on the observed covariate. Subsequently, estimators are derivedbased on this estimated conditional distribution. This probabilistic perspectivehas some distinctive advantages over existing deterministic schemes: (1) itis compatible with potentially inconsistent training samples, i.e., when closecovariate variables correspond to dramatically different response values; (2)it is not restricted by any prior structural assumption on the map between thecovariate and the response; (3) the two-stage setup allows much freedom inchoosing statistical estimates and provides a full nonparametric descriptionof the uncertainty for the resulting estimate. We evaluated the predictionperformance on ten patient RPM traces, using the root mean squared differencebetween the prediction and the observed value normalized by the standarddeviation of the observed data as the error metric. Furthermore, we comparedthe proposed method with two benchmark methods: most recent sample andan adaptive linear filter. The kernel density estimation-based prediction resultsdemonstrate universally significant improvement over the alternatives and are

0031-9155/10/051311+16$30.00 © 2010 Institute of Physics and Engineering in Medicine Printed in the UK 1311

http://dx.doi.org/10.1088/0031-9155/55/5/004

mailto:[email protected]

http://stacks.iop.org/PMB/55/1311

1312 D Ruan

especially valuable for long lookahead time, when the alternative methods failto produce useful predictions.

(Some figures in this article are in colour only in the electronic version)

1. Introduction

Modern radiotherapy systems are capable of delivering the prescribed dose to a specifiedtarget position with high precision. However, target motion, if not properly accountedfor, compromises the delivery accuracy. Intrafractional target motion during treatment ismanaged with passive gating (Keall et al 2002) or adaptive tracking (Nuyttens et al 2006).Furthermore, intrinsic system latency exists due to hardware limitation, software processingtime and data communication. These considerations have motivated a large body of studies onprediction algorithms during treatment, especially for highly mobile targets, such as thoracicand abdominal tumors that are affected by respiratory motion.

The problem of predicting respiratory motion has been intensively studied, and the existingmethods can be classified into two categories: (1) those that assumes a specific inference modelstructure between the covariate, which is usually constructed from a finite sequence of historicalsamples, and the response (2) those that are ‘model free’. The general class of linear filtersfall into the first category, even though they may differ in adaptivity and specific formulations.Autoregressive moving average (ARMA) model is a straightforward generalization of thelinear model (McCall and Jeraj 2007). Linear models in the statistical setting have given riseto the application of the Kalman filter and multiple models (Putra et al 2008, Zimmermanet al 2008, McMahon et al 2007). By estimating the regression coefficients, these modelsextrapolate the behavior between the training covariate and training response values, and applythe estimated inference map to the test covariate.

One potential drawback of these fixed-structure models is their incapability to learnthe inference pattern from samples that are further away in space/time, but more similarin behavior, e.g., training samples that are obtained at similar breathing phase. Changingthe representation space, such as the sinusoidal (Vedam et al 2004) or scale space (Ernstet al 2007), is one way to ‘pull’ these samples together. An even more flexible way is tolearn the covariate–response directly, as in neural networks (Isaksson et al 2005, Kakar et al2005, Murphy and Dieterich 2006). Note that even in neural network setup, a consistent andsmooth inference map is assumed, and it will have difficulty if there exist multiple inconsistenttraining samples whose covariates are identical (or very close), but have very distinct responsevalues.

In this study, we adopt a statistical prospective and consider the probability distribution ofthe test response upon observing the test covariate. Treating the prediction as a random variablenot only honors the inconsistent training samples, but also provides a natural nonparametricdescription of the uncertainty associated with any prediction estimate. We adopt a kerneldensity estimator to approximate the joint probability distribution of the covariate andresponse variable from training samples. The distribution of the test response variable isobtained as the section of the joint distribution picked out by the observed test covariatevalue.

We present the basic theory of the proposed methodology in section 2. Section 3 describesthe test data and the implementation procedure and reports the experimental results. Section 4provides some structural discussion, and section 5 summarizes the study and discusses futurework.

Kernel density estimation-based real-time prediction for respiratory motion 1313

2. Methods

2.1. Basic setup

At current time instant t, we are given a set of discrete samples si, i = 1, 2, . . . , K, ofbreathing trajectory, acquired at preceding times ti < t prior to t. For simplicity, we discussthe formulation when scalar observations are acquired at uniform time intervals and the look-ahead length is an integer multiple L of the sampling interval. Then, for any i � K − L, wecan construct a length p covariate xi = [si−(p−1)�, si−(p−2)�, . . . , si] and response yi = si+L.The parameter � is an integer that indicates the ‘lag length’ used to augment the covariate.It should be chosen properly to balance the effect of system dynamics and observation noise(Ruan et al 2007).

Let zi = (xi ,yi ) ∈ �p+1, i = 1, 2, . . . , M, denote the collection of training samples,where xi and yi are the covariate variable and the response variable, respectively. The goalof prediction is to obtain an estimate of the unknown y given a test covariate x.

The key idea of the proposed statistical method is to consider the probability distribution(pdf) of the random vector Z = (X, Y ) ∈ �n, and regard each training sample zi asan realization of Z. The prediction (or more generally inference task) becomes two folds:(1) estimate the conditional distribution p(y|x), for observed covariate x and (2) obtain anestimate y from such distribution. Rather than formulating the whole problem in a singleoptimization setting, we have chosen to present the estimation of conditional pdf and thesubsequent estimation of y in a decoupled manner, to honor the fact that each module isself-contained and that once the conditional distribution p(y|x) is obtained, the user has thefreedom to choose any estimate based on the specific application. Yet the overall setting ofconstructing the distribution in the joint (X, Y ) space, and the logic of obtaining an estimatebased on the conditional probability holds regardless.

We present the framework for kernel-based prediction in section 2.2, which provides apdf of the response variable. Section 2.3 discusses some natural estimates derived from theresulting distribution. Section 2.4 provides a specific example for the proposed scheme andsection 2.5 explains in depth certain design considerations, and describes some variations thatcould potentially improve the prediction performance.

2.2. Estimating the distribution of response variable with kernel density approximation

We consider the samples zi = (xi ,yi ), i = 1, 2 . . . , M, as independent samples of therandom vector Z and obtain the kernel density approximation (Duda et al 2001) of the pdfp(z) as the superposition of (equally weighted) local kernel densities centered about eachsample zi :

p(z) = 1

M

M∑i=1

κ(z|zi ), (1)

where κ(z|zi ) is a local density kernel.The distribution of the response variable conditioned on the covariate p(y|x) can be

written as

p(y|x) = p((x,y)|x) = p(z|x) = p(z)/p(x), (2)

where p(x) is the marginal distribution p(x) = ∫p((x,y)) dy. The conditional distribution

p(y|x) is the (normalized) section of p(z) at X = x.Equations (1) and (2) provide the principle for estimating the distribution of the response

variable conditioned on the observed test covariate, by approximating the joint distribution

1314 D Ruan

0

0.2

0.4

0.6

0.8

1

response

cova

riate

(a) single kernel κ(zzz|

|

zzzi) (b) discrete training samples zzi

(c) kernel density estimated joint distribution p(zzz) (d) conditional distribution p(y x).

Figure 1. Schematic of the kernel density estimation-based prediction method: a single kerneldensity (a) is placed at each discrete training sample (b) to construct the joint pdf (c). Based onthe observed test covariate, the corresponding section of the joint pdf is selected and renormalizedto obtain the conditional distribution of the test response (d), where certain statistics could be usedas the prediction, such as mean, medium and mode, as illustrated in (d).

of covariate–response with kernel densities. Figure 1 illustrates the major component in thisprocedure.

A common choice of kernel density is the Gaussian kernel:

κ(z|zi , �i) = N (z − zi ), �i = 1

(2π)n/2|�i |1/2exp[−(z − zi )

T �−1i (z − zi )].

In the prediction application, the various coordinates in z correspond to different states in thecovariate or response variable. Therefore, it is feasible to assume the covariance matrix to beblock separable:

�i =[�x,i 0

0 σ 2y,i

]. (3)

The covariance of the local Gaussian kernel about each sample zi reflects contribution of thesample to the local curvature of the overall probability distribution. If all training samples areobtained under similar environment, then it is reasonable to assume �x,i = �x and σy,i = σy

for all i. The block diagonal form of the covariance indicates a separable Gaussian kernel,which enables further simplification of (2) as follows:


p(y|x) = 1

M

∑i

κ((x,y)|zi )

= 1

C

∑i

exp[−(x − xi )

T �−1x (x − xi ) − ‖y − yi‖/σ 2

y

]

= 1

C

∑i

exp[−(x − xi )

T �−1x (x − xi )] exp[−‖y − yi‖/σ 2

y

]. (4)

The normalization parameters C is independent of both i and y, so one could delay thenormalization until the last step by scaling the final conditional pdf to unity integral. Definethe sample weights wi as

wi = exp[−(x − xi )

T �−1x (x − xi )

], (5)

and (4) reduces to

p(y|x) = 1

C

∑i

wi exp[−‖y − yi‖/σ 2

y

], (6)

with normalization parameter C. This indicates that conditioned on the test covariate x, thedistribution of the response variable y is a Gaussian mixture, with each Gaussian componentcentered at the sampled yi’s. The weights in the mixture are determined by how ‘close’ thetest covariate x is to the covariate xi in the training samples.

2.3. Natural estimates

Given the distribution of the response variable Y conditioned on the observed covariate statex, one could devise estimates of the response variable accordingly.

2.3.1. The mean estimate. The mean of a random variable minimizes the expected squarederror:

ymean = arg miny

E[(y − Y )2|x] = E[Y |x]. (7)

Given the Gaussian mixture conditional distribution (6), the mean estimate reads

ymean = E[Y |x] = 1

C

∫ ∑i

ywi exp[−‖y − yi‖

/σ 2

y

]dy = 1∑

i wi

∑i

wiyi . (8)

It turns out that the mean estimate is the weighted sum of the training response values, wherethe weights are determined by the ‘closeness’ between the testing and training covariates.Also note that explicit normalization is no longer necessary due to cancellation.

2.3.2. The mode estimate. The mode can be considered as a maximum a posterioriprobability (MAP) estimate, as it seeks a response value that maximizes the conditionaldistribution:

ymode = arg maxy

p(y|x). (9)

1316 D Ruan

2.3.3. The medium estimate. From the point of view of order statistics, one could also adoptthe medium estimate, which is defined as

ymedium = {y : p(Y � y) = 1/2} . (10)

Note that for mixture Gaussian distribution in (6), the cumulative distribution can be obtainedas

p(Y � y|x) =∫ y

−∞

∑i

ywi exp[−‖y − yi‖

/σ 2

y

]dy

= 1

2

∑wi

[1 + erf

(y − yi√

2σy

)], (11)

where the Gauss error function is defined as

erf(x) = 2√π

∫ x

0e−x2

dx.

2.4. An exemplary scheme for respiratory motion prediction

As an example, we provide the algorithmic flow chart using Gaussian kernel and the meanestimate.

Algorithm 1 Predict y from (xi ,yi ) with Gaussian kernel and mean estimate.

1: Determine covariance �x and σy for covariate and response variables.2: Compute the weighting according to (5).3: Compute the mean estimate ymean = wiyi∑

i wi, equivalent to (8).

Note that it is unnecessary to compute the conditional distribution explicitly when themean estimate is used, as the order of taking the expectation to obtain the mean and computingthe weighted sum for the conditional pdf can be interchanged. This is not true in general, asin the case of mode and medium estimates.

2.5. Design considerations and potential variations

We describe some design considerations and variations that could potentially improve theprediction performance.

2.5.1. Data-driven covariance/bandwidth selection. In introducing the principle of thekernel method, we assumed the covariance of the covariate and the response variables a priori.In practice, one needs to determine these values. Setting the covariance for a general kernel isa challenging problem and can be identified with the bandwidth selection problem studied innonparametric density estimators. To this end, statistical methods in data-driven bandwidthselection (Sheather and Jones 1991, Botev 2006) may be applied.

In our application, the end goal is to obtain an estimate of the test response rather thanestimating the pdf itself and we expect the prediction performance to be less sensitive to thechoice of covariance than in general fitting problem. With the assumed separable Gaussiankernel and approximate spatial invariance (cf (3)), we estimate the covariate covariance �x

and response σy from the training population as follows:


x�= 1

Ntrain

∑i∈ training set

xi , y�= 1

Ntrain


yi;

�x = 1

Ntrain − 1


(xi − x)(xi − x)′, σ 2y = 1

Ntrain − 1


(yi − y)2.

(12)

2.5.2. Inhomogeneous Gaussian kernel with varying covariance. When the training samplesdiffer in their uncertainty, which may be caused by varying local noise level, one could assigndifferent covariances to different samples. In particular, samples with higher uncertaintyshould use a kernel with higher covariance, corresponding to a ‘more flat’ local kernel. With�i being the covariance for the local kernel around sample zi , the conditional distribution in(4) becomes

p(y|x) = 1

M

∑i

κ((x,y)|zi )

= 1

M

∑i

1

(2π)n/2|�x,i |1/2σy,i

exp[−(x − xi )

T �−1x,i (x − xi )

]× exp

[−‖y − yi‖/σ 2

y

]. (13)

We modify the definition of sample weights wi as

wi = 1

|�x,i |1/2σy,i

exp[−(x − xi )

T �−1x,i (x − xi )

], (14)

and the conditional distribution p(y|x) maintains the Gaussian mixture form as in (6).Subsequently, the mean estimate ymean is again a weighted sum of the sample response yi . Theonly difference incurred by allowing the local Gaussian kernels to have different covariances isthat the local confidence level modifies the weights: less reliable training samples (with higher|�x,i |1/2σy,i) receive less weight. In data-driven approaches, neighborhoods with few sampleshave higher uncertainty, and it is desirable to ‘flatten’ out the local kernel. This approach issimilar in spirit to the variable kernel method in Silverman (1986).

2.5.3. Robustness to outliers in training samples. Training samples during abrupt (and non-repetitive) changes, such as patient coughing, are less representative of the general covariate–response behavior and may be regarded as outliers. To increase the robustness of the kerneldensity estimator to such outliers, one should decrease the contribution of a training sampleif its behavior differs significantly from other samples. We follow a similar philosophy ofiterative weight assignment for robust local weighted regression (Cleveland 1979, Ruan et al2007) and adjust the weight of a sample in the kernel density estimation as follows.

Let B(·) be a scalar function that satisfies

• B(x) > 0 for |x| < 1 and B(x) = 0 for |x| � 1,• B(x) = B(−x),• B(x) is non-increasing for x � 0.

Let I = {1, 2, . . . , M} be the complete index set. Denote the kernel approximation ofp(z) with samples zj , for j ∈ I \ {i}, termed leave-one-out density, as

pi(y|x) = 1

M − 1

∑j∈I,j =i

κ((x,y)|zj ).

1318 D Ruan

Conditioned on X = xi , one obtains a distribution of pi(y|xi ), and subsequently anestimate yi using any chosen estimate in section 2.3. Let ei = yi − yi be the residual of theobserved sample response from the estimated response with the leave-one-out density. Letρ be a scale parameter, which could be set as the medium of |ei |’s for i = 1, 2, . . . , M . Wedefine the robust weight by

δi = B(yi − yi/ρ). (15)

The original kernel density approximation (1) can then be modified by

p(z) = 1∑Mi=1 δi

M∑i=1

δiκ(z|zi ) (16)

to incorporate the robust weighting.

2.5.4. Modified kernel weight to account for temporal correlation. By the same token asin section 2.5.3, one could incorporate the temporal correlation between the training samplesand the test sample by modifying (16) further with

p(z) =M∑i=1

δiηiκ(z|zi ), (17)

where ηi is a monotonically decreasing function of the temporal distance between sample zi

and z. Fading memory is often modeled with exponential discounting or windowed training.

• For exponential discounting,

ηi = exp(−α|ti − t |), (18)

where ti and t are the time tags associated with zi and z, respectively. The positiveconstant α determines the decay rate of influence of one sample on another as theirtemporal distance increases.

• For moving window:

ηi ={

1 |t − ti | < �,

0 else,(19)

where � is the window size. Here, only training samples close enough in time to the testsample are used to estimate the pdf. The window size � should be chosen large enoughto ensure a reasonable kernel approximation of the pdf.

2.6. Benchmark methods for comparison

Sampling rate and lookahead length are the two major prediction parameters in adaptive image-guided radiotherapy. We desire prediction algorithms that can handle low sampling rates andlarge lookahead lengths: a low sampling rate means less imaging dose and a large lookaheadlength allows more time for observation acquisition, signal processing and delivery. We willstudy the performance of the proposed method, with varying sampling rates and lookaheadlengths and compare the outcome with the following benchmarks.

• Most recent sample (no prediction)

yk = sk. (20)


• Adaptive linear predictor

yk = βTk xk + γk, (21)

where the linear coefficients βk and γk are obtained by solving the least-squares problemfor the observed covariate–response pairs in a dynamically updated training set. Morespecifically, at each instant k, a training set of covariate–response pairs is constructed withthe most recent observed samples, and the prediction coefficients are determined by

(βk, γk) = arg min∑

i∈ training set for time k

(yi − βkxi − γk)2,

which can be solved in closed form.

3. Materials and results

3.1. Material

We used the real position management system (RPM system, Varian Medical, Palo Alto,CA) to obtain 1D traces of fiducial markers placed on the patient’s chest wall. The RPMtraces are believed to be highly correlated with respiratory motion and sufficiently capturethe temporal behavior of respiration. Moreover, the performance of respiratory predictionalgorithms depends on the fundamental variation pattern rather than the amplitude, so theRPM traces are reasonable test subjects for algorithmic development.

To rid the adverse impact of the arbitrary scaling in RPM amplitude, we adopt thenormalized root mean squared error (nRMSE) as the performance measure for each trace,defined by the usual RMSE divided by the standard deviation of the observed sample values:

nRMSE({y}Ni=1

) = RMSE

stdy=

√E((y − y)2)√E((y − y)2)

. (22)

Population nRMSE (across traces) is computed by taking the L2 average of the trace-wisenRMSE, i.e.,

nRMSE = 1

number of traces

∑i:trace id

nRMSE2i .

We report the RPM data characteristics in table 1 and illustrate some traces in figure 2.

Table 1. RPM dataset information.

Subject ID 1 2 3 4 5 6 7 8 9 10

STD 0.49 0.50 0.30 0.20 0.32 0.59 0.07 0.23 0.28 0.11P-P 2.54 2.36 1.27 1.12 1.87 0.97 0.29 0.88 1.22 0.43P-P/STD 5.11 4.74 4.21 5.66 5.92 5.61 4.59 3.88 4.44 4.05Duration (s) 140 79 113 165 165 117 150 165 160 162

1320 D Ruan

0 50 100 150−2

−1.5

−1

−0.5

0

0.5

1

20 40 60 80 100 120 140 160−1

−0.5

0

0.5

20 40 60 80 100 120 140 160−1.5

−1

−0.5

0

0.5

trace 1 trace 4 trace 5

trace 7 trace 9 trace 100 50 100 150

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

20 40 60 80 100 120 140 160−1

−0.5

0

0.5

20 40 60 80 100 120 140 160−0.4

−0.2

0

0.2

Figure 2. Typical RPM traces used in this study.

3.2. Experiment detail and results

We have chosen to use a three-dimensional augmented covariate variable, so that xi =[si−2�, si−�, si]. This low-dimensional decision was based on the consideration that wewould like to obtain a reasonable kernel density estimation from limited samples and avoid the‘curse of dimensionality’ in density estimation. This setup is also designed to ensure fairnessin performance comparison with the benchmark methods.

To properly capture the motion dynamics, we choose � to correspond to 0.4 s delaybetween consecutive coordinates of the covariate x. In the baseline setup, we used a samplingrate of 30 Hz, with � = 12. As mentioned in section 2.5.1, we are most interested in apredictor that could perform well after a short training stage under low sampling conditions,so we set the covariance in (3) with the estimated population covariance from the trainingsamples as in (12): this covariance setup corresponds to an overly broad (flat) local kernel, butit provides numerical stability in most cases, an issue discussed in section 4 . We investigateda lookahead length L = 30 corresponding to a 1 s prediction, which has been reported to bechallenging for a wide spectrum of common prediction techniques (Vedam et al 2003, Sharpet al 2004, Murphy and Dieterich 2006).

For a fair comparison, the adaptive linear filter uses the same covariate variables, andthe observed covariate–response pairs in the most recent 20 s are used to estimate the linearregression coefficients βk ∈ �3 and γk ∈ � at each instant k.

3.2.1. Training samples in kernel density estimation. We studied three different schemes inchoosing the training samples used in kernel density estimation. In the static scheme, the first20 s of the trajectory was used to generate the training sample collection, and the estimatedpdf was kept still thereafter for all predictions. In the expansive scheme, the training set isenriched as new covariate–response pairs were observed. This corresponds to a special caseof fading memory (cf section 2.5.4) where all previous samples contribute equivocally to thekernel approximation (M = k − L, α = 0 in (17) and (18)). In the moving window updatescheme, only training samples within the most recent 20 s temporal window were used to


0 20 40 60 80 100 120 140−2

−1.5

−1

−0.5

0

0.5

1

time (second)

RP

Md

isp

lace

me

nt

20 40 60 80 100 120 140 160−1.5

−1

−0.5

0

0.5

time (second)

RP

Md

isp

lace

me

nt

(a) trace 1 - major drift (b) trace 5 - transient changes

20 40 60 80 100 120 140 160−1

−0.8

−0.6

−0.4

−0.2

0

0.2

time (second)

RP

Md

ispla

cem

ent

20 40 60 80 100 120 140−0.4

−0.2

0

0.2

time (second)

RP

Md

isp

lace

men

t

(c) trace 9 - irregular (d) trace 10 - quasi regular

Figure 3. Comparison among 1 s lookahead prediction results with different training schemesin kernel density estimation. Actual signal trajectory (solid blue line), prediction with the statictraining scheme (dashed green line), prediction from the expansive training scheme (dashed cyanline), prediction from the moving window training scheme (dashed red line).

Table 2. Comparison of prediction performance among static training, expansive training andmoving window training.

Subject ID 1 2 3 4 5 6 7 8 9 10 Average

Root mean squared error (RMSE)Static 1.85 0.81 0.73 0.91 0.99 1.03 0.83 0.54 1.05 0.85 1.02Expansive 0.58 0.49 0.48 0.64 0.56 0.54 0.72 0.41 0.72 0.63 0.59Moving window 0.37 0.37 0.39 0.58 0.45 0.42 0.69 0.35 0.60 0.56 0.49

construct the kernel pdf, as formulated in (19). Table 2 reports the prediction performance ofthese three training schemes and figure 3 illustrates some typical prediction trajectories withthese training schemes.

It is quite obvious that expanding the training samples, or even better, using a movingwindow to select the training sample collection, improves the prediction performance. Thisis no surprise, as the updated pdf would drive the final prediction estimate to resemble thetraining samples that both behave similarly and are close in time. This is particularly true fortrajectories with drifting, as shown in figure 4.

1322 D Ruan

0 20 40 60 80 100 120 140−2

−1.5

−1

−0.5

0

0.5

1

time (second)

RP

Mdis

pla

cem

ent

(a) trajectory with trendy mean drift (patient 1)

−1 −0.5 0 0.5 1 1.5 2 2.50

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

−1 −0.5 0 0.5 1 1.50

0.01

0.02

0.03

0.04

0.05

−1 −0.5 0 0.5 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

(b) static training (c) expansive training (d) moving-window training

Figure 4. Comparison among 1 s lookahead prediction results with different training samples inkernel density estimation for drifting trajectory. Top row: (a) time series of actual signal trajectory(solid blue line), prediction with the static training scheme (dashed green line), prediction fromthe expansive training scheme (dashed cyan line), prediction from the moving window trainingscheme (dashed red line). Bottom row: histogram of prediction residual yk − yk for (b) statictraining; (c) expansive training ; (d) moving window training.

3.2.2. The effect of sampling rate and lookahead length. Since sampling rate and lookaheadlength are two of the most critical parameters in a prediction system, we compared theperformance of the proposed method with the most recent sample and the adaptive linearfilter prediction method as described in section 2.6, for various sampling rates and lookaheadlengths. In particular, we tested prediction performances for lookahead lengths 0.2 s, 0.6 sand 1 s when samples are acquired at 5 Hz, 10 Hz, 15 Hz and 30 Hz respectively. Figure 5reports the nRMSE for the three methods (MRS, Linear, Kernel) under different combinationsof prediction parameters. In figure 5, the upper-left corner corresponds to the relatively easycase of short prediction (0.2 s) with dense samples (30 Hz). The sampling rate decreases aswe move down the rows and the prediction length increases as we move to the columns on theright—both changes increases the difficulty of prediction.

In most cases, the adaptive linear filter yields significant accuracy gain compared to theMRS, usually reducing the nRMSE by about a half. This performance is comparable to whatis reported by current development of respiratory predictors, justifying our previous argumentthat the adaptive linear predictor is a fair benchmark for this study. The only exception occurswhen 5 Hz samples are used to predict 0.2 s ahead. This can be explained from two aspects:(1) 200 ms delay is relatively short and does not incur too high an nRMSE even using MRS;(2) the low sampling rate basically renders a down-sampled path, causing relative slow responseof the adaptive linear filter to dynamic changes. Meanwhile, the proposed kernel-based methoduniversally dominates the alternatives on a trace-to-trace basis, always yielding the lowestnRMSE value. Moreover, its improvement upon RMS and adaptive linear predictor is quitesignificant.


1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Trace ID

nR

MS

E

MRS

Linear

Kernel

1 2 3 4 5 6 7 8 9 100.2

0.4

0.6

0.8

1

1.2

Trace ID

nR

MS

E

MRS

Linear

Kernel

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

Trace ID

nR

MS

E

MRS

Linear

Kernel

sampling at 30Hz, lookahead (left to right) 0.2s, 0.6s, 1s.

1 2 3 4 5 6 7 8 9 100.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Trace ID

nR

MS

E

MRS

Linear

Kernel

1 2 3 4 5 6 7 8 9 100.2

0.4

0.6

0.8

1

1.2

Trace ID

nR

MS

EMRS

Linear

Kernel

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

Trace ID

nR

MS

E

MRS

Linear

Kernel


1 2 3 4 5 6 7 8 9 100.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Trace ID

nR

MS

E

MRS

Linear

Kernel

1 2 3 4 5 6 7 8 9 100.2

0.4

0.6

0.8

1

1.2

Trace ID

nR

MS

E

MRS

Linear

Kernel

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

Trace ID

nR

MS

E

MRS

Linear

Kernel


1 2 3 4 5 6 7 8 9 100.1

0.2

0.3

0.4

0.5

0.6

Trace ID

nR

MS

E

MRS

Linear

Kernel

1 2 3 4 5 6 7 8 9 100.2

0.4

0.6

0.8

1

1.2

Trace ID

nR

MS

E

MRS

Linear

Kernel

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

Trace ID

nR

MS

E

MRS

Linear

Kernel


Figure 5. Comparison of prediction performance in terms of nRMSE under various predictionparameters. Column-wise (left → right): lookahead = 0.2 s, 0.6 s, 1 s. Row-wise (up → down):sampling rate = 30 Hz, 15 Hz, 10 Hz, 5Hz.

Figure 6 reports the nRMSE across all traces with varying sampling rates for lookaheadlengths 0.2 s, 0.6 s and 1 s respectively, and figure 7 reports the performance subject tovarying lookahead lengths, for each given sampling rate. The advantage of the proposedmethod is universal across all prediction lengths and all sampling rates and is most dramaticfor long lookahead length when the alternative methods yield essentially useless prediction(nRMSE � 1). The nonparametric nature and data-driven learning makes the kernel-basedprediction robust toward the changes in the prediction parameters.

1324 D Ruan

5 10 15 20 25 30

0.2

0.25

0.3

0.35

0.4

0.45

0.5

sampling frequency (Hz)

nR

MS

E

MRS

Linear

Kernel

5 10 15 20 25 30

0.4

0.5

0.6

0.7

0.8

0.9

1


nR

MS

E

MRS

Linear

Kernel

5 10 15 20 25 300.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3


nR

MS

E

MRS

Linear

Kernel

(a) lookahead 0.2 s (b) lookahead 0.6 s (c) lookahead 1 s

Figure 6. nRMSE versus sampling frequency for 0.2, 0.6 and 1 s lookahead prediction across alltraces.

0.2 0.4 0.6 0.8 10.2

0.4

0.6

0.8

1

1.2

1.4

1.6

lookahead length (seconds)

nR

MS

E

MRS

Linear

Kernel

0.2 0.4 0.6 0.8 10.2

0.4

0.6

0.8

1

1.2

1.4

1.6


nR

MS

EMRS

Linear

Kernel

(a) sampling frequency 5 Hz

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1.2

1.4


nR

MS

E

MRS

Linear

Kernel

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1.2

1.4


nR

MS

E

MRS

Linear

Kernel

(d) sampling frequency 30 Hz(c) sampling frequency 15 Hz

(b) sampling frequency 10Hz

Figure 7. nRMSE versus lookahead length with various sampling rates across all traces.

4. Discussions

• The Gaussian mixture structure in (6) offers an efficient form to store and process the pdf.It suffices to record the weights wi , the Gaussian centers yi and the variance σy to fullyrecover the probability distribution. Furthermore, the Gaussian kernel with separable


covariance (with respect to covariate and response variables), together with the linearityin the mean estimate allows us to obtain the prediction ymean as a simple weighted sumof sample response values (8), where the weight of each training sample is determined byhow close its covariate value is to the covariate value of the test sample.

• When there are only a few training samples (with respect to the covariate–response spaceunder study) available, numerical instability may occur if each of the weights wi in (5)is small relative to the machine precision. A ‘zero divided by zero’ fault may arise fromnormalization. Geometrically, uniformly small weights indicate that the test sample is farfrom all training samples. One could utilize this observation to detect ‘unseen’ changesin the trajectory. From the perspective of obtaining a stable estimate, one could adjustthe covariance �x in (5) to ‘flatten’ the local kernel and increasing its bandwidth, so thesupport of the kernel density approximation from the same training samples covers a largerportion of the whole space, preventing uniformly vanishing p(y|x). As an alternative,one could use a finitely supported density kernel and determine a test-sample-dependentbandwidth by requesting the test sample to fall inside the kernel support of a certainnumber of training samples, as in Cleveland (1979), Silverman (1986), Ruan et al (2007).

5. Conclusion and future work

This paper reports a kernel density-based method to estimate the probability distribution ofthe future target position. We provide a general framework for estimating the conditionalprobability distribution and several options in constructing estimates based on the conditionaldistribution. In particular, we have discussed the use of the Gaussian density kernel andthe mean estimate, leading to a simple yet intuitive estimate that is the weighted sum ofthe training response values. Our proposed method compares favorably with alternativebenchmark methods, yielding a reduction of two- to threefolds in RMSE. Improvement of theproposed method is most noticeable in the case of low sampling rate and/or long lookaheadlength prediction.

We have discussed how numerical instability could be indicative of an ‘unseen’ scenario,and may be used for change detection. More generally, the marginal distribution of the testcovariate p(x) indicates the chances of such event and the conditional distribution p(y|x)

provides all the information for characterizing the quality of the resulting estimate. This givesrise to an instantaneous quantification of the prediction, which is an interesting alternative to theconventional collective performance measure. We will study the performance quantificationissue as a natural extension of this work.

As we described in our methodology and observed in our experiment, the choice of thecovariance for the kernel density is not particularly critical for the purpose of prediction.However, a good estimation of the joint distribution of the covariate–response is valuable forits own sake. Related to the previous comment, change detection relies on a good estimate ofthe marginal distribution. We will study automatic schemes to ‘optimally’ choose the kernelcovariance.

We have formulated the mode and medium estimates in section 2.3, but have focusedon the mean estimate due to its simplicity and the resulting intuitive weighted superpositionform for the predictor. However, the mode estimate corresponds to the maximum a posteriori(MAP) estimate and the medium estimate has its advantage toward outlier samples, and havetheir own significance. In the future, we will analyze these estimates and investigate theirapplicability in various settings.

1326 D Ruan

Acknowledgments

The author thanks Dr Per Poulsen and Dr Byung-Chul Cho for motivating this work andDr Paul Keall for his support. She is grateful to the reviewers for their insightful commentsthat greatly improved the quality of this work.

References

Botev Z I 2006 A novel nonparametric density estimator Postgraduate Seminar Series Department of Mathematics,The University of Queensland

Cleveland W S 1979 Robust locally weighted regression and smoothing scatterplots J. Am. Stat. Assoc. 74 829–36Duda R O, Hart P E and Stork D G 2001 Pattern Classification (New York: Wiley)Ernst F, Schlaefer A and Schweikard A 2007 Prediction of respiratory motion with wavelet-based multiscale

autoregression Med. Image Comput. Comput. Assist. Intervention 10 668–75Isaksson M, Jalden J and Murphy M J 2005 On using an adaptive neural network to predict lung tumor motion during

respiration for radiotherapy applications Med. Phys. 32 3801–9Kakar M, Nystrom H, Aarup L R, Nottrup T J and Olsen D R 2005 Respiratory motion prediction by using the

adaptive neuro fuzzy inference system (ANFIS) Phys. Med. Biol. 50 4721–8Keall P J, Kini V R, Vedam S S and Mohan R 2002 Potential radiotherapy improvements with respiratory gating

Australas Phys. Eng. Sci. Med. 25 1–6McCall K C and Jeraj R 2007 Dual-component model of respiratory motion based on the periodic autoregressive

moving average (periodic ARMA) method Phys. Med. Biol. 52 3455–66McMahon R, Papiez L and Sandison G 2007 Addressing relative motion of tumors and normal tissue during dynamic

MLC tracking delivery Australas Phys. Eng. Sci. Med. 30 331–6Murphy M J and Dieterich S 2006 Comparative performance of linear and nonlinear neural networks to predict

irregular breathing Phys. Med. Biol. 51 5903–14Nuyttens J J, Prevost J B, Praag J, Hoogeman M, Van Klaveren R J, Levendag P C and Pattynama P M 2006 Lung

tumor tracking during stereotactic radiotherapy treatment with the cyberknife: marker placement and earlyresults Acta Oncol. 45 961–5

Putra D, Haas O C, Mills J A and Burnham K J 2008 A multiple model approach to respiratory motion prediction forreal-time IGRT Phys. Med. Biol. 53 1651–63

Ruan D, Fessler J A and Balter J M 2007 Real-time prediction of respiratory motion based on nonparametric localregression methods Phys. Med. Biol. 52 7137–52

Sharp G C, Jiang S B, Shimizu S and Shirato H 2004 Prediction of respiratory tumour motion for real-time image-guided radiotherapy Phys. Med. Biol. 49 425–40

Sheather S J and Jones M C 1991 A reliable data-based bandwidth selection method for kernel density estimation J.R. Stat. Soc. B 53 683–90

Silverman B W 1986 Density Estimation for Statistics and Data Analysis (New York: Chapman and Hall)Vedam S S, Kini V R, Keall P J, Ramakrishnan V, Mostafavi H and Mohan R 2003 Quantifying the predictability of

diaphragm motion during respiration with a noninvasive external marker Med. Phys. 30 505–13Vedam S S, Keall P J, Docef A, Todor D A, Kini V R and Mohan R 2004 Predicting respiratory motion for

four-dimensional radiotherapy Med. Phys. 31 2274–83Zimmerman J, Korreman S, Persson G, Cattell H, Svatos M, Sawant A, Venkat R, Carlson D and Keall P 2008

DMLC motion tracking of moving targets for intensity modulated arc therapy treatment—a feasibility studyActa Oncol. 1–6

http://dx.doi.org/10.2307/2286407

http://dx.doi.org/10.1118/1.2134958

http://dx.doi.org/10.1088/0031-9155/50/19/020

http://dx.doi.org/10.1007/BF03178368

http://dx.doi.org/10.1088/0031-9155/52/12/009

http://dx.doi.org/10.1088/0031-9155/51/22/012

http://dx.doi.org/10.1080/02841860600902205

http://dx.doi.org/10.1088/0031-9155/53/6/010

http://dx.doi.org/10.1088/0031-9155/52/23/024

http://dx.doi.org/10.1088/0031-9155/49/3/006

http://dx.doi.org/10.1118/1.1558675

http://dx.doi.org/10.1118/1.1771931

kernel density estimation-based real-time prediction...

Documents