consistent and powerful graph-based change-point test for ... · and dc.r. rao advanced institute...

6
STATISTICS Consistent and powerful graph-based change-point test for high-dimensional data Xiaoping Shi a,1 , Yuehua Wu b,1 , and Calyampudi Radhakrishna Rao c,d,1 a Department of Mathematics and Statistics, Thompson Rivers University, Kamloops, BC, Canada V2C0C8; b Department of Mathematics and Statistics, York University, Toronto, ON, Canada M3J1P3; c Department of Biostatistics, University at Buffalo, The State University of New York, Buffalo, NY 14221-3000; and d C.R. Rao Advanced Institute of Mathematics, Statistics, and Computer Science, Hyderabad 500046, India Contributed by Calyampudi Radhakrishna Rao, March 1, 2017 (sent for review February 17, 2017; reviewed by Venkata Krishna Jandhyala and Runze Li) A change-point detection is proposed by using a Bayesian-type statistic based on the shortest Hamiltonian path, and the change- point is estimated by using ratio cut. A permutation procedure is applied to approximate the significance of Bayesian-type statis- tics. The change-point test is proven to be consistent, and an error probability in change-point estimation is provided. The test is very powerful against alternatives with a shift in variance and is accu- rate in change-point estimation, as shown in simulation studies. Its applicability in tracking cell division is illustrated. Bayesian-type statistic | shortest Hamilton path | ratio cut | minimum spanning tree | cell division M odeling high-dimensional time series is necessary in many fields such as neuroscience, signal processing, network evolution, text analysis, and image analysis. Such a time series may contain unknown multiple change-points. For example, the time of cell divisions can be accessed using an automatic embryo monitoring system by a time-lapse observation (see ref. 1). When a cell divides at some time point, the distribution of pixel values in the corresponding frame will change, and hence the detec- tion of cell divisions can be formulated as a multiple change- point problem. Sample frames of a particular mouse embryo from ref. 1 are shown in Fig. 1. The aim is to automatically detect multiple change-points: the time points of first, second, and third division cycles (from one to two cells, from two to four cells, and from four to eight cells, respectively). Histograms are usually used to compare cell images. Their advantages are efficiency and insensitivity to cell movement (see, e.g., ref. 2). Assume that the pixel values are placed into d bins, and let h 0 t ,k be the number of occurrences of pixel values in the t th image that are contained within that k th bin for k =1,..., d . A non- linear scaling (h t ,k ) of the count (h 0 t ,k ) can usually improve the performance of image segmentation, e.g., by using a square- root or logarithmic transformation (3, p. 88). The ultimate aim is to detect the multiple change-points in d -dimensional vec- tors Ht =(ht ,1, ht ,2,..., h t ,d ) 0 for t =1, 2,..., N . As there are 321 × 321 pixel values in each image, it is possible to consider a large number of bins. In other words, d may be very large for high-resolution images. Another example is the authorship debate given in Chen and Zhang (4), where h 0 t ,k represents the count of the k th word in the t th chapter. More examples can be found in Chen and Zhang (4) and Roy et al. (5), among others. A change-point detection can be built upon a two-sample test; Chen and Zhang (4) recently developed scan statistics for change-point detection by using the run test of Friedman and Rafsky (6) that is based on the minimal spanning tree (MST). Even though the test of Friedman and Rafsky (6) can be used in high-dimension, low-sample-size situations, as by ref. 7, it is no longer distribution-free and may not be consistent under some conditions (see ref. 7, theorem 2). As shown in the simulation studies of this paper, the power of Chen and Zhang’s MST-based test tends to zero when the variance changes and d is large. Considering a two-sample testing problem with two indepen- dent d -dimensional samples of size m1 and m2 respectively from distributions FX and FY , ref. 7 proposes a multivariate generalization of the Wald–Wolfowitz run test (8) using the shortest Hamiltonian path (SHP), where vertices are points in a Euclidean space and edge weights are Euclidean distances between points. They show that the generalized run test is distribution-free and consistent when N = m1 + m2 is finite and d tends to infinity, which leads us to consider extending their method from a two-sample test to a change-point detection, and to investigate the properties of SHP-based tests for change-point detection. Our contributions are as follows: (i) a Bayesian-type statistic for change-point detection and a change-point estimate by using ratio cut; (ii) a permutation procedure for approximations to the significance of Bayesian-type statistics; (iii) a theoretical analysis respectively on consistent tests for change-point detection and an error probability in change-point estimation; and (iv) a method for tracking cell division using the SHP-based statistics. Recent alternative approaches for change-point analysis in high-dimensional time series can be found in Cho and Fryzlewicz (9), Jirak (10), and Roy et al. (5), among others. Change-Point Detection Based on the Minimal Spanning Tree A change-point is a location or time t * at which observations or data make a transition from one model (until t * ) to another model (after t * ). The null hypothesis is that there is no change- point, and the alternative hypothesis is that there exists a change- point t * . We denote respectively pr0 and pr1 as the probabili- ties under the null hypothesis and the alternative hypothesis. To detect whether there is a change-point or not, we cut the whole sequence {Hj , j =1,..., N } at an arbitrary point t into two sequences {Hj , j =1,..., t } (until t ) and {Hj , j = t +1,..., N } (after t ). As in ref. 4, we define Significance Change-point detection in high-dimensional time series is nec- essary in many areas of science and engineering, including neuroscience, signal processing, network evolution, image analysis, and text analysis. In terms of a multivariate gen- eralization of the Wald–Wolfowitz run test using the short- est Hamiltonian path, this paper proposes a distribution- free, consistent graph-based change-point detection for high-dimensional data. Once a change-point is detected, its location is estimated by using ratio cut. The test is very power- ful against alternatives with a shift in mean or variance and is accurate in change-point estimation. Its applicability is demon- strated in the example of tracking cell division. Author contributions: X.S., Y.W., and C.R.R. designed research; X.S., Y.W., and C.R.R. per- formed research; X.S. analyzed data; and X.S., Y.W., and C.R.R. wrote the paper. Reviewers: V.K.J., Washington State University; and R.L., Pennsylvania State University. The authors declare no conflict of interest. 1 To whom correspondence may be addressed. Email: [email protected], [email protected], or [email protected]. www.pnas.org/cgi/doi/10.1073/pnas.1702654114 PNAS | April 11, 2017 | vol. 114 | no. 15 | 3873–3878 Downloaded by guest on July 26, 2021

Upload: others

Post on 28-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Consistent and powerful graph-based change-point test for ... · and dC.R. Rao Advanced Institute of Mathematics, Statistics, and Computer Science, Hyderabad 500046, India Contributed

STA

TIST

ICS

Consistent and powerful graph-based change-pointtest for high-dimensional dataXiaoping Shia,1, Yuehua Wub,1, and Calyampudi Radhakrishna Raoc,d,1

aDepartment of Mathematics and Statistics, Thompson Rivers University, Kamloops, BC, Canada V2C0C8; bDepartment of Mathematics and Statistics, YorkUniversity, Toronto, ON, Canada M3J1P3; cDepartment of Biostatistics, University at Buffalo, The State University of New York, Buffalo, NY 14221-3000;and dC.R. Rao Advanced Institute of Mathematics, Statistics, and Computer Science, Hyderabad 500046, India

Contributed by Calyampudi Radhakrishna Rao, March 1, 2017 (sent for review February 17, 2017; reviewed by Venkata Krishna Jandhyala and Runze Li)

A change-point detection is proposed by using a Bayesian-typestatistic based on the shortest Hamiltonian path, and the change-point is estimated by using ratio cut. A permutation procedureis applied to approximate the significance of Bayesian-type statis-tics. The change-point test is proven to be consistent, and an errorprobability in change-point estimation is provided. The test is verypowerful against alternatives with a shift in variance and is accu-rate in change-point estimation, as shown in simulation studies.Its applicability in tracking cell division is illustrated.

Bayesian-type statistic | shortest Hamilton path | ratio cut |minimum spanning tree | cell division

Modeling high-dimensional time series is necessary in manyfields such as neuroscience, signal processing, network

evolution, text analysis, and image analysis. Such a time seriesmay contain unknown multiple change-points. For example, thetime of cell divisions can be accessed using an automatic embryomonitoring system by a time-lapse observation (see ref. 1). Whena cell divides at some time point, the distribution of pixel valuesin the corresponding frame will change, and hence the detec-tion of cell divisions can be formulated as a multiple change-point problem. Sample frames of a particular mouse embryofrom ref. 1 are shown in Fig. 1. The aim is to automaticallydetect multiple change-points: the time points of first, second,and third division cycles (from one to two cells, from two tofour cells, and from four to eight cells, respectively). Histogramsare usually used to compare cell images. Their advantages areefficiency and insensitivity to cell movement (see, e.g., ref. 2).Assume that the pixel values are placed into d bins, and let h0

t,k

be the number of occurrences of pixel values in the tth imagethat are contained within that k th bin for k =1, . . . , d . A non-linear scaling (ht,k ) of the count (h0

t,k ) can usually improve theperformance of image segmentation, e.g., by using a square-root or logarithmic transformation (3, p. 88). The ultimate aimis to detect the multiple change-points in d -dimensional vec-tors Ht =(ht,1, ht,2, . . . , ht,d)

′ for t =1, 2, . . . ,N . As there are321× 321 pixel values in each image, it is possible to considera large number of bins. In other words, d may be very largefor high-resolution images. Another example is the authorshipdebate given in Chen and Zhang (4), where h0

t,k represents thecount of the k th word in the tth chapter. More examples can befound in Chen and Zhang (4) and Roy et al. (5), among others.

A change-point detection can be built upon a two-sampletest; Chen and Zhang (4) recently developed scan statistics forchange-point detection by using the run test of Friedman andRafsky (6) that is based on the minimal spanning tree (MST).Even though the test of Friedman and Rafsky (6) can be used inhigh-dimension, low-sample-size situations, as by ref. 7, it is nolonger distribution-free and may not be consistent under someconditions (see ref. 7, theorem 2). As shown in the simulationstudies of this paper, the power of Chen and Zhang’s MST-basedtest tends to zero when the variance changes and d is large.

Considering a two-sample testing problem with two indepen-dent d -dimensional samples of size m1 and m2 respectively

from distributions FX and FY , ref. 7 proposes a multivariategeneralization of the Wald–Wolfowitz run test (8) using theshortest Hamiltonian path (SHP), where vertices are points ina Euclidean space and edge weights are Euclidean distancesbetween points. They show that the generalized run test isdistribution-free and consistent when N =m1 +m2 is finite andd tends to infinity, which leads us to consider extending theirmethod from a two-sample test to a change-point detection, andto investigate the properties of SHP-based tests for change-pointdetection.

Our contributions are as follows: (i) a Bayesian-type statisticfor change-point detection and a change-point estimate by usingratio cut; (ii) a permutation procedure for approximations to thesignificance of Bayesian-type statistics; (iii) a theoretical analysisrespectively on consistent tests for change-point detection and anerror probability in change-point estimation; and (iv) a methodfor tracking cell division using the SHP-based statistics.

Recent alternative approaches for change-point analysis inhigh-dimensional time series can be found in Cho and Fryzlewicz(9), Jirak (10), and Roy et al. (5), among others.

Change-Point Detection Based on the MinimalSpanning TreeA change-point is a location or time t∗ at which observationsor data make a transition from one model (until t∗) to anothermodel (after t∗). The null hypothesis is that there is no change-point, and the alternative hypothesis is that there exists a change-point t∗. We denote respectively pr0 and pr1 as the probabili-ties under the null hypothesis and the alternative hypothesis. Todetect whether there is a change-point or not, we cut the wholesequence {Hj , j =1, . . . ,N } at an arbitrary point t into twosequences {Hj , j =1, . . . , t} (until t) and {Hj , j = t +1, . . . ,N }(after t). As in ref. 4, we define

Significance

Change-point detection in high-dimensional time series is nec-essary in many areas of science and engineering, includingneuroscience, signal processing, network evolution, imageanalysis, and text analysis. In terms of a multivariate gen-eralization of the Wald–Wolfowitz run test using the short-est Hamiltonian path, this paper proposes a distribution-free, consistent graph-based change-point detection forhigh-dimensional data. Once a change-point is detected, itslocation is estimated by using ratio cut. The test is very power-ful against alternatives with a shift in mean or variance and isaccurate in change-point estimation. Its applicability is demon-strated in the example of tracking cell division.

Author contributions: X.S., Y.W., and C.R.R. designed research; X.S., Y.W., and C.R.R. per-formed research; X.S. analyzed data; and X.S., Y.W., and C.R.R. wrote the paper.

Reviewers: V.K.J., Washington State University; and R.L., Pennsylvania State University.

The authors declare no conflict of interest.1To whom correspondence may be addressed. Email: [email protected], [email protected], [email protected].

www.pnas.org/cgi/doi/10.1073/pnas.1702654114 PNAS | April 11, 2017 | vol. 114 | no. 15 | 3873–3878

Dow

nloa

ded

by g

uest

on

July

26,

202

1

Page 2: Consistent and powerful graph-based change-point test for ... · and dC.R. Rao Advanced Institute of Mathematics, Statistics, and Computer Science, Hyderabad 500046, India Contributed

Fig. 1. Sample images with dimension 321× 321 located at 1, 23, 195, and 259 in file folder E00 from celltracking.bio.nyu.edu/. The time points of the first,second, and third division cycles are, respectively, 22, 194, and 258.

CGt =

∑(i,j)∈E(G)

I {I (i > t) 6= I (j > t)}, [1]

where G is an undirected finite graph with vertex setV (G)= {1, . . . ,N }, E(G) is the edge set, and I (x ) is an indi-cator function that takes the value 1 if x is true, and 0 otherwise.

Given the minimum spanning tree MST, ref. 4 proposes a testbased on the standardized version of CMST

t by using a scan statis-tic based on a standardized version of CMST

t in Eq. 1,

CMSTN = max

n0≤t≤n1

−CMSTt − E0(C

MSTt )√

var0(CMSTt )

, [2]

where n0 and n1 are prespecified constraints, and E0(CMSTt ) and

var0(CMSTt ) are, respectively, expectation and variance of CMST

t

under the permutation null. For simple presentation, we namethis the CZ test. Ref. 4 obtains analytic approximations to thenull distribution of the scan statistic for large sample size. Inaddition, ref. 4 applies a permutation procedure to find the criti-cal value and p values for the scan statistic. Note that the permu-tation is based on a distribution of degrees in the MST of originalobservations. If another new set of observations differs from theoriginal ones, then the distribution of degrees in the correspond-ing MST may also differ, which causes the scan statistic based onthe MST to be no longer distribution-free. Thus, in this paper,we aim to find a distribution-free change-point test that is alsoconsistent when the dimension d tends to infinity.

Bayesian-Type Statistic and a Change-Point EstimateBased on the SHPIn light of the Bayesian-type statistic for detecting mean shift inrefs. 11 and 12 among others, we propose the following Bayesian-type statistic based on the SHP:

Fig. 2. An illustration of an SHP and the change-point estimate based on ratio cut for N = 10 andd = 2. (Left) SHP in a complete graph. (Right)CSHP

t /Dt , 1≤ t<N, and the change-point estimate.

S SHPN =

1

N − 1

N−1∑t=1

{C SHPt − E0(C

SHPt )}2

var0(C SHPt )

, [3]

where E0(CSHPt )= 2t(N − t)/N and var0(C SHP

t )= 2t(N − t)

{2t(N − t)−N }/(N 3−N 2) (see ref. 8). If S SHPN is large, then

the null hypothesis is rejected, and the change-point is estimatedby using

tD = arg min1≤t<N

C SHPt

Dt, [4]

the ratio cut introduced in ref. 13, where Dt = t(N − t). For sim-ple presentation, we name this the SWR test.

To illustrate the change-point estimate, we present a varianceshift model with N =10 and d =2. Each entry of the data is froma normal distribution with mean zero, but its variance changesfrom 1 to 4 at the change-point 5. As it is a NP (nondeterministicpolynomial time) problem to find an SHP in the complete graphwith N vertices, the efficient heuristic Kruskal algorithm (14)suggested by ref. 7 is applied here by using the msTreeKruskalfunction in the R package optrees (15).

Fig. 2 shows the SHP in a complete graph with N =10 andd =2, and C SHP

t /Dt , 1≤ t <N . The change-point estimate isdetermined by finding the value of t that minimizes C SHP

t /Dt asin Eq. 4, the ratio cut. It can be seen that the change-point esti-mate using ratio cut is exactly the same as the true change-point.

Permutation ProcedureFor a random SHP and an observed SHP denoted as SHPobs, oneusually need to compute the p value

p = pr0{S SHPN ≥ S SHPobs

N

}[5]

3874 | www.pnas.org/cgi/doi/10.1073/pnas.1702654114 Shi et al.

Dow

nloa

ded

by g

uest

on

July

26,

202

1

Page 3: Consistent and powerful graph-based change-point test for ... · and dC.R. Rao Advanced Institute of Mathematics, Statistics, and Computer Science, Hyderabad 500046, India Contributed

STA

TIST

ICS

Table 1. Permuted critical values for Eq. 8 based on various N andα, and B = 100,000 replications

α

N 10% 5% 2.5% 1% 0.5% 0.1%

cα, Eq. 8 with N = 20 1.8 2.1 2.5 3.0 3.4 4.3cα, Eq. 8 with N = 40 1.8 2.2 2.6 3.1 3.5 4.5cα, Eq. 8 with N = 60 1.8 2.2 2.6 3.1 3.5 4.5cα, Eq. 8 with N = 80 1.8 2.2 2.6 3.1 3.6 4.5cα, Eq. 8 with N = 100 1.8 2.2 2.6 3.1 3.5 4.6cα, Eq. 8 with N = 200 1.8 2.1 2.5 3.1 3.5 4.6cα, Eq. 8 with N = 300 1.8 2.1 2.5 3.0 3.5 4.4

or the critical value cα defined by

α = pr0{S SHPN ≥ cα

}, [6]

for a significant level α.A permutation method is applied to approximate the p value

and the critical value cα by

p =1

B

B∑b=1

I(S

PATHbN ≥ S SHP

N

), [7]

cα = inf

{x ∈ R :

1

B

B∑b=1

I(S

PATHbN ≥ x

)≤ α

}, [8]

where PATH1, . . . ,PATHB are independent replicates of pathconnecting the sampled points from the set {1, . . . ,N } withoutreplacement.

Table 1 presents the estimated critical values by using Eq. 8under the permutation null hypothesis based on various N andα,and B =100,000 permutations. Fig. 3 compares the distributionsof permuted SPATH

N in Eq. 3 for N =20 and N =300. As can beseen, the estimated critical values are very close for various Nbecause the sample size N has little impact on the permuted nulldistributions of SPATH

N .

Theoretical AnalysisConsistency in Test. A theoretical investigation of asymptoticbehaviors when N is fixed and d tends to infinity is established toshow that the SWR test is consistent and that the change-pointestimate tD in Eq. 4 provides an accurate estimate of the truechange-point t∗.

Fig. 3. Histograms of permuted SPATH20 and SPATH

300 .

Suppose that the t∗ independent observations on X=(X1, . . . ,Xd)

′ are from distribution FX and the N − t∗ independentobservations on Y=(Y1, . . . ,Yd)

′ are from distribution FY,FX 6=FY, and the observations on X and Y are indepen-dent. Let X1 =(X1,1, . . . ,X1,d)

′ and X2 =(X2,1, . . . ,X2,d)′

be two independent copies of X, and let Y1 =(Y1,1, . . . ,Y1,d)

′ and Y2 =(Y2,1, . . . ,Y2,d)′ be two independent cop-

ies of Y.Assumption 1. Suppose that ||X1− X2||/

√d→σ1

√2, ||Y1−

Y2||/√d→σ2

√2, and ||X− Y||/

√d→

√σ21 + σ2

2 + ν2 in prob-ability as d→∞, where ν2 = limd→∞ d−1∑d

q=1{E(Xq)−E(Yq)}2, σ2

1= limd→∞d−1∑dq=1 var(Xq), and σ2

2 = limd→∞

d−1∑dq=1 var(Yq).

Assumption 2. There exists an Nα such that

min

∑|t−t∗|≤Nα

1

N − 1

N−1∑t=1

[κt − E0(CSHPt )]2

var0(C SHPt )

>cα,

where κt∗ ≤ 2 and |κt −κt±1| ≤ 2 for all t .Assumption 1 is based on weak convergence, which is a mod-

ified version of three assumptions in theorem 1 of ref. 7. If thecomponents of X and Y are independent and identically dis-tributed, as for the normal examples in Eq. 10 with µ=0.3 andη=1.3, then σ2

1 =1, σ22 =1.32, and ν2 =0.32. Under Assump-

tion 1, if ν2> 0 or σ21 6=σ2

2 , then theorem 1 of ref. 7 implies thatC SHP

t∗ ≤ 2 in probability as d→∞. In Assumption 2, E0(CSHPt )

and var0(C SHPt ) are defined in Eq. 3, and cα can be esti-

mated by Table 1 based on various N and α. For example,set N =100 and t∗=50. If α=0.05, Table 1 suggests choos-ing Nα=1, as the minimum in Assumption 2 is 2.7 greater thanthe estimated cα, 2.2. Further for a smaller α such as 0.001,Nα can be 3 because the minimum is 5.7 greater than the esti-mated cα, 4.6. The following theorem shows that SWR test isconsistent.

Theorem 1. Under Assumptions 1–2, for a predefined positivenumber α, if ν2> 0 or σ2

1 6=σ22 , the power of the SWR test of level

α converges to 1 as d→∞.

Proof. We need to show that pr1(S SHPN ≤ cα)→ 0 as d→∞. It is

easy to see that

pr1(SSHPN ≤ cα) ≤ pr1(S

SHPN ≤ cα, C

SHPt∗ ≤ 2) + pr1(C

SHPt∗ >2).

Shi et al. PNAS | April 11, 2017 | vol. 114 | no. 15 | 3875

Dow

nloa

ded

by g

uest

on

July

26,

202

1

Page 4: Consistent and powerful graph-based change-point test for ... · and dC.R. Rao Advanced Institute of Mathematics, Statistics, and Computer Science, Hyderabad 500046, India Contributed

Table 2. Simulated type I errors for SWR test

d

N 10 50 100 500 1,000 5,000

N = 20 0.041 0.050 0.043 0.043 0.051 0.050N = 40 0.060 0.038 0.044 0.035 0.053 0.049N = 60 0.052 0.038 0.057 0.056 0.037 0.048N = 80 0.048 0.039 0.052 0.047 0.059 0.042N = 100 0.044 0.049 0.041 0.050 0.043 0.042N = 200 0.043 0.055 0.056 0.058 0.047 0.050N = 300 0.054 0.053 0.054 0.056 0.050 0.056

By Assumption 1 and Biswas et al. (7, theorem 1), it follows that

pr1(CSHPt∗ > 2)→ 0, as d →∞.

For any t ∈ {2, . . . ,N − 1}, |C SHPt −C SHP

t±1 | ≤ 2 due to the con-nected path. When C SHP

t∗ ≤ 2, it can be shown that

S SHPN ≥ min

∑|t−t∗|≤Nα

1

N − 1

N−1∑t=1

[κt − E0(CSHPt )]2

var0(C SHPt )

,where κt∗ ≤ 2, |κt −κt±1| ≤ 2, and hence pr1(S

SHPN ≤ cα,

C SHPt∗ ≤ 2)= 0 by Assumption 2, which concludes the theorem.The main two reasons for consistency of the SWR test are

(i) C SHPt∗ ≤ 2 in probability as d→∞ and (ii) |C SHP

t −C SHPt±1 | ≤ 2

for any 1≤ t <N due to the connected path. If the SHP isreplaced by MST and the corresponding Bayesian-type statisticis denoted by SMST

N , then CMSTt∗ converges to 2 in probability as

d→∞ under some conditions (7, theorem 2, i), but the differ-ence |CMST

t −CMSTt±1 | might be very large due to the tree where

a vertex can have the largest degree N − 1. The possible largedifference may result in the loss of power.

Error Probability in Change-Point Estimation. If t < t∗, let R∗t bethe number of edge links from the two partitions {1, . . . , t} and

Fig. 4. Comparison of powers of SWR and CZ tests with 200 replications, two parameter combinations (µ = 0.3, η = 1) and (µ = 0, η = 1.3), and twolocation combinations for sample sizes (N = 40, N = 100), and the change-point locations (t∗ = N/2, t∗ = N/4).

{t+1, . . . , t∗}; if t > t∗, let R∗t be the number of edge links fromthe two partitions {t∗ + 1, . . . , t} and {t + 1, . . . ,N }. By thedefinition of C SHP

t , it follows that C SHPt ≥R∗t when |t − t∗| ≥M

with M ≥ 1. The following theorem provides an error bound forthe change-point estimate.

Theorem 2. Under Assumption 1, if ν2> 0 or σ21 6=σ2

2 , andt0≤ t∗≤ t1, then

pr1(|tD − t∗| ≥ M ) ≤ pr1{C SHPt∗ > 2}

+∑

t0≤t≤t1, |t−t∗|≥M

pr0(R∗t < 2Dt/Dt∗).

[9]

Proof. It can be seen that

pr1(|tD − t∗| ≥ M ) ≤ pr1(CSHPt∗ > 2) +

pr1

(min

t0≤t≤t1,|t−t∗|≥M

C SHPt

Dt<

C SHPt∗

Dt∗, C SHP

t∗ ≤ 2

).

In light of Biswas et al. (7, theorem 1), pr1(C SHPt∗ > 2)→ 0, as

d→∞. Because C SHPt ≥R∗t for |t − t∗| ≥M ,

pr1

(min

t0≤t≤t1, |t−t∗|≥M

C SHPt

Dt<

C SHPt∗

Dt∗, C SHP

t∗ ≤ 2

)≤ pr1

(min

t0≤t≤t1, |t−t∗|≥M

C SHPt

Dt<

2

Dt∗

)+ ≤ pr0

(min

t0≤t≤t1, |t−t∗|≥M

R∗tDt

<2

Dt∗

)≤

∑t0≤t≤t1, |t−t∗|≥M

pr0(R∗t < 2Dt/Dt∗).

Therefore, the theorem follows.

3876 | www.pnas.org/cgi/doi/10.1073/pnas.1702654114 Shi et al.

Dow

nloa

ded

by g

uest

on

July

26,

202

1

Page 5: Consistent and powerful graph-based change-point test for ... · and dC.R. Rao Advanced Institute of Mathematics, Statistics, and Computer Science, Hyderabad 500046, India Contributed

STA

TIST

ICS

Fig. 5. Comparison of box plots of the change-point estimates based on the proposed method and the method given in ref. 4 for each of the dimensionsd1 = 10, d2 = 50, d3 = 100, d4 = 500, d5 = 1,000, and d6 = 5,000, with two parameter combinations (µ = 0.3, η = 1) and (µ = 0, η = 1.3), the respectivechange-point locations t∗ = N/2, N/4, and two sample sizes N = 40, 100.

By Assumption 1, the first error probability pr1{C SHPt∗ > 2}

in Eq. 9 tends to 0 as d→∞. The second error probabilitypr0(R

∗t < 2Dt/Dt∗) in Eq. 9 can be obtained from formulas 7

and 8 in ref. 8,

pr0(R∗t = 2k − 1) = 2

(t − 1k − 1

)(t∗ − t − 1

k − 1

)/(t∗

t

),

k = 1, . . . ,min{t , t∗ − t}, t < t∗,

pr0(R∗t = 2k − 2) =

{(t − 1k − 1

)(t∗ − t − 1

k − 2

)+

(t − 1k − 2

)(t∗ − t − 1

k − 1

)}/(t∗

t

),

A B C

Fig. 6. CSHPt /Dt based on the respective square-root transformed 10-dimensional vectors (A) {Ht , t = 1, 2, . . . , 285}, (B) {Ht , t = 1, 2, . . . , 194}, and

(C) {Ht , t = 195, 196, . . . , 285}.

k = 2, . . . ,min{t , t∗ − t}+ 1, t < t∗,

pr0(R∗t =2k − 1) = 2

(t − t∗ − 1

k − 1

)(N − t − 1

k − 1

)/(N − t∗

t − t∗

),

k = 1, . . . ,min{t − t∗,N − t}, t > t∗,

pr0(R∗t = 2k − 2) =

{(t − t∗ − 1

k − 1

)(N − t − 1

k − 2

)+

(t − t∗ − 1

k − 2

)(N − t − 1

k − 1

)}/(N − t∗

t − t∗

),

k = 2, . . . ,min{t − t∗,N − t}+ 1, t > t∗.

Shi et al. PNAS | April 11, 2017 | vol. 114 | no. 15 | 3877

Dow

nloa

ded

by g

uest

on

July

26,

202

1

Page 6: Consistent and powerful graph-based change-point test for ... · and dC.R. Rao Advanced Institute of Mathematics, Statistics, and Computer Science, Hyderabad 500046, India Contributed

Table 3. Change-point estimates in cell images

Square root Logarithmic

d = 10 (194, 22, 262) (194, 22, 262)d = 50 (194, 21, 263) (194, 21, 263)

To show how to calculate the second error probability∑t0 ≤ t ≤ t1,|t − t∗| ≥M pr0(R

∗t < 2Dt/Dt∗), consider N =100 and

t0 =1=N − t1; if t∗=50, the error probabilities are, respec-tively, 2.2× 10−6 for M =5 and 0.3× 10−2 for M =2 afterrounding. The second error probability will decrease if Mincreases, but may increase if t∗ is near the beginning or endof the sequence. For instance, when t∗=40, the second errorprobability increases to 1.6× 10−5 for M =5 and 0.4× 10−1 forM =2.

Data ExamplesSimulations. Consider

Xt =

{ut , 1 ≤ t ≤ N /2µ1d + ηut , N /2 < t ≤ N ,

[10]

where ut ∼Nd{(0, . . . , 0)′, Id} with Id being the d × d identitymatrix, and 1d = (1, 1, . . . , 1)′.

In Table 2, simulated type I errors for the SWR test are com-pared based on 1,000 simulations, with N =20, 40, 60, 80, 100,200, and 300, α=0.05, µ=0, η=1, d =10, 50, 100, 500, 1,000,and 5,000 in Eq. 10, and estimated critical values c0.05 in Table 1.It can be seen from Table 2 that the SWR test has a satisfactoryaccuracy.

To examine the powers of both SWR and CZ tests, 200 simu-lations for the two parameter combinations (µ=0.3, η=1) and(µ=0, η=1.3) suggested by section 2.4 of ref. 7 are carried out.To investigate the effect of a change-point location for differentN , the locations t∗=N /2,N /4 are considered with N =40, 100.The critical values for α=0.05 are based on Table 1 for S SHP

N inEq. 3 and Chen and Zhang’s analytical approximation [R pack-age gSeg (16)] for CMST

N in Eq. 2. Fig. 4 shows the percentagethat the null hypothesis is rejected at 0.05 level for each of theSWR and CZ tests.

It can be seen from Fig. 4 that the power of the SWR testmonotonously increases as d increases, which suggests that thistest may be consistent. For the mean-shift model with µ=0.3,η=1, and t∗=N /2, the CZ test is more powerful than theSWR test. However, for the other models with a shift in vari-ance (µ=0, η=1.3) or (µ=0.3, η=1.3), the power of the CZtest converges to zero, which may be explained by Biswas et al.(7, theorem 2, ii).

Further comparisons are carried out for each of the threechange-point estimates based on the ratio cut Eq. 4 and the scan

statistic in Eq. 2 where the change-point estimate is determinedby finding the value of t that maximizes CMST

N , as in Eq. 2. Fig.5 shows the box plots of these estimates in order for each of thedimensions d1 =10, d2 =50, d3 =100, d4 =500, d5 = 1,000, andd6 = 5,000.

It can be seen from Fig. 5 that the ratio cut tends to giveaccurate estimates when d increases; CZ estimates (Eq. 2) havea comparable performance only for the mean shift model withµ=0.3 and η=1, but tend to be biased when d increases forother models with a shift in variance.

Cell Division Detection. To illustrate the application to some realdata, we use the 321× 321 cell images provided in ref. 1, onsquare-root or logarithmic transformed 10-dimensional vectorsHt =(ht,1, ht,2, . . . , ht,10)

′ for t =1, 2, . . . , 285.For a square-root transformation, the SWR test statistic

S SHP285 =199.5, which suggests that there is a change-point. Fig.

6A displays C SHPt /Dt (Eq. 4), where the ratio cut yields the

change-point estimate 194. Let us divide the data sequence at194 and consider the first segment. As the SWR test statisticS SHP193 =121.5, there exists a change-point there. The result is

shown in Fig. 6B, where the ratio cut locates a change-point at22. In the second segment, as the SWR test statistic S SHP

90 =55.6,there also exists a change-point. The result is shown in Fig. 6C,where the ratio cut gives the change-point estimate 262. Afterthese three change-point estimates, corresponding to three divi-sion cycles, are obtained, the detection procedure stops, and nomore segmentation of the data sequence is needed.

The logarithmic transformation produces the same results asthe square-root transformation for d =10. Table 3 shows theresults for d =10 and d =50 based on the proposed method.Because the three change-point estimates match with the timepoints of the first, second, and third division cycles in thecell images displayed in Fig. 1, our method has a satisfactoryperformance.

Discussion and ConclusionsA graph-based method is developed for the detection and estima-tion of unknown change-points in high-dimensional data. It per-forms well when applied to the problem of identifying the timepoints at which cell division occurs in the monitoring of an embryo.By Theorem 1, the larger the dimension of the data, the more pow-erful the SWR test. Thus, to improve the tracking of cell division,one might input the green fluorescent protein, which may signifi-cantly improve the performance of the SWR test.

ACKNOWLEDGMENTS. We thank Dr. M. Cicconet for allowing us to use hiscell image data. We also thank Dr. R. Brewster, Dr. R. Yu, and Dr. B. Cro-foot for their helpful suggestions. The research is partially supported by theNatural Sciences and Engineering Research Council of Canada.

1. Ciccone M, Gutwein M, Gunsalus KC, Geiger D (2014) Label free cell-tracking anddivision detection based on 2D time-lapse images for lineage analysis of early embryodevelopment. Comput Biol Med 51:24–34.

2. Pass G, Zabih R, Miller J (1996) Comparing images using color coherence vectors. Pro-ceedings of the Fourth ACM International Conference on Multimedia (Assoc ComputMachinery, New York), pp 65–73.

3. Demant C, Streicher-Abel B, Garnica C (2013) Industrial Image Processing: Visual Qual-ity Control in Manufacturing (Springer, Heidelberg), 2nd Ed.

4. Chen H, Zhang N (2015) Graph-based change-point detection. Ann Stat 43:139–176.5. Roy S, Atchade Y, Michailidis G (2017) Change point estimation in high dimensional

Markov random-field models. J R Stat Soc Series B Stat Methodol, 10.1111/rssb.12205.6. Friedman JH, Rafsky LC (1979) Multivariate generalizations of the Wald-Wolfowitz

and Smirnov two-sample tests. Ann Stat 7:697–717.7. Biswas M, Mukhopadhyay M, Ghosh AK (2014) A distribution-free two-

sample run test applicable to high-dimensional data. Biometrika 101:913–926.

8. Wald A, Wolfowitz J (1940) On a test whether two samples are from the same distri-bution. Ann Math Stat 11:147–162.

9. Cho H, Fryzlewicz P (2015) Multiple-change-point detection for high dimensional timeseries via sparsified binary segmentation. J R Stat Soc Series B Stat Methodol 77:475–507.

10. Jirak M (2015) Uniform change point tests in high dimension. Ann Stat 43:2451–2483.

11. Chernoff H, Zacks S (1964) Estimating the current mean of a normal distribution whichis subjected to changes in time. Ann Math Stat 35:999–1018.

12. Gardner JA (1969) On detecting changes in the mean of normal variates. Ann MathStat 40:116–126.

13. Wei YC, Cheng CK (1989) Towards efficient hierarchical designs by ratio cut partition-ing. Computer-Aided Design (Inst Electr Electron Eng, New York), pp 298–301.

14. Kruskal JB (1956) On the shortest spanning subtree of a graph and the travellingsalesman problem. Proc Am Math Soc 7:48–50.

15. Fontenla M (2014) Optrees: Optimal trees in weighted graphs. Available athttps://CRAN.R- project.org/package=optrees, version1.0. Accessed October 12, 2016.

16. Chen H, Zhang N (2014) gSeg: Graph-based change-point detection (g-segmentation).R package version 0.1. Available at CRAN.R-project.org/package=gSeg. AccessedOctober 27, 2015.

3878 | www.pnas.org/cgi/doi/10.1073/pnas.1702654114 Shi et al.

Dow

nloa

ded

by g

uest

on

July

26,

202

1