11 elsner ch11 - florida state universitymyweb.fsu.edu/jelsner/book/01_proofs...

26
Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 275 — #1 11 CLUSTER MODELS “There are in fact two things, science and opinion; the former begets knowledge, the latter, ignorance.” —Hippocrates A cluster is a group of the same or similar events close together. Clusters arise in hur- ricane origin locations, tracks, and landfalls. In this chapter, we look at how to analyze and model clusters. We divide the chapter into time, space, and feature clustering. Among climatologists feature clustering is perhaps the most well-known. We begin by showing you how to detect and model time clusters. 11.1 TIME CLUSTERS Hurricanes form over certain regions of the ocean. Consecutive hurricanes from the same area often take similar paths. This grouping, or clustering, increases the potential for multiple landfalls above what you expect from random events. A statistical model for landfall probability captures clustering through covariates like the North Atlantic Oscillation (NAO), which relates a steering mechanism (posi- tion and strength of the subtropical high pressure) to coastal hurricane activity. But there could be additional time correlation not related to the covariates. A model that does not account for this extra variation will underestimate the potential for multiple hits in a season. Following Jagger and Elsner (2006), you consider three coastal regions including the Gulf Coast, Florida, and the East Coast (Fig. 6.2). Regions are large enough to capture enough hurricanes, but not too large as to include many non-coastal strikes. Here you use hourly position and intensity data described in Chapter 6. For each hurricane, you note its wind speed maximum within each region. If the maximum wind exceeds 33 m s 1 , then you count it as a hurricane for the region. A tropical 275

Upload: others

Post on 31-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 275 — #1

11CLUSTER MODELS

“There are in fact two things, science and opinion; the former begets knowledge, the latter,ignorance.”—Hippocrates

A cluster is a group of the same or similar events close together. Clusters arise in hur-ricane origin locations, tracks, and landfalls. In this chapter, we look at how to analyzeand model clusters. We divide the chapter into time, space, and feature clustering.Among climatologists feature clustering is perhaps the most well-known. We beginby showing you how to detect andmodel time clusters.

11.1 TIME CLUSTERS

Hurricanes form over certain regions of the ocean. Consecutive hurricanes from thesame area often take similar paths. This grouping, or clustering, increases the potentialfor multiple landfalls above what you expect from random events.A statistical model for landfall probability captures clustering through covariates

like theNorth AtlanticOscillation (NAO), which relates a steeringmechanism (posi-tion and strength of the subtropical high pressure) to coastal hurricane activity. Butthere could be additional time correlation not related to the covariates. A model thatdoes not account for this extra variation will underestimate the potential for multiplehits in a season.Following Jagger and Elsner (2006), you consider three coastal regions including

the Gulf Coast, Florida, and the East Coast (Fig. 6.2). Regions are large enough tocapture enough hurricanes, but not too large as to include many non-coastal strikes.Here you use hourly position and intensity data described in Chapter 6. For eachhurricane, you note its wind speed maximum within each region. If the maximumwind exceeds 33 m s−1, then you count it as a hurricane for the region. A tropical

275

jelsner
Sticky Note
change sentence to read "Of the three feature clustering is best known to climatologists."
jelsner
Sticky Note
insert "methods for"
jelsner
Sticky Note
Remove this entire sentence.
jelsner
Sticky Note
change to "originating in"
jelsner
Sticky Note
Do not start a new paragraph
Page 2: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2

276 Cluster Models

cyclone that affects more than one region at hurricane intensity is counted in eachregion. Because of this, the sum of the regional counts is larger than the total count.Begin by loading annual.RData. These data were assembled in Chapter 6. Subset

the data for years starting with 1866.

> load("annual.RData")

> dat = subset(annual, Year >= 1866)

The covariate Southern Oscillation Index (SOI) data begins in 1866.Next, extract allhurricane counts for the Gulf coast, Florida, and East coast regions.

> cts = dat[, c("G.1", "F.1", "E.1")]

11.1.1 Cluster Detection

You start by comparing the observed with the expected number of years for the twogroups of hurricane counts. The groups include years with no hurricanes and yearswith three or more. The expected number is from a Poisson distribution with a con-stant rate. The idea is that for regions that show a cluster of hurricanes, the observednumber of yearswith no hurricanes and yearswith three ormore hurricanes should begreater than the corresponding expected number. Said another way, a Poisson modelwith a hurricane rate estimated from counts over all years in regions with cluster-ing will underestimate the number of years with no hurricanes and years with manyhurricanes.For example, you find the observed number of years without a Florida hurricane

and the number of years with more than two hurricanes by typing

> obs = table(cut(cts$F.1,

+ breaks=c(-.5, .5, 2.5, Inf)))

> obs

(-0.5,0.5] (0.5,2.5] (2.5,Inf]

70 62 13

And the expected numbers for these three groups by typing

> n = length(cts$F.1)

> mu = mean(cts$F.1)

> exp = n * diff(ppois(c(-Inf, 0, 2, Inf), lambda=mu))

> exp

[1] 61.66 75.27 8.07

The observed and expected counts for the three regions are given in Table 11.1. Inthe Gulf and East coast regions, the observed number of years are relatively close tothe expected number of years in each of the groups. By contrast, in the Florida region,you see the observed number of years exceeds the expected number of years for theno-hurricane and the three-or-more hurricanes groups.

jelsner
Sticky Note
change "number of years" to "numbers"
jelsner
Sticky Note
change "number of years" to "numbers"
jelsner
Sticky Note
change "number of years" to "numbers"
jelsner
Sticky Note
change "number of years" to "numbers"
Page 3: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 277 — #3

277 Time Clusters

Table 11.1 Observed (O) and expected (E) number of hurricane years by count groups.

Region O(= 0) E(= 0) O(≥ 3) E(≥ 3)

Gulf Coast 63 66.1 7 6.6Florida 70 61.7 13 8.1East Coast 74 72.3 7 4.9

Table 11.2 Observed versus expected statistics. The Pearson and χ2 test statistics alongwith the corresponding p-values are given for each coastal region.

Region Pearson p-value χ2 p-value

Gulf Coast 135 0.6858 0.264 0.894Florida 187 0.0092 6.475 0.039East Coast 150 0.3440 1.172 0.557

The difference between the observed and expected numbers is used to assess thestatistical significance of the clustering. This is done using Pearson residuals and theχ2 statistic. The Pearson residual is the difference between the observed count andexpected rate divided by the square root of the variance. The p-value is evidence insupport of the null hypothesis of no clustering.For example, to obtain the χ2 statistic, type

> xis = sum((obs - exp)ˆ2 / exp)

> xis

[1] 6.48

The p-value as evidence in support of the null hypothesis is given by

> pchisq(q=xis, df=2, lower.tail=FALSE)

[1] 0.0393

where df is the degrees of freedom equal to the number of groupsminus one. Theχ2

and Pearson statistics for the three regions are shown in Table 11.2. The p-values forthe Gulf and East coasts are greater than 0.05 indicating little support for the clusterhypothesis. By contrast, the p-value for the Florida region is 0.009 using the Pear-son residuals and 0.037 using the χ2 statistic. These values provide evidence that thehurricane occurrences in Florida are clustered in time.

11.1.2 Conditional Counts

Whatmight be causing this clustering? The extra variation in annual hurricane countsmight be due to variation in hurricane rates. You examine this possibility with a

jelsner
Sticky Note
remove "the"
jelsner
Sticky Note
change to "Those"
Page 4: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 278 — #4

278 Cluster Models

Poisson regressionmodel (see Chapter 7). Themodel includes an index for the NAOand the SOI after Elsner and Jagger (2006).This is a generalized linear model (GLM) approach using the Poisson family that

includes a logarithmic link function for the rate. The annual hurricane count model is

Hi ∼ dpois(λi) (11.1)

log(λi)= β0+βsoi SOIi+βnao NAOi

where Hi is the hurricane count in year i simulated (∼) from a Poisson distributionwith a rate λi that depends on the year i. The logarithm of the rate depends in a lin-ear way on the SOI and NAO covariates. The code to fit the model, determine theexpected count, and table the observed versus expected counts for each region is givenbelow.

> pfts = list(G.1 = glm(G.1 ˜ nao + soi,

+ family="poisson", data=dat),

+ F.1 = glm(F.1 ˜ nao + soi, family="poisson",

+ data=dat),

+ E.1 = glm(E.1 ˜ nao + soi, family="poisson",

+ data=dat))

> prsp = sapply(pfts, fitted, type="response")

> rt = regionTable(cts, prsp, df=3)

The count model provides an expected number of hurricanes in each year. Theexpected is compared to the observed as before. Results indicate that clustering issomewhat ameliorated by conditioning the rates on the covariates. In particular, thePearson residual reduces to 172.4 with an increase in the corresponding p-value to0.042. However, the p-value remains near 0–.15 indicating the conditional model,while an improvement, fails to capture all the extra (beyond Poisson) variation inFlorida hurricane counts.

11.1.3 ClusterModel

Having found evidence that Florida hurricanes arrive in clusters next you model thisprocess. In the simplest case, you assume the following. Each cluster has either oneor two hurricanes and the annual cluster count is described by a Poisson distribu-tion with a rate r. Note the difference. Earlier, you assumed that each hurricanewas independent and the annual hurricane count followed a Poisson distribution.Furthermore, now you let p be the probability that a cluster will have two hurricanes.Formally, your model can be expressed as follows. Let N be the number of clus-

ters in a given year and Xi, i = 1, . . . ,N be the number of hurricanes in each clusterminus one. Then, the number of hurricanes in a given year isH =N+∑N

i=1Xi. Con-ditional on N, M = ∑N

i=1Xi is described by a binomial distribution since the Xis areindependent Bernoulli variables and p is constant. That is, H = N +M, where theannual number of clusters N has a Poisson distribution with cluster rate r, and M

jelsner
Sticky Note
change to "0.15"
Page 5: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 279 — #5

279 Time Clusters

has a binomial distribution with proportion p and sizeN. Here the binomial distribu-tion describes the number of occurrences of at least one hurricane in a sequence ofN independent years, with each year having a probability p of observing at least onehurricane.In summary, your cluster model has the following properties:

1. The expected number of hurricanes E(H)= r(1+ p).2. The variance ofH is given by

var(H)= E(var(H|N))+ var(E(H|N))

= E(N(p(1− p)))+ var((1+ p)N)

= rp(1− p)+ r(1+ p)(1+ p) = r(1+ 3p)

3. The dispersion of H is given by var(H)/E(H) = φ = (1+ 3p)/(1+ p), which isindependent of cluster rate. Solving for p, you find p= (φ − 1)/(3−φ).

4. The probability mass function for the number of hurricanes,H, is

P(H = k|r,p)=�i/2�∑i=0

dpois(k− i, r) dbinom(i,k− i,p);k = 0,1, . . .

P(H = 0|r,p)= e−rdpois(k− i, r)

= e−r rk−i

(k− i)!dbinom(i,k− i,p)

=k− ii

pi(1− p)k−2i

5. The model has two parameters r and p. A better parameterization is to use λ =

r(1+ p) with p to separate the hurricane frequency from the cluster probability.The parameters do not need to be fixed and can be functions of the covariates.

6. When p= 0,H is Poisson, and when p= 1,H/2 is Poisson, the dispersion is two,and the probability thatH is even is 1.

You need a way to estimate r and p.

11.1.4 Parameter Estimation

Your goal is a hurricane count distribution for the Florida region. For that, you need toestimate the annual cluster rate (r) and the probability (p) that the cluster size is two.Continuing with the GLM approach, you separately estimate the annual hurricanefrequency, λ, and the annual cluster rate r. The ratio of these two parameters minusone is an estimate of the probability p.This is reasonable if p does not vary much, since the annual hurricane count vari-

ance is proportional to the expected hurricane count (i.e., var(H) = r(1 + 3p) ∝r ∝ E(H)). You estimated the parameters of the annual count model using Pois-son regression, which assumes that the variance of the count is proportional to the

jelsner
Sticky Note
insert "is given by"
jelsner
Sticky Note
change to "is given by"
Page 6: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 280 — #6

280 Cluster Models

expected count. Thus under the assumption that p is constant, Poisson regression canbe used to estimate λ in the cluster model.You regress the logarithm of the link function for the cluster rate onto the predic-

tors NAO and SOI. The model is given by

Ni ∼ dpois(ri) (11.2)

log(ri)= α0+α1 SOIi+α2 NAOi+ εi

The parameters of this annual cluster count model cannot be estimated directly,since the observed hurricane count does not furnish information about the numberof clusters.Consider the observed set of annual Florida hurricane counts. Since the annual

frequency is quite small, the majority of years have either no hurricanes or a singlehurricane. You can create a “reduced” data set by using an indictor of whether or notthere was at least one hurricane. Formally, let Ii = I(Hi > 0) = I(Ni > 0)), then Iis an indicator of the occurrence of a hurricane cluster for each year. You assume thatI has a binomial distribution with size parameter of one and a proportion equal to π .This leads to a logistic regression model (see Chapter 7) for I.Note that since exp( − r) is the probability of no clusters, the probability of a

cluster π is 1− exp(− r). Thus the cluster rate is r = − log(1− π). If you use alogarithmic link function on r, then log(r) = log( − log(1 − π)) = cloglog(π),where cloglog is the complementary log—log function. Thus, you model I using thecloglog function to obtain r.Your cluster model is a combination of twomodels, one for the counts and another

for the clusters. Formally, it is given by

Ii ∼ dbern(πi) (11.3)

cloglog(πi)= α0+α1 SOIi+α1 NAOi

where dbern is the Bernoulli distribution with mean π . The covariates are thesame as those used in the cluster count model. Given these equations, you have thefollowing relationships for r and p.

log(r̂(1+ p̂))= β̂0+ β̂1 SOI+ β̂2 NAO (11.4)

log(r̂)= α̂0+ α̂1 SOI+ α̂2 NAO (11.5)

By subtracting the coefficients in Eq. 11.5 for the annual cluster count model from thethose in Eq. 11.4 for the annual hurricane count model, you have a regression modelfor the probabilities given by

log(1+ p̂)= β̂0 − α̂0+ (β̂1 − α̂1) SOI+ (β̂2 − α̂2) NAO (11.6)

11.1.5 Model Diagnostics

You start by comparing fitted values from the count model with fitted values from thecluster model. LetHi be the hurricane count in year i and λ̂i and r̂i be the fitted annual

jelsner
Sticky Note
remove "+ epsilon_i"
jelsner
Sticky Note
change subscript to "2"
Page 7: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 281 — #7

281 Time Clusters

count and cluster rates, respectively. Then let τ0 be a test statistic given by

τ0 =1n

n

∑i=1

(Hi − r̂i)=1n

n

∑i=1

(λ̂i − r̂i). (11.7)

The value of τ0 is greater than one if there is clustering. You test the significance ofτ0 by generating random samples of length n from a Poisson distribution with rate λiand computing τj for j= 1, . . . ,N, whereN is the number of samples. A test of the nullhypothesis that τ0 ≤ 0 is the proportion of simulated τ ’s that are at least as large as τ0.You do this with the testfits function in the correlationfuns.R package by

specifying the model formula, data, and number of random samples.

> source("correlationfuns.R")

> tfF = testfits(F.1 ˜ nao + soi, data=dat, N=1000)

> tfF$test; tfF$testpvalue

[1] 0.104

[1] 0.026

For Florida hurricanes, the test statistic τ0 has a value of 0.104 indicating some dif-ference in count and cluster rates. The proportion of 1,000 simulated τ s that areat least as large as this is 0.026 providing sufficient evidence to reject the no-clusterhypothesis. Repeating the simulation using Gulf Coast hurricanes

> tfG = testfits(G.1 ˜ nao + soi, data=dat, N=1000)

> tfG$test; tfG$testpvalue

[1] -0.0617

[1] 0.787

you find that, in contrast to Florida, there is little evidence against the no-clusterhypothesis. The results are consistent with what you found previously using the χ2

statistic.A linear regression through the origin of the fitted count rate on the cluster rate

under the assumption that p is constant yields an estimate for 1+ p. You plot theannual count and cluster rates and draw the regression line using the plotfitsfunction.

> par(mfrow=c(1, 2), pty="s")

> ptfF = plotfits(tfF)

> mtext("a", side=3, line=1, adj=0, cex=1.1)

> ptfG = plotfits(tfG)

> mtext("b", side=3, line=1, adj=0, cex=1.1)

The regressions are shown in Figure 11.1 for Florida and the Gulf coast. The blackline is the y = x axis, and cluster and hurricane rates align along this axis if there isno clustering. The red line is the regression of the fitted hurricane rate onto the fittedcluster rate with the intercept set to zero. The slope is an estimate of 1+ p.

Page 8: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 282 — #8

282 Cluster Models

Table 11.3 Coefficients of the count rate model.

Estimate Std. Error z Value Pr(>|z|)(Intercept) −0.27 0.11 −2.55 0.01nao −0.23 0.09 −2.50 0.01soi 0.06 0.03 1.86 0.06

0.5 1.0 1.5 2.0

0.5

1.0

1.5

2.0a b

Cluster rate (/yr)

Cou

nt ra

te (/

yr)

0.5 1.0 1.5 2.0

0.5

1.0

1.5

2.0

Cluster rate (/yr)

Cou

nt ra

te (/

yr)

Figure 11.1 Count versus cluster rates for (a) Florida and (b) Gulf coast.

The regression slopes are printed by typing,

> coefficients(ptfF); coefficients(ptfG)

rate

1.14

rate

0.942

The slope is 1.14 for the Florida region giving 0.14 as an estimate for p (probabilitythat the cluster will have two hurricanes). The regression slope is 0.94 for the Gulfcoast region, which you interpret as a lack of evidence for hurricane clusters.Your focus is now on Florida hurricanes only. You continue by looking at the

coefficients from both models. Type

> summary(tfF$fits$poisson)$coef

> summary(tfF$fits$binomial)$coef

The output coefficient tables (Tables 11.3 and 11.4) show that the NAO andSOI covariates are significant in the hurricane count model, but only the NAO issignificant in the cluster rate model.The difference in coefficient values from the two models is an estimate of

log(1+ p), where again p is the probability that a cluster will have two hurricanes.The difference in the NAO coefficient is 0.043 and the difference in the SOI coef-ficient is 0.035 indicating that the NAO increases the probability of clustering more

jelsner
Sticky Note
remove "output"
Page 9: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 283 — #9

283 Time Clusters

Table 11.4 Coefficients of the cluster rate model.

Estimate Std. Error z Value Pr(>|z|)(Intercept) −0.42 0.13 −3.11 0.00nao −0.27 0.12 −2.23 0.03soi 0.02 0.04 0.52 0.60

than ENSO. Lower values of the NAO lead to a larger rate increase for the Poissonmodel relative to the binomial model.

11.1.6 Forecasts

It is interesting to compare forecasts of the distribution of Florida hurricanes usingyour Poisson and cluster models. Here you set p=.138 for the cluster model. You canuse the same two-component formulation for your Poisson model by setting p= 0.You prepare the data using the lambdapclust function as follows:

> ctsF = cts[, "F.1", drop=FALSE]

> pars = lambdapclust(prsp[, "F.1", drop=FALSE],

+ p=.138)

> ny = nrow(ctsF)

> h = 0:5

Next, you compute the expected number of years with h hurricanes from the clusterand Poisson models and tabulate the observed number of years. You combine themin a data object.

> eCl = sapply(h, function(x)

+ sum(do.call('dclust',+ c(x=list(rep(x, ny)), pars))))

> ePo = sapply(0:5, function(x)

+ sum(dpois(x=rep(x,ny),

+ lambda=prsp[, "F.1"])))

> o = as.numeric(table(ctsF))

> dat = rbind(o, eCl, ePo)

> names(dat) = 0:5

Finally, you plot the observed versus the expected from your cluster and Poissonmodels using a bar plot where the bars are plotted side by side.

> barplot(dat, ylab="Number of Years",

+ xlab="Number of Florida Hurricanes",

+ names.arg=c(0:5),

+ col=c("black", "red", "blue"),

Page 10: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 284 — #10

284 Cluster Models

0 1 2 3 4 5

ObservedClusterPoisson

Number of Florida hurricanes

Num

ber o

f yea

rs

0

10

20

30

40

50

60

70

Figure 11.2 Observed versus expected number of Florida hurricane years.

+ legend=c("Observed", "Cluster", "Poisson"),

+ beside=TRUE)

Results are shown in Figure 11.2. The expected numbers are based on your clustermodel (p = 0.137) and on your Poisson model (p = 0). The cluster model fits theobserved counts better than does the Poisson model particularly at the low and highcount years.Florida had hurricanes in only 2 of the 11 years from 2000 through 2010. But

these 2 years featured seven hurricanes. Seasonal forecast models that predict U.S.hurricane activity assume a Poisson distribution. You show here that this assumptionapplied to Florida hurricanes leads to a forecast that underpredicts both the numberof years without hurricanes and the number of years with three or more hurricanes(Jagger and Elsner, 2012).

11.2 SPATIAL CLUSTERS

Is there a tendency for hurricanes to cluster in space? More specifically, given thata hurricane originates at a particular location, is it more (or less) likely that anotherhurricane will form in the same vicinity? This question can be answered using spatialpoint pattern analysis. Here you consider models for analyzing and modeling eventsacross space.We begin with some definitions. An event is the occurrence of some phenomenon

of interest. For example, an event can be a hurricane’s origin or its lifetime maximumintensity. An event location is the spatial location of the event. For example, the lati-tude and longitude of the maximum intensity event. A point is any location where anevent could occur. A spatial point pattern is a collection of events and event locationstogether with spatial domain. The spatial domain is defined as the region of interestover which events occur. For example, the North Atlantic basin is the spatial domainfor hurricanes.

jelsner
Sticky Note
change "another" to "a later"
Page 11: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 285 — #11

285 Spatial Clusters

Complete spatial randomness (CSR) defines a situation where an event is equallylikely to occur at any location within the study area regardless of the locations of otherevents. A spatial point pattern is said to be clustered if given an event at some locationit is more likely than random that another event will occur nearby. Regularity is theopposite; given an event, if it is less likely than random that another event will occurnearby, then the spatial point pattern is regular.A realization is a collection of spatial point patterns generated under a spatial point

process model. To illustrate, consider four point patterns each consisting of eventsinside the unit plane. You generate the event locations and plot them by typing

> par(mfrow=c(2, 2), mex=.9, pty="s")

> for(i in 1:4){

+ u = runif(n=30, min=0, max=1)

+ v = runif(n=30, min=0, max=1)

+ plot(u, v, pch=19, xlim=c(0, 1), ylim=c(0, 1))

+ }

The pattern of events are shown in Figure 11.3. The plots which show events underCSR illustrate that some amount of clustering occurs by chance.The spatstat package (Baddeley and Turner, 2005) contains a large number of

functions for analyzing and modeling point pattern data. To make the functionsavailable and obtain a citation, type

> require(spatstat)

> citation(package="spatstat")

CSR lies on a continuum between clustered and regular spatial point patterns. Youuse simulation functions in spatstat to compare the CSR plots in Figure 11.3 withregular and clustered patterns. Examples are shown in Figure 11.4. Here you see two

Figure 11.3 Pointpatterns exhibitingcomplete spatialrandomness.

Page 12: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 286 — #12

286 Cluster Models

Figure 11.4 Regular(top) and clustered(bottom) pointpatterns.

realizations of patterns more regular than CSR (top row), and two realizations ofpatterns more clustered than CSR (bottom row). The simulations are made using aspatial point pattern model. Spatial scale also plays a role. A set of events can indicatea regular spatial pattern on one scale but a clustered pattern on anther.

11.2.1 Point Processes

Given a set of spatial events (spatial point pattern), your goal is to assess evidencefor clustering (or regularity). Spatial cluster detection methods are based on statis-tical models for spatial point processes. The random variable in these models is theevent locations. Herewe provide some definitions useful for understanding point pro-cess models. Details on the underlying theory are available in Ripley (1981), Cressie(1993), and Diggle (2003).A process generating events is stationary when it is invariant to translation across

the spatial domain. Thismeans that the relationship between two events depends onlyon their positions relative to one another and not on the event locations themselves.This relative position refers to (lag) distance and orientation between the events.A process is isotropic when orientation does not matter. Recall these same conceptsapplied to the variogrammodels used in Chapter 9.Given a single realization, the assumptions of stationarity and isotropy allow for

replication. Two pairs of events in the same realization of a stationary process that areseparated by the same distance should have the same relatedness. This allows you touse relatedness information from one part of the domain as a replicate for relatednessin another part of the domain. The assumptions of stationarity and isotropy are thestarting point but can be relaxed later on.

jelsner
Sticky Note
change sentence to read "A set of events might be clustered at one scale and regular at another"
jelsner
Sticky Note
insert "you"
jelsner
Sticky Note
insert "they"
Page 13: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 287 — #13

287 Spatial Clusters

The Poisson distribution is used to define a model for CSR. A spatial point processis said to be homogeneous Poisson under the following two criteria:

• The number of events occurring within a finite region A is a random variable fol-lowing a Poisson distribution with mean λ ×A for some positive constant λ, with|A| denoting the area of A.

• Given the total number of events N, their locations are an independent randomsample of pointswhere each point is equally likely to be picked as an event location.

The first criterion is about the density of the spatial process. For a given domain, itanswers the question, “how many events?” It is the number of events divided by thedomain area. The density is an estimate of the rate parameter of the Poisson distri-bution.1 The second criterion is about homogeneity. Events are scattered throughoutthe domain and are not clustered or regular.It is instructure to consider how you would create a homogeneous Poisson point

pattern. The procedure follows from the definition. Step 1: Generate the number ofevents using a Poisson distribution with mean equal to λ. Step 2: Place the eventsinside the domain using uniform distributions for the spatial coordinates.For example, let area of the domain be one and the density of events be 200, then

type

> lambda = 200

> N = rpois(1, lambda)

> x = runif(N); y=runif(N)

> plot(x, y, pch=19)

Note that if your domain is not a rectangle, you can circumscribe it within a rectangle,then reject events sampled from inside the rectangle that fall outside the domain.Thisis called “rejection sampling.”By definition, these point patterns are CSR. However, as noted earlier, you are typ-

ically in the opposite position. You observe a set of events and want to know if theyare regular or clustered. Your null hypothesis is CSR and you need a test statistic thatwill guide your inference. CSR models are easy to construct, so you can use MonteCarlo methods.In some cases, the homogeneous assumption is too restrictive. Consider hurricane

genesis as an event process. Event locations may be random, but the probability ofan event is higher away from the equator. The constant risk hypothesis of the homo-geneous Poisson point pattern model requires a generalization to include a spatiallyvarying density function.To do this, you define the densityλ(s), where s denotes spatial points. This is called

an inhomogeneous Poisson model, and it is analogous to the universal kriging modelused on field data (see Chapter 9). Inhomogeneity as defined by a spatially varyingdensity implies nonstationarity as the number of events depends on location.

1 In the spatial statistics, this is often called the intensity. Here we use the term density instead so asnot to confuse the spatial rate parameter with hurricane strength.

jelsner
Sticky Note
change to "instructive"
jelsner
Sticky Note
change "the spatial rate parameter" to "it"
Page 14: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 288 — #14

288 Cluster Models

11.2.2 Spatial Density

Density is a first-order property of a spatial point pattern. That is, the density func-tion estimates the mean number of events at any point in your domain. Events areindependent of one another, but event clusters might appear because of the varyingdensity.Given an observed set of events, how do you estimate λ(s)? One approach is to use

kernel densities. Consider again the set of hurricanes over the period 1944–2000 thatwere designated tropical only and baroclinically enhanced (Chapter 7). Input thesedata and create a spatial points data frame of the genesis locations by typing

> bh = read.csv("bi.csv", header=TRUE)

> require(sp)

> coordinates(bh) = c("FirstLon", "FirstLat")

> ll = "+proj=longlat +ellps=WGS84"

> proj4string(bh) = CRS(ll)

Next convert the geographic coordinates using the Lambert conformal conic pro-jection true at parallels 10 and 40◦N and a center longitude of 60◦W. First save thereference system as a CRS object, then use the spTransform function from thergdal package.

> lcc = "+proj=lcc +lat_1=40 +lat_2=10 +lon_0=-60"

> require(rgdal)

> bht = spTransform(bh, CRS(lcc))

The spatial distance unit ismeters. Use themap (maps) and themap2SpatialLines(maptools) to obtain country borders by typing

> require(maps)

> require(maptools)

> brd = map("world", xlim=c(-100, -30), ylim=c(10, 48),

+ plot=FALSE)

> brd_ll = map2SpatialLines(brd, proj4string=CRS(ll))

> brd_lcc = spTransform(brd_ll, CRS(lcc))

You use the same coordinate transformation on the map borders as you do on thehurricane locations.Next you convert your S4 class objects (bh and clp) into S3 class objects for use

with the functions in the spatstat package.

> require(spatstat)

> bhp = as.ppp(bht)

> clpp = as.psp(brd_lcc)

The spatial point pattern object bhp contains marks. Mark are attributes attachedto the events. By default the marks are the columns from the original data that are not

jelsner
Sticky Note
insert "with"
Page 15: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 289 — #15

289 Spatial Clusters

used for location. You are interested only in the hurricane type (either tropical onlyor baroclinic), so you reset the marks accordingly by typing

> marks(bhp) = bht$Type == 0

You summarize the object with the summarymethod. The summary includes anaverage density over the spatial domain. The density is per unit area. Your lengthunit is meter from the Lambert projection, so your density is per square meter. Youretrieve the average density in units of per (1,000 km2) by typing

> summary(bhp)$intensity * 1e+12

[1] 11.4

Thus, on average, each point in your spatial domain has slightly more than 10 hurri-cane origins per 1,000 km2. This is the mean spatial density. Some areas havemore orless than the mean so you would like an estimate of λ(s).You do this using the densitymethod, which computes a kernel smoothed den-

sity function from your point pattern object. By default, the kernel is Gaussian witha bandwidth equal to the standard deviation of the isotropic kernel. The default out-put is a pixel image containing local density estimates. You again convert the densityvalues to units of 1,000 km2.

> den = density(bhp)

> den$v = den$v * 1e+12

Finally you use the plot method to plot the densities, the country borders, and theevent locations.

> plot(den)

> plot(unmark(clpp), add=TRUE)

> plot(unmark(bhp), pch=19, cex=.3, add=TRUE)

Event density maps split by tropical-only and baroclinic hurricane genesis areshown in Figure 11.5. Here we converted the im object to a SpatialGridDataFrameobject and used the spplot method. Regions of the central Atlantic extendingwestward through the Caribbean into the southern Gulf of Mexico have the great-est tropical-only density generally exceeding 10 hurricane origins per (1000 km2).By contrast, regions along the eastern coast of the United States extending east-ward to Bermuda have the greatest baroclinic density. The amount of smoothing isdetermined by the bandwidth and the type of kernel.Densities at the genesis locations are made using the argument at="points" in

the density call. Also the number of events in grid boxes across the spatial domainare obtained using the quadratcount or pixellate functions. For example,type

> plot(quadratcount(bhp))

> plot(pixellate(bhp, dimyx=5))

jelsner
Sticky Note
change to "map"
jelsner
Sticky Note
change font type from slant to typewriter
jelsner
Sticky Note
change font type from slant to typewriter
jelsner
Sticky Note
change to "function"
Page 16: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 290 — #16

290 Cluster Models

ba

Density (Events/(1,000 km)2)0 5 10 15 20 25 30

Density (Events/(1,000 km)2)0 5 10 15 20 25 30

Figure 11.5 Genesis density for (a) tropical-only and (b) baroclinic hurricanes.

11.2.3 Second-Order Properties

The density function describes the rate (mean) of hurricane genesis events locally.Second-order functions describe the variability of events. Ripley’s K function is anexample. It is defined as

K̂(s)= λ−1n−1∑i=j

I(dij < s), (11.8)

where dij is the Euclidean distance between the ith and jth events in a data set of nevents, andλ is the average density of events, estimated as n/|A|, where |A| is the areaof the domain. I is an indicator that equals 1 if the distance is less than or equal to s.The definition assumes stationarity and isotropy for the point pattern and implies

that under CSR if the events are homogeneous, K̂(s) should be approximately equalto π s2. For point patterns more regular than CSR, you expect fewer events withindistance s of a randomly chosen event than under CSR, so K̂(s) < π s2. For pointpatterns more clustered than CSR, you expect more events within a given distancethan under CSR or K̂(s)> π s2.The function Kest is used to compute K̂(s). Here you save the results by typing

> k = Kest(bhp)

The function takes an object of classppp and computes K̂(s) using a formula that cor-rects for edge effects to reduce bias arising from counting hurricanes near the bordersof your spatial domain (Ripley, 1991; Baddeley et al, 2000).The K function is defined such that λK̂(s) equals the expected number of addi-

tional hurricanes within a distance s of any other hurricane. You plot the expectednumber as a function of separation distance by typing

> lam = summary(bhp)$intensity

> m2km = .001

> plot(k$r * m2km, k$iso * lam, type="l", lwd=2,

jelsner
Sticky Note
change "should be" to "is"
jelsner
Sticky Note
add period after "al"
Page 17: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 291 — #17

291 Spatial Clusters

+ xlab="Lag Distance (s) [km]",

+ ylab="Avg Number of Nearby Hurricanes")

You first save the value of λ and a conversion factor to go from meters to kilometers.The iso column refers to the K function computed using the isotropic correction forregular polygon domains.The Kest function also returns theoretical values under the hypothesis of CSR.

You add these to your plot by typing

> lines(k$r * m2km, k$theo * lam, col="red", lwd=2)

The empirical curve lies above the theoretical curve at all lag distances. Forinstance, at a lag distance of 471 km, there are on average about 16 additionalhurricanes nearby. This compares with an expected number of eight additional hur-ricanes. This indicates that for a given hurricane, there are more nearby hurricanesthan you would expect by chance under the assumption of CSR. This is indicative ofclustering.But as you have seen earlier, the clustering is related to the inhomogeneous distri-

bution of hurricane events across the basin. So this result is not particularly useful.Instead you use the Kinhom function to compute a generalization of the K func-tion for inhomogeneous point patterns (Baddeley et al, 2000). The results are shownin Figure 11.6. Black curves are the empirical estimates and red curves are thetheoretical. The empirical curve is much closer to the inhomogeneous theoreticalcurve.There appears to be some clustering of hurricane genesis at short distances and

regularity at larger distances. The regularity at large distance is likely due to land. Theanalysis can be improved bymasking land areas.

0 400 800

0

10

20

30

40

50

a b

Lag distance (s) (km)

0 400 800

0

10

20

30

40

50

Lag distance (s) (km)

Avg

num

ber o

f nea

rby h

urric

anes

Avg

num

ber o

f nea

rby h

urric

anes

Figure 11.6 2nd-order genesis statistics. (a) Ripley’s K and (b) generalization of K.

jelsner
Sticky Note
replace this paragraph with "In the case of the inhomogeneous estimate the empirical and theoretical curves match better. The curves depart at large lag distance with the empirical curve falling below the theoretical curve indicating regularity. The regularity is likely due to land areas having no genesis points. The match can be improved by removing land areas from the analysis domain."
jelsner
Sticky Note
remove this sentence
Page 18: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 292 — #18

292 Cluster Models

11.2.4 Models

You can fit a parametric model to your point pattern. This allows you to predictthe occurrence of hurricane events given covariates and the degree of clustering.The spatstat package contains options for fitting point pattern models with the ppmfunction.Models can include spatial trend, dependence on covariates, and event interac-

tions. Models are fit using the method of maximum pseudo-likelihood, or using anapproximatemaximum likelihoodmethod.Model syntax is standardR. Typically, thenull model is homogeneous Poisson.

11.3 FEATURE CLUSTERS

In the first two sections of this chapter, you examined clustering in time and geo-graphic space. It is also possible to cluster in feature space. Indeed, this is the bestknown of the cluster methods. Feature clustering is called cluster analysis.Cluster analysis searches for groups in feature (attribute) data. Objects belonging

to the same cluster have similar features. Objects belonging to different clusters havedissimilar features. In two or three dimensions, clusters can be visualized. In morethan three, analytic assistance is helpful.Note the difference from the two earlier cluster methods.With thosemethods your

initial goal was cluster detection with the final goal a model to improve prediction.Here, your goal is descriptive and cluster analysis is a data-reduction technique. Youstart with the assumption that your data can be grouped based on feature similarity.Cluster analysis methods fall into two distinct categories: partition and hierarchi-

cal. Partition clustering divides your data into a prespecified number of groups. Thenumber of groups is usually chosen by trial. An index of cluster quality can make iteasier for you to choose an optimal number.Hierarchical clustering creates increasing or decreasing sets of clustered groups

from your data. Agglomerative hierarchical starts with each object forming its owncluster. It then successively merges clusters until a single large cluster remains. Divi-sive hierarchical is the opposite. It starts with a single cluster that includes all objectsand successively splits the clusters into smaller groups.To cluster in feature space, you need to

1. create a dissimilarity matrix from a set of objects, and2. group the objects based on the values of the dissimilarity matrix.

11.3.1 Dissimilarity andDistance

The dissimilarity between two objectsmeasures how different they are.The larger thedissimilarity, the greater the difference. Objects are considered vectors in attribute(feature) space, where vector length equals the number of data set attributes.Consider, for instance, a data set with three attributes and four objects

(Table 11.5). Object 1 is a vector consisting of the triplet (x1,1, x1,2, x1,3), object 2

jelsner
Sticky Note
insert "also"
jelsner
Sticky Note
add "without covariates"
jelsner
Sticky Note
change to "is"
jelsner
Sticky Note
change to "ultimate"
jelsner
Sticky Note
change "improve prediction" to "help you make inferences"
jelsner
Sticky Note
change "...descriptive and cluster" to "descriptive. Cluster..."
jelsner
Sticky Note
remove "distinct"
jelsner
Sticky Note
change "increasing or decreasing" to "ordered"
Page 19: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 293 — #19

293 Feature Clusters

Table 11.5 Attributes and objects. A data set ready for clusteranalysis.

Feature 1 Feature 2 Feature 3

Object 1 x1,1 x1,2 x1,3Object 2 x2,1 x2,2 x2,3Object 3 x3,1 x3,2 x3,3Object 4 x4,1 x4,2 x4,3

is a vector consisting of the triplet (x2,1, x2,2, x2,3), and so on. Then the elements ofa dissimilarity matrix are pairwise distances between the objects in feature space. Asan example, an object might be a hurricane track with features that include genesislocation, lifetime maximum intensity, and lifetime maximum intensification.Although distance is an actual metric, the dissimilarity function need not be. The

minimum requirements for a dissimilarity measure d are as follows:

• di,i = 0• di,j ≥ 0• di,j = dj,i.

The following axioms of a propermetric may be included but are not necessary:

• di,k ≤ di,j+ dj,k triangle inequality• di,j = 0 implies i= j.

Before clustering, you arrange your data set as a n × p data matrix, where n is thenumber of rows (one for each object) and p is the number of columns (one for eachattribute variable).Howyou compute dissimilarity depends on your attribute variables. If the variables

are numeric, you use Euclidean or Manhattan distance given as

dEi,j =

√√√√p

∑f=1

(xi,f − xj,f )2 (11.9)

dMi,j =p

∑f=1

|xi,f − xj,f | (11.10)

The summation is over the number of attributes (features). With hierarchical cluster-ing, the maximum distance norm, given by

dmaxi,j =max(|xi,f − xj,f |, f = 1, . . . ,p) (11.11)

is sometimes used.Measurement units on your feature variables influence the relative distances feature

space which, in turn, will affect your clusters. Features with high variancewill have thelargest impact. If all features are deemed equally important to your grouping, then the

jelsner
Sticky Note
remove "as follows"
jelsner
Sticky Note
change sentence to read "Measurement units on your features influence the distances in feature space which affects your clusters."
jelsner
Sticky Note
change sentence to read "The feature with the largest variance will have the greatest impact.
jelsner
Sticky Note
remove "deemed"
Page 20: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 294 — #20

294 Cluster Models

data need to be standardized. You can do this with the scale function, whose defaultmethod centers and scales the columns of your data matrix.The dist function computes the distance between objects using a variety ofmeth-

ods including Euclidean (default), Manhattan, and maximum. Create two featurevectors each having five values and plot them with object labels in feature space.

> x1 = c(2, 1, -3, -2, -3)

> x2 = c(1, 2, -1, -2, -2)

> plot(x1, x2, xlab="Feature 1", ylab="Feature 2")

> text(x1, x2, labels=1:5, pos=c(1, rep(4, 4)))

From the plotted points, you can group the two features into two clusters by eye.There is an obvious distance separation between the clusters. To compute pairwiseEuclidean distances between the five objects, type

> d = dist(cbind(x1, x2))

> d

1 2 3 4

2 1.41

3 5.39 5.00

4 5.00 5.00 1.41

5 5.83 5.66 1.00 1.00

The result is a vector of distances, but printed as a table with the object numbers listedas row and column names. The values are the pairwise distances so, for instance, thedistance between object 1 and object 2 is 1.41 units. You can see two distinct clustersof distances (dissimilarities) those less than or equal to 1.41 and those greater than orequal to 5. Objects 1 and 5 are the most dissimilar followed by objects 2 and 5. Onthe other hand, objects 3 and 5 and 4 and 5 are the most similar. You can change thedefault Euclidean distance to the Manhattan distance by including method="man"in the dist function.If the feature vectors contain values that are not continuous numeric (e.g., factors,

ordered, binary) or if there is a mixture of data types (one feature is numeric andthe other is a factor, for instance), then dissimilarities need to be computed differ-ently. The cluster package contains functions for cluster analysis including daisyfor dissimilarity matrix calculations, which by default uses Euclidean distance as thedissimilarity metric.To test, type

> require(cluster)

> d = daisy(cbind(x1, x2))

> d

Dissimilarities :

1 2 3 4

2 1.41

3 5.39 5.00

jelsner
Sticky Note
change sentence to read "To begin, create two feature vectors each having five values and plot them with labels that correspond to the object number."
Page 21: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 295 — #21

295 Feature Clusters

4 5.00 5.00 1.41

5 5.83 5.66 1.00 1.00

Metric : euclidean

Number of objects : 5

Note, the values that make up the dissimilarity matrix are the same as the earlierdistance values, but additional information is saved including the metric used andthe total number of objects in your data set. The function contains a logical flagcalled stand that when set to true standardizes the feature vectors before calculat-ing dissimilarities. If some of the features are not numeric, the function computes ageneralized coefficient of dissimilarity (Gower, 1971).

11.3.2 K-Means Clustering

Partition clustering requires you to specify the number of clusters beforehand. Thisnumber is denoted k. The algorithm allocates each object in your data frame to oneand only one of the k clusters.The k-means method is the most popular. Membership of an object is determined

by its distance from the centroid of each cluster. The centroid is themultidimensionalversion of the mean. The method alternates between calculating the centroids basedon the current cluster members and reassigning objects to clusters based on the newcentroids.For example, in deciding which of the two clusters an object belongs, the

method computes the dissimilarity between the object and the centroid of clusterone and between the object and the centroid of cluster two. It then assigns theobject to the cluster with the smallest dissimilarity and recomputes the centroidwith this new object included. An object already in this cluster might now havegreater dissimilarity due to the reposition of the centroid, in which case it gets reas-signed. Assignments and reassignments continue in this way until all objects have amembership.The kmeans function from the base stat package performs k-means clustering. By

default, it uses the algorithm of Hartigan and Wong (1979). The first argument isthe data frame (not the dissimilarity matrix) and the second is the number of clusters(centers). To perform a k-means cluster analysis on the example data above, type

> dat = cbind(x1, x2)

> ca = kmeans(dat, centers=2)

> ca

K-means clustering with 2 clusters of sizes 3, 2

Cluster means:

x1 x2

1 -2.67 -1.67

2 1.50 1.50

jelsner
Sticky Note
change "gets" to "might get"
jelsner
Sticky Note
use typewriter font
Page 22: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 296 — #22

296 Cluster Models

Clustering vector:

[1] 2 2 1 1 1

Within cluster sum of squares by cluster:

[1] 1.33 1.00

(between_SS / total_SS = 93.4 %)

Available components:

[1] "cluster" "centers" "totss"

[4] "withinss" "tot.withinss" "betweenss"

[7] "size"

Initial centroids are chosen at random so it is recommended to rerun the algorithma few times to see if it arrives at the same groupings. While the algorithm minimizeswithin-cluster variance, it does not ensure that there is a global minimum varianceacross all clusters.The first bit of output gives the number of clusters (input) and the size of the clus-

ters that results. Here cluster 1 has two members and cluster 2 has three. The nextbit of output are the cluster centroids. There are two features labeled x1 and x2.The rows list the centroids for clusters 1 and 2 as vectors in this two-dimensionalfeature space. The centroid of the first cluster is (−2.67, −1.67) and the centroid ofthe second cluster is (1.5, 1.5).The centroid is the feature average using all objects in the cluster. You can see this

by adding the centroids to your plot.

> points(ca$centers, pch=8, cex=2)

> text(ca$centers, pos=c(4, 1),

+ labels=c("Cluster 2", "Cluster 1"))

The next bit of output tags each object with cluster membership. Here you see thatthe first two objects belong to cluster 2 and the next three belong to cluster 1. Thecluster number keeps track of distinct clusters, but the numerical order is irrelevant.The within-cluster sum of squares is the sum of the distances between each object

and the cluster centroid for which it is a member. From your earlier plot you can seethat the centroids (stars) minimize the within-cluster distances while maximizing thebetween-cluster distances.The function pam uses medoids rather than centroids. A medoid is a multidimen-

sional center based on medians. The method accepts a dissimilarity matrix, tends tobe more robust (converges to the same result), and provides for additional graphicaldisplays from the cluster package.

jelsner
Sticky Note
change "to" to "that you"
jelsner
Sticky Note
remove "variance"
jelsner
Sticky Note
remove "(input)"
jelsner
Sticky Note
remove "from the cluster package"
Page 23: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 297 — #23

297 Feature Clusters

11.3.3 Track Clusters

Here your interest is to group hurricanes according to common track features. Thesefeatures include location of hurricane origin, location of lifetime maximum intensity,location of final hurricane intensity, and lifetime maximum intensity. The three loca-tion features are further subdivided into latitude and longitudemaking a total of sevenattribute (feature) variables. Note that location is treated here as an attribute ratherthan as a spatial coordinate.You first create a data frame containing only the attributes youwish to cluster. Here

you use the cyclones since 1950:

> load("best.use.RData")

> best = subset(best.use, Yr >= 1950)

You then split the data frame into separate cyclone tracks using Sid with each ele-ment in the list as a separate track. You are interested in identifying features associatedwith each track.

> best.split = split(best, best$Sid)

You then assemble a data frame using only the attributes to be clustered.

> best.c = t(sapply(best.split, function(x){

+ x1 = unlist(x[1, c("lon", "lat")])

+ x2 = unlist(x[nrow(x), c("lon", "lat")])

+ x3 = max(x$WmaxS)

+ x4 = unlist(x[rev(which(x3 == x$WmaxS))[1],

+ c("lon", "lat")])

+ return(c(x1, x2, x3, x4))

+ }))

> best.c = data.frame(best.c)

> colnames(best.c) = c("FirstLon", "FirstLat",

+ "LastLon", "LastLat", "WmaxS", "MaxLon", "MaxLat")

The data frame contains seven features from 667 cyclones.Before clustering, you check the feature variances by applying the var function on

the columns of the data frame using the sapply function.

> sapply(best.c, var)

FirstLon FirstLat LastLon LastLat WmaxS

499.2 57.0 710.4 164.8 870.7

MaxLon MaxLat

356.7 75.9

The variances range from a minimum of 57 degrees squared for the latitude of ori-gin feature to a maximum of 870.7 knots squared for the maximum intensity feature.Because of this large range, you scale the features so that each will have the sameinfluence on the clustering.

jelsner
Sticky Note
remove "only"
jelsner
Sticky Note
change "using only the attributes to be clustered" to "that contains your attributes"
jelsner
Sticky Note
change "will have" to "has"
Page 24: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 298 — #24

298 Cluster Models

> best.cs = scale(best.c)

> m = attr(best.cs, "scaled:center")

> s = attr(best.cs, "scaled:scale")

The function scale centers and scales the columns of your numeric data frame. Thecenter and scale values are saved as attributes in the new data frame. Here you savethem to rescale the centroids after the cluster analysis.You perform a k-means cluster analysis setting the number of clusters to three by

typing

> k = 3

> ct = kmeans(best.cs, centers=k)

> summary(ct)

Length Class Mode

cluster 667 -none- numeric

centers 21 -none- numeric

totss 1 -none- numeric

withinss 3 -none- numeric

tot.withinss 1 -none- numeric

betweenss 1 -none- numeric

size 3 -none- numeric

Theoutput is a list of length 7 containing the cluster vector, clustermeans (centroids),the total sum of squares, the within-cluster sum of squares by cluster, the total withinsum of squares, the between sum of squares, and the size of each cluster.Your clustermeans are scaled so they are not readilymeaningful. The cluster vector

gives the membership of each hurricane in order as they appear in your data set. Theratio of the between sum of squares to the total sum of squares is 44.7 percent. Thisratio will increase with the number of clusters, but at the expense of having clustersthat may not be interpretable. With four clusters, the increase in this ratio is smallerthan the increase going from two to three clusters.So you are content with your three-cluster analysis.

11.3.4 Track Plots

Since six of the seven features are spatial coordinates it is tempting to plot the cen-troids on a map and connect them with a track line. This would be misleading sincethe combination of attributes is not a spatial feature. Instead, you plot examples ofcyclones that resemble each cluster.First, you add cluster membership and distance to your original data frame, then

split the data by cluster member.

> ctrs = ct$center[ct$cluster, ]

> cln = ct$cluster

> dist = rowMeans((best.cs - ctrs)ˆ2)

> id = 1:length(dist)

jelsner
Sticky Note
remove "new"
jelsner
Sticky Note
change "readily meaningful" to "directly interpretable"
jelsner
Sticky Note
change "analysis" to "result"
Page 25: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 299 — #25

299 Feature Clusters

> best.c = data.frame(best.c, id=id, dist=dist,

+ cln=cln)

> best.c.split = split(best.c, best.c$cln)

Next you subset your cluster data based on the tracks that come closest to the clustercentroids. This closeness occurs is in feature space that includes not only latitude andlongitude but also intensity. Here you choose nine tracks for each centroid.

> te = 9

> bestid = unlist(lapply(best.c.split, function(x)

+ x$id[order(x$dist)[1:te]]))

> cinfo = subset(best.c, id %in% bestid)

Finally, you plot the tracks on a map. This requires a few steps to make the ploteasier to read. Begin by setting the bounding box using latitude and longitude of yourcluster data and renaming your objects.

> cyclones = best.split

> uselat = range(unlist(lapply(cyclones[cinfo$id],

+ function(x) x$lat)))

> uselon = range(unlist(lapply(cyclones[cinfo$id],

+ function(x) x$lon)))

Next, order the clusters by number and distance and set the colors for plotting. Usethe brewer.pal function in the RColorBrewer package Neuwirth (2011). Youuse a single hue sequential color ramp that allows you to highlight the tracks that areclosest to the centroids of the feature clusters.

> cinfo = cinfo[order(cinfo$cln, cinfo$dist), ]

> require(RColorBrewer)

> blues = brewer.pal(te, "Blues")

> greens = brewer.pal(te, "Greens")

> reds = brewer.pal(te, "Reds")

> cinfo$colors = c(rev(blues), rev(greens), rev(reds))

Next, reorder the tracks for plotting them as a weave with the tracks farthest from thecentroids plotted first.

> cinfo = cinfo[order(-cinfo$dist, cinfo$cln), ]

Finally, plot the tracks and add world and country borders. Results are shown inFigure 11.7. Track color is based on attribute proximity to the cluster centroid usinga color saturation that decreases with distance.

> plot(uselon, uselat, type="n", xaxt="n", yaxt="n",

+ ylab="", xlab="")

> for(i in 1:nrow(cinfo)){

+ cid = cinfo$id[i]

+ cyclone = cyclones[[cid]]

jelsner
Sticky Note
change sentence to read "That closeness occurs in feature space that includes latitude, longitude, and intensity features."
jelsner
Sticky Note
change to "a"
jelsner
Sticky Note
change "Neuwirth (2011)" to "(Neuwirth, 2011)"
jelsner
Sticky Note
change sentence to read "Next, reorder the tracks so that those farthest from the centroids get plotted first."
Page 26: 11 Elsner Ch11 - Florida State Universitymyweb.fsu.edu/jelsner/Book/01_Proofs 2/11_Elsner_Ch11.pdfElsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 276 — #2. 276 ClusterModels

Elsner: “11˙ELSNER˙CH11” — 2012/9/24 — 17:14 — page 300 — #26

300 Cluster Models

Figure 11.7 Tracks by cluster membership.

+ lines(cyclone$lon, cyclone$lat, lwd=2,

+ col=cinfo$colors[i])

+ }

> require(maps)

> map("world", add=TRUE)

> map("usa", add=TRUE)

The analysis splits the cyclones into a group over the Gulf of Mexico, a group at highlatitudes, and a group that begins at low latitudes but ends at high latitudes gener-ally east of the United States. Some of the cyclones that begin at high latitude arebaroclinic (see Chapter 7).Cluster membership can change depending on the initial random centroids partic-

ularly for tracks that are farthest from a centroid.The two- and three-cluster tracks arethe most stable. The approach can be extended to include other track features includ-ing velocity, acceleration, and curvature representing more dimensions of the spaceand time behavior of hurricanes.If you want to make inferences about future hurricane activity. Model-based clus-

tering, where the observations are assumed to be quantified by a finite mixture ofprobability distributions, is an attractive alternative to cluster analysis. In the end, it isworthwhile to keep in mind the advice of Bill Venables and Brian Ripley. You shouldnot assume cluster analysis is the best way to discover interesting groups. Indeed,visualization methods are often more effective in this task.This chapter showed you how to detect, model, and analyze clusters in hurricane

data. We began by showing you how to detect and model the arrival of hurricanesalong the coast.We then showed you how to detect and analyze the presence of spatialclusters in hurricane genesis locations. We also looked at the first- and second-orderstatistical properties of hurricane genesis. Finally, we showed you how to apply clusteranalysis to hurricane track features and map representative tracks.In the next chapter, we look at several applications of Bayesianmodels.

jelsner
Sticky Note
change to "Track"
jelsner
Sticky Note
change to "additional"
jelsner
Sticky Note
remove this sentence
jelsner
Sticky Note
insert new sentence "That is especially true if you want to make predictions of future activity."
jelsner
Sticky Note
change sentence and next sentence to read "It is good to keep in mind the advice from Bill Venables and Brian Ripley, who note that you should not assume cluster analysis is the best way to discover interesting groups."