spatial data analysis areas i: rate smoothing and the maup

Spatial Data Analysis Areas I: Rate Smoothing and the MAUP

Gilberto CâmaraINPE, Brazil

Ifgi, Muenster, Fall School 2005

Areal data

Study region is partitioned in disjoint areas The region is the union of the areas Each map has one or more associated measures

Treated as random variables

Examples: Map of Germany divided in municipalities. For each area,

we measure the unemployment rate and the literacy rate.

Is unemployment correlated with years of school? What about Brazil?

Violence in Minas Gerais

Attributes in areal data

As a general rule, each measure is a sum, count or a similar aggregated function over all the area

Each value is associated to all the corresponding area

If we need to choose a single location, usually we take the polygon centroid

There are no intermediate values

What is mapped in areal data?

Typical values are rates or proportions

Numerator = events

Denominador = pop at risk

Log maps?

Log rate of motor vehicle accident death per 100.000 residents, 1990-92

São Paulo

Minas Gerais

Kilômetros

0 100 200

EspíritoSanto

Rio de JaneiroLEGENDA

classes (n de municípios)

4,214 a 5,28 (35)3,148 a 4,214 (287)

2,082 a 3,148 (536)1,016 a 2,082 (253)

-0,05 a 1,016 (23)

0 óbitos (298)

N

L

S

O

Capitais

Log ratio of homicide death of males 15-49 per 100.000 residents of same group age, 1990-92

São Paulo

Minas Gerais

Kilômetros

0 100 200

EspíritoSanto

Rio de JaneiroLEGENDA

classes (n de municípios)

0,95 a 1,906 (28)1,906 a 2,862 (209)

2,862 a 3,818 (460)

3,818 a 4,774 (223)4,774 a 5,73 (64)

0 óbitos (448)

N

L

S

O

Capitais

Models of Discrete Spatial Variation

Taxas de Leishmaniose Visceral (1997/1998) .casos por 100 mil habitantes .

200 a 250 (1)150 a 200 (2)100 a 150 (1)50 a 100 (4)10 a 50 (29)5 a 10 (16)1 a 5 (43)

< 1 (19)

Random variable in

area i iY

iZ

• n° of ill people

• n° of newborn babies

• per capita income

Source: Renato Assunção (UFMG/Brasil)

When the study variable is a rate or a proportion, mapping

those rates is the first obvious step in any analysis.

However, the use of raw observed rates might be

misleading, since the variability of those rates will be a

function of the population counts, which differs widely

between the areas.

Bailey,1995

Dealing with rates and proportions

São Paulo Metropolitan Region

0

10

20

30

40

50

60

0 5000 10000 15000 20000 25000

population aged less than 1 year

Infa

nt

mo

rtal

ity

rate

Source: Fred Ramos (CEDEST/Brasil)

Model-Driven Approaches

Model of discrete spatial variation Each subregion is described by is a statistical

distribution Zi

e.g., homicides numbers are Poisson (, ). The main objective of the analysis is to estimate the

joint distribution of random variables Z = {Z1,…,Zn}

We use a model-driven approach to correct the missing data It is called the “Empirical Bayes” method... We could also use the “Full Bayes” method (but that is

another story...)

ˆ (1 )i i i i iw r w ( / )i

ii i i

wn

i

(measured rate)ii

i

yr

n

In Bayesian statistics, the best estimate of the true

and unknown rate isi

iwhere


ˆ i

i

y

n

2ˆ( ) ˆˆ i i

i

n r

n n

ˆ ˆ( )ˆ ˆˆ ˆ( / )

ii

i

r

n

Simplifying assumptions for estimating means and

variances for all random variables of all areas (Marshall,

1991)

Empirical Bayes


Municípios da RMSP e distritos MSP

0

10

20

30

40

50

60

0 5000 10000 15000 20000 25000

população até 1 ano

tax

a d

e m

ort

ali

da

de

in

fan

til

0

10

20

30

40

50

60

0 5000 10000 15000 20000 25000

population less than 1 year old

es

tim

ate

d i

nfa

nt

mo

rtal

ity

ra

te


Infant Mortality Rate – São Paulo (Raw)


Infant Mortality Rate – São Paulo (Corrected)


Some Important Questions

How does scale matter?

How do the spatial partitions matter?

How does proximity matter?

What can we learn by studing how multiple data vary in space?

How much prior assumptions can we impose in our spatial data?

Problema das Unidades de Área Modificáveis - MAUPA Question of Scale

A basic problem with areal data The spatial definition of the frontiers of the areas

impacts the results

Different results can be obtained by just changing the frontiers of these zones.

This problem is known as the “the modifiable area unit problem”

Per capita incomePer capita income Jobs/ populationJobs/ population Illiterate / populationIlliterate / population

Scale Effects


Scale EffectsPer capita incomePer capita income Jobs/ populationJobs/ population Illiterate / populationIlliterate / population


Population >60 years

Illiterates per capitaincome

270 ZONES OD97

Scale Effects: Figthing the MAUP


96 DISTRICTS OF SÃO PAULO





96 INCOME-HOMOGENOUS ZONES IN SÃO PAULO





27

0 Z

ON

ES

OD

97

96

DIS

TR

ICTS

96

IN

CO

ME-

AG

GR

EG

ATED

A) Percentage of population 60 year-old or more

B) Percentage of illiterate population

C) Per capita individual income

VARIABLES

Correlation matrices


Get census data

Identify inter-tractvariation

Adaptation

Minimize the outlier effect

Reduce data variability

A Questão da EscalaA Questão da Escala

Regionalization

Reagregate N small areas (finest scale available) into M bigger regions to reduce scale effects.

A possible solution: constrained clustering

Regionalization: Maps as graphs

Regionalization: Maps as graphs

Simple aggregation Population-constrained aggregation

spatial data analysis areas i: rate smoothing and the maup

Documents

unemployment rate

fred ramos cedestbrasilsource

rate smoothing

imeasured rate

literacy rate

spatial data analysis

spatial definition

multiple data