geogra phical analys is

Geographical analysis

Overlay, cluster analysis, auto-correlation, trends, models, network

analysis, spatial data mining

Geographical analysis

• Combination of different geographic data sets or themes by overlay or statistics

• Discovery of patterns, dependencies• Discovery of trends, changes (time)• Development of models• Interpolation, extrapolation, prediction• Spatial decision support, planning• Consequence analysis (What if?)

Example overlay

• Two subdivisions with labeled regions

Soil type 1Soil type 2Soil type 3Soil type 4

Birch forestBeech forestMixed forest

Birch forest on soil type 2

soil vegetation

Kinds of overlay

• Two subdivisions with the same boundaries- nominal and nominal Religion and voting per municipality- nominal and ratio Voting and income per municipality- ratio and ratio Average income and age of employees

• Two subdivisions with different boundariesSoil type and vegetation

• Subdivision and elevation modelVegetation and precipitation

Kinds of overlay, cont’d

• Subdivision and point setquarters in city, occurrences of violence on the street

• Two elevation modelselevation and precipitation

• Elevation model and point setelevation and epicenters of earthquakes

• Two point setsmoney machines, street robbery locations

• Network and subdivision, other network, elevation model

Result of overlay

• New subdivision or map layer, e.g. for further processing

• Table with combined data• Count, surface area

Soil Vegetation Area #patches

Type 1 Beech 30 ha 2Type 2 Birch 15 ha 2Type 3 Mixed 8 ha 1Type 4 Beech 2 ha 1…. ….

Buffer and overlay

• Neighborhood analysis: data of a theme within a given distance (buffer) of objects of another theme

Sightings of nesting locations of the great blue heron (point set)Rivers; buffer with width 500 m of a river

Overlay Nesting locations great blue heron near river

Overlay: ways of combination

• Combination (join) of attributes• One layer as selection for the other

– Vegetation types only for soil type 2– Land use within 1 km of a river

Overlay in raster

• Pixel-wise operation, if the rasters have the same coordinate (reference) system

Forest Population increaseabove 2% per year

Pixel-wiseAND

Both

Overlay in vector

• E.g. the plane sweep algorithm as given in Computational Geometry (line segment intersection), to get the overlay in a topological structure

• Using R-trees as an indexing structure to find intersections of boundaries

Combined (multi-way) overlays

• Site planning, new construction sites depending on multiple criteria

• Another example (earth sciences):Parametric land classification: partitioning of the land based on chosen, classified themes

Elevation Annual precipitation

Types of rock Overlay: partitioning based on the three themes

Analysis point set

• Points in an attribute space: statistics, e.g. regression, principal component analysis, dendrograms

(area, population, #crimes)

#population

#crimes (12, 34.000, 34)(14, 45.000, 31)(15, 41.000, 14)(17, 63.000, 82)(17, 66.000, 79) …… ……

Analysis point set

• Points in geographical space without associated value: clusters, patterns, regularity, spread

Actual average nearestneighbor distance versus expected Av. NN. Dist. for this number of points in the region

For example: volcanoes in a region; crimes in a city

Analysis point set

• Points in geographical space with value: up to what distance are measured values “similar” (or correlated)?

2014

13

1012

11

16

18

2115

17

16

22

2119

12

Analysis point set

• Temperature at location x and 5 km away from x is expected to be nearly the same

• Elevation (in Switzerland) at location x and 5 km away from x is not expected to be related (even over 1 km), but it is expected to be nearly the same 100 meters away

• Other examples:– depth to groundwater– soil humidity– nitrate concentration in the soil

Analysis point set

• Points in geographical space with value:auto-correlation (~ up to what distance are measured values “similar”, or correlated)

2014

13

1012

11

16

18

2115

17

16

22

2119

12 n points (n choose 2) pairs;each pair has a distance and a difference in value

distance

difference

distance

Classify distances and determine average per class

Averagedifference observed expected difference

2

2

2

distance distance

sill

range

2σ

Observed variogram Model variogram (linear)

Smaller distances more correlation, smaller variance

Averagedifference observed expected difference

2

2

nugget

Importance auto-correlation

• Descriptive statistic of a data set: describes the distance-dependency of auto-correlation

• Interpolation based on data further away than the range is nonsense

20

14

13

101211

16

18

21 15

17

16

22

21

1912

range

??

Importance auto-correlation

• If the range of a geographic variable is small, more sample point measurements are needed to obtain a good representation of the geographic variable through spatial interpolation

influences cost of an analysis or decision procedure, and quality of the outcome of the analysis

Analysis subdivision

• Nominal subdivision: auto-correlation(~ clustering of equivalent classes)

• Ratio subdivision: auto-correlation

PvdA

CDA

VVD

Auto-correlation No auto-correlation

Auto-correlation nominal subdivision

• 22 neighbor relations (adjacencies) among 12 provinces

• Pr(province A = VVD and province B = VVD) = 4/12 * 3/11

• E(VVD adj. VVD) = 22 * 12/132 = 2• Reality: 4 times

• E(CDA adj. PvdA) = 5.33; reality once

PvdA

CDA

VVD

Join count statistic:

4/12 * 4/11 * 2 * 22

Geographical models

• Properties of (geographical) models:– selective (simplification, more ideal)– approximative– analogous (resembles reality)– structured (usable, analyzable, transformable)– suggestive – re-usable (usable in related situations)

Geographical models

• Functions of models:– psychological (for understanding, visualization)– organizational (framework for definitions)– explanatory– constructive (beginning of theories, laws)– communicative (transfer scientific ideas)– predictive

Example: forest fire

• Is the Kröller-Müller museum well enough protected against (forest)fire?

• Data: proximity fire dept., burning properties of land cover, wind, origin of fire

• Model for: fire spread

b * ws * (1- sh) * (0.2 + cos )

b = burn factorws = wind speed = angle wind – direction pixelsh = soil humidity

Time neighbor pixel on fire: [1.41 *]

Forest fire

Forest; burn factor 0.8Heath; burn factor 0.6Road; burn factor 0.2Museum

Soilhumidity

Origin< 3 minutes< 6 minutes< 9 minutes> 9 minutes

Wind, speed 3

Forest fire model

• Selective: only surface cover, humidity and wind; no temperature, seasonal differences, …

• Approximative: surface cover in 4 classes; no distinction in forest type, etc., pixel based so direction discretized

• Structured: pixels, simple for definition relations between pixels

• Re-usable: approach/model also applies to other locations (and other spread processes)

Network analysis

• When distance or travel time on a network (graph) is considered

• Dijkstra’s shortest path algorithm• Reachability measure for a destination:

potential value

j

ijjcwi )(potential w = weight origin j = distance decay parameterc = distance cost betweenorigin j and destination i

j

ij

Example reachability

• Law Ambulance Transport: every location must be reachable within 15 minutes (from origin of ambulance)

Example reachability

• Physician’s practice:- optimal practice size: 2350 (minimum: 800)- minimize distance to practice - improve current situation with as few changes as possible

Current situation: 16 practices, 30.000 people, average 1875 per practice

Computed, improved situation: 13 practices

Example in table

Original New

Number of practices 16 13

Number of practice locations 9 7

Number of practices < 800 size 2 0

Number of people > 3 km 3957 4624

Average travel distance (km) 0,9 1,2

Largest distance (km) 5,2 5,4

Analysis elevation model

• Landscape shape recognition:- peaks and pits- valleys and ridges- convexity, concavity

• Water flow, erosion,watershed regions,landslides, avalanches

Spatial data mining

• Finding spatial patterns in large spatial data sets– within one spatial data set– across two or more data sets

• With time: spatio-temporal data mining

Spatial data mining and computation

• “Geographic data mining involves the application of computational tools to revealinteresting patterns in objects and events distributed in geographic space and across time” (Miller & Han, 2001)

• Large data sets attempt to carefully define interesting patterns (to avoid finding non-interesting patterns) advanced algorithms neededfor efficiency

Clustering?

• Are the people clustered in this room? How do we define a cluster?

• In spatial data mining we have objects/ entities with a location given by coordinates

• Cluster definitions involve distance between locations

Clustering - options

• Determine whether clustering occurs• Determine the degree of clustering• Determine the clusters• Determine the largest cluster

• Determine the outliers

Co-location

• Are the men clustered?• Are the women clustered?

• Is there a co-location of men and women?

co-location pattern

Co-location

• Like before, we may be interested in– is there co-location?– the degree of co-location– the largest co-location– the co-locations themselves– the objects not involved in co-location

Spatio-temporal data

• Locations have a time stamp• Interesting patterns involve space and time• Example here: time-stamped point set

Trajectory data

• Entities with a trajectory (time-stamped motion path)

• Interesting patterns involve subgroupswith similar heading, expected arrival,joint motion, ...

• n entities = trajectories; n = 10 – 100,000• t time steps; t = 10 – 100,000

input size is nt

• m size subgroup (unknown); m = 10 – 100,000

Trajectory data

• Migration patterns of animals• Trajectories of tornadoes• Tracking of (suspect) individuals for security• Lifelines of people for social behavior

Example pattern in trajectories

• What is the location visited by most entities?

location = circular region of specified radius




4 entities




3 entities


• Compute buffer of each trajectory


• Compute buffer of each trajectory

0

1

2

1

11

• Compute the arrangement of the buffers and the cover count of each cell

1


• One trajectory has t time stamps; its buffer can be computed in O(t log t) time

• All buffers can be computed in O(nt log t) time• The arrangement can be computed in

O(nt log (nt) + k) time, where k = O((nt)2) is the complexity of the arrangement

• Cell cover counts are determined in O(k) time


• Total: O(nt log (nt) + k) time• If the most visited location is visited by

m entities, this is O(nt log (nt) + ntm)

• Note: input size is nt ;n entities, each with location at t moments

Patterns in entity data

Spatial data• n points (locations)• Distance is important

– clustering pattern

• Presence of attributes (e.g. male/female):– co-location patterns

Spatio-temporal data• n trajectories, each

has t time steps• Distance is time-

dependent– flock pattern– meet pattern

• Heading and speed are important and are also time-dependent

Patterns in trajectories

• n trajectories, each with t time steps n polygonal lines with t vertices


• Flock and meet patterns: large enough subset that has same “character” during a time interval– close to each other– same direction of motion– ...Flock: changing locationMeet: fixed location

• Determine the longest duration pattern

Patterns in trajectories• Longest flock: given a radius r and subset

size m, determine the longest time interval for which the m entities were within each other’s proximity (circle radius r)

Time = 0 1 65432 7 8

longest flock in [ 1.9 , 6.4 ]

m = 3


• Computing the longest flock is NP-hard• This remains true for radius cr approximations with

c < 2• A radius 2 approximation of the longest flock can

be computed in time O(n2 t log n)

... meaning: if the longest flock for radius rhas duration , then we surely find a flock ofduration for radius 2r


flock

meet

fixed subsetm = 3

fixed radius


• Go into 3D (space-time) for algorithms

time

0

1

2

4

3

flock meet


Exact radius results

flock

meet

NP-hard

O(n2t2 (n2 log n + t))


Approximate radius results

flock

meet

O(n2t log n)

O((n2t log n) / (m2))

factor 2

factor 1+


• Flock and meet patterns require algorithms in 3-dimensional space (space-time)

• Exact algorithms are inefficient only suitable for smaller data sets

• Approximation can reduce running time with an order of magnitude

Summary

• There are many types of geographical analysis, it is the main task of a GIS

• Overlay and buffer analysis are most important• Statistics is also very important• Spatial and spatio-temporal data mining gives

new types of analysis of geographic data

geogra phical analys is

Documents

type 2birch15 ha

type 4beech2 ha

elevation models elevation

network analysis

cityanalysis point setpoints

themesanalysis point

different boundaries

set elevation