exploring spatial datasets using discriminative pattern ... · a won by obama, in footprint of...

1
Exploring Spatial Datasets Using Discriminative Pattern Mining and Pattern Similarity Measure Lunar and Planetary Institute ([email protected]) Tomasz F. Stepinski Wei Ding Dept. of Computer Science, Univ. of Massachusetts Boston ([email protected]) Motivation Complex multi-attributed spa- tial datasets hide knowledge that needs to be discovered by exploring their structure. We propose association analysis-based strategy for exploration spatial datasets posessing prior binary classi- fication. Input data :> Lunar and Planetary Institute ([email protected]) Josue Salazar Example: Analysis of 2008 presidential election Innovation mining for discriminative patterns class 2 c l a s s 1 class1 multi-attribute spatial dataset with prior binary classification Each spatial element is a transaction containing values of exploratory attributes cluster 1 cluster 2 cluster 3 aggomerative clustering of patterns Segmentation of class 1 into clusters of similar patterns of exloratory attributes Algorithm 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 footprint of pattern Y (2 objects) footprint of pattern X (4 objects) 1 2 _ _ A B C D pattern Y attributes 1 2 _ _ A B C D pattern X attributes S ( X, Y )= 4 i =1 w i S i ( X i ,Y i ) Σ 1 1 1 1 1 2 2 2 attribute A S A ( X A ,Y A )= s ( x A ,y A ) 1 1 1 2 2 1 1 1 attribute C S ( - ,Y C )= 2 k =1 P X ( x k ) s ( x k ,y C ) Σ C 2 2 2 2 2 1 2 2 attribute B S ( , X B )= 2 k =1 P y ( y k ) s ( y k ,X C ) Σ B - 1 1 1 1 2 1 1 2 attribute D S ( - , - )= 2 l =1 2 k =1 P X ( x l ) P Y ( y k ) s ( x l ,y k ) ΣΣ D Pattern similarity z , z , ..., z are ordinal values such that z = x + 1 and z = y - 1. i 1 2 1 k k i 2008 election results + 13 socio-economic indicators from the US Census Bureau for 3108 counties. Example 1 :> McCain voting block (red) and Obama voting block (blue) that are dissimilar in socio-economic sense and geographically apart. Example 2 :> McCain voting block (red) and Obama voting block (green) that are dissimilar in socio-economic sense but geographically collocated. Visual analytics :> Discriminative patterns are calculated for four groups (A, B, E, and F) of counties. In each group patterns are ordered using ag- glomerative clustering. Clustering heat map is a distance matrix with rows ordered according to clustering. s ( x i ,y i )= 2 × log P ( x i z 1 z 2 ... z k y i ) log P ( x i ) + log P ( y i ) A B C D E F G H 3 - 12 13 - 20 21 - 27 28 - 37 38 - 58 59 - 100 1 - 2 3 -4 5 - 6 7 - 8 9 - 10 11 - 13 0 - 0.25 0.25 - 0.5 0.5 - 1 1 - 2 2 - 3 3 - 4 4 - 13 0 - 0.05 0.05 - 0.18 0.18 - 0.32 0.32 - 0.46 0.46 - 0.62 0.62 - 0.82 0.82 - 1 pattern size patter length pattern size pattern length pattern overlap pattern dissimilarity } } pattern set A (Obama) pattern set E (McCain) B F } } pattern set A (Obama) pattern set E (McCain) B F pop. dens. urban pop. % female pop. % foreign born % per capita income household income HS edu. bachelor edu. white pop. % poverty % owned house % soc . sec. recipent % soc. sec. income lowest (1) low (2) average (3) high (4) highest (5) Obama block 1 (1- 872) Obama block 2 (928 -3364) Voted for Obama but not in disciminate patterns support (3365 - 3610) McCain block (3611- 6680) Voted for McCain but not in disciminate patterns support (6681 - 6970) no value ( _ ) socio-economic indicators E A won by Obama, IN footprint of Obama and NOT in footprint of McCain 153,611,411 67,040,847 62.14 won by Obama, NOT in footprint of Obama and NOT in footprint of McCain B 495 361 16,696,346 9,568,427 56.24 C won by Obama, NOT in footprint of Obama but IN in footprint of McCain 9 199,478 88,945 51.07 D won by Obama, IN footprint of Obama and IN footprint of McCain 1 210,554 61,494 52.90 won by McCain, IN footprint of McCain and NOT in footprint of Obama 1688 51,289,510 23,224,203 62,11 F won by McCain, NOT in footprint of McCain and NOT in footprint of Obama 472 31,269,880 15,772,301 59.01 G won by McCain, NOT in footprint of McCain but IN footprint of Obama 62 23,518,016 8,941,422 55.91 H won by McCain, IN footprint of McCain and IN footprint of Obama 20 2,255,368 1,024,861 60.83 set description # of counties population # voted winning % won by Obama won by McCain

Upload: others

Post on 14-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring Spatial Datasets Using Discriminative Pattern ... · A won by Obama, IN footprint of Obama and NOT in footprint of McCain 153,611,411 67,040,847 62.14 won by Obama, NOT

Exploring Spatial Datasets Using Discriminative Pattern Mining and Pattern Similarity MeasureLunar and Planetary Institute ([email protected])

Tomasz F. Stepinski Wei DingDept. of Computer Science, Univ. of Massachusetts Boston ([email protected])

MotivationComplex multi-attributed spa-tial datasets hide knowledge that needs to be discovered by exploring their structure. We propose association analysis-based strategy for exploration spatial datasets posessing prior binary classi-fication.

Input data :>

Lunar and Planetary Institute ([email protected])

Josue Salazar

Example: Analysis of 2008 presidential election Innovation

mining for discriminative

patterns

class 2

class 1

class1

multi-attribute spatial datasetwith prior binary classification

Each spatial element is a transaction containing values of exploratory attributes

cluster 1

clust

er 2

cluste

r 3

aggomerative clusteringof patterns

Segmentation of class 1into clusters of similar patterns of exloratoryattributes

Algorithm

11

11

1

1 11 1

1 11

22

22

22

2 2

11 1

11

1

2

22

2

2 2

footprint of pattern Y(2 objects)

footprint of pattern X(4 objects)

12 __A B C D

pattern Y

attributes

1 2 __A B C D

pattern X

attributes

S (X, Y ) = 4i=1 wiS i (X i , Y i )Σ

11

11

1 2

22

attribute A

SA(XA , YA) = s(xA, yA)

11

12

2

11

1attribute C

S (− , YC) =2

k=1PX (x k )s(xk , yC)ΣC

22

22

21

2

2attribute B

S ( , XB ) =2

k=1P y (yk )s(yk , XC )ΣB −

11

11

2

11

2attribute D

S (− , − ) =2

l=1

2

k=1PX (x l )PY (yk )s(x l , yk )Σ ΣD

Pattern similarity

z , z , ..., z are ordinal values such that z = x + 1 and z = y - 1.i

1 2

1

k

k i

2008 election results + 13 socio-economic indicatorsfrom the US Census Bureau for 3108 counties.

Example 1 :>McCain voting block (red) and Obama voting block (blue) that are dissimilar in socio-economic sense and geographically apart.

Example 2 :>McCain voting block (red) and Obama voting block (green) that are dissimilar in socio-economic sense but geographically collocated.

Visual analytics :>Discriminative patterns are calculated for four groups (A, B, E, and F) of counties.

In each group patterns are ordered using ag-glomerative clustering.

Clustering heat map is a distance matrix with rows ordered according to clustering.

s(x i , y i ) =2 × log P (x i z1 z2 . . . zk yi )

log P (x i ) + log P (yi )

A

BC

D

E

FG

H

3 - 12

13 - 20

21 - 27

28 - 37

38 - 58

59 - 100

1 - 2

3 -4

5 - 6

7 - 8

9 - 10

11 - 13

0 - 0.25

0.25 - 0.5

0.5 - 1

1 - 2

2 - 3

3 - 4

4 - 13

0 - 0.05

0.05 - 0.18

0.18 - 0.32

0.32 - 0.46

0.46 - 0.62

0.62 - 0.82

0.82 - 1

pattern size patter length

pattern sizepattern length

patternoverlap

patterndissimilarity

} }

pattern set A (Obama) pattern set E (McCain)B F

} }

pattern set A (Obama) pattern set E (McCain)B F

pop. dens.

urban pop. %

female pop. %

fore

ign born %

per capita

income

household income

HS edu.

bachelor edu.

white pop. %

poverty %

owned house %

soc . sec. re

cipent %

soc. sec. in

come

lowest (1)

low (2)average (3)

high (4)

highest (5)

Obama block 1 (1- 872)

Obama block 2 (928 -3364)

Voted for Obama but not in disciminate patternssupport (3365 - 3610)

McCain block (3611- 6680)

Voted for McCain but not in disciminate patternssupport (6681 - 6970)

no value ( _ )

socio-economic indicators

E

A won by Obama, IN footprint of Obamaand NOT in footprint of McCain

153,611,411 67,040,847 62.14

won by Obama, NOT in footprint of Obamaand NOT in footprint of McCain

B

495

361 16,696,346 9,568,427 56.24

C won by Obama, NOT in footprint of Obamabut IN in footprint of McCain

9 199,478 88,945 51.07

D won by Obama, IN footprint of Obamaand IN footprint of McCain

1 210,554 61,494 52.90

won by McCain, IN footprint of McCainand NOT in footprint of Obama

1688 51,289,510 23,224,203 62,11

F won by McCain, NOT in footprint of McCainand NOT in footprint of Obama

472 31,269,880 15,772,301 59.01

G won by McCain, NOT in footprint of McCainbut IN footprint of Obama

62 23,518,016 8,941,422 55.91

H won by McCain, IN footprint of McCainand IN footprint of Obama

20 2,255,368 1,024,861 60.83

set description # of counties population # voted winning %

won by Obama

won by McCain