two-step classification method for spatial decision tree

An Efficient Two-Step Method for Classification of Spatial Data

Authors : Krzysztof Koperski, Jiawei Han, Nebojsa StefanovicPresented on : Spatial Data Handling (SDH’ 98)Reviewed by: Abhishek Agrawal

Introduction

• In spatial databases very large amounts of Spatial Data have been collected used in various applications ranging from remote sensing to geographical information systems (GIS), computer cartography, environmental assessment and planning etc.

• These spatial databases contains many hidden and interesting implicit spatial relations and patterns which are extracted which are not explicitly stored in such databases.

• One of the spatial data mining techniques is the classification of the spatial objects stored in the spatial databases where the objective is to label different spatial objects by identifying set of rules that can describe the partition.

Classification Approach : Spatial Decision Tree

❖ In this paper[1], authors have used decision tree to classify spatial objects based on➢ Non-Spatial properties of the classified objects (Traditional)➢ Spatial relations of the classified objects to other objects in the database

❖ Also, authors have analyzed the problem of classification of spatial objects in relevance to thematic maps and and spatial relationships to other objects in the database.

❖ With the new approach of spatial classification using decision tree, authors provided the experimental results of both real and synthetic data to compare the performance and quality of the results with other existing methods in the same problem space.

Business Problem: Label the local business units such as shopping malls or stores based on their business profit status based on the influence of their trade area.

Problem Definition

Problem Definition Continue..

Data Mining Problem: Classification of spatial objects such as shopping malls or stores defined by its attributes, that belong to two or different classes Y and N which are selected based on attribute high_profit with two values Y for “yes” and N for “no”.

● In our example, objects OID1 and OID2 belong to class Y and objects OID3, OID4 and OID5 belong to class N.


● We want to build a decision tree classifying objects Oi based on two types of information:

➢ descriptions of the objects in the proximity of objects Oi


● We want to build a decision tree classifying objects Oi based on two types of information:

➢ descriptions of the objects in the proximity of objects Oi

➢ non-spatial attributes of the thematic map

State of the Art

● Fayyad et. al.[2] used decision tree methods to classify images of stellar objects to detect stars and galaxies. They used low-level image processing system FOCAS to select and generate basic attributes. The proposed method deals with image databases and is tailored for the astronomical attributes which is not suitable for vector data format (GIS Database) .

● Another approach, Ester et. al.[3], based on ID3 algorithm and uses the concept of neighbourhood graphs. This method doesn’t analyze aggregate values of non-spatial attributes for the neighbouring objects. Similarly it doesn’t perform any relevance analysis for narrowing its search space.

● Ng and Yu[4] described a method for the extraction of strong, common and discriminating characteristics of clusters based on the thematic map. They have not extended the result characteristics of thematic map to construct decision trees.

Classification Algorithm

Building a decision tree to classify spatial object based on spatial predicates, functions andthematic maps.

Input :1. Spatial Database containing:

a. classified objects Oc

b. other spatial objects with non-spatial attributes2. Geo-mining query specifying:

a. objects to be used, predictive attributes, predicates and functionsb. attribute, predicate or function used as a class label

Output :Binary Decision Tree

Method: Spatial Decision Tree

1. Collect a set S of classified objects and other objects that are used for description2. For the sample of spatial object Oc from S:

a. Build sets of predicates describing all objects using coarse predicates, functions and attributes.

b. Perform generalization of the sets of predicates based on concept hierarchies

c. Find coarse predicates, functions, and relevant attributes using RELIEF algorithm3. Find the best size for the buffer for aggregates of thematic map polygons. It is done by finding for

all relevant non-spatial aggregate attributes the size of the buffer Xmax where the information gain for the aggregated attribute is maximum.

4. Build sets of predicates using relevant fine predicates and generalize based on concept hierarchies.

5. Generate Decision Tree

Method: Spatial Decision Tree1. Collect a set S of classified objects and other objects that are used for description




Step1.a : Define MBR(Minimum Bounding Rect.) using data distribution and confidence level as threshold.




Step 2.a : Find coarse description for the sample to list the spatial attributes, functions etc.





Step 2.b : Generalize the predicates using concept hierarchies





c. Find coarse predicates, functions, and relevant attributes using RELIEF algorithm

RELIEF ALGORITHM

Find Relevant AttributesStep 2.c : For every object s in the sample two nearest neighbours are found, where one neighbour belongs to the same class(Y/N) as object s (nearest hit)

and other neighbour belongs to a class different than s (nearest miss)




c. Find coarse predicates, functions, and relevant attributes using RELIEF algorithm

RELIEF ALGORITHM

Find Relevant AttributesStep 2.c : Give weights to the predicate based on neighbourhood predicates:➔ For nearest hit, if it has the same predicate value, then weight for this predicate increases ↑➔ For nearest hit, if it has the different predicate value, then weight for this predicate decreases ↓➔ For nearest miss, if it has the same predicate value, then weight for this predicate decreases ↓➔ For nearest miss, if it has the different predicate value, then weight for this predicate increases ↑

Now based on weight > threshold, we select the relevant predicates






c. Find coarse predicates, functions, and relevant attributes using RELIEF algorithm3. Find the best size for the buffer for aggregates of thematic map polygons. It is done by finding for

all relevant non-spatial aggregate attributes the size of the buffer Xmax where the information gain for the aggregated attribute is maximum.

4. Build sets of predicates using relevant fine predicates and generalize based on concept hierarchies.

5. Generate Decision Tree


Step 3: Find the best size for the buffer for aggregates of thematic map polygons.

• Now for the shape of the buffer, different criteriamay be used. The buffers may be based on rings or customer penetration polygons.

• The rings have some advantages:1. ease of use,2. no need to determine trade area based on

customer data3. easy comparison between sites


Step 3: Find the best size for the buffer for aggregates of thematic map polygons.

• Buffers represents area that have an impact on class label attribute of classified objects.

• The size of buffer is fixed by finding for all relevant non-spatial aggregate attributes,the size of the buffer Xmax where theinformation gain for the aggregated attributeis maximum.

Method: Spatial Decision TreeStep 4 : Build sets of predicates using relevant fine predicates and generalize based on concept hierarchies.

Method: Spatial Decision TreeStep 5 : Build Decision Tree

Method: Spatial Decision TreeStep 5 : Build Decision Tree : Binary Split ( Based on Info gain )

Complexity Analysis

Complexity Analysis:

Results & Performance Evaluation

• Experiments were performed on synthetic data merge with TIGER U.S. census data for washington state.

• With real data, best results were found with threshold between 0 to 0.2 and accuracy drastically increased when relevance analysis was used.

Conclusion and Future Directions

• Classification of geographical objects enables researcher to explore interesting relations between spatial and non-spatial data.

• The algorithm performs less costly, approximate spatial computations, relevance analyses for producing smaller and more accurate decision trees.

• The pre-computed spatial indexes can be stored as part of regular spatial query to find neighbourhood attributes.

• Authors plan to perform experiments using aggregate values for thematic maps and by varying distance for close_to spatial predicates.

• Integrate with their spatial data mining prototype GeoMiner

References

[1] Koperski, Krzysztof, Jiawei Han, and Nebojsa Stefanovic. "An efficient two-step method for classification of spatial data." proceedings of International Symposium on Spatial Data Handling (SDH’98). 1998.

[2] Fayyad, Usama M., S. George Djorgovski, and Nicholas Weir. "Automating the analysis and cataloging of sky surveys." Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, 1996.

[3] Ester, Martin, Hans-Peter Kriegel, and Jörg Sander. "Spatial data mining: A database approach." Advances in spatial databases. Springer Berlin Heidelberg, 1997.

[4] Ng, R. T., and Y. Yu Discovering Strong. "Common and Discriminating Characteristics of Clusters from Thematic Maps." Proc. of the 11th Annual Symp. on Geographic Information Systems. 1997.

two-step classification method for spatial decision tree

Education