creation of spatial databases for data miningbao/talks/gis_dm2.pdfcreation of spatial databases for...
TRANSCRIPT
Creation of Spatial Databases for Data Mining
Dang Van DucGIS Lab
Vietnam Institute of Information Technology
15 November 2001
Ho Tu BaoKCM lab
Japan Advanced Institute of Science and Technology
November 2001 2
CSA lab (JAIST):
Spatial data mining
methods and
applications
Research Context
GIS lab (IOIT): Methods and tools to
create primary spatial databases
PR&IP lab (IOIT+HUS):
Methods and tools to
create secondary
spatial databases
KDC lab (JAIST):Methods and tools for spatial data mining
Joint Research on Spatial Knowledge Discovery and Data Mining
To create high quality methods and tools for spatial data mining
November 2001 3
Finding high quality methods to create spatial databases
Providing efficient spatial operations that are suited to spatial data mining
Our Objectives
November 2001 4
GIS Overview (a Case Study: PopMap)
An approach to Spatial Data Mining from GIS point view
Conclusions
Outlines
November 2001 5
GIS Overview
1. What is GIS?GIS definitionComponents and tasks of GIS
2. How to create spatial databases?Data typesData capture and data collectionData organization
3. Some spatial operations in GIS useful for spatial data mining
Spatial relationshipsBuffer and overlay operations
November 2001 6
GIS Definition
Stands for "geographic information system“. It is a special kind of "information system” applied to geographically referenced data.
Information system: set of processes executed on raw data to produce information which will be useful in decision-making
Is a system for capturing, storing, checking, integrating, manipulating, analyzing and displayingdata which are spatially referenced to the Earth. (Chorley, 1987)
November 2001 7
View of Information Systems
Information System
Information System
Non-spatial Information System (Counting...)
Non-spatial Information System (Counting...)
Spatial Information SystemSpatial Information System
GISGIS Other Spatial Information System (CAD, CAM...)Other Spatial Information System (CAD, CAM...)
Land Information System (LIS)Land Information System (LIS) Other GIS (Economic-Social, Population ...)Other GIS (Economic-Social, Population ...)
Land Use Information SystemLand Use Information System Cadastral Information SystemCadastral Information SystemPopMap
(by Robert G. Cromley)(by Robert G. Cromley)
November 2001 8
GIS is:a type of softwarea real application, including the hardware, data, software and people needed to solve a problem
GIS Components
Software tools Database
Results
UsersEarth
GIS Software
Abstraction
Generalization
(Words, Chart, Graphs, Tables, or Maps)
Interaction
November 2001 9
Data input from maps, aerial photo, satellites, surveys, and other sources
Data storage, retrieval, and query
Data transformation, analysis, and modeling, including spatial statistics
Data reporting, such as reports, maps, and plans
GIS: Main Tasks
November 2001 10
ArcInfo (ESRI): Is the complete GIS data creation, update, query, mapping, and analysis system
GeoMedia (Intergraph): Is the product specifically designed to collect and manage spatial data using standard databases
ArcView (ESRI): It a desktop GIS, which provides data visualization, query, analysis, and integration capabilities along with the ability to create and edit geographic data
MapInfo (MapInfo): Is a desktop GIS, which is organized by four key technology pillars: mapping, routing, geocoding and data.
PopMap (VN-IOIT/UN): It an integrated software package for geographical information, map and graphics database.
and more
Some GIS Softwares
November 2001 11
There are many kinds of GIS softwares, ranging from low cost (desktop), easy-to-use GIS to high cost, powerful, difficult-to-use GIS. Fortunately, spatial data in almost GIS can be exchangeable.
GIS producers added to own classic GIS the extension modules (ArcView Spatial Analyst, Integraph Grid Analyst, etc.) which have many operations for spatial analysis.
Almost GIS softwares integrate own scripting language, which allows user to customize some more operations for GIS (MapBasic of MapInfo, Avenue of ArcInfo, etc.)
State-of-the-art of GIS Softwares
November 2001 12
developed by GIS Lab of Vietnam IOIT for UNSTATintegrated, easy-to-use software for developing geographical database of population & related dataComponents:
tools for maintaining geographical databasecapabilities for retrieving and processing data in a worksheet environment, and creating statistical graphsoptions for analyzing, interpreting, and developing effective data presentations using maps
Further information about PopMap is on Web site:http://www.un.org/depts/unsd/softproj/index.htm
What is PopMap?
November 2001 13
GIS Overview
1. What is GIS?GIS definitionComponents and tasks of GIS
2. How to create spatial databases?Data typesData capture and data collectionData organization
3. Some spatial operations in GIS useful for spatial data mining
Spatial relationshipsBuffer and overlay operations
November 2001 14
Obtaining data is an important part of any GIS project
We need to knowWhat types of data we can use with GISHow to evaluate itWhere to find itAnd how to create it our self
Creating Spatial Database: Data Sources
November 2001 15
Data measured directly by surveys, field data collection, remote sensing
Data obtained from existing maps, tables or other data sources
More and more ready-made digital GIS data sets become availableGovernment agencies: census geographyTopographic survey
Two Types of Data Sources
November 2001 16
Geographical variation in the real world is infinitely complex. The closer we look, the more detail we see, almost without limit.
Data must somehow be reduced to a finite and manageable quantity by a process of generalization or abstraction.
The rules used to convert real geographical variation into discrete objects is the data model.
There are two major choices of data model -raster and vector.
Data Model in GIS
November 2001 17
Raster model divides the entire study area into a regular grid of cells in specific sequence.
Vector model uses discrete line segments or pointsto identify locations.
Raster Model and Vector Model
ClinicDistrict
Road
Row
Column
District
School
River
TØnh
A
YX
Array
Grid
Real World
PopMap
November 2001 18
Spatial objects (entities) consist of both spatial and non-spatial data
Spatial data (also called geometric data) includes location, shape, size, and orientation
Non-spatial data (also called attribute or characteristic data) consists of values, or attributes, associated with a set of locations
Estimates are that 80% of all data has a spatial component (GIS.com)
Component of Spatial Objects
November 2001 19
Component of Spatial Objects (2)
GEOGRAPHICAL DATA
GEOMETRIC DATA
Geometry
AreaLinePoint
ATTRIBUTE DATA
Quantitative data
Ratio
Interval
Ordinal
Qualitative data
Nominal
November 2001 20
Surveying, Digitizing and scanning the maps
GIS: Capturing Geometric Data (1)
Manual digitizing
Scanning
Automated Vectorization
Total Station
November 2001 21
Using US GPS (Global Positioning System) or Russian GLONASS Remote sensing and aerial photography
GIS: Capturing Geometric Data (2)
GPS satellite orbits Imaging satellite NOAA-11Handheld GPS
November 2001 22
Geometric Data Entry Using PopMap
Trace features to be digitized with pointing device (cursor)
Conversion of hardcopy to digital maps is the most time-consuming task in GIS (up to 80% of project costs)
November 2001 23
Automatic Geometric Data Entry Using MapScan
Computer generates vector data from raster image.
The vector data must be organized by any spatial data structure, which is suited for operations of spatial data analysis.
November 2001 24
GIS Overview
1. What is GIS?
2. How to create spatial databases?Data typesData capture and data collectionData organization
AaaBbbCcc
3. Some spatial operations in GIS useful for spatial data mining
November 2001 25
point is recorded as x,y coordinate pairline is a series of x,y coordinatesarea is a series of x,y coordinates, with the first and last coordinate being identical (e.g., “closed-loop polygons”)
“Spaghetti” Data Structure
1
1
5
4
3
2
6
2 3 4 5 6
A
B CArea CoordinatesA (1,4), (1,6), (6,6), (6,4), (4,4), (1,4)B (1,4), (4,4), (4,1), (1,1), (1,4)C (4,4), (6,4), (6,1), (4,1), (4,4)
November 2001 26
records x/y coordinates of spatial features and encodes spatial relationships
Topological data Structure
Node X YI 1 4II 4 4III 6 4IV 4 1
Line From To Left Right1 I III O A2 I IV B O3 III IV O C4 I II A B5 II III A C6 II IV C B
Poly LinesA 1,4,5B 2,4,6C 3,5,6
1
1
1
5
4
3
2
6
2 3 4 5 6
A
B C
2 3
4 5
6
III
III
IV
O = “outside” polygon
November 2001 27
Source map is registered in a real world coordinate system with a projection
Digitized coordinates are recorded in digitizing units (e.g., cm or inches from the table’s origin)
For spatial data integration, coordinates need to be transformed into real world units
Data Transformation
(Longitude, Latitude)
Japan map and East Asia map must be in real world coordinates before integrating
November 2001 28
Projection is a mathematical conversion from spherical to planar coordinates.Projection concept: Projecting surface of a sphere onto a flat surface.Many types of projections have been invented. Each of them is useful for some applications, and not useful for others.
Projection
A particular projection can preserve area (sharp or size) of the maps
November 2001 29
Attribute Data Collection Using PopMap
Declaring meta-data
Using this module to declare variables, and enter attribute values for spatial objects
The data then is stored in the relational database
In advanced GIS, the spatial data is stored in the object-oriented database
November 2001 30
Id Type Staff
156 RPH 17
157 General 47
... ... ...
GIS: Geometric and Attribute Data Relationship
Id Pop HH
305 20,838 5,934
306 74,293 21,893
... ... ...
305
306304
303
302
154 156
157
160
155
158159
Census districts
Hospitals
November 2001 31
Dual architecture of the GIS
GIS: Spatial Database Architecture (1)
Spatial Data Management
Software
Spatial Data Management
Software
Coordinate Files
Topological Files
Attribute tables
GIS ToolsGIS ToolsUser
InterfaceUser
Interface
Commercial DBMS
Commercial DBMS
ARC/INFO (ESRI)
MGE (Intergraph)
PopMap
November 2001 32
Integrated architecture of the GIS
GIS: Spatial Database Architecture (2)
GIS ToolGIS ToolUser InterfaceUser Interface
ExtensionExtensionCustomized DBMSCustomized DBMS
a)
Coordinate Files
Topological Files
Attribute Tables
Coordinate Files
Topological Files
Attribute Tables
b)
GIS ToolGIS ToolUser InterfaceUser Interface
ExtensionExtensionCommercial DBMSCommercial DBMS
TIGRIS(Integraph)
GeoTropics(Universite de Paris VI)
November 2001 33
Each geographic feature(entity) is constructed by basic graphical elements (point, line and polygon)
Geographic data (administrative boundaries, soil, transportation, etc.) are categorized separately and stored in different map themes or layers, orcoverages
Classification of spatial information
Real World
Layers
November 2001 35
Turning Data into Information Using GIS
Map Projection Worksheet
Chart
Thematic maps
GIS can combine various display methods
RangeGraduated symbol
dot-density
November 2001 36
GIS Overview
1. What is GIS?GIS definitionComponents and tasks of GIS
2. How to create spatial databases?Data typesData capture and data collectionData organization
3. Some spatial operations in GIS useful for spatial data mining
Spatial relationshipsBuffer and overlay operations
November 2001 37
The most important issue in GIS:How can we create useful information from spatial data?
The answers can be:Querying the database (most frequent GIS application):
What is located at A? Where is X located?
Performing spatial analysis (key feature of GIS) => Spatial relationships are important
Spatial analysis is a general ability to manipulate spatial data into different forms and extract additional meaning as a result.
Network analysis, routing, cartographic algebra, site selection, projection, 3D modeling... are spatial analytical functionalities
Spatial Analysis in GIS
November 2001 38
Query and retrieval Operations
to display various kinds of object
to locate and select features, measure distances and areas, and calculate statistics
November 2001 39
Logical connections between spatial objects
Examples: “adjacent to”, “connect to”, “near to”, “intersects with”, “within”, “overlaps”, etc.
Some relationships are explicitly stored in the database
Examples: left, right poly in the line attribute table for “adjacent to”, list of lines for “connect to”
Others need to be computed
Spatial Relationships
November 2001 40
point/pointwhich health center is closest to the village?
point/linewhich road is nearestto the village?
same with other combinations of spatial features
“is nearest to”
November 2001 41
A Thiessen Polygon defines individual areas of influence around each of a set of points
The Thiessen polygon boundaries are equidistant from the neighboring points
“is nearest to”: Thiessen polygons
November 2001 42
A buffer zone is an area of specified width drawn around one or more map elements
point bufferaffected area around a polluting facilitycatchment area of a water source
line bufferhow many people live near the polluted river?what is the area impacted by highway noise?
polygon bufferarea around a reservoir where development should not be permitted
“is near to”: Buffer Operation
November 2001 43
In GIS, the normal case of polygon overlay takes two map layers and overlays them
“overlaps”: Polygon overlay
Hospital Catchment Areas
Districts
Overlay
November 2001 44
Given: The population of the district areas and the fact that areas of district layer and hospital catchment layer overlap
Estimate: The population of each are in hospital catchment layer
The problem can be solved by the polygon overlay operation and areal interpolation technique
Areal interpolation is based on the area surface proportion, assuming that there is an even distribution of the attributes
Areal Interpolation
November 2001 45
Point buffer operation
PopMap: Buffer Operation
Phu Tho Hospital
Viet Tri Hospital
Creating point buffer for hospitals with 10 km radius
November 2001 46
Area overlay operation
PopMap: Overlay Operation
Overlaying Hospital Band layer and District layer
Producing the population of the overlapped areas
November 2001 47
Line buffer operation
PopMap: Buffer Operation
Which clinics are located near the railway?
November 2001 48
Almost of GIS works on desktop machines and/or in a local networks, therefore the users of these systems are limited by geographical location
Since 1997 the Internet increases in the efficiency and effectiveness of the ways in which users obtain, use, and share information. Using Internet added real value to existing GIS databases.
There are many Web-based GIS which have already been built, including the MapOnline developed by GIS Lab of VN-IOIT
Web-based GIS
November 2001 49
URL – Uniform Resource Locator <protocol>://<machine id>/<local name>
HTTP - Hypertext Transfer Protocol
MapOnline: System Architecture
WWWBrowserWWW
Browser
Spatial DatabaseSpatial
DatabaseGIS
SoftwareGIS
SoftwareClient
Server
URLRequests
Maps,Images,HTML
INTERNET InterfaceInterface
Maps,Reports,etc.
GISCommands
HTTP Server Software
HTTP Server Software
November 2001 50
MapOnline: User Interfaces
Spatial Data Visualisation using Microsoft Internet Explorer
November 2001 51
An approach to Spatial Data Mining from GIS point view
What is Spatial Data Mining?
Why is spatial data mining important?
Some tasks of spatial data mining
November 2001 52
From GIS and Data Mining to Spatial Data Mining
Data mining and GIS have existed as two separate technologies, each with its own methods, approaches to visualisation and data analysis
Nowadays, huge volume of geo-referenced data has been available to users. A simple query and retrieval function of GIS can not satisfied with their needs
Statistical spatial analysis has been most common approach for analyzing spatial data, but it has some limitations.
November 2001 53
GIS versus Data Mining
GIS:user generates hypothesisvisualization in geographical spaceshows what’s inside the dataworks on spatial databaseshard to visualize multivariate dependencies on a map
Data Mining:system generates hypothesis
search (and visualization) in abstract space
inductive generalizations exceeding content of database
search for multivariate dependencies
How can we use the benefits offered by Data Mining and GIS?
November 2001 54
GIS + Data Mining = Spatial Data Mining?
GISData Mining
What is that?
Extensionsnew spatial analysis methods: association, sequential patterns, classification, clustering...
visualization methods for multivariate dependencies on a map
hypotheses languages (spatial data mining languages)
new spatial data structures
convergence of GIS and data mining in an Internet.
November 2001 55
Spatial Data Mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases (J. Han, M. Kamber)
Spatial Data Mining is a subfield of data mining. It can be a combination of GIS and data mining algorithms that have been adapted to spatial data.
Functionalities of spatial data mining:Discovering spatial relationships and relationships between spatial and nonspatial data
Constructing spatial knowledge bases
Reorganizing spatial databases
Optimizing spatial queries
What is Spatial Data Mining?
November 2001 56
Spatial Data Mining Tasks
Geo-Spatial Warehousing and OLAP
Spatial data classification/predictive modeling
Spatial clustering/segmentation
Spatial association and correlation analysis
Spatial regression analysis
Time-related spatial pattern analysis: trends, sequential patterns, partial periodicity analysis
Many more to be explored
November 2001 57
Example of Spatial Classification
MINE CLASSIFICATION RULES
ANALYZE crimes100000R
WITH RESPECT TO
states_census.geo, statename,
capita_income,
with_bachelor_degp
FROM states_census
November 2001 58
Example of Spatial Clustering
How can we cluster points?What are the distinct features of the clusters?
There are more customers with university degrees in clusters located in the West.Thus, we can use different marketing strategies!
MINE CLUSTERS AS ``DBCities''
ANALYZE sum(pop90)
WITH RESPECT TO DBCities.geo, pop90, med_fam_income, with_bachelor_degp
FROM DBCities
November 2001 59
What kinds of spatial objects are close to each other in B.C.?”
Kinds of objects: cities, water, forests, usa_boundary, mines, etc.
Rules mined:
is_a(x, large_town) ∧ intersect(x, highway) adjacent_to(x, water) [7%, 85%]
is_a(x, large_town) ∧ adjacent_to(x, georgia_strait) close_to(x, u.s.a.) [1%, 78%]
Mining method: Apriori + multi-level association + geo-spatial algorithms (from rough to high precision)
Spatial Association Mining
November 2001 60
Example of Spatial Association Mining
FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course"
FROM Washington_Golf_courses, WashingtonWHERE CLOSE_TO(Washington_Golf_courses. Obj,
Washington. Obj, "3 km") AND Washington.CFCC <> "D81"
IN RELEVANCE TO Washington_Golf_courses. Obj, Washington. Obj, CFCC
SET SUPPORT THRESHOLD 0.5SET CONFIDENCE THRESHOLD 0.5
November 2001 61
Spatial Trend Detection & Characterization
Spatial trends describe a regular change of non-spatial attributes when moving away from certain start objects. Global and local trends can be distinguished
Spatial (region) characterizationdoes not only consider the attributes of the target regions but also neighboring regions and their properties
November 2001 62
Spatial trend predictive modelingDiscover centers: local maximal of some non-spatial attribute
Determine the (theoretical) trend of some non-spatial attribute, when moving away from the centers
Discover deviations (from the theoretical trend)
Explain the deviations
ExampleTrend of unemployment rate change according to the distance to Osaka
Trend of temperature with the altitude, degree of pollution in relevance to the regions of population density, etc.
Spatial Trend Detection & Characterization
November 2001 63
GIS provides for spatial data mining:the concepts and methods for creating spatial database,
methods to create internal spatial data structures,
cartographic algebra to create spatial operations, and
concepts and methods for visual, attractive representation of patterns on the maps
Conclusions