creation of spatial databases for data miningbao/talks/gis_dm2.pdfcreation of spatial databases for...

63
Creation of Spatial Databases for Data Mining Dang Van Duc GIS Lab Vietnam Institute of Information Technology 15 November 2001 Ho Tu Bao KCM lab Japan Advanced Institute of Science and Technology

Upload: dangmien

Post on 11-Jun-2018

246 views

Category:

Documents


2 download

TRANSCRIPT

Creation of Spatial Databases for Data Mining

Dang Van DucGIS Lab

Vietnam Institute of Information Technology

15 November 2001

Ho Tu BaoKCM lab

Japan Advanced Institute of Science and Technology

November 2001 2

CSA lab (JAIST):

Spatial data mining

methods and

applications

Research Context

GIS lab (IOIT): Methods and tools to

create primary spatial databases

PR&IP lab (IOIT+HUS):

Methods and tools to

create secondary

spatial databases

KDC lab (JAIST):Methods and tools for spatial data mining

Joint Research on Spatial Knowledge Discovery and Data Mining

To create high quality methods and tools for spatial data mining

November 2001 3

Finding high quality methods to create spatial databases

Providing efficient spatial operations that are suited to spatial data mining

Our Objectives

November 2001 4

GIS Overview (a Case Study: PopMap)

An approach to Spatial Data Mining from GIS point view

Conclusions

Outlines

November 2001 5

GIS Overview

1. What is GIS?GIS definitionComponents and tasks of GIS

2. How to create spatial databases?Data typesData capture and data collectionData organization

3. Some spatial operations in GIS useful for spatial data mining

Spatial relationshipsBuffer and overlay operations

November 2001 6

GIS Definition

Stands for "geographic information system“. It is a special kind of "information system” applied to geographically referenced data.

Information system: set of processes executed on raw data to produce information which will be useful in decision-making

Is a system for capturing, storing, checking, integrating, manipulating, analyzing and displayingdata which are spatially referenced to the Earth. (Chorley, 1987)

November 2001 7

View of Information Systems

Information System

Information System

Non-spatial Information System (Counting...)

Non-spatial Information System (Counting...)

Spatial Information SystemSpatial Information System

GISGIS Other Spatial Information System (CAD, CAM...)Other Spatial Information System (CAD, CAM...)

Land Information System (LIS)Land Information System (LIS) Other GIS (Economic-Social, Population ...)Other GIS (Economic-Social, Population ...)

Land Use Information SystemLand Use Information System Cadastral Information SystemCadastral Information SystemPopMap

(by Robert G. Cromley)(by Robert G. Cromley)

November 2001 8

GIS is:a type of softwarea real application, including the hardware, data, software and people needed to solve a problem

GIS Components

Software tools Database

Results

UsersEarth

GIS Software

Abstraction

Generalization

(Words, Chart, Graphs, Tables, or Maps)

Interaction

November 2001 9

Data input from maps, aerial photo, satellites, surveys, and other sources

Data storage, retrieval, and query

Data transformation, analysis, and modeling, including spatial statistics

Data reporting, such as reports, maps, and plans

GIS: Main Tasks

November 2001 10

ArcInfo (ESRI): Is the complete GIS data creation, update, query, mapping, and analysis system

GeoMedia (Intergraph): Is the product specifically designed to collect and manage spatial data using standard databases

ArcView (ESRI): It a desktop GIS, which provides data visualization, query, analysis, and integration capabilities along with the ability to create and edit geographic data

MapInfo (MapInfo): Is a desktop GIS, which is organized by four key technology pillars: mapping, routing, geocoding and data.

PopMap (VN-IOIT/UN): It an integrated software package for geographical information, map and graphics database.

and more

Some GIS Softwares

November 2001 11

There are many kinds of GIS softwares, ranging from low cost (desktop), easy-to-use GIS to high cost, powerful, difficult-to-use GIS. Fortunately, spatial data in almost GIS can be exchangeable.

GIS producers added to own classic GIS the extension modules (ArcView Spatial Analyst, Integraph Grid Analyst, etc.) which have many operations for spatial analysis.

Almost GIS softwares integrate own scripting language, which allows user to customize some more operations for GIS (MapBasic of MapInfo, Avenue of ArcInfo, etc.)

State-of-the-art of GIS Softwares

November 2001 12

developed by GIS Lab of Vietnam IOIT for UNSTATintegrated, easy-to-use software for developing geographical database of population & related dataComponents:

tools for maintaining geographical databasecapabilities for retrieving and processing data in a worksheet environment, and creating statistical graphsoptions for analyzing, interpreting, and developing effective data presentations using maps

Further information about PopMap is on Web site:http://www.un.org/depts/unsd/softproj/index.htm

What is PopMap?

November 2001 13

GIS Overview

1. What is GIS?GIS definitionComponents and tasks of GIS

2. How to create spatial databases?Data typesData capture and data collectionData organization

3. Some spatial operations in GIS useful for spatial data mining

Spatial relationshipsBuffer and overlay operations

November 2001 14

Obtaining data is an important part of any GIS project

We need to knowWhat types of data we can use with GISHow to evaluate itWhere to find itAnd how to create it our self

Creating Spatial Database: Data Sources

November 2001 15

Data measured directly by surveys, field data collection, remote sensing

Data obtained from existing maps, tables or other data sources

More and more ready-made digital GIS data sets become availableGovernment agencies: census geographyTopographic survey

Two Types of Data Sources

November 2001 16

Geographical variation in the real world is infinitely complex. The closer we look, the more detail we see, almost without limit.

Data must somehow be reduced to a finite and manageable quantity by a process of generalization or abstraction.

The rules used to convert real geographical variation into discrete objects is the data model.

There are two major choices of data model -raster and vector.

Data Model in GIS

November 2001 17

Raster model divides the entire study area into a regular grid of cells in specific sequence.

Vector model uses discrete line segments or pointsto identify locations.

Raster Model and Vector Model

ClinicDistrict

Road

Row

Column

District

School

River

TØnh

A

YX

Array

Grid

Real World

PopMap

November 2001 18

Spatial objects (entities) consist of both spatial and non-spatial data

Spatial data (also called geometric data) includes location, shape, size, and orientation

Non-spatial data (also called attribute or characteristic data) consists of values, or attributes, associated with a set of locations

Estimates are that 80% of all data has a spatial component (GIS.com)

Component of Spatial Objects

November 2001 19

Component of Spatial Objects (2)

GEOGRAPHICAL DATA

GEOMETRIC DATA

Geometry

AreaLinePoint

ATTRIBUTE DATA

Quantitative data

Ratio

Interval

Ordinal

Qualitative data

Nominal

November 2001 20

Surveying, Digitizing and scanning the maps

GIS: Capturing Geometric Data (1)

Manual digitizing

Scanning

Automated Vectorization

Total Station

November 2001 21

Using US GPS (Global Positioning System) or Russian GLONASS Remote sensing and aerial photography

GIS: Capturing Geometric Data (2)

GPS satellite orbits Imaging satellite NOAA-11Handheld GPS

November 2001 22

Geometric Data Entry Using PopMap

Trace features to be digitized with pointing device (cursor)

Conversion of hardcopy to digital maps is the most time-consuming task in GIS (up to 80% of project costs)

November 2001 23

Automatic Geometric Data Entry Using MapScan

Computer generates vector data from raster image.

The vector data must be organized by any spatial data structure, which is suited for operations of spatial data analysis.

November 2001 24

GIS Overview

1. What is GIS?

2. How to create spatial databases?Data typesData capture and data collectionData organization

AaaBbbCcc

3. Some spatial operations in GIS useful for spatial data mining

November 2001 25

point is recorded as x,y coordinate pairline is a series of x,y coordinatesarea is a series of x,y coordinates, with the first and last coordinate being identical (e.g., “closed-loop polygons”)

“Spaghetti” Data Structure

1

1

5

4

3

2

6

2 3 4 5 6

A

B CArea CoordinatesA (1,4), (1,6), (6,6), (6,4), (4,4), (1,4)B (1,4), (4,4), (4,1), (1,1), (1,4)C (4,4), (6,4), (6,1), (4,1), (4,4)

November 2001 26

records x/y coordinates of spatial features and encodes spatial relationships

Topological data Structure

Node X YI 1 4II 4 4III 6 4IV 4 1

Line From To Left Right1 I III O A2 I IV B O3 III IV O C4 I II A B5 II III A C6 II IV C B

Poly LinesA 1,4,5B 2,4,6C 3,5,6

1

1

1

5

4

3

2

6

2 3 4 5 6

A

B C

2 3

4 5

6

III

III

IV

O = “outside” polygon

November 2001 27

Source map is registered in a real world coordinate system with a projection

Digitized coordinates are recorded in digitizing units (e.g., cm or inches from the table’s origin)

For spatial data integration, coordinates need to be transformed into real world units

Data Transformation

(Longitude, Latitude)

Japan map and East Asia map must be in real world coordinates before integrating

November 2001 28

Projection is a mathematical conversion from spherical to planar coordinates.Projection concept: Projecting surface of a sphere onto a flat surface.Many types of projections have been invented. Each of them is useful for some applications, and not useful for others.

Projection

A particular projection can preserve area (sharp or size) of the maps

November 2001 29

Attribute Data Collection Using PopMap

Declaring meta-data

Using this module to declare variables, and enter attribute values for spatial objects

The data then is stored in the relational database

In advanced GIS, the spatial data is stored in the object-oriented database

November 2001 30

Id Type Staff

156 RPH 17

157 General 47

... ... ...

GIS: Geometric and Attribute Data Relationship

Id Pop HH

305 20,838 5,934

306 74,293 21,893

... ... ...

305

306304

303

302

154 156

157

160

155

158159

Census districts

Hospitals

November 2001 31

Dual architecture of the GIS

GIS: Spatial Database Architecture (1)

Spatial Data Management

Software

Spatial Data Management

Software

Coordinate Files

Topological Files

Attribute tables

GIS ToolsGIS ToolsUser

InterfaceUser

Interface

Commercial DBMS

Commercial DBMS

ARC/INFO (ESRI)

MGE (Intergraph)

PopMap

November 2001 32

Integrated architecture of the GIS

GIS: Spatial Database Architecture (2)

GIS ToolGIS ToolUser InterfaceUser Interface

ExtensionExtensionCustomized DBMSCustomized DBMS

a)

Coordinate Files

Topological Files

Attribute Tables

Coordinate Files

Topological Files

Attribute Tables

b)

GIS ToolGIS ToolUser InterfaceUser Interface

ExtensionExtensionCommercial DBMSCommercial DBMS

TIGRIS(Integraph)

GeoTropics(Universite de Paris VI)

November 2001 33

Each geographic feature(entity) is constructed by basic graphical elements (point, line and polygon)

Geographic data (administrative boundaries, soil, transportation, etc.) are categorized separately and stored in different map themes or layers, orcoverages

Classification of spatial information

Real World

Layers

November 2001 34

Layers in PopMap

Administrative Layer

Road Layer Clinic Layer

November 2001 35

Turning Data into Information Using GIS

Map Projection Worksheet

Chart

Thematic maps

GIS can combine various display methods

RangeGraduated symbol

dot-density

November 2001 36

GIS Overview

1. What is GIS?GIS definitionComponents and tasks of GIS

2. How to create spatial databases?Data typesData capture and data collectionData organization

3. Some spatial operations in GIS useful for spatial data mining

Spatial relationshipsBuffer and overlay operations

November 2001 37

The most important issue in GIS:How can we create useful information from spatial data?

The answers can be:Querying the database (most frequent GIS application):

What is located at A? Where is X located?

Performing spatial analysis (key feature of GIS) => Spatial relationships are important

Spatial analysis is a general ability to manipulate spatial data into different forms and extract additional meaning as a result.

Network analysis, routing, cartographic algebra, site selection, projection, 3D modeling... are spatial analytical functionalities

Spatial Analysis in GIS

November 2001 38

Query and retrieval Operations

to display various kinds of object

to locate and select features, measure distances and areas, and calculate statistics

November 2001 39

Logical connections between spatial objects

Examples: “adjacent to”, “connect to”, “near to”, “intersects with”, “within”, “overlaps”, etc.

Some relationships are explicitly stored in the database

Examples: left, right poly in the line attribute table for “adjacent to”, list of lines for “connect to”

Others need to be computed

Spatial Relationships

November 2001 40

point/pointwhich health center is closest to the village?

point/linewhich road is nearestto the village?

same with other combinations of spatial features

“is nearest to”

November 2001 41

A Thiessen Polygon defines individual areas of influence around each of a set of points

The Thiessen polygon boundaries are equidistant from the neighboring points

“is nearest to”: Thiessen polygons

November 2001 42

A buffer zone is an area of specified width drawn around one or more map elements

point bufferaffected area around a polluting facilitycatchment area of a water source

line bufferhow many people live near the polluted river?what is the area impacted by highway noise?

polygon bufferarea around a reservoir where development should not be permitted

“is near to”: Buffer Operation

November 2001 43

In GIS, the normal case of polygon overlay takes two map layers and overlays them

“overlaps”: Polygon overlay

Hospital Catchment Areas

Districts

Overlay

November 2001 44

Given: The population of the district areas and the fact that areas of district layer and hospital catchment layer overlap

Estimate: The population of each are in hospital catchment layer

The problem can be solved by the polygon overlay operation and areal interpolation technique

Areal interpolation is based on the area surface proportion, assuming that there is an even distribution of the attributes

Areal Interpolation

November 2001 45

Point buffer operation

PopMap: Buffer Operation

Phu Tho Hospital

Viet Tri Hospital

Creating point buffer for hospitals with 10 km radius

November 2001 46

Area overlay operation

PopMap: Overlay Operation

Overlaying Hospital Band layer and District layer

Producing the population of the overlapped areas

November 2001 47

Line buffer operation

PopMap: Buffer Operation

Which clinics are located near the railway?

November 2001 48

Almost of GIS works on desktop machines and/or in a local networks, therefore the users of these systems are limited by geographical location

Since 1997 the Internet increases in the efficiency and effectiveness of the ways in which users obtain, use, and share information. Using Internet added real value to existing GIS databases.

There are many Web-based GIS which have already been built, including the MapOnline developed by GIS Lab of VN-IOIT

Web-based GIS

November 2001 49

URL – Uniform Resource Locator <protocol>://<machine id>/<local name>

HTTP - Hypertext Transfer Protocol

MapOnline: System Architecture

WWWBrowserWWW

Browser

Spatial DatabaseSpatial

DatabaseGIS

SoftwareGIS

SoftwareClient

Server

URLRequests

Maps,Images,HTML

INTERNET InterfaceInterface

Maps,Reports,etc.

GISCommands

HTTP Server Software

HTTP Server Software

November 2001 50

MapOnline: User Interfaces

Spatial Data Visualisation using Microsoft Internet Explorer

November 2001 51

An approach to Spatial Data Mining from GIS point view

What is Spatial Data Mining?

Why is spatial data mining important?

Some tasks of spatial data mining

November 2001 52

From GIS and Data Mining to Spatial Data Mining

Data mining and GIS have existed as two separate technologies, each with its own methods, approaches to visualisation and data analysis

Nowadays, huge volume of geo-referenced data has been available to users. A simple query and retrieval function of GIS can not satisfied with their needs

Statistical spatial analysis has been most common approach for analyzing spatial data, but it has some limitations.

November 2001 53

GIS versus Data Mining

GIS:user generates hypothesisvisualization in geographical spaceshows what’s inside the dataworks on spatial databaseshard to visualize multivariate dependencies on a map

Data Mining:system generates hypothesis

search (and visualization) in abstract space

inductive generalizations exceeding content of database

search for multivariate dependencies

How can we use the benefits offered by Data Mining and GIS?

November 2001 54

GIS + Data Mining = Spatial Data Mining?

GISData Mining

What is that?

Extensionsnew spatial analysis methods: association, sequential patterns, classification, clustering...

visualization methods for multivariate dependencies on a map

hypotheses languages (spatial data mining languages)

new spatial data structures

convergence of GIS and data mining in an Internet.

November 2001 55

Spatial Data Mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases (J. Han, M. Kamber)

Spatial Data Mining is a subfield of data mining. It can be a combination of GIS and data mining algorithms that have been adapted to spatial data.

Functionalities of spatial data mining:Discovering spatial relationships and relationships between spatial and nonspatial data

Constructing spatial knowledge bases

Reorganizing spatial databases

Optimizing spatial queries

What is Spatial Data Mining?

November 2001 56

Spatial Data Mining Tasks

Geo-Spatial Warehousing and OLAP

Spatial data classification/predictive modeling

Spatial clustering/segmentation

Spatial association and correlation analysis

Spatial regression analysis

Time-related spatial pattern analysis: trends, sequential patterns, partial periodicity analysis

Many more to be explored

November 2001 57

Example of Spatial Classification

MINE CLASSIFICATION RULES

ANALYZE crimes100000R

WITH RESPECT TO

states_census.geo, statename,

capita_income,

with_bachelor_degp

FROM states_census

November 2001 58

Example of Spatial Clustering

How can we cluster points?What are the distinct features of the clusters?

There are more customers with university degrees in clusters located in the West.Thus, we can use different marketing strategies!

MINE CLUSTERS AS ``DBCities''

ANALYZE sum(pop90)

WITH RESPECT TO DBCities.geo, pop90, med_fam_income, with_bachelor_degp

FROM DBCities

November 2001 59

What kinds of spatial objects are close to each other in B.C.?”

Kinds of objects: cities, water, forests, usa_boundary, mines, etc.

Rules mined:

is_a(x, large_town) ∧ intersect(x, highway) adjacent_to(x, water) [7%, 85%]

is_a(x, large_town) ∧ adjacent_to(x, georgia_strait) close_to(x, u.s.a.) [1%, 78%]

Mining method: Apriori + multi-level association + geo-spatial algorithms (from rough to high precision)

Spatial Association Mining

November 2001 60

Example of Spatial Association Mining

FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course"

FROM Washington_Golf_courses, WashingtonWHERE CLOSE_TO(Washington_Golf_courses. Obj,

Washington. Obj, "3 km") AND Washington.CFCC <> "D81"

IN RELEVANCE TO Washington_Golf_courses. Obj, Washington. Obj, CFCC

SET SUPPORT THRESHOLD 0.5SET CONFIDENCE THRESHOLD 0.5

November 2001 61

Spatial Trend Detection & Characterization

Spatial trends describe a regular change of non-spatial attributes when moving away from certain start objects. Global and local trends can be distinguished

Spatial (region) characterizationdoes not only consider the attributes of the target regions but also neighboring regions and their properties

November 2001 62

Spatial trend predictive modelingDiscover centers: local maximal of some non-spatial attribute

Determine the (theoretical) trend of some non-spatial attribute, when moving away from the centers

Discover deviations (from the theoretical trend)

Explain the deviations

ExampleTrend of unemployment rate change according to the distance to Osaka

Trend of temperature with the altitude, degree of pollution in relevance to the regions of population density, etc.

Spatial Trend Detection & Characterization

November 2001 63

GIS provides for spatial data mining:the concepts and methods for creating spatial database,

methods to create internal spatial data structures,

cartographic algebra to create spatial operations, and

concepts and methods for visual, attractive representation of patterns on the maps

Conclusions