spatial databases: lecture 8 spatial data mining

Spatial Databases: Lecture 8Spatial Databases: Lecture 8Spatial Data MiningSpatial Data Mining

DT249DT249--4 DT2284 DT228--4 4 Semester 2 2008Semester 2 2008Pat BrownePat Browne

Based on Chapter 7 of Spatial Databases a Based on Chapter 7 of Spatial Databases a tour: By S. tour: By S. ShekherShekher & S. & S. ChawlaChawla

http://www.comp.dit.ie/pbrowne/Spatial%20Databases%20SDEV4005/Spatial%20Databases%20SDEV4005.htm

Data Mining: OutlineData Mining: OutlineBackground to data mining & spatial data mining.Background to data mining & spatial data mining.The data mining processThe data mining processSpatial autocorrelationSpatial autocorrelation i.e. the non independence of i.e. the non independence of phenomena in a contiguous geographic area.phenomena in a contiguous geographic area.Spatial independenceSpatial independenceClassical data mining concepts:Classical data mining concepts:

ClassificationClassificationClusteringClusteringAssociation rulesAssociation rules

Spatial data mining, e.g. CoSpatial data mining, e.g. Co--location Ruleslocation RulesSummarySummary

Data MiningData MiningData mining is the process of discovering Data mining is the process of discovering interesting and potentially useful patterns of interesting and potentially useful patterns of information embedded in large information embedded in large databasesdatabases. . Spatial data mining has the same goals as Spatial data mining has the same goals as conventional data mining but requires additional conventional data mining but requires additional techniques that are tailored to the spatial techniques that are tailored to the spatial domain.domain.A key goal of spatial data mining is to A key goal of spatial data mining is to partially partially automate knowledge discoveryautomate knowledge discovery, i.e., search for , i.e., search for ““nuggetsnuggets”” of information embedded in very large of information embedded in very large quantities of spatial data.quantities of spatial data.

Data MiningData Mining

Data mining lies at the intersection of Data mining lies at the intersection of database management, statistics, and database management, statistics, and artificial intelligence. DM provides semiartificial intelligence. DM provides semi--automatic techniques for discovering automatic techniques for discovering unexpected patterns in very large data unexpected patterns in very large data sets. sets.

Data MiningData Mining11


Spatial DM can be characterised by Spatial DM can be characterised by ToblerTobler’’ss first law of geography (near things first law of geography (near things are more related than far things). Which are more related than far things). Which means that the standard DM assumptions means that the standard DM assumptions that values are independently and that values are independently and identically distributed does not hold in identically distributed does not hold in spatial DM. The term spatial DM. The term spatial spatial autocorrelationautocorrelation captures this property and captures this property and needs to be included in DM techniques.needs to be included in DM techniques.


The important techniques in conventional The important techniques in conventional DM are association rules, clustering, DM are association rules, clustering, classification, and regression. These classification, and regression. These techniques need to be modified for spatial techniques need to be modified for spatial DM. Two approaches used when adapting DM. Two approaches used when adapting DM techniques to the spatial domain:DM techniques to the spatial domain:

1)Correct the underlying (1)Correct the underlying (iidiid) statistical model) statistical model2)The o2)The objective functionbjective function11 which drives the which drives the search can be modified to include a spatial search can be modified to include a spatial term.term.


Size of spatial data sets:Size of spatial data sets:NASANASA’’s Earth Orbiting Satellites capture about a s Earth Orbiting Satellites capture about a terabyte(10terabyte(101212) a day, YouTube = 6 terabytes.) a day, YouTube = 6 terabytes.Environmental agencies, utilities (e.g. ESB), Central Environmental agencies, utilities (e.g. ESB), Central Statistics Office, government departments such as Statistics Office, government departments such as health/agriculture, and local authorities all have large health/agriculture, and local authorities all have large spatial data sets.spatial data sets.

It is very difficult to analyse such large data sets It is very difficult to analyse such large data sets manually.manually.For examples see Chapter 7 from SDTFor examples see Chapter 7 from SDT

Data Mining: SubData Mining: Sub--processesprocesses

Data mining involves many subData mining involves many sub--process:process:Data collection: usually data was collected as Data collection: usually data was collected as part of the operational activities of an part of the operational activities of an organization, not for the data mining task. It is organization, not for the data mining task. It is unlikely that the data mining requirements were unlikely that the data mining requirements were considered during data collection.considered during data collection.Data extraction/cleaning: hence data must be Data extraction/cleaning: hence data must be extracted & cleaned for the specific data mining extracted & cleaned for the specific data mining task.task.

Data Mining: SubData Mining: Sub--processesprocesses

Feature selection.Feature selection.Algorithm design.Algorithm design.Analysis of outputAnalysis of outputLevel of aggregation at which the data is Level of aggregation at which the data is being analysed must be decided. Identical being analysed must be decided. Identical experiments at different levels of scale can experiments at different levels of scale can sometimes lead to contradictory results sometimes lead to contradictory results (e.g. the choice of basic spatial unit can (e.g. the choice of basic spatial unit can influence the results of a social survey).influence the results of a social survey).

Geographic Data mining processGeographic Data mining process

Close interaction between Domain Expert & Data-Mining Analyst

The output consists of hypotheses (data patterns) which can be verified with statistical tools and visualised using a GIS.

The analyst can interpret the patterns recommend appropriate actions

Statistics versus Data MiningStatistics versus Data Mining

Do we know the statistical properties of data? Is data Do we know the statistical properties of data? Is data spatially clustered, dispersed, or random? spatially clustered, dispersed, or random? Data mining is strongly related to statistical analysis.Data mining is strongly related to statistical analysis.Data mining can be seen as a filter (exploratory data Data mining can be seen as a filter (exploratory data analysis) before applying a rigorous statistical tool. analysis) before applying a rigorous statistical tool. Data mining generates hypothesis that are then Data mining generates hypothesis that are then verified. verified. The filtering process do not guarantee completeness The filtering process do not guarantee completeness (wrong elimination or missing data).(wrong elimination or missing data).

Data Mining as a Search ProblemData Mining as a Search Problem

DM DM searchessearches for interesting & useful patterns in for interesting & useful patterns in large database.large database.Consider a 4X4 image where we want to classify Consider a 4X4 image where we want to classify each pixel into one of two classes giving a total each pixel into one of two classes giving a total 221616 potential combinations.potential combinations.We could reduce the outcomes by asserting that We could reduce the outcomes by asserting that each 2X2 can only be assigned to one class. It each 2X2 can only be assigned to one class. It often happens that neighbouring pixels belong to often happens that neighbouring pixels belong to the same class (autocorrelation). the same class (autocorrelation).

Data Mining as a Search ProblemData Mining as a Search Problem

a) One potential pattern out of 2a) One potential pattern out of 21616..b) If we constrain the pattern to be such b) If we constrain the pattern to be such the each 2x2 block can only be assigned the each 2x2 block can only be assigned one class (black or white) the number of one class (black or white) the number of potential patterns is 2potential patterns is 244

Unique features of spatial data Unique features of spatial data miningmining

The difference between classical & spatial The difference between classical & spatial data mining parallels the difference data mining parallels the difference between classical & spatial statistics.between classical & spatial statistics.Statistics assumes the samples are Statistics assumes the samples are independently generated, which is independently generated, which is generally not the case with spatial data.generally not the case with spatial data.Like things tend to cluster together.Like things tend to cluster together.Change tends to be gradual over space.Change tends to be gradual over space.

NonNon--Spatial Descriptive Data Spatial Descriptive Data MiningMining

Descriptive analysisDescriptive analysis is an analysis that results in some description or is an analysis that results in some description or summarization of summarization of existingexisting data. It characterizes the properties of the data by data. It characterizes the properties of the data by discovering patterns in the data, which would be difficult for tdiscovering patterns in the data, which would be difficult for the human he human analyst to identify by eye or by using standards statistical tecanalyst to identify by eye or by using standards statistical techniques. hniques. Description involves identifying rules or models that describe dDescription involves identifying rules or models that describe data. Both ata. Both clusteringclustering and and association rulesassociation rules are employed by supermarket chains. are employed by supermarket chains. ClusteringClustering (unsupervised learning) is a descriptive data mining technique.(unsupervised learning) is a descriptive data mining technique.Clustering is the task of assigning cases into groups of cases (Clustering is the task of assigning cases into groups of cases (clusters) so clusters) so that the cases within a group are similar to each other and are that the cases within a group are similar to each other and are as different as different as possible from the cases in other groups. Clustering can idenas possible from the cases in other groups. Clustering can identify groups tify groups of customers with similar buying patterns and this knowledge canof customers with similar buying patterns and this knowledge can be used be used to help promote certain products.to help promote certain products.Association RulesAssociation Rules. Association rule discovery identifies the relationships . Association rule discovery identifies the relationships within data.within data. The rule can be expressed as a predicate in the form (IF The rule can be expressed as a predicate in the form (IF x x THEN THEN y y ). ARD can identify product lines that are bought together in a ). ARD can identify product lines that are bought together in a single shopping trip by many customers and this knowledge can besingle shopping trip by many customers and this knowledge can be used to used to by a supermarket chain to help decide on the layout of the produby a supermarket chain to help decide on the layout of the product lines.ct lines.

NonNon--Spatial Predictive Spatial Predictive Data MiningData MiningPredictive DM results in some description or summarization of a Predictive DM results in some description or summarization of a sample of data which predicts or forecast the form of unobservesample of data which predicts or forecast the form of unobserved d data. Prediction involves building a set of rules or a model thadata. Prediction involves building a set of rules or a model that will t will enable unknown or future values of a variable to be predicted frenable unknown or future values of a variable to be predicted from om known values of another variable.known values of another variable.ClassificationClassification is a predictive data mining technique. Classification is is a predictive data mining technique. Classification is the task of finding a model that maps (classifies) each case intthe task of finding a model that maps (classifies) each case into one o one of several predefined classes. Classification is used in risk of several predefined classes. Classification is used in risk assessment in the insurance industry. assessment in the insurance industry. RegressionRegression analysis is a predictive data mining technique that uses analysis is a predictive data mining technique that uses a model to predict a value. Regression can be used to predict sa model to predict a value. Regression can be used to predict sales ales of new product lines based on advertising expenditure.of new product lines based on advertising expenditure.

Case StudyCase Study

Data from 1995 & 1996 concerning two wetlands Data from 1995 & 1996 concerning two wetlands on the shores of Lake Erie, USA.on the shores of Lake Erie, USA.Using this information we want to predict the Using this information we want to predict the spatial distribution of marsh breeding bird called spatial distribution of marsh breeding bird called the redthe red--winged black bird.winged black bird.A uniform grid (pixel=5 square metres) was A uniform grid (pixel=5 square metres) was superimposed on the wetland.superimposed on the wetland.Seven attributes were recorded.Seven attributes were recorded.See linkSee link11 to Spatial Databases a Tour for details.to Spatial Databases a Tour for details.


Significance of three key variables Significance of three key variables established with statistical analysis.established with statistical analysis.Vegetation durabilityVegetation durabilityDistance to open waterDistance to open waterWater depthWater depthThe spatial distribution is shown in 7.3.The spatial distribution is shown in 7.3.


Nest locations Distance to open water

Vegetation durability Water depth

Color version of Fig. 7.3, pp. 188

Classical Statistical Assumptions Classical Statistical Assumptions do do notnot hold for spatially dependent hold for spatially dependent

datadata


The previous maps illustrate two important The previous maps illustrate two important features of spatial data:features of spatial data:Spatial Autocorrelation, Spatial Autocorrelation, not independently not independently distributed.distributed.Spatial Heterogeneity, not Spatial Heterogeneity, not identicallyidenticallydistributed.distributed.

Why spatial Why spatial DBsDBs does not use does not use classical DM(*)classical DM(*)

Rich data types (e.g., extended spatial objects, possibly temporal data)Explicit or Implicit spatial relationships among

the variablesObservations that are not independent,

Spatial autocorrelation exists among the features of interest. In the earlier part of the course these were physical features (e.g. roads), here were are interested in observation or events that occur in geographic space.

Classical Data MiningClassical Data MiningAssociation rulesAssociation rules: Determination of interaction between attributes. For : Determination of interaction between attributes. For

example:example:X X →→Y: Y:

ClassificationClassification: Estimation of the attribute of an entity in terms of : Estimation of the attribute of an entity in terms of attribute values of another entity. Some applications are:attribute values of another entity. Some applications are:

Predicting locations (shopping centers, habitat, crime zones)Predicting locations (shopping centers, habitat, crime zones)Thematic classification (satellite images)Thematic classification (satellite images)

ClusteringClustering: Unsupervised learning, where classes and the number : Unsupervised learning, where classes and the number of classes are unknown. Uses similarity criterion. Applications:of classes are unknown. Uses similarity criterion. Applications:Clustering pixels from a satellite image on the basis of their sClustering pixels from a satellite image on the basis of their spectral pectral signature, identifying hot spots in crime analysis and disease signature, identifying hot spots in crime analysis and disease tracking.tracking.Regression:Regression: takes a numerical dataset and develops a takes a numerical dataset and develops a mathematical formula that fits the data. The results can be usedmathematical formula that fits the data. The results can be used to to predict future behavior. Works well with continuous quantitativepredict future behavior. Works well with continuous quantitative data data like weight, speed or age. Not good for categorical data where like weight, speed or age. Not good for categorical data where order order is not significant, like color, name, gender, nest/no nest.is not significant, like color, name, gender, nest/no nest.

Location Predictors &Location Predictors &Thematic ClassificationThematic Classification

The goal of classification is to estimate the value The goal of classification is to estimate the value of an attribute of a relation based on the value of of an attribute of a relation based on the value of the relationthe relation’’s other attribute.s other attribute.Determining the location of nests based on the Determining the location of nests based on the values of vegetation durability & water depth is values of vegetation durability & water depth is a a location prediction problemlocation prediction problem. . Classifying the pixels of a satellite image into Classifying the pixels of a satellite image into various thematic classes such as water, forest, various thematic classes such as water, forest, or agricultural is or agricultural is a thematic classification a thematic classification problemproblem..

Determining the Interaction among Determining the Interaction among Attributes(*)Attributes(*)

We wish to discovery relationships We wish to discovery relationships between attributes of a relation.between attributes of a relation.

is_close(house,beachis_close(house,beach) ) --> > is_expensive(houseis_expensive(house))

low(vegetationDurabilitylow(vegetationDurability) ) --> >

high(stemhigh(stem density)density)

Associations & association rules are often Associations & association rules are often used to select subsets of features for more used to select subsets of features for more rigorous statistical correlation analysis.rigorous statistical correlation analysis.

Identification of Hot Spots: Identification of Hot Spots: Clusters & OutliersClusters & Outliers

Above techniques can be used to determine Above techniques can be used to determine areas of high crime density, nest locations, or areas of high crime density, nest locations, or disease clusters.disease clusters.Cluster detection involves Cluster detection involves unsupervised unsupervised learninglearning..Noise, error, deviations, or exceptions can be Noise, error, deviations, or exceptions can be identified by outlier detection.identified by outlier detection.Hot spots represent higher than normal values Hot spots represent higher than normal values of a variable. Hot spots can be identified visually of a variable. Hot spots can be identified visually and could be considered statistically as being and could be considered statistically as being (say) more than two standard deviations from (say) more than two standard deviations from the mean. Cool spotsthe mean. Cool spots

Classification techniquesClassification techniquesLinear RegressionLinear Regression is used to model the interaction is used to model the interaction between independent and dependent variables using an between independent and dependent variables using an equation. The linear equation y = equation. The linear equation y = mxmx + c is used for + c is used for modelling class boundary in linear regression analysis.modelling class boundary in linear regression analysis.MaximumMaximum--likelihoodlikelihood uses joint probability uses joint probability P(Dep,IndP(Dep,Ind) & ) & BayesBayes Theorem, used in remote sensing.Theorem, used in remote sensing.DecisionDecision--tree classifierstree classifiers (DTC) divide attribute space D (DTC) divide attribute space D into labelled regions, used in business systems.into labelled regions, used in business systems.Neural NetworksNeural Networks generalise DTC by computing regions generalise DTC by computing regions that have that have nonthat have that have non--linear boundaries.linear boundaries.Spatial regressionSpatial regression extends linear regression to the extends linear regression to the spatial domain spatial domain

Linear regression(*)Linear regression(*)A A modelmodel is an equation that describes data, or makes it is an equation that describes data, or makes it possible to predict an unseen rule from another, known possible to predict an unseen rule from another, known value in the data. For example value in the data. For example Y = a + Y = a + bXbX, where , where XXand and YY are variables and are variables and aa and and bb are parameters are parameters (constants) of the model. Once (constants) of the model. Once aa and and bb are determined are determined by the data mining task values of by the data mining task values of YY can be calculated can be calculated from values of from values of X X using the model. Models are stored in using the model. Models are stored in a model base is part of the model management system a model base is part of the model management system of a decision support system. A model base contains a of a decision support system. A model base contains a number of prenumber of pre--generated models (financial, statistical, generated models (financial, statistical, whatwhat--if, goalif, goal--seeking) that can be used and modified by seeking) that can be used and modified by decision makers.decision makers.

Linear regressionLinear regression

Real valued class variables are best Real valued class variables are best modeled using modeled using conditional expectationconditional expectation (a (a value from the domain) rather than value from the domain) rather than conditional probabilityconditional probability (a probability). Goal (a probability). Goal is to computeis to computeE[C | AE[C | A11……AAnn]]

E[Y | E[Y | XX=x]==x]=f(f(αα++ββxx, where , where XX=(x=(x11,,……xxnn))

YY==XXββ + + εε

Spatial regression(*)Spatial regression(*)

Spatially referenced variables tend to Spatially referenced variables tend to exhibit spatial autocorrelation. We cannot exhibit spatial autocorrelation. We cannot always assume an identical, independent, always assume an identical, independent, distribution. Spatial autocorrelation distribution. Spatial autocorrelation regression (SAR) extends regression with regression (SAR) extends regression with a contiguity (or weight) matrixa contiguity (or weight) matrixYY==ρρWWYY + + XXββ + + εεFinding solutions for SAR equations is Finding solutions for SAR equations is more complex than linear regression.more complex than linear regression.

Descriptive & Predictive KnowledgeDescriptive & Predictive Knowledge

Data mining task can provide Data mining task can provide descriptive descriptive andand predictive predictive knowledge. knowledge.

Descriptive AnalysisDescriptive AnalysisDescriptive analysisDescriptive analysis is an analysis that results in some description or is an analysis that results in some description or summarization of data. It characterizes the properties of the dasummarization of data. It characterizes the properties of the data by ta by discovering patterns in the data, which would be difficult for tdiscovering patterns in the data, which would be difficult for the human he human analyst to identify by eye or by using standards statistical tecanalyst to identify by eye or by using standards statistical techniques. hniques. Description involves identifying rules or models that describe dDescription involves identifying rules or models that describe data. Both ata. Both clustering and association rules are employed by supermarket chaclustering and association rules are employed by supermarket chains.ins.Clustering Clustering is a descriptive data mining technique. Clustering is the task is a descriptive data mining technique. Clustering is the task of of assigning cases into groups of cases (clusters) so that the caseassigning cases into groups of cases (clusters) so that the cases within a s within a group are similar to each other and are as different as possiblegroup are similar to each other and are as different as possible from the from the cases in other groups. Clustering can identify groups of customcases in other groups. Clustering can identify groups of customers with ers with similar buying patterns and this knowledge can be used to help psimilar buying patterns and this knowledge can be used to help promote romote certain products. certain products. A grouping in existing data.A grouping in existing data.Association RulesAssociation Rules. Association rule discovery identifies the relationships . Association rule discovery identifies the relationships within data.within data. The rule can be expressed as a predicate in the form (IF The rule can be expressed as a predicate in the form (IF x x THEN THEN y y ). ARD can identify product lines that are bought together in a ). ARD can identify product lines that are bought together in a single shopping trip by many customers and this knowledge can besingle shopping trip by many customers and this knowledge can be used to used to by a supermarket chain to help decide on the layout of the produby a supermarket chain to help decide on the layout of the product lines.ct lines.

Predictive AnalysisPredictive AnalysisPredictive analysis an analysis that results in some Predictive analysis an analysis that results in some description or summarization of a sample of data which description or summarization of a sample of data which predicts the form of unobserved data. Prediction involves predicts the form of unobserved data. Prediction involves building a set of rules or a model that will enable building a set of rules or a model that will enable unknown or future values of a variable to be predicted unknown or future values of a variable to be predicted from known values of another variable.from known values of another variable.ClassificationClassification is a predictive data mining technique. is a predictive data mining technique. Classification is the task of finding a model that maps Classification is the task of finding a model that maps (classifies) each case into one of several predefined (classifies) each case into one of several predefined classes. Classification is used in risk assessment in the classes. Classification is used in risk assessment in the insurance industry. insurance industry. RegressionRegression is a predictive data mining technique that is a predictive data mining technique that uses a model to predict a value. Regression can be uses a model to predict a value. Regression can be used to predict sales of new product lines based on used to predict sales of new product lines based on advertising expenditure.advertising expenditure.

Linear regression : ExampleLinear regression : Example

Below is a linear regression model. It shows the value Below is a linear regression model. It shows the value of the amount a customer spends in a supermarket of the amount a customer spends in a supermarket fitted as a linear function of the person's income. fitted as a linear function of the person's income. Where Where aa (the intercept) and (the intercept) and bb (the slope) are found by (the slope) are found by the data mining task. If the model is reasonably the data mining task. If the model is reasonably accurate, values of accurate, values of AnnualSpending(YAnnualSpending(Y)) can be can be predicted (or calculated) from values of predicted (or calculated) from values of Income(XIncome(X))

How does data mining differ from How does data mining differ from conventional methods of data conventional methods of data

analysis?(*)analysis?(*)Using conventional data analysis the analyst formulates Using conventional data analysis the analyst formulates and refines the hypothesis. This is known as hypothesis and refines the hypothesis. This is known as hypothesis verification, which is an approach to identifying patterns verification, which is an approach to identifying patterns in data where a human analyst formulates and refines in data where a human analyst formulates and refines the hypothesis. For example "Did the sales of cream the hypothesis. For example "Did the sales of cream increase when strawberries were available?"increase when strawberries were available?"Using data mining the hypothesis is formulated and Using data mining the hypothesis is formulated and refined without human input. This approach is known as refined without human input. This approach is known as hypothesis generation is an approach to identifying hypothesis generation is an approach to identifying patterns in that data where the hypotheses are patterns in that data where the hypotheses are automatically formulated and refined. Knowledge automatically formulated and refined. Knowledge discovery is where the data mining tool formulates and discovery is where the data mining tool formulates and refines the hypothesis by identifying patterns in the data. refines the hypothesis by identifying patterns in the data. For example, "What are the factors that determine the For example, "What are the factors that determine the sales of cream?"sales of cream?"

Classification techniquesClassification techniquesA classification function, A classification function, f : D f : D --> L> L, maps a , maps a domain domain DD consisting of one or more variables consisting of one or more variables (e.g. (e.g. vegetation durabilityvegetation durability, , water water depthdepth, , distance to open waterdistance to open water) to a set ) to a set of labels of labels LL (e.g. (e.g. nestnest or or notnot--nestnest).).The goal of the classification is to determine the The goal of the classification is to determine the appropriate appropriate ff, from a finite subset , from a finite subset Train Train ⊂⊂ D D ×× LL..Accuracy of Accuracy of ff determined on determined on Test Test which is which is disjoint from disjoint from TrainTrain..

Classification techniquesClassification techniques

The classification problem is known as The classification problem is known as predictive modelling predictive modelling because f is used to predict the labels because f is used to predict the labels LLfrom from DD..

Association rule discoveryAssociation rule discovery

AprioriAprioriSpatial Association RulesSpatial Association RulesCoCo--location ruleslocation rules

Clustering Clustering

Clustering (unsupervised learning) is a Clustering (unsupervised learning) is a process for discovering groups in large process for discovering groups in large databases. Clusters are formed on the databases. Clusters are formed on the basis of similarity criterion which are used basis of similarity criterion which are used to determine the relationship between to determine the relationship between each pair of each pair of tuplestuples in the database.in the database.TuplesTuples that are similar are grouped and that are similar are grouped and the group is labelled.the group is labelled.

ClusteringClusteringClustering is used in statistics, and the data mining role Clustering is used in statistics, and the data mining role is to scale a cluster algorithm to handle very large & is to scale a cluster algorithm to handle very large & complex data sets complex data sets

Diverse data typesDiverse data typesLarge numbers records per tableLarge numbers records per tableLarge number attributes per record.Large number attributes per record.

Clustering operates in multiClustering operates in multi--dimensional attribute space, dimensional attribute space, n data objects and m variables, each object represented n data objects and m variables, each object represented as a point in m space. Clustering can be then interpreted as a point in m space. Clustering can be then interpreted as determining highas determining high--density groups of points from a set density groups of points from a set of nonof non--uniformly distributed points in m space. The uniformly distributed points in m space. The search for potential groups requires a suitable similarity search for potential groups requires a suitable similarity criterion.criterion.

Spatial ClusteringSpatial Clustering

Consider population density as a function Consider population density as a function (y axis) of location (x axis)(y axis) of location (x axis)

Clustering Algorithms Clustering Algorithms

Hierarchical Hierarchical PartitionalPartitionalDensity basedDensity basedGrid basedGrid based

Spatial Outlier Detection(*)Spatial Outlier Detection(*)

Global outliers are observations which Global outliers are observations which appear inconsistent with the remainder of appear inconsistent with the remainder of that data set.that data set.Global outliers deviate so much from other Global outliers deviate so much from other observations that it observations that it maymay be possible that be possible that they were generated by a different they were generated by a different mechanism.mechanism.Spatial outliers are observations that Spatial outliers are observations that appear inconsistent with their neighbours.appear inconsistent with their neighbours.


Detecting spatial outliers has important Detecting spatial outliers has important applications in transportation, ecology, applications in transportation, ecology, public safety, public health, climatology public safety, public health, climatology and location based services.and location based services.Geographic objects have a spatial Geographic objects have a spatial (location, shape, metric & topological (location, shape, metric & topological properties) & nonproperties) & non--spatial component spatial component (house owner, sensor id., soil type).(house owner, sensor id., soil type).


Spatial neighbourhoods may be defined using Spatial neighbourhoods may be defined using spatial attributesspatial attributes & & spatial relationsspatial relations..Comparisons between spatially referenced Comparisons between spatially referenced objects can be based on nonobjects can be based on non--spatial attributes.spatial attributes.A spatial outlier is a spatially referenced object A spatial outlier is a spatially referenced object whose nonwhose non--spatial attribute values differ from spatial attribute values differ from those of other spatially referenced objects in its those of other spatially referenced objects in its spatial neighbourhood.spatial neighbourhood.


Global & spatial outlier. A Moran scatter Global & spatial outlier. A Moran scatter plot is a plot of normalised values plot is a plot of normalised values Z[f(iZ[f(i)] =)] = ((((f(if(i) ) -- µµff)/)/σσff))))against the neighbourhood average of the against the neighbourhood average of the normalised attributes (W normalised attributes (W ⋅⋅ Z), where W is Z), where W is the row normalised (the row normalised (∑∑j j WWijij = 1= 1) ) neighbourhood matrix, (neighbourhood matrix, (WWijij > 0 > 0 iffiffneighbour(i,jneighbour(i,j))Recall dot product defined as:Recall dot product defined as:

Data for Outlier detection(*)Data for Outlier detection(*)


The upper left & lower The upper left & lower right quadrants of right quadrants of figure 7.17 indicate a figure 7.17 indicate a spatial association of spatial association of dissimilar values; low dissimilar values; low values surrounded by values surrounded by high value neighbours high value neighbours (P & Q) and high (P & Q) and high values surrounded by values surrounded by low values (S).low values (S).


MoranMoranoutlieroutlier is a is a point located in the point located in the upper left or lower upper left or lower right quadrant of a right quadrant of a Moran scatter plot.Moran scatter plot.

((Z[f(iZ[f(i)]))])××( ( ∑∑jj WWijij Z[f(iZ[f(i)]))]) < 0< 0

Model EvaluationModel Evaluation

Consider the twoConsider the two--class classification problem class classification problem ‘‘nestnest’’ or or ‘‘nono--nestnest’’. The four possible outcomes . The four possible outcomes (or predictions) are shown on the next slide. The (or predictions) are shown on the next slide. The desired predictions are:desired predictions are:

1) where the model says the should be a nest and 1) where the model says the should be a nest and there is an actual nest (True Positive)there is an actual nest (True Positive)2) where the model says there is no nest and there is 2) where the model says there is no nest and there is no nest (True Negative)no nest (True Negative)

The other outcomes are not desirable and point The other outcomes are not desirable and point to a flaw in the model.to a flaw in the model.


In classification the goal is to predict the In classification the goal is to predict the conditional probability of one attribute on conditional probability of one attribute on the basis of the values of the other the basis of the values of the other attribute, thus the outcomes of attribute, thus the outcomes of classification techniques are probabilities.classification techniques are probabilities.We choose a cutWe choose a cut--off probability off probability bb, for a , for a given given bb we have:we have:


If we plot the TPR versus the FPR for If we plot the TPR versus the FPR for classical regression (CR) against spatial classical regression (CR) against spatial regression (SAR) then the classifier whose regression (SAR) then the classifier whose curve is further above the diagonal TPR = curve is further above the diagonal TPR = FPR is the better model for that specific FPR is the better model for that specific data set. data set.


ROC curve (ROC = receiver operating characteristics)

Association ruleAssociation rules (s (defsdefs. & . & calcscalcs differ from differ from handout)handout)

Association ruleAssociation rules(*)s(*)

An association rule is a pattern that can An association rule is a pattern that can be expressed as a predicate in the form be expressed as a predicate in the form (IF (IF x x THEN THEN y y ), where ), where xx and and yy are are conditions (about conditions (about casescases), which state if ), which state if xx(the (the antecedentantecedent) occurs then, in most ) occurs then, in most cases, so will cases, so will y (y (thethe consequence)consequence). The . The antecedent many contain several antecedent many contain several conditions but the consequence conditions but the consequence usuallyusuallycontains only one term. contains only one term.

Association ruleAssociation rules(*)s(*)Association rules need to be discovered. Rule Association rules need to be discovered. Rule discovery is data mining technique that identifies discovery is data mining technique that identifies relationships within data. In the nonrelationships within data. In the non--spatial case spatial case rule discovery is usually employed to discover rule discovery is usually employed to discover relationships within transactions or between relationships within transactions or between transactions in operational data. The relative transactions in operational data. The relative frequency with which an antecedent appears in frequency with which an antecedent appears in a database is called its a database is called its supportsupport. High support is . High support is the frequency at which the relative frequency is the frequency at which the relative frequency is considered significant and is called the support considered significant and is called the support threshold (say 70%)threshold (say 70%)

Association ruleAssociation rules(*)s(*)

ExampleExample: Market basket analysis is form : Market basket analysis is form of association rule discovery that of association rule discovery that discovers relationships in the purchases discovers relationships in the purchases made by a customer during a single made by a customer during a single shopping trip. An shopping trip. An itemsetitemset in the context of in the context of market basket analysis the set of items market basket analysis the set of items found in a customerfound in a customer’’s shopping basket.s shopping basket.

AA--Priori algorithm(*) Priori algorithm(*)

The algorithm follows a two stage The algorithm follows a two stage process.process.1) Find the 1) Find the kk--itemsetitemset that is at or above that is at or above the support threshold giving the frequent the support threshold giving the frequent kk--itemsetitemset. If none is fond stop, otherwise.. If none is fond stop, otherwise.2) Generate the k+1 2) Generate the k+1 itemsetitemset from the from the kk--itemsetitemset. . GotoGoto 1.1.

AA--Priori algorithm : Example(*)Priori algorithm : Example(*)

Association Rules: A prioriAssociation Rules: A priori

Principle: If an item set has a high support, then so do all itsPrinciple: If an item set has a high support, then so do all itssubsets.subsets.The steps of the algorithm is as follows:The steps of the algorithm is as follows:

first,discover all 1first,discover all 1--itemsets that are frequentitemsets that are frequentcombine to form 2combine to form 2--itemsets and analyze for frequent setitemsets and analyze for frequent setgo on until no more go on until no more itemsetsitemsets exceed the threshold.exceed the threshold.search for rulessearch for rules

Association Rules: Example 2Association Rules: Example 2

CD DAlarm ATV TVCR VComputador C

items Cases

1 D A V C2 A T C3 D A V C4 D A T C5 D A T V C6 A T V

Association Rules : Example 2Association Rules : Example 2

Frequency of itemsets

100% (6) A83% (5) C, A C67 % (5) C, T, V, D A

DC,AT,AV,DAC50% (3) DV,TC,VC,DAV,

DVC,ATC,AVC,DAVC

Association Rules: Example 2Association Rules: Example 2

Confidence of association rules = 100%

D → A (4/4)D → C (4/4)D → AC (4/4)T → C (4/4)V → A (4/4)C → A (5/5)

D → A (4/4)D → A (3/3)D → A (3/3)D → A (4/4)D → A (3/3)D → A (3/3)

VC → A (3/3)DV → A (3/3)VC → A (3/3)DAV → A (3/3)DVC → A (3/3)AVC → A (3/3)

C → D (4/5) A → C (5/6) C →DA(4/5)

Association rules with confidence >= 80%

Association rules & Spatial Association rules & Spatial Domain(*)Domain(*)

Differences with respect to spatial domain:Differences with respect to spatial domain:1.1. The notion of transaction or case does not exist, since data The notion of transaction or case does not exist, since data

are immerse in a continuous space.The partition of the are immerse in a continuous space.The partition of the space may introduce errors with respect to overestimation space may introduce errors with respect to overestimation or subor sub--estimation confidences. The notion of transaction is estimation confidences. The notion of transaction is replaced by neighborhood. replaced by neighborhood.

2.2. The size of The size of itemsetsitemsets is less in the spatial domain. Thus, the is less in the spatial domain. Thus, the cost of generating candidate is not a dominant factor. The cost of generating candidate is not a dominant factor. The enumeration of enumeration of neighboursneighbours dominates the final dominates the final computational cost. computational cost.

3.3. In most cases, the spatial items are discrete version of In most cases, the spatial items are discrete version of continuous variables. continuous variables.

Spatial Association Rules(*)Spatial Association Rules(*)

Table 7.5 shows examples of association Table 7.5 shows examples of association rules, support, and confidence that were rules, support, and confidence that were discovered in discovered in DarrDarr 1995 wetland data.1995 wetland data.

CoCo--Location rules(*)Location rules(*)

ColocationColocation rules attempt to rules attempt to generalisegeneralise association rules to association rules to point collection data sets that are indexed by space. The point collection data sets that are indexed by space. The colocationcolocation pattern discovery process finds frequently copattern discovery process finds frequently co--located subsets of spatial event types given a map of their located subsets of spatial event types given a map of their locations, see Figure 7.12.locations, see Figure 7.12.

CoCo--location patternslocation patternspredatorpredator--prey species, symbiosisprey species, symbiosisDental health and fluorideDental health and fluoride

CoCo--Location rulesLocation rules11

Measuring spatial attraction1

Use spatial statistics: the K functionIn its basic form, for a single point patternLet

NP = number of points within radius h of a random pointλK(h)= E(NP)

If no spatial correlation, K(h) = π h2

Attraction: K(h) > πh2

Repulsion: K(h) < π h2

Correlation between two point patterns:λ2K12(h) = E(number of points of type 2 within radius h of a random point of type 1)

Associations, Spatial associations, CoAssociations, Spatial associations, Co--location(*)location(*)

Answers: and

Two co-location patterns

Spatial Association Rules(*)Spatial Association Rules(*)A spatial association rule is a rule indicating certain A spatial association rule is a rule indicating certain association relationship among a set of spatial and possibly association relationship among a set of spatial and possibly some nonsome non--spatial predicates.spatial predicates.Spatial association rules (SPAR) are defined in terms of Spatial association rules (SPAR) are defined in terms of spatial predicates rather than item.spatial predicates rather than item.PP11 ¶¶ PP22 ¶¶.. .. ¶¶ PPnn ff QQ11 ¶¶.. .. ¶¶ QQmmWhere at least one of the terms (Where at least one of the terms (PP or or QQ) is a spatial ) is a spatial predicate.predicate.

is(x,country)is(x,country)¶¶touches(x,Mediterraneantouches(x,Mediterranean))

is(x,wineis(x,wine--exporter)exporter)⎯⎯⎯⎯⎯⎯ →⎯ e%confidenc %support,

CoCo--location V Association Rules(*)location V Association Rules(*)

Transactions are disjoint while spatial co-location is not. Something must be done. Three main options

1. Divide the space into areas and treat them as transactions2. Choose a reference point pattern and treat the neighbourhood of each of its points as a transaction3. Treat all point patterns as equal

CoCo--location V Association Rules(*)location V Association Rules(*)

Spatial Association Rules Mining (SARM) is similar to Spatial Association Rules Mining (SARM) is similar to the raster view in the sense that it tessellates a study the raster view in the sense that it tessellates a study region region S S into discrete groups based on spatial or into discrete groups based on spatial or aspatialaspatialpredicates derived from concept hierarchies. For predicates derived from concept hierarchies. For instance, a instance, a spatial predicatespatial predicate close close toto((αα, , ββ) divides ) divides S S into two groups, locations close to into two groups, locations close to ββ and those not. So, and those not. So, close close toto((αα, , ββ) can be either true or false depends on ) can be either true or false depends on αα’’sscloseness to closeness to ββ. A spatial association rule is a rule that . A spatial association rule is a rule that consists of a set of predicates in which at least a consists of a set of predicates in which at least a spatial spatial predicatepredicate is involved. For instance, is involved. For instance, is is aa((αα, house, house) and) andclose close toto((αα, beach, beach) ) -->> is is expensiveexpensive((αα)). This . This approach efficiently mines large datasets using a approach efficiently mines large datasets using a progressive deepening approach.progressive deepening approach.

Data Mining with OracleData Mining with Oracle11


The following are examples of the kinds of data mining The following are examples of the kinds of data mining applications that could benefit from including spatial applications that could benefit from including spatial information in their processing:information in their processing:Business prospecting: Determine if Business prospecting: Determine if colocationcolocation of a of a business with another franchise (such as business with another franchise (such as colocationcolocation of a of a pizza restaurant with a video store) might improve its pizza restaurant with a video store) might improve its sales.sales.Store prospecting (USA): Find a good store location that Store prospecting (USA): Find a good store location that is within 50 miles of a major city and inside a state with is within 50 miles of a major city and inside a state with no sales tax. no sales tax. Hospital prospecting: Identify the best locations for Hospital prospecting: Identify the best locations for opening new hospitals based on the population of opening new hospitals based on the population of patients who live in each neighbourhood.patients who live in each neighbourhood.


Spatial regionSpatial region--based classification or personalization: based classification or personalization: Determine if south eastern United States customers in a Determine if south eastern United States customers in a certain age or income category are more likely to prefer certain age or income category are more likely to prefer "soft" or "hard" rock music."soft" or "hard" rock music.Automobile insurance: Given a customer's home or work Automobile insurance: Given a customer's home or work location, determine if it is in an area with high or low location, determine if it is in an area with high or low rates of accident claims or auto thefts.rates of accident claims or auto thefts.Property analysis: Use Property analysis: Use colocationcolocation rules to find hidden rules to find hidden associations between proximity to a highway and either associations between proximity to a highway and either the price of a house or the sales volume of a store.the price of a house or the sales volume of a store.Property assessment: In assessing the value of a house, Property assessment: In assessing the value of a house, examine the values of similar houses in a examine the values of similar houses in a neighbourhood, and derive an estimate based on neighbourhood, and derive an estimate based on variations and spatial correlation.variations and spatial correlation.

Classical V Spatial Data Mining(*)

Statistical presumption of data independence does not hold for spatially dependent data. problems with spatial attributes in dataset (spatial and non-spatial attributes)complicated structures of storing spatial data (R-trees, indexing)

Classical V Spatial Data Mining(*)

Spatial Autocorrelation”All things are related, but nearby things are more related than distant things. [Tobler]”Spatial Heterogeneity

spatial data is not identically distributed in the spacedata properties are location dependentlocal trends can sometimes contradict the global trends

Review of DM termsReview of DM termsClassificationClassification is a predictive data mining is a predictive data mining technique. Classification is the task of finding a technique. Classification is the task of finding a model that maps (classifies) each case into one model that maps (classifies) each case into one of several predefined classes. The estimation of of several predefined classes. The estimation of the attribute of an entity in terms of attribute the attribute of an entity in terms of attribute values of another entity. Some applications are:values of another entity. Some applications are:Predicting locations (shopping centers, habitat, Predicting locations (shopping centers, habitat, crime zones)crime zones)Risk assessment in the insurance industry.Risk assessment in the insurance industry.Thematic classification (satellite images)Thematic classification (satellite images)

Review of DM termsReview of DM terms

ClusteringClustering (unsupervised learning) is a (unsupervised learning) is a descriptive data mining technique. descriptive data mining technique. Clustering is the task of assigning cases Clustering is the task of assigning cases into groups of cases (clusters) so that the into groups of cases (clusters) so that the cases within a group are similar to each cases within a group are similar to each other and are as different as possible from other and are as different as possible from the cases in other groups. Clustering can the cases in other groups. Clustering can identify groups of customers with similar identify groups of customers with similar buying patterns and this knowledge can be buying patterns and this knowledge can be used to help promote certain products.used to help promote certain products.

Review of DM termsReview of DM terms

Association Rules(*)Association Rules(*). Association rule . Association rule discovery identifies the relationships within discovery identifies the relationships within data. The rule can be expressed as a data. The rule can be expressed as a predicate in the form (IF x THEN y ). predicate in the form (IF x THEN y ). ARD can identify product lines that are ARD can identify product lines that are bought together in a single shopping trip bought together in a single shopping trip by many customers and this knowledge by many customers and this knowledge can be used to by a supermarket chain to can be used to by a supermarket chain to help decide on the layout of the product help decide on the layout of the product lines.lines.

Review of DM termsReview of DM termsRegression Analysis(*):Regression Analysis(*): Linear regression is used to Linear regression is used to model the interaction between independent and model the interaction between independent and dependent variables using an equation. The linear dependent variables using an equation. The linear equation equation y=y=mx+cmx+c is used for modelling class boundary in is used for modelling class boundary in linear regression analysis. Regression analysis can be linear regression analysis. Regression analysis can be used as a predictive data mining technique that uses a used as a predictive data mining technique that uses a model to predict a value. Regression can be used to model to predict a value. Regression can be used to predict sales of new product lines based on advertising predict sales of new product lines based on advertising expenditure. Works well with continuous quantitative expenditure. Works well with continuous quantitative data like weight, speed or age. Not good for categorical data like weight, speed or age. Not good for categorical data where order is not significant, like colour, name, data where order is not significant, like colour, name, gender, nest/no nest.gender, nest/no nest.

Adapting Association Rules to the Adapting Association Rules to the spatial case(*)spatial case(*)

There are differences with respect to spatial domain:There are differences with respect to spatial domain:The notion of transaction or case does not exist, since The notion of transaction or case does not exist, since data are immerse in a continuous space. The partition of data are immerse in a continuous space. The partition of the space may introduce errors with respect to the space may introduce errors with respect to overestimation or suboverestimation or sub--estimation confidences. The estimation confidences. The notion of transaction is replaced by notion of transaction is replaced by neighborhoodneighborhood. But . But unlike traditional Association Rules we do not have a set unlike traditional Association Rules we do not have a set of given local items in a basket. Therefore an algorithm of given local items in a basket. Therefore an algorithm will have to compare all near objects for association. will have to compare all near objects for association. The size of The size of itemsetsitemsets is less in the spatial domain. is less in the spatial domain. Thus, the cost of generating candidate is not a dominant Thus, the cost of generating candidate is not a dominant factor. The enumeration of neighbours dominates the factor. The enumeration of neighbours dominates the final computational cost. final computational cost.


There are differences with respect to spatial There are differences with respect to spatial domain:domain:In most cases, the spatial items are discrete In most cases, the spatial items are discrete version of continuous variables.version of continuous variables.CoCo--location rules attempt to generalise location rules attempt to generalise association rules to point collection data sets association rules to point collection data sets that are indexed by space. The cothat are indexed by space. The co--location location pattern discovery process finds frequently copattern discovery process finds frequently co--located subsets of spatial event types given a located subsets of spatial event types given a map of their locationsmap of their locationsExamples of coExamples of co--location patterns: predatorlocation patterns: predator--prey prey species, symbiosis, Dental health and fluoride.species, symbiosis, Dental health and fluoride.


CoCo--location extends traditional Association Rule location extends traditional Association Rule to where the set of transactions is a continuum to where the set of transactions is a continuum in a space, but we needin a space, but we need additional definitions of additional definitions of both both neighbourneighbour (say radius) and the s(say radius) and the statistical tatistical weight of neighbour.weight of neighbour. Use spatial statistic, the K Use spatial statistic, the K function, to measure the correlation between function, to measure the correlation between one (same one (same varvar) and two point (diff. ) and two point (diff. varvar) ) patterns. K can measure If no spatial correlation, patterns. K can measure If no spatial correlation, attraction, repulsion, between variables.attraction, repulsion, between variables.

Adapting Regression to the spatial Adapting Regression to the spatial case(*)case(*)

Variables that are spatially referenced tend to Variables that are spatially referenced tend to exhibit autocorrelation thus are not identically exhibit autocorrelation thus are not identically and independently distributed. To take this into and independently distributed. To take this into account the standard regression equation must account the standard regression equation must be augmented with a weight matrix W that be augmented with a weight matrix W that represents the spatial contiguity of sample represents the spatial contiguity of sample areas (or regions)areas (or regions)YY==ρρWWYY + + XXββ + + εεFinding solutions for SAR equations is more Finding solutions for SAR equations is more complex than linear regression.complex than linear regression.

RevisionRevision

Project (Project (LeinsterLeinster))OGC Standard OGC Standard Database Architecture for GISDatabase Architecture for GISShortest PathShortest PathThemes & Geographic ObjectsThemes & Geographic ObjectsSpaghetti & Topological ModelSpaghetti & Topological ModelSpatial JoinSpatial JoinConstraint Data ModelConstraint Data Model

RevisionRevision

Intersection of regionsIntersection of regionsSpatial AutocorrelationSpatial AutocorrelationSpatial HeterogeneitySpatial HeterogeneityUnderstanding MoranUnderstanding Moran’’s I statistics I statisticContiguity & Weight MatrixContiguity & Weight MatrixUnderstanding the terms Understanding the terms Association RulesAssociation Rules and and RegressionRegression in the context of DMin the context of DMThe The AprioriApriori algorithm for creating association algorithm for creating association rules.rules.

spatial databases: lecture 8 spatial data mining

Documents