spatial temporal data mining
TRANSCRIPT
Spatial-Temporal Data MiningSpatial-Temporal Data Mining
Wei Wang
Data Mining Lab
Computer Science Department UCLA
OutlineOutline
• Introduction
• Active Spatial Data Mining– Spatial data mining trigger
• Temporal Association Rule with Numerical Attributes– Correlation among object evolutions
• Conclusions and Future Work
IntroductionIntroduction
• Huge amount of spatial data are generated everyday.– Earth Observing System
– National Spatial Data Infrastructure
– National Image Mapping Agency
– One meter resolution data
– Digital earth
Users are usually interested in the hidden information.– Aggregate information
– Clustering
– Patterns
S a t e l l i t e
S a t e l l i t e d i s hS a t e l l i t e d i s h
S a t e l l i t e d i s h
S a t e l l i t e
S a t e l l i t e
IntroductionIntroduction
• Knowledge discovery processes are computationally expensive.
• Today’s technology advances provide necessary computing power to carry out such complicated processes.
OutlineOutline
• Introduction• STING+: An approach to active spatial data
mining• Temporal association rules with numerical
attributes• Conclusions and Future Work
STING+STING+
• Since data evolves over time, interesting patterns are likely to emerge or change.
• Goal: identify and find (most) interesting patterns • Problems:
– Knowledge discovery processes are expensive. It is not feasible to re-process the entire data set for every change.
– Periodically examine the data.• Long delays
• Transient patterns might be missed
Natural solution: Usage of triggers.
STING+STING+
• Traditional database triggers can not be directly applied:– Expressive power of traditional database triggers is limited,
especially in describing spatial relationships.
– Example: Trigger investigation when the size of any cluster exceeds 20.
. . ..
.. ... . ... . .. . .
.
.
..
..
STING+STING+
• STING+ was designed to introduce and support spatial triggers efficiently.
• Observation (spatial locality): Only objects added to the shaded area will contribute to the growth of cluster size at this moment.
. . ..
.. ... . ... . .. . .
.
.
..
..
STING+STING+
• STING+ Strategy: Monitor only the area occupied by potential clusters and their neighborhoods.
• Observation (cumulative effect): at least 4 more objects are needed in order to make the cluster size be 20.
• STING+ Strategy: Space is organized in a hierarchy so that updates can be suspended at various levels in the hierarchy until the cumulative effect might cause the trigger to be fired.
. . ..
.. ... . ... . .. . .
.
.
..
..
Level 1
. . ..
.. ... . ... . .. . .
.
.
..
..
Level 2
STING+STING+
– Space is recursively divided into smaller rectangular cells down to a specified granularity and is organized via the inherit pyramid hierarchy.
STING+STING+
– STING+ decomposes a trigger into a set of sub-triggers associated with individual cells in the hierarchical structure to monitor the cumulative effect of data changes within the cell.
Level 4
. . ..
.. ... . ... . .. . .
.
.
..
..
Sub-trigger on cell
Level 3
. . ..
.. ... . ... . .. . .
.
.
..
..
Higher level sub-trigger
on cell
STING+STING+
– Updates/insertions are suspended at various levels in the hierarchy until such time that the cumulative effect of these insertions might cause the trigger condition to become satisfied.
. . ..
.. ... . ... . .. . .
.
.
..
..
Level 1
. . ..
.. ... . ... . .. . .
.
.
..
..
Level 0
+++ ++++ +
STING+STING+
Level 3
. . ..
.. ... . ... . .. . .
.
.
..
..
+++ +
. . ..
.. ... . ... . .. . .
.
.
..
..
Level 2
+++ +
No update of cluster !
STING+STING+
• Primitive event: insertion, deletion, update• Composite event: a set of primitive events• In general, evaluating a trigger T usually involves two
aspects:– Find a set of composite events E(s) that may cause the trigger
condition CT to become true.
– Each time some composite event in E(s) occurs, check the status (false or true) of CT (given that CT was false previously).
• Observation: As a side effect of the occurrence of some composite event, E(s) might also evolve over time.
STING+STING+
• STING+ Strategy: Two sets of composite events are considered:– the set of composite events E(s) that can cause CT to become true
• need to re-evaluate CT
– the set of composite events F(s) that can cause a change to E(s)• need to update E(s)
– The sub-triggers are used to monitor composite events in E(s) and F(s) and change accordingly when E(s) and F(s) evolves.
. . ..
.. ... .. . . .
.
.
.
.
..
. .. ... .. .. .. ..
.
STING+STING+
• Observation: Trigger condition CT is a conjunction of predicates P1 ∩ P2 ∩ … ∩ Pn and can not be true if one predicate is false.– They can be evaluated in a specific order: the ith
predicate is tested when all previous (i -1) predicates are true.
– The evaluation order should be chosen in such a way that the total cost is minimum.
STING+STING+
• PK-tree is used to index instantiated cells– Bound on height– Bounds on number of children– Uniqueness for any data set
• independent of order of insertion and deletion
– Solid theoretical foundation– Fast retrieval and efficient maintenance
• Statistical information maintained at each node is used to facilitate the trigger process.– Sub-trigger
STING+STING+
• Comparison with periodic re-examination via STING– 200,000 synthetic point objects– 10,000 insertions/deletions/updates
– If the period is set to be less than 4000 updates, STING+ consumes less CPU cycles.
– Significant delay and transient patterns misses can occur for larger period.
• Not acceptable in many applications
– No delay and no transient patterns missed with STING+.
OutlineOutline
• Introduction• STING+: An approach to active spatial data
mining• Temporal association rules with numerical
attributes• Conclusions and Future Work
Temporal Association RulesTemporal Association Rules
• Now we are considering general databases with evolving numerical attributes.
• Interesting patterns exhibited in the data are often numerous and complicated.– Customer churning: If a customer’s phone bill increases by at
least $10 each month for six months, then he is likely to change his long distance telephone carrier.
– Real estate: People who receive a raise of at least 20% of their salary are likely to move away from big city.
• Such patterns can be represented by association rules of the form X ⇒ Y, which indicates that the occurrences of X and Y have high correlation.
Temporal Association RulesTemporal Association Rules
• Earlier work on association rules mainly focused on binary attributes and intra-transaction relationship.– E.g., ham ⇒ bread– Support and strength are two metrics used to qualify
interesting rules.• support: number of instances to follow the rule
– N(ham, bread)
• strength: how strong the correlation is
–
– )()(
),(
breadNhamN
breadhamN
×
)(
),(
hamN
breadhamN
Temporal Association RulesTemporal Association Rules• Consider a set of objects, each of which has a unique ID
and a set of time varying numerical attributes; and a sequence of snapshots are taken at some frequency.– E.g., in an employee database, two attributes are considered: salary
and monthly housing expense.
– For a given snapshot, each employee can be mapped to a point in a two dimensional space.
sala
ry
monthly housing expense
.
.
.
..
.
Temporal Association RulesTemporal Association Rules
– Given a sequence of snapshots, the trace of an employee can be mapped to a point in a high dimensional space.
• (<s1, mhe1>, <s2, mhe2>, <s3, mhe3>, <s4, mhe4>, <s5, mhe5>)
sala
ry
monthly housing expense..
.
..
. ...
..
.
.
.
.
..
...
.
. ..
.
.
.
.
.
.
Snapshot 1 Snapshot 2 Snapshot 3 Snapshot 4 Snapshot 5
time
Temporal Association RulesTemporal Association Rules
• Temporal association rules represent the correlation among object evolutions.– (salary: [52000, 56000]→[54000, 58000]) ⇒
(monthly_housing_expense: [1200, 1400]→[1400, 1600])
– Each temporal association rule can also be viewed as an interpretation of a cluster (with certain shape) of points.
... .. . ... . .. . ... .
.
.
.
.
..
... . .. .. .
.. ... ... .
.
..
..
monthly_housing_expense
sala
ry
Temporal Association RulesTemporal Association Rules
• Observation: The domain of a numerical attribute might contain a large number of distinct values and might even be continuous.– E.g., domain(salary) = [50000, 60000].– Any sub-ranges can appear in a rule.– The number of possible rules may be very large if not infinite.
• Strategy: Each attribute domain is quantized into a set of equi-length base intervals.– The domain of salary could be quantized into base intervals of length
$2000:
– Values within the same interval are not distinguished. • E.g., $51000 and $51500 are considered as the same.
50000 60000
Temporal Association RulesTemporal Association Rules
50000
56000
54000sala
ry
58000
60000
52000
E1(salary) = [52000, 54000] → [52000, 54000] → [54000, 56000]
• Attribute evolution
E2(salary) = [52000, 56000] → [52000, 54000] → [52000, 56000]
Temporal Association RulesTemporal Association Rules
Snapshot 1
Snapshot 2
Snapshot 3
Evolution space
Evolution cube of E1(salary)
Evolution cube of E2(salary)
Base cube
Temporal Association RulesTemporal Association Rules
– The subcube-supercube relationship defines a lattice among all evolution cubes within the evolution space.
– This also holds for the evolution space of more than one attributes.
sala
ry
50000
60000
monthly housing expense
1000 2000
Temporal Association RulesTemporal Association Rules
• Some properties of the metrics enable us to search efficiently through the lattice in a bottom-up manner.
...
...
...
. . . .
...
Property of strength: The strength of an evolution cube is less than or equal to the highest strength of its subcubes.
Property of support: The support of an evolution cube is great than or equal to support of its subcube.
Temporal Association RulesTemporal Association Rules
• Observation: Many valid but trivial rules may exist.– (salary: [52000, 56000]) ⇒ (monthly_housing_expense: [1200, 1400])– (salary: [50000, 56000]) ⇒ (monthly_housing_expense: [1200, 1400])– Both rules have the same value of support and strength since no
employee’s salary is between 50000 and 52000. However, the first rule conveys more precise information.
sala
ry
50000
60000
monthly housing expense
1000 2000
.
.
.
..
.
Temporal Association RulesTemporal Association Rules• Strategy: An interval can be included in a rule only if
there are some minimum number of objects whose attributes values fall into that interval.– The density of each base cube within the evolution cube of a rule
has to meet some threshold.
– In the previous example, the second rule can be eliminated.
• Property of density: An evolution cube could satisfy the density threshold only when all of its subcubes satisfy the density threshold.
.
.
...
.
.
...
........
min_density = 2
Temporal Association RulesTemporal Association Rules
• General Model:– Data set D
– Language L• express properties or define subgroup of data
– Selection predicate q• evaluate whether a sentence ϕ ∈ L defines a potentially interesting
class of D
– Task: find the set {ϕ ∈ | q(D, ϕ) is true}
• If– a lattice can be formed on sentences in L and
– partial order exists on selection predicate
• then the level-wise algorithm can be used to prune search space efficiently.
Temporal Association RulesTemporal Association Rules
• Temporal Association Rule:– Language L: each sentence ϕ ∈ L is a temporal association rule.
– The selection predicate q(D, ϕ) is true iff • support(D, ϕ) ≥ min_support and• strength(D, ϕ) ≥ min_strength and• density(D, ϕ) ≥ min_density
– Task: find the set of temporal association rules which satisfy all three predicates.
• Specialization relation < a lattice on the sentences in L– subcube/supercube relationship
q1q2
q3
Temporal Association RulesTemporal Association Rules• partial order on qi with respect to <
– support(D, ϕ) ≤ support(D, γ) if ϕ < γ– if strength (D, ϕ) < min_strength for all ϕ < γ, then strength(D, γ)
< min_strength
– density(D, ϕ) ≥ density(D, γ) if ϕ < γ• level-wise algorithm
– basic scheme: starting from the most special (general) sentences, and then evaluate more and more general (special) sentences excluding those sentences that can not be interesting given all the information obtained in earlier iterations.
Efficient space pruning– Starting point
– Random sampling
– Order of predicate evaluation
Temporal Association RulesTemporal Association Rules
• Efficiency of space pruning– SR algorithm: after quantization, base intervals are combined as long as
their density satisfies the threshold. The original base intervals and the combined intervals are treated as a set of items.
100000 objects100 snapshots5 attributes500 rules of length 5density = 2support = 5%strength = 1.4
Conclusions and Future WorkConclusions and Future Work
• STING+ was developed to support spatial data mining triggers very efficiently by – employing spatial locality property and
– postponing the trigger condition evaluation until the cumulative effect might cause the trigger to be fired.
• Temporal association rules were introduced to capture relationship among object evolutions.
• Selected continuous work– Patterns whose cause and consequence do not happen together
• There is a delay for the consequence to show up.
– Patterns involving relationships among objects• e.g., children tend to live further away from their parent when they
grow up.
Conclusions and Future WorkConclusions and Future Work
• Selected future work– Data mining over Internet
• data type• networking issue
– Analytical model• classify data mining problems• devise efficient general approach
– Applications• compiler/programming language• WWW