© prentice hall1 data mining introductory and advanced topics part iii margaret h. dunham...

114
© Prentice Hall 1 DATA MINING DATA MINING Introductory and Advanced Topics Introductory and Advanced Topics Part III Part III Margaret H. Dunham Margaret H. Dunham Department of Computer Science and Department of Computer Science and Engineering Engineering Southern Methodist University Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics Data Mining, Introductory and Advanced Topics , Prentice , Prentice Hall, 2002. Hall, 2002.

Post on 19-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 1

DATA MININGDATA MININGIntroductory and Advanced TopicsIntroductory and Advanced Topics

Part III Part III

Margaret H. DunhamMargaret H. DunhamDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering

Southern Methodist UniversitySouthern Methodist University

Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Data Mining, Introductory and Advanced TopicsIntroductory and Advanced Topics, Prentice Hall, 2002., Prentice Hall, 2002.

Page 2: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 2

Data Mining OutlineData Mining Outline PART IPART I

– IntroductionIntroduction– Related ConceptsRelated Concepts– Data Mining TechniquesData Mining Techniques

PART IIPART II– ClassificationClassification– ClusteringClustering– Association RulesAssociation Rules

PART IIIPART III– Web MiningWeb Mining– Spatial MiningSpatial Mining– Temporal MiningTemporal Mining

Page 3: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 3

Web Mining OutlineWeb Mining Outline

Goal:Goal: Examine the use of data mining on Examine the use of data mining on the World Wide Webthe World Wide Web

IntroductionIntroduction Web Content MiningWeb Content Mining Web Structure MiningWeb Structure Mining Web Usage MiningWeb Usage Mining

Page 4: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 4

Web Mining IssuesWeb Mining Issues

SizeSize– >350 million pages (1999) >350 million pages (1999) – Grows at about 1 million pages a dayGrows at about 1 million pages a day– Google indexes 3 billion documentsGoogle indexes 3 billion documents

Diverse types of dataDiverse types of data

Page 5: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 5

Web DataWeb Data

Web pagesWeb pages Intra-page structuresIntra-page structures Inter-page structuresInter-page structures Usage dataUsage data Supplemental dataSupplemental data

– ProfilesProfiles– Registration informationRegistration information– CookiesCookies

Page 6: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 6

Web Mining TaxonomyWeb Mining Taxonomy

Modified from [zai01]

Page 7: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 7

Web Content MiningWeb Content Mining

Extends work of basic search enginesExtends work of basic search engines Search EnginesSearch Engines

– IR applicationIR application– Keyword basedKeyword based– Similarity between query and documentSimilarity between query and document– CrawlersCrawlers– IndexingIndexing– ProfilesProfiles– Link analysisLink analysis

Page 8: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 8

CrawlersCrawlers Robot (spider)Robot (spider) traverses the hypertext sructure in traverses the hypertext sructure in

the Web.the Web. Collect information from visited pagesCollect information from visited pages Used to construct indexes for search enginesUsed to construct indexes for search engines Traditional CrawlerTraditional Crawler – visits entire Web (?) and – visits entire Web (?) and

replaces indexreplaces index Periodic CrawlerPeriodic Crawler – visits portions of the Web and – visits portions of the Web and

updates subset of indexupdates subset of index Incremental CrawlerIncremental Crawler – selectively searches the Web – selectively searches the Web

and incrementally modifies indexand incrementally modifies index Focused CrawlerFocused Crawler – visits pages related to a – visits pages related to a

particular subjectparticular subject

Page 9: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 9

Focused CrawlerFocused Crawler

Only visit links from a page if that page Only visit links from a page if that page is determined to be relevant.is determined to be relevant.

Classifier is static after learning phase.Classifier is static after learning phase. Components:Components:

– Classifier which assigns relevance score to Classifier which assigns relevance score to each page based on crawl topic.each page based on crawl topic.

– Distiller to identify Distiller to identify hub pages.hub pages.– Crawler visits pages to based on crawler Crawler visits pages to based on crawler

and distiller scores.and distiller scores.

Page 10: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 10

Focused CrawlerFocused Crawler

Classifier to related documents to topicsClassifier to related documents to topics Classifier also determines how useful Classifier also determines how useful

outgoing links areoutgoing links are Hub PagesHub Pages contain links to many contain links to many

relevant pages. Must be visited even if relevant pages. Must be visited even if not high relevance score.not high relevance score.

Page 11: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 11

Focused CrawlerFocused Crawler

Page 12: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 12

Context Focused CrawlerContext Focused Crawler

Context Graph:Context Graph:– Context graph created for each seed document .Context graph created for each seed document .– Root is the sedd document.Root is the sedd document.– Nodes at each level show documents with links Nodes at each level show documents with links

to documents at next higher level. to documents at next higher level. – Updated during crawl itself .Updated during crawl itself .

Approach:Approach:1.1. Construct context graph and classifiers using Construct context graph and classifiers using

seed documents as training data.seed documents as training data.2.2. Perform crawling using classifiers and context Perform crawling using classifiers and context

graph created.graph created.

Page 13: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 13

Context GraphContext Graph

Page 14: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 14

Virtual Web ViewVirtual Web View Multiple Layered DataBase (MLDB)Multiple Layered DataBase (MLDB) built on top of built on top of

the Web.the Web. Each layer of the database is more generalized (and Each layer of the database is more generalized (and

smaller) and centralized than the one beneath it.smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be Upper layers of MLDB are structured and can be

accessed with SQL type queries.accessed with SQL type queries. Translation tools convert Web documents to XML.Translation tools convert Web documents to XML. Extraction tools extract desired information to place in Extraction tools extract desired information to place in

first layer of MLDB.first layer of MLDB. Higher levels contain more summarized data obtained Higher levels contain more summarized data obtained

through generalizations of the lower levels.through generalizations of the lower levels.

Page 15: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 15

PersonalizationPersonalization

Web access or contents tuned to better fit the Web access or contents tuned to better fit the desires of each user.desires of each user.

Manual techniques identify user’s preferences Manual techniques identify user’s preferences based on profiles or demographics.based on profiles or demographics.

Collaborative filteringCollaborative filtering identifies preferences identifies preferences based on ratings from similar users.based on ratings from similar users.

Content based filteringContent based filtering retrieves pages retrieves pages based on similarity between pages and user based on similarity between pages and user profiles.profiles.

Page 16: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 16

Web Structure MiningWeb Structure Mining

Mine structure (links, graph) of the WebMine structure (links, graph) of the Web TechniquesTechniques

– PageRankPageRank– CLEVERCLEVER

Create a model of the Web organization.Create a model of the Web organization. May be combined with content mining to May be combined with content mining to

more effectively retrieve important pages.more effectively retrieve important pages.

Page 17: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 17

PageRankPageRank Used by GoogleUsed by Google Prioritize pages returned from search by Prioritize pages returned from search by

looking at Web structure.looking at Web structure. Importance of page is calculated based Importance of page is calculated based

on number of pages which point to it – on number of pages which point to it – BacklinksBacklinks..

Weighting is used to provide more Weighting is used to provide more importance to backlinks coming form importance to backlinks coming form important pages.important pages.

Page 18: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 18

PageRank (cont’d)PageRank (cont’d)

PR(p) = c (PR(1)/NPR(p) = c (PR(1)/N11 + … + PR(n)/N + … + PR(n)/Nnn))

– PR(i): PageRank for a page i which points PR(i): PageRank for a page i which points to target page p.to target page p.

– NNii: number of links coming out of page i: number of links coming out of page i

Page 19: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 19

CLEVERCLEVER

Identify authoritative and hub pages.Identify authoritative and hub pages. Authoritative PagesAuthoritative Pages : :

– Highly important pages.Highly important pages.– Best source for requested information.Best source for requested information.

Hub PagesHub Pages : :– Contain links to highly important pages.Contain links to highly important pages.

Page 20: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 20

HITSHITS

Hyperlink-Induces Topic SearchHyperlink-Induces Topic Search Based on a set of keywords, find set of Based on a set of keywords, find set of

relevant pages – R.relevant pages – R. Identify hub and authority pages for these.Identify hub and authority pages for these.

– Expand R to a base set, B, of pages linked to or Expand R to a base set, B, of pages linked to or from R.from R.

– Calculate weights for authorities and hubs.Calculate weights for authorities and hubs.

Pages with highest ranks in R are returned.Pages with highest ranks in R are returned.

Page 21: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 21

HITS AlgorithmHITS Algorithm

Page 22: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 22

Web Usage MiningWeb Usage Mining

Extends work of basic search enginesExtends work of basic search engines Search EnginesSearch Engines

– IR applicationIR application– Keyword basedKeyword based– Similarity between query and documentSimilarity between query and document– CrawlersCrawlers– IndexingIndexing– ProfilesProfiles– Link analysisLink analysis

Page 23: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 23

Web Usage Mining ApplicationsWeb Usage Mining Applications

PersonalizationPersonalization Improve structure of a site’s Web pagesImprove structure of a site’s Web pages Aid in caching and prediction of future Aid in caching and prediction of future

page referencespage references Improve design of individual pagesImprove design of individual pages Improve effectiveness of e-commerce Improve effectiveness of e-commerce

(sales and advertising)(sales and advertising)

Page 24: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 24

Web Usage Mining ActivitiesWeb Usage Mining Activities Preprocessing Web logPreprocessing Web log

– Cleanse Cleanse – Remove extraneous informationRemove extraneous information– SessionizeSessionize

Session:Session: Sequence of pages referenced by one user at a sitting. Sequence of pages referenced by one user at a sitting.

Pattern DiscoveryPattern Discovery– Count patterns that occur in sessionsCount patterns that occur in sessions– Pattern Pattern is sequence of pages references in session.is sequence of pages references in session.– Similar to association rulesSimilar to association rules

» Transaction: sessionTransaction: session» Itemset: pattern (or subset)Itemset: pattern (or subset)» Order is importantOrder is important

Pattern AnalysisPattern Analysis

Page 25: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 25

ARs in Web MiningARs in Web Mining Web Mining:Web Mining:

– ContentContent– StructureStructure– UsageUsage

Frequent patterns of sequential page Frequent patterns of sequential page references in Web searching.references in Web searching.

Uses:Uses:– CachingCaching– Clustering usersClustering users– Develop user profilesDevelop user profiles– Identify important pagesIdentify important pages

Page 26: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 26

Web Usage Mining IssuesWeb Usage Mining Issues

Identification of exact user not possible.Identification of exact user not possible. Exact sequence of pages referenced by Exact sequence of pages referenced by

a user not possible due to caching.a user not possible due to caching. Session not well definedSession not well defined Security, privacy, and legal issuesSecurity, privacy, and legal issues

Page 27: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 27

Web Log CleansingWeb Log Cleansing

Replace source IP address with unique Replace source IP address with unique but non-identifying ID.but non-identifying ID.

Replace exact URL of pages referenced Replace exact URL of pages referenced with unique but non-identifying ID.with unique but non-identifying ID.

Delete error records and records Delete error records and records containing not page data (such as containing not page data (such as figures and code)figures and code)

Page 28: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 28

SessionizingSessionizing

Divide Web log into sessions.Divide Web log into sessions. Two common techniques:Two common techniques:

– Number of consecutive page references Number of consecutive page references from a source IP address occurring within from a source IP address occurring within a predefined time interval (e.g. 25 a predefined time interval (e.g. 25 minutes).minutes).

– All consecutive page references from a All consecutive page references from a source IP address where the interclick time source IP address where the interclick time is less than a predefined threshold.is less than a predefined threshold.

Page 29: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 29

Data Structures Data Structures

Keep track of patterns identified during Keep track of patterns identified during Web usage mining processWeb usage mining process

Common techniques:Common techniques:– Trie Trie – Suffix TreeSuffix Tree– Generalized Suffix TreeGeneralized Suffix Tree– WAP TreeWAP Tree

Page 30: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 30

Trie vs. Suffix TreeTrie vs. Suffix Tree

Trie:Trie:– Rooted treeRooted tree– Edges labeled which character (page) from Edges labeled which character (page) from

patternpattern– Path from root to leaf represents pattern.Path from root to leaf represents pattern.

Suffix Tree:Suffix Tree:– Single child collapsed with parent. Edge Single child collapsed with parent. Edge

contains labels of both prior edges.contains labels of both prior edges.

Page 31: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 31

Trie and Suffix TreeTrie and Suffix Tree

Page 32: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 32

Generalized Suffix TreeGeneralized Suffix Tree

Suffix tree for multiple sessions. Suffix tree for multiple sessions. Contains patterns from all sessions.Contains patterns from all sessions. Maintains count of frequency of Maintains count of frequency of

occurrence of a pattern in the node.occurrence of a pattern in the node. WAP Tree:WAP Tree:

Compressed version of generalized suffix Compressed version of generalized suffix treetree

Page 33: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 33

Types of PatternsTypes of Patterns

Algorithms have been developed to discover Algorithms have been developed to discover different types of patterns.different types of patterns.

Properties:Properties:– Ordered Ordered – Characters (pages) must occur in the – Characters (pages) must occur in the

exact order in the original session.exact order in the original session.– Duplicates Duplicates – Duplicate characters are allowed in – Duplicate characters are allowed in

the pattern.the pattern.– ConsecutiveConsecutive – All characters in pattern must – All characters in pattern must

occur consecutive in given session.occur consecutive in given session.– Maximal Maximal – Not subsequence of another pattern.– Not subsequence of another pattern.

Page 34: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 34

Pattern TypesPattern Types

Association RulesAssociation RulesNone of the properties holdNone of the properties hold

EpisodesEpisodesOnly ordering holdsOnly ordering holds

Sequential PatternsSequential PatternsOrdered and maximalOrdered and maximal

Forward SequencesForward SequencesOrdered, consecutive, and maximalOrdered, consecutive, and maximal

Maximal Frequent SequencesMaximal Frequent SequencesAll properties holdAll properties hold

Page 35: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 35

EpisodesEpisodes

Partially ordered set of pagesPartially ordered set of pages Serial episodeSerial episode – totally ordered with – totally ordered with

time constrainttime constraint Parallel episodeParallel episode – partial ordered with – partial ordered with

time constrainttime constraint General episodeGeneral episode – partial ordered with – partial ordered with

no time constraintno time constraint

Page 36: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 36

DAG for EpisodeDAG for Episode

Page 37: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 37

Spatial Mining OutlineSpatial Mining Outline

Goal:Goal: Provide an introduction to some Provide an introduction to some spatial mining techniques.spatial mining techniques.

IntroductionIntroduction Spatial Data Overview Spatial Data Overview Spatial Data Mining PrimitivesSpatial Data Mining Primitives Generalization/SpecializationGeneralization/Specialization Spatial RulesSpatial Rules Spatial ClassificationSpatial Classification Spatial ClusteringSpatial Clustering

Page 38: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 38

Spatial ObjectSpatial Object

Contains both spatial and nonspatial Contains both spatial and nonspatial attributes.attributes.

Must have a location type attributes:Must have a location type attributes:– Latitude/longitudeLatitude/longitude– Zip codeZip code– Street addressStreet address

May retrieve object using either (or May retrieve object using either (or both) spatial or nonspatial attributes.both) spatial or nonspatial attributes.

Page 39: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 39

Spatial Data Mining ApplicationsSpatial Data Mining Applications

GeologyGeology GIS SystemsGIS Systems Environmental ScienceEnvironmental Science AgricultureAgriculture MedicineMedicine RoboticsRobotics May involved both spatial and temporal May involved both spatial and temporal

aspectsaspects

Page 40: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 40

Spatial QueriesSpatial Queries Spatial selection may involve specialized selection Spatial selection may involve specialized selection

comparison operations:comparison operations:– NearNear– North, South, East, WestNorth, South, East, West– Contained inContained in– Overlap/intersectOverlap/intersect

Region (Range) QueryRegion (Range) Query – find objects that intersect a given – find objects that intersect a given region.region.

Nearest Neighbor QueryNearest Neighbor Query – find object close to identified – find object close to identified object.object.

Distance ScanDistance Scan – find object within a certain distance of an – find object within a certain distance of an identified object where distance is made increasingly larger.identified object where distance is made increasingly larger.

Page 41: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 41

Spatial Data StructuresSpatial Data Structures Data structures designed specifically to store or Data structures designed specifically to store or

index spatial data.index spatial data. Often based on B-tree or Binary Search TreeOften based on B-tree or Binary Search Tree Cluster data on disk basked on geographic location.Cluster data on disk basked on geographic location. May represent complex spatial structure by placing May represent complex spatial structure by placing

the spatial object in a containing structure of a the spatial object in a containing structure of a specific geographic shape.specific geographic shape.

Techniques:Techniques:– Quad TreeQuad Tree– R-TreeR-Tree– k-D Treek-D Tree

Page 42: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 42

MBRMBR

Minimum Bounding RectangleMinimum Bounding Rectangle Smallest rectangle that completely Smallest rectangle that completely

contains the objectcontains the object

Page 43: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 43

MBR ExamplesMBR Examples

Page 44: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 44

Quad TreeQuad Tree

Hierarchical decomposition of the space Hierarchical decomposition of the space into quadrants (MBRs)into quadrants (MBRs)

Each level in the tree represents the Each level in the tree represents the object as the set of quadrants which object as the set of quadrants which contain any portion of the object.contain any portion of the object.

Each level is a more exact representation Each level is a more exact representation of the object.of the object.

The number of levels is determined by The number of levels is determined by the degree of accuracy desired.the degree of accuracy desired.

Page 45: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 45

Quad Tree ExampleQuad Tree Example

Page 46: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 46

R-TreeR-Tree

As with Quad Tree the region is divided As with Quad Tree the region is divided into successively smaller rectangles into successively smaller rectangles (MBRs).(MBRs).

Rectangles need not be of the same Rectangles need not be of the same size or number at each level.size or number at each level.

Rectangles may actually overlap.Rectangles may actually overlap. Lowest level cell has only one object.Lowest level cell has only one object. Tree maintenance algorithms similar to Tree maintenance algorithms similar to

those for B-trees.those for B-trees.

Page 47: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 47

R-Tree ExampleR-Tree Example

Page 48: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 48

K-D TreeK-D Tree

Designed for multi-attribute data, not Designed for multi-attribute data, not necessarily spatialnecessarily spatial

Variation of binary search treeVariation of binary search tree Each level is used to index one of the Each level is used to index one of the

dimensions of the spatial object.dimensions of the spatial object. Lowest level cell has only one objectLowest level cell has only one object Divisions not based on MBRs but Divisions not based on MBRs but

successive divisions of the dimension successive divisions of the dimension range.range.

Page 49: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 49

k-D Tree Examplek-D Tree Example

Page 50: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 50

Topological RelationshipsTopological Relationships

DisjointDisjoint Overlaps or IntersectsOverlaps or Intersects EqualsEquals Covered by or inside or contained inCovered by or inside or contained in Covers or containsCovers or contains

Page 51: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 51

Distance Between ObjectsDistance Between Objects EuclideanEuclidean ManhattanManhattan Extensions:Extensions:

Page 52: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 52

Progressive RefinementProgressive Refinement

Make approximate answers prior to Make approximate answers prior to more accurate ones.more accurate ones.

Filter out data not part of answerFilter out data not part of answer Hierarchical view of data based on Hierarchical view of data based on

spatial relationshipsspatial relationships Coarse predicate recursively refinedCoarse predicate recursively refined

Page 53: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 53

Progressive RefinementProgressive Refinement

Page 54: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 54

Spatial Data Dominant AlgorithmSpatial Data Dominant Algorithm

Page 55: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 55

STINGSTING

STatistical Information Grid-basedSTatistical Information Grid-based Hierarchical technique to divide area Hierarchical technique to divide area

into rectangular cellsinto rectangular cells Grid data structure contains summary Grid data structure contains summary

information about each cellinformation about each cell Hierarchical clustering Hierarchical clustering Similar to quad treeSimilar to quad tree

Page 56: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 56

STINGSTING

Page 57: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 57

STING Build AlgorithmSTING Build Algorithm

Page 58: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 58

STING AlgorithmSTING Algorithm

Page 59: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 59

Spatial RulesSpatial Rules

Characteristic RuleCharacteristic Rule

The average family income in Dallas is $50,000.The average family income in Dallas is $50,000. Discriminant RuleDiscriminant Rule

The average family income in Dallas is $50,000, The average family income in Dallas is $50,000, while in Plano the average income is $75,000.while in Plano the average income is $75,000.

Association RuleAssociation Rule

The average family income in Dallas for families The average family income in Dallas for families living near White Rock Lake is $100,000.living near White Rock Lake is $100,000.

Page 60: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 60

Spatial Association RulesSpatial Association Rules

Either antecedent or consequent must Either antecedent or consequent must contain spatial predicates.contain spatial predicates.

View underlying database as set of View underlying database as set of spatial objects.spatial objects.

May create using a type of progressive May create using a type of progressive refinementrefinement

Page 61: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 61

Spatial Association Rule AlgorithmSpatial Association Rule Algorithm

Page 62: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 62

Spatial ClassificationSpatial Classification

Partition spatial objectsPartition spatial objects May use nonspatial attributes and/or May use nonspatial attributes and/or

spatial attributesspatial attributes Generalization and progressive Generalization and progressive

refinement may be used.refinement may be used.

Page 63: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 63

ID3 ExtensionID3 Extension

Neighborhood GraphNeighborhood Graph– Nodes – objectsNodes – objects– Edges – connects neighborsEdges – connects neighbors

Definition of neighborhood variesDefinition of neighborhood varies ID3 considers nonspatial attributes of all ID3 considers nonspatial attributes of all

objects in a neighborhood (not just one) objects in a neighborhood (not just one) for classification.for classification.

Page 64: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 64

Spatial Decision TreeSpatial Decision Tree

Approach similar to that used for spatial Approach similar to that used for spatial association rules.association rules.

Spatial objects can be described based Spatial objects can be described based on objects close to them – on objects close to them – Buffer.Buffer.

Description of class based on Description of class based on aggregation of nearby objects.aggregation of nearby objects.

Page 65: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 65

Spatial Decision Tree AlgorithmSpatial Decision Tree Algorithm

Page 66: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 66

Spatial ClusteringSpatial Clustering

Detect clusters of irregular shapesDetect clusters of irregular shapes Use of centroids and simple distance Use of centroids and simple distance

approaches may not work well.approaches may not work well. Clusters should be independent of order Clusters should be independent of order

of input.of input.

Page 67: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 67

Spatial ClusteringSpatial Clustering

Page 68: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 68

CLARANS ExtensionsCLARANS Extensions

Remove main memory assumption of Remove main memory assumption of CLARANS.CLARANS.

Use spatial index techniques.Use spatial index techniques. Use sampling and R*-tree to identify Use sampling and R*-tree to identify

central objects.central objects. Change cost calculations by reducing Change cost calculations by reducing

the number of objects examined.the number of objects examined. Voronoi DiagramVoronoi Diagram

Page 69: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 69

VoronoiVoronoi

Page 70: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 70

SD(CLARANS)SD(CLARANS)

Spatial DominantSpatial Dominant First clusters spatial components using First clusters spatial components using

CLARANSCLARANS Then iteratively replaces medoids, but Then iteratively replaces medoids, but

limits number of pairs to be searched.limits number of pairs to be searched. Uses generalizationUses generalization Uses a learning to to derive description Uses a learning to to derive description

of cluster.of cluster.

Page 71: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 71

SD(CLARANS) AlgorithmSD(CLARANS) Algorithm

Page 72: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 72

DBCLASDDBCLASD

Extension of DBSCANExtension of DBSCAN Distribution Based Clustering of LArge Distribution Based Clustering of LArge

Spatial DatabasesSpatial Databases Assumes items in cluster are uniformly Assumes items in cluster are uniformly

distributed.distributed. Identifies distribution satisfied by Identifies distribution satisfied by

distances between nearest neighbors.distances between nearest neighbors. Objects added if distribution is uniform.Objects added if distribution is uniform.

Page 73: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 73

DBCLASD AlgorithmDBCLASD Algorithm

Page 74: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 74

Aggregate ProximityAggregate Proximity

Aggregate ProximityAggregate Proximity – measure of how – measure of how close a cluster is to a feature.close a cluster is to a feature.

Aggregate proximity relationship finds the Aggregate proximity relationship finds the k closest features to a cluster.k closest features to a cluster.

CRH AlgorithmCRH Algorithm – uses different shapes: – uses different shapes:– Encompassing CircleEncompassing Circle– Isothetic RectangleIsothetic Rectangle– Convex HullConvex Hull

Page 75: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 75

CRHCRH

Page 76: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 76

Temporal Mining OutlineTemporal Mining Outline

Goal:Goal: Examine some temporal data Examine some temporal data mining issues and approaches.mining issues and approaches.

IntroductionIntroduction Modeling Temporal EventsModeling Temporal Events Time SeriesTime Series Pattern DetectionPattern Detection SequencesSequences Temporal Association RulesTemporal Association Rules

Page 77: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 77

Temporal DatabaseTemporal Database

Snapshot Snapshot – Traditional database– Traditional database TemporalTemporal – Multiple time points – Multiple time points Ex:Ex:

Page 78: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 78

Temporal QueriesTemporal Queries QueryQuery

DatabaseDatabase

Intersection QueryIntersection Query

Inclusion QueryInclusion Query

Containment QueryContainment Query

Point Query – Tuple retrieved is valid at a particular point in time.Point Query – Tuple retrieved is valid at a particular point in time.

tsq te

q

tsd te

d

tsq te

qtsd te

d

tsq te

qtsd te

d

tsq te

qtsd te

d

Page 79: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 79

Types of DatabasesTypes of Databases

Snapshot – No temporal supportSnapshot – No temporal support Transaction Time – Supports time when Transaction Time – Supports time when

transaction inserted datatransaction inserted data– TimestampTimestamp– RangeRange

Valid Time – Supports time range when Valid Time – Supports time range when data values are validdata values are valid

Bitemporal – Supports both transaction Bitemporal – Supports both transaction and valid time.and valid time.

Page 80: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 80

Modeling Temporal EventsModeling Temporal Events Techniques to model temporal events.Techniques to model temporal events. Often based on earlier approachesOften based on earlier approaches Finite State Recognizer (Machine) (FSR)Finite State Recognizer (Machine) (FSR)

– Each event recognizes one characterEach event recognizes one character– Temporal ordering indicated by arcsTemporal ordering indicated by arcs– May recognize a sequenceMay recognize a sequence– Require precisely defined transitions between statesRequire precisely defined transitions between states

ApproachesApproaches– Markov ModelMarkov Model– Hidden Markov ModelHidden Markov Model– Recurrent Neural NetworkRecurrent Neural Network

Page 81: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 81

FSRFSR

Page 82: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 82

Markov Model (MM)Markov Model (MM) Directed graphDirected graph

– Vertices represent statesVertices represent states– Arcs show transitions between statesArcs show transitions between states– Arc has probability of transitionArc has probability of transition– At any time one state is designated as current At any time one state is designated as current

state.state. Markov PropertyMarkov Property – Given a current state, the – Given a current state, the

transition probability is independent of any transition probability is independent of any previous states.previous states.

Applications: speech recognition, natural Applications: speech recognition, natural language processinglanguage processing

Page 83: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 83

Markov ModelMarkov Model

Page 84: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 84

Hidden Markov Model (HMM)Hidden Markov Model (HMM)

Like HMM, but states need not correspond to Like HMM, but states need not correspond to observable states.observable states.

HMM models process that produces as HMM models process that produces as output a sequence of observable symbols.output a sequence of observable symbols.

HMM will actually output these symbols.HMM will actually output these symbols. Associated with each node is the probability Associated with each node is the probability

of the observation of an event.of the observation of an event. Train HMM to recognize a sequence.Train HMM to recognize a sequence. Transition and observation probabilities Transition and observation probabilities

learned from training set.learned from training set.

Page 85: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 85

Hidden Markov ModelHidden Markov Model

Modified from [RJ86]

Page 86: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 86

HMM AlgorithmHMM Algorithm

Page 87: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 87

HMM ApplicationsHMM Applications

Given a sequence of events and an Given a sequence of events and an HMM, what is the probability that the HMM, what is the probability that the HMM produced the sequence?HMM produced the sequence?

Given a sequence and an HMM, what is Given a sequence and an HMM, what is the most likely state sequence which the most likely state sequence which produced this sequence?produced this sequence?

Page 88: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 88

Recurrent Neural Network (RNN)Recurrent Neural Network (RNN)

Extension to basic NNExtension to basic NN Neuron can obtian input form any other Neuron can obtian input form any other

neuron (including output layer).neuron (including output layer). Can be used for both recognition and Can be used for both recognition and

prediction applications.prediction applications. Time to produce output unknownTime to produce output unknown Temporal aspect added by backlinks.Temporal aspect added by backlinks.

Page 89: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 89

RNNRNN

Page 90: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 90

Time SeriesTime Series

Set of attribute values over timeSet of attribute values over time Time Series Analysis – finding patterns Time Series Analysis – finding patterns

in the values.in the values.– TrendsTrends– CyclesCycles– SeasonalSeasonal– OutliersOutliers

Page 91: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 91

Analysis TechniquesAnalysis Techniques Smoothing Smoothing – Moving average of attribute – Moving average of attribute

values.values. Autocorrelation Autocorrelation – relationships between – relationships between

different subseriesdifferent subseries– Yearly, seasonalYearly, seasonal– LagLag – Time difference between related items. – Time difference between related items.– Correlation Coefficient rCorrelation Coefficient r

Page 92: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 92

SmoothingSmoothing

Page 93: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 93

Correlation with Lag of 3Correlation with Lag of 3

Page 94: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 94

SimilaritySimilarity Determine similarity between a target pattern, Determine similarity between a target pattern,

X, and sequence, Y: sim(X,Y)X, and sequence, Y: sim(X,Y) Similar to Web usage miningSimilar to Web usage mining Similar to earlier word processing and spelling Similar to earlier word processing and spelling

corrector applications.corrector applications. Issues:Issues:

– LengthLength– ScaleScale– GapsGaps– OutliersOutliers– BaselineBaseline

Page 95: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 95

Longest Common SubseriesLongest Common Subseries

Find longest subseries they have in Find longest subseries they have in common.common.

Ex:Ex:– X = <10,5,6,9,22,15,4,2>X = <10,5,6,9,22,15,4,2>– Y = <6,9,10,5,6,22,15,4,2>Y = <6,9,10,5,6,22,15,4,2>– Output: <22,15,4,2>Output: <22,15,4,2>– Sim(X,Y) = l/n = 4/9Sim(X,Y) = l/n = 4/9

Page 96: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 96

Similarity based on Linear Similarity based on Linear TransformationTransformation

Linear transformation function fLinear transformation function f– Convert a value form one series to a value Convert a value form one series to a value

in the secondin the second ff – tolerated difference in results – tolerated difference in results – – time value difference allowedtime value difference allowed

Page 97: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 97

PredictionPrediction

Predict future value for time seriesPredict future value for time series Regression may not be sufficientRegression may not be sufficient Statistical TechniquesStatistical Techniques

– ARMAARMA– ARIMAARIMA

NNNN

Page 98: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 98

Pattern DetectionPattern Detection

Identify patterns of behavior in time Identify patterns of behavior in time seriesseries

Speech recognition, signal processingSpeech recognition, signal processing FSR, MM, HMMFSR, MM, HMM

Page 99: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 99

String MatchingString Matching

Find given pattern in sequenceFind given pattern in sequence Knuth-Morris-Pratt:Knuth-Morris-Pratt: Construct FSM Construct FSM Boyer-Moore:Boyer-Moore: Construct FSM Construct FSM

Page 100: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 100

Distance between StringsDistance between Strings

Cost to convert one to the otherCost to convert one to the other TransformationsTransformations

– Match: Current characters in both strings Match: Current characters in both strings are the sameare the same

– Delete: Delete current character in input Delete: Delete current character in input stringstring

– Insert: Insert current character in target Insert: Insert current character in target string into stringstring into string

Page 101: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 101

Distance between StringsDistance between Strings

Page 102: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 102

Frequent SequenceFrequent Sequence

Page 103: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 103

Frequent Sequence ExampleFrequent Sequence Example

Purchases made by Purchases made by customerscustomers

s(<{A},{C}>) = 1/3s(<{A},{C}>) = 1/3 s(<{A},{D}>) = 2/3s(<{A},{D}>) = 2/3 s(<{B,C},{D}>) = 2/3s(<{B,C},{D}>) = 2/3

Page 104: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 104

Frequent Sequence LatticeFrequent Sequence Lattice

Page 105: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 105

SPADESPADE

Sequential Pattern Discovery using Sequential Pattern Discovery using Equivalence classesEquivalence classes

Identifies patterns by traversing lattice in Identifies patterns by traversing lattice in a top down manner.a top down manner.

Divides lattice into equivalent classes Divides lattice into equivalent classes and searches each separately.and searches each separately.

ID-List:ID-List: Associates customers and Associates customers and transactions with each item.transactions with each item.

Page 106: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 106

SPADE ExampleSPADE Example

ID-List for Sequences of length 1:ID-List for Sequences of length 1:

Count for <{A}> is 3Count for <{A}> is 3 Count for <{A},{D}> is 2Count for <{A},{D}> is 2

Page 107: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 107

Equivalence ClassesEquivalence Classes

Page 108: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 108

SPADE AlgorithmSPADE Algorithm

Page 109: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 109

Temporal Association RulesTemporal Association Rules

Transaction has time:Transaction has time:<TID,CID,I<TID,CID,I11,I,I22, …, I, …, Imm,t,tss,t,tee>>

[t[tss,t,tee] is range of time the transaction is active.] is range of time the transaction is active. Types:Types:

– Inter-transaction rulesInter-transaction rules– Episode rulesEpisode rules– Trend dependenciesTrend dependencies– Sequence association rulesSequence association rules– Calendric association rulesCalendric association rules

Page 110: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 110

Inter-transaction RulesInter-transaction Rules

Intra-transaction association rulesIntra-transaction association rulesTraditional association RulesTraditional association Rules

Inter-transaction association rulesInter-transaction association rules– Rules across transactionsRules across transactions– Sliding windowSliding window – How far apart (time or – How far apart (time or

number of transactions) to look for related number of transactions) to look for related itemsets.itemsets.

Page 111: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 111

Episode RulesEpisode Rules

Association rules applied to sequences Association rules applied to sequences of events.of events.

EpisodeEpisode – set of event predicates and – set of event predicates and partial ordering on thempartial ordering on them

Page 112: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 112

Trend DependenciesTrend Dependencies Association rules across two database Association rules across two database

states based on time.states based on time. Ex: (SSN,=) Ex: (SSN,=) (Salary, (Salary, ))

Confidence=4/5Confidence=4/5Support=4/36Support=4/36

Page 113: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 113

Sequence Association RulesSequence Association Rules

Association rules involving sequencesAssociation rules involving sequences Ex:Ex:

<{A},{C}> <{A},{C}> <{A},{D}> <{A},{D}>Support = 1/3Support = 1/3Confidence 1Confidence 1

Page 114: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 114

Calendric Association RulesCalendric Association Rules

Each transaction has a unique Each transaction has a unique timestamp.timestamp.

Group transactions based on time Group transactions based on time interval within which they occur.interval within which they occur.

Identify large itemsets by looking at Identify large itemsets by looking at transactions only in this predefined transactions only in this predefined interval.interval.