algorithmics algorithmics research on knowledge research on

32
1 Algorithmics Algorithmics Research on Knowledge Research on Knowledge Discovery and Data Mining Discovery and Data Mining ©Vladimir Estivill-Castro School of Computing and Information Technology © Vladimir Estivill-Castro 2 Outline Outline u Motivation u What is Data Mining u Extent of The Field uWeb Mining, Text Mining, Software Engineering and Data Mining u Example of algorithm uAttribute Oriented Generalization

Upload: tommy96

Post on 29-Nov-2014

1.013 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Algorithmics Algorithmics Research on Knowledge Research on

1

Algorithmics Algorithmics Research on Knowledge Research on Knowledge Discovery and Data MiningDiscovery and Data Mining

©Vladimir Estivill-CastroSchool of Computing and Information Technology

© Vladimir Estivill -Castro 2

OutlineOutline

u Motivationu What is Data Miningu Extent of The FielduWeb Mining, Text Mining, Software

Engineering and Data Mining

uExample of algorithmuAttribute Oriented Generalization

Page 2: Algorithmics Algorithmics Research on Knowledge Research on

3

MotivationMotivation

© Vladimir Estivill -Castro 4

MotivationMotivation

u The problem of data overload looms ominously ahead.u Our technology to analyze data and

understand massive datasets lags far behind our technology to gather and store data.

Page 3: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 5

Interest for Knowledge Interest for Knowledge DiscoveryDiscoveryu Emerged in Databasesu Large logs of transactions

u Credit card transactionsu Supermarket transactions

u Data Mining (1960s)u “Massaging the data so statistics would reflect

my preconceived hypothesis”

u Deductive vs Inductive Scienceu Data => Hypothesis vs Hypothesis validated by

data

© Vladimir Estivill -Castro 67

Knowledge DiscoveryKnowledge Discovery

The nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patters in large data sets.u Data: The geo-referenced layers.u Information: The average population per

administrative regionu Knowledge: The patterns of growth of

population densities and valid explanations for them.

Page 4: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 7

Aspects of Knowledge Aspects of Knowledge DiscoveryDiscovery

u Nontrivialu it goes beyond

computing closed-from quantities or evaluating models.

u Validu the discovered patterns

are true with some degree of certainty of unseen data

u Potentially usefulu for the user

u Understandableu simple, descriptive,

informative

8

What is Data MiningWhat is Data Mining

Page 5: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 9

What is Data Mining?What is Data Mining?u Different answers¬ Using computers to make sense of large

volumes of datauusing a set of fundamental data manipulation tasks

u classificationu association rules / basket analysis

u what makes it different from Statistics?­ A task in the process of knowledge discovery® A multidisciplinary field

uwhat makes it different from Statistics?

© Vladimir Estivill -Castro

Mining Different Kinds of Mining Different Kinds of KnowledgeKnowledge

u Description

u Estimation.

u Prediction.

u Affinity Grouping .

u Classification.

u Clustering.

Page 6: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro

Mining Different Kinds of KnowledgeMining Different Kinds of Knowledge(by J. Han)(by J. Han)

u Characterization: Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions.

u Association: Rules like “inside(x, city) à near(x, highway)”.

u Classification: Classify data based on the values in a classifying attribute, e.g., classify countries based on climate.

u Clustering: Cluster data to form new classes, e.g., cluster houses to find distribution patterns.

u Trend and deviation analysis : Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., housing market analysis.

u Pattern-directed analysis : Find and characterize user-specified patterns in large databases, e.g., volcanoes on Mars.

© Vladimir Estivill -Castro 12

Example: RetailExample: Retail

u Bar-code technology makes possible to collect and store massive amounts of sales data (the basked data).

u Information driven marketing process demands mining association rules over basket data.u ``98% of customers that purchase tires and

auto accessories also get service done’’

Page 7: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 13

Application of RulesApplication of Rules

u Cross-marketing and attached mailing applications.u customized and designed junk-mail

u catalog design, add-on sales.u store layoutu customer segmentation based on buying

patterns.

© Vladimir Estivill -Castro 14

IllustrationIllustration

u ``90% of purchases that have bread and butter also include milk’’.

u It is a rule of the form A ⇒ B (90%).u A is the antecedent.u B is the consequent.u There is a confidence value associated to the rule.

ProductA

ProductB

ProductC

ProductD

Customer 1 X X XCustomer 2 X XCustomer 3 X X X

Page 8: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 15

Usage of mining for rulesUsage of mining for rules

u Find all rules that have Diet Coke as the consequent.u Help plan what should be done to boast the sales of

Diet Coke.u Find all the rules that have bagels in the antecedent.u Determine what products may be impacted if the store

discontinues selling bagels.

© Vladimir Estivill -Castro 16

Usage of mining rulesUsage of mining rules

u Find all rules that have sausage in the antecedent and mustard in the consequentu what items should be sold with sausages to make

highly likely that mustard will also be sold.u Find all rules relating items located in shelves A and Bu understand if the distance affects the sales of items

from both shelves.

Page 9: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 17

The Knowledge Discovery The Knowledge Discovery ProcessProcess

Data TargetData

PreprocessedData

TransformedData

Patterns Knowledge

SelectionPreprocessing

ArrangementData Mining

Interpretation

© Vladimir Estivill -Castro 18

A formal process (and a A formal process (and a standard)standard)

Page 10: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 19

MultiMulti--disciplinarydisciplinary

u Finding useful patterns in data is known by different names among different communities:u data mining (statistics, databases)u knowledge discoveryu information discovery, information harvestingu data archeologyu pattern processing

© Vladimir Estivill -Castro 20

A multiA multi--disciplinary fielddisciplinary field

Data Bases Statistic

Visualization

Logic Programming

Machine Learning

Page 11: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 21

Knowledge DiscoveryKnowledge Discoveryu Discovery of useful knowledge from data.u Databasesu Machine Learningu Pattern Recognitionu Artificial Intelligenceu Knowledge Acquisitionu Scientific Discoveryu High-Performance Computingu Algorithms (Analysis and Design)u Statistics

© Vladimir Estivill -Castro 22

Differences with StatisticsDifferences with Statistics

Page 12: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 23

A significant amount of A significant amount of overlapoverlap

u Statisticsu probabilistic modelsu descriptive, non-

parametric, exploratoryu mathematically sound

(advanced)u informative and

predictive

u KDDMu concern for

computability and scalability

u interpret datau understandableu data on electronic

media (and structured)

Exploratory Analysis of Massive Data Sets

24

Extent of the FieldExtent of the Field

Page 13: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 25

Large number of conferences Large number of conferences and specialized workshopsand specialized workshops

u KDD - ACM

u but now IEEE conference on KDDu SIAM conference on KDD

u PAKDD (2001 - 5th Pacific Rim Conf. on KDD)u PKDD (2000 - 4th European Conf. on Principles and Practice of

KDD)u Data Base conferences

u SIGMOD

u VLDBu Artificial Intelligence and Machine Learning

u ICMLu ECML

© Vladimir Estivill -Castro 26

Successful ExampleSuccessful Exampleu Recent analysis of Bank of America loan database

u 250 fields per customeru back to 1914!u over nine million records

u A clustering tools was used to automatically segment customers into groups with many similar categorical attributes.u 14 groups identified, only one could be explained

u The interesting cluster had 2 propertiesu 39% of customers had business and personal accounts with the banku this cluster accounted for 27% of the 11% of customers that had been

classified by a decision tree as likely respondents to a home equity loan offer

Page 14: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 27

Text MiningText Mining

u With / without Natural Language Processingu Different from Information Retrieval/Extraction

uResults:uThere exists a text (or a sets of texts) such as

speak-much-of Miss Lewinsky & speak-little-of Mrs. Clinton

u There exists another text such as speak-a-lot-of Mrs. Clinton & do-not-speak-of Miss Lewinsky

© Vladimir Estivill -Castro 28

u Data Mining

u system generates hypothesis

u search (and visualization) in abstract space

u inductive generalizations exceeding content of database

u GIS

u user generates hypothesis

u visualization in geographical space

u shows what’s inside the data

hard to visualize multivariate dependencies on a map

Spatial Data MiningSpatial Data Mining

Page 15: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 29

Spatial Data MiningSpatial Data MiningDecision TreeThematic Map

© Vladimir Estivill -Castro 30

Spatial AssociationsSpatial AssociationsFIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course" FROM Washington_Golf_courses, WashingtonWHERE CLOSE_TO(Washington_Golf_courses.Obj, Washington.Obj, "3 km")

AND Washington.CFCC <> "D81" IN RELEVANCE TO Washington_Golf_courses.Obj, Washington.Obj, CFCC SET SUPPORT THRESHOLD 0.5

Page 16: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro

Spatial Associations & Hierarchy of Spatial Associations & Hierarchy of Spatial RelationshipsSpatial Relationships

u Spatial association: Association relationship containing spatial predicates, e.g., close_to, intersect, contains, etc.

u Topological relations:u intersects, overlaps, disjoint, etc.

u Spatial orientations:u left_of, west_of, under, etc.

u Distance information:

u close_to, within_distance, etc.u Hierarchy of spatial relationship:

u “g_close_to”: near_by, touch, intersect, contain, etc.

u First search for rough relationship and then refine it.

© Vladimir Estivill -Castro

Example: Example: Spatial Association Rule MiningSpatial Association Rule Mining

u “What kinds of spatial objects are close to each other in B.C.?”u Kinds of objects: cities, water, forests, usa_boundary, mines, etc.

u Rules mined:u is_a(x, large_town) ^ intersect(x, highway) → adjacent_to(x, water). [7%, 85%]

u is_a(x, large_town) ^adjacent_to(x, georgia_strait) → close_to(x, u.s.a.). [1%, 78%]

uMining method: Apriori + multi-level association + geo-spatial algorithms (from rough to high precision).

Page 17: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 33

Spatial ClassificationSpatial Classification

uGeneralization-based induction

u Interactive classification

© Vladimir Estivill -Castro

Visualization of Predicted DistributionVisualization of Predicted Distribution

Page 18: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro

u Spatial trend predictive modeling (Ester et al’97):u Discover centers: local maximal of some non-spatial attribute.u Determine the (theoretical) trend of some non-spatial attribute, when

moving away from the centers.u Discover deviations (from the theoretical trend).

u Explain the deviations.

u Example: Trend of unemployment rate change according to the distance to Munich.

u Similar modeling can be used to study trend of temperature with the altitude, degree of pollution in relevance to the regions of population density, etc.

Spatial Prediction and Trend AnalysisSpatial Prediction and Trend Analysis

© Vladimir Estivill -Castro 36

Spatial ClusteringSpatial ClusteringuHow can we cluster points?uWhat are the distinct features of the

clusters?

There are more customers with university degrees in clusters located in the West.Thus, we can use different marketing strategies!

Page 19: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro

Mining in Image & Raster DatabasesMining in Image & Raster Databases

uMagellan project (Fayyad et al.’96, JPL).u identify volcanos on Venus surfaceu over 30,000 high resolution imagesu Resolution accuracy: 80%u 3 steps: data focusing, feature extraction, and classification learning

u POSSII project (Palomar Obervatory Sky Survey II, )u 2 x 109 stellar objects (galaxies, stars, etc.) classifiedu Resolution:one magnitude better than in previous studiesu Classification accuracy: no normalization 75%, with normalization 94%, and

compared with neural networks.

u QuakeFinder (Stolorz et al’96): Find earth quakes from space. u using statistics, massive parallelism, and global optimization

© Vladimir Estivill -Castro 38

Basket Analysis / Link Basket Analysis / Link AnalysisAnalysis

uAmazon.comu Who bought what togetheru What are related books

uFinding who has fax at homeu Get phone data for business with a fax numberu Get usage records of lines to find who dials a

business fax number from home for larger than 20 seconds

Page 20: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 39

Crime detectionCrime detection

u crime investigation (e.g., the Okalahoma City bombing)

u fraud detection [Italy KDD-99 San Diego] but also Australian Taxation Office and HIC [PAKDD-99]u Characterization of Doctor Shoppers

© Vladimir Estivill -Castro 40

Software Engineering and Software Engineering and Data MiningData Mining

Page 21: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 41

KDD techniques used in KDD techniques used in Software EngineeringSoftware Engineering

Clustering

u (Mancoridis): Clustering for graph partitioning towards high-coupling and low-cohesion. uGraph partitioning algorithms using hill-climbing

u (Ouyang) Clustering towards improving the reusability in the design phase.

u (Anquetil) Re-modularization of software

© Vladimir Estivill -Castro 42

u Input from ISAu Software S made of a set of programs P, and set of files F.

u Representation of database of the systemA set alpha : A={P, F} : P ⊆P, F ⊆FCreation of a grouping table

Programason rows, Files in columnssT = t1, t2, ... , t|F|

ti={ p ∈ P | p uses fi}

u KDDMAssociaiton rules

“c% of the programs that use file X also use files Y and Z”. Results in a group of programs that uses a similar set of files.

Colect and summirize resultsu Association rules are used to guide clustering

u A set of programs and files makes a subsystem=> SUBSISTEMA

Decomposition into subsystemsDecomposition into subsystems--Non OO caseNon OO case

p 1 p 2 p 3 p 4 p 5f1 X X X Xf2 X Xf3 X X Xf4 X X X X

cp3= 3/4, cp4= 3/3

p4 ⇒ p3 (100%)

Page 22: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 43

u Using metrics, OO, KDDM and clustering to split with low coupling.u Approach:

( Create list of related entity pairs (classes, methods and objects) ( Use OO metrics to create sets of metrics

CBOSet, RFCSet, and DACSet (DAC: Data Abstraction Coupling)CBO_d, CBO_d’, and DAC_t

( Generate matrices of classes that interact (matrices de interaccion)( Apply KDDM algorithms (association rules)

( Use hierarchical clustering on the coefficients of the association rules to produce a hierarchical decomposition.

Decomposition into Subsystems Decomposition into Subsystems -- OOOO

CBOSet’Interaction Matrix

© Vladimir Estivill -Castro 44

Case Study Case Study -- OO Sistema IIOO Sistema IIMozilla - Netscape Communicatoru HTML Editor

MOZILLA

SUMMARY Symbol Table Statistics of 1223 projects========================================Files: 6713Includes: 36492Macros: 27024Functions: 15898Types: 3176Variables: 11151Enums: 715Userdef: 0Classes: 5933 Instance Variables: 23757 Methods: 41015 Friends: 273 Localdefs: 290

SUMMARY File Type Statistics of 1223 projects========================================File Type Number of files

HTML 763Header 3331Implementation 3382Make 117Project Description 1309

HTML Composer/Editor - Mozilla

SUMMARY Symbol Table Statistics of 30 projects==============================================Files: 111Includes: 697Macros: 147Functions: 60Types: 0Variables: 79Enums: 4Userdef: 0Classes: 90 Inst Vars: 415 Methods: 1320 Friends: 25 Localdefs: 46

SUMMARY File Type Statistics of 30 projects==============================================File Type: Number of Files:

HTML 4Header 65IDL Interface 4Image 57Implementation 42Project Description 38

Mozilla Statistics HTML Editor Statistics

Page 23: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 45

Graph Mining and WEB Graph Mining and WEB MiningMining

uStep beyond link analysisuFinding sub-graphs that are similar

uChemical moleculesu CAD/CAM parts (designs)

u Patterns of useuAnalysis of WEB datau Text Mining/ Multimedia Mining

© Vladimir Estivill -Castro 46

Time Series MiningTime Series Mining

u Predicting stock market

u Monitoring condition of equipment, weather, pilot behavior during long flights.

Page 24: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 47

An example of generalizationAn example of generalization

u For attribute-oriented data

© Vladimir Estivill -Castro 48

Automated Attribute Oriented Automated Attribute Oriented InductionInduction

u Illustration of basic strategiesu based on a hierarchy of concepts,u where discovery initiated by a query for a rule

with necessary conditions for a class.

Page 25: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 49

The data setThe data setNAME STATUS BIRTH PLACEMAJOR GPA

History

Math

Lib. Arts

Physics

Math

Computing

Chemistry

Biology

Statistics

Literature

Vancouver

Calgary

Edmonton

Ottawa

Bombay

Victoria

Richmond

Shanghai

Nanjing

Toronto

Anderson

Bach

Carlton

Fraser

Gupta

Jackson

Hart

Liu

Wang

Wise

3.5

3.7

2.6

3.9

3.3

3.5

2.7

3.4

3.2

3.9

M.A.

junior

junior

M.S.

PhD

senior

sophomore

PhD

M.S.

freshman

STUDENT

© Vladimir Estivill -Castro 50

The concept hierarchyThe concept hierarchy

ANY

undergraduate graduate

freshman sophomore junior senior M.A. PhDM.S.

For attributeSTATUS

Page 26: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 51

The discovery queryThe discovery query

u In relation STUDENT, learn characteristic rule for STATUS=“graduate” in relevance to NAME, BIRTH PLACE, GPA

u (threshold value of 3)

© Vladimir Estivill -Castro 52

The inductionThe induction

u1) Select “graduate” students.NAME STATUS BIRTH PLACEMAJOR GPA

History

Physics

Math

Biology

Computing

Vancouver

Ottawa

Bombay

Shanghai

Victoria

Anderson

Frazer

Gupta

Liu

Monk

3.5

3.9

3.33.4

3.8

M.A.

M.S.

PhD

PhD

PhD

VOTE

1

1

1

1

1

Statistics NanjingWang 3.2M.S. 1

1

1

1

1

Page 27: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 53

The inductionThe induction

u1) Select “graduate” students.NAME STATUS BIRTH PLACEMAJOR GPA

History

Physics

Math

Biology

Computing

Vancouver

Ottawa

Bombay

Shanghai

Victoria

Anderson

Frazer

Gupta

Liu

Monk

3.5

3.9

3.33.4

3.8

M.A.

M.S.

PhD

PhD

PhD

VOTE

1

1

1

1

1

Statistics NanjingWang 3.2M.S.

1

1

1

1

1

1

© Vladimir Estivill -Castro 54

The inductionThe induction

u2) Eliminate attribute “STATUS” because it is “graduate” for all students.

BIRTH PLACEMAJOR GPA

History

Physics

Math

Biology

Computing

Statistics

Vancouver

Ottawa

Bombay

Shanghai

Victoria

Nanjing

NAME

Anderson

Frazer

Gupta

Liu

Monk

Wang

3.5

3.9

3.33.4

3.8

3.2

VOTE

1

1

1

1

1

1

1

Page 28: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 55

The inductionThe induction

u3) Generalize on the smallest decomposable components.u Not illustrated here, because we do not have

composite attributes.

© Vladimir Estivill -Castro 56

The inductionThe induction

u 4) If there is a large set of distinct values for an attribute but there is no higher level concept provided for the attribute, the attribute should be removed.u Attribute “NAME” satisfies this.

Page 29: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 57

The inductionThe induction

u2) Eliminate attribute “NAME” because it has to many values.

BIRTH PLACEMAJOR GPA

History

Physics

Math

Biology

Computing

Statistics

Vancouver

Ottawa

Bombay

Shanghai

Victoria

Nanjing

NAME

Anderson

Frazer

Gupta

Liu

Monk

Wang

3.5

3.9

3.33.4

3.8

3.2

VOTE

1

1

1

1

1

1

1

© Vladimir Estivill -Castro 58

A generalizationA generalizationBIRTH PLACEMAJOR GPA

History

Physics

Math

Biology

Computing

Statistics

Vancouver

Ottawa

Bombay

Shanghai

Victoria

Nanjing

3.5

3.9

3.33.4

3.8

3.2

VOTE

2

1

1

1

2

3

The value of the vote of a tuple should be carried toits generalized tuple and the votes should be accumulated when merging identical tuples.

Page 30: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 59

Concept tree ascensionConcept tree ascension

u 4) If there is a higher level concept in the concept tree for an attribute, the substitution of the higher level concept generalizes thetuple.

HistoryPhysicsMathBiologyComputingStatistics

Science Arts

© Vladimir Estivill -Castro 60

A generalization of each A generalization of each attributeattribute

BIRTH PLACEMAJOR GPA

Art

Science Ontario

British Columbia

India

China

excellent

VOTE

35

30

10

15

10

The value of the vote of a tuple should be carried toits generalized tuple and the votes should be accumulated when merging identical tuples.

excellent

excellent

good

good

British Columbia

Science

Science

Science

Page 31: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 61

Threshold ControlThreshold Control

u If the number of distinct values of an attribute is larger than the generalization threshold value, further generalization on this attribute should be performed.u Not the case for “Major”u It is the case for “Birth Place”

© Vladimir Estivill -Castro 62

Further generalizationFurther generalization

The value of the vote of a tuple should be carried toits generalized tuple and the votes should be accumulated when merging identical tuples.

BIRTH PLACEMAJOR GPA

Art

Science

foreign

excellent

VOTE

35

25

40

good

excellent

Canada

Science

Canada

Page 32: Algorithmics Algorithmics Research on Knowledge Research on

© Vladimir Estivill -Castro 63

Transformation to rulesTransformation to rules

A graduate student is either (with 75% probability) a Canadianwith excellent GPA or (with 25% probability) a foreign student,

majoring in science with a good GPA.

BIRTH PLACEMAJOR GPA

{Art, Science}

foreign

excellent

VOTE

37

25good

Canada

Science

64

Questions?Questions?