self-organised data mining – 20 years after guha-80 martin kejkula keg 8 th april 2004

40
Self-Organised Data Self-Organised Data Mining Mining 20 Years after 20 Years after GUHA-80 GUHA-80 Martin Kejkula Martin Kejkula KEG 8 KEG 8 th th April 2004 April 2004 http://gama.vse.cz/keg/ http://gama.vse.cz/keg/

Upload: curtis-hoover

Post on 05-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

Self-Organised Data MiningSelf-Organised Data Mining

–– 20 Years after GUHA-8020 Years after GUHA-80

Martin KejkulaMartin Kejkula

KEG 8KEG 8thth April 2004 April 2004

http://gama.vse.cz/keg/http://gama.vse.cz/keg/

Page 2: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

22

Idea of Self-Organised Data Mining GUHA-80 revival

Process of Self-Organised Data Mining Key factors for Self-Organised Data Mining

Metabase, Knowledge Base, etc.

Proposed EverMiner system for Self-Organised

Data Mining

AgendaAgenda

Page 3: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

33

IntroductionIntroduction

Motivation: support X-Miner usersMotivation: support X-Miner users Best practices, known problems collectionBest practices, known problems collection

Muller, Lemke: Self-Organising Data Mining Muller, Lemke: Self-Organising Data Mining (2000)(2000)

My thesis:My thesis: Design/test strings of jobs for EverMinerDesign/test strings of jobs for EverMiner Formalization/using heuristicsFormalization/using heuristics

Page 4: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

44

References (1)References (1)

HHájek, P. – Havránek, T.: GUHA 80: An ájek, P. – Havránek, T.: GUHA 80: An Application of Artificial Intelligence to Data Application of Artificial Intelligence to Data Analysis. Computers and Artificial Analysis. Computers and Artificial Intelligence, Vol. 1, 1982, pp. 107-134Intelligence, Vol. 1, 1982, pp. 107-134

Hájek, P. – Ivánek, J.: Artificial Intelligence Hájek, P. – Ivánek, J.: Artificial Intelligence and Data Analysis. Proc. COMPSTATand Data Analysis. Proc. COMPSTAT’82, ’82, Wien, Physica Verlag 1982, pp. 54-60Wien, Physica Verlag 1982, pp. 54-60

Page 5: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

55

References (References (22))

HHájek, P. – Havránek, T.: GUHA-80 – An ájek, P. – Havránek, T.: GUHA-80 – An Application of Artificial Intelligence to Data Application of Artificial Intelligence to Data Analysis. Matematické středisko Analysis. Matematické středisko biologických ústavů ČSAV, Praha, 1982biologických ústavů ČSAV, Praha, 1982

Jirků, P. – Havránek, T.: On Verbosity Jirků, P. – Havránek, T.: On Verbosity Levels in Cognitive Problem Solvers. Proc. Levels in Cognitive Problem Solvers. Proc. Computational Linguistics, 1982, Computational Linguistics, 1982, http://acl.eldoc.ub.rug.nl/mirror/C/C82/http://acl.eldoc.ub.rug.nl/mirror/C/C82/

Page 6: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

66

References (References (33))

Rauch, J.: EverMiner – studie projektu. Rauch, J.: EverMiner – studie projektu. Dokumentace projektu LISp-Miner, 2003.Dokumentace projektu LISp-Miner, 2003.

Mueller, J.-A. – Lemke, F.: Self-Organising Mueller, J.-A. – Lemke, F.: Self-Organising Data Mining. Extracting Knowledge from Data Mining. Extracting Knowledge from Data. Dresden, Berlin, 2000.Data. Dresden, Berlin, 2000.

Page 7: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

77

GUHA-80: Main FeaturesGUHA-80: Main Features

Application of artificial intelligence to Application of artificial intelligence to exploratory data analysisexploratory data analysis

To generate interesting views onto given To generate interesting views onto given empirical data (recognize interesting empirical data (recognize interesting logical patterns)logical patterns)

Views: relevant, usefulViews: relevant, useful

Page 8: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

88

GUHA-80 Sources (1)GUHA-80 Sources (1)

GUHAGUHA Automatically generate all interesting Automatically generate all interesting

hypotheseshypotheses Lenat’s AMLenat’s AM

Jobs (tasks)Jobs (tasks) Agenda of jobsAgenda of jobs Hundreds of heuristical rulesHundreds of heuristical rules ConceptsConcepts

Page 9: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

99

GUHA-80 vs. Lenat’s AMGUHA-80 vs. Lenat’s AM DataData

• Data-processing proceduresData-processing procedures

Statistical program packagesStatistical program packages Effective modulesEffective modules

GUHA-80 Sources (2)GUHA-80 Sources (2)

Page 10: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

1010

GUHA-80 ParadigmGUHA-80 Paradigm

Open-ended data analysisOpen-ended data analysis To maximize interestingness valueTo maximize interestingness value

Hundreds of heuristic rulesHundreds of heuristic rules Guide to define and study next stepGuide to define and study next step

Access potentially relevant rules, Access potentially relevant rules,

Find truly relevant rules,Find truly relevant rules,

Follows truly relevant rulesFollows truly relevant rules

Page 11: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

1111

Interestingness in GUHA-80Interestingness in GUHA-80

No explicit definitionNo explicit definition Determined by interplayDetermined by interplay

Heuristical rulesHeuristical rules Weighting mechanismsWeighting mechanisms Testing in practice (adequately behaviour?)Testing in practice (adequately behaviour?)

No algorithm, but constraintsNo algorithm, but constraints

Page 12: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

1212

Principles of GUHA-80Principles of GUHA-80

Domain dependence (…exploratory data Domain dependence (…exploratory data analysis)analysis)

Join human possibilities with machineJoin human possibilities with machine More heuristics are relevantMore heuristics are relevant Interactivity with userInteractivity with user Non routine (GUHA-80 not for every-day Non routine (GUHA-80 not for every-day

data processing)data processing)

Page 13: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

1313

GUHA-80 StructureGUHA-80 Structure (1) (1)

Page 14: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

1414

GUHA-80 Structure (2)GUHA-80 Structure (2)

Input empirical dataInput empirical data Input parameters Input parameters

How understood “interestingness”How understood “interestingness” Effective modules (system’s knowledge)Effective modules (system’s knowledge)

Clustering proceduresClustering procedures GUHA proceduresGUHA procedures

Agenda of jobs (priority/weight)Agenda of jobs (priority/weight)

Page 15: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

1515

Heuristics: optimal way to realize a jobHeuristics: optimal way to realize a job Changing system of conceptsChanging system of concepts Hierarchy of concepts (applicability)Hierarchy of concepts (applicability) Possible unification of heuristics, jobs,… Possible unification of heuristics, jobs,…

GUHA-80 Structure (3)GUHA-80 Structure (3)

Page 16: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

1616

Page 17: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

1717

Page 18: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

1818

Page 19: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

1919

Page 20: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

2020

GUHA-80 InputGUHA-80 Input

DataData Input informationInput information

Decompositions/orderings of sets of quantitiesDecompositions/orderings of sets of quantities Help understand “interestingness”Help understand “interestingness”

Page 21: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

2121

GUHA-80 Effective modulesGUHA-80 Effective modules

Evaluation of usual statistical Evaluation of usual statistical characteristics,…characteristics,…

Complicated proceduresComplicated procedures Synthesis of parameters (“job on job”)Synthesis of parameters (“job on job”)

Page 22: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

2222

GUHA-80GUHA-80

Hundreds of heuristic rulesHundreds of heuristic rules No explicit definition of interestingness No explicit definition of interestingness

(exploration in a space)(exploration in a space) Interactivity with the userInteractivity with the user Non-routine characterNon-routine character

Page 23: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

2323

Process of S-O Data MiningProcess of S-O Data Mining

EmpiricalData

Chains of Data & Knowledge Processing Tasks

Domain Knowledge,…

All Interesting Views, Patterns

DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, …

Page 24: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

2424

Process of S-O Data MiningProcess of S-O Data Mining

Page 25: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

2525

Key Factors of S-O Data MiningKey Factors of S-O Data Mining

Data PreparationData Preparation ModelingModeling EvaluationEvaluation Knowledge BaseKnowledge Base Domain KnowledgeDomain Knowledge

Page 26: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

2626

Data PreparationData Preparation

DiscretizationDiscretization Attribute Type dependent:Attribute Type dependent:

• Nominal/Ordinal/Interval/Ratio Nominal/Ordinal/Interval/Ratio Type of coefficient dependentType of coefficient dependent Discretization-Modeling Cycle (KL, 4ft, CF,…)Discretization-Modeling Cycle (KL, 4ft, CF,…) Known problem with intervals of categories Known problem with intervals of categories

without valueswithout values Usually not one target attributeUsually not one target attribute

Page 27: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

2727

Attribute type dependent discretizationAttribute type dependent discretization

NominalNominal Classes of valuesClasses of values

OrdinalOrdinal Extrem/missing valuesExtrem/missing values Type of coefficientType of coefficient Usually not one target attributeUsually not one target attribute

Page 28: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

2828

Intervals of Categories without ValuesIntervals of Categories without Values

Page 29: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

2929

Intervals of Categories without ValuesIntervals of Categories without Values

Solution:Solution: Statistics – extrem values Statistics – extrem values 4ft Task: correlations, implications4ft Task: correlations, implications Potentially interesting patternsPotentially interesting patterns

Page 30: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

3030

Extrem/Missing ValuesExtrem/Missing Values

4ft: Find associations between 4ft: Find associations between extrem/missing values (impl/correl)extrem/missing values (impl/correl)

CF, KL: Find patterns with extrem/missing CF, KL: Find patterns with extrem/missing valuesvalues

Page 31: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

3131

Data PreparationData Preparation

Classes of attributesClasses of attributes Partial cedentsPartial cedents Associations between attributes in one classAssociations between attributes in one class Associations between partial cedentsAssociations between partial cedents

Page 32: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

3232

Evaluation-ModelingEvaluation-Modeling

Input information for partial cedentsInput information for partial cedents Mining for Interesting Patterns Mining for Interesting Patterns

ExceptionsExceptions Missing valuesMissing values Extrem valuesExtrem values

Discovered hypothesesDiscovered hypotheses Groups of hypothesesGroups of hypotheses Coverage hypotheses/input dataCoverage hypotheses/input data

Page 33: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

3333

Heuristic RulesHeuristic Rules (1) (1)

Examples:Examples: IF more extrem/missing values found, search IF more extrem/missing values found, search

for association with extrem/missing valuesfor association with extrem/missing values IF 0 hypotheses found, set-up less strong IF 0 hypotheses found, set-up less strong

quantifier (p, Base) valuesquantifier (p, Base) values IF subset of input data not covered by IF subset of input data not covered by

hypotheses THEN search for associations hypotheses THEN search for associations covering these datacovering these data

Page 34: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

3434

Heuristic RulesHeuristic Rules (2) (2)

Examples:Examples: IF nominal type of column (input data matrix) IF nominal type of column (input data matrix)

AND no associated table for discretization AND no associated table for discretization THEN each value is one category (attribute THEN each value is one category (attribute creation)creation)

Use “subset” coefficient type for nominal Use “subset” coefficient type for nominal attributesattributes

Page 35: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

3535

Metabase, Knowledge BaseMetabase, Knowledge Base

Metadata (Knowledge):Metadata (Knowledge): Results of Previous X-Miner TasksResults of Previous X-Miner Tasks Domain KnowledgeDomain Knowledge Interaction with User (learning?)Interaction with User (learning?)

Page 36: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

3636

GUHA-80 vs. X-Miner (1)GUHA-80 vs. X-Miner (1)

Task parameters (partial cedents, …)Task parameters (partial cedents, …) SW, HWSW, HW Experiences with LM applications,…Experiences with LM applications,…

Page 37: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

3737

GUHA-80 vs. X-Miner (2)GUHA-80 vs. X-Miner (2)

More complex heuristicsMore complex heuristics

Page 38: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

3838

EverMiner – Features EverMiner – Features

Based on LispMiner (X-Miners)Based on LispMiner (X-Miners) Agenda of jobs, priority/stringsAgenda of jobs, priority/strings HeuristicsHeuristics Interaction with userInteraction with user Enables to repeat the process on new Enables to repeat the process on new

data (“check” vs. new KDD process)data (“check” vs. new KDD process)

Page 39: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

3939

EverMiner – where we areEverMiner – where we are

Experiences (Medicine, traffic, shares, Experiences (Medicine, traffic, shares, sociology,…)sociology,…)

Heuristics collection (www, brainstorming)Heuristics collection (www, brainstorming) Co-operation with data preparation experts Co-operation with data preparation experts

(FEL, SumatraTT)(FEL, SumatraTT) Testing “Strings of jobs” (learning)Testing “Strings of jobs” (learning)

Page 40: Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004

4040

DiscussionDiscussion