self-organised data mining – 20 years after guha-80 martin kejkula keg 8 th april 2004
TRANSCRIPT
Self-Organised Data MiningSelf-Organised Data Mining
–– 20 Years after GUHA-8020 Years after GUHA-80
Martin KejkulaMartin Kejkula
KEG 8KEG 8thth April 2004 April 2004
http://gama.vse.cz/keg/http://gama.vse.cz/keg/
22
Idea of Self-Organised Data Mining GUHA-80 revival
Process of Self-Organised Data Mining Key factors for Self-Organised Data Mining
Metabase, Knowledge Base, etc.
Proposed EverMiner system for Self-Organised
Data Mining
AgendaAgenda
33
IntroductionIntroduction
Motivation: support X-Miner usersMotivation: support X-Miner users Best practices, known problems collectionBest practices, known problems collection
Muller, Lemke: Self-Organising Data Mining Muller, Lemke: Self-Organising Data Mining (2000)(2000)
My thesis:My thesis: Design/test strings of jobs for EverMinerDesign/test strings of jobs for EverMiner Formalization/using heuristicsFormalization/using heuristics
44
References (1)References (1)
HHájek, P. – Havránek, T.: GUHA 80: An ájek, P. – Havránek, T.: GUHA 80: An Application of Artificial Intelligence to Data Application of Artificial Intelligence to Data Analysis. Computers and Artificial Analysis. Computers and Artificial Intelligence, Vol. 1, 1982, pp. 107-134Intelligence, Vol. 1, 1982, pp. 107-134
Hájek, P. – Ivánek, J.: Artificial Intelligence Hájek, P. – Ivánek, J.: Artificial Intelligence and Data Analysis. Proc. COMPSTATand Data Analysis. Proc. COMPSTAT’82, ’82, Wien, Physica Verlag 1982, pp. 54-60Wien, Physica Verlag 1982, pp. 54-60
55
References (References (22))
HHájek, P. – Havránek, T.: GUHA-80 – An ájek, P. – Havránek, T.: GUHA-80 – An Application of Artificial Intelligence to Data Application of Artificial Intelligence to Data Analysis. Matematické středisko Analysis. Matematické středisko biologických ústavů ČSAV, Praha, 1982biologických ústavů ČSAV, Praha, 1982
Jirků, P. – Havránek, T.: On Verbosity Jirků, P. – Havránek, T.: On Verbosity Levels in Cognitive Problem Solvers. Proc. Levels in Cognitive Problem Solvers. Proc. Computational Linguistics, 1982, Computational Linguistics, 1982, http://acl.eldoc.ub.rug.nl/mirror/C/C82/http://acl.eldoc.ub.rug.nl/mirror/C/C82/
66
References (References (33))
Rauch, J.: EverMiner – studie projektu. Rauch, J.: EverMiner – studie projektu. Dokumentace projektu LISp-Miner, 2003.Dokumentace projektu LISp-Miner, 2003.
Mueller, J.-A. – Lemke, F.: Self-Organising Mueller, J.-A. – Lemke, F.: Self-Organising Data Mining. Extracting Knowledge from Data Mining. Extracting Knowledge from Data. Dresden, Berlin, 2000.Data. Dresden, Berlin, 2000.
77
GUHA-80: Main FeaturesGUHA-80: Main Features
Application of artificial intelligence to Application of artificial intelligence to exploratory data analysisexploratory data analysis
To generate interesting views onto given To generate interesting views onto given empirical data (recognize interesting empirical data (recognize interesting logical patterns)logical patterns)
Views: relevant, usefulViews: relevant, useful
88
GUHA-80 Sources (1)GUHA-80 Sources (1)
GUHAGUHA Automatically generate all interesting Automatically generate all interesting
hypotheseshypotheses Lenat’s AMLenat’s AM
Jobs (tasks)Jobs (tasks) Agenda of jobsAgenda of jobs Hundreds of heuristical rulesHundreds of heuristical rules ConceptsConcepts
99
GUHA-80 vs. Lenat’s AMGUHA-80 vs. Lenat’s AM DataData
• Data-processing proceduresData-processing procedures
Statistical program packagesStatistical program packages Effective modulesEffective modules
GUHA-80 Sources (2)GUHA-80 Sources (2)
1010
GUHA-80 ParadigmGUHA-80 Paradigm
Open-ended data analysisOpen-ended data analysis To maximize interestingness valueTo maximize interestingness value
Hundreds of heuristic rulesHundreds of heuristic rules Guide to define and study next stepGuide to define and study next step
Access potentially relevant rules, Access potentially relevant rules,
Find truly relevant rules,Find truly relevant rules,
Follows truly relevant rulesFollows truly relevant rules
1111
Interestingness in GUHA-80Interestingness in GUHA-80
No explicit definitionNo explicit definition Determined by interplayDetermined by interplay
Heuristical rulesHeuristical rules Weighting mechanismsWeighting mechanisms Testing in practice (adequately behaviour?)Testing in practice (adequately behaviour?)
No algorithm, but constraintsNo algorithm, but constraints
1212
Principles of GUHA-80Principles of GUHA-80
Domain dependence (…exploratory data Domain dependence (…exploratory data analysis)analysis)
Join human possibilities with machineJoin human possibilities with machine More heuristics are relevantMore heuristics are relevant Interactivity with userInteractivity with user Non routine (GUHA-80 not for every-day Non routine (GUHA-80 not for every-day
data processing)data processing)
1313
GUHA-80 StructureGUHA-80 Structure (1) (1)
1414
GUHA-80 Structure (2)GUHA-80 Structure (2)
Input empirical dataInput empirical data Input parameters Input parameters
How understood “interestingness”How understood “interestingness” Effective modules (system’s knowledge)Effective modules (system’s knowledge)
Clustering proceduresClustering procedures GUHA proceduresGUHA procedures
Agenda of jobs (priority/weight)Agenda of jobs (priority/weight)
1515
Heuristics: optimal way to realize a jobHeuristics: optimal way to realize a job Changing system of conceptsChanging system of concepts Hierarchy of concepts (applicability)Hierarchy of concepts (applicability) Possible unification of heuristics, jobs,… Possible unification of heuristics, jobs,…
GUHA-80 Structure (3)GUHA-80 Structure (3)
1616
1717
1818
1919
2020
GUHA-80 InputGUHA-80 Input
DataData Input informationInput information
Decompositions/orderings of sets of quantitiesDecompositions/orderings of sets of quantities Help understand “interestingness”Help understand “interestingness”
2121
GUHA-80 Effective modulesGUHA-80 Effective modules
Evaluation of usual statistical Evaluation of usual statistical characteristics,…characteristics,…
Complicated proceduresComplicated procedures Synthesis of parameters (“job on job”)Synthesis of parameters (“job on job”)
2222
GUHA-80GUHA-80
Hundreds of heuristic rulesHundreds of heuristic rules No explicit definition of interestingness No explicit definition of interestingness
(exploration in a space)(exploration in a space) Interactivity with the userInteractivity with the user Non-routine characterNon-routine character
2323
Process of S-O Data MiningProcess of S-O Data Mining
EmpiricalData
Chains of Data & Knowledge Processing Tasks
Domain Knowledge,…
All Interesting Views, Patterns
DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, …
2424
Process of S-O Data MiningProcess of S-O Data Mining
2525
Key Factors of S-O Data MiningKey Factors of S-O Data Mining
Data PreparationData Preparation ModelingModeling EvaluationEvaluation Knowledge BaseKnowledge Base Domain KnowledgeDomain Knowledge
2626
Data PreparationData Preparation
DiscretizationDiscretization Attribute Type dependent:Attribute Type dependent:
• Nominal/Ordinal/Interval/Ratio Nominal/Ordinal/Interval/Ratio Type of coefficient dependentType of coefficient dependent Discretization-Modeling Cycle (KL, 4ft, CF,…)Discretization-Modeling Cycle (KL, 4ft, CF,…) Known problem with intervals of categories Known problem with intervals of categories
without valueswithout values Usually not one target attributeUsually not one target attribute
2727
Attribute type dependent discretizationAttribute type dependent discretization
NominalNominal Classes of valuesClasses of values
OrdinalOrdinal Extrem/missing valuesExtrem/missing values Type of coefficientType of coefficient Usually not one target attributeUsually not one target attribute
2828
Intervals of Categories without ValuesIntervals of Categories without Values
2929
Intervals of Categories without ValuesIntervals of Categories without Values
Solution:Solution: Statistics – extrem values Statistics – extrem values 4ft Task: correlations, implications4ft Task: correlations, implications Potentially interesting patternsPotentially interesting patterns
3030
Extrem/Missing ValuesExtrem/Missing Values
4ft: Find associations between 4ft: Find associations between extrem/missing values (impl/correl)extrem/missing values (impl/correl)
CF, KL: Find patterns with extrem/missing CF, KL: Find patterns with extrem/missing valuesvalues
3131
Data PreparationData Preparation
Classes of attributesClasses of attributes Partial cedentsPartial cedents Associations between attributes in one classAssociations between attributes in one class Associations between partial cedentsAssociations between partial cedents
3232
Evaluation-ModelingEvaluation-Modeling
Input information for partial cedentsInput information for partial cedents Mining for Interesting Patterns Mining for Interesting Patterns
ExceptionsExceptions Missing valuesMissing values Extrem valuesExtrem values
Discovered hypothesesDiscovered hypotheses Groups of hypothesesGroups of hypotheses Coverage hypotheses/input dataCoverage hypotheses/input data
3333
Heuristic RulesHeuristic Rules (1) (1)
Examples:Examples: IF more extrem/missing values found, search IF more extrem/missing values found, search
for association with extrem/missing valuesfor association with extrem/missing values IF 0 hypotheses found, set-up less strong IF 0 hypotheses found, set-up less strong
quantifier (p, Base) valuesquantifier (p, Base) values IF subset of input data not covered by IF subset of input data not covered by
hypotheses THEN search for associations hypotheses THEN search for associations covering these datacovering these data
3434
Heuristic RulesHeuristic Rules (2) (2)
Examples:Examples: IF nominal type of column (input data matrix) IF nominal type of column (input data matrix)
AND no associated table for discretization AND no associated table for discretization THEN each value is one category (attribute THEN each value is one category (attribute creation)creation)
Use “subset” coefficient type for nominal Use “subset” coefficient type for nominal attributesattributes
3535
Metabase, Knowledge BaseMetabase, Knowledge Base
Metadata (Knowledge):Metadata (Knowledge): Results of Previous X-Miner TasksResults of Previous X-Miner Tasks Domain KnowledgeDomain Knowledge Interaction with User (learning?)Interaction with User (learning?)
3636
GUHA-80 vs. X-Miner (1)GUHA-80 vs. X-Miner (1)
Task parameters (partial cedents, …)Task parameters (partial cedents, …) SW, HWSW, HW Experiences with LM applications,…Experiences with LM applications,…
3737
GUHA-80 vs. X-Miner (2)GUHA-80 vs. X-Miner (2)
More complex heuristicsMore complex heuristics
3838
EverMiner – Features EverMiner – Features
Based on LispMiner (X-Miners)Based on LispMiner (X-Miners) Agenda of jobs, priority/stringsAgenda of jobs, priority/strings HeuristicsHeuristics Interaction with userInteraction with user Enables to repeat the process on new Enables to repeat the process on new
data (“check” vs. new KDD process)data (“check” vs. new KDD process)
3939
EverMiner – where we areEverMiner – where we are
Experiences (Medicine, traffic, shares, Experiences (Medicine, traffic, shares, sociology,…)sociology,…)
Heuristics collection (www, brainstorming)Heuristics collection (www, brainstorming) Co-operation with data preparation experts Co-operation with data preparation experts
(FEL, SumatraTT)(FEL, SumatraTT) Testing “Strings of jobs” (learning)Testing “Strings of jobs” (learning)
4040
DiscussionDiscussion