development in the ferda project december 2006 martin ralbovský

Development in the Ferda Development in the Ferda projectproject

December 2006December 2006

Martin RalbovskMartin Ralbovskýý

ContentContent

HistoryHistory Changes in the 2.0 version, improved Changes in the 2.0 version, improved

GUHA abilitiesGUHA abilities Background knowledge and Background knowledge and

ontologiesontologies Further academic developmentFurther academic development

Ferda project history IFerda project history I

Ferda – successor of the LISp-Miner data Ferda – successor of the LISp-Miner data mining system, visual and modular mining system, visual and modular environmentenvironment

Software project at MFF UKSoftware project at MFF UK KEG 10.11.2005KEG 10.11.2005

Introduction of the systemIntroduction of the system Description of parts of the working environmentDescription of parts of the working environment Implementation principlesImplementation principles

Znalosti 2006 articleZnalosti 2006 article KEG 4.5.2006KEG 4.5.2006

State of development in May 06State of development in May 06 Master theses themes discussedMaster theses themes discussed

Ferda project history IIFerda project history II

Development since May 06Development since May 06 ““Experimental GUHA Procedures” by Experimental GUHA Procedures” by

Tomáš KuchařTomáš Kuchař completed completed ““Usage of Domain Knowledge for Usage of Domain Knowledge for

Applications of GUHA Procedures” by Applications of GUHA Procedures” by Martin Martin RalbovskýRalbovský completed completed

Further development + testingFurther development + testing

Available versions of FerdaAvailable versions of Ferda

Version 1.0 (1.1) - approved MFF project Version 1.0 (1.1) - approved MFF project version (+ improvements)version (+ improvements)Copy of the LISp-Miner system in terms of GUHA abilities Copy of the LISp-Miner system in terms of GUHA abilities (almost)(almost)

Dependent on the LISp-Miner hypotheses generation engineDependent on the LISp-Miner hypotheses generation engine

Version 2.0 based on the master thesis of Version 2.0 based on the master thesis of Tomáš KuchařTomáš KuchařFerda no longer dependent on LISp-Miner systemFerda no longer dependent on LISp-Miner system

Improved GUHA abilities (datasource, definition of relevant Improved GUHA abilities (datasource, definition of relevant questions…)questions…)

Improved GUHA abilities Improved GUHA abilities theoretically Itheoretically I

Definition of a large set of relevant Definition of a large set of relevant questions (original):questions (original):

Attribute A, Attribute A, non-empty subset of non-empty subset of attribute attribute , then A(, then A() is ) is basic boolean basic boolean attributeattribute

Each Each basic boolean attribute basic boolean attribute is a is a boolean boolean attributeattribute

If If and and are are boolean attributes,boolean attributes, then then and and are are boolean boolean attributesattributes

Improved GUHA abilities Improved GUHA abilities theoretically IItheoretically II

Definition of a large set of relevant Definition of a large set of relevant questions in LISp-Miner (and Ferda 1.0)questions in LISp-Miner (and Ferda 1.0)

Literal ~ basic boolean attribute Literal ~ basic boolean attribute or its or its negationnegation

Literal Literal can be can be basic basic or or remainingremainingbasic – in each basic – in each partial cedent partial cedent there has to be at least there has to be at least

one one basic literalbasic literal

remaining – the oppositeremaining – the opposite

Partial cedent Partial cedent ~ conjunction of ~ conjunction of literalsliterals Cedent Cedent ~ conjunction of ~ conjunction of partial cedentspartial cedents

Improved GUHA abilities Improved GUHA abilities theoretically IIItheoretically III

Definition of a large set of relevant Definition of a large set of relevant questions in Ferda 2.0questions in Ferda 2.0

Ferda 2.0 fully supports the original Ferda 2.0 fully supports the original definition, user can use conjunction, definition, user can use conjunction, disjunction and negation multiple timesdisjunction and negation multiple times

Basic boolean attribute Basic boolean attribute can becan be Basic – Basic – the same meaningthe same meaning Forced – Forced – must be present in every relevant questionmust be present in every relevant question Auxiliary – Auxiliary – conjunction and disjunction cannot be conjunction and disjunction cannot be

formed only with formed only with auxiliaryauxiliary boolean attributes (there boolean attributes (there must be a must be a basic basic or or forcedforced attribute). attribute).

Improved GUHA abilities practically Improved GUHA abilities practically 4FT – Ferda 1.04FT – Ferda 1.0

Improved GUHA abilities practically Improved GUHA abilities practically 4FT – Ferda 2.04FT – Ferda 2.0

Improved GUHA abilities practicallyImproved GUHA abilities practicallyKL – Ferda 1.0KL – Ferda 1.0

Improved GUHA abilities practicallyImproved GUHA abilities practicallyKL – Ferda 2.0KL – Ferda 2.0

Ferda 2.0 versus LISp-MinerFerda 2.0 versus LISp-Miner

We compare only the hypotheses We compare only the hypotheses generation engines, not the whole systemsgeneration engines, not the whole systems

Running time of proceduresRunning time of procedures 4FT approximately equal4FT approximately equal KL faster in Ferda 2.0KL faster in Ferda 2.0 CF faster in Ferda 2.0CF faster in Ferda 2.0 SD procedures much faster in LISp-Miner (no jump SD procedures much faster in LISp-Miner (no jump

optimalizations)optimalizations) Some quantifiers not implemented in Some quantifiers not implemented in

Ferda 2.0 (but are easy to implement)Ferda 2.0 (but are easy to implement) LISp-Miner better testedLISp-Miner better tested

Background knowledge I – Background knowledge I – introductionintroduction

Background knowledge is a vague term for knowledge from Background knowledge is a vague term for knowledge from the domain experts to aid in KDD.the domain experts to aid in KDD.

No central definition or theory, different authors use it No central definition or theory, different authors use it differently.differently.

The definition for GUHA mining: The definition for GUHA mining: a set of various verbal rules that are accepted in a a set of various verbal rules that are accepted in a specific domain as a common knowledge.specific domain as a common knowledge.

Background knowledge can be used as an effective mean of Background knowledge can be used as an effective mean of communication between the knowledge expert and the communication between the knowledge expert and the data miner.data miner.

Usage of background knowledge in GUHA is described in Usage of background knowledge in GUHA is described in master thesis of Martin Ralbovsky (and elsewhere)master thesis of Martin Ralbovsky (and elsewhere)

Background knowledge II - Background knowledge II - examplesexamples

Sociomedical domain:Sociomedical domain: If education increases, wine consumption If education increases, wine consumption

increases as wellincreases as well Patients with greater responsibility in work Patients with greater responsibility in work

tend to drive to work by cartend to drive to work by carBeer marketing domain:Beer marketing domain: Younger consumers prefer drought beerYounger consumers prefer drought beer Older consumers prefer beer in bottlesOlder consumers prefer beer in bottles More expensive brands are better sold More expensive brands are better sold

during holidaysduring holidays

Background knowledge III – Background knowledge III – preferred usagepreferred usage

Domain expert Data miner

Knowledge about the domain Data mining techniquesand interpretation knowledge

Specification of interesting facts to the domain expertRules can be transformed into mining tasks

Tasks resultsSoundness of DM techniques

Background knowledge IV – in Background knowledge IV – in FerdaFerda

Formalization of background knowledge Formalization of background knowledge rules sound for GUHA purposes createdrules sound for GUHA purposes created

Implemented modules of the Ferda system Implemented modules of the Ferda system (version 1.1) to validate background (version 1.1) to validate background knowledge rulesknowledge rules

Experiments carried out to find presence Experiments carried out to find presence of background knowledge rules in the data of background knowledge rules in the data with the GUHA procedures 4FT and KLwith the GUHA procedures 4FT and KL

So far rather disappointing resultsSo far rather disappointing results

Background knowledge V - Background knowledge V - experimentexperiment

Presumptions:Presumptions: Background knowledge rules are somehow Background knowledge rules are somehow

stored in the datastored in the data Data collection and attribute creation Data collection and attribute creation

without mistakeswithout mistakes

Question: Can the rules be found in Question: Can the rules be found in data with “our” techniques?data with “our” techniques?

Experiment: 8 background knowledge Experiment: 8 background knowledge rules tested with the 4FT and KLrules tested with the 4FT and KL

Background knowledge VI - resultsBackground knowledge VI - results

Founded Implication with default values (Founded Implication with default values (base base = = 0,05, 0,05, p p = 0,95) – 1/8 rules approved= 0,95) – 1/8 rules approved

Above Average with default values (Above Average with default values (basebase= 0,05, = 0,05, P P = = 1,2) – 1/8 rules approved1,2) – 1/8 rules approved

Modifications of Kendall – 2/6 rules approvedModifications of Kendall – 2/6 rules approved Furthermore quantifiers showed strange results Furthermore quantifiers showed strange results

(4/8 FI results below with (4/8 FI results below with pp below 0,4) below 0,4) How good are our quantifiers???How good are our quantifiers??? Bigger experiments are planned to be done in the Bigger experiments are planned to be done in the

futurefuture

Ontologies I – introductionOntologies I – introduction In the past attempts to enhance GUHA In the past attempts to enhance GUHA

mining with domain ontologies (also mining with domain ontologies (also presented on KEG)presented on KEG)

Data understandingData understanding Attribute creationAttribute creation Decomposition of tasksDecomposition of tasks Task creationTask creation

RalbovskýRalbovský’s master thesis first work to ’s master thesis first work to examine automatic processing of domain examine automatic processing of domain ontologiesontologies

Deep analysis, however no tools Deep analysis, however no tools implementedimplemented

Ontologies II – problemsOntologies II – problems

Technical problems… not so badTechnical problems… not so badConceptual problemsConceptual problems Ontologies express knowledge on very general Ontologies express knowledge on very general

levellevel For GUHA mining, we need specific knowledge For GUHA mining, we need specific knowledge

that usually is not present in ontologiesthat usually is not present in ontologiesExample: for attribute creation we needExample: for attribute creation we need

Maximum and minimum valuesMaximum and minimum values Extreme valuesExtreme values Significant values dividing the domainSignificant values dividing the domain Typical values (for nominal domains)Typical values (for nominal domains)

Solution: probably specific ontologies for GUHA Solution: probably specific ontologies for GUHA miningmining

Further academic development IFurther academic development I

Alexander Kuzmin – “Relational GUHA procedures” Alexander Kuzmin – “Relational GUHA procedures” master thesismaster thesis

Implementation of relational 4FT miner (and Implementation of relational 4FT miner (and possibly others)possibly others)

Ferda 2.0, spring 2007Ferda 2.0, spring 2007

Daniel Kupka – “User support for 4ft-Miner Daniel Kupka – “User support for 4ft-Miner procedure for data mining” master thesisprocedure for data mining” master thesis

Help scenarios depending on the settings of 4FT Help scenarios depending on the settings of 4FT tasktask

Complex and modular systemComplex and modular system Ferda 2.0, spring 2007Ferda 2.0, spring 2007

Further academic development IIFurther academic development II

Martin Martin Zeman – Zeman – “Using ontologies in GUHA “Using ontologies in GUHA procedures”procedures”

Definition of GUHA ontologiesDefinition of GUHA ontologies Tools for ontology supportTools for ontology support Ferda 2.0, autumn 2006Ferda 2.0, autumn 2006

Michal KováčMichal Kováč – “User oriented language for – “User oriented language for solving KDD tasks”solving KDD tasks”

Only Michal knows what this is aboutOnly Michal knows what this is about Ferda 2.0, autumn 2006Ferda 2.0, autumn 2006

Thank you for your attention.Thank you for your attention.

development in the ferda project december 2006 martin ralbovský

Documents

guha mining

ft ferda

ferda projectdecember

experimental guha procedures

boolean attributeif

common knowledge

knowledge expert

lispminer data mining