topic maps for association rule mining

23
Topic Maps for Association Rule Mining Tomáš Kliegr, Jan Zemánek, Marek Ovečka Department of Information and Knowledge Engineering Faculty of Informatics and Statistics University of Economics, Prague

Upload: tmra

Post on 19-Nov-2014

2.087 views

Category:

Technology


0 download

DESCRIPTION

This paper investigates the possibilities for post-processing results of association rule mining algorithms with topic maps. Converting discovered association rules (DARs) as well as background knowledge to a topic map representation allows to assess the interestingness of discovered rules automatically with a topic map query language. This paper introduces a DAR ontology based on the GUHA method, a background knowledge ontology and a way of linking these two ontologies. It is shown on an example how these topic map ontologies can be used to represent particular mining data and how the tolog query language can be used to automatically find interesting rules in such a representation.

TRANSCRIPT

Page 1: Topic Maps for Association Rule Mining

Topic Maps for

Association Rule Mining

Tomáš Kliegr, Jan Zemánek, Marek Ovečka

Department of Information and Knowledge EngineeringFaculty of Informatics and Statistics

University of Economics, Prague

Page 2: Topic Maps for Association Rule Mining

Data Mining using CRISP-DM

The goal of data mining is to obtain useful non-trivial patterns from the data.

Analytical Report

Page 3: Topic Maps for Association Rule Mining

Common data mining tasks

Clustering Classification

Sex(M) and Salary(Low) and District(Havlickuv Brod) => Quality(Bad)

Association rules

Page 4: Topic Maps for Association Rule Mining

Association Rule MiningEXAMPLEUnlike clustering and classification, association rules provide true “nuggets” – rules

meeting selected interest measuresDuration(2y+)and District(Prague)=> Loan Quality(good)

THE QUEST FOR TOPIC MAPS

Antecedent Consequent

Select the really interesting rules from the rules output automatically.Help searching through the results.

THE PROBLEM WITH INTEREST MEASURESIt is usually not possible to tweak the interest measure thresholds so that only the really interesting rules are output. To be on the safe side, we often get (many!) more rules than desired,

Page 5: Topic Maps for Association Rule Mining

The quest

- Past results

- Background knowledge

- Redundant rules

Discovered nuggetsMore precise tasks

orAutomatic rule filtering

The lingua franca for exchange of data mining models is PMML

Page 6: Topic Maps for Association Rule Mining

Predictive Modeling Markup Language• XML Schema• PMML is the leading standard for

statistical and data mining models• Supported by over 20 vendors and

organizations• Covers the technical part of the

CRISP-DM Cycle

http://www.dmg.org/pmml_examples/index.html

Page 7: Topic Maps for Association Rule Mining

PMML is “just” an XML Schema

• Developed for deploying mining models • Good for migration from one data mining

environment to anotherBut:• No explicit links between nodes• Verbose• Self-contained. Lacks support for– Interlinking multiple PMML documents– Interlinking PMML with other information

Page 8: Topic Maps for Association Rule Mining

Association Rule Mining Ontology

The ontology is a „semantization“ of PMML XML Schema

DESIGN GUIDELINESThe key design principle was to allow easy transformation of data from PMML to AROn

SCOPEThe ontology is limited to the subset of PMML relevant toassociation rule mining. 60 topic types, 50 association types and 20 occurence types

USENo automatic transformation is yet available, but we are working on one using OKS framework. Currently, data can be input using Ontopoly.

Page 9: Topic Maps for Association Rule Mining

• xs:element is mapped to topic type• Topics are assigned same names as PMML Nodes

– But respecting spaces between words and capitalization

• Superclasses are introduced for semantically similar XML Nodes

• Named elements used as children in other elements that carry most of the semantics of their parents are merged with parent

• If an XML element has a directly corresponding topic type in the ontology, the URI of the XML element within the schema is used as subject identifier

Design guidelines: Elements

Page 10: Topic Maps for Association Rule Mining

Design guidelines: Attributes• Enumeration restriction on an attribute is mapped as a topic type with enumeration

superclass (this is a workaround for missing TMCL support in OKS)

• Attributes that could be interpreted as reference to other elements become associations

• Other attributes become occurence types

Page 11: Topic Maps for Association Rule Mining

Design guidelines: Associations• Names for association types are arbitrarily chosen so that they are most

descriptive• Introduce less rather than more associations

– minimizes the effort when populating the ontology from PMML– Avoid unnecessary inflation of the topic map

• Link only the semantically closest topics– Additional „soft“ relations can be introduced with inference statements or derived with tolog

Page 12: Topic Maps for Association Rule Mining

Design guidelines: Role types

• Topic types used to map PMML elements are used as role types– Unless multiple topics are permitted in association end. In that case

superclass is used as a role type, or a new role type is introduced

Page 13: Topic Maps for Association Rule Mining

Two alternative association rulerepresentations-Apriori based(Item-Itemset)-GUHA based(Boolean Attributes)

Page 14: Topic Maps for Association Rule Mining

Ongoing work

• Support for background knowledge „already known association rules“

• Support for schema mapping „linking of background knowledge with mining results“

• Already in the ontology, distinguished by base of subject identifier

Schema Mapping• http://keg.vse.cz/sma/XXXBackground Knowledge• http://keg.vse.cz/bko/xxx

Page 15: Topic Maps for Association Rule Mining

Data Mining Use case

PREDICT LOAN QUALITYFind client characteristics that could be used to predict their attitude to paying back a loan.

BASED ON PAST RECORDS Input data: records on already given loans

Page 16: Topic Maps for Association Rule Mining

The data

• 6181 clients in the PKDD’99 financial dataset

Data were preprocessed, i.e.District districtPrague PragueBrno Brno… …

duration Duration

Many distinct values in<0;100>

<0;12>

<13;23>

<24;inf>

status statusAggA GoodB MediumC

BadD

ID sex age duration district Loan quality

5464 male 54 12 [months] Prague A

5489 female 20 6 months Ostrava E

… .. .. .. .. ..

Page 17: Topic Maps for Association Rule Mining

• ….And perhaps 9997 other association rules

Preprocessed data

Association Rule Learner

Page 18: Topic Maps for Association Rule Mining

WE CAN’T PRESENT ALL 10.000 RULES TO THE CLIENT

ASK CLIENT WHAT HE KNOWS

If loan duration is more than two years and the loan was given in Prague district, we can expect good loan quality.

…background knowledge

Page 19: Topic Maps for Association Rule Mining

Semantize the results

Page 20: Topic Maps for Association Rule Mining

Formalize Background Knowledge

Page 21: Topic Maps for Association Rule Mining

Schema Mapping• Background knowledge can use different “vocabulary” than the data • If we are to use background knowledge in querying, we need to interlink

them with data.

The same approach would apply if we interlink several mining models (PMMLs)

Page 22: Topic Maps for Association Rule Mining

Deleting information with Topic Maps

• Find association rules that subsume background knowledge

Visualization of a tolog query

Page 23: Topic Maps for Association Rule Mining

Summary

• Methodology for transferring XML Schema to Topic Maps

• Association Rule Mining Ontology based on PMML• Easily extensible to other data mining algorithms• Initial attempts to formalize background knowledge• Initial attempts to use Topic Maps for schema mapping

AROn On-Line: http://maiana.topicmapslab.de/u/lmaicher/tm/kliegr