data mining versus semantic web

64
Page 1/64 Data Mining Data Mining Versus Versus Semantic Web Semantic Web Veljko Milutinovic, vm@etf. rs http://home.etf.rs/~vm This material was developed with financial help of the WUSA fund of Austria.

Upload: daktari

Post on 04-Feb-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Data Mining Versus Semantic Web. Veljko Milutinovic, [email protected] http://galeb.etf.bg.ac.yu/vm. This material was developed with financial help of the WUSA fund of Austria. Two different avenues leading to the same goal! - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Mining  Versus  Semantic Web

Page 1/64

Data Mining Data Mining Versus Versus

Semantic WebSemantic Web

Data Mining Data Mining Versus Versus

Semantic WebSemantic Web

Veljko Milutinovic, [email protected]

http://home.etf.rs/~vm

This material was developed with financial help of the WUSA fund of Austria.

Page 2: Data Mining  Versus  Semantic Web

Page 2/64

DataMining versus SemanticWeb

• Two different avenues leading to the same goal!• The goal:

Efficient retrieval of knowledge,from large compact or distributed databases, or the Internet

• What is the knowledge: Synergistic interaction of information (data)and its relationships (correlations).

• The major difference: Placement of complexity!

Page 3: Data Mining  Versus  Semantic Web

Page 3/64

Essence of DataMining

• Data and knowledge representedwith simple mechanisms (typically, HTML)and without metadata (data about data).

• Consequently, relatively complex algorithms have to be used (complexity migratedinto the retrieval request time).

• In return,low complexity at system design time!

Page 4: Data Mining  Versus  Semantic Web

Page 4/64

Essence of SemanticWeb

• Data and knowledge representedwith complex mechanisms (typically XML)and with plenty of metadata (a byte of data may be accompanied with a megabyte of metadata).

• Consequently, relatively simple algorithms can be used (low complexity at the retrieval request time).

• However, large metadata designand maintenance complexityat system design time.

Page 5: Data Mining  Versus  Semantic Web

Page 5/64

Major Knowledge Retrieval Algorithms (for DataMining)

• Neural Networks• Decision Trees• Rule Induction• Memory Based Reasoning,

etc…

Consequently, the stress is on algorithms!

Page 6: Data Mining  Versus  Semantic Web

Page 6/64

Major Metadata Handling Tools (for SemanticWeb)

• XML• RDF• Ontology Languages• Verification (Logic +Trust) Efforts in Progress

Consequently, the stress is on tools!

Page 7: Data Mining  Versus  Semantic Web

Page 7/64

Issues in Data Mining Issues in Data Mining InfrastructureInfrastructure

Issues in Data Mining Issues in Data Mining InfrastructureInfrastructure

Authors: Nemanja Jovanovic, [email protected] Milenkovic, [email protected] Veljko Milutinovic, [email protected]

http://home.etf.rs/~vm

Page 8: Data Mining  Versus  Semantic Web

Page 8/64

Ivana Vujovic ([email protected])Erich Neuhold ([email protected])

Peter Fankhauser ([email protected])Claudia Niederée ([email protected])

Veljko Milutinovic ([email protected])

http://home.etf.rs/~vm

Page 9: Data Mining  Versus  Semantic Web

Page 9/64

Data Mining in the NutshellData Mining in the NutshellData Mining in the NutshellData Mining in the Nutshell

Uncovering the hidden knowledge

Huge n-p complete search space

Multidimensional interface

Page 10: Data Mining  Versus  Semantic Web

Page 10/64

A Problem …A Problem …A Problem …A Problem …

You are a marketing manager for a cellular phone company

Problem: Churn is too high

Bringing back a customer after quitting is both difficult and expensive

Giving a new telephone to everyone whose contract is expiring is very expensive (as well as wasteful)

You pay a sales commission of 250$ per contract

Customers receive free phone (cost 125$) with contract

Turnover (after contract expires) is 40%

Page 11: Data Mining  Versus  Semantic Web

Page 11/64

… … A SolutionA Solution… … A SolutionA Solution

Three months before a contract expires, predict which customers will leave

If you want to keep a customer that is predicted to churn, offer them a new phone

The ones that are not predicted to churn need no attention

If you don’t want to keep the customer, do nothing

How can you predict future behavior?

Tarot Cards?

Magic Ball?

Data Mining?

Page 12: Data Mining  Versus  Semantic Web

Page 12/64

Still Skeptical?Still Skeptical?Still Skeptical?Still Skeptical?

Page 13: Data Mining  Versus  Semantic Web

Page 13/64

The DefinitionThe DefinitionThe DefinitionThe Definition

Automated

The automated extraction of predictive information from (large) databases

Extraction

Predictive

Databases

Page 14: Data Mining  Versus  Semantic Web

Page 14/64

History of Data MiningHistory of Data MiningHistory of Data MiningHistory of Data Mining

Page 15: Data Mining  Versus  Semantic Web

Page 15/64

Repetition in Solar ActivityRepetition in Solar ActivityRepetition in Solar ActivityRepetition in Solar Activity

1613 – Galileo Galilei

1859 – Heinrich Schwabe

Page 16: Data Mining  Versus  Semantic Web

Page 16/64

The Return of theThe Return of theHalley CometHalley Comet

The Return of theThe Return of theHalley CometHalley Comet

1910 1986 2061 ???

1531

1607

1682

239 BC

Edmund Halley (1656 - 1742)

Page 17: Data Mining  Versus  Semantic Web

Page 17/64

Data Mining is NotData Mining is NotData Mining is NotData Mining is Not

Data warehousing

Ad-hoc query/reporting

Online Analytical Processing (OLAP)

Data visualization

Page 18: Data Mining  Versus  Semantic Web

Page 18/64

Data Mining isData Mining isData Mining isData Mining is

Automated extraction of predictive informationfrom various data sources

Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines

Page 19: Data Mining  Versus  Semantic Web

Page 19/64

Focus of this PresentationFocus of this PresentationFocus of this PresentationFocus of this Presentation

Data Mining problem types

Data Mining models and algorithms

Efficient Data Mining

Available software

Page 20: Data Mining  Versus  Semantic Web

Page 20/64

Data MiningData Mining Problem Types Problem Types

Data MiningData Mining Problem Types Problem Types

Page 21: Data Mining  Versus  Semantic Web

Page 21/64

Data Mining Problem TypesData Mining Problem TypesData Mining Problem TypesData Mining Problem Types

6 types

Often a combination solves the problem

Page 22: Data Mining  Versus  Semantic Web

Page 22/64

Data Description and Data Description and SummarizationSummarization

Data Description and Data Description and SummarizationSummarization

Aims at concise description of data characteristics

Lower end of scale of problem types

Provides the user an overview of the data structure

Typically a sub goal

Page 23: Data Mining  Versus  Semantic Web

Page 23/64

SegmentationSegmentationSegmentationSegmentation

Separates the data into interesting and meaningful subgroups or classes

Manual or (semi)automatic

A problem for itself or just a step in solving a problem

Page 24: Data Mining  Versus  Semantic Web

Page 24/64

ClassificationClassificationClassificationClassification

Assumption: existence of objects with characteristics that belong to different classes

Building classification models which assign correct labels in advance

Exists in wide range of various application

Segmentation can provide labels or restrict data sets

Page 25: Data Mining  Versus  Semantic Web

Page 25/64

Concept DescriptionConcept DescriptionConcept DescriptionConcept Description

Understandable description of concepts or classes

Close connection to both segmentation and classification

Similarity and differences to classification

Page 26: Data Mining  Versus  Semantic Web

Page 26/64

Prediction (Regression)Prediction (Regression)Prediction (Regression)Prediction (Regression)

Similar to classification - difference:discrete becomes continuous

Finds the numerical value of the target attribute for unseen objects

Page 27: Data Mining  Versus  Semantic Web

Page 27/64

Dependency AnalysisDependency AnalysisDependency AnalysisDependency Analysis

Finding the model that describes significant dependences between data items or events

Prediction of value of a data item

Special case: associations

Page 28: Data Mining  Versus  Semantic Web

Page 28/64

Data Mining ModelsData Mining ModelsData Mining ModelsData Mining Models

7A AntiPasti

Analytics

Animation

Anecdotic

Advantages

Applications

AfterTea

Page 29: Data Mining  Versus  Semantic Web

Page 29/64

Neural NetworksNeural NetworksNeural NetworksNeural Networks

Characterizes processed data with single numeric value

Efficient modeling of large and complex problems

Based on biological structures Neurons

Network consists of neurons grouped into layers

Page 30: Data Mining  Versus  Semantic Web

Page 30/64

Neuron FunctionalityNeuron FunctionalityNeuron FunctionalityNeuron Functionality

I1

I2

I3

In

Output

W1

W2

W3

Wn

f

Output = f (W1*I1, W2*I2, …, Wn*In)Output = f (W1*I1, W2*I2, …, Wn*In)

Page 31: Data Mining  Versus  Semantic Web

Page 31/64

Training Neural NetworksTraining Neural NetworksTraining Neural NetworksTraining Neural Networks

Advantages? Applications? Anecdotes (Dark+Dark)?

Page 32: Data Mining  Versus  Semantic Web

Page 32/64

Decision TreesDecision TreesDecision TreesDecision Trees

A way of representing a series of rules that lead to a class or value

Iterative splitting of data into discrete groups maximizing distance between them at each split

CHAID, CHART, Quest, C5.0

Classification trees and regression trees

Unlimited growth and stopping rules

Univariate splits and multivariate splits

Page 33: Data Mining  Versus  Semantic Web

Page 33/64

Decision TreesDecision TreesDecision TreesDecision Trees

Balance>10 Balance<=10

Age<=32 Age>32

Married=NO Married=YES

Page 34: Data Mining  Versus  Semantic Web

Page 34/64

Decision TreesDecision TreesDecision TreesDecision Trees

Advantages? Applications? Anecdotes (CRO+Past)?

Page 35: Data Mining  Versus  Semantic Web

Page 35/64

Rule InductionRule InductionRule InductionRule Induction

Method of deriving a set of rules to classify cases

Creates independent rules that are unlikely to form a tree

Rules may not cover all possible situations

Rules may sometimes conflict in a prediction

Page 36: Data Mining  Versus  Semantic Web

Page 36/64

Rule InductionRule InductionRule InductionRule Induction

If balance>100.000 then confidence=HIGH & weight=1.7

If balance>25.000 andstatus=married

then confidence=HIGH & weight=2.3

If balance<40.000 then confidence=LOW & weight=1.9

Advantages? Applications? Anecdotes (Teeth+Devor)

Page 37: Data Mining  Versus  Semantic Web

Page 37/64

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

Usage of knowledge of previously solved similar problems in solving the new problem

Assigning the class to the group where most of the k-”neighbors” belong

First step – finding the suitable measure for distance between attributes in the data

How far is black from green?

+ Easy handling of non-standard data types

- Huge models

Page 38: Data Mining  Versus  Semantic Web

Page 38/64

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

Advantages? Applications? Anecdotes (Predictions Based on Similarities)? E: Nd+1s, so(loMOTH), prof(badA>off), prof(abuA>fam)

Page 39: Data Mining  Versus  Semantic Web

Page 39/64

Data Mining Models Data Mining Models and Algorithmsand Algorithms

Data Mining Models Data Mining Models and Algorithmsand Algorithms

Logistic regression

Discriminant analysis

Generalized Adaptive Models (GAM)

Genetic algorithms

Etc…

Many other available models and algorithms

Many application-specific variations of known models

Final implementation usually involves several techniques

Adding selected metadata (e.g., time for prediction)E:27+sports(5+6+7+11+…+2+1)

Page 40: Data Mining  Versus  Semantic Web

Page 40/64

Efficient Data MiningEfficient Data Mining Efficient Data MiningEfficient Data Mining

Page 41: Data Mining  Versus  Semantic Web

Page 41/64

Don’t Mess With It!

YES NO

YES

You Shouldn’t Have!

NO

Will it ExplodeIn Your Hands?

NO

Look The Other Way

Anyone ElseKnows? You’re in TROUBLE!

YESYES

NO

Hide ItCan You Blame Someone Else?

NO

NO PROBLEM!

YES

Is It Working?

Did You Mess With It?

Page 42: Data Mining  Versus  Semantic Web

Page 42/64

DM Process ModelDM Process ModelDM Process ModelDM Process Model

CRISP–DM – tends to become a standard

5A – used by SPSS Clementine(Assess, Access, Analyze, Act and Automate)

SEMMA – used by SAS Enterprise Miner(Sample, Explore, Modify, Model and Assess)

Page 43: Data Mining  Versus  Semantic Web

Page 43/64

CRISP - DMCRISP - DMCRISP - DMCRISP - DM

CRoss-Industry Standard for DM

Conceived in 1996 by three companies:

Page 44: Data Mining  Versus  Semantic Web

Page 44/64

CRISP – DM methodologyCRISP – DM methodologyCRISP – DM methodologyCRISP – DM methodology

Four level breakdown of the CRISP-DM methodology:

Phases

Generic Tasks

Process Instances

Specialized Tasks

Page 45: Data Mining  Versus  Semantic Web

Page 45/64

Mapping generic modelsMapping generic modelsto specialized modelsto specialized models

Mapping generic modelsMapping generic modelsto specialized modelsto specialized models

Analyze the specific context

Remove any details not applicable to the context

Add any details specific to the context

Specialize generic context according toconcrete characteristic of the context

Possibly rename generic contents to provide more explicit meanings

Page 46: Data Mining  Versus  Semantic Web

Page 46/64

Generalized and Specialized Generalized and Specialized CookingCooking

Generalized and Specialized Generalized and Specialized CookingCooking

Preparing food on your own Find out what you want to eat

Find the recipe for that meal

Gather the ingredients

Prepare the meal

Enjoy your food

Clean up everything (or leave it for later)

Raw stake with vegetables?

Check the Cookbook or call mom

Defrost the meat (if you had it in the fridge)

Buy missing ingredients or borrow the from the neighbors

Cook the vegetables and fry the meat

Enjoy your food or even more

You were cooking so convince someone else to do the dishes

Page 47: Data Mining  Versus  Semantic Web

Page 47/64

CRISP – DM modelCRISP – DM modelCRISP – DM modelCRISP – DM model

Business understanding

Data understanding

Data preparation

Modeling

Evaluation

Deployment

Business understanding

Data understanding

Datapreparation

ModelingEvaluation

Deployment

Page 48: Data Mining  Versus  Semantic Web

Page 48/64

Business UnderstandingBusiness UnderstandingBusiness UnderstandingBusiness Understanding

Determine business objectives

Assess situation

Determine data mining goals

Produce project plan

Page 49: Data Mining  Versus  Semantic Web

Page 49/64

Data UnderstandingData UnderstandingData UnderstandingData Understanding

Collect initial data

Describe data

Explore data

Verify data quality

Page 50: Data Mining  Versus  Semantic Web

Page 50/64

Data PreparationData PreparationData PreparationData Preparation

Select data

Clean data

Construct data

Integrate data

Format data

Page 51: Data Mining  Versus  Semantic Web

Page 51/64

ModelingModelingModelingModeling

Select modeling technique

Generate test design

Build model

Assess model

Page 52: Data Mining  Versus  Semantic Web

Page 52/64

EvaluationEvaluationEvaluationEvaluation

Evaluate results

Review process

Determine next steps

results = models + findings

Page 53: Data Mining  Versus  Semantic Web

Page 53/64

DeploymentDeploymentDeploymentDeployment

Plan deployment

Plan monitoring and maintenance

Produce final report

Review project

Page 54: Data Mining  Versus  Semantic Web

Page 54/64

At Last…At Last…At Last…At Last…

Page 55: Data Mining  Versus  Semantic Web

Page 55/64

Available SoftwareAvailable SoftwareAvailable SoftwareAvailable Software

14

Page 56: Data Mining  Versus  Semantic Web

Page 56/64

Comparison of forteen DM tools

• The Decision Tree products were - CART - Scenario - See5 - S-Plus

• The Rule Induction tools were - WizWhy - DataMind - DMSK

• Neural Networks were built from three programs - NeuroShell2 - PcOLPARS - PRW

• The Polynomial Network tools were - ModelQuest Expert - Gnosis - a module of NeuroShell2- KnowledgeMiner

Page 57: Data Mining  Versus  Semantic Web

Page 57/64

Criteria for evaluating DM tools

A list of 20 criteria for evaluating DM tools, put into 4 categories:

• Capability measures what a desktop tool can do, and how well it does it

- Handless missing data- Considers misclassification

costs - Allows data transformations - Quality of tesing options - Has programming language - Provides useful output reports

- Visualisation

Page 58: Data Mining  Versus  Semantic Web

Page 58/64

Criteria for evaluating DM tools

• Learnability/Usability shows how easy a tool is to learn and use:

- Tutorials- Wizards- Easy to learn- User’s manual- Online help- Interface

Page 59: Data Mining  Versus  Semantic Web

Page 59/64

Criteria for evaluating DM tools

• Interoperability shows a tool’s ability to interface with other computer applications

- Importing data- Exporting data- Links to other applications

• Flexibility

- Model adjustment flexibility- Customizable work enviroment- Ability to write or change code

Page 60: Data Mining  Versus  Semantic Web

Page 60/64

A classification of data sets

• Pima Indians Diabetes data set– 768 cases of Native American women from the Pima tribe

some of whom are diabetic, most of whom are not – 8 attributes plus the binary class variable for diabetes per instance

• Wisconsin Breast Cancer data set – 699 instances of breast tumors

some of which are malignant, most of which are benign– 10 attributes plus the binary malignancy variable per case

• The Forensic Glass Identification data set – 214 instances of glass collected during crime investigations – 10 attributes plus the multi-class output variable per instance

• Moon Cannon data set – 300 solutions to the equation:

x = 2v 2 sin(g)cos(g)/g

– the data were generated without adding noise

Page 61: Data Mining  Versus  Semantic Web

Page 61/64

Evaluation of forteen DM tools

Page 62: Data Mining  Versus  Semantic Web

Page 62/64

ConclusionsConclusionsConclusionsConclusions

Page 63: Data Mining  Versus  Semantic Web

Page 63/64

WWWWWW.NBA..NBA.COMCOMWWWWWW.NBA..NBA.COMCOM

Page 64: Data Mining  Versus  Semantic Web

Page 64/64

Se7enSe7enSe7enSe7en