data mining and knowledge discovery in databases

Data Mining and Knowledge Discovery

in Databases

Outline

• What is Data Mining and KDD?• Characteristics• Applications• Methods• Packages & Close Relatives

What is Data Mining & KDD?

• “The process of identifying hidden patterns and relationships within data”

or

• “Data mining helps end users extract useful business information from large databases”

What’s the Appeal?

• Hidden nuggets of valuable information buried deep within a mountain of otherwise unremarkable data

• Pervasive data• Seek competitive advantage

The Challenge51020188905212001539458199000000001419881229448821996081621000000101000100000001100003111110000000001003130200000000000000202001000000000000000000000000000043438888888842424342433301220202220000101001000000044100000000110000000000000000010000010000000000000000000000000000000000000000000000000199810275102018960601200212694096800000015901998090337981199809173100100000100010000000110000320002000000100000001239900000000000020022220031310031200000000000000004243888888888842434242332121212222000000101100000024410000000001002000000000000000000001000000000000000000000000000000000000000000000000019981230510201897020320001862692920000004709199802135697119980227310000010010001000000000110110000002000010000000002101100010000000000010000000000001000110000000111003388882222331132334333000000110000011101001100102000100000000100000000100000000000000000000000000000000000000000000000000000000000000001998122151020189909302005200898673000001941019990112759811999012631001000101000100000000011111011111220101000001112300100100000010210002200000000002000000000000011133438888434242424342423300000011110000010110010000244100000000010020000000100101000000010000000000000000000000000000000000000000000000000199905255102018991227200935405158300000144841997052717971199706103100000010110010000000100000311120120000100100101200011110010000110100120000000000010000000000101013243888888888822424243310000000100210000111001001123010000001000002000100000000001100001000000001000001000000000000000010000000000000000019981117510201899122720093540515830000014484199705271797219980616310000001011001000000011010031111112100010000020221001222022002022122220100000000000000000101001100324343432132422142424233002100210000111101100000112231001100000100000010000000000110000100000000100000100000000000000000000000000000000001998122351020190001

Process: Knowledge Discovery In Databases

database

database

datawarehouse

cleaning & integration

modify data selection

modify data selection

data mining

collect and transform

discoveredpatterns

data mining engines, models

evaluation &presentation

user interface and expert knowledge

domain

modify methods,

parameters

Context

• Where you stand on Data Mining depends on where you sit:

• Business User

• Researcher

• Computer Scientist

Data Mining Might Mean…

• Statistics• Visualization• Artificial intelligence• Machine learning• Database technology• Neural networks• Pattern recognition• Knowledge-based systems• Knowledge acquisition• Information retrieval• High performance computing• And so on...

What’s needed?

• Suitable data• Computing power• Data mining software• Skilled operator who knows both the nature of

the data and the software tools• Reason, theory, or hunch

Typical Applications of Data Mining & KDD

• Marketing• Market Basket Analysis• Customer Relationship Management• New Product Development


• Financial Services• Credit Approval• Fraud Detection• Marketing


• Health Care• Epidemiological Analysis - incidence and prevalence

of disease in large populations and detection of the source and cause of epidemics of infectious disease

• Knowledge for funding • Policy, programs

Two Basic Approaches

• Supervised• A dependent or target variable

• Unsupervised• “Pure Data Mining”• Fewer assumptions• Typically used for clustering techniques

Automation

• The ability to aim a tool at some data and push a button

• Some methods of KDD/Data mining are more suitable for automation than others

Seven Basic Methods:

1. Decision Trees

2. (Artificial) Neural Networks

3. Cluster/Nearest Neighbour

4. Genetic Algorithms/Evolutionary Computing

5. Bayesian Networks

6. Statistics

7. Hybrids

• Graphical representations of relationships with data

• Excel at Classification & Prediction Models

Decision Trees

Sample of a Decision Tree

gender

femalemale

<65 >=65

married?age

yes nogood health?

yes no

- +

urban?

yes no

pet owner?

yes no

+ - - +

pet owner?

yes no

- +

Decision Trees

• Strengths • Easily understood

and interpreted• Represent complexity

in a compact form• Handle non-linear

data well• Relatively well suited

to automation.

• Weaknesses• Large trees with large

numbers of variables become difficult to understand

• Missing data must be appropriately managed in construction and use of the models

Neural Networks

• Derived from Artificial Intelligence Research• Modelled on the Human Neuron

Neural Networks

Age Gender Income

Prediction

Hidden Layer

Input Variables

0.60.3

0.1

0.5

0.7

0.8 0.4Weights

Weights

0.3 0.2

Neural Networks

• Strengths • Accuracy of prediction• Robust performance

with a wide variety of data types

• Weaknesses• Prone to overfitting• Poor clarity of model

Clustering/Nearest Neighbour

• Aim to assign “like” records to a group• Groups assigned according to some target

variable or criteria• Nearest neighbour used for prediction


• Applications:• Text processing: search engines• Image processing: radiology/image processing• Fraud detection: outliers


• Strengths • Easily understood

and interpreted• Easily implemented in

basic situations

• Weaknesses• complex data not well

suited to automation (much preprocessing required)

Genetic Algorithms/Evolutionary Computing

• Grounded in Darwin – applied using mathematics

• Require• a way to represent a solution to a problem • a way to test the “fitness” of the solution

• Solutions are mathematically “mutated”• Fittest solutions survive• Convergence

Genetic Algorithms/Evolutionary Computing

• Strengths • Suited to novel

problems that are poorly understood

• Suitable where data is dirty or missing

• May be useful where other methods cannot be applied

• Weaknesses• Not easily automated• Require creativity in

their application

Bayesian Networks

• Based on Bayes’ rule:• P(a|b) = P(b|a) * P(a) / P(b)

• Can construct networks of linked events, each with prior probabilities

Bayesian Network Example

J.R. Shot

Bobby shot him

Just a dream

sequence

Mistress shot him

Wife shot him

Suicide

J. R. Treated

for Depressio

n

Bobby publicly

threatened

Producers

desperate for

ratings

Big fight between

wife, mistress

Bayesian Networks• Strengths

• Clarity of the resulting models

• Good precision in predicting

• Easily adapt to new probabilities

• Weaknesses• Time consuming to

construct and maintain

• Poor at predicting rare events

Statistics

• With an outcome or dependent variable:• Correlations• ANOVA• Regression

• Used by themselves or to confirm findings of another method

Statistics• Strengths

• “Gold Standard” – valid and trusted in scientific circles

• Weaknesses• Limits findings to

those techniques that are applied and their associated limitations (normality, linearity, and so on)

Hybrids

• Techniques used in combination• Example: use of a genetic algorithm to identify

target variables for inclusion in a neural network model

Recap

• Data Mining is the core activity or method within a process of Knowledge Discovery in Databases

• Done in order to find useful information in large amounts of data not possible using “conventional” approaches

• Variety of methods• Knowledge of data domain, methods, as well

as creativity

Data Mining Packages

• Major vendors of database/data management products (IBM, SPSS, Oracle PeopleSoft, SAS, and so on)

• Added as a component of turnkey packages• May incorporate several methods (SAS

Enterprise Miner)• Single method (TreeAge Software Inc.: a

dedicated decision tree product)

How to implement?

• Do it yourself (you know the data domain)• Put a team together (domain and method

specialists)• Hire a consultant (who knows both your

domain and the tools)• Vertical markets in data mining

Close Relatives of Data Mining

• On-Line Analytical Processing (OLAP)• Pivot tables in spreadsheets• General statistical packages

• Intelligent Data Analysis – comprises the use of data mining methods in the analysis of “small” datasets

data mining and knowledge discovery in databases

Documents

data selection data

data mining kdd

large databases slide

competitive advantage

knowledge discovery

expert knowledge domain

useful business information

hidden patterns