data mining and knowledge discovery in databases

36
Data Mining and Knowledge Discovery in Databases

Upload: judith-casey

Post on 26-Dec-2015

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining and Knowledge Discovery in Databases

Data Mining and Knowledge Discovery

in Databases

Page 2: Data Mining and Knowledge Discovery in Databases

Outline

• What is Data Mining and KDD?• Characteristics• Applications• Methods• Packages & Close Relatives

Page 3: Data Mining and Knowledge Discovery in Databases

What is Data Mining & KDD?

• “The process of identifying hidden patterns and relationships within data”

or

• “Data mining helps end users extract useful business information from large databases”

Page 4: Data Mining and Knowledge Discovery in Databases

What’s the Appeal?

• Hidden nuggets of valuable information buried deep within a mountain of otherwise unremarkable data

• Pervasive data• Seek competitive advantage

Page 5: Data Mining and Knowledge Discovery in Databases

The Challenge51020188905212001539458199000000001419881229448821996081621000000101000100000001100003111110000000001003130200000000000000202001000000000000000000000000000043438888888842424342433301220202220000101001000000044100000000110000000000000000010000010000000000000000000000000000000000000000000000000199810275102018960601200212694096800000015901998090337981199809173100100000100010000000110000320002000000100000001239900000000000020022220031310031200000000000000004243888888888842434242332121212222000000101100000024410000000001002000000000000000000001000000000000000000000000000000000000000000000000019981230510201897020320001862692920000004709199802135697119980227310000010010001000000000110110000002000010000000002101100010000000000010000000000001000110000000111003388882222331132334333000000110000011101001100102000100000000100000000100000000000000000000000000000000000000000000000000000000000000001998122151020189909302005200898673000001941019990112759811999012631001000101000100000000011111011111220101000001112300100100000010210002200000000002000000000000011133438888434242424342423300000011110000010110010000244100000000010020000000100101000000010000000000000000000000000000000000000000000000000199905255102018991227200935405158300000144841997052717971199706103100000010110010000000100000311120120000100100101200011110010000110100120000000000010000000000101013243888888888822424243310000000100210000111001001123010000001000002000100000000001100001000000001000001000000000000000010000000000000000019981117510201899122720093540515830000014484199705271797219980616310000001011001000000011010031111112100010000020221001222022002022122220100000000000000000101001100324343432132422142424233002100210000111101100000112231001100000100000010000000000110000100000000100000100000000000000000000000000000000001998122351020190001

Page 6: Data Mining and Knowledge Discovery in Databases

Process: Knowledge Discovery In Databases

database

database

datawarehouse

cleaning & integration

modify data selection

modify data selection

data mining

collect and transform

discoveredpatterns

data mining engines, models

evaluation &presentation

user interface and expert knowledge

domain

modify methods,

parameters

Page 7: Data Mining and Knowledge Discovery in Databases

Context

• Where you stand on Data Mining depends on where you sit:

• Business User

• Researcher

• Computer Scientist

Page 8: Data Mining and Knowledge Discovery in Databases

Data Mining Might Mean…

• Statistics• Visualization• Artificial intelligence• Machine learning• Database technology• Neural networks• Pattern recognition• Knowledge-based systems• Knowledge acquisition• Information retrieval• High performance computing• And so on...

Page 9: Data Mining and Knowledge Discovery in Databases

What’s needed?

• Suitable data• Computing power• Data mining software• Skilled operator who knows both the nature of

the data and the software tools• Reason, theory, or hunch

Page 10: Data Mining and Knowledge Discovery in Databases

Typical Applications of Data Mining & KDD

• Marketing• Market Basket Analysis• Customer Relationship Management• New Product Development

Page 11: Data Mining and Knowledge Discovery in Databases

Typical Applications of Data Mining & KDD

• Financial Services• Credit Approval• Fraud Detection• Marketing

Page 12: Data Mining and Knowledge Discovery in Databases

Typical Applications of Data Mining & KDD

• Health Care• Epidemiological Analysis - incidence and prevalence

of disease in large populations and detection of the source and cause of epidemics of infectious disease

• Knowledge for funding • Policy, programs

Page 13: Data Mining and Knowledge Discovery in Databases

Two Basic Approaches

• Supervised• A dependent or target variable

• Unsupervised• “Pure Data Mining”• Fewer assumptions• Typically used for clustering techniques

Page 14: Data Mining and Knowledge Discovery in Databases

Automation

• The ability to aim a tool at some data and push a button

• Some methods of KDD/Data mining are more suitable for automation than others

Page 15: Data Mining and Knowledge Discovery in Databases

Seven Basic Methods:

1. Decision Trees

2. (Artificial) Neural Networks

3. Cluster/Nearest Neighbour

4. Genetic Algorithms/Evolutionary Computing

5. Bayesian Networks

6. Statistics

7. Hybrids

Page 16: Data Mining and Knowledge Discovery in Databases

• Graphical representations of relationships with data

• Excel at Classification & Prediction Models

Decision Trees

Page 17: Data Mining and Knowledge Discovery in Databases

Sample of a Decision Tree

gender

femalemale

<65 >=65

married?age

yes nogood health?

yes no

- +

urban?

yes no

pet owner?

yes no

+ - - +

pet owner?

yes no

- +

Page 18: Data Mining and Knowledge Discovery in Databases

Decision Trees

• Strengths • Easily understood

and interpreted• Represent complexity

in a compact form• Handle non-linear

data well• Relatively well suited

to automation.

• Weaknesses• Large trees with large

numbers of variables become difficult to understand

• Missing data must be appropriately managed in construction and use of the models

Page 19: Data Mining and Knowledge Discovery in Databases

Neural Networks

• Derived from Artificial Intelligence Research• Modelled on the Human Neuron

Page 20: Data Mining and Knowledge Discovery in Databases

Neural Networks

Age Gender Income

Prediction

Hidden Layer

Input Variables

0.60.3

0.1

0.5

0.7

0.8 0.4Weights

Weights

0.3 0.2

Page 21: Data Mining and Knowledge Discovery in Databases

Neural Networks

• Strengths • Accuracy of prediction• Robust performance

with a wide variety of data types

• Weaknesses• Prone to overfitting• Poor clarity of model

Page 22: Data Mining and Knowledge Discovery in Databases

Clustering/Nearest Neighbour

• Aim to assign “like” records to a group• Groups assigned according to some target

variable or criteria• Nearest neighbour used for prediction

Page 23: Data Mining and Knowledge Discovery in Databases

Clustering/Nearest Neighbour

• Applications:• Text processing: search engines• Image processing: radiology/image processing• Fraud detection: outliers

Page 24: Data Mining and Knowledge Discovery in Databases

Clustering/Nearest Neighbour

• Strengths • Easily understood

and interpreted• Easily implemented in

basic situations

• Weaknesses• complex data not well

suited to automation (much preprocessing required)

Page 25: Data Mining and Knowledge Discovery in Databases

Genetic Algorithms/Evolutionary Computing

• Grounded in Darwin – applied using mathematics

• Require• a way to represent a solution to a problem • a way to test the “fitness” of the solution

• Solutions are mathematically “mutated”• Fittest solutions survive• Convergence

Page 26: Data Mining and Knowledge Discovery in Databases

Genetic Algorithms/Evolutionary Computing

• Strengths • Suited to novel

problems that are poorly understood

• Suitable where data is dirty or missing

• May be useful where other methods cannot be applied

• Weaknesses• Not easily automated• Require creativity in

their application

Page 27: Data Mining and Knowledge Discovery in Databases

Bayesian Networks

• Based on Bayes’ rule:• P(a|b) = P(b|a) * P(a) / P(b)

• Can construct networks of linked events, each with prior probabilities

Page 28: Data Mining and Knowledge Discovery in Databases

Bayesian Network Example

J.R. Shot

Bobby shot him

Just a dream

sequence

Mistress shot him

Wife shot him

Suicide

J. R. Treated

for Depressio

n

Bobby publicly

threatened

Producers

desperate for

ratings

Big fight between

wife, mistress

Page 29: Data Mining and Knowledge Discovery in Databases

Bayesian Networks• Strengths

• Clarity of the resulting models

• Good precision in predicting

• Easily adapt to new probabilities

• Weaknesses• Time consuming to

construct and maintain

• Poor at predicting rare events

Page 30: Data Mining and Knowledge Discovery in Databases

Statistics

• With an outcome or dependent variable:• Correlations• ANOVA• Regression

• Used by themselves or to confirm findings of another method

Page 31: Data Mining and Knowledge Discovery in Databases

Statistics• Strengths

• “Gold Standard” – valid and trusted in scientific circles

• Weaknesses• Limits findings to

those techniques that are applied and their associated limitations (normality, linearity, and so on)

Page 32: Data Mining and Knowledge Discovery in Databases

Hybrids

• Techniques used in combination• Example: use of a genetic algorithm to identify

target variables for inclusion in a neural network model

Page 33: Data Mining and Knowledge Discovery in Databases

Recap

• Data Mining is the core activity or method within a process of Knowledge Discovery in Databases

• Done in order to find useful information in large amounts of data not possible using “conventional” approaches

• Variety of methods• Knowledge of data domain, methods, as well

as creativity

Page 34: Data Mining and Knowledge Discovery in Databases

Data Mining Packages

• Major vendors of database/data management products (IBM, SPSS, Oracle PeopleSoft, SAS, and so on)

• Added as a component of turnkey packages• May incorporate several methods (SAS

Enterprise Miner)• Single method (TreeAge Software Inc.: a

dedicated decision tree product)

Page 35: Data Mining and Knowledge Discovery in Databases

How to implement?

• Do it yourself (you know the data domain)• Put a team together (domain and method

specialists)• Hire a consultant (who knows both your

domain and the tools)• Vertical markets in data mining

Page 36: Data Mining and Knowledge Discovery in Databases

Close Relatives of Data Mining

• On-Line Analytical Processing (OLAP)• Pivot tables in spreadsheets• General statistical packages

• Intelligent Data Analysis – comprises the use of data mining methods in the analysis of “small” datasets