data mining and knowledge discovery in databases
TRANSCRIPT
Data Mining and Knowledge Discovery
in Databases
Outline
• What is Data Mining and KDD?• Characteristics• Applications• Methods• Packages & Close Relatives
What is Data Mining & KDD?
• “The process of identifying hidden patterns and relationships within data”
or
• “Data mining helps end users extract useful business information from large databases”
What’s the Appeal?
• Hidden nuggets of valuable information buried deep within a mountain of otherwise unremarkable data
• Pervasive data• Seek competitive advantage
The Challenge51020188905212001539458199000000001419881229448821996081621000000101000100000001100003111110000000001003130200000000000000202001000000000000000000000000000043438888888842424342433301220202220000101001000000044100000000110000000000000000010000010000000000000000000000000000000000000000000000000199810275102018960601200212694096800000015901998090337981199809173100100000100010000000110000320002000000100000001239900000000000020022220031310031200000000000000004243888888888842434242332121212222000000101100000024410000000001002000000000000000000001000000000000000000000000000000000000000000000000019981230510201897020320001862692920000004709199802135697119980227310000010010001000000000110110000002000010000000002101100010000000000010000000000001000110000000111003388882222331132334333000000110000011101001100102000100000000100000000100000000000000000000000000000000000000000000000000000000000000001998122151020189909302005200898673000001941019990112759811999012631001000101000100000000011111011111220101000001112300100100000010210002200000000002000000000000011133438888434242424342423300000011110000010110010000244100000000010020000000100101000000010000000000000000000000000000000000000000000000000199905255102018991227200935405158300000144841997052717971199706103100000010110010000000100000311120120000100100101200011110010000110100120000000000010000000000101013243888888888822424243310000000100210000111001001123010000001000002000100000000001100001000000001000001000000000000000010000000000000000019981117510201899122720093540515830000014484199705271797219980616310000001011001000000011010031111112100010000020221001222022002022122220100000000000000000101001100324343432132422142424233002100210000111101100000112231001100000100000010000000000110000100000000100000100000000000000000000000000000000001998122351020190001
Process: Knowledge Discovery In Databases
database
database
datawarehouse
cleaning & integration
modify data selection
modify data selection
data mining
collect and transform
discoveredpatterns
data mining engines, models
evaluation &presentation
user interface and expert knowledge
domain
modify methods,
parameters
Context
• Where you stand on Data Mining depends on where you sit:
• Business User
• Researcher
• Computer Scientist
Data Mining Might Mean…
• Statistics• Visualization• Artificial intelligence• Machine learning• Database technology• Neural networks• Pattern recognition• Knowledge-based systems• Knowledge acquisition• Information retrieval• High performance computing• And so on...
What’s needed?
• Suitable data• Computing power• Data mining software• Skilled operator who knows both the nature of
the data and the software tools• Reason, theory, or hunch
Typical Applications of Data Mining & KDD
• Marketing• Market Basket Analysis• Customer Relationship Management• New Product Development
Typical Applications of Data Mining & KDD
• Financial Services• Credit Approval• Fraud Detection• Marketing
Typical Applications of Data Mining & KDD
• Health Care• Epidemiological Analysis - incidence and prevalence
of disease in large populations and detection of the source and cause of epidemics of infectious disease
• Knowledge for funding • Policy, programs
Two Basic Approaches
• Supervised• A dependent or target variable
• Unsupervised• “Pure Data Mining”• Fewer assumptions• Typically used for clustering techniques
Automation
• The ability to aim a tool at some data and push a button
• Some methods of KDD/Data mining are more suitable for automation than others
Seven Basic Methods:
1. Decision Trees
2. (Artificial) Neural Networks
3. Cluster/Nearest Neighbour
4. Genetic Algorithms/Evolutionary Computing
5. Bayesian Networks
6. Statistics
7. Hybrids
• Graphical representations of relationships with data
• Excel at Classification & Prediction Models
Decision Trees
Sample of a Decision Tree
gender
femalemale
<65 >=65
married?age
yes nogood health?
yes no
- +
urban?
yes no
pet owner?
yes no
+ - - +
pet owner?
yes no
- +
Decision Trees
• Strengths • Easily understood
and interpreted• Represent complexity
in a compact form• Handle non-linear
data well• Relatively well suited
to automation.
• Weaknesses• Large trees with large
numbers of variables become difficult to understand
• Missing data must be appropriately managed in construction and use of the models
Neural Networks
• Derived from Artificial Intelligence Research• Modelled on the Human Neuron
Neural Networks
Age Gender Income
Prediction
Hidden Layer
Input Variables
0.60.3
0.1
0.5
0.7
0.8 0.4Weights
Weights
0.3 0.2
Neural Networks
• Strengths • Accuracy of prediction• Robust performance
with a wide variety of data types
• Weaknesses• Prone to overfitting• Poor clarity of model
Clustering/Nearest Neighbour
• Aim to assign “like” records to a group• Groups assigned according to some target
variable or criteria• Nearest neighbour used for prediction
Clustering/Nearest Neighbour
• Applications:• Text processing: search engines• Image processing: radiology/image processing• Fraud detection: outliers
Clustering/Nearest Neighbour
• Strengths • Easily understood
and interpreted• Easily implemented in
basic situations
• Weaknesses• complex data not well
suited to automation (much preprocessing required)
Genetic Algorithms/Evolutionary Computing
• Grounded in Darwin – applied using mathematics
• Require• a way to represent a solution to a problem • a way to test the “fitness” of the solution
• Solutions are mathematically “mutated”• Fittest solutions survive• Convergence
Genetic Algorithms/Evolutionary Computing
• Strengths • Suited to novel
problems that are poorly understood
• Suitable where data is dirty or missing
• May be useful where other methods cannot be applied
• Weaknesses• Not easily automated• Require creativity in
their application
Bayesian Networks
• Based on Bayes’ rule:• P(a|b) = P(b|a) * P(a) / P(b)
• Can construct networks of linked events, each with prior probabilities
Bayesian Network Example
J.R. Shot
Bobby shot him
Just a dream
sequence
Mistress shot him
Wife shot him
Suicide
J. R. Treated
for Depressio
n
Bobby publicly
threatened
Producers
desperate for
ratings
Big fight between
wife, mistress
Bayesian Networks• Strengths
• Clarity of the resulting models
• Good precision in predicting
• Easily adapt to new probabilities
• Weaknesses• Time consuming to
construct and maintain
• Poor at predicting rare events
Statistics
• With an outcome or dependent variable:• Correlations• ANOVA• Regression
• Used by themselves or to confirm findings of another method
Statistics• Strengths
• “Gold Standard” – valid and trusted in scientific circles
• Weaknesses• Limits findings to
those techniques that are applied and their associated limitations (normality, linearity, and so on)
Hybrids
• Techniques used in combination• Example: use of a genetic algorithm to identify
target variables for inclusion in a neural network model
Recap
• Data Mining is the core activity or method within a process of Knowledge Discovery in Databases
• Done in order to find useful information in large amounts of data not possible using “conventional” approaches
• Variety of methods• Knowledge of data domain, methods, as well
as creativity
Data Mining Packages
• Major vendors of database/data management products (IBM, SPSS, Oracle PeopleSoft, SAS, and so on)
• Added as a component of turnkey packages• May incorporate several methods (SAS
Enterprise Miner)• Single method (TreeAge Software Inc.: a
dedicated decision tree product)
How to implement?
• Do it yourself (you know the data domain)• Put a team together (domain and method
specialists)• Hire a consultant (who knows both your
domain and the tools)• Vertical markets in data mining
Close Relatives of Data Mining
• On-Line Analytical Processing (OLAP)• Pivot tables in spreadsheets• General statistical packages
• Intelligent Data Analysis – comprises the use of data mining methods in the analysis of “small” datasets