data mining introduction
DESCRIPTION
Data Mining Introduction. TYNE SYSTEM Chun-hung, Chou 2003.12.09. Outline. 1. Data Mining Overview 2. Functionalities 3. Software 4. R function 5. Example 6. Q & A. Data Mining Overview. Knowledge Discovery Process. 1. Data cleaning - remove noise and inconsistent data - PowerPoint PPT PresentationTRANSCRIPT
Data Mining IntroductionData Mining Introduction
TYNE SYSTEM
Chun-hung, Chou
2003.12.09
OutlineOutline
1. Data Mining Overview
2. Functionalities
3. Software
4. R function
5. Example
6. Q & A
Data Mining Overview
Knowledge Discovery ProcessKnowledge Discovery Process
1. Data cleaning - remove noise and inconsistent data
2. Data integration - combine multiple data sources
3. Data selection - data relevant to the analysis task
4. Data transformation - the forms for mining
5. Data mining
6. Pattern evaluation - identify
7. Knowledge presentation
What is Data Mining?What is Data Mining?
• Viewed as part of the Knowledge Discovery process.
• Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data.
• Uses tools from Computer Science and Artificial Intelligence as well as Statistics.
Why do we need data mining?Why do we need data mining?
– Large number of records (cases) (108-1012 bytes)– High dimensional data (variables) (10-104 attributes)– Only a small portion, typically 5% to 10%, of the
collected data is ever analyzed.– Data that may never be explored continues to be
collected out of fear that something that may prove important in the future may be missing.
– Magnitude of data precludes most traditional analysis ANOVA/PC/
Potential ApplicationsPotential Applications
– Fraud Detection – Manufacturing Processes – Targeting Markets – Scientific Data Analysis– Risk Management– Web Intelligence– Bioinformation– …...
•Data mining tools need no guidance.•Data mining models explain behavior.•Data mining requires no data analysis skill.•Data mining tools are “different” from statistics•Data mining eliminates the need to understand your business and your data.
Data Mining MythsData Mining Myths
Data Mining FunctionalitiesData Mining Functionalities
• Concept/Class Description
• Association Analysis
• Classification Analysis
• Cluster Analysis
• Outlier Analysis
• Evolution Analysis
Concept DescriptionConcept Description
Generate descriptions for characterization and
comparison of data
characterization :
summarizes and describes a collection of data
e.g. mean,distribution,percentile,..
comparison :
summarizes and distinguishes one collection of data from other
collection(s) of data
Concept DescriptionConcept Description
Method:
visualization:
e.g. boxplot,bar chart, histogram,…
statistics/tabulate:
e.g. mean, std, proportion,contingency table…
Association AnalysisAssociation Analysis
Goal: find interesting relationships among items in a given data set
Association AnalysisAssociation Analysis
Example:• Market Basket Analysis - An example of Rule-based
Machine Learning• Customer Analysis
– Market Basket Analysis uses the information about what a customer purchases to give us insight into who they are and why they make certain purchases
• Product Analysis– Market Basket Analysis gives us insight into the
merchandise by telling us which products tend to be purchased together and which are most amenable to purchase
Classification AnalysisClassification Analysis
Goal:
Build a model to describe a predetermined set of data
classes or concepts and use the model as prediction
Classification AnalysisClassification Analysis
Method: Decision Tree Bayesian network Bayesian belife network Neural network k-nearest neighbor case-based reasoning genetic algorithm rough sets fuzzy logic
Cluster AnalysisCluster Analysis
Goal:
grouping a set of physical or abstract objects into classes
of similar objects
ClusterCluster
• Method:
Partitioning methods :k-means
Hierarchical methods :top-down,bottom-up
Density-based methods :arbitrary shapes
Grid-based methods :cells
Model-based methods :best fit of given model
Outlier AnalysisOutlier Analysis
Outlier: the data can be considered as
inconsistent in a given data set
Goal: find an efficient method to mine the
outliers
Outlier AnalysisOutlier Analysis
Method:
- Statistical-Based Outlier Detection
- Distance-Based Outlier Detection
- Deviation-Based Outlier Detection
Evolution AnalysisEvolution Analysis
• Goal:
Describe and models regularities or trends for
objects whose behavior changes over time
Evolution AnalysisEvolution Analysis
• Method:
Statistical Method
Trend Analysis
Similarity Search in Time-Series Analysis
Sequential Pattern Mining
Periodicity Analysis
Commercial Software Commercial Software
• Full Suite
Product Company Price(US$)
EnterpriseMiner SAS >75000
Clementine SPSS ~50000Intelligent Miner IBM ??
Data Miner STATISTICA ~50000
IndexMiner Index Software ??
Method in RMethod in R
Function R Library
Tree tree
Cluster clara
Cluster diana
Cluster fanny
Cluster mona
Cluster hclust
Cluster kmeans
Cluster cluster
Example—Decision TreeExample—Decision Tree
• Decision Tree for Tools abnormal detection
AWD080AWD030,AWD050
Example– Decision TreeExample– Decision Tree
Example -- ClusterExample -- Cluster
Question & Suggestion
Thanks !