other data mining techniques

27
Another Look at Data Mining Why do we mine? Why do we mine? What do we mine? What do we mine? How do we mine? How do we mine?

Upload: tommy96

Post on 10-May-2015

276 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Other Data Mining Techniques

Another Look at Data MiningAnother Look at Data Mining

Why do we mine?Why do we mine?

What do we mine?What do we mine?

How do we mine?How do we mine?

Page 2: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

What is Data MiningWhat is Data Mining

Data mining discovers meaningful new Data mining discovers meaningful new correlations, hidden patterns and correlations, hidden patterns and relationships in your datarelationships in your data

Conceptual descendent of statisticsConceptual descendent of statistics Combines machine learning,statistics,and Combines machine learning,statistics,and

databasesdatabases Knowledge discovery:process of building Knowledge discovery:process of building

and implementing a data mining solutionand implementing a data mining solution

Page 3: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Data Mining OverviewData Mining Overview Knowledge Discovery in Databases, Knowledge Discovery in Databases, KDDKDD No one data mining approachNo one data mining approach

each tool viewed logically as application of clienteach tool viewed logically as application of client Can reside on separate machine or in separate process and access Can reside on separate machine or in separate process and access

data warehousedata warehouse RDBMS or proprietary OLAP embed data mining RDBMS or proprietary OLAP embed data mining

capabilities deeply within engines to improve efficiency capabilities deeply within engines to improve efficiency and add extensionsand add extensions

Requires a good foundation in terms of a data warehouseRequires a good foundation in terms of a data warehouse

Page 4: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Data Mining Overview (con’t)Data Mining Overview (con’t)

Common algorithmic approachesCommon algorithmic approaches association, affinity groupingassociation, affinity grouping predicting, sequence-based analysispredicting, sequence-based analysis clustering clustering classificationclassification estimationestimation

Steps are:data selection, data Steps are:data selection, data transformation,data mining,result transformation,data mining,result interpretation.interpretation.

Page 5: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Strategic Benefit of Data MiningStrategic Benefit of Data Mining

Direct MarketingDirect Marketing Trend AnalysisTrend Analysis Fraud detectionFraud detection Forecasting in Financial MarketsForecasting in Financial Markets

Page 6: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Why Data Mining Now?Why Data Mining Now?

EconomicsEconomics Unprecedented affordability of MIPS and MBUnprecedented affordability of MIPS and MB

Parallel computingParallel computing Enormous amounts of data can be processedEnormous amounts of data can be processed

Popularity of data warehouses, data martsPopularity of data warehouses, data marts Relatively clean data availableRelatively clean data available

Page 7: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Data Mining compared to Traditional AnalysisData Mining compared to Traditional Analysis

Traditional AnalysisTraditional Analysis Did sales of product X increase in Nov.?Did sales of product X increase in Nov.? Do sales of product X decrease when there is a Do sales of product X decrease when there is a

promotion on product Y?promotion on product Y? Data mining is result orientedData mining is result oriented

What are the factors that determine sales of What are the factors that determine sales of product X?product X?

Page 8: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Data Mining compared to Traditional Analysis (con’t)Data Mining compared to Traditional Analysis (con’t)

Traditional; analysis is incrementalTraditional; analysis is incremental Does billing level affect turnover?Does billing level affect turnover? Does location affect turnover?Does location affect turnover? Analyst builds model step by stepAnalyst builds model step by step

Data Mining is result orientedData Mining is result oriented Identify the factors and predict turnoverIdentify the factors and predict turnover

Page 9: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Steps in Data MiningSteps in Data Mining Data Manipulation - can be 70-80% of data Data Manipulation - can be 70-80% of data

mining effortmining effort data cleaningdata cleaning missing valuesmissing values data derivationdata derivation merging datamerging data

Defining a studyDefining a study Supervised-articulating goal, choosing dependent variable or Supervised-articulating goal, choosing dependent variable or

output and specifying data fieldsoutput and specifying data fields Unsupervised-group similar types of data or identify Unsupervised-group similar types of data or identify

exceptionsexceptions

Page 10: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Steps in Data Mining (con’t)Steps in Data Mining (con’t)

Reading the data and building the modelReading the data and building the model model summarizes large amounts of data by model summarizes large amounts of data by

accumulating indicators accumulating indicators (frequencies,weight,conjunctions,differentiation)(frequencies,weight,conjunctions,differentiation)

Understanding the modelUnderstanding the model Know the particular modelKnow the particular model

PredictionPrediction Choose the best outcome based on historical dataChoose the best outcome based on historical data

Page 11: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

ModelsModels

Genetic AlgorithmsGenetic Algorithms Neural NetsNeural Nets AgentsAgents StatisticsStatistics VisualizationVisualization

Page 12: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Genetic AlgorithmsGenetic AlgorithmsGenetic AlgorithmsGenetic Algorithms

Artificial intelligence system that mimics the Artificial intelligence system that mimics the evolutionary, survival-of-the-fittest processes to evolutionary, survival-of-the-fittest processes to generate increasingly better solutions to a problem.generate increasingly better solutions to a problem.

Genetic algorithms produce several generations of Genetic algorithms produce several generations of solutions, choosing the best of the current set for solutions, choosing the best of the current set for each new generation.each new generation.

ExamplesExamples Generating human faces based on a few known features.Generating human faces based on a few known features. Generating solutions to routing problems.Generating solutions to routing problems. Generating stock portfolios.Generating stock portfolios.

Page 13: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

EVOLUTION IN GENETIC ALGORITHMSEVOLUTION IN GENETIC ALGORITHMS

SELECTIONSELECTION - or survival of the fittest. The - or survival of the fittest. The key is to give preference to better outcomes.key is to give preference to better outcomes.

CROSSOVERCROSSOVER - combining portions of good - combining portions of good outcomes in the hope of creating an even outcomes in the hope of creating an even better outcome.better outcome.

MUTATIONMUTATION - randomly trying combinations - randomly trying combinations and evaluating the success (or failure) of the and evaluating the success (or failure) of the outcome.outcome.

Page 14: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Neural NetsNeural NetsNeural NetsNeural Nets Mathematical Model of the Way a Brain Mathematical Model of the Way a Brain

FunctionsFunctions Machine learning approach by which Machine learning approach by which

historical data can be examined for historical data can be examined for pattern recognitionpattern recognition

A neural network simulates the human A neural network simulates the human ability to classify things based on the ability to classify things based on the experience of seeing many examplesexperience of seeing many examples..

Pros -Numerical Data Pros -Numerical Data

Cons - Opaque, Art or Science Cons - Opaque, Art or Science

Page 15: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

ExampleExampleDistinguishing different chemical Distinguishing different chemical compoundscompounds

Detecting anomalies in human tissue Detecting anomalies in human tissue that may signify diseasethat may signify disease

Reading handwritingReading handwriting

Detecting fraud in credit card useDetecting fraud in credit card use

Page 16: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Intelligent AgentsIntelligent Agents

Software entities that carry out some set of Software entities that carry out some set of operations on behalf of user or program with some operations on behalf of user or program with some degree of autonomy and employ some knowledge degree of autonomy and employ some knowledge or representation of users goals and desires.or representation of users goals and desires.

Some common characteristics Some common characteristics ability to communicate, cooperate and coordinate with ability to communicate, cooperate and coordinate with

other agentsother agents ability to act autonomously to achieve collective goal ability to act autonomously to achieve collective goal

of systemof system

Page 17: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Intelligent Agents (con’t)Intelligent Agents (con’t)

TasksTasks automate repetitive tasksautomate repetitive tasks finding and filtering informationfinding and filtering information summarizing complex datasummarizing complex data

Capability to learn and make Capability to learn and make recommendationsrecommendations

Black box approach hides complexity and Black box approach hides complexity and allows for design of scalable systemallows for design of scalable system

Page 18: Other Data Mining Techniques

AI System

Expert Systems

Neural Networks

Genetic Algorithms

Intelligent Agents

Problem Type

Diagnostic or prescriptive

Identification, classification, prediction

Optimal solution

Specific and repetitive tasks

Based On

Strategies of experts

The human brain

Biological evolution

One or more AI techniques

Starting Information

Expert’s know-how

Acceptable patterns

Set of possible solutions

Your preferences

Comparison

Page 19: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

StatisticsStatisticsStatisticsStatistics

SAS, SPSSSAS, SPSS Pros - Established technology Pros - Established technology Cons - Needs assumptions, nominal Cons - Needs assumptions, nominal

variable handling, management variable handling, management acceptance?acceptance?

Page 20: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

VisualizationVisualizationVisualizationVisualization

Data visualization refers to technologies Data visualization refers to technologies that support visualization of informationthat support visualization of information

Includes – digital images, GIS, multi-Includes – digital images, GIS, multi-dimensions, 3-D presentations, animationsdimensions, 3-D presentations, animations

http://www.almaden.ibm.com/cs/quest/http://www.almaden.ibm.com/cs/quest/demo/assoc/general.htmldemo/assoc/general.html

Page 21: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Data Mining is Not a Silver BulletData Mining is Not a Silver Bullet

It does not:It does not: Find answers to questions you don’t askFind answers to questions you don’t ask Eliminate the need for domain experienceEliminate the need for domain experience Remove the need for data analysis skillsRemove the need for data analysis skills

Page 22: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Data Mining SoftwareData Mining Software

http://www.kdnuggets.com/software/http://www.kdnuggets.com/software/ http://www.attar.com/http://www.attar.com/ download download http://www.cs.bham.ac.uk/~anp/software.hthttp://www.cs.bham.ac.uk/~anp/software.ht

mlml software listing software listing

Page 23: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Six Rules of Data Qualityby Ken Orr

Six Rules of Data Qualityby Ken Orr

1. Data that is not used cannot be correct for very long1. Data that is not used cannot be correct for very long

2. Data Quality in an information system is a function of its 2. Data Quality in an information system is a function of its use, not its collectionuse, not its collection

3.Data quality will ultimately be no better than its most 3.Data quality will ultimately be no better than its most stringent usestringent use

4. Data quality problems tend to become worse with the age of 4. Data quality problems tend to become worse with the age of the systemthe system

5. Less likely it is that some data element will change, more 5. Less likely it is that some data element will change, more traumatic it will be when it finally does change.traumatic it will be when it finally does change.

6. Information overload affects data quality6. Information overload affects data quality

Page 24: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Data Quality SoftwareData Quality Software

http://www.rulequest.com/gritbot-info.htmlhttp://www.rulequest.com/gritbot-info.html

Page 25: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

General DW Data transformationGeneral DW Data transformation

Resolve inconsistent legacy formatsResolve inconsistent legacy formats Strip out unwanted fieldsStrip out unwanted fields Interpret codes into textInterpret codes into text Combine data from multiple sources under Combine data from multiple sources under

a common keya common key Find fields used for multiple purposes and Find fields used for multiple purposes and

interpret fields value based on contextinterpret fields value based on context

Page 26: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Data transformation for Data MiningData transformation for Data Mining

Flag normal, abnormal, out of bounds or Flag normal, abnormal, out of bounds or impossible factsimpossible facts

Recognize random or noise values from Recognize random or noise values from context and mask outcontext and mask out

Apply uniform treatment to NULL valuesApply uniform treatment to NULL values Flag fast records with changed statusFlag fast records with changed status Classify individual record by one of its Classify individual record by one of its

aggregatesaggregates

Page 27: Other Data Mining Techniques

CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt

Conclusion Conclusion

For successful data mining:For successful data mining: data analysis and mining goals must be data analysis and mining goals must be

identifies and formulatedidentifies and formulated appropriate data must be selected, cleaned and appropriate data must be selected, cleaned and

prepared for queries and business analysisprepared for queries and business analysis http://www.rulequest.com/cubist-http://www.rulequest.com/cubist-

examples.html#BOSTONexamples.html#BOSTON http://www.almaden.ibm.com/cs/quest/http://www.almaden.ibm.com/cs/quest/