data mining, by dr. khalil1 data mining dr. awad khalil computer science department auc

18
Data Mining, by Dr. Khali l 1 Data Mining Dr. Awad Khalil Dr. Awad Khalil Computer Science Department Computer Science Department AUC AUC

Upload: cameron-bell

Post on 10-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 1

Data Mining

Dr. Awad KhalilDr. Awad Khalil

Computer Science DepartmentComputer Science Department

AUCAUC

Page 2: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 2

Content

What and Why Data Mining What and Why Data Mining Data Mining ApplicationsData Mining Applications Data Mining Operations & associated TechniquesData Mining Operations & associated Techniques

Predictive ModelingPredictive Modeling Database SegmentationDatabase Segmentation Link AnalysisLink Analysis Deviation DetectionDeviation Detection

The Data Mining ProcessThe Data Mining Process The CRISP-DM ModelThe CRISP-DM Model

Page 3: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 3

What and Why Data Mining? Data Mining is the process of extracting valid, is the process of extracting valid, previously previously

unknownunknown, comprehensible, and actionable information , comprehensible, and actionable information from large databases and using it to make crucial from large databases and using it to make crucial business decisions.business decisions.

Data mining is concerned with the analysis of data and Data mining is concerned with the analysis of data and the use of software techniques for finding hidden and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data.unexpected patterns and relationships in sets of data.

The focus of data mining is to reveal information that is The focus of data mining is to reveal information that is hidden and unexpected.hidden and unexpected.

Data mining requires a single, separate, clean, Data mining requires a single, separate, clean, integrated, and self-consistent source of data. A data integrated, and self-consistent source of data. A data warehouse is well equipped for providing data for data warehouse is well equipped for providing data for data mining.mining.

Data mining can provide huge paybacks for companies Data mining can provide huge paybacks for companies who have made a significant investment in data who have made a significant investment in data warehousing.warehousing.

Page 4: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 4

Data Mining Applications Retail/Marketing::

Identifying buying patterns of customersIdentifying buying patterns of customers Finding associations among customer demographic characteristicFinding associations among customer demographic characteristic Predicting response to mailing companiesPredicting response to mailing companies Market basket analysisMarket basket analysis

Banking: : Detecting patterns of fraudulent credit card useDetecting patterns of fraudulent credit card use Identifying loyal customersIdentifying loyal customers Predicting customers likely to change their credit card affiliationPredicting customers likely to change their credit card affiliation Determining credit card spending by customer groupsDetermining credit card spending by customer groups

Insurance:: Claims analysisClaims analysis Predicting which customers will buy new policiesPredicting which customers will buy new policies

Medicine:: Characterizing patient behavior to predict surgery visitsCharacterizing patient behavior to predict surgery visits Identifying successful medical therapies for different illnessesIdentifying successful medical therapies for different illnesses

Page 5: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 5

Data Mining Operations & Associated Techniques Predictive Modeling::

ClassificationClassification Value predictionValue prediction

Database Segmentation: : Demographic clusteringDemographic clustering Neural clusteringNeural clustering

Link Analysis:: Associate discoveryAssociate discovery Sequential pattern discoverySequential pattern discovery Similar time sequence discoverySimilar time sequence discovery

Deviation Detection:: StatisticsStatistics VisualizationVisualization

Page 6: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 6

Predictive Modeling Predictive Modeling is similar to the human learning is similar to the human learning

experience in using observations to form a model of the experience in using observations to form a model of the important characteristics of some phenomenon.important characteristics of some phenomenon.

This approach uses generalization of the “real world” This approach uses generalization of the “real world” and the ability to fit new data into a general framework.and the ability to fit new data into a general framework.

Predictive modeling can be used to analyze an existing Predictive modeling can be used to analyze an existing database to determine some essential characteristics database to determine some essential characteristics (model) about the data set.(model) about the data set.

Applications of predictive modeling include customer Applications of predictive modeling include customer retention management, credit approval, cross-selling, retention management, credit approval, cross-selling, and direct marketing.and direct marketing.

There are two techniques associated with predictive There are two techniques associated with predictive modeling: modeling: classification and and value prediction..

Page 7: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 7

Classification Classification is used to establish a specific is used to establish a specific

predetermined class for each record in a database predetermined class for each record in a database from a finite set of possible class values.from a finite set of possible class values.

There are two specializations of classification: There are two specializations of classification: Tree induction;; Neural induction..

Page 8: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 8

Classification – Tree Induction In the shown example, we are interested in predicting who is currently renting In the shown example, we are interested in predicting who is currently renting

property is likely to be interested in buying property.property is likely to be interested in buying property. A predictive model has determined that only two variables are of interest: the A predictive model has determined that only two variables are of interest: the

length of time the customer has rented property and the age of the customer.length of time the customer has rented property and the age of the customer. The decision tree presents the analysis in an intuitive way.The decision tree presents the analysis in an intuitive way. The model predicts that those customers who have rented for more than two The model predicts that those customers who have rented for more than two

years and are over 25 years old are the most likely to be interested in buying years and are over 25 years old are the most likely to be interested in buying propertyproperty

Page 9: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 9

Classification – Neural Network A Neural Network contains collections of connected nodes with input, output, and A Neural Network contains collections of connected nodes with input, output, and

processing at each node.processing at each node. Between the visible input and output layers may be a number of hidden processing Between the visible input and output layers may be a number of hidden processing

layers.layers. Each processing unit (circle) in one layer is connected to each processing unit in the Each processing unit (circle) in one layer is connected to each processing unit in the

next layer by a weighted value, expressing the strength of the relationship.next layer by a weighted value, expressing the strength of the relationship. The network attempts to mirror the way the human brain works in processing patterns The network attempts to mirror the way the human brain works in processing patterns

by arithmetically combining all the variables associated with a given data point.by arithmetically combining all the variables associated with a given data point. In this way, it is possible to develop nonlinear predictive models that “learn” by In this way, it is possible to develop nonlinear predictive models that “learn” by

studying combinations of variables and how different combinations of variables affect studying combinations of variables and how different combinations of variables affect different data sets.different data sets.

Page 10: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 10

Value Prediction Value prediction is used to estimate a continuous numeric value Value prediction is used to estimate a continuous numeric value

that is associated with a database record.that is associated with a database record. This technique uses the traditional statistical techniques of linear This technique uses the traditional statistical techniques of linear

regression and nonlinear regression.regression and nonlinear regression. Linear regression attempts to fit a straight line through a plot of Linear regression attempts to fit a straight line through a plot of

the data, such that the line is the best representation of the average the data, such that the line is the best representation of the average of all observations at that point in the plot.of all observations at that point in the plot.

Linear regression works well with linear data and is sensitive to Linear regression works well with linear data and is sensitive to the presence of outliers (that is, data values which do not conform the presence of outliers (that is, data values which do not conform to the expected norm).to the expected norm).

Although nonlinear regression avoids the main problems of linear Although nonlinear regression avoids the main problems of linear regression, it is still not flexible enough to handle all possible regression, it is still not flexible enough to handle all possible shapes of the data plot.shapes of the data plot.

Applications of value prediction include credit card fraud Applications of value prediction include credit card fraud detection and target mailing list identification.detection and target mailing list identification.

Page 11: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 11

Database Segmentation The aim of database segmentation is to partition a database into an unknown number of The aim of database segmentation is to partition a database into an unknown number of

segments, or clusters, of similar records, that is, records that share a number of segments, or clusters, of similar records, that is, records that share a number of properties and so are considered to be homogeneous.properties and so are considered to be homogeneous.

This approach uses unsupervised learning to discover homogeneous sub-populations in This approach uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles.a database to improve the accuracy of the profiles.

Database segmentation is less precise than other operations and is therefore less Database segmentation is less precise than other operations and is therefore less sensitive to redundant and irrelevant features.sensitive to redundant and irrelevant features.

Applications of database segmentation include customer profiling, direct marketing, and Applications of database segmentation include customer profiling, direct marketing, and cross-selling.cross-selling.

Database segmentation is associated with demographic or neural clustering techniques, Database segmentation is associated with demographic or neural clustering techniques, which are distinguished by the allowable data inputs, the methods used to calculate the which are distinguished by the allowable data inputs, the methods used to calculate the distance between records, and the presentation of the resulting segments for analysis.distance between records, and the presentation of the resulting segments for analysis.

Page 12: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 12

Link Analysis Link analysis aims to establish links, called associations, between the individual Link analysis aims to establish links, called associations, between the individual

records, or sets of records, in a database.records, or sets of records, in a database. There are three specializations of link analysis:There are three specializations of link analysis:

Association discoveryAssociation discovery: : finds items that imply the presence of other items in the finds items that imply the presence of other items in the same event. These affinities between items are represented by association rules. For same event. These affinities between items are represented by association rules. For example “when a customer rents a property for more than two years and is more example “when a customer rents a property for more than two years and is more than 25 years old, in 40% of cases, the customer will buy a property. This than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties.”association happens in 35% of all customers who rent properties.”

Sequential pattern discoverySequential pattern discovery: : finds patterns between events such that the finds patterns between events such that the presence of one set of items is followed by another set of items in a database of presence of one set of items is followed by another set of items in a database of events over a period of time. For example, this approach can be used to understand events over a period of time. For example, this approach can be used to understand long-term customer buying behavior.long-term customer buying behavior.

Similar time sequence discoverySimilar time sequence discovery: : is used, for example, in the discovery of is used, for example, in the discovery of links between two sets of data that are time-dependent, and is based on the degree links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate, For example, of similarity between the patterns that both time series demonstrate, For example, within three months of buying property, new home owners will purchase goods within three months of buying property, new home owners will purchase goods such as cookers, freezers, and washing machines.such as cookers, freezers, and washing machines.

Applications of link analysis include product affinity analysis, direct marketing, and Applications of link analysis include product affinity analysis, direct marketing, and stock price movement.stock price movement.

Page 13: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 13

Deviation Detection Deviation detection is a relatively new technique in terms of commercially available Deviation detection is a relatively new technique in terms of commercially available

data mining tools.data mining tools. It identifies outliers, which express deviation from some previously known expectation It identifies outliers, which express deviation from some previously known expectation

and norm.and norm. This operation can be performed using statistics and visualization techniques. For This operation can be performed using statistics and visualization techniques. For

example, linear regression facilitates the identification of outliers in data while modern example, linear regression facilitates the identification of outliers in data while modern visualization techniques display summaries and graphical representations that make visualization techniques display summaries and graphical representations that make deviations easy to detect.deviations easy to detect.

Applications of deviation detection include fraud detection in the use of credit cards and Applications of deviation detection include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing. insurance claims, quality control, and defects tracing.

Page 14: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 14

The Data Mining Process In 1996 a consortium of vendors and users developed a In 1996 a consortium of vendors and users developed a

specification called the Cross Industry Standard Process for Data specification called the Cross Industry Standard Process for Data Mining (CRISP-DM).Mining (CRISP-DM).

CRISP-DM specifies a data mining process that is not specific to CRISP-DM specifies a data mining process that is not specific to any particular industry or tool.any particular industry or tool.

CRISP-DM has evolved from the knowledge Discovery processes CRISP-DM has evolved from the knowledge Discovery processes used widely in industry and in direct response to user used widely in industry and in direct response to user requirements.requirements.

The major aims of CRISP-DM are make large data mining The major aims of CRISP-DM are make large data mining projects run more efficiently as well as to make them cheaper, projects run more efficiently as well as to make them cheaper, more reliable, and more manageable.more reliable, and more manageable.

Page 15: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 15

The CRISP-DM Model The CRISP-DM methodology is a hierarchical process model.The CRISP-DM methodology is a hierarchical process model. At the top level, the process is divided into six different generic At the top level, the process is divided into six different generic

phases, ranging from business understanding to deployment of phases, ranging from business understanding to deployment of project results.project results.

The next level elaborates each of these phases as comprising The next level elaborates each of these phases as comprising several generic tasks. At this level, the description is generic several generic tasks. At this level, the description is generic enough to cover all the DM scenarios.enough to cover all the DM scenarios.

The third level specializes these tasks for specific situations. For The third level specializes these tasks for specific situations. For example, the generic task might be cleaning data, and the example, the generic task might be cleaning data, and the specialized task could be cleaning of numeric or categorical specialized task could be cleaning of numeric or categorical values.values.

The fourth level is the process instance, that is, a record of The fourth level is the process instance, that is, a record of actions, decisions, and result of an actual execution of a DM actions, decisions, and result of an actual execution of a DM project.project.

The model also discusses relationships between different DM The model also discusses relationships between different DM tasks.tasks.

Page 16: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 16

The CRISP-DM Phases Business understanding – determine business – determine business

objectives, assess situation, determine data mining goal; objectives, assess situation, determine data mining goal; and produce a project plan. and produce a project plan.

Data understanding – collect initial data, describe – collect initial data, describe data; explore data; and verify data quality.data; explore data; and verify data quality.

Data preparation – select data, clean data, construct – select data, clean data, construct data, integrate data, and format data.data, integrate data, and format data.

Modeling – select modeling technique, generate test – select modeling technique, generate test design, build model, and assess model.design, build model, and assess model.

Evaluation – evaluate results, review process, and – evaluate results, review process, and determine next step.determine next step.

Deployment – plan deployment, plan monitoring and – plan deployment, plan monitoring and maintenance, produce final report, and review report.maintenance, produce final report, and review report.

Page 17: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 17

Data Mining ToolsThere are a growing number of commercial data There are a growing number of commercial data

mining tools on the marketplace. mining tools on the marketplace. The important features of data mining tools The important features of data mining tools

include: include: Data preparationSelection of data mining operations

(algorithms)Product scalability and performanceFacilities for understanding results

Page 18: Data Mining, by Dr. Khalil1 Data Mining Dr. Awad Khalil Computer Science Department AUC

Data Mining, by Dr. Khalil 18

Thank you