testing

28
ISQS 6347, Data & Text Mining ISQS 6347, Data & Text Mining 1 Lecture Notes 1: Lecture Notes 1: Introduction to Data Introduction to Data Mining Mining Zhangxi Lin Zhangxi Lin ISQS 6347 ISQS 6347 Texas Tech University Texas Tech University

Upload: sankett

Post on 02-Nov-2014

623 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 11

Lecture Notes 1:Lecture Notes 1:Introduction to Data MiningIntroduction to Data Mining

Zhangxi LinZhangxi Lin

ISQS 6347ISQS 6347

Texas Tech UniversityTexas Tech University

Page 2: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 22

What is Data Mining?What is Data Mining?

Many DefinitionsMany Definitions– Non-trivial extraction of implicit, previously unknown and Non-trivial extraction of implicit, previously unknown and

potentially useful information from datapotentially useful information from data– Exploration & analysis, by automatic or semi-automatic Exploration & analysis, by automatic or semi-automatic

means, of large quantities of data in order to discover means, of large quantities of data in order to discover meaningful patterns. (Berry and Linoff, 1997, 2000)meaningful patterns. (Berry and Linoff, 1997, 2000)

– Data Mining is the process of discovering meaningful new Data Mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large correlations, patterns and trends by sifting through large amount of data stored in repositories, using pattern amount of data stored in repositories, using pattern recognition technologies as well as statistical and recognition technologies as well as statistical and mathematical techniques. (Gartner Group, 2004)mathematical techniques. (Gartner Group, 2004)

Page 3: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 33

Data Mining ProcessData Mining Process

Page 4: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 44

What is Text Mining?What is Text Mining?

Discover useful and previously unknown Discover useful and previously unknown “gems” of information in “gems” of information in large text large text collectionscollections

Page 5: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 55

Motivation for Text MiningMotivation for Text Mining

Approximately Approximately 90%90% of the world’s data is held in unstructured of the world’s data is held in unstructured formats (source: Oracle Corporation)formats (source: Oracle Corporation)Information intensive business processes demand that we transcend Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery.from simple document retrieval to “knowledge” discovery.

90%

Structured Numerical or CodedInformation

10%

Unstructured or Semi-structuredInformation

Page 6: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 66

Text Mining ProcessText Mining ProcessText PreprocessingText Preprocessing– Syntactic/Semantic Syntactic/Semantic

Text Analysis Text Analysis

Features Generation Features Generation – Bag of Words Bag of Words

Feature SelectionFeature Selection– Simple CountingSimple Counting– Statistics Statistics

Text/Data MiningText/Data Mining– Classification- Classification-

Supervised LearningSupervised Learning– Clustering- Clustering-

Unsupervised LearningUnsupervised Learning

Analyzing ResultsAnalyzing Results

Page 7: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 77

Lots of data is being collected Lots of data is being collected and warehoused and warehoused – Web data, e-commerceWeb data, e-commerce– purchases at department/purchases at department/

grocery storesgrocery stores– Bank/Credit Card Bank/Credit Card

transactionstransactions

Computers have become cheaper and more powerfulComputers have become cheaper and more powerful

Competitive Pressure is Strong Competitive Pressure is Strong – Provide better, customized services for an Provide better, customized services for an edge edge (e.g. in (e.g. in

Customer Relationship Management)Customer Relationship Management)

Why Mine Data? Commercial ViewpointWhy Mine Data? Commercial Viewpoint

Page 8: Testing

Why Mine Data? Scientific ViewpointWhy Mine Data? Scientific Viewpoint

Data collected and stored at Data collected and stored at enormous speeds (GB/hour)enormous speeds (GB/hour)– remote sensors on a satelliteremote sensors on a satellite

– telescopes scanning the skiestelescopes scanning the skies

– microarrays generating gene microarrays generating gene expression dataexpression data

– scientific simulations scientific simulations generating terabytes of datagenerating terabytes of data

Traditional techniques infeasible for raw Traditional techniques infeasible for raw datadataData mining may help scientists Data mining may help scientists – in classifying and segmenting datain classifying and segmenting data– in Hypothesis Formationin Hypothesis Formation

Page 9: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 99

Draws ideas from machine learning/AI, pattern Draws ideas from machine learning/AI, pattern recognition, statistics, and database systemsrecognition, statistics, and database systems

Traditional TechniquesTraditional Techniquesmay be unsuitable due to may be unsuitable due to – Enormity of dataEnormity of data– High dimensionality High dimensionality

of dataof data– Heterogeneous, Heterogeneous,

distributed nature distributed nature of dataof data

Origins of Data MiningOrigins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

Page 10: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1010

How much is 12 Exabytes?

1,200,000 Libraries of Congress

Emerging data sources Medical images: potential 1 EB/year

Video monitors: potential 100 EB/year

Sources: • http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm, • The Expanding Digital Universe, IDC white paper, March 2007

55% in personal PCs16% in corporate data warehousesInternet only 21 TBEmail 500x more than Internet / year

Page 11: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1111

Data Mining TasksData Mining Tasks

Prediction Methods. Use some variables to predict Prediction Methods. Use some variables to predict unknown or future values of other variables.unknown or future values of other variables.– ClassificationClassification– RegressionRegression– Deviation DetectionDeviation Detection

Description Methods. Find human-interpretable Description Methods. Find human-interpretable patterns that describe the data.patterns that describe the data.– Clustering Clustering – Association Rule DiscoveryAssociation Rule Discovery– Sequential Pattern DiscoverySequential Pattern Discovery

Page 12: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1212

Classification: DefinitionClassification: Definition

Given a collection of records (Given a collection of records (training set training set ))– Each record contains a set of Each record contains a set of attributesattributes, one of the , one of the

attributes is the attributes is the classclass..

Find a Find a modelmodel for class attribute as a function of for class attribute as a function of the values of other attributes.the values of other attributes.Goal: Goal: previously unseenpreviously unseen records should be records should be assigned a class as accurately as possible.assigned a class as accurately as possible.– A A test settest set is used to determine the accuracy of the model. is used to determine the accuracy of the model.

Usually, the given data set is divided into training and test Usually, the given data set is divided into training and test sets, with training set used to build the model and test set sets, with training set used to build the model and test set used to validate it.used to validate it.

Page 13: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1313

Classification ExampleClassification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier

Page 14: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1414

Classification: Application 1Classification: Application 1

Direct MarketingDirect Marketing– Goal: Reduce cost of mailing by Goal: Reduce cost of mailing by targetingtargeting a set of a set of

consumers likely to buy a new cell-phone product.consumers likely to buy a new cell-phone product.– Approach:Approach:

Use the data for a similar product introduced before. Use the data for a similar product introduced before. We know which customers decided to buy and which decided We know which customers decided to buy and which decided otherwise. This otherwise. This {buy, don’t buy}{buy, don’t buy} decision forms the decision forms the class class attributeattribute..Collect various demographic, lifestyle, and company-interaction Collect various demographic, lifestyle, and company-interaction related information about all such customers.related information about all such customers.

– Type of business, where they stay, how much they earn, etc.Type of business, where they stay, how much they earn, etc.

Use this information as input attributes to learn a classifier Use this information as input attributes to learn a classifier model.model.

From [Berry & Linoff] Data Mining Techniques, 1997

Page 15: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1515

Classification: Application 2Classification: Application 2

Fraud DetectionFraud Detection– Goal: Predict fraudulent cases in credit card transactions.Goal: Predict fraudulent cases in credit card transactions.– Approach:Approach:

Use credit card transactions and the information on its Use credit card transactions and the information on its account-holder as attributes.account-holder as attributes.

– When does a customer buy, what does he buy, how often he When does a customer buy, what does he buy, how often he pays on time, etcpays on time, etc

Label past transactions as fraud or fair transactions. This Label past transactions as fraud or fair transactions. This forms the class attribute.forms the class attribute.Learn a model for the class of the transactions.Learn a model for the class of the transactions.Use this model to detect fraud by observing credit card Use this model to detect fraud by observing credit card transactions on an account.transactions on an account.

Page 16: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1616

Clustering DefinitionClustering Definition

Given a set of data points, each having a set Given a set of data points, each having a set of attributes, and a similarity measure among of attributes, and a similarity measure among them, find clusters such thatthem, find clusters such that– Data points in one cluster are more similar to one Data points in one cluster are more similar to one

another.another.– Data points in separate clusters are less similar to Data points in separate clusters are less similar to

one another.one another.

Similarity Measures:Similarity Measures:– Euclidean Distance if attributes are continuous.Euclidean Distance if attributes are continuous.– Other Problem-specific Measures.Other Problem-specific Measures.

Page 17: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1717

Illustrating ClusteringIllustrating ClusteringEuclidean Distance Based Clustering in 3-D space.

Intracluster distancesare minimized

Intracluster distancesare minimized

Intercluster distancesare maximized

Intercluster distancesare maximized

Page 18: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1818

Clustering ExampleClustering Example

Market Segmentation:Market Segmentation:– Goal: subdivide a market into distinct subsets of customers Goal: subdivide a market into distinct subsets of customers

where any subset may conceivably be selected as a market where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.target to be reached with a distinct marketing mix.

– Approach: Approach: Collect different attributes of customers based on their Collect different attributes of customers based on their geographical and lifestyle related information.geographical and lifestyle related information.Find clusters of similar customers.Find clusters of similar customers.Measure the clustering quality by observing buying patterns of Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. customers in same cluster vs. those from different clusters.

Page 19: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1919

Association Rule Discovery: Association Rule Discovery: DefinitionDefinition

Given a set of records each of which contain some Given a set of records each of which contain some number of items from a given collection;number of items from a given collection;– Produce dependency rules which will predict occurrence of Produce dependency rules which will predict occurrence of

an item based on occurrences of other items.an item based on occurrences of other items.TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Page 20: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2020

Association Rule Discovery Association Rule Discovery ExampleExample

Marketing and Sales Promotion:Marketing and Sales Promotion:– Let the rule discovered beLet the rule discovered be

{Bagels, … } --> {Potato Chips}{Bagels, … } --> {Potato Chips}

– Potato Chips as consequentPotato Chips as consequent => => Can be used to Can be used to determine what should be done to boost its sales.determine what should be done to boost its sales.

– Bagels in the antecedentBagels in the antecedent => C => Can be used to see which an be used to see which products would be affected if the store discontinues products would be affected if the store discontinues selling bagels.selling bagels.

– Bagels in antecedent Bagels in antecedent and and Potato chips in consequentPotato chips in consequent => =>

Can be used to see what products should be sold with Can be used to see what products should be sold with Bagels to promote sale of Potato chips!Bagels to promote sale of Potato chips!

Page 21: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2121

RegressionRegression

Predict a value of a given continuous valued variable Predict a value of a given continuous valued variable based on the values of other variables, assuming a based on the values of other variables, assuming a linear or nonlinear model of dependency.linear or nonlinear model of dependency.Greatly studied in statistics, neural network fields.Greatly studied in statistics, neural network fields.Examples:Examples:– Predicting sales amounts of new product based on Predicting sales amounts of new product based on

advertising expenditure.advertising expenditure.– Predicting wind velocities as a function of temperature, Predicting wind velocities as a function of temperature,

humidity, air pressure, etc.humidity, air pressure, etc.– Time series prediction of stock market indices.Time series prediction of stock market indices.

Page 22: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2222

Deviation/Anomaly DetectionDeviation/Anomaly Detection

Detect significant deviations Detect significant deviations from normal behaviorfrom normal behavior

Applications:Applications:– Credit Card Fraud DetectionCredit Card Fraud Detection– Network Intrusion DetectionNetwork Intrusion Detection

Typical network traffic at University level may reach over 100 million connections per day

Page 23: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2323

Text Mining TasksText Mining Tasks

Exploratory Data AnalysisExploratory Data Analysis– Using text to form hypotheses about diseases (Swanson Using text to form hypotheses about diseases (Swanson

and Smalheiser, 1997).and Smalheiser, 1997).

Information ExtractionInformation Extraction– (Semi)automatically create (domain specific) knowledge (Semi)automatically create (domain specific) knowledge

bases, and then use standard data-mining techniques.bases, and then use standard data-mining techniques.

Bootstrapping methods (Riloff and Jones, 1999).Bootstrapping methods (Riloff and Jones, 1999).

Text ClassificationText Classification– Useful intermediary step for information extractionUseful intermediary step for information extraction

Bootstrapping method using EM (Nigam et al., 2000).Bootstrapping method using EM (Nigam et al., 2000).

Page 24: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2424

The Needs:The Needs:– Analysis of call records as input into decision-Analysis of call records as input into decision-

making process of Bank’s managementmaking process of Bank’s management– Quick answers to important questionsQuick answers to important questions

Which offices receive the most angry calls?Which offices receive the most angry calls?

What products have the fewest satisfied customers?What products have the fewest satisfied customers?

(“Angry” and “Satisfied” are recognizable sentiments)(“Angry” and “Satisfied” are recognizable sentiments)

– User friendly interface and visualization toolsUser friendly interface and visualization tools

Example: Example: Decision Support using Bank Call Center DataDecision Support using Bank Call Center Data

Page 25: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2525

Example: Example: Decision Support using Bank Call Center DataDecision Support using Bank Call Center Data

The Information Source:The Information Source:– Call center recordsCall center records– Example:Example:

AC2G31, 01, 0101, PCC, 021, 0053352, NEW YORK, NY, H-SUPRVR8, STMT, “Mr. Stark has been with the company forabout 20 yrs. He hates his stmt format andwishes that we would show a daily balanceto help him know when he falls below therequired balance on the account.”

Page 26: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2626

Challenges of Data MiningChallenges of Data Mining

ScalabilityScalability

DimensionalityDimensionality

Complex and Heterogeneous DataComplex and Heterogeneous Data

Data QualityData Quality

Data Ownership and DistributionData Ownership and Distribution

Privacy PreservationPrivacy Preservation

Streaming DataStreaming Data

Page 27: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2727

Challenges of Text MiningChallenges of Text Mining

Very high number of possible “dimensions”Very high number of possible “dimensions”– All possible word and phrase types in the language!!All possible word and phrase types in the language!!

Unlike data mining:Unlike data mining:– records (= docs) are not structurally identicalrecords (= docs) are not structurally identical– records are not statistically independentrecords are not statistically independent

Complex and subtle relationships between concepts in textComplex and subtle relationships between concepts in text– ““AOL merges with Time-Warner”AOL merges with Time-Warner”– ““Time-Warner is bought by AOL”Time-Warner is bought by AOL”

Ambiguity and context sensitivityAmbiguity and context sensitivity– automobile = car = vehicle = Toyotaautomobile = car = vehicle = Toyota– Apple (the company) or apple (the fruit)Apple (the company) or apple (the fruit)

Page 28: Testing

ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2828

SAS Training/Self-taught CoursesSAS Training/Self-taught Courses

Getting Start with SASGetting Start with SAS®® Enterprise Miner 4.3, 132p Enterprise Miner 4.3, 132p (EM_GS_7281.PDF)(EM_GS_7281.PDF)Getting Start with SASGetting Start with SAS®® 9.1 Text Miner, 60p 9.1 Text Miner, 60p (EM_TMGS_7693.PDF)(EM_TMGS_7693.PDF)Data Mining - A Case Study Approach, 135pData Mining - A Case Study Approach, 135pText Mining Using SASText Mining Using SAS®® Software, 274p Software, 274p (DMTM.PDF)(DMTM.PDF)

Applying Data Mining Techniques Using Enterprise Applying Data Mining Techniques Using Enterprise Miner, 308p Miner, 308p (ADMT_001.PDF)(ADMT_001.PDF)Effective Web Mining: Attracting and Keeping Valued Effective Web Mining: Attracting and Keeping Valued Cyber Consumers, 632p Cyber Consumers, 632p (CCWEB_TKIT.PDF)(CCWEB_TKIT.PDF)