data mining intro-2009-v2

62
Data Mining Introduction Prithwis Mukerjee, Ph.D.

Upload: prithwis-mukerjee

Post on 11-May-2015

1.129 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data mining intro-2009-v2

Data Mining

Introduction

Prithwis Mukerjee, Ph.D.

Page 2: Data mining intro-2009-v2

Prithwis Mukerjee 2

Why Suddenly Data Mining ?

Fluid, dynamic business environment Markets are, or seem to be, saturated Customers are aggresive and disloyal Speed is essential

“The quick and the dead”

Availability of data Vast amounts of data are generated, stored

electronically and are waiting to be processed !

Availability of tools and techniques Mathematical tools are available to “process”

this data This processing is significantly different from MIS / EDP

style of data processing

Software tools are available to implement some of these mathematical models

Page 3: Data mining intro-2009-v2

Prithwis Mukerjee 3

The Business Environment

Customer Behaviour Customers have access to more channels

Newer retail formats Online stores

Customers have access to more suppliers Increased commoditisation Customer loyalty is not assured any more

Market Saturation Multiple suppliers operating in each market Niche market based on demographics and preferences

Competition has intensified

Need for speed Product life cycles are getting shorter Time to market

Page 4: Data mining intro-2009-v2

Prithwis Mukerjee 4

New way of looking at customer

Customer Relationship Management Intimacy, collaboration, one-to-one partnerships

are necessary

Need to ask ... What classes of customers do we have ? Are

there subclasses in terms of behaviour ? How can we sell more to existing customers ?

What exactly are they buying now ? Is there a pattern in the way our customers

behave ? Who are my good customers ?

To whom we should sell more Who are my bad customers ?

Who are likely to default or defraud ?

Page 5: Data mining intro-2009-v2

Prithwis Mukerjee 5

Availability of vast amounts of data

ERP and OLTP systems With their centralised RDBMSs are huge pools

of firmwide data that can overwhelm even the most dedicated manager

Datawarehouse Technology has resulted in equally huge pools

of historical data

Storage Capacity Inexpensive Ultra high capacity

Page 6: Data mining intro-2009-v2

Prithwis Mukerjee 6

Availability of vast amounts of data

Cards Credit & Debit Cards Loyalty Card Result in capture of huge pools of data

Transactional Data Capture Point of sales systems, Bar code readers Capture vast amount of transaction data at

increasing levels of granularity What was sold ? Product, SKU When was it sold ? Date, time How was it sold ? Discount, Promotions

Beyond simple sales Telephone calls, frequent flyer data

Page 7: Data mining intro-2009-v2

Prithwis Mukerjee 7

Trends leading to Data Flood

More data is generated: Web, text, images … Business transactions,

calls, ... Scientific data:

astronomy, biology, etc

More data is captured: Storage technology

faster and cheaper DBMS can handle

bigger DB

Page 8: Data mining intro-2009-v2

Prithwis Mukerjee 8

Data Growth

In 2 years (2003 to 2005), the size of the largest database TRIPLED!

Page 9: Data mining intro-2009-v2

Prithwis Mukerjee 9

Coming to the point ....

Data mining Is the process of extracting unknown, valid

and actionable information from large databases and then using this information to make crucial business decisions.

DatabaseData Mining Tools

Data Presentationand Visualisation Tools

Decisions

Page 10: Data mining intro-2009-v2

Prithwis Mukerjee 10

Knowledge Discovery Definition

Knowledge Discovery in Data is the

non-trivial process of identifying valid novel potentially useful and ultimately understandable patterns in data.

from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996

Page 11: Data mining intro-2009-v2

Prithwis Mukerjee 11

Old Wine in new bottles ?

Are these the same as data mining ? SQL queries against large databases Multidimensional database analysis Online analytical processing Sophisticated graphic visualisation “Classical” statistical analysis

ANOVA ? Regression ? Correrelation ?

What is missing ? Discovery of information without a previously

articulated or formulated hypothesis

Page 12: Data mining intro-2009-v2

Prithwis Mukerjee 12

Related Fields

Statistics

MachineLearning

Databases

Visualization

Data Mining and Knowledge Discovery

Page 13: Data mining intro-2009-v2

Prithwis Mukerjee 13

Statistics, Machine Learning, Data Mining

Statistics: more theory-based more focused on testing hypotheses

Machine learning more heuristic focused on improving performance of a learning agent also looks at real-time learning and robotics – areas not

part of data mining

Data Mining and Knowledge Discovery integrates theory and heuristics focus on the entire process of knowledge discovery,

including data cleaning, learning, and integration and visualization of results

Distinctions are fuzzy

Page 14: Data mining intro-2009-v2

Data Mining Tasks

Page 15: Data mining intro-2009-v2

Prithwis Mukerjee 15

Some Definitions

Instance (also Item or Record): an example, described by a number of

attributes, e.g. a day can be described by temperature,

humidity and cloud status

Attribute or Field measuring aspects of the Instance, e.g.

temperature

Class (Label) grouping of instances, e.g. days good for

playing

Page 16: Data mining intro-2009-v2

Prithwis Mukerjee 16

Major Data Mining Tasks

Classification predicting an item

class

Clustering finding clusters in

data

Associations e.g. A & B & C occur

frequently

Visualization to facilitate human

discovery

Summarization describing a group

Deviation Detection finding changes

Estimation predicting a

continuous value

Link Analysis finding relationships

And So on ...

Page 17: Data mining intro-2009-v2

Prithwis Mukerjee 17

Classification

Learn a method for predicting the instance class from pre-labeled (classified) instances

Many approaches: Statistics, Decision Trees, Neural Networks, ...

Page 18: Data mining intro-2009-v2

Prithwis Mukerjee 18

Clustering

Find “natural” grouping of instances given un-labeled data

Page 19: Data mining intro-2009-v2

Prithwis Mukerjee 19

Association Rules & Frequent Itemsets

TID Produce

1 MILK, BREAD, EGGS

2 BREAD, SUGAR

3 BREAD, CEREAL

4 MILK, BREAD, SUGAR

5 MILK, CEREAL

6 BREAD, CEREAL

7 MILK, CEREAL

8 MILK, BREAD, CEREAL, EGGS

9 MILK, BREAD, CEREAL

TransactionsFrequent Itemsets:

Milk, Bread (4)Bread, Cereal (3)Milk, Bread, Cereal (2)…

Rules:Milk => Bread (66%)

Page 20: Data mining intro-2009-v2

Prithwis Mukerjee 20

Visualization & Data Mining

Visualizing the data to facilitate human discovery

Presenting the discovered results in a visually "nice" way

Page 21: Data mining intro-2009-v2

Prithwis Mukerjee 21

Summarization

Describe features of the selected group

Use natural language and graphics

Usually in Combination with Deviation detection or other methods

Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...

Page 22: Data mining intro-2009-v2

Prithwis Mukerjee 22

Data Mining Central Quest

Find true patterns and avoid overfitting (finding seemingly signifcantbut really random patterns due to searching too many possibilites)

Page 23: Data mining intro-2009-v2

Classification Methods

Page 24: Data mining intro-2009-v2

Prithwis Mukerjee 24

Classification

Learn a method for predicting the instance class from pre-labeled (classified) instances

Many approaches: Regression, Decision Trees,Bayesian,Neural Networks, ...

Given a set of points from classes what is the class of new point ?

Page 25: Data mining intro-2009-v2

Prithwis Mukerjee 25

Classification: Linear Regression

Linear Regression

w0 + w1 x + w2 y >= 0

Regression computes wi from data to minimize squared error to ‘fit’ the data

Not flexible enough

Page 26: Data mining intro-2009-v2

Prithwis Mukerjee 26

Regression for Classification

Any regression technique can be used for classification Training: perform a regression for each class, setting

the output to 1 for training instances that belong to class, and 0 for those that don’t

Prediction: predict class corresponding to model with largest output value (membership value)

For linear regression this is known as multi-response linear regression

Page 27: Data mining intro-2009-v2

Prithwis Mukerjee 27

Classification: Decision Trees

X

Y

if X > 5 then blueelse if Y > 3 then blueelse if X > 2 then greenelse blue

52

3

Page 28: Data mining intro-2009-v2

Prithwis Mukerjee 28

DECISION TREE

An internal node is a test on an attribute.

A branch represents an outcome of the test, e.g., Color=red.

A leaf node represents a class label or class label distribution.

At each node, one attribute is chosen to split training examples into distinct classes as much as possible

A new instance is classified by following a matching path to a leaf node.

Page 29: Data mining intro-2009-v2

Prithwis Mukerjee 29

Weather Data: Play or not Play?

Notruehighmildrain

Yesfalsenormalhotovercast

Yestruehighmildovercast

Yestruenormalmildsunny

Yesfalsenormalmildrain

Yesfalsenormalcoolsunny

Nofalsehighmildsunny

Yestruenormalcoolovercast

Notruenormalcoolrain

Yesfalsenormalcoolrain

Yesfalsehighmildrain

Yesfalsehighhotovercast

Notruehighhotsunny

Nofalsehighhotsunny

Play?WindyHumidityTemperatureOutlook

Note:Outlook is theForecast,no relation to Microsoftemail program

Page 30: Data mining intro-2009-v2

Prithwis Mukerjee 30

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Yes

Example Tree for “Play?”

Outlook

HumidityWindy

Page 31: Data mining intro-2009-v2

Prithwis Mukerjee 31

Classification: Neural Nets

Can select more complex regions

Can be more accurate

Also can overfit the data – find patterns in random noise

Page 32: Data mining intro-2009-v2

Prithwis Mukerjee 32

Classification: other approaches

Naïve Bayes

Rules

Support Vector Machines

Genetic Algorithms

See www.KDnuggets.com/software/

Page 33: Data mining intro-2009-v2

Prithwis Mukerjee 33

Direct Marketing Paradigm

Find most likely prospects to contact

Not everybody needs to be contacted

Number of targets is usually much smaller than number of prospects

Typical Applications retailers, catalogues, direct mail (and e-mail)

customer acquisition, cross-sell, attrition prediction

...

Page 34: Data mining intro-2009-v2

Prithwis Mukerjee 34

Direct Marketing Evaluation

Accuracy on the entire dataset is not the right measure

Approach

develop a target model

score all prospects and rank them by decreasing score

select top P% of prospects for action

How do we decide what is the best subset of prospects ?

Page 35: Data mining intro-2009-v2

Prithwis Mukerjee 35

Model-Sorted List

…4897N0.925

2422

2734

3820

2478

1024

1746

CustID

N0.06100

…N0.1199

Age

Y0.934

……

Y0.943

N0.952

Y0.971

Target

Score

No

Use a model to assign score to each customerSort customers by decreasing scoreExpect more targets (hits) near the top of the list

3 hits in top 5% of the list

If there 15 targets overall, then top 5 has 3/15=20% of targets

Page 36: Data mining intro-2009-v2

Data Mining Applications

Page 37: Data mining intro-2009-v2

Prithwis Mukerjee 37

Data Mining Applications

Science: Chemistry, Physics, Medicine Biochemical

analysis Remote sensors on

a satellite Telescopes – star

galaxy classification

Medical Image analysis

Bioscience Sequence-based

analysis Protein structure

and function prediction

Protein family classification

Microarray gene expression

Page 38: Data mining intro-2009-v2

Prithwis Mukerjee 38

Microarrays: Classifying Leukemia

Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999

72 examples (38 train, 34 test), about 7,000 genes

ALL AML

Visually similar, but genetically very different

Best Model: 97% accuracy,1 error (sample suspected mislabelled)

Page 39: Data mining intro-2009-v2

Prithwis Mukerjee 39

Microarray Potential Applications

New and better molecular diagnostics Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip,

based on Affymetrix technology

New molecular targets for therapy few new drugs, large pipeline, …

Improved treatment outcome Partially depends on genetic signature

Fundamental Biological Discovery finding and refining biological pathways

Personalized medicine ?!

Page 40: Data mining intro-2009-v2

Prithwis Mukerjee 40

Pharmaceutical companies, Insurance and Health care, Medicine Drug development Identify successful

medical therapies Claims analysis,

fraudulent behavior Medical diagnostic

tools Predict office visits

Data Mining Applications

Financial Industry, Banks, Businesses, E-commerce Stock and

investment analysis Identify loyal

customers vs. risky customer

Predict customer spending

Risk management Sales forecasting

Page 41: Data mining intro-2009-v2

Prithwis Mukerjee 41

Retail and Marketing Customer buying

patterns/demographic characteristics

Mailing campaigns Market basket

analysis Trend analysis

Data Mining Applications

Page 42: Data mining intro-2009-v2

Prithwis Mukerjee 42

Application: Direct Marketing and CRM Most major direct marketing companies are

using modeling and data mining

Most financial companies are using customer modeling

Modeling is easier than changing customer behaviour

Example

Verizon Wireless reduced customer attrition rate from 2% to 1.5%, saving many millions of $

Page 43: Data mining intro-2009-v2

Prithwis Mukerjee 43

Application: Security and Fraud Detection Credit Card Fraud Detection

over 20 Million credit cards protected by Neural networks (Fair, Isaac)

Securities Fraud Detection

NASDAQ KDD system

Phone fraud detection

AT&T, Bell Atlantic, British Telecom/MCI

Page 44: Data mining intro-2009-v2

Prithwis Mukerjee 44

Fraud Detection and Management (1)

Applications widely used in health care, retail, credit card

services, telecommunications (phone card fraud), etc.

Approach use historical data to build models of fraudulent

behavior and use data mining to help identify similar instances

Examples auto insurance: detect a group of people who stage

accidents to collect on insurance money laundering: detect suspicious money transactions

(US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring

of doctors and ring of references

Page 45: Data mining intro-2009-v2

Prithwis Mukerjee 45

Fraud Detection and Management (2)

Detecting inappropriate medical treatment Australian Health Insurance Commission

identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).

Detecting telephone fraud Telephone call model: destination of the call,

duration, time of day or week. Analyze patterns that deviate from an expected norm.

British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.

Retail Analysts estimate that 38% of retail shrink is due

to dishonest employees.

Page 46: Data mining intro-2009-v2

Prithwis Mukerjee 46

Application: e-Commerce

Amazon.com recommendations

if you bought (viewed) X, you are likely to buy Y

Netflix

If you liked "Monty Python and the Holy Grail",

you get a recommendation for "This is Spinal Tap"

Comparison shopping

Froogle, mySimon, Yahoo Shopping, …

Page 47: Data mining intro-2009-v2

Prithwis Mukerjee 47

Example : Processing Loan Applications

Given: questionnaire with financial and personal information

Problem: should money be lend?

Borderline cases referred to loan officers

But: 50% of accepted borderline cases defaulted!

Solution: reject all borderline cases?

Borderline cases are most active customers!

Page 48: Data mining intro-2009-v2

Prithwis Mukerjee 48

Enter Machine Learning

Given: 1000 training examples of borderline cases

20 attributes: age, years with current employer,years at

current address, years with the bank, years at current job, other credit cards

Learned rules predicted 2/3 of borderline cases correctly!

Rules could be used to explain decisions to customers

Page 49: Data mining intro-2009-v2

Prithwis Mukerjee 49

Case study 2:Screening images

Given: radar satellite images of coastal waters

Problem: detecting oil slicks in those images

Oil slicks = dark regions with changing size and shape

Look-alike dark regions can be caused by weather conditions (e.g. high wind)

Expensive process requiring highly trained personnel

Page 50: Data mining intro-2009-v2

Prithwis Mukerjee 50

Dark regions extracted from normalized image

Attributes: size of region, shape, area, intensity, sharpness

and jaggedness of boundaries, proximity of other regions, info about background

Constraints: Scarcity of training examples (oil slicks are

rare!) Unbalanced data: most dark regions aren’t oil

slicks Regions from same image form a batch Requirement is adjustable false-alarm rate

Enter Machine Learning

Page 51: Data mining intro-2009-v2

Prithwis Mukerjee 51

Data Mining Applications ..

Prediction & Description Would this customer buy this product ? Is this customer likely to leave ?

Relationship Marketing What kind of products have been bought by

this customer ? What kind of marketing strategy has this

customer responded to ?

Outlier identification and Fraud detection Locating unusual cases and behaviours

Customer Profiling & Segmentation Is the bottomline that we are all looking at ...

Page 52: Data mining intro-2009-v2

Prithwis Mukerjee 52

Data Mining Challenges

Computationally expensive to investigate all possibilities

Dealing with noise/missing information and errors in data

Choosing appropriate attributes/input representation

Finding the minimal attribute space

Finding adequate evaluation function(s)

Extracting meaningful information

Not overfitting

Page 53: Data mining intro-2009-v2

Prithwis Mukerjee 53

Are All “Discovered” Patterns Interesting?

Interestingness measures: A pattern is interesting if

it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user

Objective vs. subjective measures: Objective: based on statistics and structures of

patterns support and confidence

Subjective: based on user’s belief in the data unexpectedness, novelty, action ability, etc.

Completeness - Find all the interesting patterns Can a data mining system find all the interesting patterns? Association vs. classification vs. clustering

Page 54: Data mining intro-2009-v2

Privacy Issues

Page 55: Data mining intro-2009-v2

Prithwis Mukerjee 55

Data Mining, Privacy, and Security

TIA: Terrorism (formerly Total) Information Awareness Program – TIA program closed by Congress in 2003 because of

privacy concerns

However, in 2006 we learn that NSA is analyzing US domestic call info to find potential terrorists Invasion of Privacy or Needed Intelligence?

Page 56: Data mining intro-2009-v2

Prithwis Mukerjee 56

Criticism of Analytic Approaches to Threat Detection:

Data Mining will be ineffective - generate millions of false positives and invade privacy

First, can data mining be effective?

Page 57: Data mining intro-2009-v2

Prithwis Mukerjee 57

Can Data Mining and Statistics be Effective for Threat Detection?

Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives

Reality: Analytical models correlate many items of information to reduce false positives.

Example: Identify one biased coin from 1,000. After one throw of each coin, we cannot After 30 throws, one biased coin will stand out

with high probability. Can identify 19 biased coins out of 100 million

with sufficient number of throws

Page 58: Data mining intro-2009-v2

Prithwis Mukerjee 58

Another Approach: Link Analysis

Can find unusual patterns in the network structure

Page 59: Data mining intro-2009-v2

Prithwis Mukerjee 59

Analytic technology can be effective

Data Mining is just one additional tool to help analysts

Combining multiple models and link analysis can reduce false positives

Today there are millions of false positives with manual analysis

Analytic technology has the potential to reduce the current high rate of false positives

Page 60: Data mining intro-2009-v2

Prithwis Mukerjee 60

Data Mining with Privacy

Data Mining looks for patterns, not people!

Technical solutions can limit privacy invasion Replacing sensitive personal data with anon.

ID Give randomized outputs Multi-party computation – distributed data …

Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003

Page 61: Data mining intro-2009-v2

Prithwis Mukerjee 61

Summary

Data Mining and Knowledge Discovery are needed to deal with the flood of data

Knowledge Discovery is a process !

Avoid overfitting (finding random patterns by searching too many possibilities)

Page 62: Data mining intro-2009-v2

Prithwis Mukerjee 62

Additional Resources

www.KDnuggets.com data mining software, jobs, courses, etc

www.acm.org/sigkddACM SIGKDD – the professional society for data mining