motivation: why data mining? holy grail - informed decision making lots of data are being collected...

41
Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected Business - Transactions, Web logs, GPS-track, … Science - Remote sensing, Micro-array gene expression data, Challenges: Volume (data) >> number of human analysts Some automation needed Limitations of Relational Database Can not predict future! (questions about items not in the database!) Ex. Predict tomorrow’s weather or credit-worthiness of a new customer Can not compute transitive closure and more complex questions Ex. What are natural groups of customers? Ex. Which subsets of items are bought together? Data Mining may help! Provide better and customized insights for business Help scientists for hypothesis generation

Upload: lisa-norman

Post on 27-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Motivation: Why Data Mining?

• Holy Grail - Informed Decision Making• Lots of Data are Being Collected

– Business - Transactions, Web logs, GPS-track, …– Science - Remote sensing, Micro-array gene expression data, …

• Challenges:– Volume (data) >> number of human analysts– Some automation needed

• Limitations of Relational Database– Can not predict future! (questions about items not in the database!)

• Ex. Predict tomorrow’s weather or credit-worthiness of a new customer– Can not compute transitive closure and more complex questions

• Ex. What are natural groups of customers? • Ex. Which subsets of items are bought together?

• Data Mining may help!– Provide better and customized insights for business– Help scientists for hypothesis generation

Page 2: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Motivation for Data Mining

• Understanding of a (new) phenomenon

• Discovery of model may beis aided by patterns– Ex. 1854 London:

• Cholera deaths clustered around a water pump– Narrow down potential causes– Change Hypothesis: Miasma => Water-borne

• Though, final model may not involve patterns– Cause-effect e.g. Cholera caused by germs

Page 3: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Data Mining: Definition

• The process of discovering– interesting, useful, non-trivial patterns

• patterns: non-specialist• exception to patterns: specialist

– from large datasets

• Pattern families1. Clusters2. Outlier, Anomalies3. Associations, Correlations4. Classification and Prediction models5. …

Page 4: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

What’s NOT Data Mining

• Simple Querying or summarization of Data– Find number of Subaru drivers in Ramsey county– Search space is not large (not exponential)

• Testing a hypothesis via a primary data analysis– Ex. Do Subaru driver vote for Democrats ?– Search space is not large!– DM: secondary data analysis to generate multiple plausible hypotheses

• Uninteresting or obvious patterns in data– Minneapolis and St. Paul have similar climate– Common knowledge: Nearby places have similar climate!

Page 5: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Context of Data Mining Models

• CRISP-DM (CRoss-Industry Standard Process for DM)– Application/Business Understanding– Data Understanding– Data Preparation– Modeling– Evaluation– Deployment

http://www.crisp-dm.org

Phases of CRISP-DM

Page 6: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Outline

• Clustering• Outlier Detection• Association Rules• Classification & Prediction• Summary

Page 7: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Clustering: What are natural groups of employees?

R Id Age Years of Service

A 30 5

B 50 25

C 50 15

D 25 5

E 30 10

F 55 25

K = 2

Page 8: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Clustering: Geometric View shows 2 groups!

R Id Age Years of Service

A 30 5

B 50 25

C 50 15

D 25 5

E 30 10

F 55 25

K = 2

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Page 9: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

K-Means Algorithm: 1. Start with random seeds

R Id Age Years of Service

A 30 5

B 50 25

C 50 15

D 25 5

E 30 10

F 55 25

K = 2

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Seed

Seed

Page 10: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

K-Means Algorithm: 2. Assign points to closest seed

R Id Age Years of Service

A 30 5

B 50 25

C 50 15

D 25 5

E 30 10

F 55 25

K = 2

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Seed

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Seed

Color showsclosest seed

Page 11: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

K-Means Algorithm: 3. Revise seeds to group centers

R Id Age Years of Service

A 30 5

B 50 25

C 50 15

D 25 5

E 30 10

F 55 25

K = 2

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Revised seeds

Page 12: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

R Id Age Years of Service

A 30 5

B 50 25

C 50 15

D 25 5

E 30 10

F 55 25

K = 2

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Revised seeds

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Colors showclosest Seed

K-Means Algorithm: 2. Assign points to closest seed

Page 13: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

R Id Age Years of Service

A 30 5

B 50 25

C 50 15

D 25 5

E 30 10

F 55 25

K = 2

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Revised seed

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Colors showClosest Seed

K-Means Algorithm: 3. Revise seeds to group centers

Page 14: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

R Id Age Years of Service

A 30 5

B 50 25

C 50 15

D 25 5

E 30 10

F 55 25

K = 2

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Colors showClosest seed

K-Means Algorithm: If seeds changed then Loop back to step

2. Assign points to closest seed

Page 15: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

R Id Age Years of Service

A 30 5

B 50 25

C 50 15

D 25 5

E 30 10

F 55 25

K = 2

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Colors showClosest seed

Age

YearsOf Service

30 40 50

10

20

A

F

E

D

C

B

Termination

K-Means Algorithm: 3. Revise seeds to group centers

Page 16: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Outline

• Clustering• Outlier Detection• Association Rules• Classification & Prediction• Summary

Page 17: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Outliers – Global and local

• Ex. Traffic Data in Twin Cities– Abnormal Sensor 9

Page 18: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Outlier Detection• Distribution Tests

– Global Outliers, i.e., different from population– Local Outliers, i.e. different from neighbors

Page 19: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Outline

• Clustering• Outlier Detection• Association Rules• Classification & Prediction• Summary

Page 20: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Associations: Which Items are bought together?• Input: Transactions with Item-types

• Metrics balance computation cost and statistical interpretation!– Support: probability (Diaper and Beer in T) = 2/5– Confidence: probability (Beer in T | Diaper in T) = 2/2

• Algorithm Apriori [Agarwal, Srikant, VLDB94]– Support based pruning using monotonicity

Transaction Items Bought

1 {socks, , milk, , beef, egg, …}

2 {pillow, , toothbrush, ice-cream, muffin, …}

3 { , , pacifier, formula, blanket, …}

… …

n {battery, juice, beef, egg, chicken, …}

Page 21: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Apriori Algorithm: How to eliminate infrequent item-sets asap?

Transaction Id Time Item-types bought1101 18:35 Milk, bread, cookies, juice

792 19:38 Milk, juice

2130 20:05 Milk, eggs

1735 20:40 Bread, cookies, coffee

Support threshold >= 0.5

Page 22: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Apriori Algorithm: Eliminate infrequent Singleton sets.

Transaction Id Time Item-types bought1101 18:35 Milk, bread, cookies, juice

792 19:38 Milk, juice

2130 20:05 Milk, eggs

1735 20:40 Bread, cookies, coffee

Item-type

Count

Milk 3

Bread 2

Cookies 2

Juice 2

Coffee 1

Eggs 1

Milk CookiesBread EggsJuice Coffee

Support threshold >= 0.5

Page 23: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Apriori Algorithm: Make pairs from frequent items & Prune infrequent pairs!

Transaction Id Time Item-types bought1101 18:35 Milk, bread, cookies, juice

792 19:38 Milk, juice

2130 20:05 Milk, eggs

1735 20:40 Bread, cookies, coffee

Item-type

Count

Milk 3

Bread 2

Cookies 2

Juice 2

Coffee 1

Eggs 1

Item Pair Count

Milk, Cookies 2

Milk, Juice 2

Bread, Cookies 2

Milk, Bread 1

Bread, Juice 1

Cookies, Juice 1

Milk CookiesBread EggsJuice Coffee

MB BJMJ BCMC CJ

Support threshold >= 0.5

Page 24: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Transaction Id Time Item-types bought1101 18:35 Milk, bread, cookies, juice

792 19:38 Milk, juice

2130 20:05 Milk, eggs

1735 20:40 Bread, cookies, coffee

Item-type

Count

Milk 3

Bread 2

Cookies 2

Juice 2

Coffee 1

Eggs 1

Milk CookiesBread EggsJuice Coffee

MB BJMJ BCMC CJ

MBC MBJ BCJ

MBCJ

MCJ

Support threshold >= 0.5

Apriori Algorithm: Make triples from frequent pairs& Prune infrequent triples!

Item Pair Count

Milk, Cookies 2

Milk, Juice 2

Bread, Cookies 2

Milk, Bread 1

Bread, Juice 1

Cookies, Juice 1

No triples generatedDue to Monotonicity!

Apriori algorithm examined only 12 subsets instead of 64!

Page 25: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Outline

• Clustering• Outlier Detection• Association Rules• Classification & Prediction• Summary

Page 26: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Find a (decision-tree) model to predict loanworthy !

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

3 Yes 20K..50K < 5K <25 No

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

6 yes 20K..50K >= 5K >=25 Yes

Predict Class = LoanworthyFrom Other columns

RID Married Salary Acct_balance Age LoanWorthy

7 yes <20K >= 5K >=25 ?

Learning Samples

Testing Samples

Page 27: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

RID Married Salary Acct_balance Age LoanWorthy

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

Salary

RID Married Salary Acct_balance Age LoanWorthy

3 Yes 20K..50K < 5K <25 No

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

RID Married Salary Acct_balance Age LoanWorthy

6 yes 20K..50K >= 5K >=25 Yes

Age

< 20K

> 50K

20..50K

< 25

>=25

A Decision Tree to Predict Loanworthy From Other columns

RID Married Salary Acct_balance Age LoanWorthy

7 yes <20K >= 5K >=25 ?

Q? What is the decision on the new application?

Page 28: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

RID Married Salary Acct_balance Age LoanWorthy

3 Yes 20K..50K < 5K <25 No

4 No <20K >= 5K <25 No

Age

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

RID Married Salary Acct_balance Age LoanWorthy

5 No <20K < 5K >=25 No

Salary

< 25

>= 25

< 20K

>=50K

RID Married Salary Acct_balance Age LoanWorthy

6 yes 20K..50K >= 5K >=25 Yes20..50K

Another Decision Tree to Predict Loanworthy From Other columns

RID Married Salary Acct_balance Age LoanWorthy

7 yes <20K >= 5K >=25 ?

Q? What is the decision on the new application?

Page 29: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

ID3 Algorithm: Choosing a decision for Root Node -1RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

3 Yes 20K..50K < 5K <25 No

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

6 yes 20K..50K >= 5K >=25 Yes

Married Salary Acct_balance Age Loanworthy

# Groups 2 3 2 2 2

PredictClass = LoanworthyFromOther columns

Page 30: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

3 Yes 20K..50K < 5K <25 No

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

6 yes 20K..50K >= 5K >=25 Yes

Married Salary Acct_balance Age Loanworthy

# Groups 2 3 2 2 2

Groups yyn, nny yy, yn, nn yyn, nyy yyyn, nn yyy, nnn

PredictClass = LoanworthyFromOther columns

ID3 Algorithm: Choosing a decision for Root Node -2

Page 31: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

3 Yes 20K..50K < 5K <25 No

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

6 yes 20K..50K >= 5K >=25 Yes

Married Salary Acct_balance Age Loanworthy

# Groups 2 3 2 2 2

Groups yyn, nny yy, yn, nn yyn, nyy yyyn, nn yyy, nnn

Entropy 0.92 0.33 0.92 0.54 1

PredictClass = LoanworthyFromOther columns

ID3 Algorithm: Choosing a decision for Root Node -3

Page 32: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

3 Yes 20K..50K < 5K <25 No

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

6 yes 20K..50K >= 5K >=25 Yes

Married Salary Acct_balance Age Loanworthy

# Groups 2 3 2 2 2

Groups yyn, nny yy, yn, nn yyn, nyy yyyn, nn yyy, nnn

Entropy 0.92 0.33 0.92 0.54 1

Gain 0.08 0.67 0.08 0.46

PredictClass = LoanworthyFromOther columns

ID3 Algorithm: Choosing a decision for Root Node - 4

Page 33: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

RID Married Salary Acct_balance Age LoanWorthy1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

3 Yes 20K..50K < 5K <25 No

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

6 yes 20K..50K >= 5K >=25 Yes

Married Salary Acct_balance Age Loanworthy

# Groups 2 3 2 2 2

Groups yyn, nny yy, yn, nn yyn, nyy yyyn, nn yyy, nnn

Entropy 0.92 0.33 0.92 0.54 1

Gain 0.08 0.67 0.08 0.46

PredictClass = LoanworthyFromOther columns

Root Node : Decision is based on Salary

Page 34: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Root Node of a Decision Tree to Predict Loanworhty

RID Married Salary Acct_balance Age LoanWorthy

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

Salary

RID Married Salary Acct_balance Age LoanWorthy

3 Yes 20K..50K < 5K <25 No

6 yes 20K..50K >= 5K >=25 Yes

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

< 20K

> 50K

20..50K

Page 35: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

ID3 Algorithm: Which Leafs needs refinement?

RID Married Salary Acct_balance Age LoanWorthy

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

Salary

RID Married Salary Acct_balance Age LoanWorthy

3 Yes 20K..50K < 5K <25 No

6 yes 20K..50K >= 5K >=25 Yes

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

< 20K

> 50K

20..50K

Page 36: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

RID Married Salary Acct_balance Age LoanWorthy

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

Salary

RID Married Salary Acct_balance Age LoanWorthy

3 Yes 20K..50K < 5K <25 No

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

RID Married Salary Acct_balance Age LoanWorthy

6 yes 20K..50K >= 5K >=25 Yes

Age

< 20K

> 50K

20..50K

< 25

>=25

ID3 Algorithm Output: A Decision Tree to Predict Loanworthy column From Other columns

Page 37: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Another Decision Tree to Predict Loanworthy From Other columns

RID Married Salary Acct_balance Age LoanWorthy

4 No <20K >= 5K <25 No

5 No <20K < 5K >=25 No

Salary

RID Married Salary Acct_balance Age LoanWorthy

3 Yes 20K..50K < 5K <25 No

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

RID Married Salary Acct_balance Age LoanWorthy

6 yes 20K..50K >= 5K >=25 Yes

Acct_balance

< 20K

> 50K

20..50K < 5K

>=5K

Page 38: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

A Decision Root not preferred by ID3

RID Married Salary Acct_balance Age LoanWorthy

3 Yes 20K..50K < 5K <25 No

4 No <20K >= 5K <25 No

ID3 prefer Salary over Age for decision in root node due to difference in information gainEven though the choices are comparable for classification accuracy.

Age

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

5 No <20K < 5K >=25 No

6 yes 20K..50K >= 5K >=25 Yes

< 25

>= 25

Page 39: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

A Decision Tree not prefered by ID3

RID Married Salary Acct_balance Age LoanWorthy

3 Yes 20K..50K < 5K <25 No

4 No <20K >= 5K <25 No

ID3 is greedy preferring Salary over Age for decision in root node.Thus, it prefers decision tress in earlier slides over following (despite comparable quality):

Age

RID Married Salary Acct_balance Age LoanWorthy

1 No >=50K < 5K >=25 Yes

2 Yes >=50K >= 5K >=25 Yes

RID Married Salary Acct_balance Age LoanWorthy

5 No <20K < 5K >=25 No

Salary

< 25

>= 25

< 20K

>=50K

RID Married Salary Acct_balance Age LoanWorthy

6 yes 20K..50K >= 5K >=25 Yes20..50K

Page 40: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Summary

• The process of discovering– interesting, useful, non-trivial patterns– from large datasets

• Pattern families1. Clusters, e.g., K-Means2. Outlier, Anomalies3. Associations, Correlations4. Classification and Prediction models, e.g., Decision Trees5. …

Page 41: Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Review QuizConsider an Washingtonian.com article about election micro-targeting using a database of 200+ Million records about individuals. The database is compiled from voter lists, memberships (e.g. advocacy group, frequent buyer cards, catalog/magazine subscription, ...) as well polls/surveys of effective messages and preferences. It is at www.washingtonian.com/articles/people/9627.html

Q1. Match the following use-cases in the article to categories of traditional SQL2 query, association, clustering and classification: (i) How many single Asian men under 35 live in a given congressional district? (ii) How many college-educated women with children at home are in Canton, Ohio? (iii) Jaguar, Land Rover, and Porsche owners tend to be more Republican, while Subaru, Hyundai,

and Volvo drivers lean Democratic. (iv) Some of the strongest predictors of political ideology are things like education,

homeownership, income level, and household size. (v) Religion and gun ownership are the two most powerful predictors of partisan ID. (vi) ... it even studied the roads Republicans drove as they commuted to work, which allowed the

party to put its billboards where they would do the most good. (vii) Catalyst and its competitors can build models to predict voter choices. ... Based on how alike

they are, you can assign a probability to them. ... a likelihood of support on each person based on how many character traits a person shares with your known supporters..

(viii) Will 51 percent of the voters buy what RNC candidate is offering? Or will DNC candidate seem like a better deal?

Q2. Compare and contrast Data Mining with Relational Databases.Q3. Compare and contrast Data Mining with Traditional Statistics (or Machine Learning).