final project

71
Final Project Final Project

Upload: roth-guthrie

Post on 01-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Final Project. Data sets. Visit web site: http://www.kdnuggets.com/datasets/index.html - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Final Project

Final ProjectFinal Project

Page 2: Final Project

結束

10-2

Data setsData sets

Visit web site:

http://www.kdnuggets.com/datasets/index.htmlThis is an online repository of large data sets which

encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets.

http://kdd.ics.uci.edu/

Page 3: Final Project

結束

10-3

Data setsData sets

Data Sets                              

by application area

by name

by date (reverse chronological)

Machine Learning Repository

Task Files

by task type

by application area

by name

by date (reverse chronological)

by data type

Page 4: Final Project

結束

10-4

Report & PresentationReport & Presentation

書面 (50%) + 簡報 (50%)==> 為期末考成績4 位同學一組書面報告 (8 pages at least, cover not included)

簡報 : 15 分鐘 + 問題提問 (5 分鐘 ) ,簡報同學不發問,其餘同學皆須回答問題,不用及時回答,可於下課前回答。一節課用於討論與提問,並預先訂定所選定資料庫。 ( 可於一星期內修改之 ) 。

Page 5: Final Project

Business Data Mining ApplicationsBusiness Data Mining Applications

Page 6: Final Project

結束

10-6

Business Data Mining ApplicationsBusiness Data Mining Applications

Partial representative sample of applications

Catalog sales

CRM

Credit scoring

Banking (loans)

Investment risk

Insurance

Page 7: Final Project

結束

10-7

FingerhutFingerhut

Founded 1948today sends out 130 different catalogsto over 65 million customers6 terabyte data warehouse3000 variables of 12 million most active customersover 300 predictive models

Focused marketing

Page 8: Final Project

結束

10-8

FingerhutFingerhut

Purchased by Federated Department Stores for $1.7 billion in 1999 (for database)

Fingerhut had $1.6 to $2 billion business per year, targeted at lower income households

Can mail 400,000 packages per day

Each product line has its own catalog

Page 9: Final Project

結束

10-9

FingerhutFingerhut

Uses segmentation, decision tree, regression, neural network tools from SAS and SPSS

Segmentation - combines order & demographic data with product offeringscan target mailings to greatest payoff

customers who recently had moved tripled their purchasing 12 weeks after the move

send furniture, telephone, decoration catalogs

Page 10: Final Project

結束

10-10

Data for SEGMENTATIONData for SEGMENTATION

cluster indices

subj age income marital grocery dine out savings

1001 53 80000 wife 180 90 30000

1002 48 120000 husband 120 110 20000

1003 32 90000 single 30 160 5000

1004 26 40000 wife 80 40 0

1005 51 90000 wife 110 90 20000

1006 59 150000 wife 160 120 30000

1007 43 120000 husband 140 110 10000

1008 38 160000 wife 80 130 15000

1009 35 70000 single 40 170 5000

1010 27 50000 wife 130 80 0

Page 11: Final Project

結束

10-11

Initial Look at DataInitial Look at Data

Want to know features of those who spend a lot dining out

INCLUDE AS MANY ACTIONABLE VARIABLES AS POSSIBLEthings you can identify

Manipulate datasort on most likely indicator (dine out)

Page 12: Final Project

結束

10-12

Sorted by Dine OutSorted by Dine Out

cluster indices

subject age income marital grocery dine out savings

1004 26 40000 wife 80 40 0

1010 27 50000 wife 130 80 0

1001 53 80000 wife 180 90 30000

1005 51 90000 wife 110 90 20000

1002 48 120000 husband 120 110 20000

1007 43 120000 husband 140 110 10000

1006 59 150000 wife 160 120 30000

1008 38 160000 wife 80 130 15000

1003 32 90000 single 30 160 5000

1009 35 70000 single 40 170 5000

Page 13: Final Project

結束

10-13

AnalysisAnalysis

Best indicatorsmarital statusgroceries

Availablemarital status might be easier to get

Page 14: Final Project

結束

10-14

FingerhutFingerhut

Mailstream optimizationwhich customers most likely to respond to

existing catalog mailingssave near $3 million per yearreversed trend of catalog sales industry in 1998reduced mailings by 20% while increasing net

earnings to over $37 million

Page 15: Final Project

結束

10-15

LIFTLIFT

LIFT = probability in class by sample divided by probability in class by populationif population probability is 20% and

sample probability is 30%,

LIFT = 0.3/0.2 = 1.5

Best lift not necessarily bestneed sufficient sample sizeas confidence increases, longer list but lower lift

Page 16: Final Project

結束

10-16

Lift ExampleLift Example

Product to be promoted

Sampled over 10 identifiable segments of potential buying populationProfit $50 per item soldMailing cost $1Sorted by Estimated response rates

Page 17: Final Project

結束

10-17

Lift DataLift Data

S eg R a te R ev C o st P ro fit S eg R a te R ev C o st P ro fit

1 0 .0 4 2 $ 2 .1 0 $ 1 $ 1 .1 0 6 0 .0 1 3 $ 0 .6 5 $ 1 -$ 0 .3 5

2 0 .0 3 5 $ 1 .7 5 $ 1 $ 0 .7 5 7 0 .0 0 9 $ 0 .4 5 $ 1 -$ 0 .5 5

3 0 .0 2 5 $ 1 .2 5 $ 1 $ 0 .2 5 8 0 .0 0 5 $ 0 .2 5 $ 1 -$ 0 .7 5

4 0 .0 1 7 $ 0 .8 5 $ 1 -$ 0 .1 5 9 0 .0 0 4 $ 0 .2 0 $ 1 -$ 0 .8 0

5 0 .0 1 5 $ 0 .7 5 $ 1 -$ 0 .2 5 1 0 0 .0 0 1 $ 0 .0 5 $ 1 -$ 0 .9 5

Page 18: Final Project

結束

10-18

Lift ChartLift Chart

LIFT

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10

Segment

Cu

mu

lati

ve P

rop

ort

ion

Cum Response

Random

Page 19: Final Project

結束

10-19

Profit ImpactProfit Impact

PROFIT

-4

-2

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8 9 10

Segment

Do

lla

rs Cum Revenue

Cum Cost

Cum Profit

Page 20: Final Project

結束

10-20

RFMRFM

Recency, Frequency, Monetary

Same purpose as liftIdentify customers more likely to respond

RFM tracks customer transactions by its 3 measuresCode each customer Often 5 cells for each measure, or 125 combinationsIdentify positive response of each of the

combinations

Page 21: Final Project

結束

10-21

CUSTOMER RELATIONSHIP CUSTOMER RELATIONSHIP MANAGEMENT (MANAGEMENT (CRMCRM))

understanding value customer provides to firmKathleen Khirallah - The Tower Group

Banks will spend $9 billion on CRM by end of 1999Deloitte

only 31% of senior bank executives confident that their current distribution mix anticipated customer needs

Page 22: Final Project

結束

10-22

Customer ValueCustomer Value

Middle age (41-55), 3-9 years on job, 3-9 years in town, savings account

year annual purchases profit discounted net 1.3 rate

1 1000 200 153 153

2 1000 200 118 272

3 1000 200 91 363

4 1000 200 70 433

5 1000 200 53 487

6 1000 200 41 528

7 1000 200 31 560

8 1000 200 24 584

9 1000 200 18 603

10 1000 200 14 618

Page 23: Final Project

結束

10-23

Younger CustomerYounger Customer

Young (21-29), 0-2 years on job, 0-2 years in town, no savings account

year annual purchases profit discounted net 1.3

1 300 60 46 46

2 360 72 43 89

3 432 86 39 128

4 518 104 36 164

5 622 124 34 198

6 746 149 31 229

7 896 179 29 257

8 1075 215 26 284

9 1290 258 24 308

10 1548 310 22 331

Page 24: Final Project

結束

10-24

Lifetime Value ApplicationLifetime Value ApplicationDrew et al. (2001), Drew et al. (2001), Journal of Service ResearchJournal of Service Research 3:3 3:3

Cellular telephone division, major US telecommunications firmData on billing, usage, demographicsNeural net model of churn proportion by month of tenure 36 tenure classes

Tested model on 21,500 subscribers April 1998 Trained on 15,000, tested on 6,500

Page 25: Final Project

結束

10-25

Customer Tenure SegmentsCustomer Tenure Segments

1. Least likely to churn• Left alone

2. Slight propensity to churn at end of tenure• Moderate pre-expiration marketing

3. Large spike in churn at expiration• Concentrated marketing efforts before expiration

4. Highest risk• Continued competitive offers

Page 26: Final Project

結束

10-26

CREDIT SCORINGCREDIT SCORING

Data warehouse including demand deposits, savings, loans, credit cards, insurance, annuities, retirement

programs, securities underwriting, other Statistical & mathematical models (regression) to predict repayment

Page 27: Final Project

結束

10-27

CREDIT SCORINGCREDIT SCORING

Bank Loan ApplicationsAge Income Assets Debts Want On-time

24 55557 27040 48191 1500 1

20 17152 11090 20455 400 1

20 85104 0 14361 4500 1

33 40921 91111 90076 2900 1

30 76183 101162 114601 1000 1

55 80149 511937 21923 1000 1

28 26169 47355 49341 3100 0

20 34843 0 21031 2100 1

20 52623 0 23054 15900 0

39 59006 195759 161750 600 1

Page 28: Final Project

結束

10-28

Credit Card ManagementCredit Card Management

Very profitable industry

Card surfing - pay old balance with new card

Promotions typically generate 1000 responses, about 1%

In early 1990s, almost all mass marketing

Data mining improves (lift)

Page 29: Final Project

結束

10-29

British Credit Card CompanyBritish Credit Card Company

Monthly credit dataDidn’t want those who paid in full (no profit)

Application scoringContinued what had been done manually for over 50

yearsBehavioral scoringMonitor revolving credit accounts for early warning

90,000 customersState variable: cumulative months of missed repaymentSelected sample of 10,000 observations Initial state all 0 in selected dataOver 70% of customers never left state 0

Page 30: Final Project

結束

10-30

AnalysisAnalysis

ClusteringUnsupervised partitioning

K-median to get more stable results

Pattern searchSought patterns from object groupingUnexpectedly large number of similar objectsEstimated probability of each case belonging to

objects

Page 31: Final Project

結束

10-31

ComparisonComparison

Compared clustering partitions with pattern search groupings

Pattern search identified those behaving in anomalous manner

Page 32: Final Project

結束

10-32

BankingBanking

Among first users of data mining

Used to find out what motivates their customers (reduce churn)

Loan applications

Target marketingNorwest: 3% of customers provided 44% profits

Bank of America: program cultivating top 10% of customers

Page 33: Final Project

結束

10-33

CHURNCHURN

Customer turnover

Critical to:telecommunicationsbankshuman resource managementretailers

Page 34: Final Project

結束

10-34

Characteristics of Not On-TimeCharacteristics of Not On-Time

Age Income Assets Debts Want On-time

28 26169 47355 49341 3100 0

20 52623 0 23054 15900 0

Here, Debts exceed Assets

Age Young

Income Low

BETTER: Base on statistics, large sample

supplement data with other relevant variables

Page 35: Final Project

結束

10-35

Identify Characteristics of Those Who LeaveIdentify Characteristics of Those Who Leave

Age Time-job Time-town min bal checking savings card loan

years months months $

27 12 12 549 x x

41 18 41 3259 x x x

28 9 15 286 x x

55 301 5 2854 x x x

43 18 18 1112 x x x

29 6 3 0 x

38 55 20 321 x x x

63 185 3 2175 x x x

26 15 15 386 x x

46 13 12 1187 x x x

37 32 25 1865 x x x

Page 36: Final Project

結束

10-36

AnalysisAnalysis

What are the characteristics of those who leave?Correlation analysis

Which customers do you want to keep?Customer value - net present value of customer to the

firm

Page 37: Final Project

結束

10-37

CorrelationCorrelation

Age Time Time min-bal check saving card loan

Job Town

Age 1.0 0.6 0.4 -0.4 0.0 0.4 0.2 0.3

Job 1.0 0.9 -0.6 0.1 0.6 0.9 -0.2

Town 1.0 -0.5 -0.1 0.3 0.5 0.4

Min-Bal 1.0 -0.2 0.3 0.6 -0.1

Check 1.0 0.5 0.2 0.2

Saving 1.0 0.9 0.3

Card 1.0 0.5

Loan 1.0

Page 38: Final Project

結束

10-38

Bankruptcy PredictionBankruptcy PredictionSung et al. (1999), Sung et al. (1999), Journal of MISJournal of MIS 16:1 16:1

Late 20th-century, East Asian corporate bankruptcy criticalModels built for normal & crisis conditionsUsed decision tree models for explanation Discriminant analysis applied to benchmark

Korean corporations Data for all bankrupt corporations on Korean Stock Exchange,

2nd quarter 1997 to 1st quarter 199875 such cases – full data on 30 of those

Normal 2nd Qtr 1991 to 1st Qtr 199556 firms, full data on 26

Page 39: Final Project

結束

10-39

Korean Bankruptcy StudyKorean Bankruptcy Study

Matched bankrupt firms with one or two nonbankrupt firms that had similar assets and size

56 financial ratios usedEliminated 16 due to duplication

Page 40: Final Project

結束

10-40

Financial RatiosFinancial Ratios

Growth (5)

Profitability (13)

Leverage (9)

Efficiency (6)

Productivity (7)

DV 0/1 variable of bankruptcy or not

Page 41: Final Project

結束

10-41

Multivariate Discriminant AnalysisMultivariate Discriminant Analysis

Used stepwise procedureNORMAL PERIODNormal = 0.58 * cash flow/assets

+ 0.0623 * productivity of capital- 0.006 * average inventory turnover

BANKRUPT PERIODBankrupt = 0.053 * cash flow/liabilities

+ 0.056 * productivity of capital+ 0.014 * fixed assets/(equity+LT liab)

Page 42: Final Project

結束

10-42

Decision Tree ModelsDecision Tree Models

Used C4.5Applied boosting to improve predictive power, improved prediction successNORMAL RULESIF productivity of capital > 19.65 THEN OKIF cash flow/total assets > 5.64 THEN OKIF cash flow/total assets ≤ 55.64 & productivity of

capital ≤ 19.65 THEN bankrupt

Page 43: Final Project

結束

10-43

CRISIS RULESCRISIS RULES

IF productivity of capital > 20.61 THEN OK

IF cash flow/liabilities > 2.64 THEN OK

IF fixed assets/(equity+long-term invest) > 87.23 THEN OK

IF cash flow/liabilities ≤ 2.64

AND productivity of capital ≤20.61

AND fixed assets/(equity+long-term invest) ≤ 87.23 THEN bankrupt

Page 44: Final Project

結束

10-44

ComparisonComparison

Correct Bankrupt

Correct OK Overall Variables

DA-normal 0.69 0.90 0.82 3

DA-crisis 0.53 0.85 0.74 3

DT-normal 0.72 0.90 0.83 8

DT-crisis 0.67 0.89 0.81 6

Page 45: Final Project

結束

10-45

Mortgage MarketMortgage Market

Early 1990s - massive refinancing

Need to keep customers happy to retain

Contact current customers who have rates significantly higher than marketa major change in practicedata mining & telemarketing increased Crestar

Mortgage’s retention rate from 8% to over 20%

Page 46: Final Project

結束

10-46

Country Investment RiskCountry Investment Risk

Outcome categories:1. Most safe

2. Developed

3. Mature emerging markets

4. New emerging markets

5. Frontier

Page 47: Final Project

結束

10-47

Investment Risk AnalysisInvestment Risk AnalysisBecerra-Fernandez et al. (2002) Becerra-Fernandez et al. (2002) Computers and Industrial Engineering Computers and Industrial Engineering 4343

Risk by countryExpert assessment available

Decision tree (C5), neural network modelsData:Economic indicators (4)Depth & liquidity (4)Performance & value (5)Economic & market risk (4)Regulation & efficiency (4)52 samples, so used bootstrapping

Page 48: Final Project

結束

10-48

ModelsModels

Decision treesPruning rate 50%:Pruning rate 75%

Neural networksBackpropogationFuzzy (ARTMAP)Learning vector quantization

Page 49: Final Project

結束

10-49

ResultsResults

Decision tree algorithms more accurateLower pruning rate – lowest error rateNeural networks disadvantaged by small data setDecision tree algorithms consistently optimistic

relative to expert ratings

Page 50: Final Project

結束

10-50

BankingBanking

Fleet Financial Group $30 million data warehousehired 60 database marketers, statistical/quantitative

analysts & DSS specialistsexpected to add $100 million in profit by 2001

Page 51: Final Project

結束

10-51

BankingBanking

First Unionconcentrated on contact pointpreviously had very focused product groups, little

coordinationDeveloped offers for customers

Page 52: Final Project

結束

10-52

INSURANCEINSURANCE

Marketing, as retailing & banking

Special: Farmers Insurance Group - underwriting system

generating $ millions in higher revenues, lower claims7 databases, 35 million records

better understanding of market nicheslower rates on sports cars, increasing business

Page 53: Final Project

結束

10-53

Insurance FraudInsurance Fraud

Specialist criminals - multiple personas

InfoGlide specializes in fraud detection productsSimilarity search engine

link names, telephone numbers, streets, birthdays, variations

identify 7 times more fraud than exact-match systems

Page 54: Final Project

結束

10-54

Insurance Fraud - Link AnalysisInsurance Fraud - Link Analysis

claim

type amount physician attorney

back 50000 Welby McBeal

neck 80000 Frank Jones

arm 40000 Barnard Fraser

neck 80000 Frank Jones

leg 30000 Schmidt Mason

multiple 120000 Heinrich Feiffer

neck 80000 Frank Jones

back 60000 Schwartz Nixon

arm 30000 Templer White

internal 180000 Weiss Richards

Page 55: Final Project

結束

10-55

Insurance FraudInsurance Fraud

Analytics’ NetMap for Claimsuses industrywide database creates data mart of internal, external dataunusual activity for specific chiropractors, attorneys

HNC Insurance Solutionsworkers compensation fraud

VeriComp - predictive software (neural nets) saved Utah over $2 million

Page 56: Final Project

結束

10-56

Insurance Data Mining ExamplesInsurance Data Mining ExamplesSmith et al. (2000) Smith et al. (2000) Journal of the Operational Research SocietyJournal of the Operational Research Society 51:5 51:5

Large data warehouse systemRecorded every transaction & claim

Data mining to predict average claim costs & frequency, impact on profitabilityPricing

Page 57: Final Project

結束

10-57

Customer Retention AnalysisCustomer Retention Analysis

Over 20,000 motor vehicle policies due for renewal in one monthAbout 7% didn’t renewExpected reasons: price, service, value of vehicle

Page 58: Final Project

結束

10-58

Customer Retention ResultsCustomer Retention Results

Data MiningEnterprise MinerUsed data exploration to select variables (13)Used log transforms for highly skewed dataPerformed log regression, decision trees, neural

networks

Neural network fit test set bestBut low correct rate for termination

Page 59: Final Project

結束

10-59

Claims AnalysisClaims Analysis

Recent growth in policiesLower profitabilityCould improve by lowering frequency, reducing claim

amounts

Data over a three-year period

Sample size well over 100,000 per quarter

Descriptive statistics:High growth in young people, insurance over $40,000

Page 60: Final Project

結束

10-60

Claims ModelsClaims Models

ClusteringPredict group policy claims behaviorUsed 50 clustersK-means algorithm

Identified several clusters with abnormal cost ratios or frequency size

Page 61: Final Project

結束

10-61

TELECOMMUNICATIONSTELECOMMUNICATIONS

Deregulation - widespread competitionchurn

1/3 poor call quality, 1/2 poor equipmentwireless performance monitor tracking

reduced churn about 61%, $580,000/yearcellular fraud preventionspot problems when cell phones begin to go bad

Page 62: Final Project

結束

10-62

TelecommunicationsTelecommunications

Metapath’s Communications Enterprise Operating Systemhelp identify telephone customer problems

dropped calls, mobility patterns, demographics

to target specific customersreduce subscription fraud

$1.1 billionreduce cloning fraud

cost $650 million in 1996

Page 63: Final Project

結束

10-63

TelecommunicationsTelecommunications

Churn Prophet, ChurnAlertdata mining to predict subscribers who cancel

Arbor/Mobileset of products, including churn analysis

Page 64: Final Project

結束

10-64

TELEMARKETINGTELEMARKETING

MCI uses data marts to extract data on prospective customerstypically a 2-month program20% improvement in sales leadsmultimillion investment in data marts & hardwarestaff of 45trend spotting (which approaches specific

customers)

Page 65: Final Project

結束

10-65

TelemarketingTelemarketing

Australian Tourist Commissionmaintained database since 1992

responses to travel inquiries on tours, hotels, airlines, travel agents, consumers

data mine to identify travel agents & consumers responding to various media

sales closure rate at 10% and uplead lists faxed weekly to productive travel agents

Page 66: Final Project

結束

10-66

TelemarketingTelemarketing

SegmentationWhich customers respond to new promotions, to

discounts, to new product offersDetermine

whom to offer new service tothose most likely to commit fraud

Page 67: Final Project

結束

10-67

Human Resource ManagementHuman Resource Management

Identify individuals liable to leave company without additional compensation or benefits

Firm may already know 20% use 80% of offered servicesdon’t know which 20%data mining (business intelligence) can identify

Use most talented people in highest priority (or most profitable) business units

Page 68: Final Project

結束

10-68

Human Resource ManagementHuman Resource Management

Downsizingidentify right people, treat them welltrack key performance indicatorsdata on talents, company needs, competitor

requirements

State of Mississippi’s MERLIN network30 databases (finance, payroll, personnel, capital

projects)Cognos Impromptu system - 230 users

Page 69: Final Project

結束

10-69

CASINOSCASINOS

Casino gaming one of richest data sets known

Harrah’s - incentive programsabout 8 million customers hold Total Gold cards,

used whenever the customer spends money in the casino

comprehensive data collection

Trump’s Taj Card similar

Page 70: Final Project

結束

10-70

CasinosCasinos

Bellagio & Mandelay Baystrategy of luxury visitschild entertainmentchange from old strategy - cheap food

Identify high rollers - cultivateidentify those to discourage from playestimate lifetime value of players

Page 71: Final Project

結束

10-71

ARTSARTS

Computerized box offices lead to high volumes of data

Identify potential consumers for shows

Software to manage showssimilar to airline seating chart software