data mining: the next revolution data mining: the next

24
Data Mining: The Next Revolution Data Mining: The Next Revolution in Institutional Research in Institutional Research C. R. Thulasi Kumar C. R. Thulasi Kumar Office of Information Management & Analysis Office of Information Management & Analysis University of Northern Iowa University of Northern Iowa May 31, 2004 May 31, 2004

Upload: tommy96

Post on 28-Nov-2014

1.059 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Mining: The Next Revolution Data Mining: The Next

Data Mining: The Next Revolution Data Mining: The Next Revolution in Institutional Researchin Institutional Research

C. R. Thulasi KumarC. R. Thulasi KumarOffice of Information Management & AnalysisOffice of Information Management & Analysis

University of Northern IowaUniversity of Northern IowaMay 31, 2004May 31, 2004

Page 2: Data Mining: The Next Revolution Data Mining: The Next

The Evolution of Data AnalysisThe Evolution of Data AnalysisEvolutionary

Step

Business Question

Enabling Technologies

Product Providers

Characteristics

Data Collection (1960s)

"What was my total revenue in the last five years?"

Computers, tapes, disks

IBM, CDC

Retrospective, static data delivery

Data Access (1980s)

"What were unit sales in New England last March?"

Relational databases (RDBMS), Structured Query Language (SQL), ODBC

Oracle, Sybase, Informix, IBM, Microsoft

Retrospective, dynamic data delivery at record level

Data Warehousing & Decision Support (1990s)

"What were unit sales in New England last March? Drill down to Boston."

On-line analytic processing (OLAP), multidimensional databases, data warehouses

SPSS, Comshare, Arbor, Cognos, Microstrategy, NCR

Retrospective, dynamic data delivery at multiple levels

Data Mining (Emerging Today)

"What’s likely to happen to Boston unit sales next month? Why?"

Advanced algorithms, multiprocessor computers, massive databases

SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups

Prospective, proactive information delivery

Source: SPSS BI

Page 3: Data Mining: The Next Revolution Data Mining: The Next

What is Data Mining?What is Data Mining?

The exploration and analysis of large quantities of data in ordeThe exploration and analysis of large quantities of data in order r to discover meaningful patterns and rules (Berry and Linoff).to discover meaningful patterns and rules (Berry and Linoff).

The process of discovering meaningful new correlations, The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data patterns, and trends by sifting through large amounts of data stored in repositories and by using pattern recognition stored in repositories and by using pattern recognition technologies as well as statistical and mathematical techniques technologies as well as statistical and mathematical techniques (The Gartner Group).(The Gartner Group).

The nontrivial extraction of implicit, previously unknown, and The nontrivial extraction of implicit, previously unknown, and potentially useful information from data (Frawley, Paitestskypotentially useful information from data (Frawley, Paitestsky--Shapiro and Mathews).Shapiro and Mathews).

Page 4: Data Mining: The Next Revolution Data Mining: The Next

Differences between Statistics andDifferences between Statistics andData MiningData Mining

STATISTICS DATA MINING

Confirmative Explorative

Small data sets/File-based Large data sets/Databases

Small number of variables Large number of variables

Deductive Inductive

Numeric data Numeric and non-numeric

Clean data Data cleaning

Page 5: Data Mining: The Next Revolution Data Mining: The Next

Why Data Mining?Why Data Mining?

Too much dataToo much dataToo many recordsToo many recordsToo many variablesToo many variables

Interesting patterns difficult to find with traditional Interesting patterns difficult to find with traditional statistics, due tostatistics, due to

Complex non linear relationshipsComplex non linear relationshipsMultiMulti--variable combinationvariable combination

Source: Abbot, Data Mining: Level II

Page 6: Data Mining: The Next Revolution Data Mining: The Next

Data Mining is not…Data Mining is not…

OLAPOLAPData WarehousingData WarehousingData VisualizationData VisualizationSQLSQLAd Hoc QueriesAd Hoc QueriesReportingReporting

Page 7: Data Mining: The Next Revolution Data Mining: The Next

Data Mining AlgorithmsData Mining Algorithms

StatisticsStatisticsDistributions, mathematics, etc.Distributions, mathematics, etc.

Machine LearningMachine LearningComputer science, heuristics and induction algorithmsComputer science, heuristics and induction algorithms

Artificial IntelligenceArtificial IntelligenceEmulating human intelligenceEmulating human intelligence

Neural NetworksNeural NetworksBiological models, psychology and engineeringBiological models, psychology and engineering

Page 8: Data Mining: The Next Revolution Data Mining: The Next

Data Mining is… Data Mining is…

Predictive ModelingPredictive ModelingLiner/Logistic RegressionLiner/Logistic RegressionNeural NetworksNeural NetworksDecision TreesDecision Trees

ClusteringClusteringKohonen Neural Networks ClusteringKohonen Neural Networks ClusteringKK--Means ClusteringMeans ClusteringNearest Neighbor ClusteringNearest Neighbor Clustering

Page 9: Data Mining: The Next Revolution Data Mining: The Next

Data Mining is…(cont’d)Data Mining is…(cont’d)

SegmentationSegmentationDecision TreesDecision TreesNeural NetworksNeural NetworksPredictive ModelingPredictive Modeling

Affinity AnalysisAffinity AnalysisAssociation RuleAssociation RuleSequence Generators

Cat. % nBad 52.01 168

Good 47.99 155Total (100.00) 323

Credit ranking (1=default)

Cat. % nBad 86.67 143

Good 13.33 22Total (51.08) 165

Paid Weekly/MonthlyP-value=0.0000, Chi-square=179.6665, df=1

Weekly pay

Cat. % nBad 15.82 25Good 84.18 133Total (48.92) 158

Monthly salary

Cat. % nBad 90.51 143

Good 9.49 15Total (48.92) 158

Age CategoricalP-value=0.0000, Chi-square=30.1113, df=1

Young (< 25);Middle (25-35)

Cat. % nBad 0.00 0Good 100.00 7Total (2.17) 7

Old ( > 35)

Cat. % nBad 48.98 24Good 51.02 25Total (15.17) 49

Age CategoricalP-value=0.0000, Chi-square=58.7255, df=1

Young (< 25)

Cat. % nBad 0.92 1Good 99.08 108Total (33.75) 109

Middle (25-35);Old ( > 35)

Cat. % nBad 0.00 0Good 100.00 8Total (2.48) 8

Social ClassP-value=0.0016, Chi-square=12.0388, df=1

Management;Clerical

Cat. % nBad 58.54 24

Good 41.46 17Total (12.69) 41

Professional

Sequence Generators

Page 10: Data Mining: The Next Revolution Data Mining: The Next

Kohonen NetworkKohonen Network

Seeks to describe dataset in terms of natural clusters Seeks to describe dataset in terms of natural clusters of casesof cases

Source: SPSS BI

Page 11: Data Mining: The Next Revolution Data Mining: The Next

Apriori Apriori Seeks association rules in dataset“Market Basket” analysisSequence discovery

Source: SPSS BI

Page 12: Data Mining: The Next Revolution Data Mining: The Next

Areas of Current ApplicationAreas of Current Application

Credit Card/Insurance Fraud DetectionCredit Card/Insurance Fraud DetectionCredit/Risk ScoringCredit/Risk ScoringDirect Mail MarketingDirect Mail MarketingParts Failure PredictionParts Failure PredictionRecruiting/Attracting Customers Recruiting/Attracting Customers Service Delivery and Customer RetentionService Delivery and Customer Retention“Market Basket” Analysis“Market Basket” Analysis

Page 13: Data Mining: The Next Revolution Data Mining: The Next

Higher Education ApplicationsHigher Education Applications

Student academic success/Retention and graduationStudent academic success/Retention and graduationIdentify high risk studentsIdentify high risk studentsPredict course demandPredict course demandProfile good transfer candidatesProfile good transfer candidatesApplication success ratesApplication success ratesPredict potential alumni donationsPredict potential alumni donations

Page 14: Data Mining: The Next Revolution Data Mining: The Next

Software VendorsSoftware Vendors

Clementine (SPSS)Clementine (SPSS)Intelligent Miner (IBM)Intelligent Miner (IBM)Insightful Miner (Insightful)Insightful Miner (Insightful)Enterpriser Miner (SAS)Enterpriser Miner (SAS)Affinium Model (Affinium Model (UnicaUnica))CART (Salford Systems)CART (Salford Systems)XLMinerXLMinerGhostMinerGhostMinerSPlusSPlus

Page 15: Data Mining: The Next Revolution Data Mining: The Next

Clementine (SPSS)Clementine (SPSS)

Page 16: Data Mining: The Next Revolution Data Mining: The Next

Insightful Miner (Insightful)Insightful Miner (Insightful)

Page 17: Data Mining: The Next Revolution Data Mining: The Next

CART (Salford Systems)CART (Salford Systems)

Page 18: Data Mining: The Next Revolution Data Mining: The Next

How much does it cost?How much does it cost?Clementine (SPSS)Clementine (SPSS)

Price variesPrice variesInsightful Miner (Insightful)Insightful Miner (Insightful)

Small/fraction of other mining toolsSmall/fraction of other mining toolsEnterpriser Miner (SAS) Enterpriser Miner (SAS)

Academic server license $40KAcademic server license $40K--100K100KAffinium Model (Affinium Model (UnicaUnica))Intelligent Miner (IBM)Intelligent Miner (IBM)XLMinerXLMiner

Standard academic version $199 for twoStandard academic version $199 for two--yearsyearsGhostMinerGhostMiner

$2.5K$2.5K--30K + Maintenance fee30K + Maintenance feeCART (Salford Systems) CART (Salford Systems)

Very low for academic licenseVery low for academic license

Page 19: Data Mining: The Next Revolution Data Mining: The Next

ResourcesResources

Web SitesWeb Siteshttp://www.kdnuggets.com/http://www.kdnuggets.com/http://www.uni.edu/instrsch/dm/index.htmlhttp://www.uni.edu/instrsch/dm/index.html

TrainingTraininghttp://www.thehttp://www.the--modelingmodeling--agency.comagency.com

Page 20: Data Mining: The Next Revolution Data Mining: The Next

What is Data Mining?

• The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data stored in repositories and by using pattern recognition technologies as well as statistical and mathematical techniques (The Gartner Group).

• The Nontrivial extraction of implicit, previously unknown and potentially useful information from data (Frawley, Paitestky-Shapiro and Mathews).

Data Mining in Institutional Research

• Data analysis for institutional research (IR) has evolved from simple retrospective data delivery in the 1960’s to retrospective dynamic data delivery at multiple levels in the 1990’s. Unlike the past methodologies, data mining is prospective and proactive in data analysis and information delivery. With a blend of tools and techniques from disciplinessuch as statistics, computer science, mathematics, biology and engineering, data mining provides new opportunities for institutional research professionals to provide decision support data. This site provides a collection of resources from an introductory perspective for institutional research professionals interested in data mining.

• As this area is still in its infant stages, real world examples of IR applications are difficult to find, let alone emulate. As moreand more examples in IR become available, this site will be updated. Until that time, most of the examples refer to the current data mining applications in the business and industry sectors.

• Data mining has been used by universities in a number of areas, including but not limited to enrollment management, retention and graduation analysis, survey data analysis, and donation prediction (alumni contribution).

Comments or Suggestions? Email Dr. Kumar, Information Management & Analysis

Last Modified: March 25, 2004

Copyright 2004 University of Northern Iowa Office of Information Management & Analysis

Page 21: Data Mining: The Next Revolution Data Mining: The Next

TrainingTraining(The Modeling Agency)(The Modeling Agency)

DATA MINING: LEVEL IDATA MINING: LEVEL IA Strategic Overview of Methods, Resources and Applications for A Strategic Overview of Methods, Resources and Applications for Predictive Analytics by Predictive Analytics by Tony Rathburn; Eric SiegelTony Rathburn; Eric Siegel

Registration:Registration: $1,295, 2 Days*$1,295, 2 Days*Washington, DCWashington, DC -- June 21 & 22, 2004June 21 & 22, 2004San Diego, CASan Diego, CA -- September 20 & 21, 2004September 20 & 21, 2004Las Vegas, NVLas Vegas, NV -- November 29 & 30, 2004November 29 & 30, 2004

*DM Levels I & II Package $1,995*DM Levels I & II Package $1,995DATA MINING: LEVEL IIDATA MINING: LEVEL IIA Tactical DrillA Tactical Drill--Down of the Data Mining Process, Tools and Techniques by Dean AbDown of the Data Mining Process, Tools and Techniques by Dean Abbottbott

Registration:Registration: $1,295, 2 Days*$1,295, 2 Days*Washington, DCWashington, DC -- June 23 & 24, 2004June 23 & 24, 2004San Diego, CASan Diego, CA -- September 22 & 23, 2004 September 22 & 23, 2004 Las Vegas, NVLas Vegas, NV -- December 1 & 2, 2004December 1 & 2, 2004

DATA MINING: LEVEL IIIDATA MINING: LEVEL IIIA HandsA Hands--On Application Workshop for Data Mining Practitioners by Dean AbOn Application Workshop for Data Mining Practitioners by Dean Abbottbott

Registration:Registration: $695, 1 Day* $695, 1 Day* Washington, DCWashington, DC -- June 25, 2004 June 25, 2004 San Diego, CASan Diego, CA -- September 24, 2004 September 24, 2004 Las Vegas, NVLas Vegas, NV -- December 3, 2004December 3, 2004

Page 22: Data Mining: The Next Revolution Data Mining: The Next

Selected Data Mining BooksSelected Data Mining Books

Page 23: Data Mining: The Next Revolution Data Mining: The Next

What percentage (%) of time in your data mining project (s) is sWhat percentage (%) of time in your data mining project (s) is spent pent on data cleaning and preparation? (187 votes total)on data cleaning and preparation? (187 votes total)

Over 80% Over 80% (46) (46) 25%25%61 to 80% 61 to 80% (73) (73) 39%39%41 to 60% 41 to 60% (46) (46) 25%25%21 to 40% 21 to 40% (7) (7) 4%4%20% or less20% or less (15) (15) 8%8%

Source: http://www.kdnuggets.com/

Page 24: Data Mining: The Next Revolution Data Mining: The Next

Thank YouThank You