data mining: introduction, techniques, case studies ... · data mining: introduction, techniques,...

56
Data Mining: Introduction, Techniques, Case Studies & Benchmarking Milan April 2nd , 2008 Franco Orsogna Deloitte Enterprise Risk Services Italia

Upload: trinhkhue

Post on 06-Apr-2018

225 views

Category:

Documents


2 download

TRANSCRIPT

Data Mining:Introduction, Techniques, Case Studies & Benchmarking

MilanApril 2nd , 2008

Franco OrsognaDeloitte Enterprise Risk Services Italia

Data Mining: Introduction and Case Studies2 ©2008 Deloitte Touche Tohmatsu

Index

•Objectives

•Definition

•Main Techniques

•Industry Sectors & Application Fields

•Case Studies with WizRule tool

•Conclusions & possible uses

•Appendix•Data Mining tools & Benchmark

•Focus on WizRule•WizRule vs Leading DM tools

Data Mining: Introduction and Case Studies3 ©2008 Deloitte Touche Tohmatsu

Objectives

• Have a definition of Data Mining

• Show the main techniques available from DM

• Give examples of main application fields

• Present the results of the application of WizRulesoftware on 3 case studies

We included a benchmark in the appendix section among different Data Mining tools available on market

Data Mining: Introduction and Case Studies4 ©2008 Deloitte Touche Tohmatsu

Definition (1/2)

“A fast and inexpensive way of summarizing, exploring, understanding, and analyzing data…without requiring human intervention” (*)

“Knowledge Discovery in Data is the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data” (**)

(*) J. Han and M. Kamber, “Data Mining: Concepts and Techniques” , 2000

(**) Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, “Advances in Knowledge Discovery and Data Mining”, 1996

Data Mining: Introduction and Case Studies5 ©2008 Deloitte Touche Tohmatsu

Definition (2/2)

Data Mining is a matter that join together knowledge from different sciences:

•Statistics

•Pattern recognition

•Operating Logic

•Algorithm Theory

•Artificial Intelligence

•Etc…

Data Mining: Introduction and Case Studies6 ©2008 Deloitte Touche Tohmatsu

Main Techniques (1/2)

The main Data Mining Techniques are:

• Association Rules: through this kind of algorithm you can find items correlated by “if-then” rules:

i.e.: If the customer has a high income and owns a car then he could have an interest in the purchase of a car warranty extension

• Clustering: splits data into subsets that naturally belong together

i.e.: a set of customers can be split into subset of customers with similar pattern of consumptions

• Decision Trees: build a pattern of nested “if-then” rules

Salary > €30.000?

Yes

No

# children? Reliable

state-employees?

<3>2

YesNo

Not reliable

Data Mining: Introduction and Case Studies7 ©2008 Deloitte Touche Tohmatsu

Main Techniques (2/2)

• Classification: builds a predictive model to classify data

i.e.: to analyze bank customers in order to create a model for characterizing the new ones

• Summarization: summarizes data into a smaller subset that is able to significantly represent the original population

i.e.: selecting some customers as a sample for the entire population in the data

• Deviation Detection: discovers instances (records) which have a difference from the population

i.e.: atypical users (i.e. users with atypical access privileges) in a corporate DB

• Etc…

Data Mining: Introduction and Case Studies8 ©2008 Deloitte Touche Tohmatsu

Industry Sectors & ApplicationFields (1/2)

Numerous are the Industry Sectors that are using Data Mining:

•Banking•Telecommunications•Retail•Etc…

Application fields are, for example:•Customer acquisition•Cross-sell, co-marketing•Credit Risk analysis•Fraud analysis•Etc…

Data Mining: Introduction and Case Studies9 ©2008 Deloitte Touche Tohmatsu

Industry Sectors & ApplicationFields (2/2)

Two kinds of DM services are possible:

• Data-quality, Data-cleaning, Data-analysis, etc..

• Fraud detection (Audit Support / External Clients)

Typical examples of frauds to be detected are:

• Credit card frauds

• Recycling of capital

• Securities Frauds

• Telecom frauds

Data Mining: Introduction and Case Studies10 ©2008 Deloitte Touche Tohmatsu

Case Studies: Introduction

We’ve acquired a DM tool in order to evaluate Data Mining potentials in Audit Support services and in new possible ERS services.

We had chosen the WizRule for its simplicity and low license cost.

We’ve applied the tool on datasets which came from previous IT Audit support on FSA engagements. The 3 “case studies” that follow are related to:

• Journal entries

• Stock in/outflows

• Timesheet entries

Data Mining: Introduction and Case Studies11 ©2008 Deloitte Touche Tohmatsu

Case Study 1: Journal EntriesDataset

• Content of dataset: journal entries related to a company for an entire fiscal year.

• Number of records: 57.342

• Number of fields: 15

• Parameters setting on analysis:

Minimum Probability of If-then Rules: 0,99

Minimum Accuracy Level of Formula Rules: 0,99

Minimum Number of Cases in a Rule: 300

Data Mining: Introduction and Case Studies12 ©2008 Deloitte Touche Tohmatsu

Case Study 1: Journal EntriesProcessing results and Performance

• Rules found: 3.729

• Spelling deviations found: 28

• Rule deviations found: 1.253

• Time spent by the tool*: 3 minutes

* with a PC Laptop having the following:

- Processor: Intel Pentium M 1,73 Ghz

- RAM: 1 Gb

- Free HD Space: 25 Gb

Data Mining: Introduction and Case Studies13 ©2008 Deloitte Touche Tohmatsu

Case Study 1: Journal Entries Rules and deviations found – examples (1/3)

RULE:

28) If Account is 2203010101

Then

User is UFFAMMCICT1

Rule's probability: 1,000The rule exists in 425 records.

Significance Level: Error probability is almost 0

Data Mining: Introduction and Case Studies14 ©2008 Deloitte Touche Tohmatsu

Case Study 1: Journal Entries Rules and deviations found – examples (1/3)

RULE DEVIATION:

27) If Account is 1701040100

Then

User is UFFAMMCICT4

Rule's probability: 0,993The rule exists in 963 records.

Significance Level: Error probability is almost 0

Deviations (records' serial numbers):

33008, 19420, 19422, 24640, 24643, 29022, 29025

Data Mining: Introduction and Case Studies15 ©2008 Deloitte Touche Tohmatsu

Case Study 1: Journal Entries Rules and deviations found – examples (3/3)

SPELLING DEVIATION:

Deviation #2 (out of 28)Record No. 10157

Field ValueEntry_Date 28/02/2006Entry_N 3800000524

X User DIRSAPTrial_Bal_Acc.5002060101Posting_Date 07/03/2006…

Rules explaining howthe case deviates from the normThe value DIR2SAP appears 3.208 times in the User field .There are 2 case(s) containing similar value(s):10157, 10158.

Data Mining: Introduction and Case Studies16 ©2008 Deloitte Touche Tohmatsu

Case Study 2: Stocks AccountingDataset

• Content of dataset: quarterly stock in/outflows

• Number of records: 2.359.335

• Number of fields: 14

• Parameters setting on analysis:

Minimum Probability of If-then Rules: 0,99

Minimum Accuracy Level of Formula Rules: 0,99

Minimum Number of Cases in a Rule: 10.000

Data Mining: Introduction and Case Studies17 ©2008 Deloitte Touche Tohmatsu

Case Study 2: Stocks Accounting Performance

• Rules found: 268

• Spelling deviations found: 25

• Rule deviations found: 5.889

• Time spent by the tool*: 14 minutes

* with a PC Laptop having the following :

- Processor: Intel Pentium M 1,73 Ghz

- RAM: 1 Gb

- Free HD Space: 25 Gb

Data Mining: Introduction and Case Studies18 ©2008 Deloitte Touche Tohmatsu

Case Study 2: Stocks Accounting Rules and deviations found – examples (1/4)

RULE:

33) If WAREHOUSE_CODE is 3002

Then

UNIT OF MEASURE is NR (=Number of item)

Rule's probability: 1,000

The rule exists in 27.151 records.

Significance Level: Error probability is almost 0

Data Mining: Introduction and Case Studies19 ©2008 Deloitte Touche Tohmatsu

Case Study 2: Stocks Accounting Rules and deviations found – examples (2/4)

RULE:

34) If COD_MAG is 8001

Then

DATA is 2005-03-16

Rule's probability: 1,000

The rule exists in 12.755 records.

Significance Level: Error probability is almost 0

Data Mining: Introduction and Case Studies20 ©2008 Deloitte Touche Tohmatsu

Case Study 2: Stocks Accounting Rules and deviations found – examples (3/4)

RULE DEVIATION:

3532) If MAT_LOC is SDOP

Then

UNIT OF MEASURE is NR is MP

Rule's probability: 0,998

The rule exists in 13668 records.

Significance Level: Error probability is almost 0

Deviations (records' serial numbers):

47613, 880530, 880745, 881033, 881654, 881700,

881702, 881726, 881748, 881778, …

Data Mining: Introduction and Case Studies21 ©2008 Deloitte Touche Tohmatsu

Case Study 2: Stocks Accounting Rules and deviations found – examples (4/4)

SPELLING DEVIATION:

no significant spelling deviation.

Data Mining: Introduction and Case Studies22 ©2008 Deloitte Touche Tohmatsu

Case Study 3: Time Sheet entriesDataset

• Content of dataset: time sheet by customer

• Number of records: 10.978

• Number of fields: 10

• Parameters setting on analysis:

Minimum Probability of If-then Rules: 0,90

Minimum Accuracy Level of Formula Rules: 0,90

Minimum Number of Cases in a Rule: 70

Data Mining: Introduction and Case Studies23 ©2008 Deloitte Touche Tohmatsu

Case Study 3: Time Sheet Performance

• Rules found: 346

• Spelling deviations found: 24

• Rule deviations found: 625

• Time spent by the tool*: 20 seconds

* with a PC Laptop having the following:

- Processor: Intel Pentium M 1,73 Ghz

- RAM: 1 Gb

- Free HD Space: 25 Gb

Data Mining: Introduction and Case Studies24 ©2008 Deloitte Touche Tohmatsu

Case Study 3: Time Sheet Rules and deviations found – examples (1/3)

RULE:

3) If CUSTOMER is ABC COMPANY LTD

Then

CUST_CODE is 1.286,00

Rule's probability: 1,000

The rule exists in 664 records.

Significance Level: Error probability is almost 0

Data Mining: Introduction and Case Studies25 ©2008 Deloitte Touche Tohmatsu

Case Study 3: Time Sheet Rules and deviations found – examples (2/3)

RULE DEVIATION:

191) If CUSTOMER is XYZ COMPANY LTD

Then

OVERTIME is 0,00

Rule's probability: 0,945The rule exists in 156 records.

Significance Level: Error probability is almost 0

Deviations (records' serial numbers):

3492, 3494, 3585, 3578, 3463, 3476, 3523, 3526, 3549

Data Mining: Introduction and Case Studies26 ©2008 Deloitte Touche Tohmatsu

Case Study 3: Time Sheet Rules and deviations found – examples (3/3)

SPELLING DEVIATION:

Deviation #1 (out of 24)Record No. 3598

Field ValueCUST_CODE 11179.000000

X CUSTOMER XYZ COMPANY LTDCONTRACT 70383.000000BRANCH Milan, March 22 (S)REG_NUM 86878.000000

….Rules explaining how the case deviates from the normThe value XYZ COMPANY LTD appears 165 times in the CUSTOMER field.There are 3 case(s) containing similar value(s):3598, 3599, 3600.

Data Mining: Introduction and Case Studies27 ©2008 Deloitte Touche Tohmatsu

Case Studies – Some ConsiderationsWizRule Pros

•Easy-to-use: just set the “quantity” and “quality” fields

•Good processing performance (depends mainly on fields number)

•Effective discovery of hidden knowledge in the datasets

WizRule Cons

•The reports produced are generally long

•Need of a sufficiently deep knowledge of the data analyzed

•A minimal parameters setting change can vary consistently the reports produced

Data Mining: Introduction and Case Studies28 ©2008 Deloitte Touche Tohmatsu

Case Studies - Possible uses

•Data cleaning

•Limited use on IT Audit support on FSA: could be an additional value but take-up and require additional time

•The deviation analysis could support fraud detection activities

Data Mining: Introduction and Case Studies29 ©2008 Deloitte Touche Tohmatsu

Appendix

Data Mining: Introduction and Case Studies30 ©2008 Deloitte Touche Tohmatsu

Data Mining Tools

• Enterprise Miner (SAS Institute Inc.)http://www.sas.com/technologies/analytics/datamining/miner/

• Clementine (SPSS Inc.) http://www.spss.com/clementine/

• StatSoft Statistica Data Minerhttp://www.statsoft.com/products/dataminer.htm

• IBM DB2 Intelligent Minerhttp://www-06.ibm.com/software/uk/forms/pdf_form_catalog_uk.html

• Waikato Weka Projecthttp://www.cs.waikato.ac.nz/~ml/weka/

• Microsoft SQL Server 2005 (versione beta)http://www.microsoft.com/sql/2005/

• WizSoftware WizRulehttp://www.wizsoft.com

Note: in the following terms and princes are indicative

Data Mining: Introduction and Case Studies31 ©2008 Deloitte Touche Tohmatsu

SAS Enterpise Miner (1/2)

•Software house: SAS, Inc.

•Platform:

�Server: Unix/Linux, Solaris and MS-Windows

�Client: java based

•User Interface: “user-friendly”

•Integrated with forecasting and other analysis SAS modules

•General Purpose Tool that includes many DM algorithms (Rule Induction, DT, NN, Clustering, …)

•Price: not available

Data Mining: Introduction and Case Studies32 ©2008 Deloitte Touche Tohmatsu

SAS Enterpise Miner (2/2)

Enterprise Miner’s Front-end.

Data Mining: Introduction and Case Studies33 ©2008 Deloitte Touche Tohmatsu

SPSS Clementine (1/2)

•Software house: SPSS, Inc.

•Platform:

�Server: Unix/Linux, Solaris and MS-Windows

�Client: java based

•User Interface: “user-friendly”

•Integrated with SPSS statistical modules

•Price: ~ 60.000 €

Data Mining: Introduction and Case Studies34 ©2008 Deloitte Touche Tohmatsu

SPSS Clementine (2/2)

Clementine’s visual interface

Data Mining: Introduction and Case Studies35 ©2008 Deloitte Touche Tohmatsu

StatSoft Statistica Data Miner(1/2)

•Software house: StatSoft, Inc. (UK)

•Platform: MS-Windows, Unix

•User Interface: “user-friendly”

•General Purpose Tool that includes many DM algorithms (HMM, NN, DT, AR, K-Means Clustering, …)

•C/C++ Programming Interface

•Price: ~18.000 € + 3.000 €/year

Data Mining: Introduction and Case Studies36 ©2008 Deloitte Touche Tohmatsu

StatSoft Statistica Data Miner(2/2)

Example of graphic representation of association rules found by the tool.

Data Mining: Introduction and Case Studies37 ©2008 Deloitte Touche Tohmatsu

IBM DB2 Intelligent Miner(1/2)

• Platform: MS-Windows, AIX, OS/400

• It can be easily integrated with DB2 (it also supports Oracle DB)

• It supports interactive process: data processing, statistical analysis and visualization of the results

• Many DM algorithms implemented

• Scalable processing

• Tool born to process huge datasets

• Price: > 50.000 €

Data Mining: Introduction and Case Studies38 ©2008 Deloitte Touche Tohmatsu

IBM DB2 Intelligent Miner(2/2)

Front-end and available algorithms in DB2 Intelligent Miner.

Data Mining: Introduction and Case Studies39 ©2008 Deloitte Touche Tohmatsu

Waikato Weka Project (1/2)

•Developed by Waikato University (NZ)

•Platform: Java

•Very versatile (Clustering, Visualization, Analysis, ANN, …)

•Analysis and visualization tools

•Data filters

•Open Source

•Not very easy for “non DM” users

•Association rules represented only through text

•Price: free

Data Mining: Introduction and Case Studies40 ©2008 Deloitte Touche Tohmatsu

Waikato Weka Project (2/2)

Visualization of association rules in Weka

Data Mining: Introduction and Case Studies41 ©2008 Deloitte Touche Tohmatsu

Microsoft SQL Server 2005(1/2)

•Platform: MS-Windows

•Analysis Services Module is part of SQL Server

•Built-in business rules, tools and wizards to help analysis of

•Semi-additive measures•Time Intelligence

•Account intelligence•Financial Aggregations

•Price: ~ 4.000 $ for “Workgroup” version or ~ 730 $ for 5 licenses

Data Mining: Introduction and Case Studies42 ©2008 Deloitte Touche Tohmatsu

Microsoft SQL Server 2005(2/2)

Visualization of Decision Trees in SQL Server 2005

Data Mining: Introduction and Case Studies43 ©2008 Deloitte Touche Tohmatsu

WizRule

•Platform: Windows

•Easy to use

•Reports easy to understand

•Audit perspective*

•It can acquire data from many sources

•Price: ~ 1.400 $ (for 1 license)

* through the Deviation Report it is possible to immediately investigate anomalies in the data

Data Mining: Introduction and Case Studies44 ©2008 Deloitte Touche Tohmatsu

WizRule: what does it find? (1/2)

WizRule finds association rules and formula relationships (of specific types) among fields.

It also shows deviation cases from rules it found that:

• are not explained by other rules, and

• whose frequency, relative to the overall frequency, is low.

These kind of deviations are considered by the tool as suspected errors.

Data Mining: Introduction and Case Studies45 ©2008 Deloitte Touche Tohmatsu

WizRule: what does it find? (2/2)

The rules it finds are of this kind:

• if C1=a and C2=b … Ci=w then Cz = y

• if C1 starts with “abcde” then C2=y

• if C1 is between a and b, then C2=y

where Cx is the value of the field x-th, “abcde” is a string with max length = 5.

It is also able to find formula relationships (of specific types) among fields, such as:

• C1 = a x C2 + b

• C1 = a / C2

where Cx is the value of the field x-th, a and b constants.

Data Mining: Introduction and Case Studies46 ©2008 Deloitte Touche Tohmatsu

WizRule: how does it work?

The user:

• selects the source of data

• “fine-tunes” the analysis parameters

Then the tool reveals through specific reports:

• the rules governing the data

• suspected errors/deviations

Data Mining: Introduction and Case Studies47 ©2008 Deloitte Touche Tohmatsu

WizRule: what source of data?

The tool is able to acquire data from numerous sources:

• .dbf files (dBase, Fox Pro, Clipper etc)

• Ms Access files

• Ms SQL Server tables

• Oracle tables

• ODBC compliant databases

• OLE DB compliant databases

• ASCII-type text files

Data Mining: Introduction and Case Studies48 ©2008 Deloitte Touche Tohmatsu

WizRule: “fine-tuning”

User could tune the following analysis parameters in order to increase/decrease the number of found rules:

•Minimum probability of “if-then” rules (confidence

level)

•Minimum accuracy level of formula rules

•Minimum number of cases of a rule (support level)

•Minimum number of conditions

Data Mining: Introduction and Case Studies49 ©2008 Deloitte Touche Tohmatsu

WizRule: what does it show?

Once the tool processed the data, it creates the following reports:

1. Rule Report

2. Deviation Report

3. Spelling Report

Data Mining: Introduction and Case Studies50 ©2008 Deloitte Touche Tohmatsu

WizRule: reports (1/3)

Rule Report (Screenshot)

Data Mining: Introduction and Case Studies51 ©2008 Deloitte Touche Tohmatsu

WizRule: reports (2/3)

Spelling Report (Screenshot)

Data Mining: Introduction and Case Studies52 ©2008 Deloitte Touche Tohmatsu

WizRule: reports (3/3)

Deviation Report (Screenshot)

Data Mining: Introduction and Case Studies53 ©2008 Deloitte Touche Tohmatsu

Benchmark of DM ToolsSupported Platforms (1/3)

XXWizRule

XXXXEnterprise Miner

XXXXClementine

Datab

aseConnectivity

Win

dow

sServer

/PC

Clien

t

Unix

Server

/PC

Clien

t

Unix

Stan

dalo

ne

PCStan

dalo

ne

(Win

dow

s)

Supported Platforms

Data Mining: Introduction and Case Studies54 ©2008 Deloitte Touche Tohmatsu

Benchmark of DM ToolsData Input & Model Output (2/3)

XXXXWizRule

XXXXEnterprise Miner

XXXClementine

Outp

ut

Source

Code

Sum

mary

Rep

ort

Native

Datab

aseD

river

OD

BC

Auto

matic

Head

er

Data Input & Model Output

Data Mining: Introduction and Case Studies55 ©2008 Deloitte Touche Tohmatsu

Benchmark of DM ToolsAlgorithms (3/3)

X(1)WizRule

XXXXXXXEnterprise Miner

XXXXXXXClementine

Kohonen

Asso

ciation

Rules

K-M

eans

Gen

eralizedLin

earM

od.

Rule

Inductio

n

Rad

ialBasis

Functio

ns

Multi-layer

Perceptio

ns

Linear/S

tatistical

Decisio

nTrees

Algorithms

(1) with WizWhy Module

Member ofDeloitte Touche Tohmatsu

Il nome Deloitte si riferisce a una o più di una delle seguenti entità: Deloitte Touche Tohmatsu (una Verein svizzera), le sue member firm e le relative entità controllate e/o licenziatarie. Ciascuna member firm e ciascuna entità controllata e/o licenziataria è una entità giuridica separata e indipendente che opera sotto i nomi "Deloitte," "Deloitte & Touche," "Deloitte Touche Tohmatsu," o altri nomi derivati. I servizi sono forniti dalle member firm, dalle rispettive entità controllate o da entità licenziatarie e non dalla Verein Deloitte Touche Tohmatsu. Né Deloitte Touche Tohmatsu, in relazione alla sua natura di Verein (associazione) di diritto svizzero, né ciascuna delle member firm e/o delle entità controllate e/o licenziatarie può essere ritenuta in alcun modo responsabile per atti od omissioni posti in essere da altre entità.

Deloitte refers to one or more of Deloitte Touche Tohmatsu, a Swiss Verein, its member firm, and their respective subsidiaries and affiliates. As a Swiss Verein (association), neither Deloitte Touche Tohmatsu nor any of its member firms has any liability for each other’s acts or omissions. Each of the member firms is a separate and independent legal entity operating under the names “Deloitte”, “Deloitte & Touche”, “Deloitte Touche Tohmatsu”, or other related names. Services are provided by the member firms or their subsidiaries or affiliates and not by the Deloitte Touche Tohmatsu Verein.