data mining techniques: study, analysis, prevention...

A PH.D. THESIS

DATA MINING TECHNIQUES: STUDY, ANALYSIS, PREVENTION & DETECTION FOR FINANCIAL

CYBER CRIME AND FRAUDS

SUBMITTED TO

GANPAT UNIVERSITY KHERVA

FOR THE AWARD OF

DOCTOR OF PHILOSOPHY

(COMPUTER SCIENCE AND APPLICATION)

BY

JYOTINDRA N. DHARWA

A.M.PATEL INSTITUTE OF COMPUTER STUDIES GANPAT UNIVERSITY, KHERVA

UNDER THE GUIDANCE OF

DR. A. R. PATEL

DIRECTOR, DEPARTMENT OF COMPUTER SCIENCE HEMCHANDRACHARYA NORTH GUJARAT UNIVERSITY, PATAN

APRIL 2010

CONTENTS

Abstract I

Acknowledgement II

Certificate by Research Guide III

Declaration by Ph.D. Student IV

List of Tables V

List of Figures VII

Chapter Contents XI

Chapter 1: Introduction 1

Chapter 2: A comparative Study of Data Mining Techniques 21

Chapter 3: Financial Cyber crime and Frauds 92

Chapter 4: Role of Data Mining in Financial Crime Detection 115

Chapter 5: Data Warehouse Implementation 123

Chapter 6: Development of Transaction Pattern Generation Tool (TPGT) 156

Chapter 7: Development of Transaction Risk Score Generation Model 183

(TRSGM)

Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion 245

I

ABSTRACT

The Internet in India is growing rapidly. It has given rise to new opportunities in every

field we can think of - be it entertainment, business, sports or education. There are two

sides to a coin. Internet also has its own disadvantages. One of the major disadvantages is

Cyber crime- illegal activity committed on the internet. The internet, along with its

disadvantages, has also exposed us to security risks that come with connecting to a large

network. Computers today are being misused for illegal activities like e-mail espionage,

credit card fraud, spasm, software piracy and so on, which invade our privacy and offend

our senses. Criminal activities in the cyberspace are on the rise.

Developing a financial cyber crime detection system is a challenging task. Whenever any

online transaction is performed through the credit card, then there is not any system that

surely predicts a transaction as fraudulent. It just predicts the likelihood of the transaction

to be a fraudulent.

We propose a novel approach for online transaction fraud detection, which combines

evidences from current as well as past behavior. The proposed transaction risk generation

model (TRSGM) consists of five major components, namely, DBSCAN algorithm,

Linear equation, Rules, Historical transaction database and Bayesian learner. DBSCAN

algorithm is used to form the clusters of past transaction amounts of the customer and

find out the deviation of new incoming transaction amount and cluster coverage. The

patterns generated by Transaction Pattern Generation Tool (TPGT) are used in Linear

equation along with its weightage to generate a risk score for new incoming transaction.

The guidelines shown in various web sites, print and electronic media as indication of

online fraudulent transaction for credit card company is implemented as rules in TRSGM.

In the first four components, we determine the suspicion level of each incoming

transaction based on the extent of its deviation from good pattern. The transaction is

classified as genuine, fraudulent or suspicious depending on this initial belief. Once a

transaction is found to be suspicious, belief is further strengthened or weakened

according to its similarity with fraudulent or transaction history using Bayesian learning.

II

ACKNOWLEDGEMENT

I hereby take a chance to express my sense of extreme gratitude towards my Ph.D. guide

Dr. A. R. Patel, for his suggestions and constant inspiration at every stage of the research.

He is an extremely sympathetic and principle-centered person. His skills, as a researcher

and guide helped me to overcome all the hurdles. Without his constant support and

encouragement, I would not have been able to complete my research work successfully.

I owe a debt of gratitude to Prof. P. I. Patel, Director, Ganpat University for

encouragement.

I am thankful to my manager brother who has provided me statistical data at the initial

stage of the research, so my initial model design is just possible due to his support.

I would like to thank my colleagues Dr. N. J. Patel, Dr. S. M. Parikh and staff at Acharya

Motibhai Patel Institute of Computer Studies for their invaluable encouragement and

help.

My parents have their own share in my success. I firmly believe that their blessings

always enlighten my path ahead. I hereby take a chance to salute my father Nathalal and

mother Late Menaben. I would like to thank my brother navinbhai, sister-in-law lilaben,

nephews Vikas and kunal for their love, blessings and moral support throughout my

research work. I give my special thanks to my wife Urmila, daughters Mudra and Aditi ,

without whose support and sacrifice this thesis would not have been possible for me.

At last, I thank the one and all, for the divine blessings.

Jyotindra N. Dharwa

III

CERTIFICATE

I hereby certify that Mr. Jyotindra N. Dharwa has completed his Ph.D. thesis for

doctorate degree on the topic “Data Mining Techniques: Study, Analysis, Prevention &

Detection for Financial Cyber Crime and Frauds” under my guidance.

I further certify that the whole work, done by him is of his own, original and tends to

general advancement of knowledge. According to the best of my knowledge, I also

certify that he has not been conferred any degree, diploma and distinction by either the

Ganpat University or any other university for this thesis.

Date: (Dr. A. R. Patel)

IV

DECLARATION

I, Mr. Jyotindra N. Dharwa hereby declare that my Ph.D. thesis titled “Data Mining

Techniques: Study, Analysis, Prevention & Detection for Financial Cyber Crime and

Frauds” is written as a partial fulfillment of the requirement for a doctorate degree on the

topic. The complete study is based on literature survey, study of periodicals, journals and

websites and building a model for proving the concept studied and designed.

I further declare that the complete thesis work, including all analysis, hypothesis,

inferences and interpretation of data and information, is done by me and it is my own and

original work. Moreover, I declare that no degree, diploma or distinction has been

conferred on the basis of this thesis by the Ganpat University or any other university to

me before.

Date: (Jyotindra N. Dharwa)

V

LIST OF TABLES

Chapter 2

Table 2.1 Steps in the Evolution of Data Mining 22

Table 2.2 Time Line of Data Mining Development 25

Table 2.3 Initial Weight Values for the Neural Network Shown in Figure 2.4 45

Table 2.4 Comparison of Clustering Algorithms 89

Table 2.5 Data Mining Technique for Data Mining Task 90

Chapter 3

Table 3.1 Average (Median) Loss Per Typical Complaint Demographics 109

Table 3.2 Losses based on fraud category wise 110

Chapter 5

Table 5.1 Transaction 126

Table 5.2 Customer_Master 127

Table 5.3 Creditcard_Master 128

Table 5.4 Seller_Master 128

Table 5.5 Address_Master 129

Table 5.6 Product_Master 130

Table 5.7 Product_Category_Master 130

Table 5.8 Shipping_Master 131

Table 5.9 Location_Master 132

Table 5.10 City_Master 133

Table 5.11 State_Master 133

Table 5.12 Country_Master 134

Table 5.13 User_Log_Master 134

Table 5.14 Cardholder_Master 135

Table 5.15 Fraud 136

Table 5.16 Suspect 137

Table 5.17 Customer_DailyCount 138

VI

Table 5.18 Customer_WeeklyCount 139

Table 5.19 Customer_FortnightlyCount 139

Table 5.20 Customer_MonthlyCount 140

Table 5.21 Customer_SundayCount 140

Table 5.22 Customer_HolidayCount 141

Table 5.23 Statistical data of expenditure in category by income 144

Table 5.24 Components of Gaussian distribution 144

Table 5.25 Sample Data of Table Transaction 145

Table 5.26 Credit Card Parameters 152

Chapter 7

Table 7.1 Parameters of the Equation 192

Table 7.2 Sample output of the application for different transaction amounts 231

Table 7.3 Sample output of the application for different sellers 234

Table 7.4 Sample output of the application for different locations 237

VII

LIST OF FIGURES

Chapter 2

Figure 2.1 Historical Perspective of Data Mining 24

Figure 2.2 Decision Tree for Example 2.1 38

Figure 2.3 Decision Tree for Example 2.2 40

Figure 2.4 A Fully Connected Feed-Forwarded Neural Network 44

Figure 2.5 Radial Basis Function Network 70

Figure 2.6 Classification of Clustering Algorithms 72

Figure 2.7 Example of Dendrogram 73

Chapter 3

Figure 3.1 Affecting the Person by Cyber Crime (in %) 93

Figure 3.2 IC3 Complaint Categories (in %) 107

Figure 3.3 Percentage of Referrals by Monetary Loss 108

Figure 3.4 Plastic Card Fraud Losses on UK-issued Cards 1998-2008 109

Figure 3.5 Percentage of Different Plastic Card Fraud Category in Year 1998 110

Figure 3.6 Percentage of Different Plastic Card Fraud Category in Year 2008 111

Figure 3.7 Internet/E-Commerce Fraud Losses on UK-issued Cards 111

Figure 3.8 Revenue Lost to Online Fraud (in %) 113

Chapter 4

Figure 4.1 Architecture of 2-Stage Solution 116

Chapter 5

Figure 5.1 Data warehouse Design Layout-I 142

Figure 5.2 Data warehouse Design Layout-II 143

Figure 5.3 Credit Card Number Semantic Graph 151

Figure 5.4 Sample of Credit Card 154

VIII

Chapter 6

Figure 6.1 Parameters of TPGT 156

Figure 6.2 Subparameters of DP 157

Figure 6.3 Subparameters of CP 158

Figure 6.4 Subparameters of PP 159

Figure 6.5 Subparameters of TP 160

Figure 6.6 Subparameters of WP 161

Figure 6.7 Subparameters of VP 162

Figure 6.8 Subparameters of AP 162

Figure 6.9 Subparameters of FP 163

Figure 6.10 Subparameters of MP 164

Figure 6.11 Subparameters of SP 164

Figure 6.12 Subparameters of HP 165

Figure 6.13 Subparameters of LP 166

Figure 6.14 Subparameters of GP 167

Chapter 7

Figure 7.1 Block Diagram of Proposed Financial Cyber Crime Detection System 209

Figure 7.2 Graph of clusters formed by DBSCAN algorithm for Card id=1 210




Figure 7.6 Sample output of Clusters formed by DBSCAN Algorithm – I 212

Figure 7.7 Sample output of Clusters formed by DBSCAN Algorithm – II 212

Figure 7.8 Sample output of Clusters formed by DBSCAN Algorithm – III 213

Figure 7.9 Sample output of Clusters formed by DBSCAN Algorithm – IV 213

Figure 7.10 Sample output of Clusters formed by DBSCAN Algorithm – V 214

Figure 7.11 Sample output of Clusters formed by DBSCAN Algorithm – VI 214

Figure 7.12 Sample output of Clusters formed by DBSCAN Algorithm – VII 215

Figure 7.13 Sample output of Clusters formed by DBSCAN Algorithm – VIII 215

Figure 7.14 Sample output of Clusters formed by DBSCAN Algorithm – IX 216

IX

Figure 7.15 Sample output of Clusters formed by DBSCAN Algorithm – X 216

Figure 7.16 Sample output of Clusters formed by DBSCAN Algorithm – XI 217

Figure 7.17 Sample output of Clusters formed by DBSCAN Algorithm – XII 217

Figure 7.18 Sample output of Data Mining Application for Genuine 223

Transaction–I


Transaction-II


Transaction-III

Figure 7.21 Sample output of Data Mining Application for Fraudulent 225

Transaction–I


Transaction–II


Transaction–III

Figure 7.24 Sample output of Data Mining Application for Suspicious 227

Transaction - I


Transaction - II


Transaction - III

Figure 7.27 Sample output of Data Mining Application for Multiple Order 229

Product Support - I


Product Support - II


Product Support – III

Figure 7.30 Sample output of Data Mining Application for different transaction 232

amounts – I


amounts – II

X


amounts – III


amounts – IV

Figure 7.34 Sample output of Data Mining Application for different sellers – I 235

Figure 7.35 Sample output of Data Mining Application for different sellers – II 235

Figure 7.36 Sample output of Data Mining Application for different sellers – III 236

Figure 7.37 Sample output of Data Mining Application for different sellers – IV 236

Figure 7.38 Sample output of Data Mining Application for different locations – I 238

Figure 7.39 Sample output of Data Mining Application for different locations – II 238

Figure 7.40 Sample output of Data Mining Application for different locations – III 239

Figure 7.41 Sample output of Data Mining Application for different locations – IV 239

Figure 7.42 Sample output of Data Mining Application for Cluster Coverage 240

Figure 7.43 Sample output of Data Mining Application for maximum 241

purchasing habit input - I


purchasing habit input - II


purchasing habit input – III

Figure 7.46 Sample output of Data Mining Application for Bayesian 243

Learning - I

Figure 7.47 Sample output of Data Mining Application for Bayesian 244

Learning - II

XI

CHAPTERS CONTENTS

Chapter 1 Introduction 1

1.1 Motivation 2

1.2 Objective of the research 2

1.3 Related Work 4

1.3.1 In Fraud Detection 4

1.3.2 In Financial Cyber crime Prevention 11

1.4 Research Issues 13

1.5 Outline of the Research 15

1.6 References 16

Chapter 2 A Comparative Study of Data Mining Techniques 21

2.1 Data Mining: A Definition 21

2.2 The Foundations of Data Mining 22

2.3 The Development of Data Mining 23

2.4 Data Mining Process 26

2.5 A Statistical Perspective on Data Mining 26

2.5.1 Point Estimation 26

2.5.2 Measures of Performance 30

2.5.3 Models Based on Summarization 33

2.6 Decision Trees 36

2.6.1 Strengths 41

2.6.2 Weaknesses 41

2.7 Neural Networks 41

2.7.1 Why use Neural Networks? 42

2.7.2 Network Layers 42

2.7.3 Neural Network Input and Output Format 44

2.7.4 The Sigmoid Function 44

2.7.5 Applications of Neural Networks 45

2.7.6 Strengths 46

XII

2.7.7 Weaknesses 47

2.8 Genetic Algorithms 47

2.8.1 Where Gas can be used? 48

2.8.2 Explanation of terms 48

2.8.3 Applications of GA 50

2.8.4 Strengths of GA 50

2.8.5 Weaknesses of GA 50

2.9 Classification 51

2.9.1 Statistical-Based Algorithms 51

2.9.2 Distance-Based Algorithms 55

2.9.3 Decision Tree-Based Algorithms 58

2.9.4 Neural Network-Based Algorithms 65

2.10 Clustering 71

2.10.1 Hierarchical Algorithms 72

2.10.2 Agglomerative Algorithms 73

2.10.3 Partitional Algorithms 75

2.10.4 Clustering Large Databases 82

2.10.5 Comparison of Clustering Algorithms 87

2.11 Selection Criteria of a Data Mining Technique 87

2.12 References 90

Chapter 3 Financial Cyber crime and Frauds 92

3.1 What is a Cyber Crime? 92

3.2 An Example of Financial Cyber Crime 93

3.3 Financial Cyber Crimes 93

3.3.1 Credit Card Fraud 93

3.3.2 Net Extortion 94

3.3.3 Phising 94

3.3.4 Salami Attack 94

3.3.5 Sale of Narcotics 94

3.4 What is a Fraud? 94

XIII

3.5 Types of Fraud 95


3.5.2 Telecommunications Fraud 97

3.5.3 Computer Intrusion 98

3.6 Financial Crimes 99

3.6.1 Types of Financial Crimes 99

3.7 Ways of Online Banking Fraud 105

3.7.1 Phising 105

3.7.2 Malware 105

3.7.3 Spyware 106

3.8 2008 Internet Crime Report 106

3.8.1 Complain Characteristics 106

3.8.2 Case Studies of APACS 109

3.9 Online Fraud Report, Cybersource 2010 112

3.10 References 113

Chapter 4 Role of Data Mining in Financial Crime Detection 115

4.1 Two Stage Solution for Financial Crime Detection 115

4.2 Types of Financial Crime 116


4.2.2 Card-Not-Present Fraud 117

4.2.3 Loan Default 117

4.2.4 Bank Fraud 118

4.2.5 Insurance Crimes 119

4.3 Conclusion 121

4.4 References 121

Chapter 5 Data Warehouse Implementation 123

5.1 Data Warehouse Architecture 123

5.1.1 Star Schema Architecture 123

5.1.2 Snowflake Schema Architecture 124

XIV

5.1.3 Fact Constellation Architecture 125

5.2 Fact Table 125

5.3 Dimensional Tables 127

5.4 Lookup Tables 138

5.5 Data Collection 143

5.6 Sample Data 145

5.7 Credit Card Number Generation 151

5.7.1 The Luhn Algorithm 152

5.7.2 An example of Luhn Validation Technique 153

5.8 References 154

Chapter 6 Development of Transaction Pattern Generation Tool (TPGT) 156

6.1 Main Patterns (Parameters) generated by TPGT 156

6.1.1 Subparameters of DP 157

6.1.2 Subparameters of CP 158

6.1.3 Subparameters of PP 159

6.1.4 Subparameters of TP 160

6.1.5 Subparameters of WP 161

6.1.6 Subparameters of VP 162

6.1.7 Subparameters of AP 162

6.1.8 Subparameters of FP 163

6.1.9 Subparameters of MP 164

6.1.10 Subparameters of SP 164

6.1.11 Subparameters of HP 165

6.1.12 Subparameters of LP 166

6.1.13 Subparameters of GP 167

6.2 Descriptions of the Patterns (Parameters) 167

6.2.1 Daily Parameters (DP) 167

6.2.2 Category Parameters (CP) 168

6.2.3 Product Parameters (PP) 169

6.2.4 Transaction Parameters (TP) 169

XV

6.2.5 Weekly Parameters (WP) 171

6.2.6 Seller or Vendor Parameters (VP) 171

6.2.7 Address Parameters (AP) 172

6.2.8 Fortnightly Parameters (FP) 172

6.2.9 Monthly Parameters (MP) 173

6.2.10 Sunday Parameters (SP) 173

6.2.11 Holiday Parameters (HP) 174

6.2.12 Location Parameters (LP) 175

6.2.13 Transaction Gap Parameters (GP) 176

6.3 Computations of the Patterns 177

6.3.1 TP1 to TP8 177

6.3.2 TP11 and TP12 179

6.3.3 GP1 to GP7 180

6.3.4 AP1 and AP2 181

6.4 References 182

Chapter 7 Development of Transaction Risk Score Generation Model

(TRSGM) 183

7.1 Significance of the parameters in TRSGM 183

7.2 TRSGM Components 189

7.2.1 DBSCAN Algorithm 190

7.2.2 Linear Equation 191

7.2.3 Rules 193

7.2.4 Historical Transaction Database (HTD) 194

7.2.5 Bayesian Learner 195

7.3 Algorithm 196

7.3.1 Description of data structure used in the algorithm 203

7.3.2 Description of the Algorithm 205

7.4 Graphs of Cluster Formation by DBSCAN Algorithm 210

7.5 Implementation Environment 218

7.5.1 Lookup Tables Auto Updation 218

XVI

7.5.2 Inter Transaction Gap Recording 219

7.5.3 Maximum Value Finding 222

7.6 Sample Results 223

7.6.1 Genuine Transaction 223

7.6.2 Fraudulent Transaction 225

7.6.3 Suspicious Transaction 227

7.6.4 Multiple Product Order Support 229

7.7 Result Analysis & Discussions 231

7.8 References 244

Chapter 8 Proposed Financial Cyber Crime Prevention Model

& Conclusion 245

8.1 Proposed Financial Cyber Crime Prevention Model 246

8.2 Features of Developed Data Mining Application Software 247

8.3 Significance of the Research 250

8.4 Limitation of the Study 250

8.5 Future Scope of the Research 251

8.6 References 251

1

CHAPTER 1

INTRODUCTION

1.1 MOTIVATION

1.2 OBJECTIVE OF THE RESEARCH

1.3 RELATED WORK

1.4 RESEARCH ISSUES

1.5 OUTLINE OF THE RESEARCH

1.6 REFERENCES

The Internet in India is growing rapidly. It has given rise to new opportunities in every

field we can think of - be it entertainment, business, sports or education. There are two

sides to a coin. Internet also has its own disadvantages. One of the major disadvantages is

Cyber crime- illegal activity committed on the internet. The internet, along with its

disadvantages, has also exposed us to security risks that come with connecting to a large

network. Computers today are being misused for illegal activities like e-mail espionage,

credit card fraud, spasm, software piracy and so on, which invade our privacy and offend

our senses. Criminal activities in the cyberspace are on the rise.

So in today’s electronic society, e-commerce has become an essential sales channel for

global business. Due to rapid advancement of e-commerce, use of credit cards for

purchases has dramatically increased. Unfortunately, fraudulent or illegal use of credit

card has also become an attractive source of revenue for fraudsters. Occurrences of credit

card fraud are increasing dramatically due to exposure of security weaknesses in

traditional credit card processing systems resulting in loss of billions of money every

year. Fraudsters now become very dynamic and use sophisticated techniques to perpetrate

credit card fraud. The fraudulent activities worldwide present unique challenges to banks

and other financial institutions who issue credit cards.

Chapter 1: Introduction

2

According to 2008 Internet Crime Report [41] of Internet Crime Complaint Center, From

January 1, 2008 – December 31, 2008, the IC3 website received 275,284 complaint

submissions. This is a (33.1%) increase when compared to 2007 when 206,884

complaints were received. These filings were composed of complaints primarily related

to fraudulent and non-fraudulent issues on the Internet. Dollar loss of referred complaints was

at an all time high in 2008, $264.59 million, exceeding last year’s record breaking dollar loss of

$239.09 million. On average, men lost more money than women.

A Gartner survey [40] of more than 160 companies reveals that 12 times more fraud

exists on Internet transactions and those e-tailers are paying credit card discount rates that

are 66 percent higher than traditional retailer fees. Moreover, Web merchants bear the

liability and costs in cases of fraud, while credit card companies generally absorb the

fraud for traditional retailers

1.1 MOTIVATION

The Various cyber crime cases through the credit card comes frequently in the daily news

papers and broad coverage in the television media inspired me to work in this area.

1.2 OBJECTIVE OF THE RESEARCH

The purpose of research is first to discuss the different financial cyber crimes and frauds

which are seen today in the forms of Credit card fraud, Phising etc. Secondly study the

different data mining techniques like Neural Network, Clustering techniques, Decision

trees etc. and eventually how these techniques can be used and applied to detect the

financial cyber crime and frauds.

Fraud Prevention describes measures to stop fraud occurring in the first place. In contrast,

fraud detection involves identifying fraud as quickly as possible once it has been

perpetrated. Fraud detection comes into play once fraud prevention has failed. In practice,

fraud detection must be used continuously, as one will typically be unaware that fraud


3

prevention has failed. We can try to prevent credit card fraud by guarding our cards

assiduously, but if nevertheless the card’s details are stolen, then we need to be able to

detect, as soon as possible, that fraud is being perpetrated.

Currently, data mining is a popular way to combat frauds because of its effectiveness.

The task of data mining is to analyze a massive amount of data and to extract some

usable information that we can interpret for future uses. In doing so, we have to define

the clear goal of data mining, and find out the right structure of possible model or

patterns that fit to the given data set. Once we have the right model for the data, we can

use the model for predicting future events by classifying the data. In terms of data

mining, fraud detection can be understood as the classification of the data. Input data is

analyzed with the appropriate model and determined whether it implies any fraudulent

activities or not. A well-defined classification model is developed by recognizing the

patterns of former fraudulent behaviors. Then the model can be used to predict any

suspicious activities implied by new data set.

The prediction of user behavior in financial systems can be used in many situations.

Predicating client migration, marketing or public relations can save a lot of money and

other resources. One of the most interesting fields of prediction is the fraud of credit line,

especially credit card payments. For the high data traffic of 400,000 transactions per day,

a reduction of 2.5% fraud triggers a saving of lots of money per year.

Certainly, all transactions, which deal with accounts of known misuse, are not authorized.

Nevertheless, there are transactions, which are formally valid, but experienced people can

tell that these transactions are probably misused, caused by stolen cards or fake

merchants. So, the task is to avoid a fraud by a credit card transaction before it is known

as “illegal”.

Data mining methods have made most impact on fraud detection. This is typically

because there are large quantities of information is numerical or can be easily converted

into the numerical in the form of counts and proportions. We should also consider speed


4

of processing is of the essence. This is particularly the case in transaction processing,

especially with telecoms and intrusion data, where vast numbers of records are processed

every day, but also applies in credit card, banking and retail sectors.

A key issue of the proposed work is how effective the tools are in detecting fraud and a

fraudulent problem is that one typically dose not know how many fraudulent cases slip

through the net. In applications such as average time to detection after fraud starts (in

minutes, number of transactions, etc.) should also be reported. Measures of this aspect

interact with measures of final detection rate: in many situations an account, telephone,

etc. will have to used for several fraudulent transactions before it is detected as fraudulent

, so that several false negative classifications will necessarily be made.

1.3 RELATED WORK

1.3.1 In Fraud detection

Credit card fraud detection has drawn a lot of research interest and a number of

techniques, with special emphasis on data mining, have been suggested. Gosh and Reilly

[1] have developed fraud detection system with neural network. Their system is trained

on large sample of labeled credit card account transactions. These transactions contain

example fraud cases due to lost cards, stolen cards, application fraud, counterfeit fraud,

mail-order fraud and non receive issue(NRI) fraud.

E. Aleskerov et al. [2] present CARDWATCH, a database mining system used for credit

card fraud detection. The system is based on a neural learning module and provides an

interface to variety of commercial databases.

Dorronsoro et al. [3] have suggested two particular characteristics regarding fraud

detection- a very limited time span for decisions and a large number of credit card

operations to be processed. They have separated fraudulent operations from the normal

ones by using Fisher’s discriminant analysis.


5

Syeda et al. [4] have used parallel granular neural network for improving the speed of

data mining and knowledge discovery in credit card fraud detection. A complete system

has been implemented for this purpose.

Chan et al. [5] have divided a large set of transactions into smaller subsets and then apply

distributed data mining for building models of user behavior. The resultant base models

are then combined to generate a meta-classifier for improving detection accuracy.

Chiu and Tsai [7] consider web services for data exchange among banks. A fraud pattern

mining (FPM) algorithm has been developed for mining fraud association rules which

give information regarding the new fraud patterns to prevent attacks.

Some survey papers have been published which categorize, compare and summarize

articles in the area of fraud detection. Phua et al. [8] did an extensive survey of data

mining based Fraud Detection Systems and presented a comprehensive report. Kou et al.

[9] have reviewed the various fraud detection techniques for credit card fraud,

telecommunication fraud and computer intrusion detection. Bolton and Hand [10]

describe the tools available for statistical fraud detection and areas in which fraud

detection technologies are most commonly used. D.W.Abbott et al. [21] compare five of

the most highly acclaimed commercial data mining tools on a fraud detection application,

with descriptions of their distinctive strengths and weaknesses, based on the lessons

learned by the authors during the process of evaluating the products. D.Yue [32] conduct

an extensive review on literatures to get the answers of the questions like (1) Can FSF be

detected? How likely and how to do it? (2) What data features can be used to predict

FSF? (3) What kinds of algorithm can be used to detect FSF? (4) How to measure the

performance of the detection? And (5) How effective of these algorithms in terms of

fraud detection?

V.Hanagandi et al. [11] generate a fraud a score using the historical information on credit

card account transactions. They describe a fraud-non fraud classification methodology


6

using radial basis function network (RBFN) with a density based clustering approach.

The input data is transformed into cardinal component space and clustering as well as

RBFN modeling is done using a few cardinal components.

A.Shen et al. [12] investigates the efficacy of applying classification models to credit

card fraud detection problems. They tested three classification methods i.e. neural

network, decision tree and logistic regression for their applicability in fraud detections.

H.shao et al. [13] introduced an application in data mining to detect fraud behavior in

customs declarations data and used data mining technology such as an easy-to-expand

multi-dimension-criterion data model and a hybrid fraud-detection strategy.

K.B.Bignell [14] outlines a framework for internet banking security using multi-layered,

feed-forward artificial neural networks.

A. Srivastava et al. [15] model the sequence of operations in credit card transaction

processing using Hidden Markov Model (HMM) and show how it can be used for

detection of frauds. An HMM is initially trained with normal behavior of card holder. If

an incoming credit card transaction is not accepted by trained HMM with sufficiently

high probability, it is considered to be fraudulent. At the same time they also try to ensure

that genuine transactions are not rejected.

B.Zhang et al. [16] consider network level features, such as users’ belief of other users to

deal with fraud in group behavior. They use loopy belief propagation algorithm and apply

it to network level fraud detection, classifying fraudsters, accomplices, honest users.

J.E.Carbal et al. [17] propose a methodology based on rough sets and KDD for fraud

detection made by electrical energy consumers. This methodology does a detailed

evaluation of the boundary region between fraudulent and normal customers, identifying

patterns of fraudulent behavior at historical data sets of electricity companies. They


7

derive classification rules using these patterns; it will permit the detection on the database

of electricity companies of those clients that present fraudulent feature.

J.Quah et al. [18] focuses on real time fraud detection and presents a new and innovative

approach in understanding spending patterns to decipher potential fraud cases. They

make use of self organizing map to decipher, filter and analyze customer behavior for

detection of fraud.

E.L.Barse et al. [19] generate synthetic test data for fraud detection in an IP based video-

on-demand service by ensuring that important statistical properties of the authentic data

are preserved.

J.Xu et al. [20] present an anomaly detection technique based on behavior mining and

monitoring that work at both the individual and system level. They utilize frequent

pattern tree to profile the normal behavior adaptively. They design a novel tree-based

pattern matching algorithm to discover individual level anomalies.

Recently fraud detection system is developed by Suvasini Panigrahi et al. [22], which

consist of four components, namely, rule-based filter, Dempster-Shafer adder, transaction

history database and Bayesian learner. In the rule based component, they determine the

suspicion level of each incoming transaction based on the extent of its deviation from

good pattern. Dempster-Shafer theory is used to combine multiple such evidences and an

initial belief is computed.

Yi Peng et al. [23] apply two clustering techniques, SAS EM and CLUTO, to a large real-

life health insurance dataset and compare the performances of these two methods.

J.Tuo et al. [24] propose a case-based genetic artificial immune system for fraud

detection (AISFD). Their system is a self-adapted system designed for credit card fraud

detection. With the case-based learning model and genetic algorithm, their system can


8

perform online learning with limited time and cost, and update the capability of fraud

detection in the rapid growth of transactions and commerce activities.

J.Kim [25] proposes a novel artificial immune system, called CIFD (Computer Immune

System for Fraud Detection), and adopts both negative selection and positive selection to

generate artificial immune cells. CIFD also employs an analogy of the self-major

histocompatability complex (MHC) molecules when antigen data is presented to the

system. Their novel mechanism improves the scalability of CIFD, which is designed to

process gigabytes or more of transaction data per day.

S.J.Stoflo et al. [26] developed the JAM distributed data mining system for the real world

problem of fraud detection in financial information systems. They have shown that cost-

based metrics are more relevant in certain domains, and defining such metrics poses

significant and interesting research questions both in evaluating systems and alternative

models, and in formalizing the problems to which one may wish to apply data mining

technologies. They also demonstrate how the techniques developed for fraud detection

can be generalized and applied to the important area of intrusion detection in networked

information systems.

F.Yu et al. [27] focus on how to build data mining algorithm centered application system

for common users. They present a case study about building a fraudulent tax declaration

detection system using decision tree classification algorithm.

A.Leung et al. [28] sheds some light on the designing issues on this add-on fraud

detection module, namely Fraud Detection Manager. Their design is based on the concept

of atomic transactions called Coupons that they implemented in e-wallet accounts.

W.Chai et al. [29] propose a method to convert fraud classification rules learned from a

genetic algorithm to a fuzzy score representing the degree to which a company’s financial

statements match those rules.


9

B.Garner and F.Chen [30] propose a paradigm, which involves an anomaly detection

model, case based hypothesis generation, and hypothesis synthesis, is deemed to provide

a basic platform for management intelligence systems and fraud detection in electronic

data processing environment.

V.Aggelis [31] demonstrates one successful fraud detection model. His scope is to

present its contribution in fast and reliable detection of any “strange” transaction

including fraudulent ones.

S.Rozsnyai et al. [33] introduce solution architecture for detection and preventing fraud

in real time by using an event-based system called SARI (Sense and Respond

Infrastructure). They present architecture and components for a real time fraud

management solution which can be easily adapted to the business needs of domain

experts and business users. Their SARI system provides functions to monitor customer

behavior as well as it can steer and optimize customer processes in real time. They show

fraud scenarios of an online gambling service provider.

T.M.Padmaja et al. [34] propose a new approach called extreme outlier elimination and

hybrid sampling technique. They use k reverse nearest neighbors (kRNNs) concept as a

data cleaning method for eliminating extreme outliers in minority regions. They conduct

the experiments with classifiers namely C4.5, Naïve Bayes, k-NN and Radial Basis

function networks and compared the performance of their approach against simple hybrid

sampling technique. They showed using obtained results that extreme outlier elimination

from minority class, produce high predictions for both fraud and non-fraud classes.

Z.Ferdousi et al. [35] use Peer Group Analysis (PGA), an unsupervised technique, to find

outliers in time series financial data. They apply the tool to the stock market data, which

has been collected from Bangladesh Stock Exchange to asses its performance in stock

fraud detection. They observe that PGA can detect those brokers who suddenly start

selling the stock in a different way to other brokers to whom they were previously

similar. They also apply t-statistics to find the deviations effectively.


10

M.Sternberg et al. [36] utilize a cultural algorithm (CA) to respond o dynamic changes in

the application of rule-based expert system. The CA provides self-adaptive capabilities

which can generate the information necessary for the expert system to respond

dynamically.

O.Dandash et al. [37] presents a security analysis of the proposed internet banking model

compared with that of the current existing models used in fraudulent internet payments

detection and prevention. Their proposed model facilitates internet banking fraud

detection and prevention (FDP) by applying two new secure mechanisms, Dynamic Key

Generation (DKG) and Group Key (GK).

S.Viaene et al. [38] apply the weight of evidence reformulation of AdaBoosted naive

Bayes scoring to the problem of diagnosing insurance claim fraud. Their method

effectively combines the advantages of boosting and the explanatory power of the weight

of evidence scoring framework.

E.Lundin et al. [39] developed a method for generating synthetic data that is derived from

authentic data. They also narrate that in many cases synthetic data is more suitable than

authentic data for the testing and training of fraud detection systems.

It is well known that every cardholder has a certain purchasing habits, which establishes

an activity profile for him. Almost all the existing fraud detection techniques try to

capture these behavioral patterns as rules and check for any violation in subsequent

transactions. However, these rules are largely static in nature. As a result, they become

ineffective when the cardholder develops new patterns of behavior that are not yet known

to the FDS. The goal of a reliable detection system is to learn the behavior of users

dynamically so as to minimize its own loss. Thus, systems that can not evolve or “learn”,

may soon become outdated resulting in large number of false alarms. A fraudster can also

attempt new types of attacks which should still get detected by the FDS. For example, a

fraudster may aim at deriving maximum benefit either by making a few high value


11

purchases or a large number of low value purchases in order to evade detection. Thus,

there is a need for developing fraud detection systems which can integrate multiple

evidences including patterns of genuine cardholders as well as that of fraudsters.

We propose a credit card fraud detection system that combines different types of

evidences effectively.

1.3.2 In Financial Cyber Crime Prevention

The first attempt at making online credit card transactions secure was to take the

transaction off-line. Many sites will allow us to call in our credit card number to a

customer support person. This solves the problem of passing the credit card number over

the Internet, but eliminates the merchant's ability to automate the purchasing process.

The next method that was developed, which is currently used by many sites, is hosting

the WWW site on a secure server. A secure server is one that uses a protocol such as SSL

or S-HTTP to transmit data between the browser and the server. These protocols encrypt

the data being transmitted, so when we submit our credit card number through WWW

form it travels to the server encrypted. This section describes the three most famous

system of secure credit card transactions First virtual, CyberCash and SET (Secure

Electronic Transactions)

1.3.2.1 First Virtual The first virtual was the first successfully used model that made internet transactions

secure. Instead of using credit card numbers, transactions are done using a First

VirtualPIN which references the buyer's First Virtual account. These PIN numbers can be

sent over the Internet because even if they are intercepted, they cannot be used to charge

purchases to the buyer's account. A person's account is never charged without email

verification from them accepting the charge.


12

Their payment system is based on existing Internet protocols, with the backbone of the

system designed around Internet email and the MIME (Multipurpose Internet Mail

Extensions) standard. First Virtual uses email to communicate with a buyer to confirm

charges against their account. Sellers use either email, Telnet or automated programs that

make use of First Virtual's Simple MIME Exchange Protocol (SMXP) to verify accounts

and initiate payment transactions. To use this scheme of transaction customer and

merchant, both should have an account on first virtual’s server. The First virtual’s model

was one of the most successfully used models but it is out of use now.

1.3.2.2 CyberCash

CyberCash makes safe passage over the Internet for credit card transaction data. They

take the data that is sent to them from the merchant, and pass it to the merchant's

acquiring bank for processing. Except for dealing with the merchant through CyberCash's

server, the acquiring bank processes the credit card transaction as they would process

transactions received through a point of sale (POS) terminal in a retail store.

The CyberCash payment system is centered on the CyberCash Wallet software program,

which buyers use when making a purchase. This program handles passing payment

information, encrypted, between the buyer and the merchant.

1.3.2.3 SET MasterCard and Visa have developed SET as a license-free protocol for credit card

transactions over the Internet. SET is based on two earlier protocols STT (Secure

transaction technology) and SEPP (Secure Electronic Payment Protocol). Secure

Electronic Transaction (SET) is a system for ensuring the security of financial

transactions on the Internet. It was supported initially by MasterCard, Visa, Microsoft,

Netscape, and others. With SET, a user is given an electronic wallet (digital certificate)

and a transaction is conducted and verified using a combination of digital certificates and


13

digital signatures among the purchaser, a merchant, and the purchaser's bank in a way

that ensures privacy and confidentiality.

SET makes use of Secure Socket Layer (SSL), and Secure Hypertext Transfer Protocol

(SHTTP). SET uses some but not all aspects of a public key infrastructure (PKI). Many

other systems are also functional like PayPal, DigiCash etc.

These systems are highly secure but are rarely used by customers and merchants. These

models secure your transaction over internet but cannot stop any forgery if credit card

information is lost physically or when customer gives his information in wrong hands.

1.3.2.4 Internet Virtual Credit Card Model

Anshul Jain et al. [43] have given this model. According to this model, a login id and a

password are issued by bank along with credit card. Once the customer logs in, he is

asked for his credit card details in order to make sure that the person logging in has the

possession of the card thus avoiding leakage of id and password. If the user is

authenticated then an internet virtual credit card number is issued. User has to select the

expiry date between present date and date of actual expiry date. Customers, who transact

very often, could activate the internet virtual credit card only for a few days, in order to

avoid forgery.

1.4 RESEARCH ISSUES

Financial fraud detection is quite confidential and is not much disclosed in public. The

major issue in this domain is that any financial institution or bank does not share its live

data with researchers as they have strict policy and they can not disclose it. Also there is

not benchmark of data set available in this area. So there are very few researchers (just

one or two) who have worked with real life credit card data and showed their results.


14

Most of the researchers have generated synthetic data based on the statistical techniques.

It may be noted that Aleskerov et al. [2] tested the performance of their CARDWATCH

system on sets of synthetic data based on Gaussian distribution. Chan et al. [5] have used

skewed distribution to generate a training set of labeled transactions. They have done

experiments to determine the most effective training distribution. Li and Zhang [42] have

modeled a customer’s payment by a Poisson process which can only capture the time gap

between two transactions. Panigrahi et al. [22] have generated synthetic data using a

Markov modulated poisson process (MMPP) and two Gaussian distribution functions.

According to E.L.Barse et al. [19], using synthetic data for evaluation, training and

testing offers several advantages over using authentic data. Properties of synthetic data

can be tailored to meet various conditions not available in authentic data sets. They

discussed motivation for using synthetic data for several reasons as authentic data can not

be used in some cases for a number of reasons. The target service may still be under

development and thus produce irregular or only small amounts of authentic data.

Synthetic data can be designed to demonstrate certain key properties or to include attacks

not available in the authentic data, giving a high degree of freedom during testing and

training. Synthetic data can cover extensive periods of time or represent large number of

users, a necessary property to train some of the more “intelligent” detection schemes.

There are two types of data mining techniques, Unsupervised and Supervised Methods.

Unsupervised methods do not need the prior knowledge of fraudulent and non-fraudulent

transactions in historical database, but instead detect changes in behavior or unusual

transactions. These methods model a baseline distribution that represents normal

behavior and then detect observations that show greatest departure from this norm.

Outliers are a basic form of non-standard observation that can be used for fraud detection.

In supervised methods, models are trained to discriminate between fraudulent and non-

fraudulent behavior so that new observations can be assigned to classes. Supervised

methods require accurate identification of fraudulent transactions in historical databases

and can only be used to detect frauds of a type that have previously occurred. An

advantage of using unsupervised methods over supervised methods is that previously


15

undiscovered types of fraud may be detected. Supervised methods are only trained to

discriminate between legitimate transactions and previously known fraud.

All the techniques or models of fraud detection just indicate the likelihood of fraud

occurrence. No one method surely confirms any transaction as fraudulent transaction.

1.5 OUTLINE OF THE RESEARCH

When the user performs transaction on the internet, then transaction related data are

generated. These data are stored in the dimensional modeling data warehouse. The

Transaction Pattern Generation Tool (TPG) generates the different patterns (parameters)

like maximum amount of transaction, time passed since the last transaction, time passed

since the same category purchased etc. based on the historical data stored in the data

warehouse. These all the parameters collectively represent the normal purchasing

behavior of the customer.

Whenever the deviation occurs than the normal behavior, then the model should raise the

alarm. The Transaction Risk Generation Model (TRSGM) works on this principle. The

Model predicts for each transaction how far or close it is from the previous set of all the

normal transactions. The risk score between 0 and 1 is generated by the model. The score

generated below 0.5 for the transaction, is considered as genuine transaction and above or

equal to 0.8 then the transaction is considered to be fraudulent and it is verified by

confirming the customer. If the risk score generated is between 0.5 and 0.8, then the

transaction is considered as suspicious transaction and an additional layer of Bayesian

learner is added by the model. Once the transaction is found suspicious, then it waits for

the next transaction on the same card. When the next transaction occurs on the same card,

again risk score is generated. If the risk score is less than 0.5 then transaction is declared

as genuine transaction, greater than 0.8 then transaction is declared as fraudulent

transaction or again found suspicious then the Bayesian learner calculates the posterior

probability whether the transaction comes from normal customer or fraudster from the

genuine transaction set and fraudulent transaction set. If probability of normal transaction


16

likelihood is more than fraudulent transaction likelihood then the transaction is

considered as genuine transaction otherwise fraudulent transaction.

The model is implemented by using the data mining techniques 1) Rules 2) DBSCAN

algorithm and 3) Bayesian Learner in oracle 9i.

Chapter 2 gives the overview of data mining and compares various data mining

techniques based on easiness of understanding and implementation of technique, input

and output issue, applications, strengths, weaknesses etc. Then it discuss the criteria that

is helpful for selecting a data mining technique such as whether learning is supervised or

unsupervised, the nature of input and output data, presence of noisy data, time (speed)

issue (algorithms for building decision tree and production rules typically execute much

faster than NN or GA), classification accuracy.

Various types of financial cyber crimes and frauds committed worldwide are discussed in

the chapter 3. Chapter 4 discusses about how various data mining techniques and rules

become helpful in financial crime detection. Chapter 5 describes the design and

implementation of the Data warehouse and also the various tables maintained by financial

cyber crime detection system (FCDS). Development of Transaction Pattern Generation

Tool (TPGT) is discussed in the Chapter 6, which generates various parameters for the

customer who is performing online transaction. Development of Transaction Risk Score

Generation Model (TRSGM) is discussed in the Chapter 7, which assigns a risk or fraud

score (0-1) for each transaction. Features of developed data mining application software,

significance of the research, limitation of study and future scope of the research as

conclusion is discussed in Chapter 8.

1.6 REFERENCES

[1] S.Ghosh, D.L.Reilly, “Credit card fraud detection with a neural–network”, in:

Proceedings of the Twenty-seventh Hawaii International Conference on system Sciences,

1994, pp. 621-630,


17

[2] E. Aleskerov, B. Freisleben, B.Rao, “CARDWATCH: a neural network based

database mining system for credit card fraud detection”, in: Proceedings of the

Computational Intelligence for Financial Enginnering, 1997, pp.220-226

[3] J. R.Dorronsoro, F. Ginel, C.Sanchez and C.S. Cruz, “Neural fraud detection in credit

card operations”, IEEE Transactions on Neural Network , Vol. 8, No. 4, July 1997, pp.

827-834

[4] M. Syeda, Y.Q.Zhang, Y. Pan, “Parallel granular neural networks for fast credit card

fraud detection”, in: Proceedings of the IEEE International Conference on Fuzzy

Systems, 2002, pp. 572-577

[5] P.K. Chan, W. Fan, A.L. Prodromidis, S.J. Stolfo, “Distributed data mining in credit

card fraud detection”, in: Proceedings of the IEEE Intelligent Systems, 1999, pp. 67-74

[6] Tao Guo, Gui-Yang Ali, “Neural data mining for credit card fraud detection”, in:

Proceedings of the Seventh International Conference on Machine learning and

cybernetics, Kunming, 12-15 July 2008, pp.3630-3634

[7] C. Chiu, C. Tsai, “A web service-based collaborative scheme for credit card fraud

detection”, in: Proceedings of the IEEE International Conference on e-Technology, e-

Commerce and e-Service, 2004, pp. 177-181.

[8] C. Phua, V.Lee, K.Smith, R.Gayler, “A comprehensive survey of data mining-based

fraud detection research”, March 2007

http://www.clifton.phua.googlepages.com/fraud-detection-survey.pdf

[9] Y.Kou, C.T.Lu, S.Sirwongwattana, Y.Huang, “Survey of fraud detection techniques”,

in: Proceedings of the IEEE International Conference on Networking, Sensing and

Control, vol. 1, 2004, pp.749-754.

[10] R.J.Bolton and D.J.Hand, “Statistical fraud detection”: a review, Journal of

Statistical Science(2002), pp.235-255.

[11] V.Hanagandi, A. Dhar, K.Buescher, “Density based clustering and radial basis

function modeling o generate credit card fraud score”, in: Proceedings of the IEEE

International Conference, February 1996, pp.247-251

[12] A.Shen, R.Tong, Y.Deng, “Application of classification models on credit card fraud

detection”, in: Proceedings of the IEEE Service Systems and Service Management,

International Conference, 9-11 June 2007, pp:1-4

http://www.clifton.phua.googlepages.com/fraud-detection-survey.pdf


18

[13] H.Shao, H. Zhao, G.Chang, “Applying Data mining to detect fraud behavior in

customs declaration”, in: Proceedings of the First International Conference on Machine

Learning and Cybernetics, Beijing, November 2002, pp.1241-1244

[14] K.B.Bignell, “Authentication in an internet banking environment strategy; towards

developing a strategy for fraud detection” in: Proceedings of International Conference

ICISP 2006, 26-28 Aug. 2006, pp.23

[15] A.Srivastava, A.Kundu, S.Sural, A.K.Majumdar, “Credit card fraud detection using

hidden markov model”, in: IEEE transactions on dependable and secure computing,Vol.

5, No. 1, January-March 2008.

[16] B.Zhang, Y. Zhou, C. faloutsos, “Toward a comprehensive model in internet auction

fraud detection”, in: Proceedings of the 41st Hawaii International Conference on System

sciences, 2008

[17] J.E.Carbal, J.Pinto, S.C.Linares, M.A.C.Pinto, Methodology for fraud detection

using rough sets, http://www.ieeexplore.ieee.org/iel5/10898/34297/01635791.pdf

[18] J.Quah, M.Sriganesh, “Real time credit card fraud detection using computational

intelligence”, in: Proceedings of the International Joint Conference on Neural Networks,

Florida, U.S.A, August 2007

[19] E.L.Barse, H.Kvanstrom, E.Jonsson, “Synthesizing test data for fraud detection

system”, in: Proceedings of the 18th Annual Computer Security Applications

Conference,2003

[20] J.Xu, A.H.Sung, Q.Liu,”Tree based behavior monitoring for adaptive fraud

detection”, in: Proceedings of the 18th International Conference on pattern recognition,

2006

[21] D.W.Abbott, I.P.Matkovsky, J.F. Elder IV, “An Evaluation of High-end Data

Mining Tools for Fraud Detection”, 1998, IEEE Xplore

[22] Suvasini Panigrahi, Amlan Kundu, Shamik Sural, A.K.Majumdar, “Credit card

fraud detection: A fusion approach using Dempster-Shafer theory and Bayesian

learning”, www.scincedirect.com

[23] Yi Peng, Gang Kou, A. Sabatka, Z. Chen, D.Khazanchi, Y.Shi, “Application of

Clustering Methods to Health Insurance Fraud Detection”, www.ieeexplore.ieee.org


19

[24] J.Tuo, S.Ren, W.Liu, X.Li, B.Li, L.Lei, “Artificial Immune System for Fraud

Detection”, in: Proceedings of the International Conference on systems, Man and

Cybernetics

[25] J.Kim, A.Ong, R.E.Overill, “Design of an Artificial Immune System as a Novel

Anomaly Detector for Combating Financial Fraud in the Retail Sector”, in: Evolutionary

Computation, 2003. CEC ’03, pp 405-412 Vol.1

[26] S.J.Stoflo, W.Lee, A.Prodromidis, P.K.Chan, “Cost-based Modeling for Fraud and

Intrusion Detection: Results from the JAM Project”, http://

www..citeseer.ist.psu.edu/244959.html

[27] F.Yu, Z.Qin, X.Jia, “Data Mining Issues in Fraudulent Tax Declaration Detection”,

in: Proceedings of the Second International Conference on Machine Learning and

Cybernetics, Xian, November 2003, pp.2202-2206

[28] A.Leung, Z.Yan, S.Fong, “On Designing a Flexible E-Payment System with Fraud

Detection Capability”, in: Proceedings of the IEEE International Conference on E-

Commerce Technology, 2004

[29] W.Chai, B.K.Hoogs, B.T.Verschueren, “Fuzzy Ranking of Financial Statements for

Fraud Detection”, in: Proceedings of the IEEE International Conference on Fuzzy

Systems, Canada, 2006, pp.152-158

[30] B.Garner, F.Chen, “Hypothesis Generation Paradigm for Fraud Detection”,

http://www.ieeexplore.ieee.org/iel2/2978/8447/00369309.pdf

[31] V.Aggelis, “Offline Internet Banking Fraud Detection”, in: Proceedings of the First

International Conference on Availability, Reliability and Security, 2006

[32] D.Yue, X.Wu, Y.Wang, Y.Li,C-H Chu, “A Review of Data Mining-based Financial

Fraud Detection Research”,

http://www.ieeexplore.ieee.org/iel5/4339774/4339775/04341127.pdf

[33] S.Rozsnyai, J.Schiefer, A.Schatten, “Solution Architecture for Detecting and

Preventing Fraud in Real Time”, in: Proceedings of the 2nd International Conference

ICDIM ’07, Volume 1, pp:152-158

[34] T.M.Padmaja, N.Dhulipalla, R.S.Bapi, P.R.Krishna, “Unbalanced Data

Classification Using extreme outlier Elimination and Sampling Techniques for Fraud


20

Detection”, in: Proceedings of 15th International Conference on Advanced Computing

and Communications, 2007, pp.511-516

[35] Z.Ferdousi, A.Maeda, “Unsupervised Outlier Detection in Time Series Data”, in:

Proceedings of 22nd International Conference on System Engineering Workshops, 2006,

pp.51-56

[36] M.Sternberg, R.G.Reynolds, “Using Cultural Algorithms to Support Re-Engineering

of Rule-Based Expert Systems in Dynamic Performance Environments: A Case Study in

Fraud Detection”, in: IEEE Transactions on Evolutionary Computation, Vol. 1, No.4,

1997

[37] O.Dandash, P.D.Le, B.Srinivasan, “Security Analysis for Internet Banking Models”,

in: Proceedings of Eighth International Conference on Software Engineering, Artificial

Intelligence, Networking and Parallel/Distributed Computing, 2007, pp. 1141-1146

[38] S.Viaene, R.A.Derrig, G.Dedene, “A Case Study of Applying Boosting Naïve Bayes

to Claim Fraud Diagnosis”, in: Proceedings of IEEE Transactions on Knowledge and

Data Engineering, Vol. 16, No.5, May 2004, pp. 612-620.

[39] E.Lundin, H.Kvanstrom, E.Jonsson, “A Synthetic Fraud Data Generation

Methodology”, ICICS 2002, Springer-Verlag Berlin

[40] Online fraud is twelve times higher than offline fraud, 20,June,2007

http://sellitontheweb.com/ezine/news0434.shtml

[41] http://www.ic3.gov/media/annualreport/2008_IC3Report.pdf

[42] Y.Li and X.Zhang, “Securing credit card transactions with one-time payment

scheme”, Journal of Electronic Commerce Research and Applications(2005), pp. 413-

426.

[43] A.Jain, T.Sharma, Internet Virtual Credit Card Model,

http://www.profile.iiita.ac.in/ajain1_b04/ivccm.pdf

http://sellitontheweb.com/ezine/news0434.shtml

http://www.ic3.gov/media/annualreport/2008_IC3Report.pdf

21

CHAPTER 2

A COMPARATIVE STUDY OF DATA MINING TECHNIQUES

2.1 DATA MINING: A DEFINITION

2.2 THE FOUNDATIONS OF DATA MINING

2.3 THE DEVELOPMENT OF DATA MINING

2.4 DATA MINING PROCESS

2.5 A STATISTICAL PERSPECTIVE ON DATA MINING

2.6 DECISION TREES

2.7 NEURAL NETWORKS

2.8 GENETIC ALGORITHMS

2.9 CLASSIFICATION

2.10 CLUSTERING

2.11 SELECTION CRITERIA OF A DATA MINING TECHNIQUE

2.12 REFERENCES

2.1 DATA MINING: A DEFINITION

Data Mining is the process of employing one or more computer learning techniques to

automatically analyze and extract knowledge from data contained within a database. The

purpose of a data mining session is to identify trends and patterns in data.

Data Mining has been defined as “the nontrivial extraction of implicit, previously

unknown, and potentially useful information from data”.

And “the science of extracting useful information from large data sets or databases.”

Hand et al define that data mining is “a well-defined procedure that takes data as input

and produces output in the forms of models or patterns.”

Chapter 2: A Comparative Study of Data Mining Techniques

22

2.2 THE FOUNDATIONS OF DATA MINING

Data mining techniques are the result of a long process of research and product

development. This evolution began when business data was first stored on computers,

continued with improvements in data access, and more recently, generated technologies

that allow users to navigate through their data in real time. Data mining takes this

evolutionary process beyond retrospective data access and navigation to prospective and

proactive information delivery. Data mining is ready for application in the business

community because it is supported by three technologies that are now sufficiently mature:

• Massive data collection

• Powerful multiprocessor computers

• Data mining algorithms

Table 2.1 Steps in the Evolution of Data Mining

Evolutionary Step

Business Question

Enabling Technologies

Product Providers Characteristics

Data Collection

(1960s)

"What was my total revenue in the last five years?"

Computers, tapes, disks

IBM, CDC Retrospective, static data delivery

Data Access

(1980s)

"What were unit sales in New England last March?"

Relational databases (RDBMS), Structured Query Language (SQL), ODBC

Oracle, Sybase, Informix, IBM, Microsoft

Retrospective, dynamic data delivery at record level

Data Warehousing & Decision Support

(1990s)

"What were unit sales in New England last March? Drill down to Boston."

On-line analytic processing (OLAP), multidimensional databases, data warehouses

Pilot, Comshare, Arbor, Cognos, Microstrategy

Retrospective, dynamic data delivery at multiple levels

Data Mining (Emerging Today)

"What’s likely to happen to Boston unit sales next month? Why?"

Advanced algorithms, multiprocessor computers, massive databases

Pilot, Lockheed, IBM, SGI, numerous startups (nascent industry)

Prospective, proactive information delivery


23

Commercial databases are growing at unprecedented rates. A recent META group survey

of data warehouse projects found that 19% of respondents are beyond the 50-gigabyte

level. In some industries, such as retail, these numbers can be much larger. The

accompanying need for improved computational engines can now be met in a cost-

effective manner with parallel multiprocessor computer technology. Data mining

algorithms embody techniques that have existed for at least 10 years, but have only

recently been implemented as mature, reliable, understandable tools that consistently

outperform older statistical methods.

In the evolution from business data to business information, each new step has built upon

the previous one. For example, dynamic data access is critical for drill-through in data

navigation applications, and the ability to store large databases is critical to data mining.

Today, the maturity of these techniques, coupled with high-performance relational

database engines and broad data integration efforts, make these technologies practical for

current data warehouse environments.

2.3 THE DEVELOPMENT OF DATA MINING

The current evolution of data mining functions and products is the result of years of

influence from many disciplines, including databases, information retrieval, statistics,

algorithms and machine learning (Figure 2.1). Another computer science area that has

had a major impact on the KDD process is multimedia and graphics.


24

Figure 2.1 Historical perspective of data mining

Table 2.2 shows developments in the areas of artificial intelligence(AI), information

retrieval(IR), databases(DB) and statistics(Stat) leading to the current view of data

mining. These different historical influences, which have led to the development of the

total data mining area, have given rise to different views of what data mining functions

actually are:

• Induction is used to proceed from very specific knowledge to more general

information. This type of technique is often found in AI applications.

• Because the primary objective of data mining is to describe some characteristics

of a set of data by a general model, this approach can be viewed as a type of

compression. Here the detailed data within the database are abstracted and

compressed to a smaller description of the data characteristics that are found in

the model.

DATA MINING

Information Retrieval

Statistics

Algorithms Machine Learning

Databases


25

Table 2.2 Time Line of Data Mining Development

Time Area Contribution

Late 1700s Stat Bayes theorem of probability

Early 1900s Stat Regression Analysis

Early 1920s Stat Maximum likelihood estimate

Early 1940s AI Neural networks

Early 1950s Nearest neighbor

Single Link

Late 1950s AI Perceptron

Late 1950s Stat Resampling, bias reduction, jackknife estimator

Early 1960s AI ML started

Early 1960s DB Batch reports

Mid 1960s Decision trees

Mid 1960s Stat Linear models for classification

IR Similarity measures

IR Clustering

Stat Exploratory data analysis(EDA)

Late 1960s DB Relational data model

Early 1970s IR SMART IR systems

Mid 1970s AI Genetic Algorithms

Late 1970s Stat Estimation with incomplete data (EM Algorithm)

Late 1970s Stat K-means clustering

Early 1980s AI Kohonen self-organizing map

Mid 1980s AI Decision tree algorithms

Early 1990s DB Association rule algorithms

Web and search engines

1990s DB Data Warehousing

1990s DB Online analytic processing (OLAP)


26

2.4 DATA MINING PROCESS: A practical data mining application is often complex.

It is interactive and iterative, involving a number of key steps:

1.Understanding the application domain, and the application goals.

2. Extracting one or more target data sets from databases.

3. Cleaning data, e.g., removing noise and handling the missing data.

4. Removing the irrelevant attributes and tuples from the data.

5. Choosing the data mining task, i.e., deciding whether the goal of data mining

process is classification, association, clustering, etc, or a combination of them.

6. Choosing the data mining algorithms.

7. Data mining using the selected algorithms to discover hidden patterns in data.

8. Post-processing the discovered patterns, i.e., analyzing the patterns

automatically or semi-automatically to identify those truly interesting/useful

patterns for the user.

2.5 A STATISTICAL PERSPECTIVE ON DATA MINING

There are many statistical concepts that are the basis for data mining techniques. Here is a

brief review of some of these concepts.

2.5.1 Point Estimation

Point estimation is a well-known and computationally tractable tool for learning the

parameters of a data mining model. It can be used for many data mining tasks such as

summarization and time-series prediction. Summarization is the process of extracting or

deriving representative information about the data. Point estimation is used to estimate

mean, variance, standard deviation, or any other statistical parameter for describing the

data. In time-series prediction, point estimation is used to predict one or more values

appearing later in a sequence by calculating parameters for a sample.


27

2.5.1.1 Methods of Point Estimation

Several methods exist for obtaining point estimates, including least squares, the method

of moments, maximum likelihood estimation, Bayes estimators, and robust estimation.

DEFINITION 2.1: Let X1,X2,……,Xn be a random sample, and let θ = {θ 1,…..,θ k} be

the set of population parameters. An estimator is a function that maps a random sample

X1,X2,……,Xn to a set of parameter values φ = {φ 1,…..,φ k}, where φ j is the estimate of

parameter θ j.

2.5.1.1.1 The Method of Moments

The method of moments, introduced by Karl Pearson circa 1894, is one of the oldest

methods of determining estimates.

DEFINITION 2.2: Let X1,X2,……,Xn be a random sample from a population whose

density function depends on a set of unknown parameters θ = {θ 1,…..,θ k}. Assume

that the first k population moments exist as functions Ør(θ ) of the unknown parameters,

where r = 1,2,….,k Let

1

1 n

r ii

Xn

φ=

= ∑ T (2.1)

be the rth sample moment. By equating φ r to Ør, where r=1,2,……,k, k equations in k

unknown parameters can be obtained.

Therefore, if there are k population parameters to be estimated, the method of moments

consists of the following two steps.

(i) Express the first k population moments in terms of the k population parameters θ 1,

θ 2,….., θ k;


28

(ii) Equate the population moments obtained from step (i) to the corresponding sample

moments calculated using above equation and solve θ 1,….., θ k as the estimates of

parameters.

2.5.1.1.2 Maximum Likelihood Estimation

Sir Ronald A. Fisher circa 1920 introduced the method of maximization of likelihood

functions. Given a random sample X1,X2,……,Xn distributed with the density (mass)

function f(x;Θ ), the likelihood function of the random sample is the joint probability

density function, denoted by

L(Θ ; X1,X2,……,Xn) = f(X1,X2,……,Xn;Θ ). (2.2)

In above equation, Θ is the set of unknown population parameters {θ 1,…..,θ k}. If the

random sample consists of random variables that are independent and identically

distributed with a common density function f(x;Θ ), the likelihood function can be

reduced to

L(Θ ; X1,X2,……,Xn) = f(X1; Θ ) * …..* f(Xn; Θ ), (2.3)

which is the product of individual density functions evaluated at each sample point.

maximum likelihood estimate, therefore, is a set of parameter values Θ 1={θ 11,…..,

θ k1}that maximizes the likelihood function of the sample. A well-known approach to

find Θ 1 is to take the derivative of L, set it equal to zero and solve forΘ . Thus, Θ 1 can

be obtained by solving the likelihood equation

( ) 0L∂Θ =

∂Θ (2.4)


29

It is important to note that a solution to the likelihood equation is not necessarily a

maximum; it could also be a minimum or a stationary point. One should ensure that the

solution is a maximum before using it as a maximum likelihood estimate.

2.5.1.1.3 The Expectation-Maximization Algorithm

The Expectation-maximization (EM) algorithm is an approach that solves the estimation

problem with incomplete data. The EM algorithm finds an MLE for a parameter using a

two-step process: estimation and maximization. The basic algorithm is shown as under.

ALGORITHM 2.1

Input:

Ø = { Ø1,……., Øp} //Parameters to be estimated

Xobs ={x1,……..,xk} //Input database values observed

Xmiss={xk+1,…..,xn} //Input database values missing

Output:

Ô //Estimates for Ø

EM algorithm:

i:=0;

Obtain initial parameter MLE estimate, Ôi ;

repeat

Estimate missing data, Ximiss;

i++

Obtain next parameter estimate, Øi to

maximize likelihood;

until estimate coverages;

An initial set of estimates for the parameters is obtained. Given these estimates and the

training data as input, the algorithm 2.1 then calculates a value for the missing data. For

example, it might use the estimated mean to predict a missing value. These data are then


30

used to determine an estimate for the mean that maximizes the likelihood. These steps are

applied iteratively until successive parameter estimates converge. Any approach can be

used to find the initial parameter estimates. In algorithm 2.1 it is assumed that the input

database has actual observed values Xobs ={x1,……..,xk} as well as values that are

missing Xmiss={xk+1,…..,xn}. It is assumed that the entire database is actually X= Xobs

∪Xmiss.

The EM algorithm is useful in computational recognition, image retrieval, computer

vision, and many other fields. In data mining, the EM algorithm can be used when the

data set has missing values due to limitations of the observation process. It is especially

useful when maximizing the likelihood function directly is analytically intractable. In that

case, the likelihood function can be simplified by assuming that the hidden parameters

are known.

2.5.2 Measures of Performance

There are several different ways (estimators) to estimate unknown parameters. In order to

assess the usefulness of estimators, some criteria are necessary to measure the

performance of estimators. These are – bias, mean squared error, standard error,

efficiency and consistency.

2.5.2.1 Bias

The bias of an estimator is the difference between the expected value of the estimator and

the actual value:

Bias = E(Ô) – Ô (2.5)

An unbiased estimator is one whose bias is 0. While point estimators for small data sets

may actually be unbiased, for larger database applications we would expect that most

estimators are biased.


31

2.5.2.2 Mean Squared Error (MSE)

MSE is defined as the expected value of the squared difference between the estimate and

the actual value:

MSE(Ô) = E(Ô – O)2 (2.6)

2.5.2.3 Standard Error

The standard error gives a measure of the precision of the estimators. The standard error

of an estimator θ is defined as the standard deviation of its sampling distribution.

( ) ( )SE Vθ

θ θ σ= = (2.7)

The sample mean can be used as an example to illustrate the concept of standard error.

Let f(x) represents a probability density function with finite variance 2σ and meanμ .

Let X be the sample mean for a random sample of size n drawn from this distribution. By

the Central Limit Theorem, the distribution of X is approximately normally distributed

with mean μ and variance 2nσ . So the standard error is given by

( ) XSE Xn

σσ= = (2.8)

When the standard deviation σ for the underlying population is unknown, then an

estimate S for the parameter can be used as a substitute for it and leads to the estimated

standard error

( ) XSSE Xn

σ == (2.9)


32

2.5.2.4 Efficiency

Another measure used to compare estimators is called efficiency. Suppose there are two

estimators Ò and Õ for a parameter Ø based on the sample X1,…..,Xn. If the MSE of one

estimator is less than the MSE of the other, i.e. MSE(Ò) < MSE(Õ), then the estimator Ò

is said to be more efficient than Õ. The relative efficiency of Ò with respect to Õ is

defined as the ratio

eff(Ò,Õ) = MSE(Õ)/MSE(Ò) (2.10)

If this ratio is greater than one, then Ò is a more efficient estimator of the parameter

Ø. When the estimator is unbiased, the ratio is just the ratio of their variance, and the

most efficient estimator would be the one with minimum variance.

2.5.2.5 Consistency

Unlike the four measures defined previously, consistency is defined for increasing sample

sizes, not a fixed sample sizes. Like the efficiency, consistency is also defined using the

MSE. Let nθ be the estimator of a parameter based on a sample of size n, then an

estimator is said to be consistent if

lim ( ) 0nn

MSE θ→∞

= (2.11)

When MSE is written in terms of bias and variance. Thus equation holds if and only if

both variance and bias nθ tend o zero as n approaches infinite.


33

2.5.2.6 The Jackknife Method

It is a popular estimate technique. With this approach, the estimate of a parameter,θ , is

obtained by omitting one value from the set of observed values. Suppose there is a set of

n values X = {x1,x2,……,xn}. An estimate for the mean would be

1

1 1( )

1

i n

j

j j ii

x

nμ

=

= = +=

+

−

∑ ∑ (2.12)

Here the subscript (i) indicates that this estimate is obtained by omitting the ith value.

Given a set of jackknife estimates, ( ) XSSE Xn

σ == (i), these can in turn be used to

obtain an overall estimate.

( )

1(.)

n

jj

n

θθ ==

∑ (2.13)

2.5.3 Models Based on Summarization

There are many basic concepts that provide an abstraction and summarization of the data

as a whole. The basic well-known statistical concepts such as mean, variance, standard

deviation, median and mode are simple models of the underlying population. Fitting a

population to a specific frequency distribution provides an even better model of the data.

Of course, doing this with large databases that have multiple attributes, have complex

and/or multimedia attributes, and are constantly changing is not practical.

There are also many well-known techniques to display the structure of the data

graphically. For example, a histogram shows the distribution of the data. A box plot is a

more sophisticated technique that illustrates several different features of the population at

once.


34

Another visual technique to display data is called a scatter diagram. This is a graph on a

two-dimensional axis of points representing the relationships between x and y values.

2.5.3.1 Bayes Theorem

With statistical inference, information about a data distribution is inferred by examining

data that follow that distribution. Given a set of data X={x1,……,xn}, a data mining

problem is to uncover properties of the distribution from which the set comes. Bayes rule

is a technique to estimate the likelihood of a property given the set of data as evidence or

input. Suppose that either hypothesis h1 or hypothesis h2 must occur, but not both. Also

suppose that xi is an observable event.

DEFINITION 2.3: Bayes Rule or Bayes Theorem is

1 11

1 1 2 2

( | ) ( )( | )( | ) ( ) ( | ) ( )

ii

i i

P x h P hP h xP x h P h P x h P h

=+

(2.14)

Here P (h1 | xi) is called the posterior probability, while P (h1) is the prior probability

associated with hypothesis h1. P (xi ) is the probability of the occurrence of data value xi

and P (xi | h1) is the conditional probability that, given a hypothesis, the tuple satisfies it.

Where there are m different hypothesis we have:

P(xi) = 1

( | ) ( )m

i j jj

P x h P h=∑ (2.15)

Thus we have

1 11

( | ) ( )( | )( )

ii

i

P x h P hP h xP x

= (2.16)


35

Bayes rule allow us to assign probabilities of hypothesis given a data value, P (hj | xi).

Here we discuss tuples when in actuality each xi may be an attribute value or other data

label. Each hi may be an attribute value, set of attribute values, or even a combination of

attribute values.

2.5.3.2 Hypothesis Testing

Hypothesis testing attempts to find a model that explains the observed data by first

creating a hypothesis and then testing the hypothesis against the data. The hypothesis

usually is verified by examining a data sample. If the hypothesis holds for the sample, it

is assumed to hold for the population in general. Given a population, the initial

hypothesis to be tested, H0, is called the null hypothesis. Rejection of the null hypothesis

causes another hypothesis, H1, called the alternative hypothesis, to be made.

One technique to perform hypothesis testing is based on the use of the chi-squared

statistic. Actually, there is a set of procedures referred to as chi squared. These

procedures can be used to test the association between two observed variable values and

to determine if a set of observed variable values is statistically significant. A hypothesis

is first made, and then the observed values are compared based on this hypothesis.

Assuming that O represents the observed data and E is the expected values on the

hypothesis, the chi-squared statistic, X2 , is defined as:

2

2 ( )O EXE−

=∑ (2.17)

When comparing a set of observed variable values to determine statistical significance,

the values are compared to those of the expected case. This may be the uniform

distribution.


36

2.5.3.3 Regression and Correlation

Both bivariate regression and correlation can be used to evaluate the strength of a

relationship between two variables. Regression is generally used to predict future values

based on past values by fitting a set of points to a curve. Correlation, however, is used to

examine the degree to which the values for two variables behave similarly.

Linear regression assumes that a linear relationship exists between the input data and the

output data. The common formula for a linear relationship is used in this model:

Y = c0 + c1x1 +…….+cnxn (2.18)

Here there are n input variables, which are called predictors or regressors; one output

variable (the variable being predicted), which is called the response; and n + 1 constants,

which are chosen during the modeling process to match the input examples. This is called

multiple linear regression because there is more than one predictor.

2.6 DECISION TREES

The decision tree method of decision analysis uses a tree structure to illustrate the

decision process. Probabilities are assigned to events, and the expected value of each

alternative is determined. The alternative with the most attractive total expected value is

chosen. Depending on the decision, the most attractive expected value may be the highest

or lowest number.

It is based on the “Twenty Questions” game that children play, as illustrated by Example

2.1. Figure 2.2 graphically shows the steps in the game. This tree has as the root the first

question asked. Each subsequent level in the tree consists of questions at that stage in the

game. Nodes at the third level show questions asked at the third level in the game. Leaf

nodes represent a successful guess as to the object being predicted. This represents a

correct prediction. Each question successfully divides the search space much as a binary


37

search does. As with a binary search, questions should be posed so that the remaining

space is divided into two equal parts. Often young children tend to ask poor questions by

being too specific, such as initially asking “Is it my Mother? This is a poor approach

because the search space is not divided into two equal parts.

EXAMPLE 2.1

Mudra and Vikas are playing a game of “Twenty Questions”. Vikas has in mind some

object that Mudra tries to guess with no more than 20 questions. Mudra’s first question is

“Is this object alive?” Based on Vikas’s answer, Mudra then asks a second question. Her

second question is based on the answer that Vikas provides to the first question. Suppose

that Vikas says “yes” as his first answer. Mudra’s second question is “Is this a person?.

When Vikas responds “yes”, Mudra asks “Is it a friend?”. When Vikas says “no”, Mudra

then asks “Is it someone in my family?”. When Vikas responds “yes”, Mudra then begins

asking the names of family members and can immediately narrow down the search space

to identify the target individual. This game is illustrated in Figure 2.2.


38

Alive?

No Yes

Ever alive? Person?

No Yes No Yes

… … Mammal? Friend?

No Yes No Yes

… … In Family? …

No Yes

… Mom?

No Yes

… FINISHED

FIGURE 2.2: Decision tree for Example 2.1

DEFINITION 2.4. A decision tree(DT) is a tree where the root and each internal node is

labeled with a question. The arcs emanating from each node represent each possible

answer to the associated question. Each leaf node represents a prediction of a solution to

the problem under consideration.

DEFINITION 2.5. A decision tree(DT) model is a computational model consists of three

parts:

1. A decision tree as defined in Definition 2.4.

2. An algorithm to create the tree.

3. An algorithm that applies the tree to data and solves the problem under

consideration.


39

The building of the tree may be accomplished via an algorithm that examines data from a

training sample or could be created by a domain expert. Most decision tree techniques

differ in how the tree is created. Algorithm 2.2 shows the basic steps in applying a tuple

to the DT, step three in Definition 2.5. We assume here that the problem to be performed

is one of prediction, so the last step is to make the prediction as dictated by the final leaf

node in the tree. The complexity of the algorithm is straightforward to analyze. For each

tuple in the database, we search the tree from the root down to a particular leaf. At each

level, the maximum number of comparisons to make depends on the branching factor at

that level. So the complexity depends on the product of the number of levels and the

maximum branching factor.

ALGORITHM 2.2

Input:

T //Decision Tree

D //Input database

Output:

M //Model prediction

DTProc algorithm:

//Simplistic algorithm to illustrate prediction technique using DT

for each t ∈ D do

n = root node of T;

while n not leaf node do;

Obtain answer to question on n applied to t;

Identify arc from t, which contains correct answer;

n=node at end of this arc;

Make prediction for t based on labeling of n;

We use Example 2.2 to further illustrate the use of decision trees.


40

Gender

=F =M

Height Height

<1.3 m >1.8 m <1.5 m >2 m

>= 1.3 m >= 1.5m <= 1.8 m <= 2 m Short Medium Tall Short Medium Tall

Figure 2.3: Decision tree for Example 2.2

EXAMPLE 2.2

Suppose that students in a particular university are to be classified as short, tall or

medium based on their height. Assume that the database schema is {name, address,

gender, height, age, year, major}.To construct a decision tree, we must identify the

attributes that are important to the classification problem at hand. Suppose that height,

age and gender are chosen. Certainly, a female who is 1.95 m in height is considered as

tall. Also, a child 10 years of age may be tall if he or she is only 1.5 m. Since this is a set

of university students, we would expect most of them to be over 17 years of age. We thus

decide to filter out those under this age and perform their classification separately. We

may consider these students to be outliers because their ages are not typical of most

university students. Thus, for classification we have only gender and height. Using these

two attributes, a decision tree building algorithm will construct a tree using a sample of

the database with known classification values. This training sample forms the basis of

how tree is constructed. One possible resulting tree after training is shown in Figure 2.3.


41

2.6.1 Strengths

Decision trees have several advantages. Here is a list of a few of the many advantages

decision trees have to offer.

• Decision trees are easy to understand and map nicely to a set of production

rules.

• Decision trees have been successfully applied to real problems.

• Decision trees make no prior assumptions about the nature of the data.

• Decision trees are able to build models with datasets containing numerical as

well as categorical data.

2.6.2 Weaknesses

There are several issues surrounding decision tree usage. Specifically,

• Output attributes must be categorical, and multiple output attributes are not

allowed.

• Decision tree algorithms are unstable in that slight variations in the training

data can result in different attribute selections at each choice point within the

tree. The effect can be significant as attribute choices affect all descendent sub

trees.

• Trees created from numeric datasets can be quite complex as attribute splits

for numeric data are typically binary.

2.7 NEURAL NETWORKS

Neural networks offer a mathematical model that attempts to mimic the human brain.

Knowledge is often represented as a layered set of interconnected processors. These

processor nodes are frequently referred to as neurodes so as to indicate a relationship with


42

the neurons of the brain. Each node has a weighted connection to several other nodes in

adjacent layers. Individual nodes take the input received from connected nodes and use

the weights together with a simple function to compute output values.

2.7.1 Why use Neural Networks?

Neural networks, with their remarkable ability to derive meaning from complicated or

imprecise data, can be used to extract patterns and detect trends that are too complex to

be noticed by either humans or other computer techniques. A trained neural network can

be thought of as an “expert” in the category of information it has been given to analyze.

This expert can then be used to provide projections given new situations of interest and

answer “what if” questions.

2.7.2 Network Layers

The NN approach, like decision trees, requires that a graphical structure be built to

represent the model and then that the structure be applied to the data. The NN can be

viewed as a directed graph with source (input), sink (output) and internal (hidden) nodes.

The input nodes exist in an input layer, while the output nodes exist in an output layer.

The hidden nodes exist over one or more hidden layers. To perform the data mining task,

a tuple is input through the input nodes and the output node determines what the

prediction is. Unlike decision trees, which have only one input node (the root of the tree),

the NN has one input node for each attribute value to be examined to solve the data

mining function. Unlike decision trees, after a tuple is processed, the NN may be changed

to improve future performance. Although the structure of the graph does not change, the

labeling of the edges may change.

DEFINITION 2.6. A neural network (NN) is a directed graph, F=(V,A) with vertices

V= {1,2,….,n} and arcs A={(i,j) | 1<=i,j<=n}, with the following restrictions.

1. V is partitioned into set of input nodes, VI, hidden nodes, VH and output

nodes, Vo.


43

2. The vertices are also partitioned into layers {1,….,k} with all input nodes

in layer 1 and output nodes in layer k. All hidden nodes are in layers 2 to

k-1 which are called the hidden layers.

3. Any arc (i,j) must have node ii in layer h-1 and node j in layer h.

4. Arc (i,j) is labeled with a numeric value wij.

5. Node i is labeled with a function fi.

Definition 2.6 is a very simplistic view of NNs. Although there are many more

complicated types that do not fit this definition, this defines the most common type of

NN.

Figure 2.4 shows a fully connected feed-forward neural network structure together with a

single input instance [1.0,0.4,0.7]. Arrow indicates the direction of flow for each new

instance as it passes through the network. The network is fully connected because nodes

at one layer are connected to all nodes in the next layer.

The number of input attributes found within individual instances determines the number

of input layer nodes. The user specifies the number of hidden layers as well as the

number of nodes within a specific hidden layer. Determining a best choice for these

values is matter of experimentation. In practice, the total number of hidden layers is

usually restricted to two. Depending on the application, the output layer of the neural

network may contain one or several nodes.


44

Input Layer Hidden Layer Output Layer

1.0 W1j

W1i Wjk

W2j

0.4 W2i Wi k

W3j 0.7 W3i

Figure 2.4 A fully connected feed-forward neural network

2.7.3 Neural Network Input and Output Format

The input to individual neural network nodes must be numeric and fall in the closed

interval range [0, 1]. Because of this, we need a way to numerically represent categorical

data. We also require a conversion method for numerical data falling outside the [0, 1]

range.

The output nodes of a neural network represent continuous values in the [0, 1] range.

However, the output can be transformed to accommodate categorical class values.

2.7.4 The Sigmoid Function

The purpose of each node within a feed-forward neural network is to accept input values

and pass an output value to the next higher network layer. The nodes of the input layer

pass input attribute values to the hidden layer unchanged. Therefore for the input instance

shown in figure 2.4, the output of node 1 is 1.0, the output of node 2 is 0.4 and the output

of node 3 is 0.7.

Node 1

Node 2

Node 3

Node j

Node i

Node k


45

Table 2.3: Initial Weight Values for the Neural Network Shown in Figure

2.4

W1J W1I W2J W2I W3J W3I WJK WIK

0.20 0.10 0.30 -0.10 -0.10 0.20 0.10 0.50

A hidden or output layer node n takes input from the connected nodes of the previous

layer, combines the previous node values into a single value, and uses the new value as

input to an evaluation function. The output of the evaluation function is a number in the

closed interval [0, 1]. This value represents the output of node n.

Let’s look at an example. Table 2.3 shows sample weight values for the neural network

of Figure 2.4. Consider node j. To compute the input to node j, we determine the sum

total of the multiplication of each input weight by its corresponding input layer node

value. That is:

Input to node j= (0.2)(1.0) + (0.3)(0.4) + (-0.1)(0.7) = 0.25 (2.19)

Therefore 0.25 represents the input value for node j’s evaluation function.

The first criterion of an evaluation function is that the function must output values in the

[0, 1] interval range. A second criterion is that the function should output a value close to

1 when sufficely excited. The sigmoid function is computed as:

f(x)=1/1+e –x (2.20)

where e is the base of natural logarithms approximated by 2.718282.

2.7.5 Applications of neural networks

Character Recognition – The idea of character recognition has become very important

as handled devices like the Palm Pilot are becoming increasingly popular. Neural

networks can be used to recognize handwritten characters.


46

Image Compression – Neural networks can receive and process vast amounts of

information at once, making them useful in image compression. With the Internet

explosion and more sites using more images on their sites, using neural networks for

image compression is worth a look.

Stock Market Prediction – The day-to-day business of the stock market is extremely

complicated. Many factors weigh in whether a given stock will go up or down on any

given day. Since neural networks can examine a lot of information quickly and sort it all

out, they can be used to predict stock prices.

Traveling Salesman’s Problem – Interestingly enough, neural networks can solve the

traveling salesman problem, but only to a certain degree of approximation.

Medicine, Electronic Nose, Security and Loan Applications – These are some

applications that are in their proof-of-concept stage, with the acception of a neural

network that will decide whether or not to grant a loan, something that has already been

used more successfully than many humans.

2.7.6 Strengths

• Neural networks well with datasets containing large amounts of noisy input

data. Neural network evaluation functions such as the sigmoid function

naturally smooth input data variations caused by outliers and random error.

• Neural networks can process and predict numeric as well as categorical

outcome. However, categorical data conversions can be tricky.

• Neural networks can be used for applications that require a time element to be

included in the data.

• Neural networks have performed consistently well in several domains.

• Neural networks can be used for both supervised learning and unsupervised

clustering.


47

2.7.7 Weaknesses

• Probably the biggest criticism of neural networks is that they lack the ability

to explain their behavior.

• Neural network learning algorithms are not guaranteed to converge to an

optimal solution. With most types of neural networks, the problem can be

dealt with by manipulating various learning parameters.

• Neural networks can easily be overtrained to the point of working well on the

training data but poorly on test data. This problem can be monitored by

consistently measuring test set performance.

2.8 GENETIC ALGORITHMS

A Genetic Algorithm is heuristic, which means it estimates a solution. We won’t know if

we get the exact solution, but that may be a minor concern.

In fact, most real-life problems are like that: we estimate a solution rather than

calculating it exactly. For most problems we don’t have any formula for solving the

problem because it is too complex, or if we do, it just takes too long to calculate the

solution exactly. An example could be space optimization – it is very difficult to find the

best way to put objects of varying size into a room so they take as little space as possible.

The most feasible approach then is to use a heuristic method.

Genetic algorithms are different from other heuristic methods in several ways. The most

important difference is that a GA works on a population of possible solutions, while other

heuristic methods use a single solution in their iterations. Another difference is that GAs

are probabilistic (stochastic), not deterministic.

Each individual in the GA population represents a possible solution to the problem. The

suggested solution is coded into the “genes” of the individual. One individual might have

these genes:”1100101011”, another has these:”0101110001” (just examples). The values


48

(0 or 1) and their position in the “gene string” tell the genetic algorithm what solution the

individual represents.

2.8.1 Where GAs can be used?

GAs can be used where optimization is needed. It means that where there large solutions

to the problem but we have to find the best one. Like we can use GAs in finding best

moves in chess, mathematical problems, and financial problems and in many more areas.

DEFINITION 2.7. Given an alphabet A, an individual or chromosome is a string I =

I1,I2,…..,In where Ij ∈ A. Each character in the string, Ij, is called a gene. The values that

each character can have are called the alleles. A population, P, is a set of individuals.

2.8.2 Explanation of terms

Fitness: Fitness is the value assigned to an individual. It is based on how far or close an

individual is from the solution. Greater the fitness value better the solution it contains.

Fitness function: Fitness function is a function which assigns fitness value to the

individual. It is problem specific.

Breeding: Taking two fit individuals and intermingling there chromosome to create new

two individuals.

Crossover: The first genetic operator, forms new elements for the population by

combing parts of two elements currently in the population.

Mutation: A second genetic operator is sparingly applied to elements chosen for

elimination. Mutation can be applied by randomly flipping bits (or attribute values)

within a single element.

Selection: A third genetic operator that is sometimes used. With selection, the elements

deleted from the population are replaced by copies of elements that pass the fitness test

with high scores.


49

DEFINITION 2.8. A genetic algorithm (GA) is a computational model consisting of

five parts:

1. Starting set of individuals, P.

2. Crossover technique.

3. Mutation algorithm.

4. Fitness function.

5. Algorithm that applies the crossover and mutation techniques to P iteratively

using the fitness function to determine the best individuals in P to keep. The

algorithm replaces a predefined number of individuals from the population

with each iteration and terminates when some threshold is met.

ALGORITHM 2.3

Input:

P //Initial population

Output:

P’ //Improved population

Genetic algorithm:

//Algorithm to illustrate genetic algorithm

repeat

N=| P |;

P’=θ;

repeat

i1,i2 = select(P);

o1,o2= cross(i1,i2);

o1 = mutate(o1);

o2 = mutate(o2);

until | P’ | = N;

P = P’;

until termination criteria satisfied;


50

Algorithm 2.3 outlines the steps performed by a genetic algorithm. Initially, a population

of individuals, P, is created. Although different approaches can be used to perform this

step, they typically are generated randomly. From this population, a new population, P’,

of the same size is created. The algorithm 2.3 repeatedly selects individuals from whom

to create new ones. These parents i1, i2 are then used to produce two offspring, o1,o2,

using a crossover process. Then mutants may be generated. The process continues until

the new population satisfies the termination condition.

We assume here that the entire population is replaced with each iteration. An alternative

would be to replace the two individuals with the smallest fitness. Although this algorithm

is quite general, it is representative of all genetic algorithms. There are many variations

on this general theme.

2.8.3 Applications of GA

Typical applications of genetic algorithms include scheduling, robotics, economics,

biology, and pattern recognition.

2.8.4 Strengths of GA

• The major advantage to the use of genetic algorithms is that they are easily

parallelized.

• It can quickly scan a vast solution set. Bad proposals do not affect the end

solution negatively as they are simply discarded. The inductive nature of the

GA means that it doesn’t have to know any rules of the problem – it works by

its own internal rules. This is very useful for complex or loosely defined

problems.

2.8.5 Weaknesses of GA

• GAs are difficult to understand and to explain to end users.


51

• The abstraction of the problem and method to represent individuals is quite

difficult.

• Determining the best fitness function is difficult.

• Determining how to do crossover and mutation is difficult.

2.9 CLASSIFICATION

Classification is the most familiar and most popular data mining technique. Examples of

classification applications include image and pattern recognition, medical diagnosis, loan

approval, detecting faults in industry applications, and classifying financial market

trends. Estimation and prediction may be viewed as types of classification.

2.9.1 Statistical-Based Algorithms

2.9.1.1 Regression

Regression problems deal with an estimation of an output value based on input values.

When used for classification, the input values are values from the database and the output

values represent the classes. Regression can be used to solve classification problems, but

it can also be used for other applications such as forecasting.

In section, we briefly introduced linear regression using the formula

y=c0 + c1x1+……+cnxn (2.21)

By determining the regression coefficients c0,c1,…..,cn the relationship between the

output parameter, y and the input parameters x1,x2,…,xn can be estimated.

There are many reasons why the linear regression model may not be used to estimate

output data. One is that the data do not fit a linear model. It is possible, however, that the

data generally do actually represent a linear model, but the linear model generated is poor


52

because noise or outliers exist in the data. Noise is erroneous data. Outliers are data

values that are exceptions to the actual and expected data.

Suppose we are having k points in any training sample then we are having k formulas

yi=c0 + c1x1i + iε , i=1,…..,k (2.22)

With a simple linear regression, given an observable value (x1i, yi), iε is the error, and

thus the squared error technique introduced in the above section can be used to indicate

the error. To minimize the error, a method of least squares is used to minimize the least

square error. This approach finds coefficients c0,c1 so that the squared error is minimized

for the set of observable values. The sum of the squares of the errors is

2 20 1 1

1 1( )

k k

i iii i

L y c c xε= =

= = − −∑ ∑ (2.23)

Taking the partial derivatives (with respect to the coefficients) and setting equal to zero,

we can obtain the least squares estimates for the coefficients, 0c and 1c .

Regression can be used to perform classification using two different approaches:

1. Division: The data are divided into regions based on class.

2. Prediction: Formulas are generated to predict the output class value.

If the predictors in the linear regression function are modified by some function (square,

square root, etc.), then the model looks like

y=c0 + f1(x1) + …. + fn(xn) (2.24)

where fi is the function being used to transform the predictor. In this case the regression is

called nonlinear regression. Linear regression techniques, while easy to understand, are

not applicable to most complex data mining applications. They do not work well with


53

nonnumeric data. They also make the assumption that the relationship between the input

value and the output value is linear, which of course may not be the case.

Linear regression is not always appropriate because the data may not fit a straight line,

but also because the straight line values can be greater than 1 and less than 0. Thus, they

certainly cannot be used as the probability of occurrence of the target class. Another

commonly used regression technique is called logistic regression. Instead of fitting the

data to a straight line, logistic regression uses a logistic curve such as illustrated in

Figure. The formula for a univariate logistic curve is

( 0 1 1)

( 0 1 1)1

c c x

c c x

epe

+

+=+

(2.25)

The logistic curve gives a value between 0 and 1 so it can be interpreted as the

probability of class membership. As with linear regression, it can be used when

classification into two classes is desired. To perform the regression, the logarithmic

function can be applied to obtain the logistic function

0 1 1log( )1

p c c xp

= +−

(2.26)

Here p is the probability of being in the class and 1-p is the probability that it is not.

However, the process chooses values for c0 and c1 that maximizes the probability of

observing the values.

2.9.1.2 Bayesian Classification

Assuming that the contribution by all attributes are independent and that each contributes

equally to the classification problem, a simple classification scheme called naïve Bayes

classification has been proposed that is based on Bayes rule of conditional probability as

stated in Definition 2.3. This approach was briefly outlined in section. By analyzing the


54

contribution of each “independent” attribute, a conditional probability is determined. A

classification is made by combining the impact that the different attributes have on the

prediction to be made. The approach is called “naïve” because it assumes the

independence between the various attribute values. Given a data value xi the probability

that a related tuple, ti, is in class Cj is described by P(Cj | xi ). Training data can be used to

determine P(xi ), P (xi | Cj ) and P( Cj ). From these values, Bayes theorem allows us to

estimate the posterior probability P (Cj | xi ) and P(Cj | ti ).

Given a training set, the naïve Bayes algorithm first estimates the prior probability P( Cj )

for each class by counting how often each class occurs in the training data. For each

attribute, xi, the number of occurrences of each attribute value xi can be counted to

determine P (xi ). Similarly, the probability P (xi | Cj ) can be estimated by counting how

often each value occurs in the class in the training data. A tuple in the training data may

have many different attributes, each with many values. This must be done for all

attributes and all values of attributes. We then use these derived probabilities when a new

tuple must be classified. This is why naïve Bayes classification can be viewed as both a

descriptive and a predictive type of algorithm. The probabilities are descriptive and are

then used to predict the class membership for a target tuple.

When classifying a target tuple, the conditional and prior probabilities generated from the

training set are used to make the prediction. This is done by combing the effects of the

different attribute values from the tuple. Suppose that tuple ti has p independent attribute

values {xi1, xi2, ……., xip}. From the descriptive phase, we know P ( xik | Cj ), for each

class Cj and attribute xik. We then estimate P ( ti | Cj ) by

1

( | ) ( | )p

i j ik j

k

P t C P x C=

=∏ (2.27)

At this point in the algorithm, we then have the needed prior probabilities P ( Cj ) for

each class and the conditional probability P ( ti | Cj ). To calculate P (ti ), we can estimate

the likelihood that ti is in each class. This can be done by finding the likelihood that this


55

tuple is in each class and then adding all these values. The probability that ti is in a class

is the product of the conditional probabilities for each attribute value. The posterior

probability P ( Cj | ti ) is then found for each class. The class with the highest probability

is the one chosen for the tuple.

2.9.1.2.1 Strengths

• It is easy to use.

• Unlike other classification approaches, only one scan of the training data is

required.

• The naïve Bayes approach can easily handle missing values by simply omitting

that probability when calculating the likelihoods of membership in each class.

• In cases where there are simple relationships, the technique often does yield good

results.

2.9.1.2.2 Weaknesses

• Although the naïve Bayes approach is straightforward to use, it does not always

yield satisfactory results.

• The technique does not handle continuous data. Dividing the continuous values

into ranges could be used to solve this problem, but the division of the domain

into ranges is not an easy task, and how this is done can certainly impact the

results.

2.9.2 Distance-Based Algorithms

Each item that is mapped to the same class may be thought of as more similar to the other

items in that class that it is to be items found in other classes. Therefore, similarity (or

distance) measures may be used to identify the “alikeness” of different items in the

database.


56

Using a similarity measure for classification where the classes are predefined is

somewhat simpler than using a similarity measure for clustering where the classes are not

known in advance.

2.9.2.1 Simple Approach

The classification problem can be restated in Definition

DEFINITION 2.9 Given a database D = {t1,t2,……,tn} of tuples where each tuple

ti=<ti1,ti2,……,tin> contains numeric values and a set of classes C={C1,…….,Cm} where

each class Cj = <Cj1,Cj2,……Cjk> has numeric values, the classification problem is to

assign each ti to the class Cj such that (ti,Cj) >=sim(ti, Cl) ∀Cl ∈C where Cl <> Cj.

To calculate these similarity measures, the representative vector for each class must be

determined. A simple classification technique, then, would be to place each item in the

class where it is most similar to the center of that class. The representative for the class

may be found in other ways. For example, in pattern recognition problems, a predefined

pattern can be used to represent each class. Once a similarity measure is defined, each

item to be classified will be compared to each predefined pattern. The item will be placed

in the class with the largest similarity value. Algorithm 2.4 illustrates a straightforward

distance-based approach assuming that each class, ci, is represented by its center or

centroid. In the algorithm 2.4 ci is used to be the center for its class. Since each tuple

must be compared to the center for a class and there are a fixed number of classes, the

complexity to classify one tuple is O (n).

ALGORITHM 2.4

Input:

c1, ……, cm // Centers for each class

t // Input tuple to classify

Output:

c //Class to which t is assigned


57

Simple distance-based algorithm

dist=∞ ;

for i:=1 to m do

if dis(ci,t) < dist, then

c=i;

dist=dist(ci,t);

2.9.2.2 K Nearest Neighbors

One common classification scheme based on the use of distance measures is that of the K

nearest neighbors (KNN). The KNN technique assumes that the entire training set

includes not only the data in the set but also the desired classification for each item. In

effect, the training data become the model. When a classification is to be made for a new

item, its distance to each item in the training set must be determined. Only the K closet

entries in the training set are considered further. The new item is then placed in the class

that contains the most items from this set of K closet items.

Algorithm 2.5 outlines the use of KNN algorithm. We use T to represent the training

data. Since each tuple to be classified must be compared to each element in the training

data, this is O(q). Given n elements to be classified, this becomes an O(nq) problem.

Given that the training data are of a constant size, this can be viewed as an O(n) problem.

ALGORITHM 2.5

Input:

T //Training data

K //Number of neighbors

T //Input tuple to classify

Output:

C //Class to which t is assigned

KNN algorithm:


58

//Algorithm to classify tuple using KNN

N=∅ ;

//Find set of neighbors, N, for t

for each d∈ T do

if | N | <=K, then

N=N ∪ {d};

else

if ∃ u∈N such that sim(t,u) <= sim(t,d), then

begin

N=N – {u};

N=N ∪ {d};

end

//Find class for classification

c=class to which the most u∈N are classified;

2.9.3 Decision Tree-Based Algorithms

The decision tree approach is most useful in classification problems. With this technique,

a tree is constructed to model the classification process. Once the tree is built, it is applied

to each tuple in the database and results in a classification for that tuple. There are two

basic steps in the technique: building the tree and applying the tree to the database. Most

research has focused on how to build effective trees as the application process is

straightforward.

The decision tree approach to classification is to divide the search space into rectangle

regions. A tuple is classified based on the region into which it falls. A definition for a

decision tree used in classification is contained in Definition 2.10. There are alternative

definitions; for example, in a binary DT the nodes could be labeled with the predicates

themselves and each are would be labeled with yes or no (like in the “Twenty Questions”

game).


59

DEFINITION 2.10 Given a database D = {t1, ….. ,tn} where ti=<ti1, ….., tin> and the

database schema contains the following attributes {A1,A2, ……., An}. Also given is a set

of classes C= {C1, …., Cm}. A decision tree (DT) or classification tree is a tree associated

with D that has the following properties:

• Each internal node is labeled with an attribute, Ai.

• Each arc is labeled with a predicate that can be applied to the attribute associated

with the parent.

• Each leaf node is labeled with a class, Cj.

Solving the classification problem using decision trees is a two-step process.

1. Decision tree induction: Construct a DT using training data.

2. For each ti ∈ D, apply the DT to determine its class.

There are many advantages to the use of DTs for classification. DTs are certainly easy to

use and efficient. Rules can be generated that are easy to interpret and understand. They

scale well for large databases because the tree size is independent of the database size.

Each tuple in the database must be filtered through the tree. This takes time proportional

to the height of the tree, which is fixed. Trees can be constructed for data with many

attributes.

ALGORITHM 2.6

Input:

D //Training data

Output:

T //Decision tree

DTBuild algorithm:

//Simplistic algorithm to illustrate naïve approach to building DT

T=∅ ;

Determine best splitting criterion;

T=Create root node and label with splitting attribute;


60

T=Add arc to root node for each split predicate and label;

for each arc do

D=Database created by applying splitting predicate to D;

if stopping point reached for this path, then

T’=Create leaf node and label with appropriate class;

else

T’=DTBuild(D);

T=Add T’ to arc;

Disadvantages also exist for DT algorithms. First, they do not easily handle continuous

data. These attribute domains must be divided into categories to be handled. Handling

missing data is difficult because correct branches in the tree could not be taken. Since the

DT is constructed from the training data, overfitting may occur. This can be overcome via

tree pruning. Finally, correlations among attributes in the database are ignored by the DT

process.

2.9.3.1 ID3

In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm used to generate

a decision tree invented by Ross Quinlan. ID3 is the precursor to the C4.5 algorithm. The

ID3 technique to building a decision tree is based on information theory and attempts to

minimize the expected number of comparisons.

The concept used to quantify information is called entropy. Entropy is used to measure

the amount of uncertainty or surprise or randomness in a set of data. Certainly, when all

data in a set belong to a single class, there is no uncertainty. In this case the entropy is

zero.

The ID3 algorithm can be summarized as follows:

1. Take all unused attributes and count their entropy concerning test samples.


61

2. Choose attribute for which entropy is minimum (or, equivalently,

information gain is maximum).

3. Make node containing that attribute.

ALGORITHM 2.7

ID3( examples, target_attribute, attributes)

create a root node for the tree

if all examples are positive, return the single-node tree root, with label=+.

if all examples are negative, return the single-node tree root, with label=-.

if number of predicting attributes is empty, then return the single node tree root

with label=most common value of the target attribute in the examples.

otherwise begin

A=the attribute that best classifies examples

decision tree attribute for root = A.

for each possible value, vi, of A

Add a new tree branch below root, corresponding to the test A= vi.

Let examples(vi), be the subset of examples that have the value vi

for A

if examples(vi) is empty

then below this new branch add a leaf node with label=most

common target value in the examples

else

below this new branch add the subtree ID3(examples(vi),

target_attribute, attributes – {A})

end

return root


62

2.9.3.1.2 The ID3 metric

The algorithm is based on Occam’s razor: it prefers smaller decision tree over larger

ones. However, it does not always produce the smallest tree, and is therefore a heuristic.

Occam’s razor is formalized using the concept of information entropy:

2

1

( ) ( , ) log ( , ).m

E

j

I i f i j f i j=

= −∑ (2.28)

2.9.3.2 C4.5

C4.5 is an algorithm used to generate a decision tree. C4.5 is an extension of Quinlan’s

ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and

for this reason, C4.5 is often referred to as a statistical classifier.

C4.5 builds decision trees from a set of training data in the same way as ID3, using the

concept of information entropy. The training data is a set S=S1,S2,……. of already

classified samples. Each sample Si=X1,X2,….. is a vector where X1,X2,….. represent

attributes or features of the sample. The training data is augmented with a vector

C=C1,C2,…. Where C1,C2,…. represent the class to which each sample belongs.

At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits

its set of samples into subsets enriched in one class or the other. Its creation is the

normalized information gain (difference in entropy) that results from choosing an

attribute for splitting the data. The attribute with the highest normalized information gain

is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists.

This algorithm has a few base cases.

• All the samples in the list belong to the same class. When this happens, it

simply creates a leaf node for the decision tree saying to choose that class.


63

• None of the features provide any information gain. In this case, C4.5 creates a

decision node higher up the tree using the expected value of the class.

• Instance of previously-unseen class encountered. Again, C4.5 creates a

decision node higher up the tree using the expected value.

ALGORITHM 2.8

1. Check for base cases.

2. For each attribute a

find the normalized information gain from splitting on a

3. Let a_best be the attribute with the highest normalized information gain.

4. Creates a decision node that splits on a_best.

5. Recurse on the sublists obtained by splitting on a_best, and add those nodes as

children of node.

2.9.3.2.1 Improvements from ID3 algorithm

C4.5 made a number of improvements to ID3. Some of these are:

• Handling both continuous and discrete attributes – In order to handle continuous

attributes. C4.5 creates a threshold and then splits the list into those whose

attribute value is above the threshold and those that are less than or equal to it.

• Handling training data with missing attribute values – C4.5 allows attribute values

to be marked as ? for missing. Missing attribute values are simply not used in gain

and entropy calculations.

• Handling attributes with differing costs.

• Pruning trees after creation – C4.5 goes back through the tree once it’s been

created and attempts to remove branches that do not help by replacing them with

leaf nodes.


64

2.9.3.2.2 Improvements in C5.0/See5 algorithm

C5.0 offers a number of improvements on C4.5. Some of these are:

• Speed – C5.0 is significantly faster than C4.5.

• Memory usage – C5.0 is more memory efficient than C4.5.

• Smaller decision trees – C5.0 gets similar results to C4.5 with considerably

smaller decision trees.

• Support for boosting – Boosting improves the trees and gives them more

accuracy.

• Weighting – C5.0 allows us to weight different attributes and misclassification

types.

• Winnowing – C5.0 automatically winnows the data to help reduce noise.

2.9.3.1 CART

A classification and regression tree (CART) is a technique that generates a binary

decision tree. As with ID3, entropy is used as a measure to choose the best splitting

attribute and criterion. Unlike ID3, however, where a child is created for each

subcategory, only two children are created. The splitting is performed around what is

determined to be the best split point. At each step, an exhaustive search is used to

determine the best split, where “best” is defined by

1

( / ) 2 | ( | ) ( | ) |m

L R j L j R

j

s t P P P C t P C t=

Φ = −∑ (2.29)

This formula is evaluated at the current node, t, and for each possible splitting attribute

and criterion, s. Here L and R are used to indicate the left and right subtrees of the current

node in the tree. PL, PR is the probability that a tuple in the training set will be on the left

or right side of the tree. This is defined as | || |

tuples in subtreetuples in training set

. We assume that the


65

right branch is taken on equality. P(Cj | tL ) or P(Cj | tR ) is the probability that a tuple is in

this class, Cj, and in the left or right subtree. This is defined as the

| || arg |tuples of class j in subtreetuples at the t et node

. At each step, only one criterion is chosen as the best over

all possible criteria.

CART handles missing data by simply ignoring that record in calculating the goodness of

a split on that attribute. The tree stops growing when no split will improve the

performance. Even though it is the best for the training data, it may not be the best for all

possible data to be added in the future. The CART algorithm also contains a pruning

strategy.

2.9.4 Neural Network-Based Algorithms

With neural networks (NNs), just as with decision trees, a model representing how to

classify any given database tuple is constructed. The activation functions typically are

sigmoidal. When a tuple must be classified, certain attribute values from that tuple are

input into the directed graph at the corresponding source nodes. There often is one sink

node for each class. The output value is generated indicates the probability that the

corresponding input tuple belongs to that class. The tuple will then be assigned to the

class with the highest probability of membership. The learning process modifies the

labeling of the arcs to better classify tuples. Given a starting structure and value for all

the labels in the graph, as each tuple in the training set is sent through the network, the

projected classification made by the graph can be compared with the actual classification.

Based on the accuracy of the prediction, various labeling in the graph can change. This

learning process continues with all the training data or until the classification accuracy is

adequate.


66

2.9.4.1 Propagation

The normal approach used for processing is called propagation. Given a tuple of values

input to the NN, X = <x1,x2,…….,xn>, one value is input at each node in the input layer.

Then the summation and activation functions are applied at each node, with an output

value created for each output arc from that node. These values are in turn sent to the

subsequent nodes. This process continues until a tuple of output values, Y = <y1,….,ym>,

is produced from the nodes in the output layer. The process of propagation is shown in

algorithm 2.9 using a neural network with one hidden layer. Here a hyperbolic tangent

activation function is used for nodes in the output layer. We assume that the constant c in

the activation function has been provided. We also use k to be number of edges coming

into a node.

ALGORITHM 2.9

Input:

N //neural network

X=<x1,x2,…….,xn> //Input tuple consisting of values for input attributes only

Output:

Y=<y1,y2,……,ym> //Tuple consisting of output values from NN

Propagation algorithm:

//Algorithm illustrates propagation of a tuple through a NN

for each node i in the input layer do

output xi on each output arc from i;

for each hidden layer do

for each node i do

1

( ( ));k

i ji jij

S w x=

= ∑

for each output arc from i do

Output (1 )(1 )

s

cs

e ie i

−

−

−+

;


67

for each node i in the output layer do

1

( ( ));k

i ji ji

j

S w x=

= ∑

Output 1(1 )

i csi

ye−=

+

2.9.4.2 NN Supervised Learning

The NN starting state is modified based on feedback of its performance with the data in

the training set. This type of learning is referred to as supervised because it is known a

priori what the desired output should be. Unsupervised learning can also be performed if

the output is not known. With unsupervised approaches, no external teacher set is used. A

training set may be provided, but no labeling of the desired outcome is included.

Supervised learning in an NN is the process of adjusting the arc weights based on its

performance with a tuple from the training set. The training set can be used as a “teacher”

during the training process. The output from the network is compared to this known

desired behavior. Algorithm 2.10 outlines the steps required.

ALGORITHM 2.10

Input:

N //Starting neural network

X //Input tuple from training set

D //Output tuple desired

Output:

N //Improved neural network

Suplearn algorithm

//Simplistic algorithm to illustrate approach to NN learning

Propagate X through N producing output Y;

Calcualte error by comparing D to Y;


68

Update weights on arcs in N to reduce error;

Assuming that the output from node i is yi but should be di, the error produced from a

node in any layer can be found by

| yi – di | (2.30)

The mean squared error (MSE) is found by

2( )

2i iy d− (2.31)

Backpropagation is a learning technique that adjusts weights in the NN by propagating

weight changes backward from the sink to the source nodes. Backpropagation is the most

well known form of learning because it is easy to understand and generally applicable.

Backpropagation can be thought as a generalized delta rule approach.

ALGORITHM 2.11

Input:


X=<x1,x2,…..,xn> //Input tuple from training set

D=<d1,d2,…...,dm> //Output tuple desired

Output:


Backpropagation algorithm:

//Illustrate backpropagation

Propogation(N,X);

2

11/ 2 ( )

m

i ii

E d y=

= −∑ ;

Gradient (N,E);


69

A simple version of backpropagation algorithm is shown in Algorithm 2.11. The MSE is

used to calculate the error. Each tuple in the training set is input to this algorithm. The

last step of the algorithm uses gradient descent as the technique to modify the weights in

the graph. The basic idea of gradient descent is to find the set of weights that minimizes

the MSE. ji

Ew∂∂

gives the slope (or gradient) of the error function for one weight. We thus

wish to find the weight where this slope is zero. Algorithm 2.12 illustrates the concept.

ALGORITHM 2.12

Input:


E //Error found from back algorithm

Output:


Gradient algorithm:

//Illustrates incremental gradient descent

for each node i in output layer do

for each node j input to i do

( ) (1 ) ;ji i i j i iw d y y y yηΔ = − −

;ji ji jiw w w= + Δ

layer = previous layer;

for each node j in this layer do

for each node k input to j do

21

( ) (1 );2

jkj k m m m jm m m

yw y d y w y yη

−Δ = − −∑

;kj kj kjw w w= + Δ

The algorithm 2.12 changes weights by working backward from the output layer to the

input layer. There are two basic versions of this algorithm. With the batch or offline

approach, the weights are changed once after all tuples in the training set are applied and

total MSE is found. With the incremental or online approach, the weights are changed


70

after each tuple in the training set is applied. The incremental technique is usually

preferred because it requires less space and may actually examine more potential

solutions (weights), thus leading to a better solution.

2.9.4.3 Radial Basis Function Networks

A radial function or a radial basis function (RBF) is a class of functions whose value

decreases (or increases) with the distance from a central point. An RBF has a Gaussian

shape, and an RBF network is typically an NN with three layers. The input layer is used

to simply input the data. A Gaussian activation function is used at the hidden layer, while

a linear activation function is used at the output layer. The objective is to have the hidden

nodes learn to respond only to a subset of the input, namely, that where the Gaussian

function is centered. This is usually accomplished via supervised learning. When RBF

functions are used as the activation functions on the hidden layer, the nodes can be

sensitive to a subset of the input values. Figure 2.5 shows the basic structure of an RBF

unit with one output node.

Figure 2.5: Radial basis function network

X1

X2

X3

f2

f1

∑

y c1

c2

w11

wk1

wkn

w2n

w21

w1n


71

2.9.4.4 Perceptrons

The simplest NN is called a perceptron. A perceptron is a single neuron with multiple

inputs and one output. The original perceptron proposed the use of a step activation

function, but it is more common to see another type of function such as a sigmoid

function. A simple perceptron can be used to classify into two classes. Using a unipolar

activation function, an output of 0 would be used to pass in other class.

2.10 CLUSTERING

Clustering is similar to classification in that data are grouped. However, unlike

classification, the groups are not predefined. Instead, the grouping is accomplished by

finding similarities between data according to characteristics found in the actual data. The

groups are called clusters. Many definitions for clusters have been proposed.

• Set of like elements. Elements from different clusters are not alike.

• The distance between points in a cluster is less than the distance between a point

in the cluster and any point outside it.

Some basic features of clustering are:

• The number of clusters is not known

• There may not be any a priori knowledge concerning the clusters

• Cluster results are dynamic.

DEFINITION 2.11: Given a database D ={t1,t2,….,tn} of tuples and an integer value k,

the clustering problem is to define a mapping f : D {1,…..,k} where each ti is

assigned to one cluster Kj, 1 <= j <=k. A cluster, Kj, contains precisely those tuples

mapped to it; that is, Kj = {ti | f(ti) = Kj, 1<=i<=n, and ti E D}.

A classification of different types of clustering algorithms is shown in the figure 2.6.


72

With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy

has a set of clusters. At the lowest level, each item is in its own unique cluster. At the

highest level, all items belong to the same cluster. With hierarchical clustering, the

desired number of clusters is not input. With partitional clustering, the algorithm creates

only one set of clusters.

2.10.1 Hierarchical Algorithms

Hierarchical clustering algorithms actually creates sets of clusters. Hierarchical

algorithms differ in how the sets are created. A tree data structure, called a dendrogram,

can be used to illustrate the hierarchical clustering technique and the sets of different

clusters. The root in a dendrogram tree contains one cluster where all elements are

together. The leaves in the dendrogram each consist of a single element cluster. Internal

nodes in the dendrogram represent new clusters formed by merging the clusters that

appear as its children in the tree. Each level in the tree is associated with the distance

measure that was used to merge the clusters. All clusters created at a particular level were

combined because the children clusters had a distance between them less than the

distance value associated with this level in the tree. One example of dendrogram can be

as given in figure 2.7.

Clustering

Partitional Categorical Hierarchical

Divisive Sampling

Figure 2.6: Classification of Clustering Algorithms

Agglomerative Compression

Large DB


73

The space complexity for hierarchical algorithms is O(n2) because this is the space

required for the adjancency matrix. The space required for the dendrogram is O(kn),

which is much less than O(n2). The time complexity for hierarchical algorithms is O(kn2)

because there is one iteration for each level in the dendrogram. Depending on the specific

algorithm, however, this could actually be O(maxd n2) where maxd is the maximum

distance between points.

Hierarchical techniques are well suited for many clustering applications that naturally

exhibit a nesting relationship between clusters. For example, in biology, plant and animal

taxonomies could easily be viewed as a hierarchy of clusters.

2.10.2 Agglomerative Algorithms

Agglomerative algorithms start with each individual item in its own cluster and

iteratively merge clusters until all items belong in one cluster. Different agglomerative

algorithms differ in how the clusters are merged at each level. Algorithm 2.13 illustrates

the typical agglomerative clustering algorithm. It assumes that a set of elements and

distances between them is given as input. We use an n * n vertex adjacency matrix, A, as

input. Here the adjacency matrix, A, contains a distance value rather than a simple

A B C D E F

Figure 2.7: Example of Dendrogram


74

boolean value: A[i,j] = dis(ti,tj).The output of the algorithm 2.13 is a dendrogram, DE,

which we represent as a set of ordered triples <d,k,K> where d is the threshold distance, k

is the number of clusters, and K is the set of clusters.

ALGORITHM 2.13

Input:

D = {t1,t2,……,tn} //set of elements

A //Adjacency matrix showing distance between elements

Output:

DE // Dendrogram represented as a set of ordered triples

Agglomerative algorithm:

d=0;

k=n;

K={{t1},….,{tn}};

DE={<d,k,K>}; // Initially dendrogram contains each element

in its own cluster.

repeat

oldk=k;

d=d+1;

Ad=Vertex adjacency matrix for graph with threshold distance of d;

<k,K>=NewClusters(Ad,D);

if oldk <> k then

DE=DE U <d,k,K>; // New set of clusters added to dendrogram

until k=1

2.10.2.1 Single Link Technique

The single link technique is based on the idea of finding maximal connected components

in a graph. A connected component is a graph in which there exists a path between any

two vertices. With the single link approach, two clusters are merged if there is at least one


75

edge that connects two clusters; that is, if the minimum distance between any two points

is less than or equal to the threshold distance being considered. For this reason, it is often

called the nearest neighbor clustering technique.

2.10.2.2 Complete Link Algorithm

Although the complete link algorithm is similar to the single link algorithm, it looks for

cliques rather than connected components. A clique is a maximal graph in which there is

an edge between any two vertices. Here a procedure is used to find the maximum

distance between any clusters so that two clusters are merged if the maximum distance is

less than or equal to the distance threshold. In this algorithm, we assume the existence of

a procedure, clique, which finds all cliques in a graph.

2.10.2.3 Average Link

The average link technique merges two clusters if the average distance between any two

points in the two target clusters is below the distance threshold.

2.10.2.4 Divisive Clustering

With divisive clustering, all items are initially placed in one cluster and clusters are

repeatedly spilt in two until all items are in their own cluster. The idea is to split up

clusters where some elements are not sufficiently close to other elements.

2.10.3 Partitional Algorithms

Nonhierarchical or partitional clustering creates the clusters in one step as opposed to

several steps. Some of the popular partitional algorithms are as below.


76

2.10.3.1 Minimum Spanning Tree

Since the clustering problem is to define a mapping, the output of this algorithm shows

the clusters as a set of ordered pairs <ti,j> where f(ti) = Kj.

ALGORITHM 2.14

Input:

D = {t1,t2,…..,tn} // Set of elements

A //Adjacency matrix showing distance between elements

k // Number of desired clusters

Output:

f // Mapping represented as a set of ordered pairs

Partitional MST Algorithm:

M = MST(A)

identify inconsistent edges in M;

remove k-1 inconsistent edges;

create output representation;

“Inconsistent” could be defined based on distance. Zahn proposes more reasonable

inconsistent measures based on the weight (distance) of an edge as compared to those

close to it. For example, an inconsistent edge would be one whose weight is much larger

than the average of the adjacent edges.

2.10.3.2 Squared Error Clustering Algorithm

The squared error clustering algorithm 2.15 minimizes the squared error. The squared

error for a cluster is the sum of the squared Euclidean distances between each element in

the cluster and the cluster centroid, Ck. Given a cluster Ki, let the set of items mapped to

that cluster be {ti1,ti2,……,tim}. The squared error is defined as


77

2

1

|| ||i

m

K ij k

j

se t C=

= −∑ (2.32)

Given a set of clusters K={K1,K2,…….KK}, the squared error for K is defined as

1

j

k

K Kj

se se=

= ∑ (2.33)

They follow the basic algorithm structure as shown in algorithm 2.15.

ALGORITHM 2.15

Input:

D={t1,t2,……,tn} // set of elements


Output:

K // Set of clusters

Squared error algorithm:

assign each item ti to a cluster;

calculate center for each cluster;

repeat

assign each item ti to the cluster which has the closet center;

calculate new center for each cluster;

calculate squared error;

until the difference between successive squared errors is below a threshold;

2.10.3.3 The K-Means Clustering Algorithm

The K-Means algorithm (Lloyd, 1982) is a simple yet effective statistical clustering

technique. It is an iterative clustering algorithm 2.16 in which items are moved among

sets of clusters until the desired set is reached.


78

The cluster mean of Ki = {ti1,ti2,……,tim} is defined as

1

1 m

i ijj

m tm =

= ∑ (2.34)

ALGORITHM 2.16

Input:

D = {t1,t2,….,tn} //Set of elements

Kk //Number of desired clusters

Output :

K //Set of clusters

K-means algorithm:

assign initial values for means m1,m2,….,mk;

repeat

assign each item to the cluster which has the closet mean; calculate

new mean for each cluster;

until convergence criteria is met;

EXAMPLE 2.3

Suppose that we are given the following items to cluster:

{2,4,10,12,3,20,30,11,25}

and suppose that k=2. We initially assign the means to the first two values: m1=2 and

m2=4. Using Euclidean distance, we find that


79

for m1=2 for m2=4

__________________________________

2 0 2 4

4 4 4 0

10 64 10 36

12 100 12 64

3 1 3 4

20 324 20 256

30 784 30 676

11 81 11 49

25 529 25 441

so we get initially K1={2,3} and K2={4,10,12,20,30,11,25}. The value 3 is equally close

to both means , so we arbitrarily choose K1. Any desired assignment could be used in the

case of ties. We then recalculate the means to get K1={2,3,4} and

K2={10,12,20,30,11,25}. Continuing in this fashion, we obtain the following.

m1 m2 K1 K2

3 18 {2,3,4,10} {12,20,30,11,25}

4.75 19.6 {2,3,4,10,11,12} {20,30,25}

7 25 {2,3,4,10,11,12} {20,30,25}

Note that the clusters in the last two steps are identical. This will yield identical means,

and thus the means have converged. Our answer is thus K1={2,3,4,10,11,12} and

K2={20,30,25}.

The time complexity of K-means is O(tkn) where t is the number of iterations. K-means

finds a local optimum and may actually miss the global optimum.


80

2.10.3.3.1 Strengths

• The K-means method is easy to understand and implement.


• Although the K-means algorithm often produces good results, it is not time-

efficient and does not scale well.

• The algorithm only works with real-valued data. If we have a categorical

attribute in our dataset we must either discard the attribute or convert the

attribute values to numeric equivalents.

• The K-means algorithm works best when the clusters that exist in the data are

of approximately equal size. This being the case, if an optimal solution is

represented by clusters of unequal size, the K-Means algorithm is not likely to

find a best solution.

• There is no way to tell which attributes are significant in determining the

formed clusters. For this reason several irrelevant attributes can cause less

than optimal results.

Despite these limitations the K-Means algorithm continues to be a favorite statistical

technique.

2.10.3.4 Nearest Neighbor Algorithm

An algorithm similar to the single link technique is called the nearest neighbor algorithm.

With this serial algorithm, items are iteratively merged into the existing clusters that are

closet. In this algorithm a threshold, t, is used to determine if items will be added to

existing clusters or if a new cluster is created.


81

ALGORITHM 2.17

Input:

D={t1,t2,….,tn} // Set of elements

A // Adjacency matrix showing distance between elements

Output:


Nearest neighbor algorithm:

K1={t1};

K={K1};

k=1;

for i=2 to n do

find the tm in some cluster Km in K such that dis(ti,tm) is the smallest;

if dis(ti,tm) <= 2 then

Km=Km U ti

Else

k=k+1;

kk={ti};

2.10.3.5 PAM Algorithm

The PAM (partitioning around medoids) algorithm, also called the K-medoids algorithm,

represents a cluster by a medoid. Using a medoid is an approach that handles outliers

well. Initially, a random set of k items is taken to be the set of medoids. Then at each

step, all items from the input dataset that are not currently medoids are examined one by

one to see if they should be medoids. By looking at all pairs of medoid, non-medoid

objects, the algorithm chooses the pair that improves the overall quality of the clustering

the best and exchanges them. Quality here is measured by the sum of all distances from a

non-medoid object to the medoid for the cluster it is in. An item is assigned to the cluster

represented by the medoid to which it is closet (minimum distance).

The total impact to quality by a medoid change TCih is given by


82

1

n

ih jihj

TC C=

= ∑ (2.35)

ALGORITHM 2.18

Input:

D={t1,t2,….,tn} // Set of elements

A // Adjacency matrix showing distance between elements


Output:


PAM algorithm:

arbitrarily select k medoids from D;

repeat

for each th not a medoid do

for each medoid ti do

calculate TCih;

find i,h where TCih is the smallest;

if TCih < 0, then

replace medoid ti with th;

until TCih >= 0;

for each ti ∈∈ D do

assign ti to Kj, where dis(ti,tj) is the smallest over all medoids;

2.10.4 Clustering Large Databases

2.10.4.1 BIRCH

BIRCH(balanced iterative reducing and clustering using hierarchies) is designed for

clustering a large amount of metric data. It is incremental and hierarchical, and it uses an


83

outlier handling technique. BIRCH applies only to numeric data. This algorithm 2.19

uses a tree called a CF tree as defined in Definition 2.12.

DEFINITION 2.12: A clustering feature (CF) is a triple (N,LS,SS), where the number of

the points in the cluster is N, LS is the sum of the points in the cluster, and SS is the sum

of the squares of the points in the cluster.

DEFINITION 2.13: A CF tree is a balanced tree with a branching factor B. Each internal

node contains a CF triple for each of its children. Each leaf node also represents a cluster

and contains a CF entry for each subcluster in it. A subcluster in a leaf node must have a

diameter no greater than a given threshold value T.

ALGORITHM 2.19

Input:

D = {t1,t2,….,tn} // Set of elements

T // Thresold for CF tree construction

Output:

K //Set of clusters

BIRCH clustering algorithm:

For each ti ∈ D do

determine correct leaf node for ti insertion;

if threshold condition is not violated, then

add ti to cluster and update CF triples;

else

if room to insert ti, then

insert ti as single cluster and update CF triples;

else

split leaf node and redistribute CF features;

BIRCH is linear in both space and I/O time. The choice of threshold values is imperative

to an efficient execution of the algorithm. Otherwise, the tree may have to be rebuilt


84

many times to ensure that it can be memory –resident. This gives the worst-case time

complexity of O(n2).

2.10.4.2 DBSCAN

The approach used by DBSCAN(density-based spatial clustering of applications with

noise) is to create clusters with a minimum size and density. Density is defined as a

minimum number of points within a certain distance of each other. This handles the

outlier problem by ensuring that an outlier will not create a cluster. One input parameter,

MinPts, indicates the minimum number of points in any cluster. In addition, for each

point in a cluster there must be another point in the cluster whose distance from it is less

than a threshold input value, Eps. The Eps-neighborhood or neighborhood of a point is

the set of points within a distance of Eps. The desired number of clusters, k, is not input

but rather is determined by the algorithm itself.

DEFINITION 2.14: Given values Eps and Minpts, a point p is directly density-reachable

from q if

• dis(p,q) <= Eps and

• | {r | dis(r,q) <= Eps} | >= MinPts

ALGORITHM 2.20

Input:

D={t1,t2,….,tn} //Set of elements

MinPts // Number of points in cluster

Eps // Maximum distance for density measure

Output:

K={K1,K2,….,KK}

DBSCAN algorithm:

K=0;

for i=1 to n do


85

if ti is not in a cluster, then

X={tj | tj is density-reachable from ti};

If X is a valid cluster, then

k=k+1;

Kk=X;

2.10.4.2.1 Strengths

• DBSCAN does not requite number of clusters as input, as it is decided by itself.

• DBSCAN can find arbitrarily shaped clusters. It can even find clusters completely

surrounded by a different cluster. Due to MinPts parameter, the so-called single-

link effect (different clusters being connected by a thin line of points) is reduced.

• DBSCAN has a notion of noise.

• DBSCAN requires just two parameters and is mostly insensitive to the ordering of

the points in the database.


• DBSCAN can only result in a good clustering as its distance measure is in the

function getNeighbors(P,epsilon). The most common distance metric used is the

euclidean distance measure. Especially for high-dimensional data, this distance

metric can be rendered almost useless.

• DBSCAN does not respond well to data sets with varying densities.

2.10.4.3 CURE Algorithm

One objective for the CURE (Clustering Using Representatives) clustering algorithm is to

handle outliers well. It has both a hierarchical component and partitioning component.

First, a constant number of points, c, are chosen from each cluster. These well-scattered

points are then shrunk toward the cluster’s centroid by applying a shrinkage factor,α .

When α is 1, all points are shrunk to just one-the centroid. These points represent the


86

cluster better than a single point (such as a medoid or centroid) could. With multiple

representative points, clusters of unusual shapes can be better represented. CURE then

uses a hierarchical clustering algorithm. At each step in the agglomerative algorithm,

clusters with the closest pair of representative points are chosen to be merged. The

distance between them is defined as the minimum distance between any pair of points in

the representative sets from the two clusters.

In the algorithm 2.21, we assume that each entry u in the heap contains the set of

representative points, u.rep; the mean of the points in the cluster, u.mean; and the cluster

closet to it, u.closest. We use the heap operations; heapify to create the heap, min to

extract the minimum entry in the heap, insert to add a new entry, and delete to delete an

entry. A merge procedure is used to merge two clusters. In CURE, a k-D tree is used to

assist in the merging of clusters.

ALGORITHM 2.21

Input:

D= {t1,t2,….,tn} //Set of elements

k // Desired number of clusters

Output:

Q // Heap containing one entry for each cluster

CURE algorithm:

T = build(D);

Q = heapify(D); // Initially build heap with one entry per item;

repeat

u = min(Q);

delete(Q, u.close);

w = merge(u,v);

delete(T,u);

delete(T,v);

insert(T,w);


87

for each x ∈ Q do

x.close = find closet cluster to x;

if x is closet to w, then

w.close = x;

insert(Q,w);

until number of nodes in Q is k;

2.10.5 Comparison of Clustering Algorithms

Here there is a comparison of different clustering algorithms based on type, space, time

and whether it is incremental, iterative or not. It is given in table 2.4.

2.11 SELECTION CRITERIA OF A DATA MINING TECHNIQUE

The following questions may be useful in determining which techniques to apply:

• Is learning supervised or unsupervised?

• Do we require a clear explanation about the relationships present in the data?

• Is there one set of input attributes and one set of output attributes or can

attributes interact with one another in several ways?

• Is the input data categorical, numerical or a combination of both?

• If learning is supervised, is there one output attribute or are there several

output attributes? Are the output attribute(s) categorical or numeric?

For a particular problem, these questions have obvious answers. For example, we know

neural network is a black-box structure. Therefore this technique is a poor choice if an

explanation about what has been learned is required. Also, association rules are usually a

best choice when attributes are allowed to play multiple roles in the data mining process.

There are some guidelines.

1. Does the data contain several missing values?


88

Most data mining researchers agree that, if applicable, neural networks tend to

outperform other models when a wealth of noisy data are present.

2. Is time an issue?

Algorithms for building decision trees and production rules typically execute

much faster than neural network or genetic learning approaches.

3. Do we know the distribution of the data?

Datasets containing more than a few hundred instances can be a problem for

data mining techniques that require the data to conform to certain standards.

For example, many statistical techniques assume the data to be normally

distributed.

4. Do we know which attributes best define the data to be modeled?

Decision trees and certain statistical approaches can determine those attributes

most predictive of class membership. Neural network, Nearest neighbor and

various clustering approaches assume attributes to be of equal importance.

This is a problem when several attributes not predictive of class membership

are present in the data.

5. Which technique is most likely to give best classification accuracy?

For a particular problem, these questions have obvious answers. For example, we know

neural network is a black-box structure. Therefore this technique is a poor choice if an

explanation about what has been learned is required. Also, association rules are usually a

best choice when attributes are allowed to play multiple roles in the data mining process.

We can also select data mining technique based on the data-mining task we want to

perform. In the table 2.5 data mining problem types are related to appropriate modeling

techniques.


89

Table 2.4 Comparison of Clustering Algorithms

Algorithm Type Space Time Notes

Single Link Hierarchical O(n2) O(kn2) Not Incremental

Average

Link

Hierarchical O(n2) O(kn2) Not Incremental

Complete

Link

Hierarchical O(n2) O(kn2) Not Incremental

MST Hierarchical

/Partitional

O(n2) O(n2) Not Incremental

Squared

Error

Partitional O(n) O(tkn) Iterative

K-Means Partitional O(n) O(tkn) Iterative;

No Categorical

Nearest

Neighbor

Partitional O(n2) O(n2) Iterative

PAM Partitional O(n2) O(tk(n-

k)2)

IterativeAdapted;ag

glomerative

BIRCH Partitional O(n) O(n) CF-tree;

Incremental;

Outliers

CURE Mixed O(n) O(n2lgn) Heap;k-D tree;

Incremental;

Outliers;

Sampling

ROCK Agglomerati

ve

O(n2) O(kn2) Sampling;

Categorical;

Links

DBSCAN Mixed O(n2) O(n2) Sampling;

Outliers


90

Table 2.5 Data Mining Technique for Data Mining Task

No Data Mining

Task

Data Mining Technique

1 Classification Decision trees, Neural networks, K-nearest

neighbors, Rule induction methods

2 Prediction Neural networks, K-nearest neighbors,

Regression Analysis

3 Dependency

Analysis

Correlation analysis, Regression Analysis,

Association rules, Bayesian networks,

Inductive logic

4 Data

description

and

summarization

Statistical techniques, OLAP

5 Segmentation

or clustering

Clustering techniques, Neural Networks

2.12 REFERENCES

[1] Margaret H. Dunham, S. Sridhar- DATA MINING Introductory and Advanced

Topics, Pearson Education, ISBN 81-7758-785-4

[2] Richard J. Roiger, Michael W. Geatz – Data Mining A Tutorial-based Primer,

Pearson Education, ISBN:81-297-1089-7

[3] Ian H. Witten, Eibe Frank – DATA MINING Practical Machine Learning Tools and

Techniques, Morgan Kaufmann Publishers, ISBN: 0-12-088407-0

[4] J. Han, M. Kamber – Data Ming Concepts and Techniques, Morgan Kaufmann

Publishers, ISBN: 81-8147-049-4

[5] V.B.Rao and H.V.Rao – C ++ Neural Networks and Fuzzy Logic MIS 1993.

[6] R.Bharath and J.Drosen – Neural Network Computing McGraw-Hill 1994

[7] R. Hecht–Nielsen – Neurocomputing


91

[8] M.Minsky and S.Papert – Perceptrons-Expanded Edition: An Introduction to

Computational Geometry. MIT Press 1987

[9] Dharwa Jyotindra N., Parikh S. M., Patel A. R. (Dr.) “A Comparative Study of Data

Mining Techniques and its Selection Issue” 61-65 Proceedings of the national conference

on IDBIT-2008, 23-24 February 2008 at SRIMCA, ISBN:978-81-906446-0-0.

[10] Website: http://www.jooneworld.com/: a free Neural Network engine written in

Java.

[11] Website: www.tek271.com/free/nuExpert.html: a free Neural Network engine

[12] Website: www.cs.purdue.edu

[13] Website: www.dbmsmag.com

[14] Website: www.osw.ca

[15] Website: www.jmis.bentley.edu

[16] Website: www.uni.edu

[17] Website: www.dms.irb.hr

[18] Website: www.ParasChopra.com

http://www.jooneworld.com/

http://www.tek271.com/free/nuExpert.html

http://www.cs.purdue.edu/

http://www.dbmsmag.com/

http://www.osw.ca/

http://www.jmis.bentley.edu/

http://www.uni.edu/

http://www.dms.irb.hr/

http://www.paraschopra.com/

92

CHAPTER 3

FINANCIAL CYBER CRIME AND FRAUDS

3.1 WHAT IS A CYBER CRIME?

3.2 AN EXAMPLE OF FINANCIAL CYBER CRIME

3.3 FINANCIAL CYBER CRIMES

3.4 WHAT IS A FRAUD?

3.5 TYPES OF FRAUD

3.6 FINANCIAL CRIMES

3.7 WAYS OF ONLINE BANKING FRAUD

3.8 2008 INTERNET CRIME REPORT

3.9 ONLINE FRAUD REPORT, CYBERSOURCE 2010

3.10 REFERENCES

3.1 WHAT IS A CYBER CRIME?

Cyber crime encompasses any criminal act dealing with computers and networks (called

hacking). Additionally, cyber crime also includes traditional crimes conducted through

the Internet. For example, hate crimes, telemarketing and Internet fraud, identity theft,

and credit card account thefts are considered to be cyber crimes when the illegal activities

are committed through the use of a computer and the Internet.

Information Systems Security Association (ISSA), Ireland conduct IRIS cyber crime

survey every year. They developed a questionnaire in which respondents indicated the

types of cyber crime incident which had affected their organization. Figure 3.1 detail the

responses received in the year 2007.

Chapter 3: Financial Cyber crime and Frauds

93

18

1615

13

1110

8 8

00

2

4

6

8

10

12

14

16

18

20

System or networ kintr usion(inter nal

sour ce)

Electr onic emloyeehar assment(exter nal

sour ce)

Electr onic f inancialf r aud(exter nal )

Or ganisationalidenti ty thef t(e.g.cloned websi te)

Thef t of intel lectualpr oper ty

Electr onic f inancialf r aud(inter nal )

Phising(di r ectedagainst the

or ganisation)

Telecom f r aud Attacks againstmanuf actur ing,

SCADA or pr ocesscontr ol systems

Figure 3.1 Affecting the Person by Cyber Crime (in %)

3.2 AN EXAMPLE OF FINANCIAL CYBER CRIME

One example of financial crime is, a website offered to sell Alphonso mangoes at a

throwaway price. Initially very few people responded to or supplied the website with

their credit card numbers. These people were actually sent the Alphonso mangoes. The

word about this website now spread like wildfire. Thousands of people from all over the

country responded to this site and ordered mangoes by providing their credit card

numbers. The owners of what was later proven to be a bogus website then fled taking the

numerous credit card numbers and proceeded to spend huge amounts of money much to

the chagrin of the card owners.

3.3 FINANCIAL CYBER CRIMES

3.3.1 Credit Card Fraud

We simply have to type credit card no, expiry date, CVV no into www page of the

vendor for online transaction. If electronic transactions are not secured the credit card

numbers can be stolen by the hackers who can misuse this card by impersonating the

credit card owner.


94

3.3.2 Net Extortion

Copying the company’s confidential data in order to extort said company for huge

amount.

3.3.3 Phising

It is technique of pulling out confidential information from the bank/financial

institutional account holders by deceptive means.

3.3.4 Salami Attack

In such crime criminal makes insignificant changes in such a manner that such changes

would go unnoticed. Criminal makes such program that deducts small amount like

Rs.1.00 per month from the account of all the customers of the bank and deposit the same

in his account. In this case no account holder will approach the bank for such small

amount but criminals gain huge amount.

3.3.5 Sale of Narcotics

This crime can be committed by sale and purchase through net. There are web sites which

offer sale and shipment of contrabands drugs. They may use the techniques of

stenography for hiding the messages.

3.4 WHAT IS A FRAUD?

Fraud may be defined as a dishonest or illegal use of services, with the intension to avoid

service charges. Frauds have plagued telecommunication industries, financial institutions

and other organizations for a long time. These frauds cost the business at great expenses

per year. As a result, fraud detection has become an important and urgent task for these

businesses. At present a number of methods have been implemented to detect frauds,


95

from both statistical approaches (e.g. data mining) and hardware approaches. (e.g.

firewalls, smart cards).

3.5 TYPES OF FRAUD

We discuss the types of fraud like Credit card fraud, Telecommunications fraud and

Intrusion in computer systems.

3.5.1 Credit Card Fraud

In simple terms credit card fraud can be defined as follows.

When an individual uses another individual’s credit card for personal reasons while the

owner of the card issuer are not aware of the fact that the card is being used. Further, the

individual using the card has no connection with cardholder or issuer, and has no

intension of either contacting the owner of the card or making repayments for the

purchases made.

Generally we can categorize credit card fraud into two main types 1.Identity theft fraud

and 2. Non-identity theft fraud

3.5.1.1 Identity theft

While identity theft and what we call credit card fraud are both pernicious crimes, and

both constitute fraud, we would like to distinguish the two for policy purposes. We place

identity theft into two basic categories.


96

3.5.1.1.1 Fraudulent Applications-Three Percent of Our Total Fraud Cases

This involves the unlawful acquisition and use of another person’s identifying

information to obtain credit, or the use of that information to create a fictitious identity to

establish an account.

In order to commit identity theft by means of fraudulent application, the perpetrator needs

to acquire not just a name, address or credit card number but unique identifiers such as

mother’s maiden name, social security number and detailed information about a person’s

credit history such as the amount of their most mortgage payment. This is why more than

40 percent of the identity theft cases that we see are committed by someone familiar to

the victim, frequently a family member or someone in a position of intimacy or trust.

This variety of identity theft represents three percent of our total fraud cases.

3.5.1.1.2 Account Takeover-One Percent of Our Total Fraud Cases

This occurs when someone unlawfully uses another person’s identifying information to

take ownership of an account. This would typically occur by making an unauthorized

change of address followed by a request for a new product such as a card or check, or

perhaps a PIN number. This variety of identity theft represents less than one percent of

our total fraud cases.

3.5.1.2 Non-identity Theft Fraud-The Other 96 Percent of Our Total Fraud Cases

This type of fraud constitutes the vast majority of occurrences and falls under four basic

headings.

1) Lost or Stolen Cards: The card is actually in possession of the customer and is

subsequently lost or stolen.

2) Non-Receipt: The card is never received by the customer and is intercepted by

the perpetrator prior to or during mail delivery.


97

3) Fraudulent Mail or Telephone Order: The card is in possession of the customer

and the account number and expiration date is compromised permitting purchases

by phone, mail or internet.

The prevention of credit card fraud is an important application for prediction techniques.

One major obstacle neural network training technique is the high necessary diagnostic

quality. Since only one financial transaction of a thousand is invalid no prediction success

less than 99.9% is acceptable.

3.5.2 Telecommunications Fraud

Johnson defines the telecommunications fraud as any transmission of voice or data across

a telecommunications network where the intent of the sender is to avoid or reduce

legitimate call charges. In similar vein, Davis and Goyal define fraud as obtaining

unbillable services and undeserved fees.

3.5.2.1 Types of Telecommunications Fraud

There are many different types of telecoms fraud, and these can occur at various levels.

The two most prevalent types are subscription fraud and superimposed or ‘surfing’ fraud.

Subscription fraud: This occurs when fraudster obtains a subscription to a service, often

with false identity details, with no intension of paying. This is thus at the level of a phone

number – all transactions from this number will be fraudulent.

Superimposed fraud: This is the use of a service without having the necessary authority

and is usually detected by the appearance of ‘phantom’ calls on a bill. There are several

ways to carry out superimposed fraud, including mobile phone cloning and obtaining

calling card authorization details. Superimposed fraud will generally occur at the level of

individual calls – the fraudulent calls will be mixed in with the legitimate ones.

Subscription fraud will generally be detected at some point through the billing process –

though one would aim to detect it well before that, since large costs can quickly be run

up. Superimposed fraud can remain undetected for a long time.


98

Other types of telecoms fraud include

‘Ghosting’ (technology that tricks the network in order to obtain free calls) and

‘Insider’ fraud where telecom company employees sell information to criminals that can

be exploited for fraudulent gain. This, of course, is a universal cause of fraud, whatever

the domain.

‘Tumling’ is a type of superimposed fraud in which rolling fake serial numbers are used

on cloned handsets, so that successive calls are attributed to different legitimate phones.

The chance of detection by spotting unusual patterns is small, and the illicit phone will

operate until all of the assumed identities have been spotted. The term ‘spoofing’ is

sometimes used to describe users pretending to be someone else.

3.5.3 Computer Intrusion

Intrusion detection plays a vital role in today’s networked environment. Intrusions into

computer systems include unauthorized users penetrating the computer systems and

authorized users abusing their privileges. Intrusion into computer systems is the most

epidemic type of fraud since it is easy to commit. Furthermore, it is very difficult to trace

the intruders because they may hide in any corner of the world so long as they have the

Internet connection.

In recent years, computer security has become increasingly important and an international

priority. Intrusion detection techniques are largely categorized into two types such as

anomaly detection and misuse detection.

Anomaly detection: In this technique, the task is focused on extracting normal (non-

fraudulent) usage patterns and finding out deviation from them.

Misuse detection: In this technique, the patterns of previous intrusions and the

vulnerable spots of a system are captured based on the historical data. Then, an intrusion

trail is compared with these identified previous patterns.


99

3.6 FINANCIAL CRIMES

This would include cheating, credit card frauds, money laundering etc.

3.6.1 TYPES OF FINANCIAL CRIME

3.6.1.1 Credit-Card Fraud

Credit-card fraud detection is especially challenging because the analyst needs to identify

both the physical theft of a card, as well as an individual's identity; this means stolen

cards, as well as cloned and personal identification number (PIN) thefts. This type of

fraud can also be the result of the theft of an individual's identification, such as his or her

home address, for the creation of new accounts under false or stolen identities.

Credit-card theft will defraud the credit-card issuer or merchant. It has a profile of many

small amounts, and an out-of-character purchasing pattern. The fraud activity is time-

constrained. The card will be reported as stolen at some point and identity theft will be

detected, at least by the next statement date. This time constraint forces perpetrators to

use the card rapidly and for amounts normally out of pattern—this is the signature of this

financial crime and a method to its detection. It is a crime where, inevitably, some loss

will occur before detection. This crime is both highly organized and opportunistic.

3.6.1.2 Card-Not-Present Fraud

Internet and phone-order transactions are the classic card-not-present (CNP) sales. They

are also time-sensitive crimes, where the thieves are racing to beat the credit-card

monthly statement mailing date.

Internet credit-card thieves do leave characteristic footprints. For example, many

businesses see fraud rates increase at certain times of the day, and orders coming in from

certain countries exhibit a higher percentage of fraud. Thieves also gravitate to certain

types of products, such as electronics, which are easy to sell via Web auction sites. Other


100

clues to these perpetrators are the use of Web-based e-mail addresses and different

shipping and billing addresses.

3.6.1.3 Loan Default

This type of financial crime involves the manipulation and inflation of an individual

credit rating prior to performing a "sting," leading to a loan default and a loss for the

financial service provider.

This financial crime relies on creating a false identity and takes time to develop. Once an

account has been created with a stolen or false identity, the marketing initiatives

employed by the bank or credit-card issuer assist the perpetrator in building a portfolio of

credit-cards, loan accounts, and a viable credit-rating and history—before defaulting on

them.

3.6.1.4 Bank Fraud

This financial crime involves the creation of fictitious bank accounts for the conduit of

money and the siphoning of other legitimate accounts. It may also be for fictitious

account purchases, particularly in association with investment accounts, bond and bearer

bond transactions.

Many of the methods of executing internal fraud are similar to money laundering, except

there is an obvious attempt to defraud the bank, whereas in money laundering the

objective is simply to hide the funds. In addition, this fraud often works in conjunction

with the establishment of creditworthy accounts, lines of credit, and fictitious accounts.

The sting is often a single or small number of large-volume transactions, often related to

real estate purchases, business investments, and the like.

3.6.1.5. Money Laundering

Money generated in large volume by illegal activities must be "laundered," or made to

look legitimate, before it can be freely spent or invested; otherwise, it may be seized by

law enforcement and forfeited to the government. Transferring funds by electronic


101

messages between banks—"wire transfer"—is one way to swiftly move illegal profits

beyond the easy reach of law enforcement agents and at the same time begin to launder

the funds by confusing the audit trail.

To launder money is to disguise the origin or ownership of illegally gained funds to make

them appear legitimate. Hiding legitimately acquired money to avoid taxation, or moving

money for the financing of terrorist attacks also qualify as money laundering activities.

Law enforcement officials describe three basic steps to money laundering.

1. Placement: introducing cash into the banking system or into legitimate commerce

2. Layering: separating the money from its criminal origins by passing it through

several financial transactions, such as transferring it into and then out of several

bank accounts, or exchanging it for travelers' checks or a cashier's check

3. Integration: aggregating the funds with legitimately obtained money or providing

a plausible explanation for its ownership

Wire transfers of illicit funds are yet another key vehicle for moving and laundering

money through the vast electronic funds transfer systems. Using data mining technologies

and techniques for the identification of these illicit transfers could reveal previously

unsuspected criminal operations or make investigations and prosecutions more effective

by providing evidence of the flow of illegal profits.

There are many ways to launder money. Any system that attempts to identify money

laundering will need to evaluate wire transfers against multiple profiles. In addition,

money launderers are believed to change their MOs frequently. If one method is

discovered and used to arrest and convict a ring of criminals, activity will switch to

alternative methods. Law enforcement and intelligence community experts stress that

criminal organizations engaged in money laundering are highly adaptable and flexible.

For example, they may use non bank financial institutions, such as exchange houses and

check cashing services and instruments like postal money orders, cashier's checks, and

certificates of deposit. In this way, money launderers resemble individuals who engage in

ordinary fraud: They are adaptive and devise complex strategies to avoid detection. They


102

often assume their transactions are being monitored and design their schemes so that each

transaction fits a profile of legitimate activity.

As with other criminal detection applications the major obstacle to using data mining

techniques is the absence of data uniformity. Related issues, such as the absence of

experts, high costs, and privacy concerns, are being reevaluated in light of the recent

terrorist attacks. The post-9/11 environment is changing the priorities of years ago. One

of the biggest obstacles to using data mining to detect the use of wire transfers for illegal

money laundering was the poor quality of the data; ineffective standards did not ensure

that all the data fields in the reporting forms were complete and validated.

3.6.1.6 Insurance Crimes

Insurance fraud and health care-related crimes are widespread and very costly to carriers,

the government, and the consumer public. Insurance fraud involves intentional deception

or misrepresentation intended to result in an unauthorized benefit. An example would be

billing for health care services that have not been rendered. Health care crime involves

charging for services that are not medically necessary, do not conform to professionally

recognized standards, or are unfairly priced. An example would be performing a

laboratory test on a large numbers of patients when only a few should have it. Health care

crime may be similar to insurance fraud, except that it is not possible to establish that the

abusive acts were done with intent to deceive the insurer.

3.6.1.7 False Claims

False-claim schemes are the most common type of health-insurance fraud. The goal in

these schemes is to obtain undeserved payment for a claim or series of claims.

This includes billing for services, procedures, or supplies that were not provided or used,

as well as misrepresentation of what was provided, when it was provided, the condition

or diagnosis, the charges involved, or the identity of the provider recipient. This may also

involve providing unnecessary services or ordering unnecessary tests.


103

3.6.1.8 Illegal Billing

Illegal billing schemes involve charging a carrier for a service that was not performed.

This includes unbundling of claims—that is, billing separately for procedures that

normally are covered by a single fee. A variation is double billing, charging more than

once for the same service, also known as upcoding, the scam of charging for a more

complex service than was performed. This may also involve kickbacks in which a person

receives payment or other benefits for making referrals.

3.6.1.9 Excessive or Inappropriate Testing

Billing for inappropriate tests—both standard and nonstandard—appears to be much

more common among chiropractors and joint chiropractic/medical practices than among

other health care providers.

The most commonly abused tests include:

• Computerized inclinometry: Inclinometry is a procedure that measures joint

flexibility.

• Nerve conduction studies: Personal injury mills often use these inappropriately to

follow the progress of their patients.

• Surface electromyography: this measures the electrical activity of muscles, which

can be useful for analyzing certain types of performance in the workplace.

However, some chiropractors claim that the test enables them to screen patients

for "subluxations." This usage is invalid.

• Thermography: Chiropractors who use thermography typically claim that it can

detect nerve impingements, or "nerve irritation" and is useful for monitoring the

effect of chiropractic adjustments on sub-luxations. These uses are not medically

appropriate.

• Ultrasound screening: Ultrasonography is not appropriate for diagnosing muscle

spasm or inflammation or for following the progress of patients treated for back

pain.


104

• Unnecessary X rays: It is not appropriate for chiropractors to routinely X-ray

every patient to measure the progress of patients who undergo spinal

manipulation.

• Spinal videofluoroscopy: This procedure produces and records X-ray pictures of

the spinal joints that show the extent to which joint motion is restricted. For

practical purposes, however, a simple physical examination procedure, such as

asking the patient to bend, provides enough information to guide the patient's

treatment.

3.6.1.10 Personal Injury Mills

Many instances have been discovered in which corrupt attorneys and health care

providers, usually chiropractors or medical clinics, combine to bill insurance companies

for nonexistent or minor injuries. The typical scam includes "cappers" or "runners," who

are paid to recruit legitimate or fake auto-accident victims or worker's compensation

claimants. Victims are commonly told they need multiple visits.

Mills fabricate diagnoses and reports, providing expensive, but unnecessary, services.

The lawyers then initiate negotiations on settlements based upon these fraudulent or

exaggerated medical claims.

3.6.1.11 Miscoding

In processing claims, insurance companies rely mainly on diagnostic and procedural

codes recorded on the claim forms. Their computers are programmed to detect services

that are not covered. Most insurance policies exclude non-standard or experimental

methods. To help boost their income, many non-standard practitioners misrepresent what

they do and may misrepresent their diagnoses.

Brief or intermediate-length visits may be coded as lengthy or comprehensive visits.

Patients receiving chelation therapy may be falsely diagnosed as suffering from lead

poisoning and may be billed for infusion therapy or simply an office visit. The

administration of quack cancer remedies may be billed as chemotherapy. Live-cell


105

analysis may be billed as one or more tests for vitamin deficiency. Nonstandard allergy

tests may be coded as standard ones.

3.7 WAYS OF ONLINE BANKING FRAUD

Scams such as phising, spyware and malware are responsible for online banking fraud.

3.7.1 Phising

Phising is the name given to the practice of sending emails at random, purporting to come

from a genuine company operating on the internet, in an attempt to trick customers of that

company into disclosing information at a bogus website operated by fraudsters. These

emails usually claim that it is necessary to ‘update’ or ‘verify’ your password, and they

urge to click on a link from the email that takes us to the bogus website. Any information

entered on the bogus website will be captured by the criminals for their own fraudulent

purposes.

Phising originated because the banks’ own systems have proved incredibly difficult to

attack. Criminals have turned their attention to phising attacks, individual internet users

in order to gain personal or secret information that can be used online for fraudulent

purposes.

3.7.2 Malware

Although the rising number of phising incidents has undoubtedly helped to raise fraud

losses, we also know that online banking customers are increasingly being targeted by

malware attacks. Malware (malicious software) includes computer viruses that can be

installed on a computer without the user’s knowledge, typically by users clicking on a

link in an unsolicited email, or by downloading suspicious software. Malware is capable

of logging keystrokes thereby capturing passwords and other financial information.


106

3.7.3 Spyware

Spyware is a type of computer virus that can be installed on computer without user

realizing. Spyware is sometimes capable of acting as a ‘keystroke logger’, capturing all

of the keystrokes entered into a computer keyboard. Typically the fraudsters will send out

emails at random, to get people to click on a link from the email and visit a malicious

website, where vulnerabilities on the customer’s computer are exploited to install the

spyware. The emails are not normally related to internet banking, and try to dupe people

into visiting, or clicking on the link to, the malicious website using a variety of excuses.

3.8 2008 INTERNET CRIME REPORT

The Internet Crime Complaint Centre (IC3) was established with a mission to serve as a

vehicle to receive, develop, and refer criminal complaints regarding the rapidly

expanding arena of cyber crime. IC3 accepts online Internet crime complaints from either

the person who believes they were defrauded or from a third party to the complainant.

3.8.1 Complain Characteristics

During 2008, non-delivery of merchandise and/or payment was by far the most reported

offense, comprising 32.9% of referred crime complaints. This represents a 32.1%

increase from the 2007 levels of non-delivery of merchandise and/or payment reported to

IC3. In addition, during 2008, auction fraud represented 25.5% of complaints (down

28.6% from 2007), and credit and debit card fraud made up an additional 9.0% of

complaints. Confidence fraud such as Ponzi schemes, computer fraud, and check fraud

complaints represented 19.5% of all referred complaints. Other complaint categories such

as Nigerian letter fraud, identity theft, financial institutions fraud, and threat complaints

together represented less than 9.7% of all complaints (See Figure 3.2).


107

2008 T op 10 I C3 Compl ai nt Cat egor i es

0 5 10 15 20 25 30 35

Non-del i ver y

Auct i on Fr aud

Cr edi t / Debi t Car d Fr aud

Conf i dence Fr aud

Computer Fr aud

Check Fr aud

Ni ger i an Let ter Fr aud

Ident i t y T hef t

Fi nanci al I nst i t ut i ons Fr aud

T hr eat

Figure 3.2 IC3 Complaint Categories (in %)

Source : www.ic3.gov During 2008, non-delivered merchandise and/or payment were, by far, the most reported

offense, comprising 32.9% of referred complaints. Internet auction fraud accounted for

25.5% of referred complaints. Credit/debit card fraud made up 9.0% of referred

complaints. Confidence fraud, computer fraud, check fraud, and Nigerian letter fraud

round out the top seven categories of complaints referred to law enforcement during the

year.

A key area of interest regarding Internet fraud is the average monetary loss incurred by

complainants contacting IC3 (See Figure 3.3). Such information is valuable because it

provides a foundation for estimating average Internet fraud losses in the general

population. To present information on average losses, two forms of averages are offered:

the mean and the median. The mean represents a form of averaging that is familiar to the

general public: the total dollar amount divided by the total number of complaints.

Because the mean can be sensitive to a small number of extremely high or extremely low

loss complaints, the median is also provided. The median represents the 50th percentile,

or midpoint, of all loss amounts for all referred complaints. The median is less

susceptible to extreme cases, whether high or low cost.

http://www.ic3.gov/


108

Of the 72,940 fraudulent referrals processed by IC3 during 2008, 63,382 involved a

victim who reported a monetary loss. Other complainants who did not file a loss may

have reported the incident prior to victimization (e.g., received a fraudulent business

investment offer online or in the mail), or may have already recovered money from the

incident prior to filing (e.g., zero liability in the case of credit/debit card fraud).

The total dollar loss from all referred cases of fraud in 2008 was $264.6 million. That loss

was greater than 2007 which reported a total loss of $239.1 million. Of those complaints

with a reported monetary loss, the mean dollar loss was $4,174.50 and the median was

$931.00. Nearly fifteen percent (14.8%) of these complaints involved losses of less than

$100.00, and (36.5%) reported a loss between $100.00 and $1,000.00. In other words,

over half of these cases involved a monetary loss of less than $1,000.00. Nearly a third

(33.7%) of the complainants reported

Figure 3.3 Percentage of Referrals by Monetary Loss

Source : www.ic3.gov A key area of interest regarding Internet fraud is the average monetary loss incurred by

complainants contacting IC3. Of the 72,940 fraudulent referrals processed by IC3 during

2008, 63,382 involved a victim who reported a monetary loss. The total dollar loss from

all referred cases of fraud in 2008 was $264.6 million.

Percentage of Referalls by Monetary Loss

15%

36%34%

8%

7%

0%

$.01 to $99.99

$100 to $999.99

$1,000 to $4,999.99

$5,000 to $9,999.99

$10,000 to $99,999.99

$100,000.00 and over

http://www.ic3.gov/


109

Table 3.1 Average (Median) Loss per Typical Complaint Demographics

Amount Lost per Referred Complaint by Selected Complainant Demographics

Average (Median) Loss Per Typical Complaint

Male $993.76

Female $860.98

Under 20 $500.00

20-29 $873.58

30-39 $900.00

40-49 $1,010.23

50-59 $1,000.00

60 and older $1,000.00

3.8.2 Case Studies of APACS (UK Payment Association and UK Card Association) 3.8.2.1 Plastic card fraud losses on UK-issued cards 1998-2008

135188.4

317

411.5 424.6 420.4

504.8

439.4 427

535.2

609.9

0

100

200

300

400

500

600

700

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

( X-axis : Year, Y-axis : £ millions)

Figure 3.4 Plastic Card Fraud Losses on UK-issued Cards 1998-2008

Source : www.cardwatch.org.uk

http://www.cardwatch.org.uk/


110

Table 3.2 Losses Based on Fraud Category Wise Fraud Type 98 99 00 01 02 03 04 05 06 07 08 Card Not Present

13.6 29.3 72.9 95.7 110.1 122.1 150.8 183.2 212.7 290.5 328.4

Counterfiet 26.8 50.3 107.1 160.4 148.5 110.6 129.7 96.8 98.6 144.3 169.8Lost/Stolen 65.8 79.7 101.9 114.0 108.3 112.4 114.5 89 68.5 56.2 54.1 Card IDTheft 16.8 14.4 17.4 14.6 20.6 30.2 36.9 30.5 31.9 34.1 47.4 Mail non-receipt

12.0 14.6 17.7 26.8 37.1 45.1 72.9 40.0 15.4 10.2 10.2

Total 135 188.4 317.0 411.5 424.6 420.4 504.8 439.4 427.0 535.2 609.9

APACS has been the forum for the co-operative activity of banks, building societies and

card issuers on payments and payment systems since the mid-80s in U.K. Figure 3.4

shows the total losses in £ millions from year 1998 to 2008 alone in U.K. due to credit

and debit card. Table 3.2 shows how this plastic card fraud occurs category wise like

card-not-present, counterfeit, lost/stolen, card id theft and mail non-receipt.

3.8.2.2 Card fraud losses split by type (as percentage of total losses)

49%

9%

20%

10%

12%

Lost/Stolen

Mail non-receipt

Counterfeit

Card-not-present

Card ID theft

1998

Figure 3.5 Percentage of Different Plastic Card Fraud Category in Year 1998 Source : www.cardwatch.org.uk



111

8%2%

28%

53%

9%

Lost/Stolen

Mail non-receipt

Counterfeit

Card-not-present

Card ID theft

2008

Figure 3.6 Percentage of Different Plastic Card Fraud Category in Year 2008 Source : www.cardwatch.org.uk 3.8.2.3 Internet / e-commerce fraud losses on UK – issued cards 2000 to 2008

3.815

28

45

117 117.1

154.5

178.3 181.7

0

20

40

60

80

100

120

140

160

180

200

2000 2001 2002 2003 2004 2005 2006 2007 2008

(X-axis: Year, Y-axis: £ millions)

Figure 3.7 Internet / E-Commerce Fraud Losses on UK – issued Cards Source : www.cardwatch.org.uk Figure 3.7 shows financial cyber crime in £ millions from year 2000 to 2008 in U.K.




112

3.9 ONLINE FRAUD REPORT, CYBERSOURCE 2010

According to the Cybersource, 11th Annual Online Fraud Report, which is based on

U.S.A. and Canadian online merchants, from 2006 to 2008 the percent of online revenues

lost to payment fraud was stable. However, total dollar losses from online payment fraud

in the U.S. and Canada steadily increased during this period as eCommerce continued to

grow.

The percent of accepted orders which are later determined to be fraudulent also fell in

2009. In 2009, merchants reported an overall average fraudulent order rate of 0.9%, down

from 1.1% in 2008 for their U.S. and Canadian orders. Over the past six years the average

percent of accepted orders which turn out to be fraudulent has varied from 1.0% to 1.3%.

2009 represents the first time this rate has dropped below the 1% threshold. Among

industry sectors, Consumer Electronics reported the highest fraudulent order rate,

averaging 1.5%, but this was down from 2.0% in 2008.

Since 2007, the percent of orders rejected due to suspicion of fraud has fallen from 4.2%

to 2.4% in 2009, a decline of more than 40% in order rejection, representing a 1.8%

increase in total orders accepted.


113

Figure 3.8 Revenue Lost to Online Fraud (in %) Source : Online Fraud Report 11th Annual Edition 2010 3.10 REFERENCES

[1] Jesus Mena : Investigative Data Mining for Security and Criminal Detection

[2] Website – www.ic3.gov

[3] Website – www.cardwatch.org.uk

[4] Website – www.cybersource.com

[5] Website - www.issaireland.org/cybercrime

[6] Website – www.en.wikipedia.org

[7] Website – www.fas.org

[8] Website – www.citizencentre.virtualpune.com

[9] Website – www.indiaforensic.com

[10] Website – www.itbusinessedge.com

http://www.ic3.gov/


http://www.cybersource.com/

http://www.en.wikipedia.org/

http://www.fas.org/

http://www.citizencentre.virtualpune.com/

http://www.indiaforensic.com/

http://www.itbusinessedge.com/


114

[11] Website – www.cybercrime.gov

[12] Website – www.cybercrime.planetindia.net

[13] Website – www.bespacific.com

[14] Website – www.dfait-maeci.gc.ca

[15] Website – www.isiconference.org

http://www.cybercrime.gov/

http://www.cybercrime.planetindia.net/

http://www.bespacific.com/

http://www.dfait-maeci.gc.ca/

http://www.isiconference.org/

115

CHAPTER 4

ROLE OF DATA MINING IN FINANCIAL CRIME DETECTION

4.1. TWO STAGE SOLUTION FOR FINANCIAL CRIME DETECTION

4.2. TYPES OF FINANCIAL CRIME

4.3. CONCLUSION

4.4 REFERENCES

Today Industry is facing huge losses due to these types of financial crimes, so it would

be able to find financial crime through data mining techniques and remove it then it can

be great benefit to the industry.

In this chapter we have suggested a two-tier architecture model for financial crime

detection. In the first stage the financial transaction is verified against the rule-based

system and is given risk score by the system. These rules contain the human insight and

then this transaction is passed to second stage of data mining technique, which will

learn from the past experience of fraudulent transactions and then decide about the

current transaction. So the accuracy of prediction increased as the financial transaction

has to pass through two stages, one of rule based system and second of data mining

technique based system.

4.1. TWO STAGE SOLUTION FOR FINANCIAL CRIME DETECTION

Here we have given a figure 4.1 of architecture of 2-stage solution for financial crime.

In the first stage, rule based system contains the static rules which is generally based on

human knowledge i.e. human insight. If the financial transaction passes through this

phase then it passes to the second phase.

Chapter 4: Role of Data Mining in Financial Crime Detection

116

In the second stage, data mining techniques generate dynamic rules based on past

fraudulent transactions. Here learning is totally dynamic so if the pattern of fraudulent

transaction changed then the model learns itself from transactions and generates

dynamic rules for prediction of financial crime.

Figure 4.1 Architecture of 2-Stage Solution

4.2. TYPES OF FINANCIAL CRIME

In this section we have suggested 2-stage solution in each type of financial crime.

4.2.1 CREDIT-CARD FRAUD

4.2.1.1 Rule Based System:

1. If number of transactions increased rapidly within short time then recommendation=

Fraud

2. If current transaction amount is very much greater than average transaction amount

and income range is medium then recommendation= Fraud

3. If purchase of same product of luxury category within short time then

recommendation = Fraud

Financial Transaction

Stage 2 Data Mining Based

Normal or Fraudulent Transaction

Stage 1 Rule Based System


117

4.2.1.2 Detection Technique: Sequencing of purchases will change; the merchant mix

will be out of character compared to previous consumer transactions. Frequency,

monetary, and recency (FMR) techniques can be examined and employed. Time-

sequence accumulated-risk scores may be used as an input to aggregated risk exposure.

A change in location may indicate a ring operation. There are a number of leads that

relate specifically to credit card and debit card fraud. They are common points-of-

purchase (CPP) detection, particularly with regard to new merchant agents. The main

method of detection is to look for outliers and changes in the normal patterns of usage.

A SOM neural network can be used to perform an autonomous clustering of patterns in

the data.

4.2.2 CARD-NOT-PRESENT FRAUD (ONLINE CREDIT CARD FRAUD)


1. If Billing Address is not same as Shipping Address then recommendation = Fraud

2. If transaction amount is greater than maximum specified limit then recommendation

= Fraud

3. If Duration between online transactions increased rapidly then recommendation =

Fraud

4.2.2.2 Detection Technique: Indicators include looking for repeated attempts with

slight variations of card numbers or the use of different names and addresses. Another

possible indication of trouble is an IP address at variance with other data. If

demographics are available, a model may be developed. The absence of certain data,

such as activity in a credit report, is also signals of possible identify theft and fraud.

4.2.3 LOAN DEFAULT

This type of financial crime involves the manipulation and inflation of an individual

credit rating prior to performing a "sting," leading to a loan default and a loss for the

financial service provider.


118

This financial crime is done by creating a false identity and it takes time to develop.

Once an account has been created with a stolen or false identity, the marketing

initiatives employed by the bank or credit-card issuer assist the perpetrator in building a

portfolio of credit-cards, loan accounts, and a viable credit-rating and history—before

defaulting on them.


A Rule Based scoring system can be developed for preventing loan default on various

parameters like age (i.e. age is less then more points given or more age then less),

educational qualification (for higher studies or degrees more points otherwise less), No

of Assets owned by borrower at home (for more assets more points otherwise less),

borrower’s income, margin etc.

4.2.3.2 Detection Technique: There are many lead indicators available. There is often

only one "pot" of money that is cycled through the various accounts—a pattern of cash

withdrawals from credit cards, and then at the end of the credit cycle, a similar amount

repaid, usually using a cash withdrawal from another credit card. Lead indicators

include credit cards that are rarely used to make actual merchant purchases and have

small outstanding credit balances. Another pattern to look for is a loan account that is

left unused. These techniques inflate a centrally controlled credit rating, providing a

false impression that the account is deemed responsible. Detection has to occur before

the "sting," which is a use of the credit and loan accounts very rapidly within a credit

cycle. This financial crime can result in high losses. Detection must occur before the

loss, because the sting has a short execution time.

4.2.4 BANK FRAUD

4.2.4.1 Detection Technique: The method of detection relies on out-of-pattern

transactions or anomalous account use. As with other financial crimes, detection must

occur before any loss is sustained. There are lead indicators like the "manipulation of

credit" described above and in the lack of references, high associations of matching

attributes, and dubious acceptance criteria.


119

The critical factors for detecting all of these financial fraud crimes is knowing the

behavior of credit, bank, and loan accounts and developing an understanding of the

categories of customers. Data mining can be used to spot outliers or account usages that

are normal and out of character. Sometimes the account seems "too good to be true,"

and it often is. The absence of telephone numbers or other contact information may

indicate a "ring." These rings enable fraudulent activities to be distanced from their

sources and add complexity to criminal detection. Another clue is the multiple use of

the same address or phone number for different accounts.

4.2.5 INSURANCE CRIMES


A Rule Based Scoring system can be developed for Insurance crime such as neck injury

can be given more risk points than leg injury, laboratory or x-ray report that is not

relevant or unnecessary to the disease (high risk score can be assigned according to the

irrelevancy of the disease), etc.

In the insurance industry, there are various methods by which carriers attempt to review

for fraud while processing policy claims. The following are some important data

attributes for detecting potential fraud claims:

• Duration of illness

• Net amount cost

• Illness (disease)

• Claimant sex

• Claimant age

• Claim cost

• Hospital

Using these variables, analyses can be performed to identify outliers for each, such as

test costs, hospital charges, illness duration, and doctor charges. These are some

temporal parameters for analyzing insurance claims.


120

4.2.5.2 FALSE CLAIMS

4.2.5.2.1 Detection Technique: Depending on the insurance carrier, we can use various

methods in an attempt to identify false claims, including red-flag reviews by fraud

specialists, both on-line and behind the scenes. A carrier may also use an expert system,

which is a rule-based program that codifies the rules of a human reviewer. Link analysis

may be used to look for a ring of fraudulent providers, and, of course, data mining tools,

such as neural networks, may be used for training and detection if samples of fraud

cases exist. The net amount of the claim may be too large compared to the average

amount of similar claims.

4.2.5.3 ILLEGAL BILLING

4.2.5.3.1 Detection Technique: The methods are the same as with false claims. In

addition, a carrier may use models and rules developed insurance special coupled with

those from data mining analyses, such as decision trees or rule generators to detect these

schemes

4.2.5.4 PERSONAL INJURY MILLS

4.2.5.4.1 Detection Technique: Mill activity can be suspected when claims are

submitted for many unrelated individuals who receive similar treatment from a small

number of providers. These claims are typically manually reviewed by claim specialists;

however, link analysis and rule generators can also be used for screening large volumes

of claims.

4.2.5.5 MISCODING

4.2.5.5.1 Detection Technique: Any code that is not standard must be subject to review

and matched against prior claims from similar clinics or practitioners, typically

performed by red-flag claim specialists. Clustering of historical data can be used to

detect outliers automatically, and to check a disease (illness) against average duration

and cost using a historical claims database to generate a histogram.


121

4.3. CONCLUSION

Data Mining Techniques like Neural Networks, Decision trees, Link Analysis etc. can

become very helpful for financial crime detection. These techniques can be used with

rule-based system combinely so accuracy of prediction increased very much. The two-

tier architecture model is used very effectively for any financial transaction verification.

Any financial transaction has to pass through two level of verification, so prediction

gets closer to real prediction and also any genuine or normal transaction is not caught

by the model as fraudulent transaction so normal or genuine customer does not have to

suffer.

Here we suggested a two-stage solution for financial crime detection, which is actually

hybrid approach and contains both human insight and machine insight also. In these

types of crime hybrid approach proves more powerful than any single stage solution and

also accuracy of prediction is increased drastically.

In this type of model or system, we also need to take care of that any normal or genuine

transaction must not be caught by as fraudulent transaction and create overhead on

customer. If any customer suffered then we might lose him.

4.4. REFERENCES

[1] S. Haykin: Neural networks- a comprehensive foundation, MacMillan, New York

[2] Fawcet, T. and F. Provost- Adaptive Fraud detection, Data Mining and Knowledge

Discovery

[3] Cannady J. –The Application of Artificial Neural Networks to misuse Detection:

Initial results http://www.packetstormsecurity.org/.../Application-of-ANN-to-Misuse-

Detection.pdf

[4] Mannila H., Toivonen H. – “Discovering Generalized Episodes using Minimal

Occurences” http://www.citeseer.ist.psu.edu/mannila96discovering.html

[5] U.Fayyad, G. Piasky-Shapiro, P.Smyth, R.Uthrusamy (eds): Advances in

Knowledge Discovery and Data Mining, Menalo Park, AAA/MIT Press

http://www.packetstormsecurity.org/.../Application-of-ANN-to-Misuse-Detection.pdf

http://www.packetstormsecurity.org/.../Application-of-ANN-to-Misuse-Detection.pdf

http://www.citeseer.ist.psu.edu/mannila96discovering.html


122

[6] Michael J.A. Berry, Gordon S. Linoff – Data Mining Techniques

[7] Aleskeorv, E., Freisleben, B. Rao “CARDWATCH: a neural network based database

mining system for credit card fraud detection. Computational Intelligence for Financial

Engineering”, Proceeding of the IEEE/IAFE, 220-226

[8] R. Srikant, R. agrawal : “Mining generalized association rules” . Proc. VLDB

Conference , Zurich, Switzerland

[9] Jesus Mena : Investigative Data Mining for Security and Criminal Detection

[10] Dharwa J. N., Parikh S. M., “Data Mining in Financial Crime Detection”

Proceedings of the National Conference on ECTKM-08, 23rd November 2008 at AITS,

Rajkot

[11] Website – www.en.wikipedia.org

[12] Website – www.fas.org

[13]Website-www.citizencentre.virtualpune.com

[14] Website – www.indiaforensic.com

[15] Website – www.itbusinessedge.com

http://www.en.wikipedia.org/

http://www.fas.org/

http://www.citizencentre.virtualpune.com/

http://www.indiaforensic.com/

http://www.itbusinessedge.com/

123

CHAPTER 5

DATA WAREHOUSE IMPLEMENTATION

5.1 DATA WAREOUSE ARCHITECTURE

5.2 FACT TABLE

5.3 DIMENSIONAL TABLES

5.4 LOOKUP TABLES

5.5 DATA COLLECTION

5.6 SAMPLE DATA

5.7 CREDIT CARD NUMBER GENERATION

5.8 REFERENCES

When transactional data is no longer of value to the operational environment, it is

removed from the database. If a business is without a decision support facility, the data is

achieved and eventually destroyed. However, if there is a decision support environment,

the data is transported to some type of interactive medium commonly referred to as a data

warehouse.

We can define the data warehouse as a historical database designed for decision support.

A more precise definition is given by W.H. Inmon(1996). Specifically,

“A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile

collection of data in support of management’s decision making process”.

5.1 DATA WAREHOUSE ARCHITECTURE

The main challenge of Data Warehouse architecture is to enable business to access

historical, summarized data with a read-only access of the end-users. Again, from a

technical standpoint the most SQL queries would start with a SELECT statement.

Chapter 5: Data Warehouse Implementation

124

In Data Warehouse environments, the relational model can be transformed into the

following architectures:

• Star schema

• Snowflake schema

• Constellation schema

5.1.1 Star schema architecture

Star schema architecture is the simplest data warehouse design. The main feature of a star

schema is a table at the center, called the fact table and the dimension tables which allow

browsing of specific categories, summarizing, drill-downs and specifying criteria.

Typically, most of the fact tables in a star schema are in database third normal form,

while dimensional tables are de-normalized (second normal form). Despite the fact that

the star schema is the simplest data warehouse architecture, it is most commonly used in

the data warehouse implementations across the world today (about 90-95% cases).

5.1.2 Snowflake schema architecture

The snowflake schema is a variation of the star schema model, where some dimension

tables are normalized, thereby further splitting the data into additional tables. The

resulting schema graph forms a shape similar to a snowflake.

The major difference between the snowflake and star schema models is that the

dimension tables of the snowflake model may be kept in normalized form to reduce

redundancies. Such table is easy to maintain and saves storage space because a large

dimension table can become enormous when the dimensional structure is included as

columns. However, this saving of space is negligible in comparison to the typical

magnitude of the fact table.


125

5.1.3 Fact constellation architecture

Sophisticated applications may require multiple fact tables to share dimensional tables.

This kind of schema can be viewed as a collection of stars, and hence is called a galaxy

schema or a fact constellation.

Here we have designed the data warehouse using snowflake schema architecture. Data

warehouse design layout is given in Figure 5.1 and 5.2.

5.2 FACT TABLE

5.2.1 Table Transaction

Description: This table contains information regarding the transaction performed by card

holder online. Whenever user orders any product through internet, then his transaction

details mentioned below are stored in this table. This is the fact table, so instead of

storing direct values, it stores the links from tables like product_master,

product_category_master, customer_master, creditcard_master, shipping_master,

location_master and seller_master.

Primary key: Serial_id

Foreign Key: Product_id references Product_Master

Product_cat_id references Product_Cat_Id

Customer_id references Customer_Master

Creditcard_id references Creditcard_Master

Seller_id references Seller_Master

Shipping_id references Shipping_Master

Location_id references Location_Master


126

Table 5.1 Transaction

Column Name Data Type Description

SERIAL_ID NUMBER(8) Unique ID for each product ordered

TRANSACTION_DATE DATE Date & time on which transaction

ordered through internet

AMOUNT NUMBER(12,2) Amount of purchase

PRODUCT_ID NUMBER(8) Foreign Key

PRODUCT_CAT_ID NUMBER(8) Foreign Key

CUSTOMER_ID NUMBER(8) Foreign Key

CREDITCARD_ID NUMBER(8) Foreign Key

SELLER_ID NUMBER(8) Foreign Key

SHIPPING_ID NUMBER(8) Foreign Key

TRANSACTION_DAY_T

YPE NUMBER(1) 1: Holiday, 0: Working day

LOCATION_ID NUMBER(8) Foreign Key


127

5.3 DIMENSIONAL TABLES

5.3.1 Table Customer_Master

Description: This table contains personal information mentioned below of the customer

who is currently performing the online transaction. This information is required by the

web site through which the customer wants to perform the transaction.

Primary key: Customer_id

Table 5.2 Customer_Master


CUSTOMER_ID NUMBER(8) Unique ID of customer

FIRST_NAME VARCHAR2(20) First Name of customer

MIDDLE_NAME VARCHAR2(20) Middle Name of customer

LAST_NAME VARCHAR2(20) Last Name of customer

GENDER VARCHAR2(6) Gender of customer

AGE NUMBER(3) Age of customer

ANNUAL_INCOME NUMBER(12,2) Annual Income of customer


128

5.3.2 Table Creditcard_Master

Description: It contains credit card details of the credit card holder. Whenever user wants

to purchase any thing through the credit card then he must give credit card number,

expiry date and customer verification value (CVV) number, then he is able to perform the

transaction.

Primary key: Creditcard_id

Table 5.3 Creditcard_Master


CREDITCARD_ID NUMBER(8) Unique ID of Credit Card

ACCOUNT_NO VARCHAR2(20) Account no of Credit Card

CREDITCARD_NO VARCHAR2(20) Credit Card No

CARD_TYPE VARCHAR2(15) Card Type e.g. Gold, Silver etc.

EXPIRY_DATE DATE Expiry Date of Credit Card

CVV_NO NUMBER(4) Secret Pin No of Credit Card

5.3.3 Table: Seller_Master

Description: It contains the seller or vendor name, with which customer is performing

the transaction.

Primary key: Seller_id

Table 5.4 Seller_Master


SELLER_ID NUMBER(8) Unique ID of Seller/Vendor

SELLER_NAME VARCHAR2(40) Name of Seller/Vendor


129

5.3.4 Table Address_Master

Description: It contains billing address or residential address of the credit card holder.

During the online transaction, this address is verified against the shipping address entered

by the buyer to decide the sensitivity of the transaction.

Primary key: Address_id

Foreign key: Cityid references City_Master

Stateid references State_Master

Countryid reference Country_Master

Table 5.5 Address_Master


ADDRESS_ID NUMBER(8) Unique ID of Billing Address

ADDRESS1 VARCHAR2(40) Address Details 1



CITYID NUMBER(8) Foreign key

PINCODE NUMBER(6) Pin Code No

STATEID NUMBER(8) Foreign key

COUNTRYID NUMBER(8) Foreign key


130

5.3.5 Table: Product_Master

Description: Product information is stored in this table.

Primary key: Product_id

Table 5.6 Product_Master


PRODUCT_ID NUMBER(8) Unique ID of Product

PRODUCTNAME VARCHAR2(40) Name of Product

5.3.6 Table Product_Category_Master

Description: Product category information is stored in this table. This table is useful to

study the customer purchase behavior in different categories, so the incoming transaction

is predicted according to this behavior.

Primary key: Product_cat_id

Table 5.7 Product_Category_Master


PRODUCT_CAT_ID NUMBER(8) Unique ID of Product Category

PRODUCT_CATEGORY VARCHAR2(40) Name of Product Category


131

5.3.7 Table: Shipping_Master

Description: This is the address entered by the customer during the online transaction,

where the customer wants his product to be shipped. This address may be different from

billing address.

Primary key: Shipping_id




Table 5.8 Shipping_Master


SHIPPING_ID NUMBER(8) Unique ID of Shipping Address

SHIPPING_ADDRESS1 VARCHAR2(40) Shipping Address Details 1




PINCODE NUMBER(6) Pin Code Number where item to be

shipped




132

5.3.8 Table Location_master

Description: It contains the address details, where the customer requests to purchase the

product through the internet. There are several free tools are available for capturing this

kind of information. The system matches the city where the transaction is performed with

the billing address’s city to consider the time zone if both is found from different country.

Primary key: Location_id




Table 5.9 Location_master


LOCATION_ID NUMBER(8) Unique ID of location where the item is ordered

through the net





PINCODE NUMBER(6) Pin Code No




133

5.3.9 Table City_master

Description: This table contains city related information along with time zone. Whenever

any online transaction is performed different or outside of the customer’s country, then

the system uses this table to convert time zone of one city to time zone of another city.

Primary key: Cityid

Table 5.10 City_master


CITYID NUMBER(8) Unique identification number of city

CITYNAME VARCHAR2(40) Name of city

TIME_ZONE VARCHAR2(10) Time zone of city

5.3.10 Table State_master

Description: It contains name of all the states.

Primary key: Stateid

Table 5.11 State_master


STATEID NUMBER(8) Unique identification number of state

STATENAME VARCHAR2(40) Name of state


134

5.3.11 Table Country_master

Description: It contains name of all the countries.

Primary key: Countryid

Table 5.12 Country_master


COUNTRYID NUMBER(8) Unique identification number of country

COUNTRYNAME VARCHAR2(40) Name of country

5.3.12 Table User_log_Master

Description: This table is used to store the user id and login date, time of the user.

System resets the value of the tables customer_daily_count, customer_weekly_count,

customer_fortnightly_count and customer_monthly_count by using this table only. E.g. If

logon_day contains the value of 1st January, 2010 then the next day 2nd January, 2010 the

value of daily_count and amount field of customer_daily_count table becomes zero. After

the completion of week, the value of weekly_count and amount field of

customer_weekly_count table becomes zero and accordingly for

customer_fortnightly_count and customer_monthly_count tables.

Primary key: User_id

Table 5.13 User_log_Master


USER_ID VARCHAR2(12) ID of the user

LOGON_DAY DATE Login Date & Time of the user


135

5.3.13 Table Cardholder_Master

Description: This table contains personal information mentioned below of the customer

who is the credit card holder.

Primary key: Cardholder_id

Foreign key: Address_id references Address_Master

Cardid references Creditcard_Master

Table 5.14 Cardholder_Master


CARDHOLDER_ID NUMBER(8) Unique ID of credit card holder

FIRST_NAME VARCHAR2(20) First Name of credit card holder

MIDDLE_NAME VARCHAR2(20) Middle Name of credit card holder

LAST_NAME VARCHAR2(20) Last Name of credit card holder

GENDER VARCHAR2(6) Gender of credit card holder

AGE NUMBER(3) Age of credit card holder

ANNUAL_INCOME NUMBER(12,2) Annual Income of credit card holder

ADDRESS_ID NUMBER(8) Foreign key

CARDID NUMBER(8) Foreign key


136

5.3.14 Table Fraud

Description: It is a generic fraud table maintained by the system. It stores the number of

fraud transactions performed within different given below time periods. The system

records time gap between each two transactions. If the transaction is found suspecious by

the system, then it uses the following table to calculate the posterior probability using

bayesian learning and decide about the sensitivity of the transaction.

Table 5.15 Fraud


EVENT1 NUMBER(4) Number of fraud transactions performed within 4

Hours


Hours

EVENT3 NUMBER(4) Number of fraud transactions performed within

16 Hours


24 Hours


days


15 days

EVENT7 NUMBER(4) Number of fraud transactions performed after 15

days

TOTAL NUMBER(5) Total number of fraud


137

5.3.15 Table Suspect

Description: Whenever any transaction is found suspicious by the system, then its details

are stored in the following table. It stores the different time periods since the last

transaction. If another transaction on the same card is found suspicious, then this table is

updated accordingly till either the next transaction is found genuine or the generated risk

score reaches the specified threshold.

Foreign key: Cardid references Creditcard_Master

Table 5.16 Suspect



TRANSACTION_DATE DATE Date on which transaction performed

DAYS NUMBER(4) Number of days since the last transaction

HOURS NUMBER(4) Number of hours since the last transaction

MINUTES NUMBER(4) Number of minutes since the last transaction

SECONDS NUMBER(4) Number of seconds since the last transaction

SUSPECT_COUNT NUMBER(3) Incremented to 1 if suspicious transaction

found


138

Apart from these there are additional tables maintained by the system, which holds the

current transaction details of the card holder basis on daily, weekly, fortnightly or

monthly time duration.

5.4 LOOKUP TABLES

5.4.1 Table Customer_DailyCount

Description: Whenever customer performs a transaction during day, this table is updated

automatically by the system. It stores the total number of transactions and total amount of

purchasing during the current day. (e.g. If customer first performs transaction of Rs.

4000, then transcount contains 1, amount contains 4000. If customer again performs the

second transaction of Rs.5000 in the same day, then transcount contains 2, amount

contains 9000).The next day value of number of transactions and amount of purchasing

becomes automatically zero by the system. So this table is used to observe the daily

behavior of customer. Then system match this data with the past daily customer behavior.


Table 5.17 Customer_DailyCount



TRANSCOUNT NUMBER(4) Total transactions performed daily

AMOUNT NUMBER(12,2) Total amount of products purchased daily


139

5.4.2 Table Customer_WeeklyCount

Description: This table is used to observe the behavior of current week of customer. All

the transactions performed in the current week are automatically reflected in this table.

This data is used to match past weekly customer behavior. The next week value of these

fields becomes zero.


Table 5.18 Customer_WeeklyCount



TRANSCOUNT NUMBER(4) Total transactions performed daily

AMOUNT NUMBER(12,2) Total amount of products purchased daily

5.4.3 Table Customer_FortnightlyCount

Description: This table stores the transaction details of the current fifteen days only. At

the end of fifteen days, value reset with zero by the system. Here also comparison is

made of current fifteen days behavior with the past fifteen days behavior to decide the

validity of the transaction.


Table 5.19 Customer_FortnightlyCount



TRANSCOUNT NUMBER(4) Total transactions performed fortnightly

AMOUNT NUMBER(12,2) Total amount of products purchased fortnightly


140

5.4.4 Table Customer_MonthlyCount

Description: It contains the total transaction details of the current month only. After the

completion of the month, the value of transcount and amount again starts to update

according to the transactions performed by the customer. This table is used to compare

the current monthly behavior of the customer with the past monthly behavior.


Table 5.20 Customer_MonthlyCount



TRANSCOUNT NUMBER(4) Total transactions performed monthly

AMOUNT NUMBER(12,2) Total amount of products purchased monthly

5.4.5 Table Customer_SundayCount

Description: It contains the total transaction details of today if today is Sunday. So

whenever user performs the transactions on Sunday, this table is updated automatically.

These current Sunday transaction details are verified against the past sundays behavior by

the system.


Table 5.21 Customer_SundayCount



TRANSCOUNT NUMBER(4) Total transactions performed on current Sunday

AMOUNT NUMBER(12,2)Total amount of products purchased on current

Sunday


141

5.4.6 Table Customer_HolidayCount

Description: It stores the transaction details of the whole current day if the customer

performs the transactions on holiday. Customer’s current holiday behavior is checked

with the past holidays behavior to predict about the transaction is genuine or not.


Table 5.22 Customer_HolidayCount



TRANSCOUNT NUMBER(4) Total number of transactions performed on

current holiday

AMOUNT NUMBER(12,2)Total amount of products purchased on current

holiday


142

Product_Master

PRODUCT_ID

PRODUCTNAME

Seller_Master

SELLER_ID

SELLER_NAME

Shipping_Master

SHIPPING_ID

SHIPPING_ADDRESS1

SHIPPING_ADDRESS2

SHIPPING_ADDRESS3

CITY

PINCODE

STATE

COUNTRY

Transaction

SERIAL_ID

TRANSACTION_ID

TRANSACTION_DATE

AMOUNT

PRODUCT_ID

PRODUCT_CAT_ID

CUSTOMER_ID

CREDITCARD_ID

SELLER_ID

SHIPPING_ID

LOCATION_ID

TRANSACTION_DAY_T

YPE Location_Master

LOCATION_ID

ADDRESS1

ADDRESS2

ADDRESS3

CITYID

PINCODE

STATEID

COUNTRYID

Product_Catego

ry_Master

PRODUCT_CAT

_ID

PRODUCT_CAT

EGORY

Customer_Master

CUSTOMER_ID

FIRST_NAME

MIDDLE_NAME

LAST_NAME

GENDER

AGE

ANNUAL_INCOME

Creditcard_Master

CREDITCARD_ID

ACCOUNT_NO

CARD_NO

CARD_TYPE

EXPIRY_DATE

CVV_NO

Figure 5.1 Data warehouse Design Layout-I


143

City_Master

CITYID

CITYNAME

TIME_ZONE

State_Master

STATEID

STATENAME

5.5 DATA COLLECTION

The data used in this work was gathered from an online shopping firm. Even though the

firm provided real credit card data for this research, it required that the firm name was

kept confidential. Though real credit card transactional data is obtained, real credit card

number, customer personal information is not given due to confidentiality and fraudulent

transactional records are not available.

We have also generated huge synthetic data based on the statistical data to test the model

speed on large scale data. We had used Gaussian distribution to generate this data. The

number of transactional records is more than 10, 00,000.

Location_Master

LOCATION_ID

ADDRESS1

ADDRESS2

ADDRESS3

CITYID

PINCODE

STATEID

COUNTRYID

Country_Master

COUNTRYID

COUNTRYNAME

Figure 5.2 Data warehouse Design Layout-II


144

Table 5.23 Statistical data of expenditure in category by income

Categor

y

<2000

0

20000

-

29999

30000

-

39999

40000

-

49999

50000

-

59999

60000

-

69999

70000

-

79999

80000

-

89999

>9000

0

1 .19 .18 .18 .17 .16 .15 .15 .14 .13

2 .36 .38 .37 .36 .34 .32 .32 .30 .31

3 .06 .05 .05 .05 .04 .04 .04 .04 .04

4 .16 .15 .16 .17 .19 .20 .20 .21 .18

5 .05 .08 .09 .09 .08 .07 .07 .06 .04

6 .04 .04 .04 .04 .05 .05 .05 .05 .06

7 .14 .12 .11 .12 .14 .17 .17 .20 .24

The data is generated using Gaussian distribution with the following mean and standard

deviation.

Table 5.24 Components of Gaussian distribution

<2000

0

20000

-

29999

30000

-

39999

40000

-

49999

50000

-

59999

60000

-

69999

70000

-

79999

80000

-

89999

>9000

0

Mean 500 1000 1500 2000 2500 3000 3500 4000 4500

Standard

Deviatio

n

100 225 300 500 600 800 1000 1000 1200


145

5.6 SAMPLE DATA

There is a huge data in the data warehouse from the year 2005 to 2009. Here is a sample

transaction of some customers of year 2005.

Table 5.25 Sample Data of Table Transaction

PID: Product Id PCID: Product Category Id

CID: Customer Id SID: Seller Id

SHID: Shipping Id LID: Location Id

CADID: Card Id

TRANSACTION_DATE AMOUNT PID PCID CID SID SHID LID CADID 1/03/2005 4:00:23 PM 1682 20105 2 37 246 37 189 37 1/03/2005 8:26:13 PM 1254 10425 1 312 597 312 201 312 1/03/2005 8:31:45 PM 1632 20125 2 890 554 890 405 890 1/1/2005 10:05:15 AM 3656 20210 2 68 139 68 345 68 1/1/2005 10:19:33 AM 2537 70145 7 68 344 68 345 68 1/1/2005 10:25:08 AM 3455 40146 4 1001 102 1001 345 1001 1/1/2005 2:21:12 PM 3295 40346 4 1220 305 1220 345 1220 1/1/2005 3:25:08 AM 1651 40150 4 38 467 38 194 38 1/1/2005 4:30:11 AM 1631 30451 3 45 807 45 196 45 1/1/2005 6:33:30 PM 2167 40425 4 8 295 8 41 8 1/1/2005 6:33:30 PM 1270 20223 2 1601 46 1601 194 1601 1/1/2005 7:25:08 AM 550 40103 4 74 408 74 375 74 1/1/2005 8:25:08 AM 1931 40456 4 14 119 14 73 14 1/1/2005 9:27:56 AM 1403 10003 1 125 899 125 194 125 1/1/2005 9:34:56 AM 2277 10378 1 8 81 8 41 8 1/1/2005 9:43:56 AM 2423 10993 1 44 731 44 224 44 1/1/2005 9:44:56 AM 1126 10053 1 46 144 46 233 46 1/1/2005 9:58:56 AM 2201 10603 1 48 948 48 242 48 1/1/2005 9:58:56 AM 2201 10555 1 498 948 498 242 498 1/10/2005 1:25:08 AM 3501 60034 6 444 817 444 421 444 1/10/2005 1:30:11 AM 5971 70016 7 84 305 84 421 84 1/10/2005 4:00:23 PM 1370 20005 2 35 863 35 179 35 1/10/2005 6:30:45 PM 3504 60016 6 11 66 11 56 11 1/10/2005 6:30:45 PM 2977 10034 1 465 110 465 79 465 1/10/2005 8:26:13 AM 1519 10450 1 485 383 485 179 485 1/10/2005 8:30:45 PM 570 50145 5 2 234 2 6 2 1/10/2005 8:31:45 PM 1245 20231 2 35 286 35 179 35 1/11/2005 1:30:11 AM 4599 70345 7 86 807 86 432 86 1/11/2005 11:45:45 AM 3445 70452 7 11 63 11 56 11 1/11/2005 2:25:08 AM 3392 60238 6 86 467 86 432 86 1/11/2005 6:30:45 PM 4117 10136 1 75 549 75 377 75 1/12/2005 1:30:11 AM 6810 70870 7 88 257 88 445 88 1/12/2005 3:25:08 AM 5152 60432 6 88 750 88 445 88


146

TRANSACTION_DATE AMOUNT PID PCID CID SID SHID LID CADID 1/12/2005 7:00:23 PM 1420 20125 2 455 69 455 1030 455 1/12/2005 8:15:13 AM 1764 10003 1 5 291 5 24 5 1/12/2005 8:31:45 PM 1356 20007 2 5 26 5 24 5 1/13/2005 1:30:11 AM 3041 70016 7 78 739 78 394 78 1/13/2005 2:25:08 AM 5668 10034 1 78 985 78 394 78 1/13/2005 4:25:08 AM 1990 60040 6 90 776 90 451 90 1/13/2005 9:11:04 AM 646 20034 2 2 11 2 6 2 1/15/2005 1:30:11 AM 4347 70016 7 80 546 80 405 80 1/15/2005 3:25:08 AM 5560 10034 1 440 870 440 405 440 1/16/2005 2:00:23 PM 1579 20005 2 491 220 491 206 491 1/16/2005 4:14:45 PM 1657 20007 2 41 910 41 206 41 1/16/2005 8:24:13 AM 1592 10103 1 491 15 491 206 491 1/17/2005 1:30:11 AM 1439 70450 7 442 605 442 414 442 1/17/2005 4:25:08 AM 1533 10250 1 1549 784 1549 414 1549 1/2/2005 1:12:30 PM 5357 20105 2 444 139 444 421 444 1/2/2005 1:30:11 AM 4286 60016 6 12 196 12 63 12 1/2/2005 1:33:30 PM 3015 20423 2 416 165 416 285 416 1/2/2005 10:04:56 AM 1292 10103 1 482 365 482 159 482 1/2/2005 10:30:56 AM 671 10459 1 34 320 34 170 34 1/2/2005 10:34:56 AM 948 10980 1 4 127 4 18 4 1/2/2005 11:07:33 AM 4760 40016 4 69 976 69 348 69 1/2/2005 11:25:23 AM 2584 40107 4 8 220 8 41 8 1/2/2005 2:12:30 PM 3789 20305 2 86 46 86 432 86 1/2/2005 2:33:30 PM 2234 20155 2 58 18 58 291 58 1/2/2005 3:12:30 PM 5242 20360 2 88 165 88 445 88 1/2/2005 3:33:30 PM 2151 20910 2 54 46 54 274 54 1/2/2005 4:00:30 PM 3804 20540 2 60 246 60 303 60 1/2/2005 4:00:56 AM 3688 70003 7 88 959 88 445 88 1/2/2005 4:01:30 PM 3375 20455 2 62 33 62 311 62 1/2/2005 4:02:30 PM 919 20145 2 66 139 66 331 66 1/2/2005 4:02:56 AM 652 71040 7 90 408 90 451 90 1/2/2005 4:03:30 PM 4060 20110 2 64 220 64 323 64 1/2/2005 4:06:30 PM 2253 20450 2 12 279 12 63 12 1/2/2005 4:11:56 AM 6008 70003 7 84 102 84 421 84 1/2/2005 4:12:30 PM 1261 20335 2 90 18 90 451 90 1/2/2005 4:33:30 PM 1526 20789 2 52 139 52 265 52 1/2/2005 4:52:56 AM 3186 70410 7 286 899 286 432 286 1/2/2005 4:59:56 AM 4737 70150 7 468 63 468 95 468 1/2/2005 5:52:12 PM 3380 50034 5 429 973 429 348 429 1/2/2005 6:01:30 PM 835 20124 2 482 897 482 159 482 1/2/2005 6:12:30 PM 1686 20650 2 542 62 542 95 542 1/2/2005 6:25:08 AM 3394 71204 7 67 286 67 337 67 1/2/2005 6:25:23 AM 1769 40123 4 48 824 48 242 48 1/2/2005 6:30:30 PM 1129 20341 2 34 451 34 170 34 1/2/2005 6:33:35 PM 2267 40134 4 499 94 499 248 499 1/2/2005 6:34:30 PM 700 20990 2 480 919 480 149 480 1/2/2005 7:25:23 AM 1772 40750 4 44 322 44 224 44 1/2/2005 7:47:56 AM 3221 10103 1 12 266 12 63 12 1/2/2005 8:04:56 AM 2392 10650 1 52 102 52 265 52 1/2/2005 8:21:56 AM 2570 10870 1 460 262 460 53 460 1/2/2005 8:25:08 AM 1878 20034 2 50 871 50 255 50 1/2/2005 8:25:23 AM 2670 40140 4 500 778 500 255 500


147

TRANSACTION_DATE AMOUNT PID PCID CID SID SHID LID CADID 1/2/2005 8:26:56 AM 2775 10450 1 54 899 54 274 54 1/2/2005 8:34:56 AM 5209 41005 4 71 885 71 360 71 1/2/2005 8:48:56 AM 2281 10560 1 418 408 418 291 418 1/2/2005 8:55:56 AM 1422 10780 1 56 959 56 285 56 1/2/2005 9:06:56 AM 922 10970 1 477 646 477 96 477 1/2/2005 9:15:16 AM 4487 40128 4 76 698 76 382 76 1/2/2005 9:17:56 AM 1228 10453 1 481 88 481 151 481 1/2/2005 9:19:56 AM 918 10870 1 29 96 29 141 29 1/2/2005 9:22:16 AM 5150 40148 4 440 874 440 405 440 1/2/2005 9:24:56 AM 1703 10126 1 45 156 45 229 45 1/2/2005 9:25:23 AM 2067 40346 4 496 959 496 233 496 1/2/2005 9:34:56 AM 1025 10143 1 3 22 3 11 3 1/2/2005 9:43:56 AM 1798 10650 1 47 587 47 240 47 1/2/2005 9:44:16 AM 863 40110 4 442 226 442 414 442 1/2/2005 9:44:56 AM 2052 10870 1 49 567 49 248 49 1/2/2005 9:54:56 AM 1136 10458 1 483 588 483 161 483 1/2/2005 9:55:16 AM 4255 40457 4 16 203 16 84 16 1/2/2005 9:55:56 AM 2124 10678 1 43 113 43 218 43 1/3/2005 1:25:23 AM 5595 20707 2 444 344 444 421 444 1/3/2005 1:30:11 AM 3084 60016 6 10 44 10 53 10 1/3/2005 1:30:11 AM 1098 30007 3 901 153 901 96 901 1/3/2005 10:25:08 AM 2732 70034 7 10 217 10 53 10 1/3/2005 10:25:23 AM 2083 20547 2 10 142 10 53 10 1/3/2005 2:11:12 PM 4108 40016 4 14 31 14 73 14 1/3/2005 2:25:23 AM 4210 40007 4 78 355 78 394 78 1/3/2005 2:31:12 PM 3399 40123 4 70 807 70 351 70 1/3/2005 2:41:12 PM 2843 40149 4 432 257 432 365 432 1/3/2005 2:51:12 PM 1132 40116 4 74 54 74 375 74 1/3/2005 3:25:23 AM 4086 40717 4 80 689 80 405 80 1/3/2005 4:25:08 AM 2757 70134 7 58 776 58 291 58 1/3/2005 4:25:23 AM 3213 20105 2 55 33 55 276 55 1/3/2005 5:25:23 AM 3465 40133 4 77 486 77 387 77 1/3/2005 6:25:23 AM 3824 20007 2 18 167 18 95 18 1/3/2005 7:01:33 AM 914 70007 7 74 467 74 375 74 1/3/2005 7:05:15 AM 909 20562 2 434 18 434 375 434 1/3/2005 7:25:23 AM 4057 40147 4 441 963 441 406 441 1/3/2005 8:00:33 AM 795 70117 7 72 356 72 365 72 1/3/2005 8:05:15 AM 4801 21220 2 14 14 14 73 14 1/3/2005 8:10:33 AM 3455 70119 7 714 133 714 73 714 1/3/2005 8:25:23 AM 2298 21062 2 411 863 411 260 411 1/3/2005 8:33:33 AM 2707 70221 7 70 358 70 351 70 1/3/2005 9:25:08 AM 1742 70034 7 54 467 54 274 54 1/3/2005 9:25:23 AM 2096 20112 2 12 91 12 63 12 1/4/2005 1:30:11 AM 1505 60016 6 8 261 8 41 8 1/4/2005 11:25:08 AM 2413 20034 2 880 5 880 41 880 1/4/2005 11:26:23 AM 984 30007 3 32 970 32 159 32 1/4/2005 11:34:36 AM 532 30003 3 2 130 2 6 2 1/4/2005 12:26:23 AM 1012 31200 3 4 54 4 18 4 1/4/2005 3:25:08 AM 1528 40034 4 36 817 36 183 36 1/4/2005 4:30:11 AM 1337 30016 3 36 305 36 183 36 1/4/2005 6:25:08 AM 1473 20034 2 48 357 48 242 48 1/4/2005 6:33:30 PM 1027 20205 2 36 139 36 879 36


148

TRANSACTION_DATE AMOUNT PID PCID CID SID SHID LID CADID 1/4/2005 7:25:08 AM 1447 20122 2 44 840 44 224 44 1/4/2005 9:09:56 AM 1650 10654 1 490 959 490 204 490 1/4/2005 9:25:08 AM 1390 21005 2 46 488 46 233 46 1/4/2005 9:39:56 AM 1567 10140 1 486 102 486 183 486 1/5/2005 1:25:08 AM 1467 40034 4 42 776 42 215 42 1/5/2005 1:30:11 AM 1023 30007 3 3 18 3 11 3 1/5/2005 1:35:11 AM 1688 41455 4 7 47 7 38 7 1/5/2005 11:25:08 AM 448 40077 4 1 34 1 1 1 1/5/2005 2:25:08 AM 2630 20007 2 83 286 83 418 83 1/5/2005 2:25:08 AM 2630 20654 2 173 286 173 418 173 1/5/2005 3:25:08 AM 1733 40123 4 456 282 456 26 456 1/5/2005 4:25:08 AM 6101 20670 2 87 372 87 437 87 1/2/2005 9:44:56 AM 2052 10110 1 49 567 49 248 49 1/2/2005 9:54:56 AM 1136 10775 1 483 588 483 161 483 1/2/2005 9:55:16 AM 4255 40780 4 16 203 16 84 16 1/5/2005 4:30:11 AM 1209 30116 3 6 58 6 26 6 1/5/2005 5:25:08 AM 4100 21340 2 89 910 89 447 89 1/5/2005 6:25:08 AM 3359 60234 6 468 114 468 95 468 1/5/2005 6:31:30 PM 1875 21230 2 42 18 42 215 42 1/5/2005 6:33:30 PM 1292 20859 2 456 283 456 26 456 1/5/2005 7:25:08 AM 1660 60456 6 49 608 49 248 49 1/5/2005 9:19:56 AM 1231 10112 1 1050 408 1050 215 1050 1/5/2005 9:51:56 AM 1697 10453 1 456 247 456 26 456 1/6/2005 1:30:11 AM 522 70678 7 2 268 2 6 2 1/6/2005 11:09:33 AM 4299 40786 4 73 952 73 367 73 1/6/2005 11:18:13 AM 3091 70634 7 424 352 424 323 424 1/6/2005 11:26:08 AM 845 50150 5 32 833 32 159 32 1/6/2005 12:26:08 AM 1105 50564 5 4 148 4 18 4 1/6/2005 3:30:16 PM 4647 50785 5 14 46 14 73 14 1/6/2005 3:45:16 PM 1293 50100 5 434 776 434 375 434 1/6/2005 4:00:23 PM 1566 20105 2 39 33 39 199 39 1/6/2005 5:01:12 PM 2593 50433 5 433 352 433 367 433 1/6/2005 6:25:16 PM 3322 50974 5 70 467 70 351 70 1/6/2005 8:31:45 PM 1076 20112 2 39 372 39 199 39 1/6/2005 8:35:16 PM 2846 50564 5 72 750 72 365 72 1/6/2005 8:50:13 AM 2286 11245 1 489 885 489 199 489 1/6/2005 9:15:16 PM 5122 50451 5 68 817 68 345 68 1/7/2005 1:25:23 AM 1612 21347 2 492 467 492 215 492 1/7/2005 11:05:15 AM 489 50324 5 1 172 1 1 1 1/7/2005 11:05:15 AM 1127 20543 2 3 271 3 11 3 1/7/2005 11:08:33 AM 1008 50120 5 29 746 29 141 29 1/7/2005 11:10:33 AM 1139 51007 5 477 651 477 96 477 1/7/2005 11:55:33 AM 4609 61003 6 445 973 445 428 445 1/7/2005 2:05:15 AM 5121 70100 7 83 427 83 418 83 1/7/2005 2:44:33 AM 6856 60560 6 443 705 443 418 443 1/7/2005 3:25:23 AM 2215 20177 2 6 238 6 26 6 1/7/2005 5:05:15 AM 4638 70670 7 449 952 449 447 449 1/7/2005 5:11:12 PM 546 20324 2 101 298 101 1 101 1/7/2005 5:14:12 PM 3060 70236 7 55 301 55 276 55 1/7/2005 5:21:12 PM 4544 70756 7 435 198 435 377 435 1/7/2005 5:23:12 PM 2341 70540 7 51 705 51 260 51 1/7/2005 5:25:12 PM 2014 20100 2 493 552 493 218 493


149

TRANSACTION_DATE AMOUNT PID PCID CID SID SHID LID CADID 1/7/2005 5:26:12 PM 2754 20452 2 49 297 49 248 49 1/7/2005 5:28:33 AM 5116 60120 6 449 352 449 447 449 1/7/2005 5:30:33 AM 6093 40227 4 77 954 77 387 77 1/7/2005 5:31:12 PM 4093 70650 7 437 521 437 387 437 1/7/2005 5:34:12 PM 3020 70457 7 53 973 53 266 53 1/7/2005 5:41:12 PM 4170 70132 7 439 216 439 396 439 1/7/2005 5:44:12 PM 1748 20567 2 45 250 45 229 45 1/7/2005 5:51:12 PM 1252 70168 7 417 352 417 290 417 1/7/2005 5:57:12 PM 2568 20450 2 47 6 47 240 47 1/7/2005 5:59:12 PM 2934 50862 5 427 705 427 337 427 1/7/2005 6:05:15 AM 2154 20172 2 53 554 53 266 53 1/7/2005 6:10:33 AM 2525 60123 6 53 976 53 266 53 1/7/2005 6:40:33 AM 2962 40213 4 67 427 67 337 67 1/5/2005 4:30:11 AM 1209 30653 3 6 58 6 26 6 1/5/2005 5:25:08 AM 4100 20117 2 89 910 89 447 89 1/5/2005 6:25:08 AM 3359 60234 6 468 114 468 95 468 1/5/2005 6:31:30 PM 1875 20125 2 42 18 42 215 42 1/5/2005 6:33:30 PM 1292 20654 2 456 283 456 26 456 1/7/2005 7:05:15 AM 3918 70412 7 81 574 81 406 81 1/7/2005 7:05:33 AM 3910 40816 4 431 991 431 360 431 1/7/2005 7:50:33 AM 4532 40135 4 200 578 200 406 200 1/7/2005 8:05:15 AM 3228 20860 2 411 286 411 260 411 1/7/2005 8:10:33 AM 2263 60456 6 51 427 51 260 51 1/7/2005 9:05:15 AM 1229 20467 2 481 999 481 151 481 1/7/2005 9:10:33 AM 703 50345 5 31 989 31 151 31 1/8/2005 1:45:13 AM 4458 71245 7 420 973 420 303 420 1/8/2005 11:25:08 AM 392 41345 4 2 224 2 6 2 1/8/2005 11:25:08 AM 457 41568 4 19 150 19 35 19 1/9/2005 3:55:13 AM 2246 71568 7 62 301 62 311 62 1/9/2005 8:15:13 AM 1739 70187 7 462 70 462 63 462 1/9/2005 8:48:13 AM 2630 70178 7 66 817 66 331 66 1/9/2005 8:48:13 AM 2630 71456 7 156 817 156 331 156 1/7/2005 7:05:15 AM 3918 70048 7 81 574 81 406 81 1/7/2005 7:05:33 AM 3910 41902 4 431 991 431 360 431 1/7/2005 7:50:33 AM 4532 41680 4 81 578 81 406 81 1/7/2005 8:05:15 AM 3228 23469 2 411 286 411 260 411 1/7/2005 8:10:33 AM 2263 61458 6 51 427 51 260 51 1/7/2005 9:05:15 AM 1229 21567 2 481 999 481 151 481 1/7/2005 9:10:33 AM 703 52598 5 31 989 31 151 31 1/8/2005 1:45:13 AM 4458 71003 7 420 973 420 303 420 1/8/2005 11:25:08 AM 392 41007 4 2 224 2 6 2 1/8/2005 11:25:08 AM 457 44678 4 19 150 19 35 19 1/9/2005 3:55:13 AM 2246 70156 7 62 301 62 311 62 1/9/2005 8:15:13 AM 1739 71234 7 462 70 462 63 462 1/9/2005 8:48:13 AM 2630 72456 7 66 817 66 331 66 1/9/2005 8:48:13 AM 2630 74567 7 156 817 156 331 156 1/7/2005 7:05:15 AM 3918 72345 7 81 574 81 406 81 1/7/2005 7:05:33 AM 3910 41245 4 431 991 431 360 431 1/7/2005 7:50:33 AM 4532 41567 4 81 578 81 406 81 1/7/2005 8:05:15 AM 3228 21345 2 411 286 411 260 411 1/7/2005 8:10:33 AM 2263 60459 6 51 427 51 260 51 1/7/2005 9:05:15 AM 1229 21234 2 481 999 481 151 481


150

TRANSACTION_DATE AMOUNT PID PCID CID SID SHID LID CADID 1/7/2005 9:10:33 AM 703 50408 5 31 989 31 151 31 1/8/2005 1:45:13 AM 4458 71214 7 420 973 420 303 420 1/8/2005 11:25:08 AM 392 40123 4 2 224 2 6 2 1/8/2005 11:25:08 AM 457 40245 4 19 150 19 35 19 1/9/2005 3:55:13 AM 2246 71345 7 62 301 62 311 62 1/9/2005 8:15:13 AM 1739 72456 7 462 70 462 63 462 1/9/2005 8:48:13 AM 2630 70101 7 66 817 66 331 66 1/9/2005 8:48:13 AM 2630 70237 7 156 817 156 331 156 1/7/2005 7:05:15 AM 3918 70457 7 81 574 81 406 81 1/7/2005 7:05:33 AM 3910 40134 4 431 991 431 360 431 1/7/2005 7:50:33 AM 4532 40145 4 81 578 81 406 81 1/7/2005 8:05:15 AM 3228 20657 2 411 286 411 260 411 1/7/2005 8:10:33 AM 2263 62345 6 51 427 51 260 51 1/7/2005 9:05:15 AM 1229 21345 2 481 999 481 151 481 1/7/2005 9:10:33 AM 703 50145 5 31 989 31 151 31 1/8/2005 1:45:13 AM 4458 70545 7 420 973 420 303 420 1/8/2005 11:25:08 AM 392 40145 4 2 224 2 6 2 1/8/2005 11:25:08 AM 457 40568 4 19 150 19 35 19 1/9/2005 3:55:13 AM 2246 70120 7 62 301 62 311 62 1/9/2005 8:15:13 AM 1739 70345 7 462 70 462 63 462 1/9/2005 8:48:13 AM 2630 70546 7 66 817 66 331 66 1/9/2005 8:48:13 AM 2630 70647 7 156 817 156 331 156 1/7/2005 7:05:15 AM 3918 71235 7 81 574 81 406 81 1/7/2005 7:05:33 AM 3910 41345 4 431 991 431 360 431 1/7/2005 7:50:33 AM 4532 40134 4 81 578 81 406 81 1/7/2005 8:05:15 AM 3228 20450 2 411 286 411 260 411 1/7/2005 8:10:33 AM 2263 60456 6 51 427 51 260 51 1/7/2005 9:05:15 AM 1229 20564 2 481 999 481 151 481 1/7/2005 9:10:33 AM 703 50324 5 31 989 31 151 31 1/8/2005 1:45:13 AM 4458 70125 7 420 973 420 303 420 1/8/2005 11:25:08 AM 392 40456 4 2 224 2 6 2 1/8/2005 11:25:08 AM 457 40224 4 19 150 19 35 19 1/9/2005 3:55:13 AM 2246 70911 7 62 301 62 311 62 1/9/2005 8:48:13 AM 2630 70123 7 66 817 66 331 66 1/9/2005 8:48:13 AM 2630 70620 7 156 817 156 331 156 1/7/2005 7:05:15 AM 3918 70324 7 81 574 81 406 81 1/7/2005 7:05:33 AM 3910 40862 4 431 991 431 360 431----------------------------- ------------------------- 1/18/2009 8:07:45 PM 2810 20009 2 1618 603 1618 443 1618 1/22/2009 7:18:53 AM 635 70543 7 1532 284 1532 9 1532 1/9/2009 8:44:54 PM 483 20635 2 1553 949 1553 123 1553 10/7/2009 11:12:34 AM 1963 20456 2 1575 833 1575 226 1575 11/13/2009 9:02:12 PM 6801 20432 2 1620 585 1620 452 1620 3/14/2009 8:53:01 PM 1048 20182 2 1533 298 1533 13 1533 3/2/2009 6:16:13 AM 4638 20345 2 1612 662 1612 415 1612 4/1/2009 6:16:53 PM 2900 20125 2 1600 981 1600 351 1600 4/12/2009 6:59:26 PM 1860 71348 7 1535 260 1535 23 1535 9/13/2005 2:45:34 AM 5129 10057 1 1608 336 1608 395 1608 1/18/2009 8:07:45 PM 2810 21021 2 1618 603 1618 443 1618


151

5.7 CREDIT CARD NUMBER GENERATION

We had also studied how the realistic credit card numbers are generated. To generate

realistic credit card numbers, we use the semantic graph shown in the figure 5.3.

The first digit on a credit card is the Major Industry Identifier (MII) which represents the

source from where the credit card was issued. For example, a credit card number starting

with 6 is assigned for merchandising and banking purposes, such as in the case of the

Discovery card. Credit card numbers starting with 4 and 5 are used for banking and

financing purposes, as in the case of Visa and MasterCard. Digit 3 is used to represent

travel and entertainment used, for instance the American Express card. Table 5.26 is an

overview of the rules for numbering credit card. The first six numbers including the MII

represents the issuer identifier. The rest of the digits on the credit card represent the

cardholder’s account number except the last digit. The lone digit at the very right end of

Creditcard Number

MII

Card Issuer

Card Type

Figure 5.3: Credit Card Number Semantic Graph


152

the complete 15 or 16 digit credit card number sequence is known as the “check digit”,

which often is the final number that is computer generated to satisfy the mathematical

formulations of the Luhn check sum process. Meanwhile, in between the first 6 digits and

the last single check digit is the actual personalized account number – the 8 or 9 digit

sequence given by the card issuer.

Table 5.26 Credit Card Parameters

Issuer Identifier Length (Numbers)

Discovery 6011xx 16

Mastercard 51xxxx-55xxxx 16

American Express 34xxx,37xxx 15

Visa 4xxxxx 13, 16

5.7.1 The Luhn Algorithm

The Luhn Algorithm is the check sum formula used by payment verification systems and

mathematicians to verify the sequential integrity of real credit card numbers. It’s used to

help bring order to seemingly random numbers and used to prevent erroneous credit card

numbers from being cleared for use. The Luhn algorithm is not used for straight credit

card number generation from scratch, but rather utilized as a simple computational way to

distinguish valid credit card numbers from random collections of numbers put together.

The validation formula also works with most debit cards as well.

The Luhn formula was created and filed as a patent (now freely in the public domain) in

1954 by Hans Peter Luhn of IBM to detect numerical errors found in pre-existing and

newly generated identification numbers. Since then, it’s primary use has been in the area

of check sum validation, made popular with its use to verify the validity of important

sequences such as credit card numbers. Currently, almost all credit card numbers issued

today are generated and verified using the Luhn Algorithm. The luhn algorithm only

validates the 15-16 digit credit card number and not the other critical components of a


153

genuine card account such as the expiration date and the commonly used Card

Verification Value (CVV) and Card Verification Code (CVC) numbers.

ALGORITHM 5.1

1. The Luhn Algorithm always starts from right to left, beginning with the rightmost

digit on the credit card face (the check digit). Starting with the check digit and

moving left, double the value of every alternate digit. Non-doubled digits will

remain the same. The check digit is never doubled. For example, if the credit card

is a 16 digit Visa card, the check digit would be the rightmost 16th digit. Thus we

would double the value of the 15th, 13th, 11th, 9th digits, and so on until all odd

digits have been doubled. The even digits would be left the same.

2. For any digit that becomes a two digit number of 10 or more when doubled, add

the two digits together. For example, the digit 5 when doubled will become 10,

which turns into a 1.

3. Now, lay out the new sequence of numbers. The new doubled digits will replace

the old digits. Non-doubled digits will remain the same.

4. Add up the new sequence of numbers together to get a sum total. If the combined

tally is perfectly divisible by ten, then the account number is mathematically valid

according to the Luhn formula. If not, the credit card number provided is not valid

and thus fake or improperly generated.

5.7.2 An Example of Luhn Validation Technique

We can follow the luhn steps from 1 to 4 below, starting with the right most digit. I have

taken my own credit card to check how it is mathematically correct according to the Luhn

validation technique.


154

Figure 5.4 Sample of Credit Card

(1) Start here at check digit and go left.

5 1 7 6 5 3 0 0 9 2 2 4 5 0 0 3

(2) Double every other number. If doubled numbers are two digits, then add them.

5 1 7 6 5 3 0 0 9 2 2 4 5 0 0 3

10 14 10 0 18 4 10 0

(3) Drop the numbers down to the bottom arrow and keep other digits as it is.

1 1 5 6 1 3 0 0 9 2 4 4 1 0 0 3

(4) Add these new numbers, which is 40 and perfectly divisible by 10, so according to

luhn algorithm, it is valid credit card number.

5.8 REFERENCES






155



[4] K.V.S Sarma – Statistics Made Simple Do It Yourself on PC, Prentice Hall of India,

ISBN: 81-203-1741-6

[5] R.S. Bhardwaj – Business Statistics, Excel Books, ISBN: 81-7446-181-7

[6] Ivan Bayross – SQL, PL/SQL The Programming Language of Oracle, BPB

Publications, ISBN 81-7656-964-X

[7] Nilesh Shah – Database Systems Using Oracle, Prentice Hall of India, ISBN: 81-203-

2147-2

[8] A. Leon, M. Leon – Database Management Systems, Vikas Publishing House, ISBN:

0-81-259-1165-0

[9] http://www.citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.1081&rep...

[10] http:// www.thetaoofmakingmoney.com/2007/04/12/324.html

[11] http://www.etl-tools.info/en/bi/datawarehouse_star-schema.htm

http://www.citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.1081&rep...

156

CHAPTER 6

DEVELOPMENT OF TRANSACTION PATTERN GENERATION TOOL (TPGT)

6.1 MAIN PATTERNS GENERATED BY TPGT

6.2 DESCRIPTIONS OF THE PATTERNS

6.3 COMPUTATIONS OF THE PATTERNS

6.4 REFERENCES

The transaction pattern generation tool (TPGT) will generate the patterns (parameters)

based on the historical data stored in the data warehouse. TPGT is implemented in the

Oracle 9i. All the patterns generated by TPGT will collectively decide the purchasing

behavior of the card holder. These patterns are very useful for deciding or verifying the

current transaction performed by the card holder online. Implementation code is given in

the Appendix.

6.1 MAIN PATTERNS (PARAMETERS) GENERATED BY TPGT

TPGT

CP FPPP TP MPWP VP AP HP SP

Figure 6.1 Parameters of TPGT

LPDP GP

Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)

157

DP: Daily Parameters

CP: Category Parameters

PP: Product Parameters

TP: Transaction Parameters

WP: Weekly Parameters

VP: Vendor Parameters

AP: Address Parameters

FP: Fortnightly Parameters

MP: Monthly Parameters

SP: Sunday Parameters

HP: Holiday Parameters

LP: Location Parameters

GP: Transaction Gap Parameters

6.1.1 Subparameters of DP

DP1: Average Amount of purchases per day

DP2: Maximum Amount of Purchase daily

DP3: Maximum Number of transactions a day

DP4: Average number of transactions per day

DP

DP1 DP2 DP4 DP3

Figure 6.2: Subparameters of DP


158

6.1.2 Subparameters of CP

CP1: Amount spend in the current category

CP2: Time passed since the same category purchased

CP3: Number of times the transactions taken place within same category

CP

CP1 CP3 CP2

Figure 6.3 : Subparameters of CP


159

6.1.3 Subparameters of PP

PP1: Time passed since the same product purchased

PP2: Number of times the same product purchased

PP

PP1 PP2

Figure 6.4: Subparameters of PP


160

6.1.4 Subparameters of TP

TP1: Number of transactions performed during (3:01 to 6:00)








TP9: Time passed since the last transaction

TP10: Maximum Amount of Transaction

TP11: Number of transactions during day

TP12: Number of transactions during late night

TP

TP2 TP8TP3 TP4 TP9TP5 TP6 TP7 TP10

Figure 6.5 Parameters of TP

TP11TP1 TP12


161

6.1.5 Subparameters of WP

WP1: Average Amount of purchases per week

WP2: Average Number of transactions per week

WP3: Maximum Number of transactions a week

WP4: Maximum Amount of Purchase weekly

WP

WP1 WP2 WP4 WP3

Figure 6.6 : Subparameters of WP


162

6.1.6 Subparameters of VP

VP1: Number of transactions with the same seller

VP2: Amount of purchases with the same seller

6.1.7 Subparameters of AP

AP1: Number of transactions shipped with the same current shipping address

AP2: Number of transactions with different shipping and billing address

AP

AP1 AP2

Figure 6.8: Subparameters of AP

VP

VP1 VP2

Figure 6.7: Subparameters of VP


163

6.1.8 Subparameters of FP

FP1: Average Amount of purchases per fortnight

FP2: Average Number of transactions per fortnight

FP3: Maximum Number of transactions a fortnight

FP4: Maximum Amount of Purchase fortnightly

FP

FP1 FP2 FP4 FP3

Figure 6.9: Subparameters of FP


164

6.1.9 Subparameters of MP

MP1: Average Amount of purchases per month

MP2: Average Number of transactions per month

MP3: Maximum Number of transactions a month

MP4: Maximum Amount of Purchase monthly

6.1.10 Subparameters of SP

SP

SP1 SP2 SP5 SP3

Figure 6.11: Subparameters of SP

SP4

MP

MP1 MP2 MP4 MP3

Figure 6.10: Subparameters of MP


165

SP1: Average Amount of purchases per Sunday

SP2: Average Number of transactions per Sunday

SP3: Maximum Number of transactions a Sunday

SP4: Maximum individual Amount of transactions on Sunday

SP5: Maximum total Amount of transactions on Sunday

6.1.11 Subparameters of HP

HP1: Average Amount of purchases per holiday

HP2: Average Number of transactions per holiday

HP3: Maximum Number of transactions a holiday

HP4: Maximum individual Amount of transactions on holiday

HP5: Maximum total Amount of transactions on holiday

HP

HP1 HP2 HP5 HP3

Figure 6.12: Subparameters of HP

HP4


166

6.1.12 Subparameters of LP

LP1: Number of transactions ordered from the same location

LP2: Number of transactions ordered in the different city within same state

LP3: Number of transactions ordered in the different city outside of the state

LP4: Number of transactions ordered in the different country

LP5: Number of transactions shipped in the different city within same state

LP6: Number of transactions shipped in the different city outside of the state

LP7: Number of transactions shipped in the different country

LP

LP2 LP6 LP3

Figure 6.13: Subparameters of LP

LP4 LP5 LP1 LP7


167

6.1.13 Subparameters of GP

GP1: Number of transactions performed within 4 hour time gap




GP5: Number of transactions performed within 7 day time gap

GP6: Number of transactions performed within 15 day time gap

GP7: Number of transactions performed after 15 days

6.2 DESCRIPTIONS OF THE PATTERNS (PARAMETERS)

6.2.1 Daily Parameters (DP)

6.2.1.1 Average Amount of purchases per day (DP1)

This parameter contains the value of average amount of purchases per day. Suppose the

total amount of purchasing is done by customer is Rs.30000 in one year, then this value is

divided by 365 and then the value of the parameter is derived.

GP

GP3 GP4

Figure 6.14: Subparameters of GP

GP6 GP5 GP2 GP1 GP7


168

6.2.1.2 Maximum Amount of Purchase daily (DP2)

The tool also calculates the total amount of purchase daily and then it finds the maximum

amount of purchase from all the past daily transactions. This value is stored in the

parameter.

6.2.1.3 Maximum Number of transactions a day (DP3)

The tool keep track record of total number of transactions performed on each day. It finds

then the maximum number of transactions a day for that customer who is currently

performing the online transaction.

6.2.1.4 Average number of transactions per day (DP4)

This parameter contains the average number of transactions per day. If the customer

performs 365 transactions in one year then the average number of transactions per day is

1.

6.2.2 Category Parameters (CP)

6.2.2.1 Amount spend in the current category (CP1)

Whenever customer orders product in any category, then the system calculates the total

amount spend by the customer in this category. E.g. If customer buys product in

electronics category and the past transactions performed in this category worth Rs.50000,

then this parameter contains Rs.50000.

6.2.2.2 Time passed since the same category purchased (CP2)

This parameter stores the value of time since the last transaction performed in the same

category. The value is also stored in days, hours, minutes and seconds.

6.2.2.3 Number of times the transactions taken place within same category (CP3)

Total number of transactions in each category is also stored by the tool in this parameter.

E.g. If customer currently buys product of electronics category and in the past the

customer has performed total six transactions within same category, then this parameter

has value six.


169

6.2.3 Product Parameters (PP)

6.2.3.1 Time passed since the same product purchased (PP1)

It stores the value of time since the last transaction performed in the same product. The

value is also stored in days, hours, minutes and seconds.

6.2.3.2 Number of times the same product purchased (PP2)

The tool records how many times each product purchased by the customer. So if the

customer purchases any product, then this parameter has value of number of times the

same product purchased.

6.2.4 Transaction Parameters (TP)

6.2.4.1 Number of transactions performed during (3:01 to 6:00) (TP1)

This parameter contains the value of total number of past transactions performed from

3:01 A.M. to 6:00 A.M. by the customer.






9:01 A.M. to 12:00 P.M. by the customer.



12:01 P.M. to 3:00 P.M. by the customer.





170






9:01 P.M. to 12:00 A.M. by the customer.




6.2.4.9 Time passed since the last transaction (TP9)

It has value of time difference between current transaction date and the last transaction.

The value is also stored in days, hours, minutes and seconds.

6.2.4.10 Maximum Amount of Transaction (TP10)

From all the past transactions, the system finds the transaction having the maximum

amount and stores the value in this parameter. If customer buys three products at the

same time, then the system considers it’s as only one transaction and adds prices of all

the three products.

6.2.4.11 Number of transactions during day (TP11)

The tool records whether the transaction is performed during day or during late night. It

has value of total number of transactions performed during day only.

6.2.4.12 Number of transactions during late night (TP12)

If the transaction is performed during late night then it is considered as sensitive if the

customer has not performed any transaction during late night. This parameter has value of

total number of past transactions performed during late night.


171

6.2.5 Weekly Parameters (WP)

6.2.5.1 Average Amount of purchases per week (WP1)

The tool generates this parameter based on the average amount of purchases per week of

the customer performing the transaction. It is useful to compare the current weekly

behavior of customer with the past weekly behavior.

6.2.5.2 Average Number of transactions per week (WP2)

It has value of average number of transactions per week. So if the extensive number of

transactions is performed in a week, then it is taken into consideration by the system

based on this parameter.

6.2.5.3 Maximum Number of transactions a week (WP3)

The tool finds the number of transactions of a week in which the maximum number of

transactions is performed by the customer. So if in the current week, the customer’s

behavior changes drastically, then accordingly risk score is generated by the model.

6.2.5.4 Maximum Amount of Purchase weekly (WP4)

The tool also finds in which week, the customer has performed maximum amount of

transactions.

6.2.6 Seller or Vendor Parameter (VP)

6.2.6.1 Number of transactions with the same seller (VP1)

Whenever customer performs a new transaction, then this tool finds how many

transactions are performed with the same seller by this customer. So customer habit of

any particular choice of seller can be monitored by the model.

6.2.6.2 Amount of purchases with the same seller (VP2)

With the new transaction, the tool finds total amount of purchases with the same seller

and stores value in this parameter. If the customer is performing the transaction first time

with this seller, then the value of this parameter is 0.


172

6.2.7 Address Parameters (AP)

6.2.7.1 Number of transactions shipped with the same current shipping address

(AP1)

While performing the online transaction, the customer has to enter the shipping address

where he wants his item to be delivered. The tool finds how many times the past

transactions are performed with the same shipping address. If the past transactions are

performed on the same shipping address, then the model considers transaction as highly

genuine.

6.2.7.2 Number of transactions with different shipping and billing address (AP2)

The tool finds how many transactions the customer has performed other than his billing

address. So the customer habit of performing transaction other than his billing address

can be studied by the model and decide about the sensitivity of new incoming transaction.

6.2.8 Fortnightly Parameters (FP)

6.2.8.1 Average Amount of purchases per fortnight (FP1)

It stores the value of average amount of purchases per fifteen days by the customer.

6.2.8.2 Average Number of transactions per fortnight (FP2)

It has value of average number of transactions per fifteen days by the customer.

6.2.8.3 Maximum Number of transactions a fortnight (FP3)

It finds the maximum number of transactions from each fortnight and stores the value in

this parameter.

6.2.8.4 Maximum Amount of Purchase fortnightly (FP4)

It also finds the maximum amount of transactions from each fortnight and stores the

value in this parameter.


173

6.2.9 Monthly Parameters (MP)

6.2.9.1 Average Amount of purchases per month (MP1)

This parameter has value of average amount of purchases per month by the customer.

6.2.9.2 Average Number of transactions per month (MP2)

It stores the value of average number of transactions per month.

6.2.9.3 Maximum Number of transactions a month (MP3)

The tool finds in which month, the customer has performed the maximum number of

transactions and stored the value in this parameter.

6.2.9.4 Maximum Amount of Purchase monthly (MP4)

It also finds the maximum amount of monthly purchase and stores the value in this

parameter.

6.2.10 Sunday Parameters (SP)

6.2.10.1 Average Amount of purchases per Sunday (SP1)

Whenever transaction is performed on Sunday, then the tool automatically records

customer behavior separately. It stores the average amount of purchases per Sunday in

this parameter.

6.2.10.2 Average Number of transactions per Sunday (SP2)

Average number of transactions performed by customer on sundays is stored in this

parameter.

6.2.10.3 Maximum Number of transactions a Sunday (SP3)

The tool finds a Sunday, in which the customer has performed the maximum number of

transactions and stores value in it. So if the new incoming transactions are performed

heavily on Sunday, then it compare with this parameter to decide sensitivity of

transactions.


174

6.2.10.4 Maximum individual Amount of transactions on Sunday (SP4)

If multiple transactions are performed on Sunday, then the tool finds which transaction

has maximum amount value and stores in this parameter.

6.2.10.5 Maximum total Amount of transactions on Sunday (SP5)

The tool calculates total amount of purchases on each Sunday and finds the maximum

amount of purchase among these.

6.2.11 Holiday Parameters (HP)

6.2.11.1 Average Amount of purchases per holiday (HP1)

Customer behavior is also monitored separately on national holiday, as the customer has

free time on this day. So his behavior may be totally different than other working day.

This parameter has value of average amount of purchases of all holidays.

6.2.11.2 Average Number of transactions per holiday (HP2)

The tool generates this parameter based on average number of transactions performed by

customer per holiday.

6.2.11.3 Maximum Number of transactions a holiday (HP3)

The tool calculates the total number of transactions on each holiday and then finds the

maximum number of transactions among these.

6.2.11.4 Maximum individual Amount of transactions on holiday (HP4)

If multiple transactions are performed on holiday, then the tool finds which transaction

has maximum amount value and stores in this parameter.

6.2.11.5 Maximum total Amount of transactions on holiday (HP5)

The tool calculates total amount of transactions on each holiday and then finds the

maximum amount among these.


175

6.2.12 Location Parameters (LP)

6.2.12.1 Number of transactions ordered from the same location (LP1)

Whenever customer performs online transaction, the tool records the location from which

transaction is ordered and also generate this parameter showing how many total

transactions are ordered from the same location.

6.2.12.2 Number of transactions ordered in the different city within same state (LP2)

If customer initiates any order from different city other than his own city but within same

state, then it will be added into this parameter.

6.2.12.3 Number of transactions ordered in the different city outside of the state

(LP3)

If the customer orders a product outside of his state but within his country, then it will be

added into this parameter.

6.2.12.4 Number of transactions ordered in the different country (LP4)

This parameter has value of number of transactions; the user has performed from outside

of his country.

6.2.12.5 Number of transactions shipped in the different city within same state (LP5)

This parameter has a value of number of transactions; the user has requested to ship the

items in the different city other than his billing address city, but within his state.

6.2.12.6 Number of transactions shipped in the different city outside of the state

(LP6)


items in the different state other than his billing address state, but within his country.

6.2.12.7 Number of transactions shipped in the different country (LP7)


items in the different country other than his billing address country.


176

6.2.13 Transaction Gap Parameters (GP)

6.2.13.1 Number of transactions performed within 4 hour time gap (GP1)

The tool timestamps each transaction. It records within particular time duration how

many past transactions are performed by the customer. This parameter contains the

number of transactions the customer has performed within 4 hour time gap from the

previous transaction. E.g. If the customer performs the first transaction and second

transaction is performed after just one hour (within 4 hour time gap), then the tool adds 1

to this parameter.


This parameter contains the number of transactions the customer has performed within 5th

to 6th hours time gap from the previous transaction.


This parameter contains the value of number of transactions the customer has performed

within 8th to 16th hours time gap from the previous transaction.



within 16th to 24th hours from the previous transaction.

6.2.13.5 Number of transactions performed within 7 day time gap (GP5)


within 1st day to 7th day from the previous transaction.

6.2.13.6 Number of transactions performed within 15 day time gap (GP6)


within 8th to 15th day from the previous transaction.


177

6.2.13.7 Number of transactions performed after 15 days (GP7)


within 16th to 30th day from the previous transaction.

6.3 COMPUTATIONS OF THE PATTERNS

6.3.1 TP1 to TP8

The Calculation of the parameters TP1 to TP8 in the tool is done as follows.

The tool divides all the transactions of the customer into eight different time frames

according to the following.

T1 becomes true if the past transaction is performed from 3:00 to 6:00 time frame on the

card Ck within data warehouse.

1 | { 3 : 00 6 : 00}kT TRUE Tc t= ∃ ∧ < ≤ (6.1)



2 | { 6 : 00 9 : 00}kT TRUE Tc t= ∃ ∧ < ≤ (6.2)



3 | { 9 : 00 12 : 00}kT TRUE Tc t= ∃ ∧ < ≤ (6.3)

T4 becomes true if the past transaction is performed from 12:00 to 15:00 time frame on

the card Ck within data warehouse.


178

4 | { 12 : 00 15 : 00}kT TRUE Tc t= ∃ ∧ < ≤ (6.4)



5 | { 15 : 00 18 : 00}kT TRUE Tc t= ∃ ∧ < ≤ (6.5)



6 | { 18 : 00 21: 00}kT TRUE Tc t= ∃ ∧ < ≤ (6.6)



7 | { 21: 00 0 : 00}kT TRUE Tc t= ∃ ∧ < ≤ (6.7)



8 | { 0 : 00 3: 00}kT TRUE Tc t= ∃ ∧ < ≤ (6.8)

The tool then finds the total number of the transactions performed by the customer in

time frame from T1 to T8.

TP1 = occurrences (count) of T1 on the card Ck from the data warehouse (6.9)







179



Finally the percentage of all the parameters of all the transactions is computed as follows.

Percent_TP1=(TP1 * 100) / total transactions on card Ck from the data warehouse (6.17)








6.3.2 TP11 and TP12

L1 becomes true if the transaction is performed from 0:00 to 4:00 on the card Ck from the

data warehouse.

1 | { 0 : 00 4 : 00}kL TRUE Tc t= ∃ ∧ < ≤ (6.25)

L2 becomes true if the transaction is performed except from 0:00 to 4:00 on the card Ck

within the data warehouse.

2 | { 4 : 00 0 : 00}kL TRUE Tc t= ∃ ∧ < ≤ (6.26)

Finally TP11 and TP12 are computed as follows.

TP11 = occurrences (count) of L1 on the card Ck from the data warehouse (6.27)

TP12 = occurrences (count) of L2 on the card Ck from the data warehouse (6.28)


180

6.3.3 GP1 to GP7

G1 becomes true if the transaction occurs just within 4 hours from the previous

transaction on the same card Ck from the data warehouse.

1 |{ (0 4)}kG True Tc d= ∃ ∧ < ≤ (6.29)

d stands for the duration in hours between two successive transactions.

G2 becomes true if the transaction occurs just within 5 to 8 hours from the previous


2 |{ (4 8)}kG True Tc d= ∃ ∧ < ≤ (6.30)



3 |{ (8 16)}kG True Tc d= ∃ ∧ < ≤ (6.31)



4 |{ (16 24)}kG True Tc d= ∃ ∧ < ≤ (6.32)

G5 becomes true if the transaction occurs from 2nd day to within a week from the previous


5 |{ (24 (24*7))}kG True Tc d= ∃ ∧ < ≤ (6.33)

G6 becomes true if the transaction occurs just within 15 days from the second week since

the previous transaction on the same card Ck from the data warehouse.


181

6 |{ ((24*7) (24*15))}kG True Tc d= ∃ ∧ < ≤ (6.34)

G7 becomes true if the transaction occurs after 15 days from the previous transaction on

the same card Ck from the data warehouse.

7 |{ ( (24*7))}kG True Tc d= ∃ ∧ > (6.35)

Now the parameters GP1 to GP7 are computed as follows.

GP1 = occurrences (count) of G1 on the card Ck from the data warehouse (6.36)







6.3.4 AP1 and AP2

A1 becomes true if the past transactions are also shipped with the same shipping address

from the data warehouse.

1 ( ) ( )| { }k addr Tcurrent addr TpastA TRUE Tc S S= ∃ ∧ = (6.43)

A2 becomes true if the transaction is performed with the different shipping and billing

address.

2 | { }k addr addrA TRUE Tc S B= ∃ ∧ ≠ (6.44)

Finally AP1 and AP2 are computed as follows.


182

AP1 = occurrences (count) of A1 on the card Ck from the data warehouse (6.45)

AP2 = occurrences (count) of A2 on the card Ck from the data warehouse (6.46)

Other parameters are computed in the similar way.

6.4 REFERENCES







[4] E. Aleskerov, B. Freisleben, B.Rao, “CARDWATCH: a neural network based

database mining system for credit card fraud detection”, in: Proceedings of the

Computational Intelligence for Financial Enginnering, 1997, pp.220-226

[5] A.Shen, R.Tong, Y.Deng, “Application of classification models on credit card fraud

detection”, in: Proceedings of the IEEE Service Systems and Service Management,

International Conference, 9-11 June 2007, pp:1-4

[6] Tao Guo, Gui-Yang Ali, “Neural data mining for credit card fraud detection”, in:

Proceedings of the Seventh International Conference on Machine learning and

cybernetics, Kunming, 12-15 July 2008, pp.3630-3634

[7] J.Quah, M.Sriganesh, “Real time credit card fraud detection using computational

intelligence”, in: Proceedings of the International Joint Conference on Neural Networks,

Florida, U.S.A, August 2007

183

CHAPTER 7

DEVELOPMENT OF TRANSACTION RISK SCORE GENERATION MODEL (TRSGM)

7.1 SIGNIFICANCE OF THE PARAMETERS IN TRSGM

7.2 TRSGM COMPONENTS

7.3 ALGORITHM

7.4 GRAPHS OF CLUSTER FORMATION BY DBSCAN ALGORITHM

7.5 IMPLEMENTATION ENVIRONMENT

7.6 SAMPLE RESULTS

7.7 RESULT ANALYSIS & DISCUSSIONS

7.8 REFERENCES

7.1 SIGNIFICANCE OF THE PARAMETERS IN TRSGM

7.1.1 Address Parameters (AP)

This is the most important parameters considered by the model. When the shipping

address entered by the customer is different than the billing address, then the model

checks how many previous transactions are performed on the same shipping address by

checking the value of parameter (AP1). If it is greater than zero, then model considers it

as highly genuine and generates 0 risk score. The model also learns how many total

transactions the customer has performed with different billing and shipping address by

the parameter (AP2).

Chapter 7: Development of Transaction Risk Generation Model (TRSGM)

184

7.1.2 Location Parameters (LP)

The model considers the location from which the current online transaction is performed.

The model then use the parameter LP1: number of transactions ordered from the same

location and accordingly generates a risk score. If there is no any transaction performed

from the same location then more risk score generated and more transactions performed

then less risk score generated.

7.1.3 Category Parameters (CP)

When a customer purchase a product in any category, then the model finds how much

amount is spend by the customer in this category. It uses the parameter CP1 for

generating risk score. Higher the value of CP1, less risk score generated as the model

assumes that it is matched with customer purchasing pattern. Less the value of CP1,

higher the risk score is generated as it is far from customer purchasing habit.

7.1.4 Product Parameters (PP)

The model also use the parameter PP1, time passed since the same product purchased. If

the value of this parameter is less and the product is costly, then the model considers the

transaction as sensitive and generates risk score accordingly. Number of times the same

product purchased is also recorded by the model.

7.1.5 Transaction Parameters (TP)

The parameters TP1 to TP8 has value of percentage of total number of transactions

performed within particular time frame. The model records the current transaction time

and finds percentage of total number of transactions in this time by using parameters TP1

to TP8. If percentage is higher, then less risk score generated as most of the past

transactions are performed within this time frame. If percentage is less, then risk score is

generated higher as it is not matched with customer past transaction time.


185

If current transaction is performed during late night, then the model checks the parameter

TP12 to find total past transactions performed by the customer during late night. If value

of TP12 is 0 or very less (as compared with total transactions), then the model considers

the transaction as sensitive and generates the risk score accordingly.

The customer is active and time passed since the last transaction (TP9) is more, the model

considers the transaction sensitive and generates the risk score accordingly.

The model also finds the deviation than maximum amount of all past transactions (TP10).

It generates a risk score based on how much the current transaction amount is greater

than TP10.If it is less then the risk score generated is less.

7.1.6 Vendor (Seller) Parameters (VP)

With the new incoming transaction with the seller, the model checks the parameter VP2

to find total amount of transactions performed by the customer with the same seller.

Higher purchasing is done with the seller, lesser risk score generated and lesser

purchasing is done, higher risk score generated.

7.1.7 Sunday Parameters (SP)

Customer behavior is observed separately by the model on Sunday. Customer’s current

Sunday behavior is matched with the past Sunday’s behavior and according to the

deviation the risk score is generated.

Customer’s first transaction amount is compared with SP4: maximum individual Amount

of transactions on Sunday and accordingly risk score is generated. If customer

subsequently performs transactions on this day, then its total amount and total number of

transactions are compared with SP5: Maximum total Amount of transactions on Sunday

and SP3: Maximum number of transactions on Sunday.


186

7.1.8 Holiday Parameters (HP)

Customer’s behavior is also monitored separately on holiday by the model and is

compared with holiday parameters to find the deviation of holiday behavior.

For the first transaction performed by customer on holiday, its amount is compared with

HP4: maximum individual Amount of transactions on holiday and accordingly risk score

is generated. If customer subsequently performs transactions on this day, then its total

amount and total number of transactions are compared with HP5: Maximum total amount

of transactions on holiday and HP3: Maximum number of transactions on holiday.

7.1.9 Daily Parameters (DP)

All the transactions performed by the customer on current day are monitored by model

and stored in the table customer_dailycount. They are compared with daily parameters to

find how close or far current day behavior is from past daily behavior.

Total amount of transactions on current day is compared with parameter DP2: Maximum

Amount of Purchase daily and risk score is generated accordingly. Higher risk score is

generated according its value is greater than DP2.

Total number of transactions on current day is matched with DP3: Maximum Number of

transactions a day and risk score is generated accordingly. Higher risk score is generated

according to its value is greater than DP3 otherwise less.

7.1.10 Weekly Parameters (WP)

The transactions of the current week are updated in the table customer_weeklycount. Its

value is matched with weekly parameters to find the deviation from the past weekly

behavior.


187

The weekly transaction amount is compared with WP4: Maximum Amount of Purchase

weekly and if it is greater than WP4 then more risk score generated and it is less than WP4

then less risk score is generated.

The total number of transactions of current week is matched with WP3: Maximum

Number of transactions a week. If it is higher then more risk score generated otherwise

less.

7.1.11 Fortnightly Parameters (FP)

Customer’s current 15 days behavior is stored in the table customer_fortnightlycount and

is compared with the fortnightly parameters to check how far or close the current 15 days

behavior is from the past fifteen days behavior.

Total number of transactions in the current fortnight is checked with the parameter FP3:

Maximum Number of transactions a fortnight and risk score is generated accordingly.

Total amount of transactions in the current fortnight is checked with FP4: Maximum

Amount of Purchase fortnightly and risk score is generated accordingly.

7.1.12 Monthly Parameters (MP)

All the transactions of the current month are stored in the customer_monthlycount table.

This table is used to find how far or close the current month’s behavior from past

monthly behavior by comparing with monthly parameters.

Total number of transactions of current month is compared with MP3: Maximum Number

of transactions a month and risk score is generated accordingly.

Total amount of transactions is compared with MP4: Maximum Amount of Purchase

monthly and risk score is generated accordingly.


188

7.1.13 Transaction Gap Parameters (GP)

The model has one important feature that it records the transaction gap between each two

successive transactions performed by the customer. Seven transaction gap parameters

GP1 to GP7 are generated according to the transaction gap.

Whenever any transaction is found suspicious by the model, it updates the field

suspect_count of the suspect table. Then the model first finds which event occurs on this

card and finds probability that it occurs from generic fraudulent transactions set or

normal transactions set by using these parameters. Finally posterior probabilities are

computed by the model.

Here the time gap between successive transactions on the same card is considered to

capture the frequency of card use. The transaction gap is divided into seven mutually

exclusive and exhaustive events – E1, E2, E3, E4, E5, E6 and E7. Occurrence of each event

depends on the time since the last purchase (transaction gap-g ) on any particular card.

The event E1 is defined as the occurrence of a transaction on the same card Ck within 4

hours of the last transaction which can be represented as:

1 |{ (0 4)}kE True Tc g= ∃ ∧ < ≤ (7.1)

The event E2 is defined as the occurrence of a transaction on the same card Ck from 4th to

8 hours of the last transaction which can be represented as:

2 |{ (4 8)}kE True Tc g= ∃ ∧ < ≤ (7.2)

The event E3 is defined as the occurrence of a transaction from 8th to 16 hours of the last

transaction.

3 |{ (8 16)}kE True Tc g= ∃ ∧ < ≤ (7.3)


189

The event E4 is defined as the occurrence of a transaction from 16th to 24 hours of the last

transaction

4 |{ (16 24)}kE True Tc g= ∃ ∧ < ≤ (7.4)

The event E5 is defined as the occurrence of a transaction within a week (from 2nd day to

7th day) of the last transaction

5 |{ (24 (24*7))}kE True Tc g= ∃ ∧ < ≤ (7.5)

The event E6 is defined as the occurrence of a transaction within a fortnight (from the 8th

day to 15th day) of the last transaction

6 |{ ((24*7) (24*15))}kE True Tc g= ∃ ∧ < ≤ (7.6)

The event E7 is defined as the occurrence of a transaction after 15 days of the last

transaction

7 |{ ( (24*15))}kE True Tc g= ∃ ∧ > (7.7)

7.2 TRSGM COMPONENTS

In the TRSGM, a number of rules are used to analyze the deviation of each incoming

transaction from the normal profile of the cardholder by computing the patterns generated

by TPGT. The initial belief value is obtained as the risk score. The belief is further

strengthened or weakened according to its similarity with fraudulent or genuine

transaction history using Bayesian learning. In order to meet this functionality, the

TRSGM is designed with the following five major components:

(1) DBSCAN algorithm


190

(2) Linear equation

(3) Rules

(4) Historical transaction database

(5) Bayesian learner

7.2.1 DBSCAN Algorithm

A customer usually carries out similar types of transactions in terms of amount, which

can be visualized as part of a cluster. Since a fraudster is likely to deviate from the

customer’s profile, his transactions can be detected as exceptions to the cluster – a

process known as outlier detection. It has important applications in the field of fraud

detection and has been used for quite some time to detect anomalous behavior.

DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a density

based clustering algorithm which can be used to filter out outliers and discover clusters of

arbitrary shapes. Formally, let C’= {c1,c2,……,cn}denote the clusters in a database for a

particular card Ck. A transaction T is detected as an outlier if it does not belong to any

cluster in the set C’. We can measure the extent of deviation of an incoming transaction

by its degree of outlierness. If the average distance of the amount p of an outlier

transaction T from the set of existing clusters in C’ is vavg, then its degree of outlierness

doutlier is given by:

1 | ( )|outlieravg

d if N P MinPtsv

εε

= − < (7.8)

doutlier=0 otherwise

where

MinPts: Minimum number of points required in the ε - neighborhood of each point to

form cluster.

ε : Maximum radius of the neighborhood ( ) { | ( , ) }.N p q D dist p qε ε= ∈ ≤


191

The key idea of the DBSCAN algorithm is that for each point p in a cluster ci, there are at

least a minimum number of points (MinPts) in the ε - neighborhood of that point p

denoted as Nε (p) i.e. the density in the ε - neighborhood has to exceed some threshold.

The larger the ε - neighborhood, the less is the number of clusters formed. If it is set too

high, there will be no cluster since the MinPts condition is not satisfied. However, if both

the parameters are small, there can be a lot of clusters. If MinPts is set to 1, then each

point in the database is treated as a separate cluster and even noise gets identified as a

separate cluster.

Here DBSCAN algorithm is used to form the clusters of transaction amounts spend by

the customer. Whenever a new transaction is performed by the customer, the algorithm

finds the cluster coverage of this particular amount. If this amount occurs more than once

in the past, then the TRSGM considers as highly genuine transaction.

7.2.2 Linear Equation

The TRSGM is based on the following linear equation, which generates a risk score and

indicates how far or close the current transaction is from the normal profile of the

customer. If the generated risk score is closer to 0, then it is considered closely match to

customer normal profile. If the risk score is greater than 0.5 or close to 1, then it

considered heavily deviation from the customer normal profile.

1(1 ) ( * )

n

i ii

Risk score thresold P W=

= − ∑ (7.9)

Where threshold=0.5

Pi = Parameter generated by TPGT

Wi= Weightage of the parameter which is given as input to algorithm 7.1

Weightage is in the percentage


192

7.2.2.1 Parameters

Table 7.1 Parameters of the Equation

Sr No Parameter Weightage 1 Location from which product is ordered W1 % 2 Amount of the transaction W2 % 3 Number of the transactions W3 % 4 Category of the purchase W4 % 5 Time frame during which product is ordered W5 % 6 Seller or Vendor, with whom product is

purchased W6 %

7 Same product purchased within short time W7 % 8 Time passed since the last transaction W8 % 9 Late night transaction W9 % 10 Overseas transaction W10 %

7.2.2 2 Formation of Linear Equation

The sigmoid function is computed as:

f(x)=1/1+e –x (7.10) where

e is the base of natural logarithms approximated by 2.718282.

This function is used when the value of parameter can not be shown in the percentage as

it maps the computation value in the range [0, 1].

The equation is a linear combination of the following sub equations.

1. ( 1 – percentage_location_count / 100) * W1 (7.11)

2. ( ( 1 – percentage_category_amount / 100) * W4) / (no_of_product_purchased) (7.12)

3. ( 1 / ( 1 + e –x) ) * W2 (7.13)

where x = (current_transaction_amount – max_transaction_amount) * 25 /

current_transaction_amount

4. ( 1 / ( 1 + e –x) ) * W3 (7.14)


193

where x = (current_transaction_total – max_transaction_total) * 25 / ( 7 *

current_transaction_total)

5. (1 – time_percentage / 100) * W5 (7.15)

6. ((1 – seller_amount_percentage / 100) * W6) / (no_of_product_purchased) (7.16)

7. (1 / (1 + e –x) ) * W7 (7.17)

where x = (1620 – time_same_product) / ( time_same_product * 0.005 )

8. (1 / ( 1 + e –x) ) * W8 (7.18)

where x = time_last_transaction / 75

9. (1 – latenight_transaction_percentage / 100) * W9 (7.19)

10. (1- overseas_transaction_percentage / 100) * W10 (7.20)

The co-efficient of sigmoid function is derived by creating a small simulator programs

and exhaustive run of the same.

The weightage of different parameters have been derived and implemented using

artificial intelligence. Despite that the application is not stick to this weightage, it is made

dynamic and can be changed if any credit card company wish to do that. It is also

observed that within particular month or time, fraudster becomes so active and fraudulent

transactions increased drastically. So it is useful as the weightage is dynamic because we

can give more weightage to any sensitive parameter when there is a fear of fraudster in a

particular time period.

7.2.3 Rules There are various guidelines given on several websites, print and electronic media as

indications for fraudulent transaction. These guidelines are implemented as rules in the

TRSGM.

• If the transaction is performed during the late night then it is considered as

sensitive. So weightage is given to this in the TRSGM.

• If the customer is active and performs the transactions frequently, but then stops

performing the transactions and after some time he or she becomes again active, it


194

is also considered as sensitive, it is monitored by the TRSGM also and risk score

is also generated according to the duration of time since the last transaction

performed.

• Generally customer doesn’t purchase the costly and luxury product again within

short time. So the TRSGM raises alarm by generating a risk score if similar event

occurs on the same card.

• Overseas transaction is also considered as highly sensitive by the TRSGM if in

the past no overseas transaction is performed on the same card.

7.2.4 Historical transaction database (HTD)

HTD is the transaction repository component of the proposed TRSGM, which is stored in

the data warehouse. The expected behavior of a fraudster is to maximize his benefit from

a stolen card. This can be achieved by carrying out high value transactions frequently.

However, to avoid detection, the fraudsters can make either high value purchases at

longer time gaps or smaller value purchases at shorter time gaps. Contrary to such usual

behavior, a fraudster may also carry out low value purchases at longer time gaps. This

would be difficult for the TRSGM to detect if it resembles the genuine cardholder’s

profile. However, in such cases, the total loss incurred by the credit card company will

also be quite low.

To capture the frequency of card use, we consider the time gap between successive

transactions on the same card. The transaction gap is divided into seven mutually

exclusive and exhaustive events – E1, E2, E3, E4, E5, E6 and E7. Occurrence of each event

depends on the time since last purchase (transaction gap) on any particular card. All the

events are already defined according to the equation (7.1) to (7.7).

The Event E is the union of all the seven events E1, E2, E3, E4, E5, E6 and E7 such that:

7

1

( ) ( ) 1i

i

P E P E=

= =∑ (7.21)


195

Now Compute P(Ei| f ) and P(Ei| f ) from the normal transaction set of that card holder

and generic fraud transactions set. P(Ei| f ) measures the probability of occurrence of Ei

given that a transaction is originating from a fraudster and P(Ei| f ) measures the

probability of occurrence of Ei given that it is genuine. The likelihood functions P(Ei| f )

and P(Ei| f ) are given by the following equations.

#( )( | )#( )

ii

Occurences of E in fraud transaction setP E fTransactions in fraud transaction set

= (7.22)

#( )( | )#( )

i k

k

Occurences of E on C of normal transaction setP Ei fTransactions on C in normal transaction set

= (7.23)

Using equations (7.22) and (7.23), P(Ei) can be computed as follows:

( ) ( | )* ( ) ( | )* ( )i i iP E P E f P f P E f P f= + (7.24) 7.2.5 Bayesian learner

Bayesian learning is a tool to measure evidences supporting alternative hypothesis and

arrive at optimal decisions. The general idea of belief revision is that, whenever new

information becomes available, it may require updating of prior beliefs. Initial risk score

is generated in the range 0 to 1 and is considered as prior probability. Bayes rule gives the

mathematical formula for belief revision, which can be expressed as follows:

( | )* ( )( | )( )

ii

i

P E f P fP f EP E

= (7.25)

By substituting equation (7.24) in equation (7.25) we get:


196

( | )* ( )( | )( | )* ( ) ( | )* ( )

ii

i i

P E f P fP f EP E f P f P E f P f

=+

(7.26)

We use Bayesian learning once the transaction is found suspicious in the light of the new

evidence Ei. Ψ is the probability that the current transaction is fraudulent.

The credit card fraud detection problem has the following two hypothesis: f : fraud and

f : fraud¬ . By substituting the values obtained from equations (7.22), (7.23) in (7.26),

the posterior probability for hypothesis f : fraud is given as:

( | ) * ( )( | )( | ) * ( ) ( | ) * ( )

ii

i i

P E fraud P fraudP fraud EP E fraud P fraud P E fraud P fraud

=+ ¬ ¬

(7.27)

Similarly, the posterior probability for hypothesis f : fraud¬ is given as:

( | ) * ( )( | )( | ) * ( ) ( | ) * ( )

ii

i i

P E fraud P fraudP fraud EP E fraud P fraud P E fraud P fraud

¬ ¬¬ =

¬ ¬ + (7.28)

Depending on which of the two posterior values is greater, future actions are decided by

the TRSGM.

TRSGM is based on the following algorithm.

7.3 ALGORITHM

The working principle of the proposed TRSGM is presented in Algorithm 7.1. It takes the

transaction parameters – card id, transaction amount, product, product category, shipping

address, location id from where transaction is performed and transaction day type(

working day or normal day) as well as design parameters - ε , MinPts and Wi (Weightage

of the parameter Pi) as input.


197

An incoming transaction is first checked for the address mismatch. If shipping address

and billing address is found same, then the transaction is considered to be genuine and is

approved and no other check is performed. The incoming transaction amount is checked

with the clusters formed by DBSCAN algorithm for its coverage. If coverage is found to

be more than 10%, then the transaction is considered to be genuine and is approved and

no other check is performed with the transaction. Then the linear equation of the patterns

generated by TPGT along with its weightage( Wi) generates a risk score for the

transaction. If the risk score < 0.5, the transaction is considered to be genuine and is

approved. On the other hand, if risk score > 0.8 then the transaction is declared to be

fraudulent and manual confirmation is made with the cardholder. In case 0.5 ≤ risk score

≤ 0.8, the transaction is allowed but the card Ck is labeled as suspicious. If this is the first

suspicious transaction on this card, the field suspect_count is incremented to 1 for this

card number in a suspect table. The TRSGM then waits until the next transaction occurs

on the same card number.

When the next transaction occurs on the same card Ck, it is also passed to the TRSGM.

The first four components of the TRSGM again generate a risk score to the transaction. In

case the transaction is found to be suspicious, the following events take place. Since each

transaction is time stamped, from the time gap g between the current and the last

transaction, the TRSGM determines which event E has occurred out of the seven Ei’s and

retrieves the corresponding ( | )iP E f and ( | )P Ei f . The posterior probabilities ( | )iP f E

and ( | )P f E are next computed using Eqs. (7.14) and (7.15). If ( | )iP f E > ( | )P f E

then the transaction is declared to be fraudulent and if ( | )P f E > ( | )iP f E then the

transaction is declared to be genuine. The flow of the events of the proposed financial

cyber crime detection system is shown in the figure 7.1.


198

AlGORITHM 7.1:

Input: Ck, Tamount(i), Saddr, Location, ε , MinPts, categoryi, producti,selleri, day_type, Wi,

no_of_products // ( No of the products customer has ordered online)

Tamount_daily ; // It stores total amount of current day purchase and update table

customer_dailycount accordingly

Ttotal_daily ; // It stores total number of current day transactions and update table

customer_dailycount accordingly

Tamount_weekly; // It stores total amount of current week purchase and update table

customer_weeklycount accordingly

Ttotal_weekly ; // It stores total number of current week transactions and update table

customer_weeklycount accordingly

Tamount_fortnightly ; // It stores total amount of current fortnight transactions and update table

customer_fortnightlycount accordingly

Ttotal_fortnightly ; // It stores total number of current fortnight transactions and update table

customer_fortnightlycount accordingly

Tamount_monthly; // It stores total amount of current month transactions and update table

customer_monthlycount accordingly

Ttotal_monthly ; // It stores total number of current month transactions and update table

customer_monthlycount accordingly

Tamount_sunday ; // It stores total amount of current day(if Sunday) purchase and update

table customer_sundaycount accordingly

Ttotal_sunday ; // It stores total number of current day(if Sunday) transactions and update

table customer_dailycount accordingly

Tamount_holiday ; // It stores total amount of current day(if holiday) purchase and update


Ttotal_holiday ; // It stores total number of current day(if holiday) transactions and update


Ψ = 0

trans_amount = 0


199

i = 1

while ( i <= no_of_products)

loop

Input category_id(i), product_id(i), Tamount(i), seller_id(i)

trans_amount := trans_amount + Tamount(i)

i := i + 1

end loop;

If Baddr = Saddr then

risk_score Ψ =0;

Output(“Genuine”) // The transaction is approved

End if

If Baddr ≠ Saddr then

Call Transaction_Pattern_Generation_Tool;

If AP1 > 0 then // AP1: No of transactions shipped with the same shipping

address

risk_score Ψ =0;

output(“Genuine”) // The transaction is approved

else

If current_day is running then

If current_week is running then

If current_fortnight is running then

If current_month is running then

Clusteri=DBSCAN_Algorithm(trans_amount,ε ,MinPts);// Number of clusters

Found by this algorithm

count_percen=Cluster_coverage(Clusteri, trans_amount );


200

If count_percen >= 10 then


else

risk_score Ψ = generate_and_update_risk_score_1 ( LP ); // LP: Location

Parameters

// Using Eq. (7.11)

risk_score Ψ = generate_and_update_risk_score_2( CP ); // CP: Category

Parameters

// Using Eq. (7.12)

risk_score Ψ = generate_and_update_risk_score_3 (PP); // PP: Product

Parameters

// Using Eq. (7.17)

risk_score Ψ = generate_and_update_risk_score_4 (TP);//TP: Transaction

Parameters

//Using Eqs. (7.12),(7.13),(7.15),(7.18) and (7.19)

risk_score Ψ = generate_and_update_risk_score_5 (VP);

// VP: Vendor(Seller) Parameters

// Using Eq. (7.16)

If (day_type is Sunday) then

risk_score Ψ = generate_and_update_risk_score_6 (SP);//SP: Sunday

Parameters

Tamount_sunday=Tamount_sunday + Tamount;

Ttotal_sunday=Ttotal_sunday + 1;

Update_customer_sundaycount_table(Tamount_sunday, Ttotal_sunday );

End if; // End of Sunday

// At the end of day, trigger is automatically executed and update

Table customer_sundaycount(Tamount_sunday=0, Ttotal_sunday=0 )


201

If (day_type is Holiday) then

risk_score Ψ = generate_and_update_risk_score_7 (HP);//HP: Holiday

Parameters

Tamount_holiday=Tamount_holily + Tamount;

Ttotal_holiday=Ttotal_holiday + 1;

Update_customer_holidaycount_table(Tamount_holiday, Ttotal_holiday );

End if; // End of Holiday


Table customer_holidaycount(Tamount_holiday=0, Ttotal_holiday=0 )

Tamount_daily=Tamount_daily + Tamount;

Ttotal_daily=Ttotal_daily + 1;

Update_customer_daily_count_table(Tamount_daily, Ttotal_daily );

End if; // End of current day

risk_score Ψ = generate_and_update_risk_score_8 (DP);//DP: Daily

Parameters


Table customer_dailycount(Tamount_daily=0, Ttotal_daily=0 )

Tamount_weekly=Tamount_weekly + Tamount;

Ttotal_weekly=Ttotal_weekly + 1;

Update_customer_weekly_count_table(Tamount_weekly, Ttotal_weekly );

End if; // End of current week

risk_score Ψ = generate_and_update_risk_score_9 (WP);

//WP: Weekly Parameters

// At the end of week, trigger is automatically executed and update

table customer_weeklycount(Tamount_weekly=0, Ttotal_weekly=0 )

Tamount_fortnightly=Tamount_fortnightly + Tamount;

Ttotal_fortnightly=Ttotal_fortnightly + 1;


202

Update_customer_fortnightlycount_table(Tamount_fortnightly, Ttotal_fortnightly );

End if; // End of current fortnight

risk_score Ψ = generate_and_update_risk_score_10 (FP);

//FP: Fortnightly Parameters

// At the end of fortnight, trigger is automatically executed and update

table customer_fortnightlycount(Tamount_fortnightly=0, Ttotal_fortnightly=0 )

Tamount_monthly=Tamount_monthly + Tamount;

Ttotal_monthly=Ttotal_monthly + 1;

Update_customer_monthly_count_table(Tamount_monthly, Ttotal_monthly );

End if; // End of current month

risk_score Ψ = generate_and_update_risk_score_11 (MP);

//MP: Monthly Parameters

// At the end of month, trigger is automatically executed and update

Table customer_monthlycount(Tamount_monthly=0, Ttotal_monthly=0 )

If (Ψ < 0.5) then


else if (Ψ > 0.8) then

output(“Fraudulent”) // Check with customer

if (transaction verified to be fraudulent) then

block_card(Ck);

end if;

else

if (suspect_count =0) then // Returns true if the suspect_count field of

suspect table is zero

suspect_count ++; // Update suspect_count for card Ck in suspect table

wait for the next transaction on the card Ck;

else

E=find_event(g); // Using Eqs. (7.1),(7.2),(7.3),(7.4),(7.5) ,(7.6) and (7.7)

Ef=compute_event_probf(E); // Using Eq. (7.22) and generic fraud table


203

fE =compute_event_prob f (E); // Using Eq. (7.23) and GP: Transaction

Gap Parameters

Posteriorf = compute_posterior_probf (Ψ , Ef , fE ); // Using Eq. (7.27)

Posterior f = compute_posterior_prob f (Ψ , Ef , fE ); // Using Eq. (7.28)

If (Posteriorf > Posterior f ) then

output(“Fraudulent”) // Check with customer

if (transaction verified to be fraudulent) then

block_card(Ck);

end if;

else

output(“Genuine”);

suspect_count := 0; // Update suspect_count for card Ck in suspect

table

End if;

Wait for the next transaction on the card Ck;

End if;

End if;

End if;

If (All the transactions of current month are found to be genuine) then

Store them in the data warehouse;

End if;

7.3.1 Description of data structure used in the algorithm

Variable Meaning

Ck Current online transaction is performed on a card Ck

Tamount Purchase amount of current online transaction of each product

Trans_amount Total purchase amount of all the products of current online transaction


204

Baddr Billing Address of the customer, which is given by customer while

opening an account

Saddr Shipping address given by the customer while performing online

Transaction, where he wants his item to be shipped

Product_idi It shows customer has purchased which kind of products

Category_idi It shows customer has purchased which kind of category

Seller_idi It shows customer has performed transaction with which seller

ε ε -neighborhood or neighborhood of a point is the set of points within

distance of ε

MinPts Minimum number of points in any cluster

day_type The model keeps track of current day is holiday or not by using this

variable ( 0: Working Day, 1: Holiday)

Tamount_daily It has total amount of current day purchase till the current day is

completed

Ttotal_daily It stores total number of current day transactions till the current day is

completed

Tamount_weekly It stores total amount of current week purchase till the current week is

completed

Ttotal_weekly It stores total number of current week transactions till the current week

is completed

Tamount_fortnightly It stores total amount of current fortnight transactions till the current

fortnight is completed

Ttotal_fortnightly It stores total number of current fortnight transactions till the current

fortnight is completed

Tamount_monthly It stores total amount of current month transactions till the current

month is completed

Ttotal_monthly It stores total number of current month transactions till the current

month is completed

Tamount_sunday It stores total amount of current day(if Sunday) purchase till the

current day completed


205

Ttotal_Sunday It stores total number of current day(if Sunday) transactions till the

current day is completed

Tamount_holiday It stores total amount of current day(if holiday) purchase till the


Ttotal_holiday It stores total number of current day(if holiday) transactions till the


Ψ Risk score generated by the model

Clusteri It indicates the particular cluster formed by DBSCAN Algorithm

g Transaction gap e.g. Number of hours since the last transaction on the

same card

E Model finds event based on eqs. From (7.1) to (7.6)

Ef Probability of event E coming from fraudulent transaction set

fE Probability of event E coming from normal transaction set

Posteriorf Posterior probability of event E that it is fraudulent

Posterior f Posterior probability of event E that it is genuine

Suspect_count If the current transaction is found suspicious, then the value of

suspect_count is incremented to 1 and system waits for the next

transaction.

Wi Weightage of the parameter

7.3.2 Description of the algorithm

• First Algorithm checks the shipping address entered by the customer with the

billing address given by the customer while performing online transaction, If both

are same then it considers the transaction highly genuine and generate risk score

0.

• If shipping address is different than billing address, then algorithm checks the

parameter AP1 generated by TPGT to check whether the past transactions are

successfully performed on the same shipping address. If products are successfully

shipped on the current shipping address, then also it considers the transaction

highly genuine and generate risk score 0.


206

• If this is the first shipping address on which current transaction is going to

perform then the algorithm proceeds further and generates the risk score based on

how far or close the current transaction is to all his past transactions.

• DBSCAN algorithm creates different clusters based on transaction amounts, ε

and MinPts. Cluster_Coverage() function finds the percentage of coverage of

current transaction amount in the cluster. If this value is greater than 10 then the

algorithm believes that this is the regular transaction performed by customer and

predicts the current transaction as highly genuine and generate risk score 0 for it.

• If the risk score generated by the algorithm applying all the parameters of TPGT

is less than 0.5, then the transaction is considered genuine by the TRSGM.

• If the risk score is greater than or equal to 0.8, then the transaction is considered

fraudulent by the TRSGM.

• If the risk score is between 0.5 and 0.8, then transaction is considered as

suspicious transaction and the value of suspect_count field is incremented to 1 for

this card Ck in suspect table. It waits for the next transaction on same card. If the

risk score of next transaction is less than 0.5, then it is declared as genuine and the

value of suspect_count field is set to 0. If the risk score of the next transaction is

greater than or equal to 0.8, then it is declared as fraudulent and verified by

confirming the customer. If the risk score of next transaction is again found

between 0.5 and 0.8 (i.e. found suspicious) then the event E will be decided by

find_event() function using equations from (7.1) to (7.7). Then the probability of

Event E Ef ( Probability that Event E is occurred from generic fraud transaction

set) and fE (Probability that Event E is occurred from normal transaction set) is

computed. Finally Posterior probability posteriorf (Probability that the transaction

is fraudulent) and Posterior f (Probability that the transaction is genuine) is

computed. If posteriorf is greater than Posterior f , then the transaction is

considered fraudulent and verified by contacting the customer. If posterior f is

greater than posteriorf , then the transaction is declared as genuine transaction by

the model.

• If the current day is Sunday then the current transaction is compared with Sunday

Parameters (SP) generated by TPGT and risk score is generated accordingly. If


207

customer subsequently performs transactions on this day, then total amount of

purchase and total number of transactions are updated in the table

customer_sundaycount accordingly. Each time risk score is generated if the

customer performs transaction on this day. The next day value of these two fields

is reset with 0 automatically by trigger.

• If the current day is holiday which is tracked by the variable day_type, then the

transaction is matched with the Holiday Parameters (HP) and risk score is

generated accordingly. If the subsequent transactions are performed by the

customer on this day, then each time the table customer_holidaycount is updated

and risk score is generated accordingly. On the next day the value of this table is

reset 0.

• Whatever the transactions are performed by the customer during the whole day,

they all are updated in the table customer_dailycount. For each new transaction on

day, the table is updated and risk score is generated accordingly using Daily

Parameters (DP) of TPGT. On the next day the trigger is automatically executed,

and set the value of the two fields of table to zero.

• The transactions of the current week are recorded in the table

customer_weeklycount. For each new transaction of the current week, this table is

updated and risk score is generated accordingly using Weekly Parameters (WP).

After the completion of current week, the trigger set the value of this table to zero.

• Whenever customer performs a transaction, its value is updated in the table

customer_fortnightlycount table till the current 15 days i.e. fortnight is not

completed. Each time risk score is also generated using parameters FP. At the end

of 15 days, the value of the fields of this table is automatically set to zero by the

trigger.

• All the transactions performed by the customer during the current month are

updated in the table customer_monthlycount. For each subsequent transaction

during the month, the risk score is generated using the monthly parameters to find

the deviation of customer monthly behavior. After the completion of current

month, the value of Tamount_monthly and Ttotal_monthly of this table is set to zero. If all


208

the transactions are found genuine then they are stored in the data warehouse, so

the next parameters are generated accordingly by TPGT.

• Here the block diagram of the proposed financial cyber crime detection system is

shown in the figure 7.1 which is the brief pictorial representation of the algorithm

7.1.


209

Data Warehouse

TPGT

TRSGM

Patterns

Current Transaction

Risk Score (0-1)

< 0.5 Yes No

No Genuine Transaction Yes

< 0.8

Suspicious Transaction

Past Transactions

Fraudulent Transaction

Update suspect_count field in Suspect table

Wait for the next transaction on the same card

For next transaction found suspicious Event Ei occurs

Bayesian Learner Analysis

Po.f > Po. f Yes

Fraudulent Transaction

No

Genuine Transaction

Figure 7.1 Block Diagram of Proposed Financial Cyber Crime Detection System


210

7.4 GRAPHS OF CLUSTER FORMATION BY DBSCAN ALGORITHM Here we have generated the scatter graphs of the different clusters formed by the

DBSCAN algorithm by taking transaction amount attribute for the various customers. In

all the examples ε =500 and MinPts=5 was taken.

Cluster formation by DBSCAN Algorithm

00.5

11.5

22.5

33.5

0 500 1000 1500 2000 2500

Transaction Amount

Clu

ster

Num

ber

Figure 7.2 Graph of clusters formed by DBSCAN algorithm for Card id=1

Cluster Formaton by DBSCAN Algorithm

01234567

0 1000 2000 3000 4000 5000 6000 7000

Transaction Amount

Clu

ster

Num

ber



211

Cluster Formation by DBSCAN Algorithm

0

2

4

6

8

0 1000 2000 3000 4000 5000 6000 7000 8000

Transaction Amount

Clu

ster

Num

ber


Cluster Formation by DBSCAN Algorithm

0

2

4

6

8

10

0 1000 2000 3000 4000 5000 6000 7000 8000

Transaction Amount

Clu

ster

Num

ber



212

Here a result is shown of the clusters formed by DBSCAN algorithm implemented in the

data mining application for various transaction amounts spend by the customer having

card id 1507.

Figure 7.6 Sample output of Clusters formed by DBSCAN Algorithm – I

Figure 7.7 Sample output of Clusters formed by DBSCAN Algorithm - II


213

Figure 7.8 Sample output of Clusters formed by DBSCAN Algorithm - III

Figure 7.9 Sample output of Clusters formed by DBSCAN Algorithm - IV


214

Figure 7.10 Sample output of Clusters formed by DBSCAN Algorithm - V

Figure 7.11 Sample output of Clusters formed by DBSCAN Algorithm - VI


215

Figure 7.12 Sample output of Clusters formed by DBSCAN Algorithm - VII

Figure 7.13 Sample output of Clusters formed by DBSCAN Algorithm - VIII


216

Figure 7.14 Sample output of Clusters formed by DBSCAN Algorithm - IX

Figure 7.15 Sample output of Clusters formed by DBSCAN Algorithm - X


217

Figure 7.16 Sample output of Clusters formed by DBSCAN Algorithm - XI

Figure 7.17 Sample output of Clusters formed by DBSCAN Algorithm - XII


218

7.5 IMPLEMENTATION ENVIRONMENT

The implementation of FCDS has been done in Oracle 9i. The data warehouse is

designed and implemented in oracle 9i, which consists of a number of tables, as shown in

the Chapter 6. Descriptions of all the tables are also shown in the same chapter. Lookup

tables are designed to store the current spending behavior of the customer. Current online

transaction is given as input to the FCDS. Linear equation along with the rules

implemented in the TRSGM generates a risk score for this transaction.

Stored procedures, functions, packages and triggers were written to facilitate the

functioning of the setup. These were used to check the deviation of each transaction from

the customer’s normal profile.

7.5.1 Lookup tables auto updation

The following trigger is automatically executed when logging into the system and

updates all the lookup tables according to their specified time duration.

create or replace trigger logon_update_trigger

after logon on database

call logrecordproc

The procedure is as below.

create or replace procedure logrecordproc is

today date;

previousday date;

begin

select logon_day into previousday from user_log_master;

commit;


219

insert into user_log_master values(user,sysdate);

select sysdate into today from dual;

if today > previousday then

update customer_dailycount set transcount=0,amount=0;

end if;

if today > (previousday + 6) then

update customer_weeklycount set transcount=0,amount=0;

end if;

if today > (previousday + 14) then

update customer_fortnightlycount set transcount=0,amount=0;

end if;

if today > (previousday +29) then

update customer_monthlycount set transcount=0,amount=0;

end if;

end;

7.5.2 Inter Transaction Gap Recording

As we discussed in the chapter 6, TPGT generates the parameters GP1 to GP7 for inter

transaction gap (time duration between each two successive transactions on the same

card). For this the following procedure time_previous_transaction( ) is implemented in

the data mining application.


220

/* This procedure finds the time difference in days, hours, minutes and seconds between

each two successive transactions. */

PROCEDURE time_previous_transaction(a_array1 IN tpg_date_array,time_diff out

tpg_array,days out tpg_array,hrs out tpg_array,mins out tpg_array,secs out tpg_array) is

hrs_frac number(12,6);

mins_frac number(12,6);

secs_frac number(12,6);

hrs_int number(12,6);

mins_int number(12,6);

secs_int number(12,6);

hrs_full number(12,6);

mins_full number(10,2);

secs_full number(12,6);

index_time number(7):=1;

BEGIN

for i in 2..a_array1.LAST

LOOP

select (a_array1(i) - a_array1(i-1)) into time_diff(index_time) from

dual;

SELECT floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600),

floor((((a_array1(i)-a_array1(i-1))*24*60*60) -

floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600)*3600)/60),

round((((a_array1(i)-a_array1(i-1))*24*60*60) -

floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600)*3600 -

(floor((((a_array1(i)-a_array1(i-1))*24*60*60) -

floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600)*3600)/60)*60) ))

into hrs_frac,mins_frac,secs_frac

FROM dual;


221

days(index_time):=floor(hrs_frac/24);

hrs_int:=hrs_frac/24-floor(hrs_frac/24);

hrs_full:=floor(hrs_int*24);

mins_full:=floor((hrs_full - floor(hrs_full))*60);

hrs(index_time):=hrs_full;

secs_full:=floor((mins_full - floor(mins_full))*60);

mins(index_time) := mins_full + mins_frac;

secs(index_time):=secs_full + secs_frac;

while hrs(index_time) >= 24

LOOP

days(index_time) := days(index_time) + 1;

hrs(index_time) := hrs(index_time) - 24;

END LOOP;

while mins(index_time) >= 60

LOOP

hrs(index_time) := hrs(index_time) + 1;

mins(index_time) := mins(index_time) - 60;

END LOOP;

if secs(index_time) >= 60 then

mins(index_time):=mins(index_time)+1;

secs(index_time):=secs(index_time)-60;

end if;

index_time := index_time + 1;

END LOOP;

END;


222

7.5.3 Maximum Value Finding

To find the maximum value from given list of array find_maximum() function is

implemented in the data mining application. This function is called several times to find

the maximum amount of transaction and maximum number of transactions.

/* This function finds the maximum value from the given array. */

FUNCTION find_maximum(a_array1 IN tpg_array) RETURN number IS

max_value number(12,2);

BEGIN

max_value:=0;

for m in a_array1.FIRST .. a_array1.LAST

loop

if a_array1(m) > max_value then

max_value:=a_array1(m);

end if;

end loop;

return max_value;

END;

In this way several procedures, functions, triggers, packages are implemented in the data

mining application.


223

7.6 SAMPLE RESULTS

7.6.1 Genuine Transaction

Figure 7.18 Sample output of Data Mining Application for Genuine Transaction - I

Figure 7.19 Sample output of Data Mining Application for Genuine Transaction - II


224

Figure 7.20 Sample output of Data Mining Application for Genuine Transaction -III


225

7.6.2 Fraudulent Transaction

Figure 7.21 Sample output of Data Mining Application for Fraudulent Transaction

–I

Figure 7.22 Sample output of Data Mining Application for Fraudulent Transaction -

II


226

Figure 7.23 Sample output of Data Mining Application for Fraudulent Transaction -

III


227

7.6.3 Suspicious transaction

Here a sample of suspicious transaction is shown along with the probability of the

transaction being genuine or fraudulent. The snapshot of table suspect is also shown

where the filed suspect_count is incremented.

Figure 7.24 Sample output of Data Mining Application for Suspicious Transaction -

I


228

Figure 7.25 Sample output of Data Mining Application for Suspicious Transaction –

II

Figure 7.26 Sample output of Data Mining Application for Suspicious Transaction -

III


229

7.6.4 Multiple product order support

Figure 7.27 Sample output of Data Mining Application for Multiple Order Product

Support - I


Support - II


230


Support - III


231

7.7 RESULT ANALYSIS & DISCUSSIONS

• The most interesting result of the TRSGM is that the risk score generated by it is very

dynamic. i.e. If the customer make any purchase and there is a very minor change in

transaction amount and keeping all other inputs same, then also risk score generated

is different. This minor change would also be reflected in the risk score. We have run

the application several times for different transaction amounts with slight variation

and keeping all other inputs fix and same. Before taking the result for second time

and onwards, we have also reset all the lookup tables. Here is an example.

For the input Card Id : 125

Category Id: 6

Product Id : 60050

Seller Id : 750

Shipping Id : 410

Location Id : 980

The output of risk score is as below.

Table 7.2 Sample output of the application for different transaction amounts

Amount 5000 50001 5002 5003

Risk Score 0.29277495 0.29277506 0.29277513 0.29287596


232

Figure 7.30 Sample output of Data Mining Application for different transaction

amounts – I


amounts - II


233


amounts - III


amounts - IV


234

In the same way, we have changed the different sellers for the same product, category,

amount, shipping address and location address. It is observed that this change would also

reflect in the risk score. Here is an example.


Category Id: 5

Product Id : 50010

Amount : 4500

Shipping Id : 590

Location Id : 110


Table 7.3 Sample output of the application for different sellers

Seller Id 801 587 986 30

Risk Score 0.28811595 0.28780605 0.28811623 0.28802632


235

Figure 7.34 Sample output of Data Mining Application for different sellers - I

Figure 7.35 Sample output of Data Mining Application for different sellers - II


236

Figure 7.36 Sample output of Data Mining Application for different sellers - III

Figure 7.37 Sample output of Data Mining Application for different sellers - IV


237

We have also checked if the customer purchases the same product, category, amount,

seller, shipping address on different location, then its change is reflected in the risk

score. Here is an example.


Category Id: 7

Product Id : 70150

Amount : 5300

Shipping Id : 1596

Seller Id : 110


Table 7.4 Sample output of the application for different locations

Location Id 351 352 353 354

Risk Score 0.25453484 0.24608244 0.22881003 0.2497576


238

Figure 7.38 Sample output of Data Mining Application for different locations - I

Figure 7.39 Sample output of Data Mining Application for different locations-II


239

Figure 7.40 Sample output of Data Mining Application for different locations - III

Figure 7.41 Sample output of Data Mining Application for different locations - IV


240

• The application finds the cluster coverage of each new incoming transaction

amount and if it is greater than 10% then model assumes that it is a genuine

transaction considering the regular payment of the customer. So the application

generates 0 risk score for the transaction. Here is an example.

Figure 7.42 Sample output of Data Mining Application for Cluster Coverage

• The author has extensively run the applications and check that the transaction

,which is the closely met by the customer purchasing habit (i.e. maximum

purchase in this category, maximum number of transactions in this time frame,

maximum number of transactions ordered from the same location etc.), generates

a least score. The transaction, which does not fall into customer purchasing habit

and more deviation than the normal profile, generate more risk score. Here is an

example. As more and more transactions performed within this particular set,

more risk score decreased.


241

The customer having cardid 1570 has the maximum purchasing habit in the given

field as below.

Category : 2

Time frame : 18:01 to 21:00

Location Id : 205

Seller Id : 257

Figure 7.43 Sample output of Data Mining Application for maximum

purchasing habit input - I


242


purchasing habit input - II


purchasing habit input - III


243

• In the domain of credit card fraud detection, the system should not raise too many

false alarms (i.e. genuine transactions should not be caught as fraudulent

transactions) because a credit card company needs to minimize its losses but, at

the same time, does not wish the cardholder to feel restricted too often. In the

same way, fraudulent transactions should also not get undetected. Considering

both of these matters, the model is designed flexible. Here we have taken upper

threshold value 0.8, but with more learning it can be changed. All the parameters’

weightage is also set according to the recommendation of credit card company.

• There is one interesting result by Bayesian learning. The customer having card id

8, first performs the transaction of 17000 is considered suspicious. After short

while he performs another transaction which is 13500, predicted as fraudulent by

Bayesian learning. Once transaction found suspicious, time duration since last

transaction is also stored in the table suspect. So if we consider both the

transactions as individual then they are seemed to be normal, but it is power of

Bayesian learning that occurrence of subsequent transaction after the first

transaction is predicted as fraudulent. Here is an example.

Figure 7.46 Sample output of Data Mining Application for Bayesian

Learning - I


244

Figure 7.47 Sample output of Data Mining Application for Bayesian Learning-II

7.8 REFERENCES







[4] Ivan Bayross – SQL, PL/SQL The Programming Language of Oracle, BPB

Publications, ISBN 81-7656-964-X

[5] Nilesh Shah – Database Systems Using Oracle, Prentice Hall of India, ISBN: 81-

203-2147-2

245

CHAPTER 8

PROPOSED FINANCIAL CYBER CRIME PREVENTION MODEL & CONCLUSION

8.1 PROPOSED FINANCIAL CYBER CRIME PREVENTION MODEL

8.2 FEATURES OF DEVELOPED DATA MINING APPLICATION SOFTWARE

8.3 SIGNIFICANCE OF THE RESEARCH

8.4 LIMITATION OF THE STUDY

8.5 FUTURE SCOPE OF THE RESEARCH

8.6 REFERENCES

As we discussed in Chapter 1, for financial cyber crime prevention different methods like

First Virtual, Cyber Cash and SET are used. These systems are highly secure but are

rarely used by customers and merchants. These models secure our transaction over

internet but cannot stop any forgery if credit card information is lost physically or when

customer gives his information in wrong hands.

Anshul Jain et al. [1] have given an Internet Virtual Credit Card Model. In this model, a

login id and password will be given by the bank. Then After logging into bank’s website

a virtual credit card number and expiry date of this virtual credit card will be issued by

the bank. So the customer has to give and remember four details like login id, password,

virtual credit card number and expiry date of this virtual card while performing the online

transaction. In my opinion, it will create overhead on customer and extra burden of

remembering these additional details.

Recently in India Reserve Bank of India has made a mandate for all the banks to issue

separate password to their credit card holders for online transaction. In other countries

this tactics is already being used. In my opinion, this tactics is not enough to prevent

fraud as the first transaction is highly secure, but the subsequent transactions we can not

Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion

246

surely consider as highly secure, because while performing the first transaction by the

customer, the password can be stolen by fraudster by hacking the computer or any other

tactics. Also the card holder is not given any control or flexibility to their own end to

prevent the fraud.

Considering all the limitations of above models, the financial cyber crime prevention

model has been proposed.

8.1 PROPOSED FINANCIAL CYBER CRIME PREVENTION MODEL

In this model, not only separate password for online transaction is given to credit card

holder, but also the validity of this password is given to the card holder according to his

choice. The customer has to log into bank website. Then he can set his password along

with the expiry date of this password for the online transaction. So whenever customer

performs online transaction, then he requires the password to complete the transaction.

The model checks the validity of the password; if the password expires then he is not able

to complete the transaction. If he is the genuine card holder, then he has to log in bank

website and set password and expiry date for this password.

So in this model the password remain valid only till the completion of expiry date. When

the password expires, the customer has to again obtain password along with its validity

from the bank. The expiry date selected by the customer must be between the present date

and actual expiry date of the card.

So here the control and flexibility is given to the customer to their own end. They can

give expiry date of password according to their convenience and keeping in mind of

avoiding the forgery. As we compare with Internet Virtual Credit Card Model, the

customer can not remember the virtual credit card number easily, as it is long and it is

issued by bank so he has to store it somewhere. While in our model, the password will be

given by the user, so it is user defined word, so he can easily keep in mind. Customers,

who transact very often, could give the expiry date of their password very short, in order


247

to avoid forgery. He can also give the expiry date of password such that it would expire

on the next day. So the customer can himself make his each online transaction very safe.

Customers who do not transact very often or consider this as overhead, can select long

expiry date. Whenever financial cyber crime increased drastically in particular month,

then user can give shorter expiry date to avoid forgery.

Thus the benefit of this model is that user can temporarily suspend his credit or debit card

by giving short expiry date as when he fears that his information may be stolen or when a

time comes that huge amount of cyber crime cases drastically increased. So no one can

use his or her credit card information for online purchases.

8.2 FEATURES OF DEVELOPED DATA MINING APPLICATION SOFTWARE

• Transaction Differentiation: The software itself differentiates the transaction by

looking at transaction date and time. E.g. if the purchase is made by the same

customer of the three different products at the same time, then it is considered as

only one transaction and summation of these three products is computed by the

software. If two purchases is made on same date but at different time, it is

considered as two transactions performed by the customer.

• Accuracy: It is the main feature of this software. All the computations and

calculations are performed very accurately by this software. The author has taken

proper care of the accuracy of the software by debugging and taking several

sample input as the result is depending on this calculation.

• Speed: Software is tested on huge scale data and its speed of execution is

checked. The software generates the risk score very fast despite of huge data. So

the huge data does not slow down the software.

• Dynamic risk score: The risk score generated by the software is very dynamic. If

there is a change of only 1% in the input, it would be reflected in the risk score.

• Live data: The application is implemented on pure live data. It extracts data from

the properly designed data warehouse.


248

• Global Application: The application is not designed and implemented keeping

view of only one country. The customer can perform online transaction from

outside of the country where the application software resides on the server. In this

case, the application automatically maps the time of the server to the country

where the transaction has been performed. The function conver_time( ) has been

implemented in the application for doing the same. It converts a time zone of one

city to another city.

• Pattern Generation: The software generates more than 60 patterns for each

customer, which collectively decides the normal profile of the customer.

• Lookup tables: Several lookup tables are designed in the database to monitor the

current spending behavior of the customer. So whenever deviation occurs than

normal behavior, software can raise the alarm.

• Inter transaction time gap recording: The time difference between each two

successive transaction on the same card is monitored and stored by the software to

capture the frequency of card use.

• Bayes rule: It is implemented in the application, which will calculate the

posterior probability of the transaction for whether it is performed by genuine

customer or fraudster.

• Clustering: A clustering technique DBSCAN algorithm is implemented in the

application for the various transaction amounts spend by the customer. Cluster

coverage of incoming transaction amount is also calculated by the software to

decide the sensitivity of the transaction.

• Type of the day: The software monitors and records the transactions separately

whether it is performed on normal working day, Sunday or holiday. Accordingly

it generates a risk score based on past spending behavior of the customer on the

similar types of day.

• Time stamping: Each transaction’s date and time is monitored and recorded by

the software. So for new incoming transaction, the software monitors how far or

close the transaction is to the customer’s purchasing time habit and generates a

risk score accordingly.


249

• Transaction counting: The software automatically calculates the total number of

transactions performed on each day, week, fortnight and month for every

customer. It also calculates the total number of transactions for Sunday and

holiday separately.

• Linear equation: A linear equation with the patterns generated by TPGT is

implemented in the software, which will generate a risk score to decide the

sensitivity of the transaction.

• Triggers: There are automatically executed triggers implemented in the software.

It updates the lookup tables accordingly. As for example, after the completion of

current day, the trigger set the values of total transaction amount and total number

of transaction of the table customer_dailycount to zero. Same actions are

performed for other lookup tables.

• Multiple product order support: The software is not designed and implemented

keeping view of only one product order. The customer can purchase more than

one product as a cart on the same time. The software can take input of all the

products purchased and generates a risk score by considering all the products with

their category.

• Rules: As discussed previously in the chapter 7, all the rules which are generally

given as indication of fraudulent transactions on various web sites and news

papers for online transaction, are implemented as rules in the software.

• Location recording: The software takes the input as location from which the

current online transaction is performed, and calculates how many past

transactions are performed on the same location to decide the sensitivity of the

transaction.

• Transaction summation: Whenever multiple products are purchased by the

customer at the same time, then the software considers it as one individual

transaction and adds the price of all the products and takes it as transaction

amount.

• Sigmoid function: Sigmoid function is used in the linear function for the

parameters which can not represent its value in percentage. This function doesn’t


250

allow any parameter to increase its share in the final risk score as it maps the

value in the range [0, 1].

• Flexibility: With the consultation of the bank, the weightage of different

parameters are derived and implemented in the software. But the software is not

stick to this weightage only and it is flexible as we can change the weightage of

any parameter according to the recommendations of the credit card company.

8.3 SIGNIFICANCE OF THE RESEARCH

• The work is unique in nature as in modeling part incorporate data mining

techniques, statistics and artificial intelligence in a single platform. The work

explained in the thesis must be helpful for the researchers; especially literature

survey of data mining techniques is effort to provide roadmap to study and select

appropriate data mining technique before implementation for the researchers.

Also understanding of role of data mining in financial crime detection is useful

for developing other financial applications also.

• Though the application is implemented keeping view of online transactions, it can

also be used for credit card holders who are making offline transactions.

• Though we have developed a specific application, we feel that with minor

application-specific modifications, the present approach can be effectively used to

counter intrusion in other database applications as well.

• We have inquired almost all the banks in Gujarat, no bank is currently using any

kind of software for financial cyber crime detection. So our data mining software

become very useful for them.

8.4 LIMITATION OF THE STUDY

• The developed data mining application is used only for those customers who are

making credit card purchases frequently. It is not for those who transact once or

very less in a year. The model has to learn all the purchasing habits of the

customer, so for new incoming transaction it can predict properly. As more and


251

more transactions are performed by the customer, the model becomes stronger,

learns the customer behavior and predicts the transaction more accurately.

• The application is also not used for a new customer for the same above reason.

• Though the application is global and implemented keeping view of all the

countries, the parameter holiday is not the same for all the countries. So the

application requires minor changes to implement this change.

8.5 FUTURE SCOPE OF THE RESEARCH

• In the current work the location, where the customer performs the online

transaction is considered. The computer on which the online transaction is

performed is not taken into account, but in future work the IP Address can also be

considered and patterns can be generated for this IP address. The only problem to

consider is that IP address is not static, but dynamic. So care should be taken to

consider this as one parameter.

• It may be worthwhile to generate more parameters to closely match the

customer’s purchasing habits.

• More dynamic rules can be derived from the historical data and applied for the

initial belief.

• Full care has been taken to ensure that the research is designed and conducted to

achieve the research objectives. This is really a thrilling domain, in which one can

not stop and it requires constant refreshing to incorporate the dynamic changes

occurred as the real problems.

• Though data mining algorithm DBSCAN is implemented for only transaction

amount, it can be implemented for other attribute as well.

8.6 REFERENCES

[1] A.Jain, T.Sharma, Internet Virtual Credit Card Model,

http://www.profile.iiita.ac.in/ajain1_b04/ivccm.pdf

data mining techniques: study, analysis, prevention...

Documents