oracle8_3

8/22/2019 oracle8_3

1/21

The role of DomainKnowledge in a large scaleData Mining Project

Kopanas I., Avouris N., Daskalaki S.

University of Patras

8/22/2019 oracle8_3

2/21

University of Patras, HCI Group - SETN02 2

Outline of the talk

Knowledge in a DM process

Case study in a large DM project: Prediction of

customer insolvency in Telecommunicationsbusiness

The role of domain expertise (and domain

experts ) in the process

Summary and conclusions

8/22/2019 oracle8_3

3/21


Data Mining

Evolution of knowledge-based systems

Key partners in Data Mining

Data analyst / statistician

Knowledge Engineer

Domain Expert

Role of domain knowledge in Data Mining

8/22/2019 oracle8_3

4/21


DM phases

(a) Problem definition(b) Creating target data set

(c ) Data pre-processing and transformation

(d ) Feature and algorithm selection

(e) Data Mining

(f) Evaluation of learned knowledge

(g) Fielding the knowledge base

8/22/2019 oracle8_3

5/21


Case study: Prediction of

Customer Insolvency inTelecommunications businessPredict the insolvent customers to be, that is the

customers that will refuse to pay their telephone

bills in the next payment due date, while thereis still time for preventive (and possibly avertive)measures

Problem Objectives

Detect as many insolvent customers as possible

Minimize false alarms (solvent customers classifiedas insolvent)

8/22/2019 oracle8_3

6/21


Case study: problem

characteristics Significant loss of revenue for the company

Human behavior is (generally) unpredictable

Insolvency cases are rare compared to non-

insolvencies

Information can be retrieved only after

processing huge amounts of data from several

sources

8/22/2019 oracle8_3

7/21University of Patras, HCI Group - SETN02 7

The billing process (domainknowledge)

Jun Jul Aug Sept Feb AprMarOct Nov JanDec

Billing Period

Due Date

Issue of Bill

Service Interruption

Nullification

8/22/2019 oracle8_3


Target data set definition

(semantic value of data) Data from 3 different cities (combination of

rural, urban and touristic areas)

Types of data Customer data (coded)

Data from billing and payments

Call detail records (from switching centers)

Time span of data studied

Cases of collected and uncollected bills (10/99-2/01)

Calls records (8/99-12/00)

8/22/2019 oracle8_3


Data pre-processing(knowledge-based reduction of

search space) Eliminated inexpensive

calls (< 0.3 )

Synchronizing data

Removing noise

Missing values

Data aggregation byperiod

DATA

WAREHOUSE

8/22/2019 oracle8_3


Dataset for model fitting

Stratified sample of solvent customers

Class distribution: 90% solvent customers and 10%insolvent customers

2066 total number of cases and 46 variables

2 variables describing the phone account

4 variables describing customer attitude towardsprevious phone bills

40 variables summarizing customer call habits overfifteen 2-week periods

8/22/2019 oracle8_3


Data mining

Classification problem

2 classes: solvent and insolvent customers

Distribution among classes in originaldataset: 99% of solvent customers and 1%

of insolvent customers

Very small number of insolvencies

Very different costs of misclassification

between the two classes of customers

8/22/2019 oracle8_3


Criteria for evaluation ofprediction

The precis ionof the classifier, defined as the

percentage of the actually insolvent customers

in those, predicted as insolvent by the

classifier.

The accuracyof the classifier, defined as the

percentage of the correctly predicted insolvent

out of the total cases of insolvent customers inthe data set.

Precision > 30% & Accuracy > 70%

8/22/2019 oracle8_3


Features selected (mostpopular in 50 classifiers) NewCust

Latency

Count_X_charges

CountResiduals

StdDif

TrendDif11

TrendDif10

TrendDif7

TrendDif6

TrendDif3

TrendUnitsMax

TrendDif5

TrendDif8

Average_Dif

Type

MaxSec

TrendUnits5

AverageUnits

TrendCount5

CountInstallments

TrendDifxx , StdDif

dispersion of calledtelephone numbers in a

given time interval xx

8/22/2019 oracle8_3


Deployment of the Knowledge-

based system The classifiers are combined (voting algorithms

have been used)

Heuristics are used as applicability criteria

Visualization plays an important role in the

design of the system

The roles of the user and the knowledge-based

system have to be carefully defined

8/22/2019 oracle8_3


Stepwise DiscriminantAnalysis

Classification Results E3

Predicted

Category 0 1 Total

0 78 58 136Count

1 28 1184 12120 57.35 42.65 100

Original

%

1 2.31 97.69 1000 77 59 136Count

1 35 1177 12120 56.62 43.38 100

CasesSelected Cross-

validated

%

1 2.89 97.11 1000 36 28 64Count

1 22 632 6540 56.25 43.75 100

Cases not

Selected

Original

%

1 3.36 96.64 100

93.6% of selected original grouped cases correctly classified

93.02% of selected cross-validated cases correctly classified

93.04% of unselected original grouped cases correctly classified

8/22/2019 oracle8_3


Decision Tree

CCaatteeggoorryy 00 11 TToottaall

CCoouunntt 0

0 1

10011 3

355 1

1336611 99 11220033 11221122

%% 00 7744..2266 2255..7744 11000011 00..7744 9999..2266 110000

CCoouunntt 00 4422 2222 664411 1166 663388 665544

%% 00 6655..6622 3344..3388 11000011 22..4455 9977..5555 110000

CCllaassssiiffiiccaattiioonn RReessuullttss EE2211Predicted Group

Original

Cases not

Selected

Original

Cases Selected

8/22/2019 oracle8_3


Neural Network

Category 00 11 Total

CCoouunntt 00 6655 6699 11336611 88 11220033 11221122

%% 00 4477..77 5500..77 11000011 00..66 9999..22 110000

CCoouunntt 00 2244 4400 6644

11 1111 664433 665544%% 00 3377..55 6622..55 110000

11 11..66 9988..33 110000

Classification Results E30

Predicted Group

Original

Cases not

Selected

Original

CasesSelected

8/22/2019 oracle8_3


Evaluation of classifiers(example)

Performance over 90% in the majority classand over 83% in the minority class.

precision = 113/2844= 3.9%

accuracy = 113/136= 83%,

Predicted cases

CategoryInsolvent (0) Solvent (1)

Insolvent (0)113

(83.1 %)23

(16.9%)Actual cases

Solvent (1) 2731(9.8 %) 25081(90.2 %)

8/22/2019 oracle8_3


stage DK Type of DK

(a) Problem definition HIGH Business and domain knowledge,requirements Implicit, tacit

knowledge

(b) Creating

target data setMEDIUM Attribute relations, semantics of

corporate DB

(c ) Data pre-

processingHIGH Tacit and implicit knowledge for

inferences

(d ) Feature and

algorithm selectionMEDIUM Interpretation of the selected

features

(e) Data Mining LOW Inspection of discovered

knowledge

(f) Evaluation of

learned knowledgeMEDIUM Definition of criteria related to

business objectives

(g) Fielding the

knowledge baseHIGH Supplementary domain

knowledge necessary for

implementing the system

8/22/2019 oracle8_3


Selection of DM tool (Elder 98)

8/22/2019 oracle8_3

21/21University of Patras HCI Group SETN02 21

Conclusion

Data mining is a knowledge-driven process

All stages contribute to the success of theprocess

Domain experts play significant role in mostphases of the process

Need for selection of algorithms and techniques

that support interpretation of mined knowledge

Need for integrated tools and adequatetechniques to support involvement of domainexperts in the process

oracle8_3

Documents