oracle8_3
TRANSCRIPT
-
8/22/2019 oracle8_3
1/21
The role of DomainKnowledge in a large scaleData Mining Project
Kopanas I., Avouris N., Daskalaki S.
University of Patras
-
8/22/2019 oracle8_3
2/21
University of Patras, HCI Group - SETN02 2
Outline of the talk
Knowledge in a DM process
Case study in a large DM project: Prediction of
customer insolvency in Telecommunicationsbusiness
The role of domain expertise (and domain
experts ) in the process
Summary and conclusions
-
8/22/2019 oracle8_3
3/21
University of Patras, HCI Group - SETN02 3
Data Mining
Evolution of knowledge-based systems
Key partners in Data Mining
Data analyst / statistician
Knowledge Engineer
Domain Expert
Role of domain knowledge in Data Mining
-
8/22/2019 oracle8_3
4/21
University of Patras, HCI Group - SETN02 4
DM phases
(a) Problem definition(b) Creating target data set
(c ) Data pre-processing and transformation
(d ) Feature and algorithm selection
(e) Data Mining
(f) Evaluation of learned knowledge
(g) Fielding the knowledge base
-
8/22/2019 oracle8_3
5/21
University of Patras, HCI Group - SETN02 5
Case study: Prediction of
Customer Insolvency inTelecommunications businessPredict the insolvent customers to be, that is the
customers that will refuse to pay their telephone
bills in the next payment due date, while thereis still time for preventive (and possibly avertive)measures
Problem Objectives
Detect as many insolvent customers as possible
Minimize false alarms (solvent customers classifiedas insolvent)
-
8/22/2019 oracle8_3
6/21
University of Patras, HCI Group - SETN02 6
Case study: problem
characteristics Significant loss of revenue for the company
Human behavior is (generally) unpredictable
Insolvency cases are rare compared to non-
insolvencies
Information can be retrieved only after
processing huge amounts of data from several
sources
-
8/22/2019 oracle8_3
7/21University of Patras, HCI Group - SETN02 7
The billing process (domainknowledge)
Jun Jul Aug Sept Feb AprMarOct Nov JanDec
Billing Period
Due Date
Issue of Bill
Service Interruption
Nullification
-
8/22/2019 oracle8_3
8/21University of Patras, HCI Group - SETN02 8
Target data set definition
(semantic value of data) Data from 3 different cities (combination of
rural, urban and touristic areas)
Types of data Customer data (coded)
Data from billing and payments
Call detail records (from switching centers)
Time span of data studied
Cases of collected and uncollected bills (10/99-2/01)
Calls records (8/99-12/00)
-
8/22/2019 oracle8_3
9/21University of Patras, HCI Group - SETN02 9
Data pre-processing(knowledge-based reduction of
search space) Eliminated inexpensive
calls (< 0.3 )
Synchronizing data
Removing noise
Missing values
Data aggregation byperiod
DATA
WAREHOUSE
-
8/22/2019 oracle8_3
10/21University of Patras, HCI Group - SETN02 10
Dataset for model fitting
Stratified sample of solvent customers
Class distribution: 90% solvent customers and 10%insolvent customers
2066 total number of cases and 46 variables
2 variables describing the phone account
4 variables describing customer attitude towardsprevious phone bills
40 variables summarizing customer call habits overfifteen 2-week periods
-
8/22/2019 oracle8_3
11/21University of Patras, HCI Group - SETN02 11
Data mining
Classification problem
2 classes: solvent and insolvent customers
Distribution among classes in originaldataset: 99% of solvent customers and 1%
of insolvent customers
Very small number of insolvencies
Very different costs of misclassification
between the two classes of customers
-
8/22/2019 oracle8_3
12/21University of Patras, HCI Group - SETN02 12
Criteria for evaluation ofprediction
The precis ionof the classifier, defined as the
percentage of the actually insolvent customers
in those, predicted as insolvent by the
classifier.
The accuracyof the classifier, defined as the
percentage of the correctly predicted insolvent
out of the total cases of insolvent customers inthe data set.
Precision > 30% & Accuracy > 70%
-
8/22/2019 oracle8_3
13/21University of Patras, HCI Group - SETN02 13
Features selected (mostpopular in 50 classifiers) NewCust
Latency
Count_X_charges
CountResiduals
StdDif
TrendDif11
TrendDif10
TrendDif7
TrendDif6
TrendDif3
TrendUnitsMax
TrendDif5
TrendDif8
Average_Dif
Type
MaxSec
TrendUnits5
AverageUnits
TrendCount5
CountInstallments
TrendDifxx , StdDif
dispersion of calledtelephone numbers in a
given time interval xx
-
8/22/2019 oracle8_3
14/21University of Patras, HCI Group - SETN02 14
Deployment of the Knowledge-
based system The classifiers are combined (voting algorithms
have been used)
Heuristics are used as applicability criteria
Visualization plays an important role in the
design of the system
The roles of the user and the knowledge-based
system have to be carefully defined
-
8/22/2019 oracle8_3
15/21University of Patras, HCI Group - SETN02 15
Stepwise DiscriminantAnalysis
Classification Results E3
Predicted
Category 0 1 Total
0 78 58 136Count
1 28 1184 12120 57.35 42.65 100
Original
%
1 2.31 97.69 1000 77 59 136Count
1 35 1177 12120 56.62 43.38 100
CasesSelected Cross-
validated
%
1 2.89 97.11 1000 36 28 64Count
1 22 632 6540 56.25 43.75 100
Cases not
Selected
Original
%
1 3.36 96.64 100
93.6% of selected original grouped cases correctly classified
93.02% of selected cross-validated cases correctly classified
93.04% of unselected original grouped cases correctly classified
-
8/22/2019 oracle8_3
16/21University of Patras, HCI Group - SETN02 16
Decision Tree
CCaatteeggoorryy 00 11 TToottaall
CCoouunntt 0
0 1
10011 3
355 1
1336611 99 11220033 11221122
%% 00 7744..2266 2255..7744 11000011 00..7744 9999..2266 110000
CCoouunntt 00 4422 2222 664411 1166 663388 665544
%% 00 6655..6622 3344..3388 11000011 22..4455 9977..5555 110000
CCllaassssiiffiiccaattiioonn RReessuullttss EE2211Predicted Group
Original
Cases not
Selected
Original
Cases Selected
-
8/22/2019 oracle8_3
17/21University of Patras, HCI Group - SETN02 17
Neural Network
Category 00 11 Total
CCoouunntt 00 6655 6699 11336611 88 11220033 11221122
%% 00 4477..77 5500..77 11000011 00..66 9999..22 110000
CCoouunntt 00 2244 4400 6644
11 1111 664433 665544%% 00 3377..55 6622..55 110000
11 11..66 9988..33 110000
Classification Results E30
Predicted Group
Original
Cases not
Selected
Original
CasesSelected
-
8/22/2019 oracle8_3
18/21University of Patras, HCI Group - SETN02 18
Evaluation of classifiers(example)
Performance over 90% in the majority classand over 83% in the minority class.
precision = 113/2844= 3.9%
accuracy = 113/136= 83%,
Predicted cases
CategoryInsolvent (0) Solvent (1)
Insolvent (0)113
(83.1 %)23
(16.9%)Actual cases
Solvent (1) 2731(9.8 %) 25081(90.2 %)
-
8/22/2019 oracle8_3
19/21University of Patras, HCI Group - SETN02 19
stage DK Type of DK
(a) Problem definition HIGH Business and domain knowledge,requirements Implicit, tacit
knowledge
(b) Creating
target data setMEDIUM Attribute relations, semantics of
corporate DB
(c ) Data pre-
processingHIGH Tacit and implicit knowledge for
inferences
(d ) Feature and
algorithm selectionMEDIUM Interpretation of the selected
features
(e) Data Mining LOW Inspection of discovered
knowledge
(f) Evaluation of
learned knowledgeMEDIUM Definition of criteria related to
business objectives
(g) Fielding the
knowledge baseHIGH Supplementary domain
knowledge necessary for
implementing the system
-
8/22/2019 oracle8_3
20/21University of Patras, HCI Group - SETN02 20
Selection of DM tool (Elder 98)
-
8/22/2019 oracle8_3
21/21University of Patras HCI Group SETN02 21
Conclusion
Data mining is a knowledge-driven process
All stages contribute to the success of theprocess
Domain experts play significant role in mostphases of the process
Need for selection of algorithms and techniques
that support interpretation of mined knowledge
Need for integrated tools and adequatetechniques to support involvement of domainexperts in the process