oracle8_3

Upload: suvir-misra

Post on 08-Aug-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/22/2019 oracle8_3

    1/21

    The role of DomainKnowledge in a large scaleData Mining Project

    Kopanas I., Avouris N., Daskalaki S.

    University of Patras

  • 8/22/2019 oracle8_3

    2/21

    University of Patras, HCI Group - SETN02 2

    Outline of the talk

    Knowledge in a DM process

    Case study in a large DM project: Prediction of

    customer insolvency in Telecommunicationsbusiness

    The role of domain expertise (and domain

    experts ) in the process

    Summary and conclusions

  • 8/22/2019 oracle8_3

    3/21

    University of Patras, HCI Group - SETN02 3

    Data Mining

    Evolution of knowledge-based systems

    Key partners in Data Mining

    Data analyst / statistician

    Knowledge Engineer

    Domain Expert

    Role of domain knowledge in Data Mining

  • 8/22/2019 oracle8_3

    4/21

    University of Patras, HCI Group - SETN02 4

    DM phases

    (a) Problem definition(b) Creating target data set

    (c ) Data pre-processing and transformation

    (d ) Feature and algorithm selection

    (e) Data Mining

    (f) Evaluation of learned knowledge

    (g) Fielding the knowledge base

  • 8/22/2019 oracle8_3

    5/21

    University of Patras, HCI Group - SETN02 5

    Case study: Prediction of

    Customer Insolvency inTelecommunications businessPredict the insolvent customers to be, that is the

    customers that will refuse to pay their telephone

    bills in the next payment due date, while thereis still time for preventive (and possibly avertive)measures

    Problem Objectives

    Detect as many insolvent customers as possible

    Minimize false alarms (solvent customers classifiedas insolvent)

  • 8/22/2019 oracle8_3

    6/21

    University of Patras, HCI Group - SETN02 6

    Case study: problem

    characteristics Significant loss of revenue for the company

    Human behavior is (generally) unpredictable

    Insolvency cases are rare compared to non-

    insolvencies

    Information can be retrieved only after

    processing huge amounts of data from several

    sources

  • 8/22/2019 oracle8_3

    7/21University of Patras, HCI Group - SETN02 7

    The billing process (domainknowledge)

    Jun Jul Aug Sept Feb AprMarOct Nov JanDec

    Billing Period

    Due Date

    Issue of Bill

    Service Interruption

    Nullification

  • 8/22/2019 oracle8_3

    8/21University of Patras, HCI Group - SETN02 8

    Target data set definition

    (semantic value of data) Data from 3 different cities (combination of

    rural, urban and touristic areas)

    Types of data Customer data (coded)

    Data from billing and payments

    Call detail records (from switching centers)

    Time span of data studied

    Cases of collected and uncollected bills (10/99-2/01)

    Calls records (8/99-12/00)

  • 8/22/2019 oracle8_3

    9/21University of Patras, HCI Group - SETN02 9

    Data pre-processing(knowledge-based reduction of

    search space) Eliminated inexpensive

    calls (< 0.3 )

    Synchronizing data

    Removing noise

    Missing values

    Data aggregation byperiod

    DATA

    WAREHOUSE

  • 8/22/2019 oracle8_3

    10/21University of Patras, HCI Group - SETN02 10

    Dataset for model fitting

    Stratified sample of solvent customers

    Class distribution: 90% solvent customers and 10%insolvent customers

    2066 total number of cases and 46 variables

    2 variables describing the phone account

    4 variables describing customer attitude towardsprevious phone bills

    40 variables summarizing customer call habits overfifteen 2-week periods

  • 8/22/2019 oracle8_3

    11/21University of Patras, HCI Group - SETN02 11

    Data mining

    Classification problem

    2 classes: solvent and insolvent customers

    Distribution among classes in originaldataset: 99% of solvent customers and 1%

    of insolvent customers

    Very small number of insolvencies

    Very different costs of misclassification

    between the two classes of customers

  • 8/22/2019 oracle8_3

    12/21University of Patras, HCI Group - SETN02 12

    Criteria for evaluation ofprediction

    The precis ionof the classifier, defined as the

    percentage of the actually insolvent customers

    in those, predicted as insolvent by the

    classifier.

    The accuracyof the classifier, defined as the

    percentage of the correctly predicted insolvent

    out of the total cases of insolvent customers inthe data set.

    Precision > 30% & Accuracy > 70%

  • 8/22/2019 oracle8_3

    13/21University of Patras, HCI Group - SETN02 13

    Features selected (mostpopular in 50 classifiers) NewCust

    Latency

    Count_X_charges

    CountResiduals

    StdDif

    TrendDif11

    TrendDif10

    TrendDif7

    TrendDif6

    TrendDif3

    TrendUnitsMax

    TrendDif5

    TrendDif8

    Average_Dif

    Type

    MaxSec

    TrendUnits5

    AverageUnits

    TrendCount5

    CountInstallments

    TrendDifxx , StdDif

    dispersion of calledtelephone numbers in a

    given time interval xx

  • 8/22/2019 oracle8_3

    14/21University of Patras, HCI Group - SETN02 14

    Deployment of the Knowledge-

    based system The classifiers are combined (voting algorithms

    have been used)

    Heuristics are used as applicability criteria

    Visualization plays an important role in the

    design of the system

    The roles of the user and the knowledge-based

    system have to be carefully defined

  • 8/22/2019 oracle8_3

    15/21University of Patras, HCI Group - SETN02 15

    Stepwise DiscriminantAnalysis

    Classification Results E3

    Predicted

    Category 0 1 Total

    0 78 58 136Count

    1 28 1184 12120 57.35 42.65 100

    Original

    %

    1 2.31 97.69 1000 77 59 136Count

    1 35 1177 12120 56.62 43.38 100

    CasesSelected Cross-

    validated

    %

    1 2.89 97.11 1000 36 28 64Count

    1 22 632 6540 56.25 43.75 100

    Cases not

    Selected

    Original

    %

    1 3.36 96.64 100

    93.6% of selected original grouped cases correctly classified

    93.02% of selected cross-validated cases correctly classified

    93.04% of unselected original grouped cases correctly classified

  • 8/22/2019 oracle8_3

    16/21University of Patras, HCI Group - SETN02 16

    Decision Tree

    CCaatteeggoorryy 00 11 TToottaall

    CCoouunntt 0

    0 1

    10011 3

    355 1

    1336611 99 11220033 11221122

    %% 00 7744..2266 2255..7744 11000011 00..7744 9999..2266 110000

    CCoouunntt 00 4422 2222 664411 1166 663388 665544

    %% 00 6655..6622 3344..3388 11000011 22..4455 9977..5555 110000

    CCllaassssiiffiiccaattiioonn RReessuullttss EE2211Predicted Group

    Original

    Cases not

    Selected

    Original

    Cases Selected

  • 8/22/2019 oracle8_3

    17/21University of Patras, HCI Group - SETN02 17

    Neural Network

    Category 00 11 Total

    CCoouunntt 00 6655 6699 11336611 88 11220033 11221122

    %% 00 4477..77 5500..77 11000011 00..66 9999..22 110000

    CCoouunntt 00 2244 4400 6644

    11 1111 664433 665544%% 00 3377..55 6622..55 110000

    11 11..66 9988..33 110000

    Classification Results E30

    Predicted Group

    Original

    Cases not

    Selected

    Original

    CasesSelected

  • 8/22/2019 oracle8_3

    18/21University of Patras, HCI Group - SETN02 18

    Evaluation of classifiers(example)

    Performance over 90% in the majority classand over 83% in the minority class.

    precision = 113/2844= 3.9%

    accuracy = 113/136= 83%,

    Predicted cases

    CategoryInsolvent (0) Solvent (1)

    Insolvent (0)113

    (83.1 %)23

    (16.9%)Actual cases

    Solvent (1) 2731(9.8 %) 25081(90.2 %)

  • 8/22/2019 oracle8_3

    19/21University of Patras, HCI Group - SETN02 19

    stage DK Type of DK

    (a) Problem definition HIGH Business and domain knowledge,requirements Implicit, tacit

    knowledge

    (b) Creating

    target data setMEDIUM Attribute relations, semantics of

    corporate DB

    (c ) Data pre-

    processingHIGH Tacit and implicit knowledge for

    inferences

    (d ) Feature and

    algorithm selectionMEDIUM Interpretation of the selected

    features

    (e) Data Mining LOW Inspection of discovered

    knowledge

    (f) Evaluation of

    learned knowledgeMEDIUM Definition of criteria related to

    business objectives

    (g) Fielding the

    knowledge baseHIGH Supplementary domain

    knowledge necessary for

    implementing the system

  • 8/22/2019 oracle8_3

    20/21University of Patras, HCI Group - SETN02 20

    Selection of DM tool (Elder 98)

  • 8/22/2019 oracle8_3

    21/21University of Patras HCI Group SETN02 21

    Conclusion

    Data mining is a knowledge-driven process

    All stages contribute to the success of theprocess

    Domain experts play significant role in mostphases of the process

    Need for selection of algorithms and techniques

    that support interpretation of mined knowledge

    Need for integrated tools and adequatetechniques to support involvement of domainexperts in the process