mining financial data histograms & contingency tables shishir gupta under the guidance of dr....
Post on 19-Dec-2015
217 views
TRANSCRIPT
Mining Financial Data Mining Financial Data Histograms & Contingency Tables Histograms & Contingency Tables
Shishir GuptaShishir GuptaUnder the guidance ofUnder the guidance of
Dr. Mirsad HadzikadicDr. Mirsad Hadzikadic
In memory of In memory of
DrDr. Jan Zytkow. Jan ZytkowSEP 09 1944 - JAN 16 2001SEP 09 1944 - JAN 16 2001
AgendaAgenda• Database• Task goals• Tool & technique used• Data preparation and cleaning• Attribute selection• Data transformation• Data Mining/Pattern
Evaluation• Knowledge presentation• Pros/Cons• Questions & Demonstration
DatabaseDatabase
• Financial Dataset from PKDD 1999
• Financial Dataset from a Czech Bank
• Relational Dataset• 8 Relations
– ACCOUNT - LOAN– DEMOGRAPH - ORDER– TRANSACTION - CARD– DISPOSITION - CLIENT
Task GoalTask Goal
• Determine Good Client to offer some additional service
• Determine Bad Client to watch carefully to minimize bank loss
• Offer Services :– Loan– Credit Card
Technique Used - HistogramTechnique Used - Histogram
SQL Statement usedSQL Statement used
SELECT age, COUNT(age)
FROM table_x
GROUP BY age
ORDER BY age
Technique Used – C-TablesTechnique Used – C-Tables
SQL Statement usedSQL Statement used
SELECT sex, COUNT(sex), age
FROM table_x a, table_y b
WHERE a.id = b.fid
GROUP BY sex, age
ORDER BY sex, age
Technique Used – CorrelationTechnique Used – Correlation
SQL Statement usedSQL Statement usedSELECT x, y
FROM table_x a, table_y b
WHERE a.id = b.fid
ORDER BY x, y
Tool - ArchitectureTool - Architecture
Tool - DescriptionTool - Description
Data CleaningData Cleaning
• Missing Value– Relation
DEMOGRAPHIC
• Incorrect Values– Relation
TRANSACTION
(Data reduced by 10% after cleaning)
Data PreparationData Preparation
• Relation CLIENT– Separating SEX &
BDATE from BIRTHNUMBER
• All Date fields converted to AGE– Ref 199901.
Data Preparation Data Preparation Cont….Cont….
• Creating Table definitions
• Setting up data in table compatible format
• Loading data into Database
• Evaluate loading errors and changing attribute definitions accordingly
Attribute SelectionAttribute Selection
• Decision Relation– LOAN
• Decision Attributes– STATUS
• Classification Attributes– All other attributes
that do not belong to LOAN relation.
A4?
A6?A1?
Class1 Class2 Class1 Class2
Y N
Y N
N Y
Data TransformationData Transformation
• Discretization – Continuous attributes into 4 to 10 buckets
• Transactions performed in the year 1997 considered for relation TRANSACTION.– Due to resource limitations– Maximum loans were approved during this
period
TRANSFORM
Data Mining/Pattern EvaluationData Mining/Pattern Evaluation• Run Histogram on all
non-key attributes to study its distribution.
• Discretize continuous attributes.
• Run Contingency Table study the reference among two attributes.
• Check significance with Correlation function if both attributes are continuous.
Knowledge Presentation - 1Knowledge Presentation - 1
• All loans on accounts where a second person is allowed to dispose are GOOD LOANS
(100%)
Knowledge Presentation - 2Knowledge Presentation - 2
• Permanent Orders of type household & leasing indicates financial stability
Knowledge Presentation - 3Knowledge Presentation - 3
• Accounts with Cash withdrawals are more likely to repay their loans
Knowledge Presentation - 4Knowledge Presentation - 4
• Accounts with low transaction amounts indicate good loans
Knowledge Presentation - 5Knowledge Presentation - 5
• Accounts that are in debt indicates BAD LOAN
ProsPros
• Flexibility to alter data presentation to Flexibility to alter data presentation to understand the nature of dataunderstand the nature of data
• Customers with no background with Customers with no background with datamining can appreciate the output datamining can appreciate the output results because of its simplicityresults because of its simplicity
• Since there is a provision to store the Since there is a provision to store the results in a file, subsequent analysis results in a file, subsequent analysis on a subset of data becomes very on a subset of data becomes very easyeasy
ConsCons
• Needs capability for Multi-Variable Needs capability for Multi-Variable analysis.analysis.
• Some kind of quantification needs to Some kind of quantification needs to be put in.be put in.
• Performance issues with using Performance issues with using RDBMS.RDBMS.
Questions & DemonstrationQuestions & Demonstration