deep learning vs traditional machine learning algorithms used in

Click here to load reader

Post on 26-Mar-2020




0 download

Embed Size (px)


  •   Deep Learning vs traditional Machine  Learning algorithms used in Credit Card 

    Fraud Detection           

    User Configuration Manual 



    Name: Sapna Gupta  Student Number:  X14115824  Course:  Msc Data Analytics 

    College:  National College of Ireland  School of Computing 


    Supervisor : Jason Roche       


      1. Introduction​                                                                                                           3  2. Application Environment Setup​                                                                             4 

    2.1 ​Hardware​                                                                                              4  2.2 ​Software​                                                                                               5 

    3. Data Mining Process Details:​                                                                                7  3.1 ​Dataset selection​                                                                                  7  3.2 Data ​Cleansing​                                                                                     7  3.3 Data ​Preprocessing( using Spark)​                                                       8  3.4 ​Exploratory Analysis ( Using Tableau)​                                                 8  3.5 ​Social Network Analysis.  ( Using Neo4J0 )​                                       12  3.6 ​Data Sampling & Data Modelling​                                                       14 

    4. Deep Learning model tuning​                                                                               19  4.1 ​Deep Learning model tuning using 5 fold cross validation​                 19  4.2 ​Deep Learning model tuning hyper parameter tuning Grid Search​    20 

    5. Result Evaluation​                                                                                                 21  6.  ​Appendix​: Scripts created for the analysis                                                         22 

    Section A​: Exploratory analysis                                                                22  Section B​: Neo4J Database Create Script                                               30  Section C​: Python ­ Neo4J integration for 3D Graph Analysis                 32  Section D​: DSS Data Loading                                                                  34  Section E​: Creating Sampling                                                                  34  Section F​: Model building of Sampling Methods                                      35  Section G​: Model Evaluation for Sampling Methods                                35  Section H​: SMOTE Sampling                                                                   36  Section I​:   Ensemble Methods GBM                                                        37  Section J​:  Ensemble Method Stacking                                                    39  Section K​:  Deep Learning using default Parameters                               41  Section L​: Deep Learning using 5 fold Cross Validation                           42  Section M​: Deep learning using Grid Search                                            43  Section N​: Result Comparison of All models Built                                    45 



    1. Introduction  This manual lists all the technical details required to design a credit card fraud detection                              application using advanced data mining software. Output of this process is in the form of                              business analytic reports , charts which explain fraud patterns , fraud rings and well trained                              binary classification model which can find fraud from new unseen credit card transactions.                          Objective of this study is to compare performance of deep learning with other binary                            classification techniques like RF , GBM and GLM and this study also:    1. Investigate the use of new advanced data mining software to analyse credit card transactions                              to find fraud more efficiently.   2. Finds different types of data mining approaches to identify patterns and fraud rings in credit                                card transactions.   3. Tries to find solutions to the class imbalance problem and show how these solutions can                                be  implemented  through the use of new advanced data mining softwares like H2O.      This document is supplementary to the dissertation thesis report submitted. Functional and                        business concepts related to credit card fraud application are very well demonstrated in the                            thesis report.                                     

  • Application Environment Setup   

      Figure 1. Overview of the hardware and software used during building process of the Fraud  detection application 

    Hardware  1. Mac  OS 

        Model Name: MacBook Pro    Model Identifier: MacBookPro11,5    Processor Name: Intel Core i7    Processor Speed: 2.5 GHz    Number of Processors: 1    Total Number of Cores: 4    L2 Cache (per Core): 256 KB    L3 Cache: 6 MB    Memory: 16 GB 

      1.1  Ubuntu VM 14.04 

    Operating System Oracle (64­bit)  Base Memory : 11296 MB   

    All the advanced data mining softwares were manually loaded into this VM. 

  •   1.2  DSS Virtual Machine (VM) 3.0.0 

    Operating System Red Hat (64­bit)  Base Memory : 11856 MB 


    DSS VM is downloaded from Dataiku web site ​  DSS comes with all the advanced data mining software preloaded. With the free version                            of DSS, user can only use limited softwares , so student licence was requested for using                                advance features or enterprise version of DSS. Due to easy installation of DSS and                            integration of all the advanced data mining softwares into DSS it became the best choice                              to carry out our final data modelling activity. 

    Software  Engineering tools like Spark , Neo4J , Python , Pyspark, R , SparkR , SparkSQL , Tableau , DSS                                      and H2O were used to study the acquired dataset.    Apache ​Spark ​is advertised as “lightning fast cluster computing”. Spark lets you run programs                            up to 100x faster in memory, or 10x faster on disk, than Hadoop. Spark contains very powerful                                  and widely used libraries like SparkSQL, Spark Streaming, MLlib (for machine learning), and                          GraphX. Spark can be used with python, java and R through its APIs.  1

      Spark was downloaded into VM1 using the following link. Version of the spark used for the                                thesis is spark­1.5.2­bin­hadoop2.6.  Please follow all the steps provided in the below link to install scala and spark.­ap

View more