Download - The Data Science Model - Occidental College
![Page 1: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/1.jpg)
The Data Science ModelMath 396
Treena Basu
Department of MathematicsOccidental College
February 9, 2021
1 / 33
![Page 2: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/2.jpg)
Summary
1 What is Data Science?
2 What is Machine Learning?
3 Classification ProblemsFraud DetectionEmail Spam FilterStudent College Commitment Decisions
2 / 33
![Page 3: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/3.jpg)
What is Data Science?
What is Data Science?
3 / 33
![Page 4: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/4.jpg)
What is Data Science?
Data Science
Data science is a broad field that concerns itself with the extraction ofknowledge and insights from data.
Data comes in different shapes and forms.
4 / 33
![Page 5: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/5.jpg)
What is Data Science?
Data Science
Data science is a broad field that concerns itself with the extraction ofknowledge and insights from data.
Data comes in different shapes and forms.
4 / 33
![Page 6: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/6.jpg)
What is Machine Learning?
What is Machine Learning?
5 / 33
![Page 7: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/7.jpg)
What is Machine Learning?
Machine Learning
Machine Learning (ML) is an application of artificial intelligence (AI) thatfocuses on the creation and development of mathematical models that usedata to learn for themselves.
The primary motivation behind creating these mathematical models or MLalgorithms is to allow the model to learn automatically without humanintervention or assistance and adjust actions accordingly.
6 / 33
![Page 8: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/8.jpg)
What is Machine Learning?
Machine Learning
Machine Learning (ML) is an application of artificial intelligence (AI) thatfocuses on the creation and development of mathematical models that usedata to learn for themselves.
The primary motivation behind creating these mathematical models or MLalgorithms is to allow the model to learn automatically without humanintervention or assistance and adjust actions accordingly.
6 / 33
![Page 9: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/9.jpg)
What is Machine Learning?
Types of Machine Learning
Machine learning approaches are traditionally divided into three broadcategories:
Supervised Learning: the model is “trained” with example inputs andtheir desired outputs; the goal is to learn a general rule that mapsinputs to outputs
Unsupervised Learning: the model is “trained” with unlabeled data,leaving it on its own to find structure in its input; the goal is todiscover hidden patterns in data
Reinforcement Learning: the model is receives “training” in the formof feedback that’s analogous to rewards, which it tries to maximize
7 / 33
![Page 10: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/10.jpg)
What is Machine Learning?
Types of Machine Learning
Machine learning approaches are traditionally divided into three broadcategories:
Supervised Learning: the model is “trained” with example inputs andtheir desired outputs; the goal is to learn a general rule that mapsinputs to outputs
Unsupervised Learning: the model is “trained” with unlabeled data,leaving it on its own to find structure in its input; the goal is todiscover hidden patterns in data
Reinforcement Learning: the model is receives “training” in the formof feedback that’s analogous to rewards, which it tries to maximize
7 / 33
![Page 11: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/11.jpg)
What is Machine Learning?
Types of Machine Learning
Machine learning approaches are traditionally divided into three broadcategories:
Supervised Learning: the model is “trained” with example inputs andtheir desired outputs; the goal is to learn a general rule that mapsinputs to outputs
Unsupervised Learning: the model is “trained” with unlabeled data,leaving it on its own to find structure in its input; the goal is todiscover hidden patterns in data
Reinforcement Learning: the model is receives “training” in the formof feedback that’s analogous to rewards, which it tries to maximize
7 / 33
![Page 12: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/12.jpg)
What is Machine Learning?
Types of Machine Learning
Machine learning approaches are traditionally divided into three broadcategories:
Supervised Learning: the model is “trained” with example inputs andtheir desired outputs; the goal is to learn a general rule that mapsinputs to outputs
Unsupervised Learning: the model is “trained” with unlabeled data,leaving it on its own to find structure in its input; the goal is todiscover hidden patterns in data
Reinforcement Learning: the model is receives “training” in the formof feedback that’s analogous to rewards, which it tries to maximize
7 / 33
![Page 13: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/13.jpg)
What is Machine Learning?
Supervised Learning
In a supervised learning problem we are given a set of data(x1, y1), (x2, y2), · · · (xn, yn).
The goal of the model is to be able to learn a function f(x) to predict ygiven x.
Note:
x can be multi-dimensional where each dimension corresponds to anattribute.
y can be real valued (Regression Problem) or y can be categorical(Classification Problem).
8 / 33
![Page 14: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/14.jpg)
What is Machine Learning?
Supervised Learning
In a supervised learning problem we are given a set of data(x1, y1), (x2, y2), · · · (xn, yn).
The goal of the model is to be able to learn a function f(x) to predict ygiven x.
Note:
x can be multi-dimensional where each dimension corresponds to anattribute.
y can be real valued (Regression Problem) or y can be categorical(Classification Problem).
8 / 33
![Page 15: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/15.jpg)
What is Machine Learning?
Supervised Learning
In a supervised learning problem we are given a set of data(x1, y1), (x2, y2), · · · (xn, yn).
The goal of the model is to be able to learn a function f(x) to predict ygiven x.
Note:
x can be multi-dimensional where each dimension corresponds to anattribute.
y can be real valued (Regression Problem) or y can be categorical(Classification Problem).
8 / 33
![Page 16: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/16.jpg)
What is Machine Learning?
Supervised Learning
In a supervised learning problem we are given a set of data(x1, y1), (x2, y2), · · · (xn, yn).
The goal of the model is to be able to learn a function f(x) to predict ygiven x.
Note:
x can be multi-dimensional where each dimension corresponds to anattribute.
y can be real valued (Regression Problem) or y can be categorical(Classification Problem).
8 / 33
![Page 17: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/17.jpg)
What is Machine Learning?
Supervised Learning
In a supervised learning problem we are given a set of data(x1, y1), (x2, y2), · · · (xn, yn).
The goal of the model is to be able to learn a function f(x) to predict ygiven x.
Note:
x can be multi-dimensional where each dimension corresponds to anattribute.
y can be real valued (Regression Problem) or y can be categorical(Classification Problem).
8 / 33
![Page 18: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/18.jpg)
What is Machine Learning?
Examples of Supervised Learning
Predicting House Prices (Regression Problem)
We will leverage data coming from thousands of houses, theirattributes (inputs: location, square footage, number of bedrooms,etc) and prices (labels or outputs), and train a supervised machinelearning model to predict a new house’s price based on the examplesobserved by the model.
Is it a Cat or Dog? (Classification Problem) We train a supervisedML algorithm using labelled pictures of cats and dogs. We then givethe model an image it has never seen before with goal that the modelwill predict what class the image belongs to: cat or dog.
9 / 33
![Page 19: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/19.jpg)
What is Machine Learning?
Components of a ML Algorithm
Every ML algorithm has the following three components:
Mathematical Representation
Parameter Optimization
Performance Evaluation
10 / 33
![Page 20: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/20.jpg)
What is Machine Learning?
Components of a ML Algorithm
Every ML algorithm has the following three components:
Mathematical Representation
Parameter Optimization
Performance Evaluation
10 / 33
![Page 21: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/21.jpg)
What is Machine Learning?
Components of a ML Algorithm
Every ML algorithm has the following three components:
Mathematical Representation
Parameter Optimization
Performance Evaluation
10 / 33
![Page 22: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/22.jpg)
What is Machine Learning?
Components of a ML Algorithm
Every ML algorithm has the following three components:
Mathematical Representation
Parameter Optimization
Performance Evaluation
10 / 33
![Page 23: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/23.jpg)
What is Machine Learning?
Some Supervised ML Algorithms for Classification
Listed below are some standard algorithms designed for classificationproblems:
Logistic Regression
Decision Trees
Random Forests
Support Vector Machines (SVM)
Naive Bayes
K Nearest Neighbors (KNN)
Each algorithm has its own set of parameters that can be optimized forbetter model performance.
11 / 33
![Page 24: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/24.jpg)
What is Machine Learning?
Some Supervised ML Algorithms for Classification
Listed below are some standard algorithms designed for classificationproblems:
Logistic Regression
Decision Trees
Random Forests
Support Vector Machines (SVM)
Naive Bayes
K Nearest Neighbors (KNN)
Each algorithm has its own set of parameters that can be optimized forbetter model performance.
11 / 33
![Page 25: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/25.jpg)
What is Machine Learning?
Some Supervised ML Algorithms for Classification
Listed below are some standard algorithms designed for classificationproblems:
Logistic Regression
Decision Trees
Random Forests
Support Vector Machines (SVM)
Naive Bayes
K Nearest Neighbors (KNN)
Each algorithm has its own set of parameters that can be optimized forbetter model performance.
11 / 33
![Page 26: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/26.jpg)
What is Machine Learning?
Some Supervised ML Algorithms for Classification
Listed below are some standard algorithms designed for classificationproblems:
Logistic Regression
Decision Trees
Random Forests
Support Vector Machines (SVM)
Naive Bayes
K Nearest Neighbors (KNN)
Each algorithm has its own set of parameters that can be optimized forbetter model performance.
11 / 33
![Page 27: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/27.jpg)
What is Machine Learning?
Some Supervised ML Algorithms for Classification
Listed below are some standard algorithms designed for classificationproblems:
Logistic Regression
Decision Trees
Random Forests
Support Vector Machines (SVM)
Naive Bayes
K Nearest Neighbors (KNN)
Each algorithm has its own set of parameters that can be optimized forbetter model performance.
11 / 33
![Page 28: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/28.jpg)
What is Machine Learning?
Some Supervised ML Algorithms for Classification
Listed below are some standard algorithms designed for classificationproblems:
Logistic Regression
Decision Trees
Random Forests
Support Vector Machines (SVM)
Naive Bayes
K Nearest Neighbors (KNN)
Each algorithm has its own set of parameters that can be optimized forbetter model performance.
11 / 33
![Page 29: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/29.jpg)
What is Machine Learning?
Some Supervised ML Algorithms for Classification
Listed below are some standard algorithms designed for classificationproblems:
Logistic Regression
Decision Trees
Random Forests
Support Vector Machines (SVM)
Naive Bayes
K Nearest Neighbors (KNN)
Each algorithm has its own set of parameters that can be optimized forbetter model performance.
11 / 33
![Page 30: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/30.jpg)
What is Machine Learning?
Linear Regression Versus Logistic Regression Visually
Logistics Regression uses a method called Gradient Descent to optimizeparameters in the Log Loss Function.
12 / 33
![Page 31: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/31.jpg)
What is Machine Learning?
Linear Regression Versus Logistic Regression Visually
Logistics Regression uses a method called Gradient Descent to optimizeparameters in the Log Loss Function.
12 / 33
![Page 32: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/32.jpg)
What is Machine Learning?
Decision Trees
Decision Trees uses a measure called entropy to find a way to split the setuntil all the data belongs to one of two classes (in case of binary
classification).
13 / 33
![Page 33: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/33.jpg)
What is Machine Learning?
Decision Trees
Decision Trees uses a measure called entropy to find a way to split the setuntil all the data belongs to one of two classes (in case of binary
classification).
13 / 33
![Page 34: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/34.jpg)
What is Machine Learning?
Decision Trees
14 / 33
![Page 35: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/35.jpg)
Classification Problems Fraud Detection
Classification ProblemsFraud Detection
15 / 33
![Page 36: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/36.jpg)
Classification Problems Fraud Detection
Credit Card Fraud Detection
Consider a data set that documents a total of 284,807 credit cardtransactions of which only 492 are fraudulent and the rest (284,315) areconsidered normal transactions.
We would like to build (train) a model that will be able to successfullyclassify future transactions as either normal or fraudulent, with theintention of immediately stopping a fraudulent transaction from beingcompleted.
Since we are interested in identifying future fraudulent transactions we willlabel fraudulent transactions as the positive class and normal transactionsas the negative class.
The data set is highly unbalanced, the positive class (frauds) account for0.172% of all transactions.
16 / 33
![Page 37: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/37.jpg)
Classification Problems Fraud Detection
Credit Card Fraud Detection
Consider a data set that documents a total of 284,807 credit cardtransactions of which only 492 are fraudulent and the rest (284,315) areconsidered normal transactions.
We would like to build (train) a model that will be able to successfullyclassify future transactions as either normal or fraudulent, with theintention of immediately stopping a fraudulent transaction from beingcompleted.
Since we are interested in identifying future fraudulent transactions we willlabel fraudulent transactions as the positive class and normal transactionsas the negative class.
The data set is highly unbalanced, the positive class (frauds) account for0.172% of all transactions.
16 / 33
![Page 38: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/38.jpg)
Classification Problems Fraud Detection
Credit Card Fraud Detection
Consider a data set that documents a total of 284,807 credit cardtransactions of which only 492 are fraudulent and the rest (284,315) areconsidered normal transactions.
We would like to build (train) a model that will be able to successfullyclassify future transactions as either normal or fraudulent, with theintention of immediately stopping a fraudulent transaction from beingcompleted.
Since we are interested in identifying future fraudulent transactions we willlabel fraudulent transactions as the positive class and normal transactionsas the negative class.
The data set is highly unbalanced, the positive class (frauds) account for0.172% of all transactions.
16 / 33
![Page 39: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/39.jpg)
Classification Problems Fraud Detection
Credit Card Fraud Detection
Consider a data set that documents a total of 284,807 credit cardtransactions of which only 492 are fraudulent and the rest (284,315) areconsidered normal transactions.
We would like to build (train) a model that will be able to successfullyclassify future transactions as either normal or fraudulent, with theintention of immediately stopping a fraudulent transaction from beingcompleted.
Since we are interested in identifying future fraudulent transactions we willlabel fraudulent transactions as the positive class and normal transactionsas the negative class.
The data set is highly unbalanced, the positive class (frauds) account for0.172% of all transactions.
16 / 33
![Page 40: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/40.jpg)
Classification Problems Fraud Detection
Steps to Developing a Model
In order to develop a classifier that will identify transactions as eitherfraudulent or normal we perform the following steps:
1 Partition the data into training set and testing set.
2 Train a classifier (e.g. Logistic Regression, Decision Trees) using thetraining data.
3 Tune the model by optimizing its parameters.
4 Pass the reserved testing data set to the final model and evaluatethe model performance.
The above steps can be executed that are available in the free,object-oriented, programming language Python https://www.python.org.
17 / 33
![Page 41: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/41.jpg)
Classification Problems Fraud Detection
Steps to Developing a Model
In order to develop a classifier that will identify transactions as eitherfraudulent or normal we perform the following steps:
1 Partition the data into training set and testing set.
2 Train a classifier (e.g. Logistic Regression, Decision Trees) using thetraining data.
3 Tune the model by optimizing its parameters.
4 Pass the reserved testing data set to the final model and evaluatethe model performance.
The above steps can be executed that are available in the free,object-oriented, programming language Python https://www.python.org.
17 / 33
![Page 42: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/42.jpg)
Classification Problems Fraud Detection
Steps to Developing a Model
In order to develop a classifier that will identify transactions as eitherfraudulent or normal we perform the following steps:
1 Partition the data into training set and testing set.
2 Train a classifier (e.g. Logistic Regression, Decision Trees) using thetraining data.
3 Tune the model by optimizing its parameters.
4 Pass the reserved testing data set to the final model and evaluatethe model performance.
The above steps can be executed that are available in the free,object-oriented, programming language Python https://www.python.org.
17 / 33
![Page 43: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/43.jpg)
Classification Problems Fraud Detection
Steps to Developing a Model
In order to develop a classifier that will identify transactions as eitherfraudulent or normal we perform the following steps:
1 Partition the data into training set and testing set.
2 Train a classifier (e.g. Logistic Regression, Decision Trees) using thetraining data.
3 Tune the model by optimizing its parameters.
4 Pass the reserved testing data set to the final model and evaluatethe model performance.
The above steps can be executed that are available in the free,object-oriented, programming language Python https://www.python.org.
17 / 33
![Page 44: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/44.jpg)
Classification Problems Fraud Detection
Steps to Developing a Model
In order to develop a classifier that will identify transactions as eitherfraudulent or normal we perform the following steps:
1 Partition the data into training set and testing set.
2 Train a classifier (e.g. Logistic Regression, Decision Trees) using thetraining data.
3 Tune the model by optimizing its parameters.
4 Pass the reserved testing data set to the final model and evaluatethe model performance.
The above steps can be executed that are available in the free,object-oriented, programming language Python https://www.python.org.
17 / 33
![Page 45: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/45.jpg)
Classification Problems Fraud Detection
Evaluating Model Performance
There are several standard metrics available to measure the performanceof a ML algorithm, however some are more appropriate than othersdepending on the goal.
Of the 284, 807 recorded transactions we reserve 25% for future testing.That leaves us with 213, 605 transactions to train the model on and71, 202 transactions to test how well the model performs.
Assume that the testing set consists of 71, 073 normal transactions and129 transactions labelled as fraudulent.
18 / 33
![Page 46: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/46.jpg)
Classification Problems Fraud Detection
Evaluating Model Performance
There are several standard metrics available to measure the performanceof a ML algorithm, however some are more appropriate than othersdepending on the goal.
Of the 284, 807 recorded transactions we reserve 25% for future testing.That leaves us with 213, 605 transactions to train the model on and71, 202 transactions to test how well the model performs.
Assume that the testing set consists of 71, 073 normal transactions and129 transactions labelled as fraudulent.
18 / 33
![Page 47: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/47.jpg)
Classification Problems Fraud Detection
Evaluating Model Performance
There are several standard metrics available to measure the performanceof a ML algorithm, however some are more appropriate than othersdepending on the goal.
Of the 284, 807 recorded transactions we reserve 25% for future testing.That leaves us with 213, 605 transactions to train the model on and71, 202 transactions to test how well the model performs.
Assume that the testing set consists of 71, 073 normal transactions and129 transactions labelled as fraudulent.
18 / 33
![Page 48: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/48.jpg)
Classification Problems Fraud Detection
TP, TN, FP, FN
Every prediction that the model makes will be in one of the fourcategories:
True Positive (TP): The transaction was fraudulent and the modelcorrectly predicted the transaction was fraudulent.
True Negative (TN): The transaction was normal and the modelcorrectly predicted the transaction was normal.
False Positive (FP): The transaction was normal but the modelincorrectly predicted that the transaction was fraudulent.
False Negative (FN): The transaction was fraudulent but the modelincorrectly predicted that the transaction was normal.
All this information can be represented in a confusion matrix.
19 / 33
![Page 49: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/49.jpg)
Classification Problems Fraud Detection
TP, TN, FP, FN
Every prediction that the model makes will be in one of the fourcategories:
True Positive (TP): The transaction was fraudulent and the modelcorrectly predicted the transaction was fraudulent.
True Negative (TN): The transaction was normal and the modelcorrectly predicted the transaction was normal.
False Positive (FP): The transaction was normal but the modelincorrectly predicted that the transaction was fraudulent.
False Negative (FN): The transaction was fraudulent but the modelincorrectly predicted that the transaction was normal.
All this information can be represented in a confusion matrix.
19 / 33
![Page 50: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/50.jpg)
Classification Problems Fraud Detection
TP, TN, FP, FN
Every prediction that the model makes will be in one of the fourcategories:
True Positive (TP): The transaction was fraudulent and the modelcorrectly predicted the transaction was fraudulent.
True Negative (TN): The transaction was normal and the modelcorrectly predicted the transaction was normal.
False Positive (FP): The transaction was normal but the modelincorrectly predicted that the transaction was fraudulent.
False Negative (FN): The transaction was fraudulent but the modelincorrectly predicted that the transaction was normal.
All this information can be represented in a confusion matrix.
19 / 33
![Page 51: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/51.jpg)
Classification Problems Fraud Detection
TP, TN, FP, FN
Every prediction that the model makes will be in one of the fourcategories:
True Positive (TP): The transaction was fraudulent and the modelcorrectly predicted the transaction was fraudulent.
True Negative (TN): The transaction was normal and the modelcorrectly predicted the transaction was normal.
False Positive (FP): The transaction was normal but the modelincorrectly predicted that the transaction was fraudulent.
False Negative (FN): The transaction was fraudulent but the modelincorrectly predicted that the transaction was normal.
All this information can be represented in a confusion matrix.
19 / 33
![Page 52: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/52.jpg)
Classification Problems Fraud Detection
TP, TN, FP, FN
Every prediction that the model makes will be in one of the fourcategories:
True Positive (TP): The transaction was fraudulent and the modelcorrectly predicted the transaction was fraudulent.
True Negative (TN): The transaction was normal and the modelcorrectly predicted the transaction was normal.
False Positive (FP): The transaction was normal but the modelincorrectly predicted that the transaction was fraudulent.
False Negative (FN): The transaction was fraudulent but the modelincorrectly predicted that the transaction was normal.
All this information can be represented in a confusion matrix.
19 / 33
![Page 53: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/53.jpg)
Classification Problems Fraud Detection
Confusion Matrix
Table: Confusion matrix
Predicted “fraudulent” Predicted “normal”“fraudulent” TP FN
“normal” FP TN
Suppose the confusion matrix for our credit card fraud detection model is:
Table: Confusion matrix
Predicted “fraudulent” Predicted “normal”“fraudulent” 72 57
“normal” 4 71,069
20 / 33
![Page 54: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/54.jpg)
Classification Problems Fraud Detection
Accuracy
Accuracy measures how often the classifier makes the correct prediction.
Accuracy =TP + TN
TP + TN + FP + FN=
72 + 71, 069
72 + 71, 069 + 4 + 57= 0.99914
Does that mean our model is phenomenal?
NO!!!
We can see that despite having an accuracy of over 99.9% our modelpredicted 57 fraud cases (nearly 44%) incorrectly as normal!
This usually happens when data set is unbalanced.
21 / 33
![Page 55: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/55.jpg)
Classification Problems Fraud Detection
Accuracy
Accuracy measures how often the classifier makes the correct prediction.
Accuracy =TP + TN
TP + TN + FP + FN=
72 + 71, 069
72 + 71, 069 + 4 + 57= 0.99914
Does that mean our model is phenomenal?
NO!!!
We can see that despite having an accuracy of over 99.9% our modelpredicted 57 fraud cases (nearly 44%) incorrectly as normal!
This usually happens when data set is unbalanced.
21 / 33
![Page 56: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/56.jpg)
Classification Problems Fraud Detection
Accuracy
Accuracy measures how often the classifier makes the correct prediction.
Accuracy =TP + TN
TP + TN + FP + FN=
72 + 71, 069
72 + 71, 069 + 4 + 57= 0.99914
Does that mean our model is phenomenal?
NO!!!
We can see that despite having an accuracy of over 99.9% our modelpredicted 57 fraud cases (nearly 44%) incorrectly as normal!
This usually happens when data set is unbalanced.
21 / 33
![Page 57: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/57.jpg)
Classification Problems Fraud Detection
Accuracy
Accuracy measures how often the classifier makes the correct prediction.
Accuracy =TP + TN
TP + TN + FP + FN=
72 + 71, 069
72 + 71, 069 + 4 + 57= 0.99914
Does that mean our model is phenomenal?
NO!!!
We can see that despite having an accuracy of over 99.9% our modelpredicted 57 fraud cases (nearly 44%) incorrectly as normal!
This usually happens when data set is unbalanced.
21 / 33
![Page 58: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/58.jpg)
Classification Problems Fraud Detection
Accuracy
Accuracy measures how often the classifier makes the correct prediction.
Accuracy =TP + TN
TP + TN + FP + FN=
72 + 71, 069
72 + 71, 069 + 4 + 57= 0.99914
Does that mean our model is phenomenal?
NO!!!
We can see that despite having an accuracy of over 99.9% our modelpredicted 57 fraud cases (nearly 44%) incorrectly as normal!
This usually happens when data set is unbalanced.
21 / 33
![Page 59: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/59.jpg)
Classification Problems Fraud Detection
Precision
Precision attempts to answer the question: What proportion of positiveidentifications was correct?
i.e., out of all the transactions predicted to be fraudulent how many ofthem were actually fraudulent?
Precision =TP
TP + FP=
72
72 + 4= 0.94736
22 / 33
![Page 60: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/60.jpg)
Classification Problems Fraud Detection
Precision
Precision attempts to answer the question: What proportion of positiveidentifications was correct?
i.e., out of all the transactions predicted to be fraudulent how many ofthem were actually fraudulent?
Precision =TP
TP + FP=
72
72 + 4= 0.94736
22 / 33
![Page 61: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/61.jpg)
Classification Problems Fraud Detection
Precision
Precision attempts to answer the question: What proportion of positiveidentifications was correct?
i.e., out of all the transactions predicted to be fraudulent how many ofthem were actually fraudulent?
Precision =TP
TP + FP=
72
72 + 4= 0.94736
22 / 33
![Page 62: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/62.jpg)
Classification Problems Fraud Detection
Recall
Recall attempts to answer the question: What proportion of actualpositives was correctly identified?
i.e., out of all transactions that were actually fraudulent, how many ofthem were correctly identified as fraudulent?
Recall =TP
TP + FN=
72
72 + 57= 0.55813
23 / 33
![Page 63: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/63.jpg)
Classification Problems Fraud Detection
Recall
Recall attempts to answer the question: What proportion of actualpositives was correctly identified?
i.e., out of all transactions that were actually fraudulent, how many ofthem were correctly identified as fraudulent?
Recall =TP
TP + FN=
72
72 + 57= 0.55813
23 / 33
![Page 64: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/64.jpg)
Classification Problems Fraud Detection
Precision Versus Recall
As the data expert, you will have to decide which of these scores (precisionor recall) is more important to optimize for your classification problem.
In this problem it would be more harmful to the financial company if thealgorithm incorrectly predicts a transaction as normal when in reality it isfraudulent compared to the trade-off of having a transaction flagged asfraudulent when it is actually normal. Thus, the credit card frauddetection model is a high recall model.
24 / 33
![Page 65: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/65.jpg)
Classification Problems Fraud Detection
Precision Versus Recall
As the data expert, you will have to decide which of these scores (precisionor recall) is more important to optimize for your classification problem.
In this problem it would be more harmful to the financial company if thealgorithm incorrectly predicts a transaction as normal when in reality it isfraudulent compared to the trade-off of having a transaction flagged asfraudulent when it is actually normal. Thus, the credit card frauddetection model is a high recall model.
24 / 33
![Page 66: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/66.jpg)
Classification Problems Fraud Detection
More Metrics
Other metrics that can be used to evaluate model performance are:
Fβ Score
AUC Score
25 / 33
![Page 67: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/67.jpg)
Classification Problems Email Spam Filter
Classification ProblemsEmail Spam Filter
26 / 33
![Page 68: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/68.jpg)
Classification Problems Email Spam Filter
Further Topics for Exploration
You should familiarize yourself with the following terms:
Bias Versus Variance
Model Complexity Graph
Learning Curve
k-fold Cross Validation
Grid Search
27 / 33
![Page 69: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/69.jpg)
Classification Problems Student College Commitment Decisions
Classification ProblemsStudent College Commitment Decisions
28 / 33
![Page 70: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/70.jpg)
Classification Problems Student College Commitment Decisions
The Admission Process
The goal of the research is to develop a mathematical model that canmake a prediction regarding each student’s college commitment decisionby classifying the student into one of two categories: accepts admissionoffer and rejects admission offer.
29 / 33
![Page 71: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/71.jpg)
Classification Problems Student College Commitment Decisions
Features (Input Variables) and Output Variable
Table: List of Variables
Binary Categorical NumericalGross Commit Indicator Application Term GPA
Net Commit Indicator Ethnic Background HS Class RankFinal Decision Permanent State/Region HS Class Size
Financial Aid Intent Permanent Zip Code/Postal Code ACT Composite ScoreScholarship Permanent Country SAT I Critical Reading
Legacy Current/Most Recent School Geomarket SAT I MathDirect Legacy First Source Date SAT I Writing
First Generation Indicator First Source Summary SAT I SuperscoreCampus Visit Indicator Top Academic Interest SATR EBRW
Interview Extracurricular Interests SATR MathRecruited Athlete Indicator Level of Financial Need SATR Total
Sex Reader Academic Rating
30 / 33
![Page 72: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/72.jpg)
Classification Problems Student College Commitment Decisions
The Data Distribution
Table: Counts and Percentages of First-Year Commit Data
Application Year Class of 2018 Class of 2019 Class of 2020 Class of 2021 TotalApplicants 6,071 5,911 6,409 6,775 25,166
Admits 2,549 2,659 2,948 2,845 11,001Admit Percentage 42% 45% 46% 42% 44%
Commits 547 518 502 571 2,138Commit Percentage 21% 19% 17% 20% 19%
Uncommits 2,002 2,141 2,446 2,274 8,863Uncommit Percentage 79% 81% 83% 80% 81%
31 / 33
![Page 73: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/73.jpg)
Acknowledgments
Acknowledgments
I would like to thank Prof. Buckmire for inviting me to Math 396 and forthe opportunity to deliver this lecture.
32 / 33
![Page 74: The Data Science Model - Occidental College](https://reader034.vdocuments.net/reader034/viewer/2022042313/625c59263957d540dc5fc569/html5/thumbnails/74.jpg)
The End
33 / 33