by hareem naveed · the age of an abalone is given by the given by number of rings on the shell +...
TRANSCRIPT
Comparing Prediction Methods for Early Warning Systems
by
Hareem Naveed
A thesis submitted in conformity with the requirementsfor the degree of Master of Science
Department of MathematicsUniversity of Toronto
c� Copyright 2018 by Hareem Naveed
Abstract
Comparing Prediction Methods for Early Warning Systems
Hareem Naveed
Master of Science
Department of Mathematics
University of Toronto
2018
In this study, we investigate the use of prediction modeling to build an early warning system to identify
students who are at risk of interacting with the criminal justice system in the future. First, we review
algorithms for supervised learning and formulate the problem in a precise modeling framework. In the
data cleaning phase, we match between the two datasets for an 86% match-rate. We then apply di↵erent
supervised learning methods and identify the best model for our problem. Using detailed variables,
temporal cross validation and our final prediction method of Random Forests, we achieved a precision
of 0.3 at 1% of the student population. This greatly out-performs the current threshold-based system
that flags a larger percentage of the student body while correctly identifying fewer at-risk students. We
also describe the results of a similar approach to developing an early warning system for public safety.
ii
Acknowledgements
The helpful revisions and constructive feedback provided by Professor Adam Stinchcombe shaped this
thesis into it’s final form. It would not have been possible without his support and guidance.
I am grateful for funding from the Eric and Wendy Schmidt Foundation through the Data Science for
Social Good 2016 Fellowship at the University of Chicago. I am thankful for my teammates and mentors
during the fellowship who contributed to this work. Additionally, this thesis would not be possible
without the infrastructure, project management and technical support provided by my colleagues at the
Center for Data Science and Public Policy. A special thank you to Rayid Ghani and Adolfo De Unanue
for their constant support, advice and the opportunity to work on interesting projects with meaningful
social impact.
iii
Contents
Acknowledgements v
Table of Contents v
1 Introduction 1
2 Supervised Learning Methods 2
2.1 Machine Learning Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Model Selection 14
3.1 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Criteria to Consider during Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 16
I Early Warning System for At-Risk Youth 18
4 Background: Identifying At-Risk Youth 19
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Data 22
5.1 Education Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Criminal Justice Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Data Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Matching the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Methods 26
6.1 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
7 Results 28
7.1 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8 Discussion 33
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
II Early Warning System for Public Safety 35
9 Refining EWS for Public Safety 36
9.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.4 Comparison to Current Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Bibliography 41
v
Chapter 1
Introduction
Given a set of inputs, a predictive model predicts an outcome. The availability of large datasets and
improved computing capabilities, particularly cloud-based computing, has led to the development and
rapid adoption of predictive modeling in many fields. Traditional statistical modeling techniques are
structured around better understanding the data generating process. In this context, the predictive
model is considered a by-product and not the main goal. These models require strict assumptions and
are not always applicable where data-generating processes are too complicated. In recent years, with
great advances being made in the field of machine learning, there has been a switch in modeling approach
to prioritize prediction accuracy.
An Early Warning System (EWS) takes as input data about entities and outputs scores about the
likelihood of some event happening for that entity in the future. This is a classification task with the
goal being to develop a predictive model that based on data about past behavior of the entity is able to
separate the entities into classes of interest. For example, an EWS that predicts student performance
in a course will use static predictors such as student age in conjunction with dynamic predictors such
as attendance and assessment scores to determine whether the student will pass or fail the course [12].
There is a temporal component to the predictions of an EWS; they are ”early” and thus are actionable.
For example, in the EWS for student performance, the predictions have to be generated at some point
before the course is over, to allow for e↵ective intervention. The temporal component is a tunable
parameter that must be optimized for in model selection (e.g. Is the EWS better at predicting a student
failing half way through the semester? Or at the start of the semester?).
Machine learning-based EWSs have previously been developed for many applications and have been
used to predict asteroid impact [19], seismic events in coal mines, dengue outbreaks [2], and student
performance in an undergraduate course [12].
The thesis is structured as follows: in the preamble, I describe modeling methods in general, outline
metrics for model evaluation, and techniques for temporal cross-validation. In the first section, I describe
the development of an EWS used to identify students at risk of interacting with the criminal justice
system as juveniles. In the second section, I describe the replication and expansion of an EWS for public
safety agencies deployed in a new agency.
1
Chapter 2
Supervised Learning Methods
2.1 Machine Learning Systems
Generally, machine learning systems can be separated into two general categories [8]:
1. If the system is trained with labeled data, it is a supervised learning system. If no labels are
available, it is known as an unsupervised learning system.
2. If the system detects patterns and builds models, it uses model-based learning. If it compares new
data points to known data points, it uses instance-based learning.
These categories can be combined. For example, an email spam filter can be build using a linear
regression model, a supervised learning method trained on emails labeled as spam and not-spam. This
makes it a model-based, supervised learning system. An Early Warning System (EWS) is a model-based,
supervised learning system.
In this section, we first introduce the mathematical structure and notation of the prediction problem.
Then, we present in detail some algorithmic approaches to solving the prediction problem in an EWS.
2.1.1 Problem
The set of examples used by a predictive model to learn patterns is called the training dataset. Each
instance is a sample, and has the form {(x1, y1), ..., (xn, yn)}. A single observation is a pair (x, y) where
y is the response variable, and x is a k�length vector of predictors for one entity. Assume that y is a
realization of the random variable Y, and x is a realization of the random variable X. Given some input
x, we want to predict the expected value of Y, which we assume depends on X. This is given by:
E(Y |X = x) = h(x) (2.1)
If the response variable is quantitative, h is a regression model defined by the function h : Rk ! R. If
the response variable is categorical, with k possible classes, h is a classification model defined by the
function h : Rk ! ⌥ with ⌥ = {G1, ..., Gk}. In the case of an EWS, we want a classification model that
will have the form:
h : Rk ! {G1, G2} (2.2)
2
Chapter 2. Supervised Learning Methods 3
Table 2.1: Information about attributes in the Abalone dataset.
Name Data Type Unit Feature NameSex Categorical sexLength Continuous mm lengthDiameter Continuous mm diameterHeight Continuous mm heightWhole Weight Continuous mm weight wShucked Weight Continuous mm weight sViscera Weight Continuous mm weight vShell Weight Continuous mm weight shRings Integer rings
Figure 2.1: A histogram for every numerical feature in the Abalone dataset
We are trying to predict whether or not an entity will have some defined event in the future. Let G1 be
the case where the entity has an event in the future, and G2 be the case where the entity does not. In
the algorithmic approach to prediction modeling that is adopted by machine learning, we try to fit the
best possible function for the problem at hand.
2.1.2 Sample Dataset
To ground our discussion of the di↵erent methods, we will implement them using the Abalone dataset
[14]. The dataset has 9 fields (see Table 2.1) and has been used as a benchmark for testing new methods
[21] as it is relatively simple. For context, abalones are a type of marine snail. The age of an abalone is
given by the given by number of rings on the shell + 1.5. In order to count the rings, the shell has to
be stained and analyzed under a microscope [14]. Thus, the age of an abalone is di�cult to predict, and
we will try to predict the age using other physical characteristics with di↵erent supervised, model-based
learning methods. Age is a continuous variable, and we will use it as a label for detailing regression
methods. For the supervised classification methods, we will classify the snail as either male or female
based on other physical attributes.
A supervised learning algorithm analyzes the training data and uses the inferred function to map
new examples from the feature space to the label space. The feature space is the p-dimensional space
where the variables of interest live. From the abalone dataset, if we use all attributes as predictors for
the response variable age, the feature-space is R8. Histograms of the seven numerical features are given
Chapter 2. Supervised Learning Methods 4
in Fig. 2.1. The categorical feature is gender. A sample from the dataset is a vector with values for all
8 attributes for an abalone, and the corresponding age.
2.1.3 Algorithmic Paradigm
We start with a training set of data and want to learn something about the structure of our data. Char-
acteristics about the entities represented in the data are captured by predictors. We make hypotheses
about the data structure and it’s relationships, and these are captured by parameters that we are trying
to learn [8]
In this section, we review the algorithms behind four supervised learning methods and the techniques
used to train models. We consider five modeling techniques; linear regression, logistic regression, support
vector machines, decision trees, and random forests.
For each of these models, we have the following:
y = h✓(X) (2.3)
• y is the predicted value. If the chosen model and parameters correctly capture the structure of the
data, then y = y
• h✓ is the function we hypothesize will best model our data; it has the parameters ✓.
• ✓ is a vector of the model’s parameters.
2.2 Linear Regression
If h(X) is a linear function, then the model is linear regression. A linear regression makes a prediction
by computing a weighted sum of the inputs. The training task in the model is learning the weights for
each predictor. The prediction has the form:
h(X) = ✓0 + ✓1x1 + ...+ ✓nxn = ✓TX (2.4)
1. ✓j is the j-th model parameter, with ✓1 to ✓n the feature weights, and ✓0 being the bias term.
✓ = (✓0, ✓1, .., ✓n)T 2 Rn+1.
2. n is the number of predictors.
Training a model means that we propose parameters that best model the data. Generally, for any
model, we try to minimize a cost function to get the best fit. To measure the fit of a linear model with
m samples, we use the Mean Squared Error (MSE) as our cost function:
MSE(✓) =1
m
mX
i=1
(✓T · xi � yi)2 (2.5)
MSE captures the di↵erence between the estimated and the true value. We want a model that has
Chapter 2. Supervised Learning Methods 5
parameters ✓ that will minimize MSE. Computing the partial derivative of the MSE function gives:
@
@✓jMSE(✓) =
2
m
mX
i=1
(✓T · xi � yi)xij
Converting this to vector notation:
r✓MSE(✓) =
2
66664
@@✓0
MSE(✓)@
@✓1MSE(✓)
...@
@✓n
MSE(✓)
3
77775
=2
mXT · (X · ✓ � y)
Setting the partial derivative equal to 0, and simplifying, we get:
✓ = (XTX)�1XT y
In the abalone example, let us a fit a linear regression model with just one predictor (Fig. 2.2), the shell
weight, to predict the age. Our model has the form:
yage = ✓0 + ✓1xweight shell (2.6)
Making the assumption that the more rings a shell has, the heavier it will be, we test to see if that is
the case.
Figure 2.2: A linear regression model (red line) which tries to use the shell weight to predict age. Thescore of this model is 0.4, meaning that it explains 40% of the variability in the data.
This is a simple enough example with few training samples, and it is possible to invert the matrix.
But if there are more than 100,000 predictors, it becomes very slow to invert a non-sparse matrix. The
method used to train more complex linear regression models is gradient descent. Gradient descent is an
Chapter 2. Supervised Learning Methods 6
iterative optimization algorithm that is used to find the minimum of a function. For a function that is
defined and di↵erentiable at a point a, it follows that iff(an+1) = an � ⌘rf(an), then f(an+1) < f(an).
Intuitively, if we move in the negative direction of the gradient, we will move towards a smaller value. In
the case of linear regression, we use gradient descent to find the minimum of the MSE. In this case, the
cost function is quadratic and there is only one minimum (Fig. 2.3). Regardless of what the initial values
are for the gradient descent, we will always arrive at the optimal solution. Specific to our parameters,
the gradient descent step is defined by:
✓0 = ✓ � ⌘r✓MSE(✓)
⌘ is the learning rate, with values between 0 and 1.
Figure 2.3: The cost function for a linear regression model. The colored dots on the plot representdi↵erent steps in the gradient descent.
For a more complex linear regression example, we use all the features from the dataset to predict
age (Fig. 2.4). This gives us a score of 0.48 which is slightly better than our original model (Fig. 2.2).
Note that linear regression makes the assumption that the data is linear, and so performance is a↵ected
when the data has a di↵erent underlying structure.
2.2.1 Regularization: Modifying Cost Functions
In selecting models, we want to be careful that they are not over-fitting to the data. Over-fitting means
that they perform really well on the training dataset and do not generalize to new samples. Above,
the two models we tested on the test-set only have scores of 0.4 and 0.48. They do not generalize well
to new data (we posited above that this may be because the data is not linear, but for the purpose of
demonstration, let us ignore that fact). In linear regression, regularization is done by constraining the
weights of the predictors. This means that if one predictor is particularly oversampled, or has large
values in the training dataset, we try to constrain it. There are three common ways to do this. One way
is LASSO (Least Absolute Shrinkage and Selection Operator) which penalizes MSE with an l1 norm
Chapter 2. Supervised Learning Methods 7
Figure 2.4: Comparing the predicted value for age with the true value. The model includes all predictorsand has a score of 0.48.
[18]. The cost function for LASSO regression is given by:
Cost(✓) = MSE(✓) + ↵
nX
i=1
|✓i| (2.7)
In Ridge Regression, the cost function is modified by adding an l2 norm:
Cost(✓) = MSE(✓) + ↵(1
2)
nX
i=1
✓2i (2.8)
The l1 norm eliminates the weights of the least important features, and performs automatic feature
selection. It is preferred when there is an understanding that only a few predictors are important for the
prediction problem. The l2 norm keeps predictor weights as small as possible, but does not eliminate
them. The third cost function is elastic net [24] and it is a mix of both LASSO and ridge:
Cost(✓) = MSE(✓) + ↵(1� r
2)
nX
i=1
✓2i + r↵
nX
i=1
|✓i| (2.9)
When r = 0, elastic net is equivalent to ridge regression, and when r = 1, elastic net is equivalent to
LASSO. Using the above regularization techniques, our linear model from Fig. 2.4 scores as follows for
di↵erent values of ↵:
Table 2.2: Scores of the regularized linear regression from Fig. 2.4 with varying ↵ values. Note that inridge regression, a large ↵ will set all model weights close to 0, and in LASSO regression, a large ↵ willeliminate some features.
Alpha LASSO Score Ridge Score0.1 0.29 0.500.5 0.19 0.521 -2.3 0.52
Chapter 2. Supervised Learning Methods 8
2.3 Logistic Regression
Linear regression can also be used for classification with some modifications; logistic regression computes
a weighted sum of the predictors like linear regression, but outputs the logistic of that result.
h(X) = �(✓TX) (2.10)
Where �(x) = 11+e�x
. Logistic regression is used for classification by imposing a threshold:
y = {0 if �(✓TX) < 0.5, and 1 if �(✓TX) � 0.5} (2.11)
Instead of MSE, we use log-loss as the cost function for logistic regression as it allows us to penalize
false classifications [9] . For a single instance, it is given by:
Cost(✓) = {� log(y) if y = 1,� log(1� y) if y = 0} (2.12)
For a dataset, the cost function is given by:
Cost(✓) = � 1
m
mX
i=1
[y(i) log(h✓(x(i))) + (1� y(i)) log(1� h✓(x
(i)))] (2.13)
This function can also be minimized using gradient descent as defined previously. Results for logistic
regression on the abalone dataset are presented at the conclusion of this section.
2.4 Support Vector Machines (SVM)
In classification, we want to find a linear decision boundary that clearly separates two classes. From
Fig. 2.5, note that the two classes in the dataset are clearly separated by a solid black line. The decision
boundary is supported by the instances that fall on the dotted-lines. They are called support vectors [8].
On either side, the SVM fits the widest possible margin (distance between the decision boundary and
the support vector) that separates the two classes. Adding examples o↵ the support vectors will not
influence the decision boundary. If the data is clearly linearly separable, then it is possible to impose
hard-margin classification. This ensures that there are no margin violations - no data points are found
between the decision boundaries and the support vectors. This is the strictest condition.
Figure 2.5: Separating two classes using a linear SVM. Source:[8]
In SVM, we are trying to fit a linear decision boundary. Consider the classifier setup:
hw,b(x) = wTx+ b (2.14)
Chapter 2. Supervised Learning Methods 9
If we let our label class be (�1, 1) then:
y = {�1 if wTx+ b < 0, 1 if wTx+ b � 0} (2.15)
If y = 1 for some x, then wTx + b needs to be a large positive number (e↵ectively wTx + b � 1,
not just � 0). Similarly if the label is negative, then that number needs to be a large negative number.
To be confident in the prediction, a large functional margin is required [9]. The functional margin has
the formula �i = yi(wTxi + b), and gives an idea of whether or not a point is properly classified. The
slope of the decision function is ||w||. The function identifies a hyperplane, with intercept b and the
normal vector w which is perpendicular to the hyperplane. To get a large margin, we want to minimize
the weights given by w, and If we want to observe the hard-margin violation rule, the decision function
must be greater than 1 for all positive training classes, and less than -1 for all negative training classes.
Then, the objective function has the following form:
minw,b
1
2w2
s.t. yi(wTx+ b) � 1
Scaling the function by constants does not change the prediction function, since the output only
depends on the sign but not the magnitude of wTx+ b. Imposing a normalization condition means we
can get something else instead of (w, b). The geometric margin in this case corresponds to the idea that
being on the ”positive” side of the decision boundary is good. Given a training set it is important to find
a decision boundary that maximizes the geometric margin as this gives us a confident set of predictions.
This is the hard-margin classification problem.
We get the geometric margin as :
� = mini=1,...,m
�(i) (2.16)
Assuming that our training dataset is linearly separable - that the positive and negative examples can
be split using a separating hyperplane. To get the maximum geometric margin want to optimize � given
that each training example has minimum margin �. The optimization problem takes the form:
maxw,b
� (2.17)
s.t. y(i)(wTxi + b) � �, i = 1, ..,m (2.18)
kwk = 1 (2.19)
This is a non-convex problem. By taking away the constraint on w, and realizing that maximizing
�/kwk = 1/kwk is the same as minimizing kwk2. This has a much nicer derivative ( 12 ||w||) and together
with the constraints otherwise identified can be expressed as a Quadratic Programming Problem. Many
solvers exist for these kind of problems and they can be easily solved[5].
Fitting a linear SVM classifier to our abalone dataset to predict whether or not length and diameter
can predict gender shows that the data is not very linearly separable. From Fig. 2.6, we note that
regardless of the parameter-tuning, the data does not separate linearly. The next step to consider
Chapter 2. Supervised Learning Methods 10
here would be di↵erent types of kernels that allows for non-linearity in the dataset. The higher the C
parameter is, the greater the number of margin violations that are allowed to take place.
Figure 2.6: Trying to classify males and females using a linear SVM with di↵erent parameter settings.
2.5 Decision Trees
The use of decision tree classifiers has been proposed in many areas ranging from speech recognition to
remote sensing [3]. A decision tree built on the subset of the abalone data is illustrated in Fig. 2.7. In
mathematics, a tree is an undirected graph where any two vertices are connected by only one path. In
trees, the vertices are nodes and the edges are branches. A decision tree has three kinds of nodes; a root
node, an internal node, and a leaf node. A root node has no incoming edges and a leaf node has no
outgoing edges. In Fig. 2.7, the root node checks to see whether the length of the sample is less than
0.61mm. At each internal node, a test is being performed on the data, and the branch that results to
the left is the one that passed the test.
Essentially, a decision tree partitions the k-dimensional space of predictors into K hypercubes, Hl for
l = 1, ...,K, and fits a very simple, (usually) constant model on each space. With cl as some constant,
a decision tree can then be represented as:
h(x) =KX
l=1
cl (x 2 Hl) (2.20)
Every internal node partitions the instance space into one or more subspaces, this process continues
recursively until the parts only contain samples from one class. This terminates at the leaf nodes.
Decision trees can be used for classification and regression. In this section, we focus on decision tree
classifiers as that is the method we will employ for our EWS.
There are three main algorithms for decision tree classifiers, ID3 (Iterative Dichotomizer 3) [16], C4.5
and CART (Classification and Regression Tree). Each algorithm uses di↵erent splitting criteria, but all
three have the same tree coverage approach [3].
2.5.1 Iterative Dichotomizer 3 (ID3)
ID3 was first proposed by Russ Quinlan [?], and it uses information gain to split at each node. ID3 splits
data based on the homogeneity of a sample and uses entropy to calculate this homogeneity. A sample
Chapter 2. Supervised Learning Methods 11
Figure 2.7: A simple decision tree built using length and diameter to predict gender in Abalones.
has entropy of 0 if it is totally homogeneous and an entropy of 1 if it is well separated. To compute the
entropy with one attribute, with S being the original set, we use:
Entropy(S) =jX
i=1
�pi log2 pi (2.21)
where pi is the probability of getting sample i when randomly selecting from a set. We compute the pi
as ni
|S| , where ni is the number of samples of class i.
To compute the entropy using two attributes, we use:
Entropy(T,X) =X
c2X
�P (c)E(c) (2.22)
The information gain is then computed based on the decrease in entropy after a dataset is split on
an attribute. Decision tree construction depends on finding the attributes that give the most homoge-
neous branches, or return the highest information gain. First, the entropy of the target (label class) is
calculated. The dataset is then split on all the di↵erent attributes and the entropy of each branch is
calculated. The gain is given when the resulting entropy after the split is subtracted from the entropy
before the split:
Gain(T,X) = Entropy(T )� Entropy(T,X) (2.23)
A branch with entropy zero is a leaf node, whereas a branch with entropy greater than zero still needs
further splitting. The ID3 algorithm runs recursively until the tree is complete on all non-leaf branches.
The main advantage of a decision tree is that it can be easily converted to a set of rules that maps
the data and gives the process for what makes each decision. ID3 is the simplest decision tree classifier
algorithm and it has a depth-first approach. The main drawbacks of the ID3 algorithm is that it is only
built for categorical variables, and has low accuracy of classification on large datasets [3]. In contrast, it’s
successor, the C4.5 algorithm can handle numeric data but is also not very successful for large datasets
[3].
Chapter 2. Supervised Learning Methods 12
2.5.2 Classification and Regression Tree (CART)
The CART algorithm creates binary decision trees, which means that each non-leaf node only has two
children. In contrast, other methods can have more than two children per non-leaf node. We used CART
to classify the abalone dataset with length and diameter as predictors(Fig. 2.7). In CART, the training
set is split into two, using one feature f and some threshold (tf ) associated with that feature. The cost
function that the algorithm tries to minimize for a classification problem is given by:
Cost(f, tf ) =nleftGini Impurityleft
n+
nrightGini Impurityrightn
(2.24)
• nleft, nright is the number of samples in the left and right node respectively, and n is the number
of samples in total
• Gini Impurity = 1�nP
i=1p2j,i , where pj,i is the ratio instances of type i at the j-th node
The process is iterative, and the function continues to split the training set recursively until only the
leaf nodes remain. The regression cost function for CART is the same, except instead of Gini Impurity,
the MSE must be optimized.
2.6 Random Forest Classifier
A random forest is a classifier made up of an ensemble of decision tree classifiers h(✓k, x) where ✓k are
i.i.d. random vectors and each tree classifiers casts a vote for a label given the same input [6]. It is trained
using the bootstrap aggregation method. In bootstrap aggregation, a diverse set of decision-tree based
classifiers are fit by training them on random subset of the training set where sampling is performed
with replacement. The final prediction for a random forest comes from taking the majority vote across
the di↵erent trees [6].
Random Forests can be developed as an extension of the bagging algorithm. The algorithm is simple
and is given by:
1. For b = 1, ..., n, sample with replacement from the training set to get Xb, Yb.
2. Train a decision tree dtb using Xb, Yb using the methods outlined above under the description of
decision trees.
3. y = 1n
nPb=1
(dtb(x)) gives the prediction for a new sample x.
Random Forests diverge from the bagging algorithm as each iteration of the decision tree randomly
samples the data.
In a random forest, the margin function measures the extent to which the average number of votes
for some input X exceed the number of votes for another class. This function is given by:
mg(X,Y ) = Ek[ (hk(X) = y)]�maxj 6=y(Ek[ (hk(X) = j]) (2.25)
• Ek [(hk(X) = y)] is the proportion of classifiers for which hk(X) = y. This is equal to 1K
KPk=1
[hk(x) =
y].
Chapter 2. Supervised Learning Methods 13
The margin function takes into account how the average number of votes at (X, Y ) for the correct class
compares to the average number of votes for the next-best class. When the margin is larger, we are more
confident in our predictions. This is similar to the setup of the SVM where the larger the geometric
margin (how far apart the samples are from the decision boundary), the more confident we are in our
classification prediction.
The generalization error, a measure of how accurately the algorithm will predict unseen samples, and
the probability that the margin function is less than zero, is given by:
e = PX,y(mg(X, y) < 0) (2.26)
Breiman [6] proves that as the number of trees increases, the generalization error converges. Random
forests are generally the best performing methods for a range of applications. This is because they
require almost no input preparation and can handle categorical and numeric features without any need
for predictor scaling. In contrast, SVMs are very sensitive to predictor scaling, logistic regression assumes
that the data is linear and decision trees are sensitive to dataset rotation. Random forests are also quick
to train, and perform implicit feature selection. In general, they are the best performing “simple” model
that is available, and provide a good benchmark against which to evaluate other, more complicated
models.
As an overview of all the methods, let us compare the accuracy of the di↵erent classifier methods we
presented in this section at predicting the gender of abalone in the test dataset. The ensemble classifier
is constructed by building many decision trees and taking the majority vote, as a rough approximation
of a random forest.
Table 2.3: Comparing the performance of the di↵erent classifiers covered in this Introduction.
Classifier Accuracy ScoreLogistic Regression 0.5595Support Vector Classifiers 0.5488Decision Trees 0.5417Random Forest 0.5027Ensemble (majority Voting Classifier) 0.5293
Chapter 3
Model Selection
3.1 Cross Validation
In order to evaluate a classifier, we will use cross-validation. A simple method for cross-validation is
k-fold cross-validation. In this case, the data is split into k folds and then predictions are made and
evaluated on each fold with a model that was trained using the other k � 1 folds. When working with
data with temporal structure, we can not use standard methods to validate our models, as there may be
leakage of information from the features to the labels. For example, if one of the features is number of
disciplinary hearings an employee attends and the label is complaints, note that an employee who has
a complaint will always have a disciplinary hearing. If the training set includes all data over time, the
feature of number of disciplinary hearings will be a perfect predictor of adverse interactions. If we are
careful about the temporal splits, then we can use past data to predict future data. In setting up an
EWS, this is very important, as we do not care about overall prediction accuracy, but rather the ability
of the EWS to predict events in the future.
In the case of an early warning system, most features are event-based, with the idea that a specific
sequence of events increases the risk score of an entity over time. We need to perform temporal cross-
validation on our data. The splitting of the dataset needs to be done at the event level. A model that
is being put into production will need a training window that includes data from the beginning of time
until that date, and then will need a label window for the following year.
Referring to Figure 3.1The longer coloured blocks represent the features of the training set for each
model, the gap immediately following is the label. The small block that follows is the testing set. A
model is a classifier with a set of hyper-parameters. Models with the same hyper-parameters in each of
the time blocks belong to the same model group. The same model parameters are used in training and
testing di↵erent splits of the data over time. Since it is important to have a well-trained generalizable
model that is useful for predicting events out of the dataset, we pick the best-performing model as one
that performs the best over time and is also stable on di↵erent train/test splits. Using this same setup,
we also do back-testing to confirm how valid the models are at di↵erent points in time.
14
Chapter 3. Model Selection 15
Figure 3.1: Training models on di↵erent blocks of data with the same parameter set lets us pick modelsthat are stable over time. The x-axis represents time.
3.2 Metrics
There are several possible metrics that we can use to evaluate our models. First, let us define some
terms. True positives (TP) are the individuals designated by the model as being part of the class of
interest and are actually part of the predicted class. False positives (FP) are all those that are predicted
as part of the class, but are not actually part of the class of interest. Similarly true negatives (TN) and
false negatives (FN) are those that are correctly and incorrectly labeled as part of the negative class by
the classifier, respectively.
3.2.1 Accuracy
The accuracy score is a fraction of the predictions which are correct. The indicator function returns 1
when the predicted value is equal to the true value. Summed and divided by the number of samples, it
gives the fraction of the predictions which are true.
Accuracy =1
nsamples
nX
i=1
(ypred = ytrue)
3.2.2 Precision
Precision evaluates the e�ciency of a model, out of the instances labeled as the class of interest, how many
of them are correct. Essentially, it indicates how much trust can be placed in the model’s predictions.
Precision =TP
TP + FP(3.1)
(3.2)
Chapter 3. Model Selection 16
3.2.3 Recall
Recall evaluates the coverage of the model, out of all the instances that the model could have labeled,
how many did it actually label correctly.
Recall =TP
TP + FN(3.3)
3.2.4 Precision-Recall Curves
Often we use precision-recall curves to understand the trade-o↵ between the two metrics. Optimizing for
precision means that recall will drop. To read the precision-recall curve, one must pick a threshold on
the x-axis, and decide what to balance. In some cases, it makes sense to have better recall, in others, it
may make more sense to optimize on precision. For example, if a school district only has the resources to
intervene on 150 students a year, it makes sense that they would try to get the best model performance
in the top riskiest 150 students. This will become more evident in our examples with the early warning
systems that we will implement in the next few sections.
Figure 3.2: The precision-recall curve for the SVM classifier we trained in the Supervised LearningMethods overview.
3.3 Criteria to Consider during Model Selection
For selecting the model, in addition to the precision/recall performance, we also look for the following
attributes:
• Performance stability across time in precision/recall (a model that performed exceptionally well
in 2015 but did not perform well in 2016 is less favorable than a model that performed reasonably
well in both 2015 and 2016).
• The model produces stable classifications. That is, the model produces (nearly) the same classifi-
cation of entities if run twice on the same data.
Chapter 3. Model Selection 17
• The model di↵erentiates in a more or less clear way the two populations (those at high risk vs
those not).
• The top features from the selected model distinguish entities between the two classes.
• The model does not flag entities simply for more data.
A selected model does not need to be the best model in each of these categories, it should perform well in
all of them. It is important to note that in a deployment setting, the model is intended to be temporary:
a new model should be selected from time to time to ensure it continues to perform as well as possible.
The exact refresh rate depends on how often data are collected and how quickly patterns change.
Part I
Early Warning System for At-Risk
Youth
18
Chapter 4
Background: Identifying At-Risk
Youth
In this study, we use and evaluate a range of classifiers in order to build a predictive early warning
systems. In principle, an Early Warning System (EWS) takes historical data and learns patterns that
are correlated with labeled adverse outcomes in the future. The EWS then scores entities for future
dates and assigns them a score that is representative of their risk of having an adverse incident in the
future. Many recent studies define and deploy early warning systems for a range of problems: from
identifying students at-risk of failing an undergraduate course [12] to predicting future dengue outbreaks
[2]. Many companies utilize EWS as part of their business process, but the application to social good
problems is a relatively recent development.
The prediction task that we are interested in this study is: Identifying students at risk of
interacting with the criminal justice system.
In developing the EWS, we have to define the prediction task and clarify the assumptions that we
are making about the data and the existing relationships. In this case study, we walk through the
development of an EWS for this purpose; from problem formulation to result validation.
4.1 Motivation
Historically, the juvenile-justice system was meant to rehabilitate delinquent youth to become productive
citizens. However, research shows that students, especially inner city youth, have trouble reintegrating
back into society once they have had a significant interaction with the juvenile justice system. Teenagers
who interact with the system are likely to experience significant negative life outcomes such as a decreased
likelihood of high school graduation [1], an increased likelihood of committing crimes in early adulthood
[4], and a significantly higher mortality rate [17].
The county that we are concerned with is a✏icted by both low graduation rates and high rates of
juvenile crime. While juvenile arrest rates have been steadily decreasing nationally, arrest rates in the
county have increased by 163% between the years of 2011 and 2015, the last-year recorded. Additionally,
while the state has a high school graduation rate of 88%, the county has graduation rate of only 58%
in 2015. In response, the police department has commissioned several task forces focused on reducing
juvenile crime and the school system has designed broad interventions that aim to increase the county’s
19
Chapter 4. Background: Identifying At-Risk Youth 20
graduation rate. It is clear that students are performing poorly at the high school level and also high
school age juveniles are interacting with the criminal justice system at higher rates. Previously, many
researchers have built prediction modelling systems that predict student academic performance - both at
the course-level and more general - [7] and other work has been done at predicting recidivism for youth
in the criminal justice system [10]. A lot of work in the education field also exists on how to build early
warning systems for students at risk of not passing courses administered through a web-based learning
system. However, no work has been done to combine datasets from the educational and justice system
to tackle such a problem.
4.1.1 Problem Formulation
Our aim is two-fold. First, we want to provide the school district with a risk-score for current students
that provides insight into a student’s risk of interacting with the criminal justice system in the next three
years. It is important to note that the list of students and their associated risk scores is generated to
allow the school system to match students with various support programs to ensure they stay in school,
graduate on time, and avoid the criminal justice system. This ties into the current community-based
crime prevention methods that are already in place in the county. Second, we want to understand what
features are most predictive of high or low risk scores. We used juvenile and adult criminal justice data
through a DataShare platform, as well as data from the school district. Many studies suggest that poor
school performance and early truancy lead to juvenile delinquency [11], but prior to this pass at it, the
education records and criminal justice records have not been combined to build predictive models of
delinquency.
We framed the problem as a binary classification problem to predict which students will have an
interaction with the criminal justice system in the next three years. We find that the students who are
assigned a high risk score (in the top decile) by our system are four to five times more likely to have
an interaction with the criminal justice system in the future than those with lower scores (bottom 9
deciles). In addition, unlike the existing system that assigns a binary (at risk or not) flag to students,
our model allows the school to use the risk scores to prioritize students for appropriate interventions.
4.1.2 Current Interventions for At-Risk Youth
The school system currently employs three tiers of interventions for at-risk youth. Tier 1 consists of
school-level interventions such as regular assemblies reminding students of behavioural expectations. Tier
2 consists of targeted interventions to support students who are not responding to Tier 1. An example
of a Tier 2 level intervention is the Check-In/Check-Out (CICO) program: a student checks in briefly
each morning and afternoon with a designated school sta↵ member who determines whether the student
is ready for class and, if required, whether the student will remain with them for further assistance and
guidance. Tier 3 interventions are intense and personalized, they are intended for students not responding
to Tier 2 intervention. There is no set criteria for being selected into a Tier 3 intervention. But when
considering the prediction problem, this is one level at which we can impact student’s well being. One
example is the RENEW program, a structured school-to-career transition planning and individualized
wrap-around process for youth with emotional and behavioural challenges. Due to resource constraints,
the number of students receiving Tier 3 interventions can be no more than one to five percent of the
total student population.
Chapter 4. Background: Identifying At-Risk Youth 21
To identify at-risk youth, the school district evaluates student attendance, behaviour, and curricular
performance (the ABCs). If a student is flagged as at risk in two of these three categories, they are
recommended to a Tier 2 intervention. Whether a student is flagged depends on their age and the
severity of the problem. For instance, the flags for behaviour are as follows:
• For students in kindergarten through grade 8, one O�ce Discipline Referrals (ODRs) in the past
20 school days or one out-of-school suspension in the past 90 days;
• For students in grades 9 through 12, three ODRs in the past 20 school days, or two out-of-school
suspensions in past 90 school days.
Once a student is flagged, the school’s Building Intervention Team considers additional data such as
the nature of the ODR, credits, grades, attendance, teacher input, work samples, observation, etc., to
determine whether a student should receive an intervention, and if so, at which tier.
This system currently flags 22,000 students without any prioritization or ranking. Currently, the
school district has the capacity to intervene with 5,000 students every year. Thus, the current system
makes it untenable to match students e↵ectively with the available interventions. A machine learning
approach to this problem gives a prioritized list of students and risk scores.
Chapter 5
Data
5.1 Education Data
The school data includes information on demographics, attendance, discipline, assessment, and school
programs for students enrolled between 2004 and 2015. Demographic data covers race, gender, birth
date, mailing address and school name per student identified by a unique student key.
Attendance data includes daily attendance records for each student, with a row representing a day
that a student was in attendance at their school. There are approximately 127 million records covering
179,780 students by unique student key.
Discipline data includes date and nature (e.g., classroom disruption, weapons related) of the disci-
plinary event. The file contains over 100,000 records which are recorded at the event-level and represent
97,000 students.
Assessment data includes descriptions of all tests taken (e.g., date taken, subject) as well as students’
scores. There are more than 5 million records representing 194,415 students. This includes repeated
standardized testing such as Measures of Academic Progress (MAP) which are administered multiple
times a year for students from kindergarten through high school, as well as college admissions tests such
as the Scholastic Aptitude Test (SAT) which are administered once per student.
Finally, school programs records include information on the type (e.g., HeadStart, Special Education)
and the dates the students were enrolled in these programs.
5.2 Criminal Justice Data
Data from the district attorney’s o�ce covers all juvenile and adult interactions with the criminal justice
system from 2009 to 2015 where the case was referred to the DA’s O�ce. Once probable cause for criminal
behavior is identified by law enforcement, a juvenile can be assigned to an informal diversion, advised
and released, transported to a homeless shelter/detox service or referred to a psychiatric crisis team. If
the juvenile is arrested and booked, they are eventually ordered to the DA’s o�ce.
After a charging decision is made by the DA, the o�ce prepares the case and it proceeds to court.
After a bond hearing and a preliminary hearing the plea negotiation process is initiated or the case
proceeds to trial. If found guilty, the juvenile might be put on probation, end up in a juvenile detention
centre or pay a fine. The DA’s o�ce serves the county and therefore covers a wider range of people
22
Chapter 5. Data 23
Datasets Number of Records Number of Unique IndividualsDemographic 1.5 million 300,000Attendance 127 million 179,780Discipline 100,000 97,000Assessment 5 million 194,415Criminal Justice 154,198 50,020
Table 5.1: A Summary of the di↵erent datasets received and used in this analysis
than the citywide school district. The criminal justice data represents 50,000 individuals. It contains
information such as the name of the defendant, as well as demographic variables such as date of birth,
gender and race. The dataset also contains information on the severity of the o↵ense separated into
felony, misdemeanor, and forfeiture.
5.3 Data Challenges
In this modeling framework, we want to create individual-level trajectories that capture a student’s
academic profile and their relationship with the criminal justice system. In tackling this problem, we
had some issues in our datasets.
· There is no source table for all the unique individuals in either of the two datasets.
There are over 1.5 million demographic records for more than 300,000 students enrolled in the school
district during the data collection period. Ideally, a new record is generated every time any of the fields
change.
However, there are multiple records per student, with newly generated unique identifiers referring
to the same individual. One reason for multiple records per student is that school district has a highly
mobile population with many students changing schools and home addresses from year to year. There are
also 100,000 more students present in the demographic dataset than are present in the other datasets.
In consultation with the school district, we noted that while we identify students by unique student
keys, in some cases, when a student leaves the school district and re-enters at a later time, they will be
registered as a new student with a new student key. This was an important consideration in the entity
resolution stage as there are not actually 300,000 unique students represented in the educational data.
Additionally, when a juvenile enters the criminal justice system, and then re-interacts with it at a
later period in time, they may not be entered under the same identifying information.
·The criminal justice dataset only includes information on serious o↵ences.
From arrest to when a juvenile enters into the records at the DA’s o�ce, there are multiple endpoints
at which the juvenile can exit the system. For example, they can be released to community service or
if it is a municipal case (i.e. not a misdemeanor or a felony) they can be ordered to civil court and
released. This means that only serious crimes are represented in the data that we have.
·Data entry errors in static variables like race and gender. In the educational dataset,
demographic details were standardized at the student level. For example, ‘Black or African American’
or ‘African-Am’ are used to refer to African American students. Such discrepancies were identified and
normalized. Since new demographic records are generated often, there were many students who have
multiple di↵erent values for their race or gender due to data entry errors. We standardized these records
by taking the last non-null value for each field for every student and propagating it back over time.
Chapter 5. Data 24
5.4 Matching the Datasets
In order to identify and link unique individuals within the educational data, we matched within the
datasets and created IDs for each person. This resolves the issue discussed above about not having a
table of all unique individuals captured in our dataset. We assumed that individuals having the same
first name, last name, and date of birth were the same person. However, simply matching on these fields
across the two data sets yielded no matches due to variations in formatting. Additionally, names are
captured in two di↵erent formats between the two datasets. For the educational dataset, there is only a
single name field, for instance “Smith, P. Jones”. In the school dataset, there are separate fields for first,
last and middle names. Additionally, in both datasets, the same individual may be booked multiple
times (criminal) or re-enrolled multiple times (school) leading to variation in how the name may be
entered each time. For example, a name might be misspelled, only the first part of a hyphenated name
might be included, or an apostrophe might be used one time and replaced by a space the second time.
In order to improve the matching rate, we cleaned the first and last name fields to make them
more uniform by removing middle initials, whitespace, commas, quotations marks, hyphens and su�xes.
Based on input from the school district and the DA’s o�ce, we expected an 80% match rate between
the two datasets. After this initial cleaning, we only achieved a 25% match rate. Recognizing that there
might be some variation due to spelling errors or the use of nicknames, we computed the Jaro-Winkler
distance for the first name and last name fields.
The Jaro distance for two strings (s1, s2) with a non-zero number of matching characters l, and t
transpositions is given by:
1
3
✓l
|s1|+
l
|s2|+
l � t
l
◆(5.1)
• |s1| is the length of the string, for example, “THE” has length 3.
• l is the number of matching characters between two strings. Between “BART” and “BARE”, the
number of matching characters is 3.
• t gives the number of transposition. A transposition is defined as the matching characters that
have a di↵erent sequence order.
The Jaro-Winkler distance [22] for two strings is given by:
dw = dj + (kp(1� dj)) (5.2)
• dj is the Jaro distance between the two strings
• k is the length of the prefix common to both strings
This means that the Jaro-Winkler distance assigns a more favorable score to strings that match more
at the beginning.
If all three identifying fields (first name, last name, and birth date) match exactly, we consider the
record to belong to the same individual. If one or zero of the fields match exactly, we do not consider the
Chapter 5. Data 25
First Name Last Name Date of Birth Jaro-Winkler DistanceReginald Grey 2004-08-03
0.8333Reginald Gray 2004-08-03Khabaugh Musgrave 1993-10-22
0.9629Khabaugh Musgraves 1993-10-22
Table 5.2: Example of the Computation of Jaro-Winkler Distances for two Entities
records as belonging to the same individual. If two of the three fields match exactly, then we consider
the records belonging to the same individual whenever:
• both names match, the birth dates share the same year and otherwise di↵er by a single character
• one of the two name fields match and the birth date match, and Jaro-Winkler distance [22] between
the mismatched names is at least 0.8.
Two examples of individuals considered the same using the Jaro-Winkler Distance rule are illustrated
in Table 5.2.
Lastly, noting that there might be some birth dates that were entered incorrectly, we allowed for
some fuzziness. With an exact match on first name, last name, and the year of birth, we allowed up to
a 1 digit di↵erence in the month and day. For example: 2010-02-04 and 2010-03-04 is a match but
2004-11-09 and 2004-11-22 is not considered a match.
The criminal justice data contained one row per case per charge. If an individual is charged with
multiple charges for the same incident, this will be reflected in multiple rows with the exact same
information but with di↵erent charges. We want to identify individuals within the data set and match
case number to a unique generated Person ID. Starting with 96,066 rows in the juvenile data, we identified
15,451 distinct cases by DA Case Numbers. After applying the matching logic above, we identified 9,451
unique individuals and assigned them a Person ID which was then appended to the original dataset.
After identifying unique individuals within the criminal justice data, these individuals were matched
to the school data. We again applied the same logic as above. We successfully linked 86% of individuals
with a DA record to the education data. Since it is possible that individuals who have a criminal
record did not attend schools in the county (e.g. out-of-state o↵enders), we believe that 86% is a
reasonable match rate. Future work includes using more sophisticated machine learning based record
linkage approaches to improve the matching process.
Chapter 6
Methods
More than 70 features were generated and they covered the whole educational dataset. One of the
features used was the number of days that students were enrolled in the last year. The histogram for
the feature is shown in Fig. 6.1. It shows that most students were enrolled for the majority of the school
year.
Figure 6.1: Histogram of the number of days students were enrolled in the year 2011.
The labels were created using the criminal justice dataset. For a calendar year, if an individual had
an interaction with the criminal justice system, they were assigned a positive label class.
As described earlier, we formulated our problem as predicting whether a currently enrolled student
is at risk of interacting with the criminal justice system in the next 3 years. We implemented the
following models using scikitlearn [15] and a variety of hyperparameters: Random Forests (RF), Logistic
Regression (LR), Support Vector Machines (SVM), and Decision Trees (DT). All of these methods were
previously detailed in the Introduction.
We used all the di↵erent classifier options while testing a range of model hyper-parameters, however,
here we present the parameters of the top performing classifiers (Table 6.1). For example, for logistic
regression, we test the range of regularization parameters. Recall that previously we defined the ridge
26
Chapter 6. Methods 27
and lasso regularization parameters. L1 corresponds to LASSO and L2 corresponds to Ridge regression.
The C value determines the strength of the regularization. The smaller the value, the stronger the
regularization, similar to SVMs. For RF, number of estimators is the number of trees to include in the
forest, max depth of the tree, max features is the maximum number of features to sample from, and the
minimum samples at split indicates the minimum number of samples required to branch o↵ an internal
node.
Table 6.1: Grid Search parameters for model selection from the top performing models
Models and Hyperparameters
Logistic RegressionC: 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10Penalty: L1, L2
Random Forest ClassifierNumber of Estimators: 1, 10, 100, 1000, 10000Max Depth: 1, 5, 10, 20, 50, 100Max Features: Square root, log2Minimum Samples at Split: 2, 5, 10
Decision Tree ClassifierCriterion: gini, entropy
6.1 Model Evaluation
We validated our models using temporal validation by creating training and test sets that are temporally
disjoint. For example, if we are predicting an interaction with the criminal justice system in the years
2010-2012, the models are trained on all the data up to the end of 2009 and then the model predicts a
risk score for all students as of the beginning of 2010 that provides their risk of having a criminal justice
interaction from 2010 to 2012.
Chapter 7
Results
7.1 Model Performance
We evaluate the model performance based on two criteria:
1. Precision in the top 1%: We want the model to be as accurate as possible in the top 1% of
the predictions since that is the intervention capacity of the school system. The school has the
resources to administer Tier 3 interventions to no more than 1 to 5% of the school population.
Focusing on the 1% threshold allows us to better match students with the limited intervention
resources available to the school district.
2. Stability of that performance over time: We want a model that is stable in terms of Precision at
1% over time so it can be used consistently without risking drastic performance changes.
To achieve those two goals, we selected the 50 best performing models based on precision at 1%. We
then selected models that are consistently among the top 50 across each time period. We found that
Random Forests with the following hyperparameters performed the best based on these two criteria:
• n_estimators = 200
• max_depth = 10
• min_samples_split = 5
• max_features = 0.33
• criterion = entropy
The precision-recall curves for this model are shown in Fig. 7.2. At 1% of the population, the
precision is 0.3 and the recall is about 0.1. This is extremely encouraging - taking the top 1% of
the model predictions allows us to identify 10% of all the at-risk students at 30% precision. This is
significantly (more than 10 times) higher than a random baseline which would get 2.8% precision (there
are 300,000 students and only 9,500 juvenile o↵enders). In comparison to the school’s baseline (as
documented in Table 7.1), we correctly identify more students who actually have an adverse incident
while flagging fewer students. This reduces the load on the schools and makes them more e�cient in
their interventions.
28
Chapter 7. Results 29
(a) Random Forest with 1K trees (b) Random Forest with 10K trees
(c) Logistic Regression with minimal regularization(d) Logistic Regression with maximum regulariza-
tion
Figure 7.1: Performance of di↵erent classifiers trained on the same datasets. Predicting for the nextthree years, we have a very low base-rate with the number of students actually interacting with thecriminal justice system is very low. This is evident in the plots with the high recall curve. The modelperforms very well on the top 0.1% which is our population of interest.
Chapter 7. Results 30
Figure 7.2: The precision-recall curve for the best-performing model.
7.2 Analysis of Results
In this section, we take the best performing model and show some diagnostics we performed to understand
and validate the model further.
7.2.1 Risk Scores
Figure 7.3 is a log-plot of the risk scores generated by the best model selected. This shows that there
is very few students with high risk and lots of students have low risk. This is promising as a risk score
that will be used for intervention-targeting as it shows that it is a well-calibrated risk-score.
Figure 7.3: Risk Score Distribution of the riskiest 1000 students
7.2.2 Evaluating the predictions by score decile
Figure 7.3 shows a decile plot that compares the actual number of positive labels in each decile versus
the predicted number. A well-performing model will have both values as close to each other as possible
in every decile and the number of predicted positive labels should go down as the risk score goes down.
As we can see from the graph, that is the case for our best performing model which gives us confidence
in the risk scores.
Chapter 7. Results 31
Flags Correctly IdentifiesHeuristic Method 22000 1310
Our Model 12000 1630
Table 7.1: Comparing the baseline method to the best performing model, we note that precision increasesfrom 6% for the current system to 14% for the best model.
7.2.3 Comparison to current School-Based Approach
As our goal was to help the school targeting interventions for the relevant students, we compared our
model results to the method the schools use to flag students who need intervention. Students are flagged
as “generally at risk” using a rule-based method based on the number of suspensions and o�ce discipline
referrals as well as their current grade level. We implemented the tier-2 intervention as the baseline and
calculated the performance in terms of precision, recall, and percent of students that they flag. The
current system flags 22,000 students, and 1300 of those flagged actually have an interaction with the
criminal justice system 7.1 (precision of 5.9%). Compared to this, our model can identify the same
number of at-risk students while only flagging 33% as many students. If we allow our model to flag as
many students as the current method, we can identify 46% more students who will go on to interact
with the criminal justice system. This shows the e↵ectiveness of our system compared to the current
methods being used in the school system today.
The features that are most important in the best performing model are:
1 Number of “Child In Need of Protective Services” (CHIPS) record
2 Age
3 Number of discipline incidents in last 2 years
4 Average absence days over the years
The number of CHIPS record is generated from the DA data set. A record is created if a child is abused
or neglected by their parent and the case was logged with the DA. This feature consistently shows up as
one of the top features in our best performing model. It is important to note that this is not necessarily
causal relationship. It is possible that the number of CHIPS records are correlated with other attributes
of a juvenile and are showing up as highly predictive. Age is also very predictive compared to other
features. This makes intuitive sense as a 15-year-old is generally more likely to commit an o↵ense than
a 8-year-old. Number of discipline incidents in the last 2 years and average absence days over the years
are also among the top features, which is consistent with the findings in the literature [13] that state
that absenteeism and truancy are often causes for delinquency. Interestingly, common demographic
features such as gender and race are noticeably absent from the top features. This is often the case since
behavioural attributes are often more predictive than demographics but both have high correlation in
practice. To further investigate whether the number of CHIPS records are masking the contribution
of other demographic variables, we examine the cross-tabs of number of CHIPS records and race. The
result is presented in table 7.2. Comparing the racial make-up for students with at least one CHIPS
record, we find that African Americans tend to have a higher fraction, and lower fraction of Hispanic
students with at least 1 record. Together, the result suggests that African American are more likely,
Hispanics are less likely and Whites are no more and no less likely to have more CHIPS records.
Chapter 7. Results 32
No. CHIPS records 0 1-10 11- 20 21- 30 31- 40 41-50
African-American 79840 (53.11%) 1185 (65.65%) 246 (68.14%) 78 (69.64%) 41 31American Indian or Alaska Native 487 (0.32%) 13 (1.11%) 4 (1.05%) 0 0 0Asian 7953 (5.29%) 13 (0.72%) 5 (1.39%) 0 0 0Hispanic 33770 (22.46%) 287 (15.90%) 58 (16.07%) 3 (2.68%) 8 1Native American 1062 (0.71%) 17 (0.94%) 1 (0.28%) 18 (16.07%) 1 2White 25276 (16.81%) 280(15.51%) 40 (11.08%) 12 (10.71%) 5 6Other 1943 (1.29%) 10 (0.55%) 7 (1.94%) 1 (0.89%) 0 0*The figure within the parenthesis denotes the fraction of overall cases. The table is using all data up to year 2013.
Table 7.2: Number of CHIPS record by Race
Chapter 8
Discussion
8.1 Future Work
The existing system only predicts interaction with the juvenile criminal justice system. A natural next
step is to expand the label set to include adult interactions as well. We would also like to broaden
the definition of interaction by incorporating arrest data. For example, it was reported that there were
approximately 16000 arrests of juveniles in 2012, but based on the DA case data we only have information
about 1923 incidents in 2012. Currently, we are only able to predict severe o↵enses, by including arrest
data we can focus on models that would predict any interaction at all with the criminal justice systems.
Many juveniles are often cited and released into the custody of their parents for minor o↵enses and
currently our labels do not capture this kind of interaction. Another extension for this work would be
to re-frame the problem as a multi-class prediction problem and predict classes of o↵ense by severity. It
would be interesting to investigate whether features have di↵erent predictive power in predicting certain
classes of o↵enses.
Another area of future work is to generate more features using other data sets such as health and
family data. This would allow us to incorporate other likely predictive factors. For instance, the health
dataset contains information on students’ blood lead levels and vaccination status.
The premise of building such a system is that we assume there exist interventions that are e↵ective
at reducing the risk of students having an interaction with the juvenile justice system. Our machine
learning system can then identify students who should be matched with those interventions in order
to improve their outcomes. A critical future endeavor is to 1) validate that assumption and determine
whether existing interventions are in fact e↵ective at reducing the risk, especially for high risk students
and 2) determine which students are not responding to existing interventions and work with experts
to create new interventions. Having a system that can accurate assess the future risk allows e↵ective
evaluation of existing interventions and supports the development of new ones, thus improving outcomes
that we care about.
8.2 Conclusions
In this work, we show that using school records, we can accurately identify students who are at risk
of future juvenile criminal justice interactions. Experiments on historical data show that our model
33
Chapter 8. Discussion 34
performs significantly better than the existing early warning system being used at the school district. If
we allow our model to flag as many students as the current school method, we can identify 46% more
students who will go on to interact with the criminal justice system. To the best of our knowledge this
work represents the first data-driven approach to address the problem of juvenile delinquency using both
school and criminal justice data.
Part II
Early Warning System for Public
Safety
35
Chapter 9
Refining EWS for Public Safety
9.1 Background
Unlike the EWS for Criminal Justice, the EWS for public safety is an ongoing project with an established
methodology that was developed over the course of partnerships with several public safety agencies. A
public safety agency is one that is tasked with keeping members of the public secure, and thus it’s
employees have high numbers of interaction with members of the public. Sometimes those interactions
can go awry, as witnessed by recent media coverage. Here, we present the results of building iterative,
complex models with thousands of features in partnership with a public safety agency. This is the first
time that such complex features have been used at this granularity in building an EWS for employees
of a public safety agency. The classifier methods used in this project are the same as those described
previously.
Most public safety agencies already have a behavioural Early Intervention System (EIS) in place.
The goal of an EIS is to proactively identify employees who display patterns of problematic performance
or who show signs of job and personal stress in order to intervene and support them with training or
counseling. When an EIS alert is raised, that alert should indicate that the employee is at high risk
of having an adverse interaction in the near future. The current state-of-the-art EIS in many public
safety agencies is threshold-based, it issues an alert for any employee who reaches a predefined number of
events in a given timeframe, such as three complaints in a six-month period. In contrast, the data-driven
EWS takes all available data and detects patterns that precede adverse incidents with greater accuracy
making the EWS predictive and enabling prioritized, preventative interventions. Also, by computing
the risk score of an employee over their career, it is possible to establish a risk profile that changes in
time and improves in accuracy with more data.
9.2 Data
We received data at the employee-level on many di↵erent types of interactions with members of the
public (ex: use of force reports, compliments, complaints, etc).
Using all of the datasets we made count features for each employee, such as number of interactions
in the past 1 month, two months, 2 years, etc. Due to the temporal nature of the problem, we created
daily, individual-level profiles as training samples. For example, employee 123 has some activity every
36
Chapter 9. Refining EWS for Public Safety 37
day, be it a verbal or physical interaction. A training sample in the dataset is not one employee, but
an employee profile and employee 123 has many copies in the training dataset for all possible days that
they are active. This means that our training dataset is highly correlated, but algorithms like random
forests and decision trees can handle a high-dimensional, highly correlated feature space. The label was
determined from Internal A↵airs data as any complaint against an employee that was sustained by the
review board in the year following the end of training time period. If the end of the training dataset is
January 2014, then a label for the employee is positive if they have an IA complaint sustained against
them between January 2014 and January 2015. Additionally, we also included demographic features,
such as age and gender as predictors.
9.3 Results
We conducted a historical analysis on the current EIS and compared it to the ability of the data-driven
EWS at that point in time. We found 19% of the alerts between 2009 - 2016 correctly identified an
employee who went on to have a sustained, unjustified or preventable internal a↵airs complaint in the
year following the alert. Then, using our approach we built a data-driven prototype. The EWS ranks
employees by risk score in contrast to a binary EIS, which only flags an employee as “at risk” or “not
at risk.” Between January 2014 and July 2016 the EWS would have flagged 24% of employees correctly.
In contrast, the threshold-based EIS raised only raised 17% correct alerts in the same time frame. The
EWS is more e�cient: although it would have raised fewer alerts, its alerts are 40% more likely to go to
employee who end up having an adverse incident with a member of the public.
We talked previously about model selection, in this case, we look at the models’ precision over time
(Fig. 9.1. Note that a precision of 20% in the top 50 employees means that for the list of 50, 20% were
correctly classified in the high-risk category.
Figure 9.1: This plot shows precision in the top 50 highest-risk employees for the public safety agencyfor di↵erent models over the years.
Similarly, looking at the models’ recall over time (Fig. 9.2), we note that models with a coverage of
Chapter 9. Refining EWS for Public Safety 38
30% capture 30% of all the possible adverse incidents in the whole dataset.
Figure 9.2: This plot shows recall in the top 50 highest-risk employees for the public safety agency fordi↵erent models over the years.
We also want to see that our top-performing models perform relatively similarly. This is evident in
Fig. 9.4
Based on this model performance, we use this to pick the best model and generate risk scores for
each employee for the next year. Note that previously we looked at the risk score distribution in the
at-risk youth example. Here, we are concerned with a risk scores that well-separates the two classes
(Fig. 9.4). The model tends to score employees who go on to have adverse incidents higher compared
to employees who will not have an adverse incident. A draw-back of this model is visible, though, in
regards to the overlapping regions between red and blue around a score value of 0.2 in this instance.
This is an indication of a lack of separability in the two populations, that would be resolved with more
granular data.
9.4 Comparison to Current Approach
Currently, the public safety agency has a manual, intensive process to handle alerts. The EIS issued
alerts for 45 employees in January 2014, 8 of whom went on to have an adverse incident in the next 12
months, and missed 193 o�cers who went on to have an adverse incident (18% precision, 4% recall). In
contrast, the EWS has slightly better performance for the same time period and is a lot-less resource
intensive.
9.5 Future Work
In contrast to the EWS for at-risk youth, we are replacing an existing method that does a similar
job. There is a ready baseline for comparison and we can show the lift that our system is able to get
Chapter 9. Refining EWS for Public Safety 39
Figure 9.3: This plot shows the precision-recall curves for top 100 performing models (out of 10,000)showing that they have relatively similar performance.
for the agency. Since we use black-box models which are not easily interpretable, we can not make
causal claims about which predictors are responsible for increases or decreases in risk scores. However,
since all interventions in this case are non-punitive, it makes sense to optimize prediction accuracy
instead of interpretability. Some work has also been done on trying to interpret prediction scores and
defining important features using tools like LIME, which provides local, interpretable, model-agnostic
explanations for each result [20].
The work done in this analysis adds on to a corpus of work that is the first of it’s kind with respect
to this domain. A start-up is taking the technology and modeling methods that were developed and
refined through the work presented here and using it on top of software that will serve as a record
management tool for public safety agencies. The methods refined here will be integrated seamlessly into
their databases and become the industry standard for how public safety agencies think about managing
risk for their employees.
From a modeling standpoint, it would be interesting to try more advanced modeling techniques from
deep learning such as transfer learning. Transfer learning is a general term for using a model that was
trained for one task as a starting point model for another task. For example, if an agency wants to
implement an EWS but does not have enough data to build reliable predictive models, they can use one
of the base models trained from another agency as a starting point [23]. Since we have data from many
di↵erent agencies, the base model could even be trained across a combined dataset.
Chapter 9. Refining EWS for Public Safety 40
Figure 9.4: Separation of Risk Score by Class. Blue is employees with an adverse incident, red isemployees without an adverse incident.
Bibliography
[1] Anna Aizer and Joseph J. Doyle Jr. Juvenile incarceration, human capital and future crime: Evi-
dence from randomly-assigned judges. The Quarterly Journal of Economics, February 2015.
[2] J. Albinati, W. Meira, Jr, and G. Lobo Pappa. An Accurate Gaussian Process-Based Early Warning
System for Dengue Fever. ArXiv e-prints, August 2016.
[3] Anuradha and G. Gupta. A self explanatory review of decision tree classifiers. pages 1–7, May 2014.
[4] J. Bernburg and M. Krohn. Labeling, life chances, and adult crime: The direct and indirect e↵ects
of o�cial intervention in adolescence on crime in early adulthood. Criminology, 41:1287–1316, 2003.
[5] Lon Bottou and Chih jen Lin. Support vector machine solvers, 2006.
[6] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001.
[7] Dragan Gasevic, Shane Dawson, Tim Rogers, and Danijela Gasevic. Learning analytics should not
promote one size fits all: The e↵ects of instructional conditions in predicting academic success.
28:6884, 01 2016.
[8] Aurelien Geron. Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools,
and techniques to build intelligent systems. O’Reilly Media, Sebastopol, CA, 2017.
[9] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.
Springer, 2001.
[10] Carter Hay, Alex O. Widdowson, Meg Bates, Michael T. Baglivio, Katherine Jackowski, and
Mark A. Greenwald. Predicting recidivism among released juvenile o↵enders in florida. Youth
Violence and Juvenile Justice, 0(0):1541204016660161, 0.
[11] Paul J Hirschfield and Joseph Gasper. The relationship between school engagement and delinquency
in late childhood and early adolescence. Journal of Youth and Adolescence, 40(1):3–22, 2011.
[12] E. Howard, M. Meehan, and A. Parnell. Contrasting Prediction Methods for Early Warning Systems
at Undergraduate Level. ArXiv e-prints, December 2016.
[13] B. Jacob and L. Lefgren. Are idle hands the devil’s workshop? incapacitation, concentration, and
juvenile crime. American Economic Review, 93(5):1560–1577, 2003.
[14] M. Lichman. UCI machine learning repository: Abalone dataset, 2013.
41
Bibliography 42
[15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-
hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and
E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,
12:2825–2830, 2011.
[16] J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81–106, March 1986.
[17] R. Ramchand, A. Morral, and K. Becker. Seven-year life outcomes of adolescent o↵enders in los
angeles. American Journal of Public Health, 99(5):863870, 2003.
[18] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society, Series B, 58:267–288, 1994.
[19] J. L. Tonry. An Early Warning System for Asteroid Impact. Publications of the Astronomical
Society of the Pacific, 123:58, 2011.
[20] M. Tulio Ribeiro, S. Singh, and C. Guestrin. “Why Should I Trust You?”: Explaining the Predictions
of Any Classifier. ArXiv e-prints, February 2016.
[21] Ilhan Uysal and H Altay Guvenir. Instance-based regression by partitioning feature projections.
Appl. Intell, 2004.
[22] W. E. Winkler. String comparator metrics and enhanced decision rules in the fellegi-sunter model
of record linkage. Proc. Sec. Survey Res. Meth., pages 354–359, 1990.
[23] J. Zhang, W. Li, and P. Ogunbona. Transfer Learning for Cross-Dataset Recognition: A Survey.
ArXiv e-prints, May 2017.
[24] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the
Royal Statistical Society, Series B, 67:301–320, 2005.