by hareem naveed · the age of an abalone is given by the given by number of rings on the shell +...

Comparing Prediction Methods for Early Warning Systems

by

Hareem Naveed

A thesis submitted in conformity with the requirementsfor the degree of Master of Science

Department of MathematicsUniversity of Toronto

c� Copyright 2018 by Hareem Naveed

Abstract

Comparing Prediction Methods for Early Warning Systems

Hareem Naveed

Master of Science

Department of Mathematics

University of Toronto

2018

In this study, we investigate the use of prediction modeling to build an early warning system to identify

students who are at risk of interacting with the criminal justice system in the future. First, we review

algorithms for supervised learning and formulate the problem in a precise modeling framework. In the

data cleaning phase, we match between the two datasets for an 86% match-rate. We then apply di↵erent

supervised learning methods and identify the best model for our problem. Using detailed variables,

temporal cross validation and our final prediction method of Random Forests, we achieved a precision

of 0.3 at 1% of the student population. This greatly out-performs the current threshold-based system

that flags a larger percentage of the student body while correctly identifying fewer at-risk students. We

also describe the results of a similar approach to developing an early warning system for public safety.

ii

Acknowledgements

The helpful revisions and constructive feedback provided by Professor Adam Stinchcombe shaped this

thesis into it’s final form. It would not have been possible without his support and guidance.

I am grateful for funding from the Eric and Wendy Schmidt Foundation through the Data Science for

Social Good 2016 Fellowship at the University of Chicago. I am thankful for my teammates and mentors

during the fellowship who contributed to this work. Additionally, this thesis would not be possible

without the infrastructure, project management and technical support provided by my colleagues at the

Center for Data Science and Public Policy. A special thank you to Rayid Ghani and Adolfo De Unanue

for their constant support, advice and the opportunity to work on interesting projects with meaningful

social impact.

iii

Contents

Acknowledgements v

Table of Contents v

1 Introduction 1

2 Supervised Learning Methods 2

2.1 Machine Learning Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Model Selection 14

3.1 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Criteria to Consider during Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 16

I Early Warning System for At-Risk Youth 18

4 Background: Identifying At-Risk Youth 19

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Data 22

5.1 Education Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2 Criminal Justice Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3 Data Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.4 Matching the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Methods 26

6.1 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

iv

iii

iv

7 Results 28

7.1 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.2 Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8 Discussion 33

8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

II Early Warning System for Public Safety 35

9 Refining EWS for Public Safety 36

9.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

9.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

9.4 Comparison to Current Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

9.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Bibliography 41

v

Chapter 1

Introduction

Given a set of inputs, a predictive model predicts an outcome. The availability of large datasets and

improved computing capabilities, particularly cloud-based computing, has led to the development and

rapid adoption of predictive modeling in many fields. Traditional statistical modeling techniques are

structured around better understanding the data generating process. In this context, the predictive

model is considered a by-product and not the main goal. These models require strict assumptions and

are not always applicable where data-generating processes are too complicated. In recent years, with

great advances being made in the field of machine learning, there has been a switch in modeling approach

to prioritize prediction accuracy.

An Early Warning System (EWS) takes as input data about entities and outputs scores about the

likelihood of some event happening for that entity in the future. This is a classification task with the

goal being to develop a predictive model that based on data about past behavior of the entity is able to

separate the entities into classes of interest. For example, an EWS that predicts student performance

in a course will use static predictors such as student age in conjunction with dynamic predictors such

as attendance and assessment scores to determine whether the student will pass or fail the course [12].

There is a temporal component to the predictions of an EWS; they are ”early” and thus are actionable.

For example, in the EWS for student performance, the predictions have to be generated at some point

before the course is over, to allow for e↵ective intervention. The temporal component is a tunable

parameter that must be optimized for in model selection (e.g. Is the EWS better at predicting a student

failing half way through the semester? Or at the start of the semester?).

Machine learning-based EWSs have previously been developed for many applications and have been

used to predict asteroid impact [19], seismic events in coal mines, dengue outbreaks [2], and student

performance in an undergraduate course [12].

The thesis is structured as follows: in the preamble, I describe modeling methods in general, outline

metrics for model evaluation, and techniques for temporal cross-validation. In the first section, I describe

the development of an EWS used to identify students at risk of interacting with the criminal justice

system as juveniles. In the second section, I describe the replication and expansion of an EWS for public

safety agencies deployed in a new agency.

1

Chapter 2

Supervised Learning Methods

2.1 Machine Learning Systems

Generally, machine learning systems can be separated into two general categories [8]:

1. If the system is trained with labeled data, it is a supervised learning system. If no labels are

available, it is known as an unsupervised learning system.

2. If the system detects patterns and builds models, it uses model-based learning. If it compares new

data points to known data points, it uses instance-based learning.

These categories can be combined. For example, an email spam filter can be build using a linear

regression model, a supervised learning method trained on emails labeled as spam and not-spam. This

makes it a model-based, supervised learning system. An Early Warning System (EWS) is a model-based,

supervised learning system.

In this section, we first introduce the mathematical structure and notation of the prediction problem.

Then, we present in detail some algorithmic approaches to solving the prediction problem in an EWS.

2.1.1 Problem

The set of examples used by a predictive model to learn patterns is called the training dataset. Each

instance is a sample, and has the form {(x1, y1), ..., (xn, yn)}. A single observation is a pair (x, y) where

y is the response variable, and x is a k�length vector of predictors for one entity. Assume that y is a

realization of the random variable Y, and x is a realization of the random variable X. Given some input

x, we want to predict the expected value of Y, which we assume depends on X. This is given by:

E(Y |X = x) = h(x) (2.1)

If the response variable is quantitative, h is a regression model defined by the function h : Rk ! R. If

the response variable is categorical, with k possible classes, h is a classification model defined by the

function h : Rk ! ⌥ with ⌥ = {G1, ..., Gk}. In the case of an EWS, we want a classification model that

will have the form:

h : Rk ! {G1, G2} (2.2)

2

Chapter 2. Supervised Learning Methods 3

Table 2.1: Information about attributes in the Abalone dataset.

Name Data Type Unit Feature NameSex Categorical sexLength Continuous mm lengthDiameter Continuous mm diameterHeight Continuous mm heightWhole Weight Continuous mm weight wShucked Weight Continuous mm weight sViscera Weight Continuous mm weight vShell Weight Continuous mm weight shRings Integer rings

Figure 2.1: A histogram for every numerical feature in the Abalone dataset

We are trying to predict whether or not an entity will have some defined event in the future. Let G1 be

the case where the entity has an event in the future, and G2 be the case where the entity does not. In

the algorithmic approach to prediction modeling that is adopted by machine learning, we try to fit the

best possible function for the problem at hand.

2.1.2 Sample Dataset

To ground our discussion of the di↵erent methods, we will implement them using the Abalone dataset

[14]. The dataset has 9 fields (see Table 2.1) and has been used as a benchmark for testing new methods

[21] as it is relatively simple. For context, abalones are a type of marine snail. The age of an abalone is

given by the given by number of rings on the shell + 1.5. In order to count the rings, the shell has to

be stained and analyzed under a microscope [14]. Thus, the age of an abalone is di�cult to predict, and

we will try to predict the age using other physical characteristics with di↵erent supervised, model-based

learning methods. Age is a continuous variable, and we will use it as a label for detailing regression

methods. For the supervised classification methods, we will classify the snail as either male or female

based on other physical attributes.

A supervised learning algorithm analyzes the training data and uses the inferred function to map

new examples from the feature space to the label space. The feature space is the p-dimensional space

where the variables of interest live. From the abalone dataset, if we use all attributes as predictors for

the response variable age, the feature-space is R8. Histograms of the seven numerical features are given


in Fig. 2.1. The categorical feature is gender. A sample from the dataset is a vector with values for all

8 attributes for an abalone, and the corresponding age.

2.1.3 Algorithmic Paradigm

We start with a training set of data and want to learn something about the structure of our data. Char-

acteristics about the entities represented in the data are captured by predictors. We make hypotheses

about the data structure and it’s relationships, and these are captured by parameters that we are trying

to learn [8]

In this section, we review the algorithms behind four supervised learning methods and the techniques

used to train models. We consider five modeling techniques; linear regression, logistic regression, support

vector machines, decision trees, and random forests.

For each of these models, we have the following:

y = h✓(X) (2.3)

• y is the predicted value. If the chosen model and parameters correctly capture the structure of the

data, then y = y

• h✓ is the function we hypothesize will best model our data; it has the parameters ✓.

• ✓ is a vector of the model’s parameters.

2.2 Linear Regression

If h(X) is a linear function, then the model is linear regression. A linear regression makes a prediction

by computing a weighted sum of the inputs. The training task in the model is learning the weights for

each predictor. The prediction has the form:

h(X) = ✓0 + ✓1x1 + ...+ ✓nxn = ✓TX (2.4)

1. ✓j is the j-th model parameter, with ✓1 to ✓n the feature weights, and ✓0 being the bias term.

✓ = (✓0, ✓1, .., ✓n)T 2 Rn+1.

2. n is the number of predictors.

Training a model means that we propose parameters that best model the data. Generally, for any

model, we try to minimize a cost function to get the best fit. To measure the fit of a linear model with

m samples, we use the Mean Squared Error (MSE) as our cost function:

MSE(✓) =1

m

mX

i=1

(✓T · xi � yi)2 (2.5)

MSE captures the di↵erence between the estimated and the true value. We want a model that has


parameters ✓ that will minimize MSE. Computing the partial derivative of the MSE function gives:

@

@✓jMSE(✓) =

2

m

mX

i=1

(✓T · xi � yi)xij

Converting this to vector notation:

r✓MSE(✓) =

2

66664

@@✓0

MSE(✓)@

@✓1MSE(✓)

...@

@✓n

MSE(✓)

3

77775

=2

mXT · (X · ✓ � y)

Setting the partial derivative equal to 0, and simplifying, we get:

✓ = (XTX)�1XT y

In the abalone example, let us a fit a linear regression model with just one predictor (Fig. 2.2), the shell

weight, to predict the age. Our model has the form:

yage = ✓0 + ✓1xweight shell (2.6)

Making the assumption that the more rings a shell has, the heavier it will be, we test to see if that is

the case.

Figure 2.2: A linear regression model (red line) which tries to use the shell weight to predict age. Thescore of this model is 0.4, meaning that it explains 40% of the variability in the data.

This is a simple enough example with few training samples, and it is possible to invert the matrix.

But if there are more than 100,000 predictors, it becomes very slow to invert a non-sparse matrix. The

method used to train more complex linear regression models is gradient descent. Gradient descent is an


iterative optimization algorithm that is used to find the minimum of a function. For a function that is

defined and di↵erentiable at a point a, it follows that iff(an+1) = an � ⌘rf(an), then f(an+1) < f(an).

Intuitively, if we move in the negative direction of the gradient, we will move towards a smaller value. In

the case of linear regression, we use gradient descent to find the minimum of the MSE. In this case, the

cost function is quadratic and there is only one minimum (Fig. 2.3). Regardless of what the initial values

are for the gradient descent, we will always arrive at the optimal solution. Specific to our parameters,

the gradient descent step is defined by:

✓0 = ✓ � ⌘r✓MSE(✓)

⌘ is the learning rate, with values between 0 and 1.

Figure 2.3: The cost function for a linear regression model. The colored dots on the plot representdi↵erent steps in the gradient descent.

For a more complex linear regression example, we use all the features from the dataset to predict

age (Fig. 2.4). This gives us a score of 0.48 which is slightly better than our original model (Fig. 2.2).

Note that linear regression makes the assumption that the data is linear, and so performance is a↵ected

when the data has a di↵erent underlying structure.

2.2.1 Regularization: Modifying Cost Functions

In selecting models, we want to be careful that they are not over-fitting to the data. Over-fitting means

that they perform really well on the training dataset and do not generalize to new samples. Above,

the two models we tested on the test-set only have scores of 0.4 and 0.48. They do not generalize well

to new data (we posited above that this may be because the data is not linear, but for the purpose of

demonstration, let us ignore that fact). In linear regression, regularization is done by constraining the

weights of the predictors. This means that if one predictor is particularly oversampled, or has large

values in the training dataset, we try to constrain it. There are three common ways to do this. One way

is LASSO (Least Absolute Shrinkage and Selection Operator) which penalizes MSE with an l1 norm


Figure 2.4: Comparing the predicted value for age with the true value. The model includes all predictorsand has a score of 0.48.

[18]. The cost function for LASSO regression is given by:

Cost(✓) = MSE(✓) + ↵

nX

i=1

|✓i| (2.7)

In Ridge Regression, the cost function is modified by adding an l2 norm:

Cost(✓) = MSE(✓) + ↵(1

2)

nX

i=1

✓2i (2.8)

The l1 norm eliminates the weights of the least important features, and performs automatic feature

selection. It is preferred when there is an understanding that only a few predictors are important for the

prediction problem. The l2 norm keeps predictor weights as small as possible, but does not eliminate

them. The third cost function is elastic net [24] and it is a mix of both LASSO and ridge:

Cost(✓) = MSE(✓) + ↵(1� r

2)

nX

i=1

✓2i + r↵

nX

i=1

|✓i| (2.9)

When r = 0, elastic net is equivalent to ridge regression, and when r = 1, elastic net is equivalent to

LASSO. Using the above regularization techniques, our linear model from Fig. 2.4 scores as follows for

di↵erent values of ↵:

Table 2.2: Scores of the regularized linear regression from Fig. 2.4 with varying ↵ values. Note that inridge regression, a large ↵ will set all model weights close to 0, and in LASSO regression, a large ↵ willeliminate some features.

Alpha LASSO Score Ridge Score0.1 0.29 0.500.5 0.19 0.521 -2.3 0.52


2.3 Logistic Regression

Linear regression can also be used for classification with some modifications; logistic regression computes

a weighted sum of the predictors like linear regression, but outputs the logistic of that result.

h(X) = �(✓TX) (2.10)

Where �(x) = 11+e�x

. Logistic regression is used for classification by imposing a threshold:

y = {0 if �(✓TX) < 0.5, and 1 if �(✓TX) � 0.5} (2.11)

Instead of MSE, we use log-loss as the cost function for logistic regression as it allows us to penalize

false classifications [9] . For a single instance, it is given by:

Cost(✓) = {� log(y) if y = 1,� log(1� y) if y = 0} (2.12)

For a dataset, the cost function is given by:

Cost(✓) = � 1

m

mX

i=1

[y(i) log(h✓(x(i))) + (1� y(i)) log(1� h✓(x

(i)))] (2.13)

This function can also be minimized using gradient descent as defined previously. Results for logistic

regression on the abalone dataset are presented at the conclusion of this section.

2.4 Support Vector Machines (SVM)

In classification, we want to find a linear decision boundary that clearly separates two classes. From

Fig. 2.5, note that the two classes in the dataset are clearly separated by a solid black line. The decision

boundary is supported by the instances that fall on the dotted-lines. They are called support vectors [8].

On either side, the SVM fits the widest possible margin (distance between the decision boundary and

the support vector) that separates the two classes. Adding examples o↵ the support vectors will not

influence the decision boundary. If the data is clearly linearly separable, then it is possible to impose

hard-margin classification. This ensures that there are no margin violations - no data points are found

between the decision boundaries and the support vectors. This is the strictest condition.

Figure 2.5: Separating two classes using a linear SVM. Source:[8]

In SVM, we are trying to fit a linear decision boundary. Consider the classifier setup:

hw,b(x) = wTx+ b (2.14)


If we let our label class be (�1, 1) then:

y = {�1 if wTx+ b < 0, 1 if wTx+ b � 0} (2.15)

If y = 1 for some x, then wTx + b needs to be a large positive number (e↵ectively wTx + b � 1,

not just � 0). Similarly if the label is negative, then that number needs to be a large negative number.

To be confident in the prediction, a large functional margin is required [9]. The functional margin has

the formula �i = yi(wTxi + b), and gives an idea of whether or not a point is properly classified. The

slope of the decision function is ||w||. The function identifies a hyperplane, with intercept b and the

normal vector w which is perpendicular to the hyperplane. To get a large margin, we want to minimize

the weights given by w, and If we want to observe the hard-margin violation rule, the decision function

must be greater than 1 for all positive training classes, and less than -1 for all negative training classes.

Then, the objective function has the following form:

minw,b

1

2w2

s.t. yi(wTx+ b) � 1

Scaling the function by constants does not change the prediction function, since the output only

depends on the sign but not the magnitude of wTx+ b. Imposing a normalization condition means we

can get something else instead of (w, b). The geometric margin in this case corresponds to the idea that

being on the ”positive” side of the decision boundary is good. Given a training set it is important to find

a decision boundary that maximizes the geometric margin as this gives us a confident set of predictions.

This is the hard-margin classification problem.

We get the geometric margin as :

� = mini=1,...,m

�(i) (2.16)

Assuming that our training dataset is linearly separable - that the positive and negative examples can

be split using a separating hyperplane. To get the maximum geometric margin want to optimize � given

that each training example has minimum margin �. The optimization problem takes the form:

maxw,b

� (2.17)

s.t. y(i)(wTxi + b) � �, i = 1, ..,m (2.18)

kwk = 1 (2.19)

This is a non-convex problem. By taking away the constraint on w, and realizing that maximizing

�/kwk = 1/kwk is the same as minimizing kwk2. This has a much nicer derivative ( 12 ||w||) and together

with the constraints otherwise identified can be expressed as a Quadratic Programming Problem. Many

solvers exist for these kind of problems and they can be easily solved[5].

Fitting a linear SVM classifier to our abalone dataset to predict whether or not length and diameter

can predict gender shows that the data is not very linearly separable. From Fig. 2.6, we note that

regardless of the parameter-tuning, the data does not separate linearly. The next step to consider


here would be di↵erent types of kernels that allows for non-linearity in the dataset. The higher the C

parameter is, the greater the number of margin violations that are allowed to take place.

Figure 2.6: Trying to classify males and females using a linear SVM with di↵erent parameter settings.

2.5 Decision Trees

The use of decision tree classifiers has been proposed in many areas ranging from speech recognition to

remote sensing [3]. A decision tree built on the subset of the abalone data is illustrated in Fig. 2.7. In

mathematics, a tree is an undirected graph where any two vertices are connected by only one path. In

trees, the vertices are nodes and the edges are branches. A decision tree has three kinds of nodes; a root

node, an internal node, and a leaf node. A root node has no incoming edges and a leaf node has no

outgoing edges. In Fig. 2.7, the root node checks to see whether the length of the sample is less than

0.61mm. At each internal node, a test is being performed on the data, and the branch that results to

the left is the one that passed the test.

Essentially, a decision tree partitions the k-dimensional space of predictors into K hypercubes, Hl for

l = 1, ...,K, and fits a very simple, (usually) constant model on each space. With cl as some constant,

a decision tree can then be represented as:

h(x) =KX

l=1

cl (x 2 Hl) (2.20)

Every internal node partitions the instance space into one or more subspaces, this process continues

recursively until the parts only contain samples from one class. This terminates at the leaf nodes.

Decision trees can be used for classification and regression. In this section, we focus on decision tree

classifiers as that is the method we will employ for our EWS.

There are three main algorithms for decision tree classifiers, ID3 (Iterative Dichotomizer 3) [16], C4.5

and CART (Classification and Regression Tree). Each algorithm uses di↵erent splitting criteria, but all

three have the same tree coverage approach [3].

2.5.1 Iterative Dichotomizer 3 (ID3)

ID3 was first proposed by Russ Quinlan [?], and it uses information gain to split at each node. ID3 splits

data based on the homogeneity of a sample and uses entropy to calculate this homogeneity. A sample


Figure 2.7: A simple decision tree built using length and diameter to predict gender in Abalones.

has entropy of 0 if it is totally homogeneous and an entropy of 1 if it is well separated. To compute the

entropy with one attribute, with S being the original set, we use:

Entropy(S) =jX

i=1

�pi log2 pi (2.21)

where pi is the probability of getting sample i when randomly selecting from a set. We compute the pi

as ni

|S| , where ni is the number of samples of class i.

To compute the entropy using two attributes, we use:

Entropy(T,X) =X

c2X

�P (c)E(c) (2.22)

The information gain is then computed based on the decrease in entropy after a dataset is split on

an attribute. Decision tree construction depends on finding the attributes that give the most homoge-

neous branches, or return the highest information gain. First, the entropy of the target (label class) is

calculated. The dataset is then split on all the di↵erent attributes and the entropy of each branch is

calculated. The gain is given when the resulting entropy after the split is subtracted from the entropy

before the split:

Gain(T,X) = Entropy(T )� Entropy(T,X) (2.23)

A branch with entropy zero is a leaf node, whereas a branch with entropy greater than zero still needs

further splitting. The ID3 algorithm runs recursively until the tree is complete on all non-leaf branches.

The main advantage of a decision tree is that it can be easily converted to a set of rules that maps

the data and gives the process for what makes each decision. ID3 is the simplest decision tree classifier

algorithm and it has a depth-first approach. The main drawbacks of the ID3 algorithm is that it is only

built for categorical variables, and has low accuracy of classification on large datasets [3]. In contrast, it’s

successor, the C4.5 algorithm can handle numeric data but is also not very successful for large datasets

[3].


2.5.2 Classification and Regression Tree (CART)

The CART algorithm creates binary decision trees, which means that each non-leaf node only has two

children. In contrast, other methods can have more than two children per non-leaf node. We used CART

to classify the abalone dataset with length and diameter as predictors(Fig. 2.7). In CART, the training

set is split into two, using one feature f and some threshold (tf ) associated with that feature. The cost

function that the algorithm tries to minimize for a classification problem is given by:

Cost(f, tf ) =nleftGini Impurityleft

n+

nrightGini Impurityrightn

(2.24)

• nleft, nright is the number of samples in the left and right node respectively, and n is the number

of samples in total

• Gini Impurity = 1�nP

i=1p2j,i , where pj,i is the ratio instances of type i at the j-th node

The process is iterative, and the function continues to split the training set recursively until only the

leaf nodes remain. The regression cost function for CART is the same, except instead of Gini Impurity,

the MSE must be optimized.

2.6 Random Forest Classifier

A random forest is a classifier made up of an ensemble of decision tree classifiers h(✓k, x) where ✓k are

i.i.d. random vectors and each tree classifiers casts a vote for a label given the same input [6]. It is trained

using the bootstrap aggregation method. In bootstrap aggregation, a diverse set of decision-tree based

classifiers are fit by training them on random subset of the training set where sampling is performed

with replacement. The final prediction for a random forest comes from taking the majority vote across

the di↵erent trees [6].

Random Forests can be developed as an extension of the bagging algorithm. The algorithm is simple

and is given by:

1. For b = 1, ..., n, sample with replacement from the training set to get Xb, Yb.

2. Train a decision tree dtb using Xb, Yb using the methods outlined above under the description of

decision trees.

3. y = 1n

nPb=1

(dtb(x)) gives the prediction for a new sample x.

Random Forests diverge from the bagging algorithm as each iteration of the decision tree randomly

samples the data.

In a random forest, the margin function measures the extent to which the average number of votes

for some input X exceed the number of votes for another class. This function is given by:

mg(X,Y ) = Ek[ (hk(X) = y)]�maxj 6=y(Ek[ (hk(X) = j]) (2.25)

• Ek [(hk(X) = y)] is the proportion of classifiers for which hk(X) = y. This is equal to 1K

KPk=1

[hk(x) =

y].


The margin function takes into account how the average number of votes at (X, Y ) for the correct class

compares to the average number of votes for the next-best class. When the margin is larger, we are more

confident in our predictions. This is similar to the setup of the SVM where the larger the geometric

margin (how far apart the samples are from the decision boundary), the more confident we are in our

classification prediction.

The generalization error, a measure of how accurately the algorithm will predict unseen samples, and

the probability that the margin function is less than zero, is given by:

e = PX,y(mg(X, y) < 0) (2.26)

Breiman [6] proves that as the number of trees increases, the generalization error converges. Random

forests are generally the best performing methods for a range of applications. This is because they

require almost no input preparation and can handle categorical and numeric features without any need

for predictor scaling. In contrast, SVMs are very sensitive to predictor scaling, logistic regression assumes

that the data is linear and decision trees are sensitive to dataset rotation. Random forests are also quick

to train, and perform implicit feature selection. In general, they are the best performing “simple” model

that is available, and provide a good benchmark against which to evaluate other, more complicated

models.

As an overview of all the methods, let us compare the accuracy of the di↵erent classifier methods we

presented in this section at predicting the gender of abalone in the test dataset. The ensemble classifier

is constructed by building many decision trees and taking the majority vote, as a rough approximation

of a random forest.

Table 2.3: Comparing the performance of the di↵erent classifiers covered in this Introduction.

Classifier Accuracy ScoreLogistic Regression 0.5595Support Vector Classifiers 0.5488Decision Trees 0.5417Random Forest 0.5027Ensemble (majority Voting Classifier) 0.5293

Chapter 3

Model Selection

3.1 Cross Validation

In order to evaluate a classifier, we will use cross-validation. A simple method for cross-validation is

k-fold cross-validation. In this case, the data is split into k folds and then predictions are made and

evaluated on each fold with a model that was trained using the other k � 1 folds. When working with

data with temporal structure, we can not use standard methods to validate our models, as there may be

leakage of information from the features to the labels. For example, if one of the features is number of

disciplinary hearings an employee attends and the label is complaints, note that an employee who has

a complaint will always have a disciplinary hearing. If the training set includes all data over time, the

feature of number of disciplinary hearings will be a perfect predictor of adverse interactions. If we are

careful about the temporal splits, then we can use past data to predict future data. In setting up an

EWS, this is very important, as we do not care about overall prediction accuracy, but rather the ability

of the EWS to predict events in the future.

In the case of an early warning system, most features are event-based, with the idea that a specific

sequence of events increases the risk score of an entity over time. We need to perform temporal cross-

validation on our data. The splitting of the dataset needs to be done at the event level. A model that

is being put into production will need a training window that includes data from the beginning of time

until that date, and then will need a label window for the following year.

Referring to Figure 3.1The longer coloured blocks represent the features of the training set for each

model, the gap immediately following is the label. The small block that follows is the testing set. A

model is a classifier with a set of hyper-parameters. Models with the same hyper-parameters in each of

the time blocks belong to the same model group. The same model parameters are used in training and

testing di↵erent splits of the data over time. Since it is important to have a well-trained generalizable

model that is useful for predicting events out of the dataset, we pick the best-performing model as one

that performs the best over time and is also stable on di↵erent train/test splits. Using this same setup,

we also do back-testing to confirm how valid the models are at di↵erent points in time.

14

Chapter 3. Model Selection 15

Figure 3.1: Training models on di↵erent blocks of data with the same parameter set lets us pick modelsthat are stable over time. The x-axis represents time.

3.2 Metrics

There are several possible metrics that we can use to evaluate our models. First, let us define some

terms. True positives (TP) are the individuals designated by the model as being part of the class of

interest and are actually part of the predicted class. False positives (FP) are all those that are predicted

as part of the class, but are not actually part of the class of interest. Similarly true negatives (TN) and

false negatives (FN) are those that are correctly and incorrectly labeled as part of the negative class by

the classifier, respectively.

3.2.1 Accuracy

The accuracy score is a fraction of the predictions which are correct. The indicator function returns 1

when the predicted value is equal to the true value. Summed and divided by the number of samples, it

gives the fraction of the predictions which are true.

Accuracy =1

nsamples

nX

i=1

(ypred = ytrue)

3.2.2 Precision

Precision evaluates the e�ciency of a model, out of the instances labeled as the class of interest, how many

of them are correct. Essentially, it indicates how much trust can be placed in the model’s predictions.

Precision =TP

TP + FP(3.1)

(3.2)


3.2.3 Recall

Recall evaluates the coverage of the model, out of all the instances that the model could have labeled,

how many did it actually label correctly.

Recall =TP

TP + FN(3.3)

3.2.4 Precision-Recall Curves

Often we use precision-recall curves to understand the trade-o↵ between the two metrics. Optimizing for

precision means that recall will drop. To read the precision-recall curve, one must pick a threshold on

the x-axis, and decide what to balance. In some cases, it makes sense to have better recall, in others, it

may make more sense to optimize on precision. For example, if a school district only has the resources to

intervene on 150 students a year, it makes sense that they would try to get the best model performance

in the top riskiest 150 students. This will become more evident in our examples with the early warning

systems that we will implement in the next few sections.

Figure 3.2: The precision-recall curve for the SVM classifier we trained in the Supervised LearningMethods overview.

3.3 Criteria to Consider during Model Selection

For selecting the model, in addition to the precision/recall performance, we also look for the following

attributes:

• Performance stability across time in precision/recall (a model that performed exceptionally well

in 2015 but did not perform well in 2016 is less favorable than a model that performed reasonably

well in both 2015 and 2016).

• The model produces stable classifications. That is, the model produces (nearly) the same classifi-

cation of entities if run twice on the same data.


• The model di↵erentiates in a more or less clear way the two populations (those at high risk vs

those not).

• The top features from the selected model distinguish entities between the two classes.

• The model does not flag entities simply for more data.

A selected model does not need to be the best model in each of these categories, it should perform well in

all of them. It is important to note that in a deployment setting, the model is intended to be temporary:

a new model should be selected from time to time to ensure it continues to perform as well as possible.

The exact refresh rate depends on how often data are collected and how quickly patterns change.

Part I

Early Warning System for At-Risk

Youth

18

Chapter 4

Background: Identifying At-Risk

Youth

In this study, we use and evaluate a range of classifiers in order to build a predictive early warning

systems. In principle, an Early Warning System (EWS) takes historical data and learns patterns that

are correlated with labeled adverse outcomes in the future. The EWS then scores entities for future

dates and assigns them a score that is representative of their risk of having an adverse incident in the

future. Many recent studies define and deploy early warning systems for a range of problems: from

identifying students at-risk of failing an undergraduate course [12] to predicting future dengue outbreaks

[2]. Many companies utilize EWS as part of their business process, but the application to social good

problems is a relatively recent development.

The prediction task that we are interested in this study is: Identifying students at risk of

interacting with the criminal justice system.

In developing the EWS, we have to define the prediction task and clarify the assumptions that we

are making about the data and the existing relationships. In this case study, we walk through the

development of an EWS for this purpose; from problem formulation to result validation.

4.1 Motivation

Historically, the juvenile-justice system was meant to rehabilitate delinquent youth to become productive

citizens. However, research shows that students, especially inner city youth, have trouble reintegrating

back into society once they have had a significant interaction with the juvenile justice system. Teenagers

who interact with the system are likely to experience significant negative life outcomes such as a decreased

likelihood of high school graduation [1], an increased likelihood of committing crimes in early adulthood

[4], and a significantly higher mortality rate [17].

The county that we are concerned with is a✏icted by both low graduation rates and high rates of

juvenile crime. While juvenile arrest rates have been steadily decreasing nationally, arrest rates in the

county have increased by 163% between the years of 2011 and 2015, the last-year recorded. Additionally,

while the state has a high school graduation rate of 88%, the county has graduation rate of only 58%

in 2015. In response, the police department has commissioned several task forces focused on reducing

juvenile crime and the school system has designed broad interventions that aim to increase the county’s

19

Chapter 4. Background: Identifying At-Risk Youth 20

graduation rate. It is clear that students are performing poorly at the high school level and also high

school age juveniles are interacting with the criminal justice system at higher rates. Previously, many

researchers have built prediction modelling systems that predict student academic performance - both at

the course-level and more general - [7] and other work has been done at predicting recidivism for youth

in the criminal justice system [10]. A lot of work in the education field also exists on how to build early

warning systems for students at risk of not passing courses administered through a web-based learning

system. However, no work has been done to combine datasets from the educational and justice system

to tackle such a problem.

4.1.1 Problem Formulation

Our aim is two-fold. First, we want to provide the school district with a risk-score for current students

that provides insight into a student’s risk of interacting with the criminal justice system in the next three

years. It is important to note that the list of students and their associated risk scores is generated to

allow the school system to match students with various support programs to ensure they stay in school,

graduate on time, and avoid the criminal justice system. This ties into the current community-based

crime prevention methods that are already in place in the county. Second, we want to understand what

features are most predictive of high or low risk scores. We used juvenile and adult criminal justice data

through a DataShare platform, as well as data from the school district. Many studies suggest that poor

school performance and early truancy lead to juvenile delinquency [11], but prior to this pass at it, the

education records and criminal justice records have not been combined to build predictive models of

delinquency.

We framed the problem as a binary classification problem to predict which students will have an

interaction with the criminal justice system in the next three years. We find that the students who are

assigned a high risk score (in the top decile) by our system are four to five times more likely to have

an interaction with the criminal justice system in the future than those with lower scores (bottom 9

deciles). In addition, unlike the existing system that assigns a binary (at risk or not) flag to students,

our model allows the school to use the risk scores to prioritize students for appropriate interventions.

4.1.2 Current Interventions for At-Risk Youth

The school system currently employs three tiers of interventions for at-risk youth. Tier 1 consists of

school-level interventions such as regular assemblies reminding students of behavioural expectations. Tier

2 consists of targeted interventions to support students who are not responding to Tier 1. An example

of a Tier 2 level intervention is the Check-In/Check-Out (CICO) program: a student checks in briefly

each morning and afternoon with a designated school sta↵ member who determines whether the student

is ready for class and, if required, whether the student will remain with them for further assistance and

guidance. Tier 3 interventions are intense and personalized, they are intended for students not responding

to Tier 2 intervention. There is no set criteria for being selected into a Tier 3 intervention. But when

considering the prediction problem, this is one level at which we can impact student’s well being. One

example is the RENEW program, a structured school-to-career transition planning and individualized

wrap-around process for youth with emotional and behavioural challenges. Due to resource constraints,

the number of students receiving Tier 3 interventions can be no more than one to five percent of the

total student population.

Chapter 4. Background: Identifying At-Risk Youth 21

To identify at-risk youth, the school district evaluates student attendance, behaviour, and curricular

performance (the ABCs). If a student is flagged as at risk in two of these three categories, they are

recommended to a Tier 2 intervention. Whether a student is flagged depends on their age and the

severity of the problem. For instance, the flags for behaviour are as follows:

• For students in kindergarten through grade 8, one O�ce Discipline Referrals (ODRs) in the past

20 school days or one out-of-school suspension in the past 90 days;

• For students in grades 9 through 12, three ODRs in the past 20 school days, or two out-of-school

suspensions in past 90 school days.

Once a student is flagged, the school’s Building Intervention Team considers additional data such as

the nature of the ODR, credits, grades, attendance, teacher input, work samples, observation, etc., to

determine whether a student should receive an intervention, and if so, at which tier.

This system currently flags 22,000 students without any prioritization or ranking. Currently, the

school district has the capacity to intervene with 5,000 students every year. Thus, the current system

makes it untenable to match students e↵ectively with the available interventions. A machine learning

approach to this problem gives a prioritized list of students and risk scores.

Chapter 5

Data

5.1 Education Data

The school data includes information on demographics, attendance, discipline, assessment, and school

programs for students enrolled between 2004 and 2015. Demographic data covers race, gender, birth

date, mailing address and school name per student identified by a unique student key.

Attendance data includes daily attendance records for each student, with a row representing a day

that a student was in attendance at their school. There are approximately 127 million records covering

179,780 students by unique student key.

Discipline data includes date and nature (e.g., classroom disruption, weapons related) of the disci-

plinary event. The file contains over 100,000 records which are recorded at the event-level and represent

97,000 students.

Assessment data includes descriptions of all tests taken (e.g., date taken, subject) as well as students’

scores. There are more than 5 million records representing 194,415 students. This includes repeated

standardized testing such as Measures of Academic Progress (MAP) which are administered multiple

times a year for students from kindergarten through high school, as well as college admissions tests such

as the Scholastic Aptitude Test (SAT) which are administered once per student.

Finally, school programs records include information on the type (e.g., HeadStart, Special Education)

and the dates the students were enrolled in these programs.

5.2 Criminal Justice Data

Data from the district attorney’s o�ce covers all juvenile and adult interactions with the criminal justice

system from 2009 to 2015 where the case was referred to the DA’s O�ce. Once probable cause for criminal

behavior is identified by law enforcement, a juvenile can be assigned to an informal diversion, advised

and released, transported to a homeless shelter/detox service or referred to a psychiatric crisis team. If

the juvenile is arrested and booked, they are eventually ordered to the DA’s o�ce.

After a charging decision is made by the DA, the o�ce prepares the case and it proceeds to court.

After a bond hearing and a preliminary hearing the plea negotiation process is initiated or the case

proceeds to trial. If found guilty, the juvenile might be put on probation, end up in a juvenile detention

centre or pay a fine. The DA’s o�ce serves the county and therefore covers a wider range of people

22

Chapter 5. Data 23

Datasets Number of Records Number of Unique IndividualsDemographic 1.5 million 300,000Attendance 127 million 179,780Discipline 100,000 97,000Assessment 5 million 194,415Criminal Justice 154,198 50,020

Table 5.1: A Summary of the di↵erent datasets received and used in this analysis

than the citywide school district. The criminal justice data represents 50,000 individuals. It contains

information such as the name of the defendant, as well as demographic variables such as date of birth,

gender and race. The dataset also contains information on the severity of the o↵ense separated into

felony, misdemeanor, and forfeiture.

5.3 Data Challenges

In this modeling framework, we want to create individual-level trajectories that capture a student’s

academic profile and their relationship with the criminal justice system. In tackling this problem, we

had some issues in our datasets.

· There is no source table for all the unique individuals in either of the two datasets.

There are over 1.5 million demographic records for more than 300,000 students enrolled in the school

district during the data collection period. Ideally, a new record is generated every time any of the fields

change.

However, there are multiple records per student, with newly generated unique identifiers referring

to the same individual. One reason for multiple records per student is that school district has a highly

mobile population with many students changing schools and home addresses from year to year. There are

also 100,000 more students present in the demographic dataset than are present in the other datasets.

In consultation with the school district, we noted that while we identify students by unique student

keys, in some cases, when a student leaves the school district and re-enters at a later time, they will be

registered as a new student with a new student key. This was an important consideration in the entity

resolution stage as there are not actually 300,000 unique students represented in the educational data.

Additionally, when a juvenile enters the criminal justice system, and then re-interacts with it at a

later period in time, they may not be entered under the same identifying information.

·The criminal justice dataset only includes information on serious o↵ences.

From arrest to when a juvenile enters into the records at the DA’s o�ce, there are multiple endpoints

at which the juvenile can exit the system. For example, they can be released to community service or

if it is a municipal case (i.e. not a misdemeanor or a felony) they can be ordered to civil court and

released. This means that only serious crimes are represented in the data that we have.

·Data entry errors in static variables like race and gender. In the educational dataset,

demographic details were standardized at the student level. For example, ‘Black or African American’

or ‘African-Am’ are used to refer to African American students. Such discrepancies were identified and

normalized. Since new demographic records are generated often, there were many students who have

multiple di↵erent values for their race or gender due to data entry errors. We standardized these records

by taking the last non-null value for each field for every student and propagating it back over time.

Chapter 5. Data 24

5.4 Matching the Datasets

In order to identify and link unique individuals within the educational data, we matched within the

datasets and created IDs for each person. This resolves the issue discussed above about not having a

table of all unique individuals captured in our dataset. We assumed that individuals having the same

first name, last name, and date of birth were the same person. However, simply matching on these fields

across the two data sets yielded no matches due to variations in formatting. Additionally, names are

captured in two di↵erent formats between the two datasets. For the educational dataset, there is only a

single name field, for instance “Smith, P. Jones”. In the school dataset, there are separate fields for first,

last and middle names. Additionally, in both datasets, the same individual may be booked multiple

times (criminal) or re-enrolled multiple times (school) leading to variation in how the name may be

entered each time. For example, a name might be misspelled, only the first part of a hyphenated name

might be included, or an apostrophe might be used one time and replaced by a space the second time.

In order to improve the matching rate, we cleaned the first and last name fields to make them

more uniform by removing middle initials, whitespace, commas, quotations marks, hyphens and su�xes.

Based on input from the school district and the DA’s o�ce, we expected an 80% match rate between

the two datasets. After this initial cleaning, we only achieved a 25% match rate. Recognizing that there

might be some variation due to spelling errors or the use of nicknames, we computed the Jaro-Winkler

distance for the first name and last name fields.

The Jaro distance for two strings (s1, s2) with a non-zero number of matching characters l, and t

transpositions is given by:

1

3

✓l

|s1|+

l

|s2|+

l � t

l

◆(5.1)

• |s1| is the length of the string, for example, “THE” has length 3.

• l is the number of matching characters between two strings. Between “BART” and “BARE”, the

number of matching characters is 3.

• t gives the number of transposition. A transposition is defined as the matching characters that

have a di↵erent sequence order.

The Jaro-Winkler distance [22] for two strings is given by:

dw = dj + (kp(1� dj)) (5.2)

• dj is the Jaro distance between the two strings

• k is the length of the prefix common to both strings

This means that the Jaro-Winkler distance assigns a more favorable score to strings that match more

at the beginning.

If all three identifying fields (first name, last name, and birth date) match exactly, we consider the

record to belong to the same individual. If one or zero of the fields match exactly, we do not consider the

Chapter 5. Data 25

First Name Last Name Date of Birth Jaro-Winkler DistanceReginald Grey 2004-08-03

0.8333Reginald Gray 2004-08-03Khabaugh Musgrave 1993-10-22

0.9629Khabaugh Musgraves 1993-10-22

Table 5.2: Example of the Computation of Jaro-Winkler Distances for two Entities

records as belonging to the same individual. If two of the three fields match exactly, then we consider

the records belonging to the same individual whenever:

• both names match, the birth dates share the same year and otherwise di↵er by a single character

• one of the two name fields match and the birth date match, and Jaro-Winkler distance [22] between

the mismatched names is at least 0.8.

Two examples of individuals considered the same using the Jaro-Winkler Distance rule are illustrated

in Table 5.2.

Lastly, noting that there might be some birth dates that were entered incorrectly, we allowed for

some fuzziness. With an exact match on first name, last name, and the year of birth, we allowed up to

a 1 digit di↵erence in the month and day. For example: 2010-02-04 and 2010-03-04 is a match but

2004-11-09 and 2004-11-22 is not considered a match.

The criminal justice data contained one row per case per charge. If an individual is charged with

multiple charges for the same incident, this will be reflected in multiple rows with the exact same

information but with di↵erent charges. We want to identify individuals within the data set and match

case number to a unique generated Person ID. Starting with 96,066 rows in the juvenile data, we identified

15,451 distinct cases by DA Case Numbers. After applying the matching logic above, we identified 9,451

unique individuals and assigned them a Person ID which was then appended to the original dataset.

After identifying unique individuals within the criminal justice data, these individuals were matched

to the school data. We again applied the same logic as above. We successfully linked 86% of individuals

with a DA record to the education data. Since it is possible that individuals who have a criminal

record did not attend schools in the county (e.g. out-of-state o↵enders), we believe that 86% is a

reasonable match rate. Future work includes using more sophisticated machine learning based record

linkage approaches to improve the matching process.

Chapter 6

Methods

More than 70 features were generated and they covered the whole educational dataset. One of the

features used was the number of days that students were enrolled in the last year. The histogram for

the feature is shown in Fig. 6.1. It shows that most students were enrolled for the majority of the school

year.

Figure 6.1: Histogram of the number of days students were enrolled in the year 2011.

The labels were created using the criminal justice dataset. For a calendar year, if an individual had

an interaction with the criminal justice system, they were assigned a positive label class.

As described earlier, we formulated our problem as predicting whether a currently enrolled student

is at risk of interacting with the criminal justice system in the next 3 years. We implemented the

following models using scikitlearn [15] and a variety of hyperparameters: Random Forests (RF), Logistic

Regression (LR), Support Vector Machines (SVM), and Decision Trees (DT). All of these methods were

previously detailed in the Introduction.

We used all the di↵erent classifier options while testing a range of model hyper-parameters, however,

here we present the parameters of the top performing classifiers (Table 6.1). For example, for logistic

regression, we test the range of regularization parameters. Recall that previously we defined the ridge

26

Chapter 6. Methods 27

and lasso regularization parameters. L1 corresponds to LASSO and L2 corresponds to Ridge regression.

The C value determines the strength of the regularization. The smaller the value, the stronger the

regularization, similar to SVMs. For RF, number of estimators is the number of trees to include in the

forest, max depth of the tree, max features is the maximum number of features to sample from, and the

minimum samples at split indicates the minimum number of samples required to branch o↵ an internal

node.

Table 6.1: Grid Search parameters for model selection from the top performing models

Models and Hyperparameters

Logistic RegressionC: 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10Penalty: L1, L2

Random Forest ClassifierNumber of Estimators: 1, 10, 100, 1000, 10000Max Depth: 1, 5, 10, 20, 50, 100Max Features: Square root, log2Minimum Samples at Split: 2, 5, 10

Decision Tree ClassifierCriterion: gini, entropy

6.1 Model Evaluation

We validated our models using temporal validation by creating training and test sets that are temporally

disjoint. For example, if we are predicting an interaction with the criminal justice system in the years

2010-2012, the models are trained on all the data up to the end of 2009 and then the model predicts a

risk score for all students as of the beginning of 2010 that provides their risk of having a criminal justice

interaction from 2010 to 2012.

Chapter 7

Results

7.1 Model Performance

We evaluate the model performance based on two criteria:

1. Precision in the top 1%: We want the model to be as accurate as possible in the top 1% of

the predictions since that is the intervention capacity of the school system. The school has the

resources to administer Tier 3 interventions to no more than 1 to 5% of the school population.

Focusing on the 1% threshold allows us to better match students with the limited intervention

resources available to the school district.

2. Stability of that performance over time: We want a model that is stable in terms of Precision at

1% over time so it can be used consistently without risking drastic performance changes.

To achieve those two goals, we selected the 50 best performing models based on precision at 1%. We

then selected models that are consistently among the top 50 across each time period. We found that

Random Forests with the following hyperparameters performed the best based on these two criteria:

• n_estimators = 200

• max_depth = 10

• min_samples_split = 5

• max_features = 0.33

• criterion = entropy

The precision-recall curves for this model are shown in Fig. 7.2. At 1% of the population, the

precision is 0.3 and the recall is about 0.1. This is extremely encouraging - taking the top 1% of

the model predictions allows us to identify 10% of all the at-risk students at 30% precision. This is

significantly (more than 10 times) higher than a random baseline which would get 2.8% precision (there

are 300,000 students and only 9,500 juvenile o↵enders). In comparison to the school’s baseline (as

documented in Table 7.1), we correctly identify more students who actually have an adverse incident

while flagging fewer students. This reduces the load on the schools and makes them more e�cient in

their interventions.

28

Chapter 7. Results 29

(a) Random Forest with 1K trees (b) Random Forest with 10K trees

(c) Logistic Regression with minimal regularization(d) Logistic Regression with maximum regulariza-

tion

Figure 7.1: Performance of di↵erent classifiers trained on the same datasets. Predicting for the nextthree years, we have a very low base-rate with the number of students actually interacting with thecriminal justice system is very low. This is evident in the plots with the high recall curve. The modelperforms very well on the top 0.1% which is our population of interest.


Figure 7.2: The precision-recall curve for the best-performing model.

7.2 Analysis of Results

In this section, we take the best performing model and show some diagnostics we performed to understand

and validate the model further.

7.2.1 Risk Scores

Figure 7.3 is a log-plot of the risk scores generated by the best model selected. This shows that there

is very few students with high risk and lots of students have low risk. This is promising as a risk score

that will be used for intervention-targeting as it shows that it is a well-calibrated risk-score.

Figure 7.3: Risk Score Distribution of the riskiest 1000 students

7.2.2 Evaluating the predictions by score decile

Figure 7.3 shows a decile plot that compares the actual number of positive labels in each decile versus

the predicted number. A well-performing model will have both values as close to each other as possible

in every decile and the number of predicted positive labels should go down as the risk score goes down.

As we can see from the graph, that is the case for our best performing model which gives us confidence

in the risk scores.


Flags Correctly IdentifiesHeuristic Method 22000 1310

Our Model 12000 1630

Table 7.1: Comparing the baseline method to the best performing model, we note that precision increasesfrom 6% for the current system to 14% for the best model.

7.2.3 Comparison to current School-Based Approach

As our goal was to help the school targeting interventions for the relevant students, we compared our

model results to the method the schools use to flag students who need intervention. Students are flagged

as “generally at risk” using a rule-based method based on the number of suspensions and o�ce discipline

referrals as well as their current grade level. We implemented the tier-2 intervention as the baseline and

calculated the performance in terms of precision, recall, and percent of students that they flag. The

current system flags 22,000 students, and 1300 of those flagged actually have an interaction with the

criminal justice system 7.1 (precision of 5.9%). Compared to this, our model can identify the same

number of at-risk students while only flagging 33% as many students. If we allow our model to flag as

many students as the current method, we can identify 46% more students who will go on to interact

with the criminal justice system. This shows the e↵ectiveness of our system compared to the current

methods being used in the school system today.

The features that are most important in the best performing model are:

1 Number of “Child In Need of Protective Services” (CHIPS) record

2 Age

3 Number of discipline incidents in last 2 years

4 Average absence days over the years

The number of CHIPS record is generated from the DA data set. A record is created if a child is abused

or neglected by their parent and the case was logged with the DA. This feature consistently shows up as

one of the top features in our best performing model. It is important to note that this is not necessarily

causal relationship. It is possible that the number of CHIPS records are correlated with other attributes

of a juvenile and are showing up as highly predictive. Age is also very predictive compared to other

features. This makes intuitive sense as a 15-year-old is generally more likely to commit an o↵ense than

a 8-year-old. Number of discipline incidents in the last 2 years and average absence days over the years

are also among the top features, which is consistent with the findings in the literature [13] that state

that absenteeism and truancy are often causes for delinquency. Interestingly, common demographic

features such as gender and race are noticeably absent from the top features. This is often the case since

behavioural attributes are often more predictive than demographics but both have high correlation in

practice. To further investigate whether the number of CHIPS records are masking the contribution

of other demographic variables, we examine the cross-tabs of number of CHIPS records and race. The

result is presented in table 7.2. Comparing the racial make-up for students with at least one CHIPS

record, we find that African Americans tend to have a higher fraction, and lower fraction of Hispanic

students with at least 1 record. Together, the result suggests that African American are more likely,

Hispanics are less likely and Whites are no more and no less likely to have more CHIPS records.


No. CHIPS records 0 1-10 11- 20 21- 30 31- 40 41-50

African-American 79840 (53.11%) 1185 (65.65%) 246 (68.14%) 78 (69.64%) 41 31American Indian or Alaska Native 487 (0.32%) 13 (1.11%) 4 (1.05%) 0 0 0Asian 7953 (5.29%) 13 (0.72%) 5 (1.39%) 0 0 0Hispanic 33770 (22.46%) 287 (15.90%) 58 (16.07%) 3 (2.68%) 8 1Native American 1062 (0.71%) 17 (0.94%) 1 (0.28%) 18 (16.07%) 1 2White 25276 (16.81%) 280(15.51%) 40 (11.08%) 12 (10.71%) 5 6Other 1943 (1.29%) 10 (0.55%) 7 (1.94%) 1 (0.89%) 0 0*The figure within the parenthesis denotes the fraction of overall cases. The table is using all data up to year 2013.

Table 7.2: Number of CHIPS record by Race

Chapter 8

Discussion

8.1 Future Work

The existing system only predicts interaction with the juvenile criminal justice system. A natural next

step is to expand the label set to include adult interactions as well. We would also like to broaden

the definition of interaction by incorporating arrest data. For example, it was reported that there were

approximately 16000 arrests of juveniles in 2012, but based on the DA case data we only have information

about 1923 incidents in 2012. Currently, we are only able to predict severe o↵enses, by including arrest

data we can focus on models that would predict any interaction at all with the criminal justice systems.

Many juveniles are often cited and released into the custody of their parents for minor o↵enses and

currently our labels do not capture this kind of interaction. Another extension for this work would be

to re-frame the problem as a multi-class prediction problem and predict classes of o↵ense by severity. It

would be interesting to investigate whether features have di↵erent predictive power in predicting certain

classes of o↵enses.

Another area of future work is to generate more features using other data sets such as health and

family data. This would allow us to incorporate other likely predictive factors. For instance, the health

dataset contains information on students’ blood lead levels and vaccination status.

The premise of building such a system is that we assume there exist interventions that are e↵ective

at reducing the risk of students having an interaction with the juvenile justice system. Our machine

learning system can then identify students who should be matched with those interventions in order

to improve their outcomes. A critical future endeavor is to 1) validate that assumption and determine

whether existing interventions are in fact e↵ective at reducing the risk, especially for high risk students

and 2) determine which students are not responding to existing interventions and work with experts

to create new interventions. Having a system that can accurate assess the future risk allows e↵ective

evaluation of existing interventions and supports the development of new ones, thus improving outcomes

that we care about.

8.2 Conclusions

In this work, we show that using school records, we can accurately identify students who are at risk

of future juvenile criminal justice interactions. Experiments on historical data show that our model

33

Chapter 8. Discussion 34

performs significantly better than the existing early warning system being used at the school district. If

we allow our model to flag as many students as the current school method, we can identify 46% more

students who will go on to interact with the criminal justice system. To the best of our knowledge this

work represents the first data-driven approach to address the problem of juvenile delinquency using both

school and criminal justice data.

Part II

Early Warning System for Public

Safety

35

Chapter 9

Refining EWS for Public Safety

9.1 Background

Unlike the EWS for Criminal Justice, the EWS for public safety is an ongoing project with an established

methodology that was developed over the course of partnerships with several public safety agencies. A

public safety agency is one that is tasked with keeping members of the public secure, and thus it’s

employees have high numbers of interaction with members of the public. Sometimes those interactions

can go awry, as witnessed by recent media coverage. Here, we present the results of building iterative,

complex models with thousands of features in partnership with a public safety agency. This is the first

time that such complex features have been used at this granularity in building an EWS for employees

of a public safety agency. The classifier methods used in this project are the same as those described

previously.

Most public safety agencies already have a behavioural Early Intervention System (EIS) in place.

The goal of an EIS is to proactively identify employees who display patterns of problematic performance

or who show signs of job and personal stress in order to intervene and support them with training or

counseling. When an EIS alert is raised, that alert should indicate that the employee is at high risk

of having an adverse interaction in the near future. The current state-of-the-art EIS in many public

safety agencies is threshold-based, it issues an alert for any employee who reaches a predefined number of

events in a given timeframe, such as three complaints in a six-month period. In contrast, the data-driven

EWS takes all available data and detects patterns that precede adverse incidents with greater accuracy

making the EWS predictive and enabling prioritized, preventative interventions. Also, by computing

the risk score of an employee over their career, it is possible to establish a risk profile that changes in

time and improves in accuracy with more data.

9.2 Data

We received data at the employee-level on many di↵erent types of interactions with members of the

public (ex: use of force reports, compliments, complaints, etc).

Using all of the datasets we made count features for each employee, such as number of interactions

in the past 1 month, two months, 2 years, etc. Due to the temporal nature of the problem, we created

daily, individual-level profiles as training samples. For example, employee 123 has some activity every

36

Chapter 9. Refining EWS for Public Safety 37

day, be it a verbal or physical interaction. A training sample in the dataset is not one employee, but

an employee profile and employee 123 has many copies in the training dataset for all possible days that

they are active. This means that our training dataset is highly correlated, but algorithms like random

forests and decision trees can handle a high-dimensional, highly correlated feature space. The label was

determined from Internal A↵airs data as any complaint against an employee that was sustained by the

review board in the year following the end of training time period. If the end of the training dataset is

January 2014, then a label for the employee is positive if they have an IA complaint sustained against

them between January 2014 and January 2015. Additionally, we also included demographic features,

such as age and gender as predictors.

9.3 Results

We conducted a historical analysis on the current EIS and compared it to the ability of the data-driven

EWS at that point in time. We found 19% of the alerts between 2009 - 2016 correctly identified an

employee who went on to have a sustained, unjustified or preventable internal a↵airs complaint in the

year following the alert. Then, using our approach we built a data-driven prototype. The EWS ranks

employees by risk score in contrast to a binary EIS, which only flags an employee as “at risk” or “not

at risk.” Between January 2014 and July 2016 the EWS would have flagged 24% of employees correctly.

In contrast, the threshold-based EIS raised only raised 17% correct alerts in the same time frame. The

EWS is more e�cient: although it would have raised fewer alerts, its alerts are 40% more likely to go to

employee who end up having an adverse incident with a member of the public.

We talked previously about model selection, in this case, we look at the models’ precision over time

(Fig. 9.1. Note that a precision of 20% in the top 50 employees means that for the list of 50, 20% were

correctly classified in the high-risk category.

Figure 9.1: This plot shows precision in the top 50 highest-risk employees for the public safety agencyfor di↵erent models over the years.

Similarly, looking at the models’ recall over time (Fig. 9.2), we note that models with a coverage of


30% capture 30% of all the possible adverse incidents in the whole dataset.

Figure 9.2: This plot shows recall in the top 50 highest-risk employees for the public safety agency fordi↵erent models over the years.

We also want to see that our top-performing models perform relatively similarly. This is evident in

Fig. 9.4

Based on this model performance, we use this to pick the best model and generate risk scores for

each employee for the next year. Note that previously we looked at the risk score distribution in the

at-risk youth example. Here, we are concerned with a risk scores that well-separates the two classes

(Fig. 9.4). The model tends to score employees who go on to have adverse incidents higher compared

to employees who will not have an adverse incident. A draw-back of this model is visible, though, in

regards to the overlapping regions between red and blue around a score value of 0.2 in this instance.

This is an indication of a lack of separability in the two populations, that would be resolved with more

granular data.

9.4 Comparison to Current Approach

Currently, the public safety agency has a manual, intensive process to handle alerts. The EIS issued

alerts for 45 employees in January 2014, 8 of whom went on to have an adverse incident in the next 12

months, and missed 193 o�cers who went on to have an adverse incident (18% precision, 4% recall). In

contrast, the EWS has slightly better performance for the same time period and is a lot-less resource

intensive.

9.5 Future Work

In contrast to the EWS for at-risk youth, we are replacing an existing method that does a similar

job. There is a ready baseline for comparison and we can show the lift that our system is able to get


Figure 9.3: This plot shows the precision-recall curves for top 100 performing models (out of 10,000)showing that they have relatively similar performance.

for the agency. Since we use black-box models which are not easily interpretable, we can not make

causal claims about which predictors are responsible for increases or decreases in risk scores. However,

since all interventions in this case are non-punitive, it makes sense to optimize prediction accuracy

instead of interpretability. Some work has also been done on trying to interpret prediction scores and

defining important features using tools like LIME, which provides local, interpretable, model-agnostic

explanations for each result [20].

The work done in this analysis adds on to a corpus of work that is the first of it’s kind with respect

to this domain. A start-up is taking the technology and modeling methods that were developed and

refined through the work presented here and using it on top of software that will serve as a record

management tool for public safety agencies. The methods refined here will be integrated seamlessly into

their databases and become the industry standard for how public safety agencies think about managing

risk for their employees.

From a modeling standpoint, it would be interesting to try more advanced modeling techniques from

deep learning such as transfer learning. Transfer learning is a general term for using a model that was

trained for one task as a starting point model for another task. For example, if an agency wants to

implement an EWS but does not have enough data to build reliable predictive models, they can use one

of the base models trained from another agency as a starting point [23]. Since we have data from many

di↵erent agencies, the base model could even be trained across a combined dataset.


Figure 9.4: Separation of Risk Score by Class. Blue is employees with an adverse incident, red isemployees without an adverse incident.

Bibliography

[1] Anna Aizer and Joseph J. Doyle Jr. Juvenile incarceration, human capital and future crime: Evi-

dence from randomly-assigned judges. The Quarterly Journal of Economics, February 2015.

[2] J. Albinati, W. Meira, Jr, and G. Lobo Pappa. An Accurate Gaussian Process-Based Early Warning

System for Dengue Fever. ArXiv e-prints, August 2016.

[3] Anuradha and G. Gupta. A self explanatory review of decision tree classifiers. pages 1–7, May 2014.

[4] J. Bernburg and M. Krohn. Labeling, life chances, and adult crime: The direct and indirect e↵ects

of o�cial intervention in adolescence on crime in early adulthood. Criminology, 41:1287–1316, 2003.

[5] Lon Bottou and Chih jen Lin. Support vector machine solvers, 2006.

[6] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001.

[7] Dragan Gasevic, Shane Dawson, Tim Rogers, and Danijela Gasevic. Learning analytics should not

promote one size fits all: The e↵ects of instructional conditions in predicting academic success.

28:6884, 01 2016.

[8] Aurelien Geron. Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools,

and techniques to build intelligent systems. O’Reilly Media, Sebastopol, CA, 2017.

[9] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.

Springer, 2001.

[10] Carter Hay, Alex O. Widdowson, Meg Bates, Michael T. Baglivio, Katherine Jackowski, and

Mark A. Greenwald. Predicting recidivism among released juvenile o↵enders in florida. Youth

Violence and Juvenile Justice, 0(0):1541204016660161, 0.

[11] Paul J Hirschfield and Joseph Gasper. The relationship between school engagement and delinquency

in late childhood and early adolescence. Journal of Youth and Adolescence, 40(1):3–22, 2011.

[12] E. Howard, M. Meehan, and A. Parnell. Contrasting Prediction Methods for Early Warning Systems

at Undergraduate Level. ArXiv e-prints, December 2016.

[13] B. Jacob and L. Lefgren. Are idle hands the devil’s workshop? incapacitation, concentration, and

juvenile crime. American Economic Review, 93(5):1560–1577, 2003.

[14] M. Lichman. UCI machine learning repository: Abalone dataset, 2013.

41

Bibliography 42

[15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-

hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and

E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,

12:2825–2830, 2011.

[16] J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81–106, March 1986.

[17] R. Ramchand, A. Morral, and K. Becker. Seven-year life outcomes of adolescent o↵enders in los

angeles. American Journal of Public Health, 99(5):863870, 2003.

[18] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical

Society, Series B, 58:267–288, 1994.

[19] J. L. Tonry. An Early Warning System for Asteroid Impact. Publications of the Astronomical

Society of the Pacific, 123:58, 2011.

[20] M. Tulio Ribeiro, S. Singh, and C. Guestrin. “Why Should I Trust You?”: Explaining the Predictions

of Any Classifier. ArXiv e-prints, February 2016.

[21] Ilhan Uysal and H Altay Guvenir. Instance-based regression by partitioning feature projections.

Appl. Intell, 2004.

[22] W. E. Winkler. String comparator metrics and enhanced decision rules in the fellegi-sunter model

of record linkage. Proc. Sec. Survey Res. Meth., pages 354–359, 1990.

[23] J. Zhang, W. Li, and P. Ogunbona. Transfer Learning for Cross-Dataset Recognition: A Survey.

ArXiv e-prints, May 2017.

[24] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the

Royal Statistical Society, Series B, 67:301–320, 2005.

by hareem naveed · the age of an abalone is given by the given by number of rings on the shell +...

Documents