phd thesis yanhe
TRANSCRIPT
-
8/2/2019 Phd Thesis YanHe
1/96
UNIVERSITY OF CALIFORNIA
Los Angeles
Missing Data Imputation for Tree-Based Models
A dissertation submitted in partial satisfaction
of the requirements for the degree
Doctor of Philosophy in Statistics
by
Yan He
2006
-
8/2/2019 Phd Thesis YanHe
2/96
c Copyright byYan He
2006
-
8/2/2019 Phd Thesis YanHe
3/96
The dissertation of Yan He is approved.
Susan Sorenson
Hongquan Xu
Mark Hansen
Richard Berk, Committee Chair
University of California, Los Angeles
2006
ii
-
8/2/2019 Phd Thesis YanHe
4/96
To my parents and my husband with love and gratitude
iii
-
8/2/2019 Phd Thesis YanHe
5/96
TABLE OF CONTENTS
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Classification and Regression Trees (CART) and Extensions . . . . . . . 6
2.1 Classification and Regression Trees (CART) . . . . . . . . . . . . . . 6
2.1.1 Splitting A Tree . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Pruning A Tree . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Taking Cost into Account . . . . . . . . . . . . . . . . . . . 11
2.2 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 The Comparative Advantage of Random Forests . . . . . . . 15
3 Standard Theory on Missing Data . . . . . . . . . . . . . . . . . . . . . 17
3.1 Mechanisms That Lead to Missing Data . . . . . . . . . . . . . . . . 17
3.2 Treatment of Missing Data . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Listwise Deletion . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Single Imputation . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Multiple Imputations through Data Augmentation . . . . . . . 23
3.2.4 Assessment of Multiple Imputations . . . . . . . . . . . . . . 25
iv
-
8/2/2019 Phd Thesis YanHe
6/96
4 Missing Data with CART/RF . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 Missing Data with CART . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Missing Data with RF . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Nonparametric Bootstrap Methods to Impute Missing Data . . . . . . . 33
5.1 The Simple Bootstrap for Complete Data . . . . . . . . . . . . . . . 33
5.2 The Simple Bootstrap Applied to Imputed Incomplete Data . . . . . . 34
5.3 The Imputation Algorithm for Tree-Based Models . . . . . . . . . . . 36
6 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.1 Data of Diabates . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.2 Data of Domestic Violence . . . . . . . . . . . . . . . . . . . 40
6.1.3 Data of Dolphin . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Missing Values in the Three Data Sets . . . . . . . . . . . . . . . . . 45
6.3 Comparison for CART . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.4 Comparison for Random Forests . . . . . . . . . . . . . . . . . . . . 61
7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 71
8 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
v
-
8/2/2019 Phd Thesis YanHe
7/96
LIST OF FIGURES
6.1 Empirical Distributions for False Positive Errors & False Negative Er-
rors from 2000 CART Bootstrap: DV data; cost ratio = 5:1. . . . . . 56
6.2 Empirical Distributions for False Positive Errors & False Negative Er-
rors from 2000 CART Bootstrap: crime data; cost ratio = 10:1. . . . 57
6.3 Empirical Distributions for False Positive Errors & False Negative Er-
rors from 2000 CART Bootstrap: diabetes data; cost ratio = 2:1. . . 58
6.4 Empirical Distributions for False Positive Errors & False Negative Er-
rors from 2000 CART Bootstrap: dolphin data; cost ratio = 10:1. . . 59
6.5 Empirical Distributions for False Positive Errors & False Negative Er-
rors from 2000 RF Bootstrap: DV data. . . . . . . . . . . . . . . . 65
6.6 Empirical Distributions for False Positive Errors & False Negative Er-
rors from 2000 RF Bootstrap: crime data. . . . . . . . . . . . . . . 67
6.7 Empirical Distributions for False Positive Errors & False Negative Er-
rors from 2000 RF Bootstrap: diabetes data. . . . . . . . . . . . . . 68
6.8 Empirical Distributions for False Positive Errors & False Negative Er-
rors from 2000 RF Bootstrap: dolphin data. . . . . . . . . . . . . . 69
vi
-
8/2/2019 Phd Thesis YanHe
8/96
LIST OF TABLES
6.1 CART confusion table for DV objective: N = 516 complete cases;
cost ratio = 5:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 CART confusion table for DV objective using surrogate: N = 636;
cost ratio = 5:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 CART confusion table for DV objective using nonparametric boot-
strap method to impute missing values (Algorithm 2): N = 636; B =
2000; cost ratio = 5:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.4 Surrogate Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.5 CART confusion table for DV objective using nonparametric boot-
strap method to impute missing values (Algorithm 2): N = 636; B =
30; cost ratio = 5:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.6 CART confusion table for crime objective: N = 516 complete cases;
cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.7 CART confusion table for crime objective using surrogate: N =
636; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.8 CART confusion table for crime objective using nonparametric boot-
strap method to impute missing values (Algorithm 2): N = 636; B =
30; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.9 CART confusion table for diabetes objective using full data set: N =
768; cost ratio = 2:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.10 CART confusion table for diabetes objective using surrogate: N =
768, cost ratio = 2:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
vii
-
8/2/2019 Phd Thesis YanHe
9/96
6.11 CART confusion table for diabetes objective using nonparametric
method to impute missing values (Algorithm 2): N = 768; B = 30; cost
ratio = 2:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.12 CART confusion table for dolphin objective using full data set: N =
1000; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.13 CART confusion table for dolphin objective using surrogate: N =
1000; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.14 CART confusion table for dolphin objective using nonparametric
bootstrap method to impute missing values (Algorithm 2): N = 1000;
B = 30; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . 54
6.15 Prediction Errors & 95% Confidence Intervals for False Positives and
False Negatives Using 2000 Bootstrap Samples: CART Model . . . . 61
6.16 RF confusion table for DV objective: N = 516 complete cases. . . . 61
6.17 RF confusion table for DV objective using rfImpute: N = 671. . . 62
6.18 RF confusion table for DV objective using nonparametric bootstrap
method to impute missing values (Algorithm 2): N = 671, B = 30. . . 63
6.19 RF confusion table for crime objective: N = 516 complete cases. . . 63
6.20 RF confusion table for crime objective using rfImpute: N = 671. . 63
6.21 RF confusion table for crime objective using nonparametric boot-
strap method to impute missing values (Algorithm 2): N = 671; B =
30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.22 RF confusion table for diabetes objective using full data set: N = 768. 64
6.23 RF confusion table for diabetes objective using rfImpute: N = 768. 64
viii
-
8/2/2019 Phd Thesis YanHe
10/96
6.24 RF confusion table for diabetes objective using nonparametric boot-
strap method to impute missing values (Algorithm 2): N = 768; B =
30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.25 RF confusion table for dolphin objective using full data set: N = 1000. 64
6.26 RF confusion table for dolphin objective using rfImpute: N = 1000. 66
6.27 RF confusion table for dolphin objective using nonparametric boot-
strap method to impute missing values (Algorithm 2): N = 1000; B =
30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.28 Prediction Errors & 95% Confidence Intervals for Misclassification
Errors Using 2000 Bootstrap Samples: RF . . . . . . . . . . . . . . . 70
7.1 Prediction Errors for RF Model by Applying Different Imputation Meth-
ods to Test Sets with Missing Data. . . . . . . . . . . . . . . . . . . . 74
7.2 Prediction Errors for RF Model by Applying Different Imputation Meth-
ods to Test Sets with No Missing Data: I (Deleting 20% of Cases from
Learning Sample). . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3 Prediction Errors for RF Model by Applying Different Imputation Meth-
ods to Test Sets with No Missing Data: II (Deleting 50% of Cases from
Learning Sample). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
ix
-
8/2/2019 Phd Thesis YanHe
11/96
ACKNOWLEDGMENTS
Here first and foremost I would like to express my deepest gratitude to my advisor
and committee chair Professor Berk. His guidance, support, and kindness made this
work possible. He has my most sincere and hearty appreciation for giving me the
freedom and tolerating me to pursue problems I chose in the way I liked.
I am also thankful to the other members of my committee, Professors Hansen, Xu
and Sorenson, for their suggestions and comments and the time they spent in reviewing
this dissertation.
I wish to express my appreciation to Professor Lin, a senior western-educated pro-
fessor, who opened this amazing field to me when I was still a college student. It is
under his mentorship that I became interested in studying and doing statistical analysis
to solve real world problems. During my four years in college, Professor Lin offered
me numerous opportunities to learn cutting-edge researches, and he also invited me
to actively participate in many national projects, which built my strong background in
mathematical analysis. It was his reference that made my application to the statistics
program in UCLA a lot easier.
I would also like to thank Professor Sun, a super nice professor, whose pecuniary
support made my joining UCLA a reality.
A debt of gratitude is owed to Professors Wu, Ferguson and Jan de Leeuw, who
were always so nice and so patient to answer my questions. I also appreciate kindness
from Mrs. Dean Dacumos, who made the Statistics Department a united community,
and our life enjoyable.
To my parents and my husband for their consecutive support and understanding.
x
-
8/2/2019 Phd Thesis YanHe
12/96
VITA
1975.10.24 Born, Nantong, P. R. China.
19941998 B.A. in Economics & B.A. in Economic Law, Huazhong University
of Science and Technology, Wuhan, P. R. China. With High Honors.
20002001 M.A. in Economics, Department of Economics, The Ohio State
University. Awarded University Fellowship.
20012003 M.S. in Statistics, Department of Statistics, UCLA.
2003.0609 Fair Isaac Corporation, Internship.
2005present Countrywide Home Loans, CA.
xi
-
8/2/2019 Phd Thesis YanHe
13/96
PUBLICATIONS
Yan He: Problems and Suggestions for Improving to the Exchange Sterilization Op-
eration of Chinas Central Bank. The Study of Finance and Economics, April 2000,
Vol.26 No.4.
ShaoGong Lin, Qiming Tang, ZhiHong Fan and Yan He: Translated the book Econo-
metric Methods by Jack Johnston & John DiNardo (UCI) (4th edition) into Chinese.
Published by China Economics Publishing House (ISBN 7-5017-5063-7), 2002.
Juana Sanchez and Yan He: Examples of the Application of Statistics and Probability
to Computer Science. Presented at the Joint AMS-MAA (American Mathematical So-
ciety - Mathematical Association of American) Annual Meeting. January 7-10, 2004,
Phoenix, AZ.
Richard Berk, Yan He and Susan Sorenson: Developing a Practical Forecasting Screener
for Domestic Violence Incidents. Evaluation Review, 29(4): 358-382, August 2005.
Juana Sanchez and Yan He: Internet Data Analysis for the Undergraduate Statistics
Curriculum. Journal of Statistics Education, Volume 13(3), 2005.
xii
-
8/2/2019 Phd Thesis YanHe
14/96
ABSTRACT OF THE DISSERTATION
Missing Data Imputation for Tree-Based Models
by
Yan He
Doctor of Philosophy in Statistics
University of California, Los Angeles, 2006
Professor Richard Berk, Chair
A wide variety of data can include some form of censorship or missing informa-
tion. Missing data are a problem for all statistical analyses, tree-based models, such as
CART and Random Forests are certainly no exception.
In recent years, there have been many new developed tools that can be applied
to missing data problems: likelihood and estimating function methodology, cross-
validation, the bootstrap and other simulation techniques, Bayesian and multiple im-
putations, and the EM algorithm. Although applied successfully to well-defined para-
metric models, such methods may be inappropriate for tree-based models, which are
usually considered as non-parametric models. CART/RF have built-in algorithms to
impute missing data, such as surrogate variables or proximity. But these imputation
methods have no formal rationale, and are unstable, especially for RF models.
The nonparametric bootstrap methods to impute missing values overcome all of
the drawbacks that are implicit in both single and multiple imputations. It 1) does not
depend on the missing-data mechanism, 2) requires no knowledge of either the prob-ability distributions or model structure, and 3) successfully incorporates the estimates
of uncertainty associated with the imputed data. Furthermore, 2000 replications of
bootstrap samples provide stable and accurate statistical inferences (Efron, 1994).
xiii
-
8/2/2019 Phd Thesis YanHe
15/96
In my dissertation research, the nonparametric bootstrap methods were imple-
mented to impute missing values before cases were dropped down the tree (CART/RF),
and the classification results were compared to both complete-data/full-data analysis
and to the classification results using surrogate variables/proximity. Significant im-
provement in the ability to predict were found for both CART and RF models.
xiv
-
8/2/2019 Phd Thesis YanHe
16/96
CHAPTER 1
Introduction
A wide variety of data can include some form of censorship or missing informa-
tion. Data imputation can then be an important component of the analysis, but crude
methods for data imputation can lead to substantial bias in the results. For example, a
complete-case analysis simply ignores the missing data and risks substantial bias.
In recent years, there have been many new computationally intensive tools devel-
oped that can be applied to missing data problems: likelihood and estimating function
methodology, cross-validation, the bootstrap and other simulation techniques, Bayes
and multiple imputations, and the EM algorithm. Existing methods have been suc-
cessfully applied with well-defined parametric models, such as Gaussian regression,
and loglinear models. But their usefulness has yet to be demonstrated for tree-based
models, such as Classification and Regression Trees (CART) and random forests (RF),
which are usually considered as non-parametric methods. It is this oversight that I will
attempt to remedy, in part, in the pages ahead.
More specifically, parametric models, such as linear regression, can provide useful
descriptions of simple structures in data. However, sometimes such simple structure
does not extend across an entire data set and may instead be confined more locally
within subsets of the data. Then, the structure might be better described by a modelthat partitions the data into subsets, employing separate submodels for each. Such al-
ternative can be accomplished by using a tree-based approach, known as CART (Clas-
sification and Regression Trees).
1
-
8/2/2019 Phd Thesis YanHe
17/96
Given a data set, a common strategy for finding a good tree is to use a greedy
algorithm to grow a tree and then to prune it back to avoid overfitting. Such greedy
algorithm typically grow a tree by sequentially choosing splitting rules for nodes on
the basis of maximizing some fitting criterion. This generates a sequence of trees,
each of which is an extension of previous trees. A single tree is then selected by prun-
ing the largest tree according to a model selection criterion such as cost-complexity
pruning (Brieman et al., 1984), cross-validation, or even multiple tests of whether two
adjoining nodes should be collapsed into a single node.
The overfitting problem in CART motivated people to develop bundling methods
such as bagging and random forests. Bagging predictors is a method for generating
multiple versions of a predictor and using these to get an aggregated result. In the case
of CART, the aggregation averages over the trees when predicting a numerical outcome
and does a plurality vote when predicting a class. The multiple versions are formed
by making bootstrap replicates of the learning set and using these as new learning data
sets. Tests on real and simulated data sets using classification and regression trees and
subset selection in linear regression have shown that bagging can allow for substantial
gains in accuracy (Breiman, 1996). The vital element is the instability of the prediction
method. If perturbing the learning set can cause significant changes in the predictor
constructed, then bagging can improve accuracy.
Random forests (RF) is a further extension of bagging. A Random forest model is
a combination of tree predictors such that each tree depends on the values of a random
vector sampled independently and with the same distribution for all trees in the forest.
The generalization error for RF converges almost surely to a limit as the number of
trees in the forest becomes large (Breiman, 2001). Using a random selection of features
to split each node yields error rates that compare favorably to Adaboost (Freund and
Schapire, 1996), but more robust with respect to noise.
2
-
8/2/2019 Phd Thesis YanHe
18/96
Missing data can be a problem for all statistical problems. CART/bagging/RF are
certainly no exception. Missing data can create the same kinds of difficulties they
create for conventional linear regression. There is the loss of statistical power with the
reduction in sample size and real possibility of bias if the observations are not lost at
random.
A general discussion of missing data and excellent treatment are easily found (Lit-
tle and Rubin, 2002). If the data are really missing completely at random (MCAR),
the only loss is statistical power. And if the number of cases lost is not large, the reduc-
tion in power is likely to be insignificant. It is, therefore, mandatory that the researcher
make a convincing argument that the data are missing completely at random. The re-
sults are then dependent upon the missing completely at random assumption, and may
be of little statistical interest unless the credibility of that assumption is determined.
A less strict assumption is that the data are missing at random (MAR). One
can subset the data based on the values of observed variables so that for each such
subset, the data are missing completely random. If this assumption is correct, the
analysis can be conducted separately for each of the subsets and then reassembled.
But again, the assumption of the mechanism in which the data are missing must beargued convincingly.
If either of these assumptions can be justified, it will be useful to impute the values
of the missing data. Imputing missing values for the response variable is usually not
sensible because the relationship between the response and the predictors can be sys-
tematically altered. But sometimes it can be very helpful to impute missing data for
predictors.
The key problem with any imputation procedure is that when the data are ultimately
analyzed, including the real data and the imputed data, the statistical procedures ap-
plied cannot tell which is which and necessarily treat all of the observations alike. The
3
-
8/2/2019 Phd Thesis YanHe
19/96
imputed values are estimates, and estimates usually come with random error. In addi-
tion, the imputed values, which are just fitted values, will have less variability than the
original value itself (Berk, 2005). In short, the imputed values will typically be less
variable than the real thing. The reduced variability can seriously undermine statistical
inference.
It is well known that CART/RF have built-in algorithms to impute missing data,
such as using surrogate variables or proximities. But these imputation methods have
no formal rationale. Furthermore, since CART/RF are more nonparametric models
than parametric, advanced multiple imputation (MI) methods may not apply at all. In
short, tools for imputing missing data are likely to be inadequate.
This thesis will address nonparametric approaches to assessing the accuracy of
an estimator in a missing data situation. Three main topics are discussed: bootstrap
methods for missing data, its relationship to the theory of multiple imputations, and
comparison to the surrogate variables/proximity method. Two main advantages (Efron,
1994) of nonparametric bootstrap imputation are: 1) it requires no knowledge of the
missing-data mechanism other than that it is missing at random or conditionally at
random; 2) the confidence interval turns out to give convenient and accurate answers.
The thesis is structured as follows: Chapter 1 introduces basic concepts about tree-
based models and missing data problem, and motivates this thesis. Chapter 2 intro-
duces Classification and Regression Trees (CART), as well as random forests (RF).
Standard theories of missing data and imputation methods are elaborated in Chapter 3,
which also illustrates the limitation of applying multiple imputation (MI) to tree-based
models. Chapter 4 explains how CART and RF deal with missing data, and their po-
tential limitations. Chapter 5 formally introduces nonparametric bootstrap methods to
impute missing data, and proposes corresponding algorithms in detail. Chapter 6 is
empirical study, which applies several imputation methods to various data sets. Here,
4
-
8/2/2019 Phd Thesis YanHe
20/96
the classification errors from 2000 bootstrapped imputations are compared to surro-
gate method for CART models, and the 2000 bootstrapped imputations are compared
to proximity method for RF models. Significant improvement can be found by using
nonparametric bootstrap methods. Chapter 7 discusses the effectiveness of the non-
parametric bootstrap methods in their ability to classify, as well as possible limitations.
Further improvement to the algorithm is also suggested.
5
-
8/2/2019 Phd Thesis YanHe
21/96
CHAPTER 2
Classification and Regression Trees (CART) and
Extensions
2.1 Classification and Regression Trees (CART)
We begin with a discussion of the general structure of a CART model. A CART
model describes the conditional distribution ofy given X, where y is the response vari-
able and X is a set of predictors (X = (X1, X2, , Xp)). This model has two maincomponents: a tree T with b terminal nodes, and a parameter = (1, 2, , b) Rk which associates the parameter values m with the m
th terminal node. Thus a treed
model is fully specified by the pair (T, ). IfX lies in the region corresponding to the
m
th
terminal node then y|X has the distribution f(y|m), where we use f to representa conditional distribution indexed by m. The model is called a regression tree or a
classification tree according to whether the response y is quantitative or qualitative,
respectively.
2.1.1 Splitting A Tree
The binary tree T subdivides the predictor space as follows. Each internal node
has an associated splitting rule which uses a predictor to assign observations to either
its left or right child node. The internal nodes are thus partitioned into two subsequent
nodes using the splitting rule. For quantitative predictors, the splitting rule is bases on
6
-
8/2/2019 Phd Thesis YanHe
22/96
a split rule s, and assigns observations for which {xi s} or {xi > s} to the left orright child node respectively. For qualitative predictors, the splitting rule is based on
a category subset C, and assign observations for which {xi C} or {xi / C} to theleft or right child node respectively.
For a regression tree, conventional algorithm models the response in each region
Rm as a constant cm. Thus the overall tree model can be expressed as (Hastie, Tibshi-
rani and Friedman, 2001):
f(x) =b
m=1
cmI(X Rm). (2.1)
where Rm, m = 1, 2, , b consist of a partition of the predictors space, and thereforerepresenting the space of b terminal nodes. If we adopt the method of minimizing the
sum of squares
(yi f(Xi))2 as our criterion to characterize the best split, it is easyto see that the best cm is just the average ofyi in region Rm:
cm = ave(yi | Xi Rm) = 1Nm
XiRm
yi (2.2)
Where Nm is the number of observations falling in node m. The residual sum of
squares is then
Qm(T) =1
Nm
XiRm
(yi cm)2 (2.3)
which will serve as an impurity measure for regression trees.
If the response is a factor taking outcomes 1, 2, . . . , K , the impurity measure Qm(T),
defined in (2.3) is not suitable. Instead, we represent a region Rm with Nm observa-
tions with
pmk =
1
Nm XiRm I(yi = k) (2.4)which is the proportion of class k(k c(1, 2, , K)) observations in node m. Weclassify the observations in node m to a class k(m) = arg maxk pmk, the majority class
7
-
8/2/2019 Phd Thesis YanHe
23/96
in node m. Different measures Qm(T) of node impurity include following (Hastie,
Tibshirani and Friedman, 2001):
Misclassification error :1
Nm iRmI(y
i/
k(m)) = 1
pmk(m)
Gini index :k=k
pmk pmk =Kk=1
pmk(1 pmk)
Cross entropy or deviance :Kk=1
pmk log pmk
(2.5)
For binary outcomes, if p is the proportion of the second class, these three measures
are 1max(p, 1p), 2p(1p) and p logp (1p) log(1p), respectively.
All three definitions of impurity are concave, having minimums at p = 0 and p = 1
and a maximum at p = 0.5. Entropy and the Gini index are the most common, and
generally give very similar results except when there are two response categories
(Berk, 2005).
2.1.2 Pruning A Tree
To be consistent with conventional notations, lets define the impurity of a node
as I() ((2.3) for a regression tree, and any one in (2.5) for a classification tree). We
then choose the split with maximal impurity reduction
I = I()p(L)I(L)p(R)I(R) (2.6)
where L and R are the left and right children nodes of.
How large should we grow the tree then? Clearly a very large tree might overfit the
data, while a small tree may not be able to capture the important structure. Tree size is
a tuning parameter governing the models complexity, and the optimal tree size should
8
-
8/2/2019 Phd Thesis YanHe
24/96
be adaptively chosen from the data. One approach would be to continue the splitting
procedures until the decrease on impurity due to the split exceeds some threshold. This
strategy is too short-sighted, however, since a seeming worthless split might lead to a
very good split below it.
The preferred strategy is to grow a large tree T0, stopping the splitting process
when some minimum number of obervations in a terminal node (say 10) is reached.
Then this large tree is pruned using cost-complexity pruning.
We define a subtree T T0 to be any tree that can be obtained by pruning T0, anddefine T to be the set of terminal nodes of T. That is, collapsing any number of itsterminal nodes. As before, we index terminal nodes by m, with node m representing
region Rm. Let |T| denote the number of terminal nodes in T (|T| = b). We use |T|instead ofb in this section following the conventional notation and define the risk of
trees
Regression tree : R(T) =
|eT|m=1
NmQm(T)
Classification tree : R(T) =eT
P()r() (2.7)
where r() measures the impurity of node in a classification tree (can be any one in
(2.5)).
We define the cost complexity criterion (Breiman et al., 1984)
R(T) = R(T) + |T| (2.8)where (> 0) is the complexty parameter. The idea is , for each , find the subtree
T T0 to minimize R(T). The tuning parameter 0 governs the tradeoffbetween tree size and its goodness of fit to the data (Hastie, Tibshirani and Friedman,
2001). Large values of result in smaller tree T, and conversely for smaller values
of. As the notation suggests, with = 0 the solution is the full tree T0.
9
-
8/2/2019 Phd Thesis YanHe
25/96
To find T we use weakest link pruning: we successively collapse the internal node
that produces the smallest per-node increase in R(T), and continue until we produce
the single-node (root) tree. This gives a (finite) sequence of subtrees, and one can
show this sequence must contains T. See Brieman et al. (1984) and Ripley (1996) for
details. Estimation of () is achieved by five- or ten-fold cross-validation. Our final
tree is then denoted as T.
It follows that, in CART and related algorithms, classification and regression trees
are produced from data in two stages. In the first stage, a large initial tree is produced
by splitting one node at a time in an iterative, greedy fashion. In the second stage,
a small subtree of the initial tree is selected, using the same data set. Whereas the
splitting procedure proceeds in a top-down fashion, the second stage, known as prun-
ing, proceeds from the bottom-up by successively removing nodes from the initial tree.
Theorem2.1 (Brieman et al., 1984, Section 3.3) For any value of the complexity pa-
rameter, there is a unique smallest subtree ofT0 that minimizes the cost-complexity.
Theorem2.2 (Zhang and Singer, 1999, Section 4.2) If 2 > 1 , the optimal sub-
tree corresponding to 2 is a subtree of the optimal subtree corresponding to 1
More general, suppose we end up with m thresholds,
0 < 1 < 2 < < m
and let 0 = 0. Also, let corresponding optimal subtrees be
{T0, T1, T2, , Tm}, thenT0 T1 T2 Tm (2.9)
where T0 T1 means that T1 is a subtree ofT0 . There are so-called nested optimal
10
-
8/2/2019 Phd Thesis YanHe
26/96
subtrees.
2.1.3 Taking Cost into Account
We talk about classification trees in this section. In many applications, tree-based
method is used for the purpose of prediction. That is, given the characteristics of a
subject, we must predict the outcome of this subject before we know the outcome. For
example, physicians in emergency rooms must predict whether a patient with chest
pain suffers from a serious disease based on the information available within a few
hours of admission. For this purpose, we must first classify a node to either class 0
(normal) or 1 (abnormal), then we predict the outcome of an individual based on themembership of the node to which the individual belongs. Unfortunately, we always
make mistakes in such a classification procedure, because some of the normal subjects
will be predicted as diseased and vice versa. These two mistakes are called false-
positive (predicting a normal condition as abnormal) and false-negative (predicting an
ill-conditioned outcome as normal), respectively. In any case, to weigh these mistakes,
we need to assign misclassification costs.
Let c(i, j) denote the misclassification cost that a class j subject is classified as a
class i subject. When i = j, we have the correct classification and the cost should
naturally be zero, i.e., c(i, i) = 0. If the outcome is binary, i and j take the values 0 or
1. Without loss of generality, we can set c(1, 0) = 1. In other words, one false positive
error counts as one. The clinicians and the statisticians need to work together to gauge
the relative cost ofc(0, 1). This is a subjective and difficult, but important, decision.
In the example of Domestic Violence (DV) analysis, 671 households reported DV
incidents during the study period, among which, about 21% of the households reported
to have a new call within 3-month follow-up period. In this instance, the two errors are:
1) false negative: failing to predict a new DV incident for households that really hap-
11
-
8/2/2019 Phd Thesis YanHe
27/96
pened and 2) false positive: predicting a new DV for households that really didnt hap-
pen. Thus, a predictor that produced few false positives but many false negatives might
be discarded if the undesirable consequences from the false negatives were larger than
the undesirable consequences from the false positives. Therefore, we needed informa-
tion from the Los Angeles Sheriffs Department on the relative consequences of false
positives and false negatives.
Information from the Los Angeles Sheriffs Department led to a general conclusion
that false negatives were substantially more problematic than false positives. In other
words, they considered not responding to a call when there actually was a need for
law enforcement assistance more costly than responding to a call that turned out to
be a false alarm. But the precise figures for these costs could not be determined.
Fortunately, all we needed for statistical analysis was the ratio of false negative costs
to false positive costs. We then proceeded with a reasonable ratio of the costs of false
negatives to the costs of false positives of 5 to 1. Consistent with the information
provided by the Sheriffs Department, the failure to forecast a new call for service was
5 times more costly than incorrectly forecasting a new call for service.
We can now better understand the role of costs using the obtained 21% return callfigure in DV data. If for every household (671 households), we predicted another call
within three months, we would be correct about 21% of the time. And, we would
also be wrong about 79% of the time. Conversely, if for every household, we always
predicted no calls within three months, we would be correct about 79% of the time.
And we would also be wrong about 21% of the time. Which is a better strategy: always
predicting a future call or not? The answer depends on the costs of false negatives
compared to the costs of false positives.
If both were equally costly, the best strategy would clearly be to never predict
a subsequent call. But since the failure to predict future calls was very costly (false
12
-
8/2/2019 Phd Thesis YanHe
28/96
negatives were 5 times more costly than false positives), the best strategy would clearly
be to predict a subsequent call. In short, the relative costs of false negatives compared
to the relative costs of false positives can affect how forecasting is done (Berk, 2005).
And, it also affects which predictors are likely to be important. Hence, in subsequent
analysis, we take costs into account.
2.2 Random Forest (RF)
Significant improvement in classification accuracy can be obtained by growing an
ensemble of trees and letting them vote for the most popular class (namely, majority
vote). An early example is bagging (Breiman, 1996), where to grow each tree a random
sample is selected from training set. Bagging stands for Bootstrap Aggregation and
may be best understood as nothing more than an algorithm.
The bagging algorithm for a data set having n observations and a binary response
variable can be summarized as following steps:
1. Take a random sample of size n with replacement from the data.
2. Construct a classification tree as usual but do not prune.
3. Assign a class to each terminal node as in CART. Drop the out-of-bag data down
the tree, and store the class attached to each case.
4. Repeat steps 1-3 a large number of times (say, 1000).
5. For each observation in the data, count the number of times over trees that it is
classified in one category and the number of times over trees it is classified in
the other category.
6. Assign each observation to a final category by a majority vote over the set of
13
-
8/2/2019 Phd Thesis YanHe
29/96
trees. Thus, if51% of the time over a large number of trees a given observation
is classified as a 1, that becomes its final classification.
7. Construct the confusion table from these class assignments.
2.2.1 The Algorithm
Random Forest extends the ideas of of bagging to the extent that allows random
selections of both observations and predictors at splitting step. Here, a large number
(say, 1000) of classification trees are constructed, each based on a bootstrap sample of
the data. In addition, at each split a random subset of predictors is selected. For each
tree constructed, data not used to grow the tree are dropped down to evaluate how well
the tree performs. Finally, overall results are produced by majority vote over the trees.
For example, if there are fifty predictors, choose a random seven candidates (It is
recommended to use the square root of number of predictors) for defining the split.
Then choose the best split, as usual, by selecting only from the seven randomly chosen
predictors. Repeat this process for each node. Therefore, the random forests algorithm
is very much like the bagging algorithm. Again let n be the number of observations
and assume for now that the response variable is binary.
1. Take a random sample of size n with replacement from the data.
2. Take a random sample of the predictors without replacement.
3. Construct the first CART partition of the data using selected predictors.
4. Repeat step 2 for each subsequent splits until the tree is as large as desired and
do not prune.
5. Drop the out-of-bag data down the tree, and store the class assigned to each
observation.
14
-
8/2/2019 Phd Thesis YanHe
30/96
6. Repeat steps 1-5 a large number of times (e.g., 1000).
7. Using the observations not used to build the tree for evaluation, count the number
of times over trees that a given observation is classified in one category and the
number of times over trees it is classified in the other category.
8. Assign each case to a category by a majority vote over the set of trees. Thus, if
51% of the time over a large number of trees a given case is classified as a 1,
that becomes its estimated classification.
9. Construct the confusion table for these assigned classes.
2.2.2 The Comparative Advantage of Random Forests
It is apparent that random forests are more than bagging. By working with a random
sample of predictors at each possible split, the fitted values across trees are more
independent (Berk, 2005). As a result, the gains from averaging over a large number
of trees can be larger. A related benefit is that it is possible to work with a very
large number of predictors, and even more predictors than observations. It is well
known that in the conventional regression modeling, all of the data mining procedures
considered so far have required that the number of predictors be less than the number
of observations (usually much less). An obvious gain is that more information can be
utilized in the fitting process , and more predictors can contribute.
The use of multiple trees (often as many as 1000) makes the random forests fitting
function much more complicated than the CART fitting function. However, the data
not included in each bootstrap sample are used to evaluate the model performance,
and the averaging over trees directly compensates for the overfitting problem that is
vulnerable to CART (Berk, 2005). Therefore, the random forest results can be treated
as true forecasts.
15
-
8/2/2019 Phd Thesis YanHe
31/96
Some of other features of RF are:
(i) It is an excellent classifiercomparable in accuracy to many other classifiers.
(ii) It generates an internal unbiased estimate of the generalized error as the forest
building progresses.
(iii) It has an effective method for estimating missing data.
(iv) It has a method for balancing error in unbalanced class population data sets.
(v) Generated forests can be saved for future use on other data.
(vi) It gives estimates of what variables are important in the classification.
(vii) Output is generated that gives information about the relation between the vari-
ables and the classification.
(viii) It computes proximities between pairs of cases that can be used in clustering,
locating outliers, or by scaling, give interesting views of the data.
(ix) The capabilities of (vii) above can be extended to unlabeled data, leading to
unsupervised clustering, data views and outlier detection. The missing value
replacement algorithm can also be extended to unlabeled data.
16
-
8/2/2019 Phd Thesis YanHe
32/96
CHAPTER 3
Standard Theory on Missing Data
3.1 Mechanisms That Lead to Missing Data
Missing Data are a problem for all statistical analyses. Missing data mechanisms
are crucial since the properties of missing-data methods depend very strongly on the
nature of these mechanisms. The crucial role of the mechanism in the analysis of data
with missing values was largely ignored until the concept was formalized in the theory
of Rubin (1976), through the simple device of treating the missing-data indicators as
random variables and assigning them a distribution.
Define the full data Y = (yij) and the missing-data indicator matrix M = (Mij),
with Mij indicating whether the corresponding Yij is missing or not. The missing-
data mechanism is characterized by the conditional distribution of M given Y, say
f(M|Y, ), where denotes unknown parameters. If missingness does not depend onthe values of the data Y, missing or observed, that is, if
f(M|Y, ) = f(M|) for all Y, (3.1)
the data are called missing completely at random (MCAR).
Let Yobs and Ymis denote the observed and missing components ofY respectivelly.An assumption less restrictive than MCAR is that missingness depends only on the
observed components ofY (Yobs), and not on the components that are missing (Ymis).
17
-
8/2/2019 Phd Thesis YanHe
33/96
That is,
f(M|Y, ) = f(M|Yobs, ) f or all Ymis, (3.2)
This missing-data mechanism is then called missing at random (MAR). The third
mechanism is called not missing at random (NMAR) if the distribution of M de-
pends on the missing values in the data matrix Y.
f(M|Y, ) = f(M|Yobs, Ymis, ) f or all (3.3)
Some literature also calls it nonignorable missing data.
The simplest data structure is a univariate random sample for which some units are
missing. Let Y = (y1, . . . , yn)T where yi denotes the value of a random variable for
unit i, and let M = (M1, . . . , M n) where Mi = 0 if unit i is observed and Mi = 1 if
unit i missing. Suppose the joint distribution of (yi, Mi) is independent across units,
so in particular the probability that a unit is observed does not depend on the values of
Y or M for other units. Then (Little and Rubin, 2002),
f(Y, M|, ) = f(Y|)f(M|Y, ) =ni=1
f(yi|)ni=1
f(Mi|yi, ) (3.4)
where f(yi
|) denotes the density ofyi indexed by unknown parameters , and f(Mi
|yi, )
is usually the density of a Bernoulli distribution for the binary indicator Mi with prob-
ability Pr(Mi = 1|yi, ) that yi is missing.
If missingness is independent of Y, that is if Pr(Mi = 1|yi, ) = , a constantthat does not depend on yi, then the missing-data mechanism is MCAR (or in this
case equivalently MAR). If the mechanism depends on yi the mechanism is NMAR
since it depends on yi that are missing, assuming that there are some. NMAR is the
most general situation, and valid statistical inferences generally require specifying thecorrect model for the missing-data mechanism, distribution assumption for the missing
yi, or both. The resulting estimators and tests are typically very sensitive to these
assumptions.
18
-
8/2/2019 Phd Thesis YanHe
34/96
Let r denote the number of responding units (I.e., Mi = 0). An obvious conse-
quence of the missing values in this example is that sample size is reduced from n to
r (Little and Rubin, 2002). We might want to carry out the sample analyses on the
reduced sample as we intended for the size-n sample. For example, if we assume the
values are normally distributed and wish to make inferences about the mean, we might
estimate the mean by the sample mean of the r responding units with standard error
s/
r, where s is the sample standard deviation of the responding units. This strat-
egy is valid if the mechanism is MCAR or MAR, since then the observed cases are
a random subsample of all the cases (Little and Rubin, 2002). However, if the data
are NMAR, the analysis based on the responding subsample is generally biased for the
parameter of the distribution ofY.
3.2 Treatment of Missing Data
There are three approaches in dealing with missing data:
1. Impute the missing data: that is, filling in the missing values.
2. Model the probability of missingness: this is a good option if imputation is in-
feasible; in certain cases it can account for much of the bias that would otherwise
occur.
3. Ignore the missing data: a poor choice, but by far the most common one.
This section gives a brief description of alternative approaches to handling the
problem of missing data.
19
-
8/2/2019 Phd Thesis YanHe
35/96
3.2.1 Listwise Deletion
By far the most common approach is to simply omit those cases with missing data
and to run analyses on what remains.
If data are missing for the response variable, the only reasonable strategy is list-
wise deletion. That is, observations with missing response are dropped totally from
the analysis. If the data are missing completely at random, the only loss is statistical
power. If not, however, bias of unknown size and direction can be introduced.
When the data are missing for one or more predictors, we have more options. List-
wise deletion remains a possible choice, especially if there is not a lot of missing data
(e.g., less than 5% of the total number of observations). Listwise deletion is also easy
to implement and understand. However, this method ignores the possible systematic
difference between the complete cases and incomplete cases, and the resulting infer-
ence may not be applicable to the population of all cases, especially with a smaller
number of complete cases.
3.2.2 Single Imputation
Single imputation refers to filling in a missing value with a single replacement
value. Imputations are means or draws from a predictive distribution of the missing
values, and require a method of creating a predictive distribution for the imputation
based on the observed data. There are two generic approaches to generating this dis-
tribution (For details see Little and Rubin, 2002, Pages 59-60):
Explict Modeling: the predictive distribution is based on a formal statistical model
(e.g. multivariate normal), and hence the assumptions are explicit.
Implicit Modeling: the focus is on an algorithm, which implies an underlying
model; assumptions are implicit, but they still need to be carefully assessed to ensure
20
-
8/2/2019 Phd Thesis YanHe
36/96
that they are reasonable.
Explicit modeling methods include:
(a) Mean imputation, where means from the responding units in the sample are used
to fill in missing values. Sometimes, the means may be formed by weighting
within cells or classes.
(b) Regression imputation replaces missing values by predicted values from a re-
gression of the missing item on items observed for the unit. Mean imputation
can actually be regarded as a special case of regression imputation. The proper
regression model depends on the type of the to-be-imputed variable. A probit
or logit is used for binary variables, Poisson or other count models for integer-
valued variables, and OLS or related models for continuous variables. For exam-
ple, suppose for subject properties, there are some missing data for gross living
areas (GLA). But gross living areas are strongly related to number of bedrooms,
number of bathrooms, number of total rooms, and lot size. For the observations
with no missing data, GLA is regressed on number of bedrooms, number of
bathrooms, number of total rooms, and lot size. Then, for the observations that
have missing GLA data, the values for the four predictors are inserted into the
estimated regression equation. Predicted values are computed, which are used
to fill in the holes in the GLA data.
(c) Stochastic regression imputation goes one step further, replacing missing values
by a value predicted by regression imputation plus a residual, which is drawn
to reflect the uncertainty in the predicted value. For example, the residual for
Gaussian regression is naturally normal with mean zero and variance equal to
the residual variance in the regression. With a binary outcome, as in logistic
regression, the predicted value is a probability of 1 versus 0, and the imputed
21
-
8/2/2019 Phd Thesis YanHe
37/96
value is then a 1 or 0 drawn with that probability.
Implicit modeling methods include:
(d) Hot deck imputation, involves substituting individual values imputed from sim-
ilar responding units. Hot deck imputation is common in survey practice and
can involve very elaborate schemes for selecting units that are similar for im-
putation (Little and Rubin, 2002). To perform hot deck imputation, all obser-
vations are divided into groups with similar characteristics, for example, prop-
erties priced 400K-800K. To impute a missing value, the researcher randomly
draw a value for that variable from the pool of properties having similar char-
acteristics. Creating a large number of subgroups yields some improvement in
accuracy, but it can also lead to very small sample sizes within some subgroups.
The primary difficulty of this method is the selection of proper subgroups.
(e) Substitution, replaces nonresponding units with alternative units not selected into
the sample. For example, in order to estimate a property value using sales com-
parison method, we need to find similar sales within 0.5 mile of the subject prop-
erty. If a similar sale cannot be found, then a similar sale beyond 0.5 mile may
be substituted. The tendency to treat the resulting sample as complete should
be taken with caution, since the substituted property may differ systematically
from properties within 0.5 mile. Hence at the analysis stage, substituted proper-
ties should be regarded as imputed values of a particular type.
(f) Cold deck imputation replaces a missing value of an item by a constant value
from an external source, such as a value from a previous realization. In theproperty valuation example, we sometimes use historical sales price adjusted to
effective date (usually evaluation date).
22
-
8/2/2019 Phd Thesis YanHe
38/96
(g) Composite methods combines ideas from different methods. For example, hot
deck and regression imputation can be combined by calculating predicted means
from a regression but then adding a residual randomly chosen from the empirical
residuals to the predicted value when forming values for imputation. See, for
example, Schieber (1978), David et. al (1986).
An important limitation of the single imputation methods described so far is that
standard variance formulas applied to the filled-in data systematically underestimate
the variance of estimates, even if the model used to generate imputations is correct.
Even if reasonably unbiased estimates can be constructed, single imputation methods
ignore the reduced variability of the predicted values and treats the imputed valuesas fixed. One response is to impute several times for each observation drawn at ran-
dom, say, from the conditional distributions implied by the regression equation. It is
then possible to get a better handle on the uncertainty associated with the imputed val-
ues. Multiple imputation is one such example, but its obvious disadvantage prevents it
from being using in the nonparametric situations. Another example is nonparametric
bootstrap imputation, which will be treated shortly.
3.2.3 Multiple Imputations through Data Augmentation
MI refers to the procedure of imputing missing value D(D 2) times. When theD sets of imputations are repeated random draws from the predictive distribution of
the missing values under a particular model, the D complete-data inferences can be
combined to form one inference that properly reflects uncertainty due to nonresponse
under the model (Little and Rubin, 2002).
As already indicated in Section 3.2.2, the obvious disadvantage of single imputa-
tion is that imputing a single value treats that value as known, and thus without special
adjustments, single imputation cannot reflect the sampling variability under the impu-
23
-
8/2/2019 Phd Thesis YanHe
39/96
tation model for nonresponse. MI Shares advantages of single imputation and recti-
fies its disadvantages. Specifically, when the D imputations are repetitions under one
model for missingness, the resulting D complete-data analyses can be easily combined
to create an inference that validly reflects sampling variability because of the missing
values.
We now turn to the problem of creating the multiple imputations. Standard theory
suggests that we draw the missing values as
Y(d)mis p(Ymis|Yobs), d = 1, , D. (3.5)
that is, from their joint posterior predictive distribution. Unfortunately, it is often dif-
ficult to draw from this predictive distribution in complicated problem, because of the
implicit requirement in Equation(3.5) to integrate over the unknown parameter . Data
augmentation accomplishes this by iteratively drawing a sequence of values of the
parameters and missing data until convergence.
Data augmentation (Tanner and Wong, 1987) is an iterative two-step method of
imputing missing values by simulating the posterior distribution of that combines
features of the EM algorithm and multiple imputations. These two steps are: the
imputation (or I) step and the posterior (or P) step. Start with an initial draw (0) from
an approximation to the posterior distribution of . Given a value (t) of , draw at
iteration t:
I Step: Draw Y(t+1) with density p(Ymis|Yobs, (t));
P Step: Draw (t+1) with density p(|Yobs, Y(t+1)mis ).
This procedure is motivated by the fact that the distribution in these two steps are often
much easier to draw from than either of the posterior distribution p(Ymis|Yobs) andp(|Yobs), or the joint posterior distribution p(, Ymis|Yobs). The iterative procedure
24
-
8/2/2019 Phd Thesis YanHe
40/96
can be shown to eventually yield a draw from the joint posterior distribution of Ymis,
given Yobs, in the sense that as t tends to infinity, this sequence converges to a draw
from the joint distribution of (, Ymis) given Yobs.
3.2.4 Assessment of Multiple Imputations
Although multiple imputation has desirable features, for instance, it allows one to
get good estimates of the standard error, certain requirements must be met for MI to
have these desirable properties. First, the data must be missing at random (MAR),
meaning that the probability of missingness on the data Y depend on what are ob-
served, and not on the components that are missing (See equation 3.2). Second, themodel used to generate the imputed values must be correct in some sense. Third,
the model used for the analysis must match up, in some sense with the model used in
the imputation. All these conditions have been rigorously described by Rubin (1987,
1996).
The problem is that its easy to violate these conditions in practice. First, when
the data are missing for reasons beyond the control of the investigators, one can never
be sure whether MAR holds. In fact, to speak of a single missingness mechanism
is often misleading, because in most of studies missing values occur for a variety of
reasons; some of these may be entirely unrelated to the data in question, but others
may be closely related.
Unfortunately, it is not possible to relax the MAR assumption in any meaningful
way without replacing it with some other equally untestable assumptions. At present,
there are no principal nonignorable missing-data methods readily available to most
data analysts. Thus, the MI methods based on the MAR assumption should be used
with an awareness of its limitations.
Furthermore, in order to generate imputations for the missing values, a probability
25
-
8/2/2019 Phd Thesis YanHe
41/96
model on the full data (observed and missing values) must be imposed. Each of the
software packages applies to a different class of multivariate models (Available in R).
NORM uses the multivariate normal distribution. CAT is based on loglinear models,
which have been traditionally used by social scientists to describe associations among
variables in cross-classified data. The MIXlibrary relies on the general location model,
which combines a loglinear model for the categorical variables with a multivariate nor-
mal regression for the continuous ones. Details of these models are given by Schafer
(1997).
In reality, real data rarely conform to the convenient models such as multivariate
normal. In most applications of MI, the model used to generate the imputations will
at least be approximately true. And an imputation model should be chosen to be (at
least approximately) compatible with the real analyses to be performed on the imputed
datasets. In particular, the imputation model should be rich enough to preserve the
associations or relationships among variables that will be the focus of later investiga-
tion (Schafer and Olsen, 1998). The precision you lose when you include unimportant
predictors is usually a relatively small price to pay for the general validity of analyses
of the resultant multiply imputed data set (Rubin, 1996). Therefore, a rich imputation
model that preserves a large number of associations is desirable because it may be used
for a variety of post-imputation analyses.
Existing software packages, however, sometimes fail for imputation models with
a large number of variables, especially when there are a large number of categorical
variables, since then, the problems of curse of dimensionality and sparse cells
can easily occur. Not to mention the possibility of misspecified imputation model,
which typically leads to overestimated variability, and thus, overcoverage of interval
estimates (Little and Rubin, 2002).
Third, the Bayesian nature of MI requires investigators to specify a prior distri-
26
-
8/2/2019 Phd Thesis YanHe
42/96
bution for the parameter () of the imputation model. In the Bayesian paradigm, this
prior distribution quantifies ones belief or state of knowledge about model parameters
before any data are seen. Because different prior distributions can lead to different
results, Bayesian models have been regarded by some statisticians as subjective and
unscientific. We tend to view the prior distribution as a mathematical convenience
that allows us to generate the imputations in a principled fashion (Schafer and Olsen,
1998).
The nonparametric bootstrap method avoids all these problems implicit in MI, thus
provides a good alternative to impute missing values in broader situations. The details
about bootstrap method together with the algorithm will be discussed in Chapter 5.
27
-
8/2/2019 Phd Thesis YanHe
43/96
CHAPTER 4
Missing Data with CART/RF
4.1 Missing Data with CART
Missing data are a problem for all statistical problems. CART and RF are certainly
not exception. The imputation methods we have discussed so far can be used for
tree-based models or used with some adjustments. For instance, if the percentage of
missing data is less than 5% of the total number of observations, listwise deletion
remains a possible choice.
A second option is to impute the data outside CART. A simple example would be to
employ conventional regression in which a predictor with the missing data is regressed
on other predictors with which it is likely to be related. The resulting regression equa-
tion can then be used to impute what the missing values might be.
A third option is to address the missing data problems for predictors within CART
itself. There are a number of ways this might be done. Here, we consider one of the
better approaches, and the one readily available in the CART software.
The first place where missing data come up is when a split is chosen. Recall that
at each step we choose the split that gives the maximal reduction in impurity:
I = I()p(L)I(L)p(R)I(R) (4.1)
where I() is the value of the parent impurity, p(R) is the probability of a case falling
in the right daughter node, p(L) is the probability of a case falling in the left daughter
28
-
8/2/2019 Phd Thesis YanHe
44/96
node, I(R) is the impurity of the right daughter node and I(L) is the impurity of the
left daughter node. CART tries to find the predictor and the split rule with which I
is as large as possible.
Consider the first term on the right hand side (I()). We can easily calculate its
value without any predictors and thus, dont have to worry about missing values. How-
ever, to construct the two daughter nodes, predictors are required. Each predictor is
evaluated as usual, but using only the predictor values that are not missing. That is,
I(R) and I(L) are computed for each of the optimal split for each predictor using
only the data available. And the associated probabilities p(R) and p(L) are estimated
for each predictor based on the split actually present.
We are not done yet. Now, observations have to be assigned to one of the two
daughter nodes. How can this be done if the predictor values needed are missing?
CART imputes those missing values using surrogate variables.
Suppose there are 10 predictors x1 x10 to be included in the CART analysis, andsuppose there are missing values for x1 only, which happens to be the best predictor
chosen to define the optimal split. The split necessarily defines two categories for
x1.
The predictor x1 now actually becomes a binary response variable with the two
classes determined by the split (Berk, 2005). CART is applied with x1 as the response
variable and x2 x10 as potential splitting variables. Only one partitioning is allowedhere; a full tree is not constructed. The nine predictors are then ranked by the propor-
tion of cases in x1 that are misclassified. Predictors that do no better than the marginal
distribution ofx1 are dropped from further consideration.
The variable with the lowest classification error for x1 is then used in place ofx1 to
assign cases with missing values on x1 to one of the two daughter nodes. That is, the
predicted classes for x1 are used when the actual classes for x1 are missing (Berk,
29
-
8/2/2019 Phd Thesis YanHe
45/96
2005). If there are missing data for the best predictor of x1, the best surrogate
variable is used instead. If there are missing data on the best surrogate variable of
x2, the second best surrogate variable of x3 is used instead. And so on. If each of
the variables x2 x10 has missing data, the majority direction of the x1 split is used.For example if split is defined so that x1 c sends observations to the left and x1 > csends cases to the right, cases with data missing on x1, which have no surrogate to use
instead, are placed along with the majority of cases. To be more specific, there are
three options in real implementation (rpartlibrary in R):
1. 0 = display only; an observation with a missing value for the primary split rule
is not sent further down the tree.
2. 1 = use surrogates, in order, to split subjects missing the primary variable; if all
surrogates are missing the observation is not split.
3. 2 = if all surrogates are missing, then send the observations in the majority di-
rection.
This would seem to be a reasonable response to missing data. There might be
other alternatives that may perform better. But the greatest risk is that if there are lots
of missing data and the surrogate variables are used, the correspondence between the
results and the data, had it been complete, can become very tenuous (Berk, 2005).
In practice, the data will rarely be missing completely at random (MCAR) or even
missing at random (MAR). Then, if too much of the data are manufactured, rather than
collected, a new kind of generalization error will be introduced. The problem is that
imputation can fail just when you need it the most.
Furthermore, a number of statistical difficulties can follow when the response vari-
able is highly skewed. The danger with missing data is that the skewing can be made
worse (Berk, 2005). Perhaps we should avoid using surrogate variables, and impute
30
-
8/2/2019 Phd Thesis YanHe
46/96
the missing data using alternative imputation methods, such as nonparametric boot-
strap method.
4.2 Missing Data with RF
There are two ways with which random forests can impute missing data. Among
the two implementations in randomForest library, option na.roughfix is quick and
easy to implement. To be specific,
1. For numerical variables, NAs are replaced with column medians.
2. For factor variables, NAs are replaced with the most frequent levels (breaking
ties at random).
3. If a data matrix contains no NAs, it is returned unaltered.
A more advanced algorithm capitalizes on the proximity matrix (rfImpute() in
the randomForest library). We now formally introduce a proximity matrix.
A proximity matrix is an nn symmetric matrix, which gives an intrinsic measureof similarities between cases. Here n is the number of cases in the data set. Run all
cases in the training set are dropped down the tree. If case i and case j both land in
the same terminal node, increase the proximity between i and j (element (i, j) of the
matrix) by one. At the end of the run, the proximities are divided by the number of
trees in the run and the proximity between a case and itself is set equal to one. This
is an intrinsic proximity measure, inherent in the data and the RF algorithm. Thus
each cell in the proximity matrix shows the proportion of trees over which each pair of
observations fall in the same terminal node. The higher the proportion, the more alike
those observations are, and the more proximate they have.
31
-
8/2/2019 Phd Thesis YanHe
47/96
The proximities between cases i and j are from the matrix {prox(i, j)}. Fromtheir definition, it follows that the value 1 {prox(i, j)} are squared distances in aEuclidean Space of high dimension (Breiman, 2003).
The function rfImpute() starts by imputing NAs using na.roughfix, then ran-
domForest() is called with the completed data. The proximity matrix from the random
forests is used to update the imputations of the NAs. For continuous predictors, the
imputed value is the weighted average of the non-missing observations, where the
weights are the proximities. So, cases that are more like the cases with the missing
data are given greater weight. For categorical predictors, the imputed value is the cat-
egory with the largest average proximity. Again, cases more like the case with the
missing data are given greater weight.
This process is relatively slow, and requiring up to 6 iterations of forest growing.
And the use of imputed values tends to make the OOB measures of fit too optimistic
(Breiman, 2003). The computational demands are also quite daunting and may be
impractical for many data sets until more efficient ways to handle the proximities are
found.
32
-
8/2/2019 Phd Thesis YanHe
48/96
CHAPTER 5
Nonparametric Bootstrap Methods to Impute Missing
Data
In this section, we formally introduce one type of resampling methods to impute
missing data: nonparametric bootstrap. A primary advantage of the nonparametric
bootstrap method is that it does not depend on the missing-data mechanism, which
rectifies disadvantages of all other imputation methods. It also requires no knowledge
of either the probability distribution or model structure, and successfully incorporates
the estimates of uncertainty associated with the imputed data.
5.1 The Simple Bootstrap for Complete Data
Let be a consistent estimate of a parameter based on a random sample Y =
(y1, y2, , yn)T. Let Y(b) be a sample of size n obtained from the original sample Yby simple random sampling with replacement, and (b) be the estimate of obtained by
applying the standard estimation method to Y(b), where b indexes the drawn samples,
and b = 1, 2, , B. Then the sequence ((1), . . . , (B)) represents the set of estimatesobtained by repeating this procedure B times. The bootstrap estimate of is defined
as the average of the B bootstrap estimates:
boot =1
B
Bb=1
(b) (5.1)
33
-
8/2/2019 Phd Thesis YanHe
49/96
Large-sample inferences can be derived from the bootstrap distribution of (b), which
are based on the histogram formed by the bootstrap estimates ((1), . . . , (B)). In par-
ticular, the bootstrap estimate of the variance of or boot is
Vboot =1
B 1Bb=1
((b) boot)2 (5.2)
It can be shown that under certain conditions, (a) the bootstrap estimator boot is less
biased than the original estimator , and under quite general conditions (b) Vboot is
a consistent estimate of the variance of or boot as n and B tend to infinity (Efron,
1987). From property (b), we can see that if the bootstrap distribution is approximately
normal, a 100(1
)% bootstrap confidence interval for a scalar can be computed as
CInorm() = z1/2
Vboot (5.3)
where z1/2 is the 100(1/2) percentile of the normal distribution. Alternatively ifthe bootstrap distribution is non-normal, a 100(1 )% bootstrap confidence intervalcan be computed empirically as
CIemp() = ((b,l), (b,u)) (5.4)
where (b,l) and (b,u) are the (/2) and (1/2) percentiles of the empirical bootstrapdistribution of . Stable intervals based on Eq.(5.3) require bootstrap samples of the
order ofB = 200. Intervals based on Eq.(5.4) require much large samples, for example
B = 2000 or more (Efron, 1994).
5.2 The Simple Bootstrap Applied to Imputed Incomplete Data
Suppose there is a simple random sample Y = (y1, y2, , yn)T, but some obser-vations yi are missing. A consistent estimate of an unknown parameter is com-
puted by first filling in the missing values in Y(b) using some imputation method Imp,
34
-
8/2/2019 Phd Thesis YanHe
50/96
yielding imputed data Y = Imp(Y), and then estimating from the imputed data Y.
Bootstrap estimates ((1), . . . , (B)) can be computed as follows:
For b = 1, . . . , B:
1. Generate a bootstrap sample Y(b) with replacement from the original incomplete
sample Y.
2. Fill in the missing data in Y(b) by applying the imputation procedure Imp to the
bootstrap sample Y(b), so that Y(b) = Imp(Y(b)).
3. Compute (b) for the imputed complete data Y(b).
Then Equation(5.2) provides a consistent estimate of the variance of , and Equations
(5.3) or (5.4) can be used to generate confidence intervals for an unknown scalar pa-
rameter.
A key feature of this procedure is that the imputation procedure is applied B times,
once to each bootstrap sample. Hence the approach is computationally intensive. A
simpler procedure would be to apply the imputation procedure Imp just once to yield
one imputed data set Y, and then bootstrap the estimation method applied to the filled-
in data. However, this approach clearly does not propagate the uncertainty in the
imputations and hence does not provide valid inferences (Little and Rubin, 2002). A
second key feature is that the imputation method must yield a consistent estimate
for the true parameter. This is not required for Equation (5.2) to yield a valid estimate
of sampling error, but it is required for Equations (5.3) and (5.4) to yield appropriate
confidence coverage, and for tests to have the nominal size see in particular Rubins
(1994) discussion of Efron (1994).
This approach should be applied with caution since it assumes large samples. With
moderate-sized data sets, it is possible that an imputation procedure that works for the
full sample may need to be modified for one or more bootstrap samples.
35
-
8/2/2019 Phd Thesis YanHe
51/96
A principal advantage of the nonparametric bootstrap method is that it does not
depend on the missing-data mechanism. Its main practical disadvantage is the compu-
tational expense of the 2000 or so bootstrap replications required for reasonable nu-
merical accuracy if the bootstrap estimation is non-normal (Efron, 1994). Fortunately,
this is no longer a big concern with the computer power we have nowadays.
5.3 The Imputation Algorithm for Tree-Based Models
The nonparametric bootstrap method to impute missing data for tree-based models
can be structured as follows.
Algorithm 1:
1. Draw B (say, 2000) boostrap samples.
2. For each bootstrap sample, b = 1, 2, , B, impute missing values using fol-lowing steps:
Replace missing values with median (if the predictor is quantitative) or
mode (if the predictor is qualitative), a.k.a.rough fix;
Categorical predictors are regressed on other predictors with which it islikely to be related, using Logistic regression.
Continuous predictors are regressed on other predictors with which it islikely to be related, using Gaussian regression.
Count (integer-valued) predictors are regressed on other predictors withwhich it is likely to be related, using Poisson regression.
For observations that have missing data, predict each missing field usingcorresponding regression equation. The missing values are then filled in
using the predicted values.
36
-
8/2/2019 Phd Thesis YanHe
52/96
Apply CART/RF to each imputed bootstrap sample, and get confusion ta-ble.
Extract and store false positive and false negative errors from confusion
tables.
3. Repeat Step #2 a large number of times (e.g., B = 2000).
4. Study the empirical distributions of false positive and false negative errors from
B runs.
5. Construct confidence intervals.
Algorithm 2 differs from Algorithm 1 at Step 4, and in fact the procedures used
to impute missing values remain the same. The only difference is that we now get an
overall estimate of false positive and false negative errors instead of their confidence
intervals.
Algorithm 2:
1. Draw B (say, 2000) boostrap samples.
2. For each bootstrap sample, b = 1, 2, , B, impute missing values using fol-lowing algorithm:
Replace missing values with median (if the predictor is quantitative) ormode (if the predictor is qualitative), a.k.a. rough fix;
Categorical predictors are regressed on other predictors with which it is
likely to be related, using Logistic regression.
Continuous predictors are regressed on other predictors with which it islikely to be related, using Gaussian regression.
37
-
8/2/2019 Phd Thesis YanHe
53/96
Count (integer-valued) predictors are regressed on other predictors withwhich it is likely to be related, using Poisson regression.
For observations that have missing data, predict each missing field using
corresponding regression equation. The missing values are then filled in
using the predicted values.
Apply CART/RF to each imputed bootstrap sample.
Drop the cases in the bth bootstrap sample down the tree. Store the classassigned to each observation in-the-sample along with each observations
predictor values.
3. Repeat Step #2 a large number of times (e.g., B = 2000).
4. Use only the class assigned to each observation when that observation is in-
the-sample, count the number of times over B replications that the observation
is classified in one category and the number of times over B replictions it is
classified in the other category.
5. Assign each case to a category by a majority vote over B replications. Thus, if
51% of the time a given case is classified as a 1, that becomes its estimated
classification.
6. Construct the confusion table using the assigned class.
38
-
8/2/2019 Phd Thesis YanHe
54/96
CHAPTER 6
Empirical Studies
6.1 Data Sets
The algorithms are applied to classification trees using following data sets:
diabetes
domestic violence
dolphin
The diabetes data sets is in the UCI repository of machine learning databases (ftp
ics.uci.edu/pub/machine-learning-databases). They can also be found in the datasets
or mlbench library under R. The dolphin data is a real world problem, and will be
explained shortly. Missing values are artificially generated to these two data sets.
6.1.1 Data of Diabates
This data were collected by the US National Institute of Diabetes and Digestive
and Kidney Diseases. A population of women who were at least 21 years old, of Pima
Indian Heritage and living near Phoenix, Arizona, was tested for disbetes according to
World Health Organization criteria. The diabetes data frame has 768 observations on
9 variables. 268 observations showed positive according to WHO criteria.
pregnant: number of times pregnant.
39
-
8/2/2019 Phd Thesis YanHe
55/96
glucose: plasma glucose concentration (in an oral glucose tolerance test).
pressure: diastolic blood pressure (mm Hg).
triceps: triceps skin fold thickness (mm).
insulin: 2-hour serum insulin (mu U/ml).
mass: body mass index (wright in kg/(height in m)2).
pedigree: diabetes pedigree function.
age: age in years.
diabetes: yes or no, for diabetic according to WHO criteria. This is the re-sponse variable.
6.1.2 Data of Domestic Violence
The purpose of the study was to investigate the possibility of domestic violence
incidents and their severity. The original research design specified a representative
sample of 1500 households, that were likely to involve domestic violence. The Sher-
iffs deputies were expected to be at the scene and to employ a screener of about 30
questions that were related to the domestic violence incidents. These screener ques-
tions were designed by a group of criminology experts.
In a three-month follow-up period, sheriffs deputies were expected to record whether
there was a new domestic violence incident call from these households. The data were
to be collected from six selected substations, due to the fact that they accounted for
largest numbers of domestic violence calls in LA county.1 However, due to some legal
1The substations are Century City, Compton, East Los Angeles, City of Industry, Lakewood, and
Lancaster.
40
-
8/2/2019 Phd Thesis YanHe
56/96
issues and cooperation problems, the data we eventually got from the Sheriffs Depart-
ment were fewer than half of the households specified by the research design (ended
up with 671 households). Two response variables were included in the analysis: 1)
followup: whether there is another call in the 3-month follow-up period; 2) crime:
whether it is related to a criminal charge. And different ratios of false negatives to false
positives were used for the two response variables.
The long screening instrument includes:
Is this the first time he/she tried to hurt you?
When was the last time he/she tried to hurt you?
How many times has he/she tried to hurt you?
How many times before have the police been called?
Was he/she ever arrested for domestic violence as a result?
Was he/she ever convicted for domestic violence as a result?
Is the violence getting worse as time goes on?
How long ago did the violence start?
Has he/she ever hurt you so that you need to see a medical doctor?
How many times?
Were you ever treated for those injuries in a hospital emergency room?
Does he/she have a problem with jealousy? Does he/she keep track of whom you talk to on the phone?
Does he/she try to determine which of your friends you can see?
41
-
8/2/2019 Phd Thesis YanHe
57/96
Does he/she try to put you down in front of your friends and family?
Does he/she have a drinking problem or a problem with drugs?
When he/she is angry with you does he/she ever try to destroy things around thehouse?
Has he/she ever threatened to kill you or someone in your family?
Are there any children in the home?
Has he/she ever intentionally hurt him/her/any of them just because he/she wasangry?
Does he/she have a handgun he/she can get to?
Did he/she purchase it himself?
When he/she is angry, has he/she ever threatened you with it?
Does he/she have a rifle he/she can get to?
Did he/she purchase it himself/herself? When he/she is angry, has he/she ever threatened you with it?
Has he/she ever threatened you with other weapons like a knife?
Is there a retraining order against him/her right now?
Have you ever left him/her?
How many times?
Does he/she have a regular job?
42
-
8/2/2019 Phd Thesis YanHe
58/96
6.1.3 Data of Dolphin
The data of dolphin is from Inter-American Tropical Tuna Commission (IATTC),
studying dolphin mortalities occurring incidental to tuna-fishing opera