density estimation trees - systems group studer.pdfarijit khan may 30, 2015. ... 2.1 de nition ......

Department of Informatics

Imanol Studer (09-734-575)

Density Estimation Trees

Seminar Report: Database Systems

Department of Informatics - Database TechnologyUniversity of Zurich

Supervision

Prof. Dr. Michael BohlenProf. Dr. Peter Widmayer

Arijit Khan

May 30, 2015

Contents

1 Introduction 1

2 Parzen-Window Density Estimation 32.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Density Estimation Tree 53.1 Piecewise constant density estimate . . . . . . . . . . . . . . . . . 53.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Categorical and Ordinal features . . . . . . . . . . . . . . . . . . 73.4 Learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 83.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Comparison 114.1 Adaptability within dimensions . . . . . . . . . . . . . . . . . . . 114.2 Adaptability between dimensions . . . . . . . . . . . . . . . . . . 114.3 Importance of dimensions . . . . . . . . . . . . . . . . . . . . . . 124.4 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.5 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Conclusion 14

i

Chapter 1

Introduction

In machine learning density estimation, an instance of unsupervised learning,is a fundamental task. Compared with supervised learning, this is a hard task,because it works without any ground truth regarding the estimated quantity.In supervised learning, decision trees have been widely used [1]. However, inthis report I want to focus on decision trees for unsupervised learning. A re-cent paper [3] introduced Density Estimation Trees, a novel method to buildProbability Density Functions from training samples in a histogram-like fash-ion. Density Estimation Methods can be roughly classified into parametric (e.g.Maximum Likelihood) vs nonparametric (e.g. Parzen Windows) models. Para-metric Models try to fit a given distribution model to the training samples bytuning its parameter. To evaluate a new observation, only the fitted distributionfunction needs to be used. Nonparametric models however, generally don’t needto be initialized because they evaluate the density of an input by comparing itto the stored training samples, which can be computationally very expensive.Density Estimation Trees are nonparametrics models which need to be initializedbefore using, which makes them a special case in the nonparametric estimatordomain. The learning algorithm partitions the space in a top-down mannerinto a tree structure and precaches a density for each partition. The partitionsare found by minimizing a cost function through leave-one-out cross-validation(LOO-CV). To evaluate a new input, a Density Estimation Tree just returnsthe density stored in the partition the input belongs to, which makes it a veryfast nonparametric estimator.An interesting property of Density Estimation Trees is adaptability. They areadaptable between dimensions, which means that they don’t treat each dimen-sion the same way. Some dimensions might have more influence on the densitythan others (i.e. noise sources). They are also adaptable within dimensions.This means that in regions with fast changes, smaller partitions are built thanin regions with slow changes.Also, Density Estimation Trees are more interpretable than the usual nonpara-metric models. DETs can be used to detect clusters and outliers. By lookingat how different dimensions are treated in the partitioning process i.e. trainingalgorithm, the importance of different dimensions can be derived. It also pro-vides simple sets of rules with which a certain part of the data (e.g. a partition)

1

2

may be found (e.g. on a database).The remainder of this Report will focus on how DETs work and what propertiesthey have that are difficult to find in other unsupervised learning algorithms.The first chapter will give a brief introduction to the Parzen Window estimator,a well known method for nonparametric estimation. The main part will be thenext chapter which introduces DETs and specifies their details. After that Iwill compare DETs with Parzen Windows and reach to a conclusion.

Chapter 2

Parzen-Window DensityEstimation

In this chapter I want to give an overview over the Parzen-Window Estima-tor. However, the focus of this report is to show the advantages of DETs overParzen-Windows and therefore I will keep it brief.In the early 1960s, Emanuel Parzen [2] invented a nonparametric Density Es-timator called the Parzen-Window Estimator or sometimes also called KDE(Kernel Density Estimator). It has found widespread utility in areas like pat-tern recognition, classfication, image processing etc. The method is in its corean interpolation technique.Given an observation x, the Parzen-Window estimates the probability densityby placing a window on x and then gives a result based on a set of trainingsamples by checking the percentage that fall inside the window. It also offersthe possibility to use different kernels, which usually are themselves PDFs.

2.1 Definition

The Parzen-Window estimate is given by

P (x) =1

N

N∑i=1

1

hdNK

(x− xihN

)(2.1)

whereN : number of observationshN : window width / bandwidthd: number of dimensionsxi: ith training sampleK(x): Kernel functionThe parameter hN is usually chosen based on the number of available trainingsamples N . The kernel function is usually a unimodal function and a PDFby itself, which makes it simple to guarantee that the estimated function P (·)

3

4 2.1. Definition

satisfies the conditions to be a PDF. Typical kernels are the Gaussian andUniform types.

Chapter 3

Density Estimation Tree

This chapter explains how density estimation can be done using desicion treesbased on [3]. First the estimator with its corresponding loss function will bedefined. Another section will explain how continuous features can be mixedwith nominal and ordinal ones. Finally a proof of correctness will be given.

3.1 Piecewise constant density estimate

The piecewise constant density estimate of a decision tree T is given by:

fN (x) =∑l∈T

|l|NVl

I(x ∈ l) (3.1)

whereT : the density estimation treeS: Set of N observations in RdT : set of leaves of T representing the partitions of the data space|l|: number of observations of S in the leaf lVl: volume of leaf l within d dimensional bounding box of SI: indicator function

The space is partitioned into pieces represented by different nodes of our tree.The density estimate is nothing more than the number of training samples (i.e.observations) encountered in all leaves containing the evaluated position, di-vided by the total number of training samples and the volume of the partitions.

3.2 Loss Function

Unsupervised learning of a density estimation tree in the above form requires anotion of a loss function that is minimized using a greedy learning algorithm. In[3] the Integrated Squared Error (ISE) loss function [4] was used. In comparison

5

6 3.2. Loss Function

with maximum-likelihood based functions, the ISE gives a robust notion of theerror between the estimated and true density.To learn a DET we have to solve the following optimization problem:

minfN∈FN

∫X

(fN (x)− f(x)

)2dx (3.2)

where FN is the set of all the estimators in the form of Eq. 3.1 that can belearned using a set of N training samples. Now we expand the square:

minfN∈FN

∫X

(fN (x)2 − 2fN (x)f(x)− f(x)2

)dx

the term f(x) is constant among different estimators fN (x) and thus f(x)2

doesn’t have to be considered during this optimization problem. By using thefollowing Monte-Carlo Substitution

∫XfN (x)f(x)dx ≈ 1

N

∑Ni=1 fN (Xi), where

Xi refers to the i’th training sample, we obtain the following estimator of theISE [4]:

minfN∈FN

{∫X

(fN (x)

)2dx− 2

N

N∑i=1

fN (Xi)

}(3.3)

Now we plug in the piecewise constant density estimator from Eq. 3.1. Substi-tuting

f2N (x) =∑l∈T

(|l|NVl

)2

I(x ∈ l)

(the cross terms in the expansion of f2N (x) vanish because of the indicatorfunction) and

N∑i=1

fN (Xi) =

N∑i=1

∑l∈T

|l|NVl

· I(Xi ∈ l) =∑l∈T

|l|NVl

· |l|

we get the following objective function

∫l

∑l∈T

(|l|NVl

)2

dx− 2

N

∑l∈T

|l|NVl

· |l|

⇒∑l∈T

{∫l

(|l|NVl

)2

dx− 2

N

|l|NVl

· |l|

} (3.4)

Doing another substitution,

∫l

(|l|NVl

)2

dx =

(|l|NVl

)2 ∫l

dx =

(|l|NVl

)2

· Vl

Chapter 3. Density Estimation Tree 7

the estimator of the ISE for the DET turns into the following equation:∑l∈T

{(|l|NVl

)2

· Vl −2

N

|l|NVl

· |l|

}=

⇒∑l∈T

{|l|2

N2Vl− 2 · |l|

2

N2Vl

}=

⇒∑l∈T

{− |l|

2

N2Vl

} (3.5)

Thus, the greedy surrogate of the error for any node t is denoted by

R(t) = − |t|2

N2Vt(3.6)

The tree is then grown in a top down manner by minimizing this greedy surro-gate of the error over the available training samples.

3.3 Categorical and Ordinal features

In order to do density estimation over data with mixed (i.e. continuous, cat-egorical and ordinal) features, [3] defines a novel density estimator and a lossfunction that allow mixing of different feature types and may be used to learnDET’s:The piecewise constant density estimate of a decision tree T with mixed featuresis given by:

fN (x) =∑l∈T

|l|I(x ∈ l)N · Vld ·

∏d′lj=1Rlj ·

∏d′′li=1Mli

(3.7)

whereT : the density estimation treeS: Set of N observations in Rd × Zd′ × Cd′′

d: number of real featuresd′: number of ordinal featuresd′′: number of categorical featuresT : set of leaves of T representing the partitions of the data space|l|: number of observations of S in the leaf lVld : volume of leaf l within d dimensional bounding box of SRlj : range of ordinal values in the jth of the d′l ordinal dimensions present in lMli : number of categories present in l for the ith of the d′′l categorical dimen-sions present in l.I: indicator function

Similar to the continuous features case, we can define the greedy surrogateof the error for any node t as:

R(t) = − |t|2

N2 · Vtd ·∏d′tj=1Rtj ·

∏d′′ii=1Mti

(3.8)

8 3.4. Learning algorithm

3.4 Learning algorithm

In order to find an optimal tree T the CART (Classification and RegressionTree) model [1] is used in [3]. However, since we’re doing unsupervised learning,a different optimization function given by Eq. 3.8 is used.In general, the idea is the following:

1. Grow a large binary tree using forward selection. For each node, find thebest split among all possible splits and dimensions. Grow until are leafnodes either:

� have < n data points, where n is a predefined parameter

� have a sufficiently small cost given by Eq. 3.8

2. Prune the tree back, creating nested sequence of trees, decreasing in com-plexity.

Splitting: For each node, the tree needs to be split so that the summed uploss of the nodes is minimized. We denote a split by s ∈ S where S is the set ofall possible splits and the best split by:

s∗ = argmaxs∈S{R(t)−R(tL)−R(tR)} (3.9)

where the children of the node t satisfy |t| = |tL| + |tR|. For continuous andordinal dimensions, the optimal splits are found by trying every |t| − 1 possi-ble splits. For categorical dimensions, each combination of two subsets of thecategories available at the nodes are evaluated, which gives 2k−1 − 1 possiblesplits.

Pruning: Overfitting is a problem often encountered in machine learning al-gorithms. To avoid this [3] introduces a regularization term, which reduces themodel complexity, and adds it to the greedy surrogate error

Rα(t) = R(t) + α · |t (3.10)

where α is a regularization parameter found by cross-validation and t is theset of leaves in the subtree rooted at t. During the training process of T α isgradually increased. A subtree that is rooted at t is pruned for the value ofα where Rα(t) = Rα({t}) is the regularized error of the pruned subtree. Thesearch space of α is finite and therefore it is efficient to calculate α efficiently[1].

Cross Validation: Cross Validation is a technique often encountered in ma-chine learning approaches. It is used to find the optimal value for parametersgoverning the training behaviour, like in this case α. From [4] we obtain thefollowing expression for our LOO-CV estimator:

J(α) =

∫X

(fαN (x)

)2dx− 2

N

N∑i=1

fα(−i)(Xi) (3.11)

Chapter 3. Density Estimation Tree 9

fαN denotes the estimator with the decision tree T pruned with parameter α,

whereas fα(−i) is the estimator with the decision tree T(−i) and sample i excludedfrom the training set. The tree with optimal size is the tree T with parameterα∗ such that:

α∗ = argminαJ(α) (3.12)

3.5 Consistency

How do we know, if an estimator converges towards a useful result during train-ing? We can find that out by prooving its consistency. A consistent estimator isan estimator whose output converges towards the true values if the number oftraining samples converges towards infinity. The variance of the error betweentrue and estimated value goes towards zero. Mathematically, convergence of anonparametric estimator can be prooven by showing that [2]:

Pr

(limN→∞

∫X

(fN (x)− f(x)

)2dx = 0

)= 1 (3.13)

In this report I will only give the proof for continuous functions based on [3]:

Theorem 1. The estimator fN from Eq. 3.7 satisfies Eq. 3.13.

Proof. Let B be the collection of all sets t ⊂ X that can be interpreted as thesolution set to a system of k inequalities of the form bTx ≤ c where b ∈ Rd andc ∈ R. Each leaf l ∈ T in a decision Tree T , is the solution set of a systemwith k inequalities of the form bTx ≤ c with b ∈ Rd and b has exactly one entryequal to 1 and the others equal to 0. Hence, the partitions that are generatedin a Density Estimation Tree are a subset of all possible solutions to a systemof inequalities in the above form, i.e. T ⊂ B.Let Xi where i ≥ 1 be a random sample from a pdf f on X. For N ≥ 1, letFN denote the empirical distribution of Xi where 1 ≤ i ≤ N , defined on a sett ⊂ X by

FN (t) =1

N

N∑n=1

I(Xi ∈ t) =|t|N

=

∫t

fN (x)dx (3.14)

where |t| is the number of random samples in the set t ∩ {xi|1 ≤ i ≤ N} whichare all the training samples that fullfil a set of inequalities imposed by the nodet i.e. fall into the corresponding partition of the sample space and fN (x) is theestimator given in Eq. 3.1.According to a general version of the Glivenko-Cantelli theorem [5], the followingholds:

P

(limN→∞

supt∈B|FN (t)−

∫t

f(x)dx| = 0

)= 1 (3.15)

To put it simple, this means that the probability that the largest possible errorbetween the estimated and the real distribution of any possible partition t is

10 3.5. Consistency

equal to 0 converges to 1 if we increase the sample size towards infinity. Bysubstituting the left term in the supremum by Eq. 3.14 we get:

P

(limN→∞

supt∈B|∫t

f(x)dx−∫t

f(x)dx| = 0

)= 1

⇒ P

(limN→∞

supt∈B

∫t

|f(x)dx− f(x)dx| = 0

)= 1

(3.16)

Now we make the assumption that if the sample size converges towards infinity,the partition sizes get infinitesimally small, i.e. limN→∞ P (diameter(t) ≥ ε) =0, because each leaf can only have a bounded number of points. We come tothe following conclusion with probability 1:

limN→∞

supt∈B

inft|fN (x)− f(x)|dx ≤ lim

N→∞|fN (x′)− f(x′)| ·

∫t

dx = 0

for some x′ ∈ t. Therefore

P

(limN→∞

supt∈B

∫t

|fN (x)− f(x)|dx = 0

)= 1⇒ P

(limN→∞

∫X

(fN (x)− f(x)

)2dx = 0

)= 1

Chapter 4

Comparison

To measure the accuracy of DETs compared to KDEs and histograms, in [3]they did different experiments.

4.1 Adaptability within dimensions

The first experiment was to fit a DET, Histogram, KDE and a local r-KDE(rodeo KDE, special version of Parzen Window Estimator) to a sample set of1000 points. To have any ground truth to compare it to, an artificial sample setwas used generated by a strongly skewed distribution given by

X ∼7∑i=0

1

8N

(3

((2

3

)i− 1

),

(2

3

)2i).

The results can be seen in Fig. 4.1. The first thing to notice, is that the DEThas a smooth tail, whereas the other methods have wiggly tails due to a fixedbandwidth. As mentioned before, one main advantage of DETs is the adaptablebandwidth within dimensions. It can be seen that in areas with fast changesaround the peak, the partitions are very small wheras in areas with slow changesabove 0 the partitions get large. The other methods have to choose small win-dows, because otherwise they couldn’t fit the peak which makes the tail lookmore noisy. If on the other hand one would choose larger windows, the curveswould become smoother but the peak wouldn’t show.

4.2 Adaptability between dimensions

In the second experiment two dimensional data was used. In one dimensionsthere where a mixture of beta distributions and in the other a simple uniform

11

12 4.3. Importance of dimensions

Figure 4.1: Density estimates obtained with 1000 training samples from skeweddistribution using DET, Histogram, and KDEs and compared to ground truth.Image taken from [3]

distribution denoted by

X1 ∼2

3B(1, 2) +

1

3B(10, 10);X2 ∼ U(0, 1).

The results are displayed in Fig. 4.2. This is a good example of adaptabilitybetween dimensions. As you can see, the Histogram and KDE methods fail tofit the irrelevant uniform dimension whereas the DET performs very well in thissetting.

4.3 Importance of dimensions

Another intuitive feature of DETs is that the importance of dimensions caneasily be read from the tree. Dimensions with splits high up near the root nodehave a greater impact on the density than dimensions that are split furtherdown. Additionally, dimensions that don’t contribute to density at all are neversplit as can be seen in Fig. 4.2.

4.4 Outlier Detection

What happens when we have outliers in our data set? Outliers can roughly bedefined as points lying far off clusters. If we run the DET algorithm on sucha dataset, we will get large partitions with very low density in outlier regions.Therefore, to find outliers, we only have to look at large leaf nodes, with a lowdensity.

Chapter 4. Comparison 13

Figure 4.2: Perspective plots of the density estimates obtained with 600 trainingsamples from mixture of beta × uniform distribution using DET, Histogram,KDEs. Image taken from [3]

4.5 Speed

In [3] they compared the DETs to KDEs using the fast approximate dual-treealgorithm with kd-trees. To estimate the optimal bandwidth for the KDEs theyused LOO-CV which is the reason why the KDEs have a training time. Theresults showed, that the training time 2 to 4 times higher when using DETs.However, the query time is many orders of magnitude lower for DETs.

Chapter 5

Conclusion

This framework of Density Estimation Trees has some significant advantagesover other nonparametric density estimators. They are defined by a simpleset of rules which are intuitive and easily interpretable. It is possible to notonly model continuous, but also ordinal and categorical dimensions. Also theyperform feature selection, a property seldom found among other nonparametricdensity estimators.New points can be evaluated cheaply like in regression and classification treesdue to the simple tree structure. However, these advantages come with a cost.One problem that remains are discontinuities along the borders of differentpartitions. However, this problem could be solved for instance by smoothly in-terpolating the density values along different partitions using for instance MLPs(multilayer perceptrons) or other such methods.Nevertheless, DETs can be applied to different kinds of data analysis like outlierand anomaly detection. They could also be used in embedded systems due totheir computational efficiency.

14

Bibliography

[1] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classificationand Regression Trees. Statistics/Probability Series. Wadsworth PublishingCompany, Belmont, California, U.S.A., 1984.

[2] Emanuel Parzen. On estimation of a probability density function and mode.The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.

[3] Parikshit Ram and Alexander G. Gray. Density estimation trees. In Pro-ceedings of the 17th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, KDD ’11, pages 627–635, New York, NY, USA,2011. ACM.

[4] B. W. Silverman. Density Estimation for Statistics and Data Analysis.Chapman & Hall, London, 1986.

[5] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence ofrelative frequencies of events to their probabilities. Theory of Probabilityand its Applications, 16(2):264–280, 1971.

15

density estimation trees - systems group studer.pdfarijit khan may 30, 2015. ... 2.1 de nition ......

Documents