scalable and provably accurate algorithms for ...fferdinando3/cfp/... · in this work, we...

13
Scalable and provably accurate algorithms for differentially private distributed decision tree learning Kai Wen Wang 1 , Travis Dick 2 , Maria-Florina Balcan 3 1 [email protected], 2 [email protected], 3 [email protected] 1,3 Carnegie Mellon University, 2 University of Pennsylvania Abstract This paper introduces the first provably accurate algorithms for differentially private, top-down decision tree learning in the distributed setting (Balcan et al. 2012). We propose DP- TopDown, a general privacy-preserving decision tree learning algorithm, and present two distributed implementations. Our first method NoisyCounts naturally extends the single ma- chine algorithm by using the Laplace mechanism. Our second method LocalRNM significantly reduces communication and added noise by performing local optimization at each data holder. We provide the first utility guarantees for differentially private top-down decision tree learning in both the single ma- chine and distributed settings. These guarantees show that the error of the privately-learned decision tree quickly goes to zero provided that the dataset is sufficiently large. Our extensive experiments on real datasets illustrate the trade-offs of privacy, accuracy and generalization when learning private decision trees in the distributed setting. 1 Introduction New technologies for edge computing and the Internet of Things (IoT) are driving computing toward the dispersed set- ting (Satyanarayanan 2017). A distributed learning algorithm has access to data with more quantity and diversity than learn- ing from one data holder. Today, data entities comprise of mobile phones, schools, hospitals, companies, countries, and more. As an illustrative example, consider training a model for early detection of a new strand of malaria, which has been tested in multiple hospitals across the country. Most hospitals alone would not have enough data to train an accurate model. Even if a hospital could locally train a model, the model would be biased toward and only accurate for that specific city. As such, it would perform poorly as a nationwide diag- nostic tool. However, while benefits of distributed learning are evident, privacy concerns can often deter cooperation. Differential privacy (Dwork et al. 2006) is a strong guaran- tee of individual privacy with an additional benefit of encour- aging generalization (Oneto, Ridella, and Anguita 2017). Dif- ferentially private mechanisms formally quantify the amount of leaked information and can greatly incentivize distributed Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. learning with sensitive data. Another attractive feature of differential privacy is its robustness to post-processing; once the model is learned through a differentially private mech- anism, any arbitrary applications of the model as well as post-processing to the model can be performed without wor- rying about additional privacy leaks. In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan et al. 2012). Decision trees are often used for their efficiency and interpretability. The learned structure of decision trees is much more interpretable than other machine learning models like deep neural networks. Motivated by the fact that con- structing optimal decision trees is NP-complete (Laurent and Rivest 1976), greedy top-down learning algorithms are used in practice (i.e. ID3, C4.5 and CART (Mitchell 1997)). Top- down algorithms can be viewed as boosting algorithms that quickly drive their error to zero under the Weak Learning Assumption (Kearns and Mansour 1999). In the distributed setting, any communicated information between data holders must be made differentially private, which limits the quantity and quality of information that can be published to learn the decision tree. Intermediate com- putations often cost privacy, which violates the information minimization principle of differential privacy (Dwork and Roth 2014). Despite these constraints, we are able to prove theoretical guarantees that the error of the privately-learned tree quickly goes to zero with high probability provided that we have sufficient data under the Weak Learning Assumption. We present DP-TopDown, a family of differentially pri- vate decision tree learning algorithms based off the TopDown algorithm (Kearns and Mansour 1999). The DP-TopDown procedure uses a differentially private subroutine, called PrivateSplit, to approximate the optimal split chosen by TopDown. Designing a private and distributed mechanism that satisfies the utility requirements of PrivateSplit is the main challenge of running DP-TopDown. In our first ap- proach NoisyCounts, each entity publishes noised counts that are aggregated to find the best splitting functions. Our sec- ond approach LocalRNM is a heuristic that performs local optimization at each entity to greatly reduce the amount of communication and noise compared to NoisyCounts. Finally, we perform a comprehensive evaluation of our dif-

Upload: others

Post on 22-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

Scalable and provably accurate algorithms for differentially private distributeddecision tree learning

Kai Wen Wang1, Travis Dick2, Maria-Florina Balcan3

[email protected], [email protected], [email protected],3Carnegie Mellon University, 2University of Pennsylvania

Abstract

This paper introduces the first provably accurate algorithmsfor differentially private, top-down decision tree learning inthe distributed setting (Balcan et al. 2012). We propose DP-TopDown, a general privacy-preserving decision tree learningalgorithm, and present two distributed implementations. Ourfirst method NoisyCounts naturally extends the single ma-chine algorithm by using the Laplace mechanism. Our secondmethod LocalRNM significantly reduces communication andadded noise by performing local optimization at each dataholder. We provide the first utility guarantees for differentiallyprivate top-down decision tree learning in both the single ma-chine and distributed settings. These guarantees show that theerror of the privately-learned decision tree quickly goes to zeroprovided that the dataset is sufficiently large. Our extensiveexperiments on real datasets illustrate the trade-offs of privacy,accuracy and generalization when learning private decisiontrees in the distributed setting.

1 IntroductionNew technologies for edge computing and the Internet ofThings (IoT) are driving computing toward the dispersed set-ting (Satyanarayanan 2017). A distributed learning algorithmhas access to data with more quantity and diversity than learn-ing from one data holder. Today, data entities comprise ofmobile phones, schools, hospitals, companies, countries, andmore. As an illustrative example, consider training a modelfor early detection of a new strand of malaria, which has beentested in multiple hospitals across the country. Most hospitalsalone would not have enough data to train an accurate model.Even if a hospital could locally train a model, the modelwould be biased toward and only accurate for that specificcity. As such, it would perform poorly as a nationwide diag-nostic tool. However, while benefits of distributed learningare evident, privacy concerns can often deter cooperation.

Differential privacy (Dwork et al. 2006) is a strong guaran-tee of individual privacy with an additional benefit of encour-aging generalization (Oneto, Ridella, and Anguita 2017). Dif-ferentially private mechanisms formally quantify the amountof leaked information and can greatly incentivize distributed

Copyright c© 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

learning with sensitive data. Another attractive feature ofdifferential privacy is its robustness to post-processing; oncethe model is learned through a differentially private mech-anism, any arbitrary applications of the model as well aspost-processing to the model can be performed without wor-rying about additional privacy leaks.

In this work, we investigate learning decision trees whilepreserving privacy in the distributed setting (Balcan et al.2012). Decision trees are often used for their efficiency andinterpretability. The learned structure of decision trees ismuch more interpretable than other machine learning modelslike deep neural networks. Motivated by the fact that con-structing optimal decision trees is NP-complete (Laurent andRivest 1976), greedy top-down learning algorithms are usedin practice (i.e. ID3, C4.5 and CART (Mitchell 1997)). Top-down algorithms can be viewed as boosting algorithms thatquickly drive their error to zero under the Weak LearningAssumption (Kearns and Mansour 1999).

In the distributed setting, any communicated informationbetween data holders must be made differentially private,which limits the quantity and quality of information that canbe published to learn the decision tree. Intermediate com-putations often cost privacy, which violates the informationminimization principle of differential privacy (Dwork andRoth 2014). Despite these constraints, we are able to provetheoretical guarantees that the error of the privately-learnedtree quickly goes to zero with high probability provided thatwe have sufficient data under the Weak Learning Assumption.

We present DP-TopDown, a family of differentially pri-vate decision tree learning algorithms based off the TopDownalgorithm (Kearns and Mansour 1999). The DP-TopDownprocedure uses a differentially private subroutine, calledPrivateSplit, to approximate the optimal split chosen byTopDown. Designing a private and distributed mechanismthat satisfies the utility requirements of PrivateSplit is themain challenge of running DP-TopDown. In our first ap-proach NoisyCounts, each entity publishes noised counts thatare aggregated to find the best splitting functions. Our sec-ond approach LocalRNM is a heuristic that performs localoptimization at each entity to greatly reduce the amount ofcommunication and noise compared to NoisyCounts.

Finally, we perform a comprehensive evaluation of our dif-

Page 2: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

ferent implementations of DP-TopDown on real and relevantdatasets. We provide plots that illustrate the privacy-accuracytrade-off for our algorithms and empirically show that Lo-calRNM performs better than NoisyCounts on most datasets.Motivated by our theoretical results, we also show that boththe training and testing accuracy of the decision tree increaseswith more training data. In one dataset, we observe that ourprivate algorithms can actually improve the performance ofdecision trees, which is similar in spirit to adding a bit ofnoise in non-convex optimization problems to escape localoptimums (Zhang, Liang, and Charikar 2017).

Our contributions are summarized as follows:1. We propose DP-TopDown, a differentially private, top-

down learning algorithm for decision trees, and presenttwo implementations for the distributed setting.

2. We prove boosting-based utility guarantees for DP-TopDown. These are the first utility guarantees for dif-ferentially private top-down learning of decision trees inboth single machine and distributed settings.

3. We extensively evaluate DP-TopDown on real datasets.The code and instructions to reproduce our experimentalresults will be made available.

1.1 Related WorksPrevious works have explored private top-down learning ofdecision trees for specific algorithms. (Blum et al. 2005)introduced and used the SuLQ framework to implementID3. (Friedman and Schuster 2010; Mohammed et al. 2011)demonstrated the effectiveness of using the exponential mech-anism for private ID3 and C4.5. Our approach generalizesprevious works and provide the first utility guarantees andsample complexity bounds for private, top-down decisiontree learning, even in the distributed setting.

Apart from top-down decision tree learning (i.e. ID3 andCART), previous worked have also considered random de-cision trees, where the construction of the decision tree isentirely random and can be done even before it sees any data(Jagannathan, Pillaipakkamnatt, and Wright 2009). (Bojarskiet al. 2014) studied techniques which averaged over collec-tions of random decision trees (a.k.a. Random Forests) andgave guarantees for empirical and generalization errors. Inpractice, random forests are powerful models that sometimesout-perform vanilla top-down learned decision trees. How-ever, top-down learning of decision trees performs well inmany cases and in general leads to more interpretable treesthan randomly constructed decision trees.

Previous works have used boosting to design improveddifferentially private algorithms for answering batches ofqueries (Dwork, Rothblum, and Vadhan 2010). Our work isdifferent and is about designing differentially private boostingalgorithms, in particular top-down decision tree learning.

2 PreliminariesWe study learning problems where the goal is to map inputexamples from a space X to a finite output space Y . Wesuppose that there is an unknown distribution P defined overX , together with an unknown target function f : X → Y .Given a sample S of i.i.d. pairs (x, f(x)) drawn from thedistribution P , our goal is to learn an approximation to f .

We focus on the canonical problem of top-down decisiontree learning. Each internal node is labeled by a splittingfunction h : X → 0, 1 that “splits” the data accordingto h(x), and the leaves of the tree are labeled by a classin Y . These splitting functions route each example x ∈ Xto exactly one leaf of the tree. That is, if the root of thetree has splitting function h, a point x is routed to the leftsubtree if h(x) = 0 and to the right subtree if h(x) = 1.The prediction made by a decision tree T for an input x isthe label of the leaf that x is routed to. We let H ⊂ 0, 1Xdenote the class of splitting functions that may be placed ateach internal node, and denote the error of a decision tree Tby ε(T ) = Pr(T (x) 6= f(x)), where x is sampled from P .

2.1 Top-down Decision Tree LearningIn this section we describe the high level structure of (non-private) top-down decision tree learning algorithms. Top-down learning algorithms construct decision trees by repeat-edly “splitting” a leaf node in the tree built so far. On eachiteration, the algorithm greedily finds the leaf and splittingfunction that maximally reduces an upper bound on the errorof the tree. The upper bound (see Equation (1)) on error isinduced by a splitting criterion G : [0, 1] → [0, 1] that isconcave, symmetric about 1

2 , and that satisfies G( 12 ) = 1,

G(0) = G(1) = 0. The selected leaf is replaced by an inter-nal node labeled with the chosen splitting function, whichpartitions the data at the node into two new children leaves.Once the tree is built, the leaves of the tree are labeled by thelabel of the most common class that reaches the leaf. Manyeffective and widely-used decision tree learning programsare instances of the TopDown algorithm. For example, C4.5uses entropy for splitting criterion while CART uses Gini.

The splitting criterion G provides an upper bound on theerror of a decision tree T . The error ε(T ) can be writtenas ε(T ) =

∑`∈leaves(T ) w(`) minq(`), 1 − q(`) where

w(`) = PrP [x reaches leaf `] is the fraction of data thatreaches ` and q(`) = PrP [f(x) = 1|x reaches leaf `] is thefraction of that data with label 1. For all z ∈ [0, 1], the split-ting criterion satisfies G(z) ≥ minz, 1− z, and thereforethe following potential function is an upper bound on theerror ε(T ) of a decision tree T :

G(T ) =∑

`∈leaves(T )

w(`)G(q(`)). (1)

With this, we can formally describe one iteration of top-down decision tree learning. For each leaf ` ∈ leaves(T ),let S` ⊆ S denote the subset of data that reaches leaf `.For each splitting function h ∈ H , let T (`, h) denote thetree obtained from T by replacing ` by an internal nodethat splits S` into two children leaves `0, `1, where any datasatisfying h(x) = i goes to `i. At each round, TopDownchooses the split-leaf pair (h∗, `∗) that maximizes the de-crease in the potential: G(T ) − G(T (`, h)) = w(`)J(`, h)

where J(`, h) = G(q(`))− |S`0 ||S`| G(q(`0))− |S`1 ||S`| G(q(`1)).Kearns and Mansour showed that (non-private) top-down

decision tree learning can be viewed as a form of boosting(Kearns and Mansour 1999). Whenever the class of splittingfunctions H used to label internal nodes satisfies the Weak

Page 3: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

Learning Assumption (i.e., they can predict the target f betterthan random guessing for any distribution on X), then top-down learning algorithms amplify these slight advantages andquickly drives the training error of the tree to zero. Formally,the Weak Learning Assumption is as follows:Definition 1. A function class H ⊂ 0, 1X γ-satisfies theWeak Learning Assumption for a target function f : X →0, 1 if for any distribution P over X , there exists h ∈ Hsuch that Prx∼P (h(x) 6= f(x)) ≤ 1

2 − γ. Commonly, γ isrefered to as the advantage.We restate the result of (Kearns and Mansour 1999) when Gis entropy: G(q) = −q log(q)− (1− q) log(1− q).Theorem 2. Let G(q) be entropy. Under the Weak LearningAssumption, for any ε ∈ (0, 1], TopDown suffices to make(1/ε)O(log(1/ε)/γ2) splits to achieve training error ≤ ε.

2.2 Differential PrivacyDifferential privacy is the de facto standard notion of privacythat enjoys composition theorems (Dwork et al. 2006). Theformal notion is as follows. We say two datasets S and S′ areneighboring if they differ on at most one point. A randomizedalgorithm A is α-differentially private if for any neighboringdatasets S, S′ and outcomes O ⊆ Range(A), we have

Pr(A(S) ∈ O) ≤ eα Pr(A(S′) ∈ O)

where the probability space is over the coin flips of A.Our decision tree algorithms will make use of two stan-

dard differentially private components. First, the Laplacemechanism (LM) can be used to privately evaluate a func-tion f mapping datasets to real numbers, while preservingα-differential privacy. Given a dataset S, the Laplace mecha-nism with privacy parameter α outputs f(S) + Z, where Zis drawn from the Laplace distribution with scale parameter∆f/α and ∆f is the sensitivity of f :

∆f = maxS,S′ neighboring datasets

|f(S)− f(S′)|.

Next, we will sometimes use the Report Noisy Max(RNM) mechanism to privately find the best scoring functionfrom a collection of k scoring functions f1, . . . , fk mappingdatasets to real numbers. Given a dataset S, RNM with pri-vacy parameter α computes fi = fi(S) + Zi, where Zi isdrawn from the Laplace distribution with scale parametermaxi=1,...,k 2∆fi/α. Then it outputs i∗ = argmaxi fi to-gether with the estimate fi∗ . Importantly, RNM does notpublish the estimated scores fi for i 6= i∗, allowing it to addless noise than applying k LMs. The RNM mechanism withprivacy parameter α preserves α-differential privacy.

Our decision tree learning algorithms will apply LM andRNM many times, with varying privacy parameters. Standardcomposition theorems, presented in Appendix A.1, guaranteethat the overall algorithm are still α-differentially private(Dwork et al. 2006; McSherry 2009).

2.3 Distributed setting with privacy constraintsIn this work, we consider the distributed learning setting withprivacy constraints (Balcan et al. 2012). We suppose there

are k entities (i.e. data holders) and each entity i ∈ [k] mustpreserve α-differential privacy for their subset of the data Si.In general, we make no assumptions on how Si partitionsS in terms of size and distribution. An example of the dis-tributed setting is hospitals with private patient data. Patientsentrust their sensitive information to be used within their lo-cal hospital, but the response to any queries from outside thehospital must preserve privacy on behalf of the patients. Notethat this notion is not local differential privacy (Warner 1965;Cormode et al. 2018), where the patients would locally per-turb their data before giving them to the hospital.

We remark here that a distributed learning algorithm canalso use cryptographic privacy protocols such as multi-partycomputation to preserve privacy while learning (Evans etal. 2018). Assuming reasonable computational limitationsof the adversary, cryptographic privacy leaks zero bits ofinformation during the training process. On the other hand,differential privacy is an information theoretic guarantee thatbounds the amount of leaked information and is robust topost-processing. In this work, we focus on differential pri-vacy protocols (Balcan et al. 2012) since they tend to bemore communication efficient and feasible in practice thancryptographic techniques.

3 Differentially Private TopDownIn this section, we propose a private, top-down decision treelearning algorithm called DP-TopDown that is based off theTopDown algorithm (Kearns and Mansour 1999). The keystep of DP-TopDown that depend on the sensitive dataset Sis choosing the optimal split for a given leaf `. Since thereare many ways to do this, we encapsulate this with a functioncalled PrivateSplit that satisfies the following privacy andutility requirements, which are sufficient to prove privacy andutility guarantees for DP-TopDown:

1. PrivateSplit(`, α, δ) outputs a pair (h, J) where h ∈ Hand J ∈ R, while preserving α-differential privacy;

2. ∀ζ > 0,∃N := N(ζ, α, δ) so that if |S`| ≥ N , thenJ(`, h) ≥ maxh J(`, h) − ζ and |J(`, h) − J | ≤ ζ withprobability≥ 1−δ. In words, this means that PrivateSplitfinds nearly optimal splits w.h.p.

Pseudocode for DP-TopDown is given in Algorithm 1. Ittakes as input the dataset S, the privacy parameter α, the maxnumberM of nodes (excluding leaves) in the tree, the desirederror ε ∈ (0, 1] and failure probability δ ∈ (0, 1]. We use abudget function B(d) to assign a fraction α to data queries atdepth d in the tree. Since nodes at depth d partition the dataat that layer, they together preserve B(d)-differential privacyby parallel composition (Theorem 9). These parameters areused in Theorem 3.

Privacy guarantee: The privacy analysis follows from thecomposition theorems. We use half of the total privacy budgetfor labeling the leaves at line 14 using the Laplace mecha-nism, which preserves α

2 -differential privacy by parallel com-position (Theorem 9) since the leaves partition S. The rest ofDP-TopDown preserves α

2 -differential privacy for an appro-priate budget function B that satisfies

∑all depths d B(d) ≤ 1.

One privacy budgeting approach is to uniformly split thebudget across all possible depths with B(d) = 1

M . This gives

Page 4: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

Algorithm 1 DP-TopDown1: Input: dataset S, max size M , error ε, privacy α, failure

probability δ2: Init T to single leaf ` and Q to max-priority queue3: (h, J)← PrivateSplit(`, α2B(1), δ

2(2M+1) )

4: Q.push(J , (h, `)) [add (h, `) to Q with priority J]5: for t = 1, 2, ...,M do6: (`∗, h∗)← Q.pop() [get max element in Q]7: T ← T (`∗, h∗) with children leaves `0, `18: for i = 0, 1 do9: α` ← α

2B(depth(`i)) [budget for leaf]

10: wi ←|S`i ||S| + Lap( 2

|S|α` ) [LM with budget α`2 ]

11: (hi, Ji)← PrivateSplit(`i,α`2 ,

δ2(2M+1) )

12: if wi ≥ εM then

13: Q.push(wi · Ji, (`i, hi))14: Label leaves by majority label [RNM with budget α2 ]15: return T

us the tightest sample complexity bounds for Theorem 3.Another approach is to set B(d) = 1

2d. This exponentially

decaying budgeting is motivated by the intuition that earlysplits in the tree are more important than later splits. Empiri-cally, we found that this strategy gives higher accuracy thanthe uniform budgeting approach.

Utility guarantee: Intuitively for any fixed α, the noisefrom preserving α-differential privacy becomes negligiblewith high probability as the size of the dataset grows toinfinity. The following theorem formalizes this intuition forDP-TopDown by providing general utility guarantees andsample complexity bounds, which are applicable in boththe single machine and distributed settings. Under the WeakLearning Assumption, the error of the privately-learned treequickly goes to zero provided that the dataset is large enough.The required dataset size can be bounded in terms of theproperties of the budgeting strategy B and of PrivateSplit.Theorem 3. Let G(q) be entropy. Let M,α, ε, δ be inputs toDP-TopDown, as defined in Algorithm 1. Under the WeakLearning Assumption with advantage γ, DP-TopDown suf-fices to make (1/ε)O(log(1/ε)/γ2) splits to achieve trainingerror ≤ ε with probability ≥ 1 − δ provided the dataset Shas size at least

|S| ≥ max

(O

(M

γ2εαb

),

2M

εN

(O(γ2ε

M),αb

4, O(

δ

M)

)),

where b = min1≤d≤M B(d) and N(ζ, α, δ) is the data re-quirement of PrivateSplit.

Proof Sketch. Let Tt denote the decision tree constructedby DP-TopDown so far at the beginning of the tth iteration.The boosting analysis of (Kearns and Mansour 1999) hastwo key steps. First, they argue that during the tth iteration,there must be a leaf ` of the tree Tt whose weight is at leastε(Tt)/t. Next, they leverage the weak-learning assumptionto guarantee that the optimal split for that leaf significantlyreduces the potential Gt. This guarantees that after a smallnumber of iterations, the error cannot be large.

There are two main challenges in extending this analysisto the DP-TopDown algorithm. First, since DP-TopDownonly splits leaves whose estimated weight is large, there issome risk that it will not consider splitting the leaf used inthe analysis of Kearns and Mansour. Second, even if thatleaf is considered, the PrivateSplit algorithm will only pro-duce nearly optimal splits for that leaf when there is enoughdata reaching that leaf (recall that PrivateSplit returns anζ-optimal split with probability at least 1− δ, provided thatthere are at least N(ζ, α, δ) data points in the leaf).

The term O( Mγ2εαb ) in our dataset size requirements is

chosen to ensure that the error in the estimated weight ofevery leaf during the run of DP-TopDown is bounded by γ2ε

Mwith high probability. This guarantees that the leaf used inthe analysis of Kearns and Mansour will be considered byDP-TopDown with high probability. Moreover, every leafapplying PrivateSplit has non-trivial weight. This term alsoensures the total error from labeling leaves is less than ε

2 .A consequence of the above argument is that DP-TopDown

only runs the PrivateSplit algorithm on leaves with weight atleast Ω( ε

M ). In other words, they contain at least Ω( |S|εM ) data

points. The term 2Mε N

(O(γ

2εM ), bα2 , O( δ

M ))

in our datasetsize requirement is chosen to guarantee the number of datapoints in any leaf that we run PrivateSplit is large enough toguarantee that it returns a ζ-optimal split where ζ is at mosthalf the reduction inG(T ) guaranteed by the split constructedin the Kearns and Mansour analysis. This ensures that ouralgorithm reduces the potential G(T ) by at least half of theguarantee of the non-private algorithm, which is sufficient toprove our utility guarantee.

We note that our proof technique generalizes to other split-ting criteria analyzed by (Kearns and Mansour 1999), such asthe Gini Index G(q) = 4q(1− q) and G(q) =

√2q(1− q).

We also note that the only assumption on H in Theorem 3 isthe Weak Learning Assumption. In particular, there are noassumptions on the size of H . In the rest of the paper, wepresent implementations of PrivateSplit that iterate throughH , which is why our corollaries depend on |H| being finite.

SingleMachine algorithm: RNM. In the single machine set-ting and when the class of splitting functions H is finite, wecan simply use RNM to select an approximately optimal splitfrom H . In particular, if we define fi(S`) = J(`, hi) forhi ∈ H and use RNM to choose the optimal split, we satisfythe utility requirement of PrivateSplit as follows.

Lemma 4. Single machine RNM satisfies the PrivateSplit

requirements with N(ζ, α, δ) = O( 1αζ ).

Proof Sketch. In Lemma 11 in the Appendix, we show thatthe sensitivity of the function fi(S`) = J(`, hi) is boundedby O(1/|S`|). Let fi = fi(S) + Zi where Zi is drawnfrom the Laplace distribution with scale parameterO(α/|S`|)to each estimate fi. With probability at least 1 − δ, wehave |Zi| = O( α

|S`| log |H|δ ) simultaneously for all i. When

|S`| ≥ Ω( 1αζ ), the error in all estimates made by the RNM

mechanism are bounded by ζ/2, which proves the claim.

Page 5: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

Corollary 5. Under uniform budgeting, using RNM asPrivateSplit requires |S| ≥ O( M3

ε2γ2α ) for Theorem 3.

We remark that another possible private implementation is touse the exponential mechanism, which recovers the algorithmin (Friedman and Schuster 2010; Mohammed et al. 2011).Distributed setting with privacy constraint. Weight esti-mation in line 10 can be done distributedly by having eachentity estimate the number of data points at a particular leafwith LM. Leaf labeling can be done distributedly by first esti-mating the number of data points at a particular leaf for eachlabel using LM and then finding the label that maximizesthe aggregated counts. The sample complexity of Theorem3 accrues a k factor in the first term, as formally quanti-fied in Corollary 16 of the Appendix. Therefore in the dis-tributed setting, we can run DP-TopDown given a distributedPrivateSplit(`, α, δ). We now present two methods.Distributed algorithm 1: NoisyCounts. For each splittingfunction hi ∈ H , each entity j ∈ [k] uses LM to publishnoisy estimated counts cjy=a,hi=b of the number of datapointsat leaf `, with label a and splitting value b. Then, the coordi-nator can use the aggregated noisy counts to compute eachJ(`, hi) and locate the maximum. The following lemma fol-lows from the utility guarantees of LM.

Lemma 6. NoisyCounts satisfies the PrivateSplit require-ments with N(ζ, α, δ) = O(k|H|αζ ).

Proof Sketch. The proof has two main steps. First, we boundthe total error in the noisy counts collected by the coordinator,and second we argue that when the counts are sufficientlyaccurate, then the chosen splitting function is nearly opti-mal. The collection of counts associated with any splittingfunction hi ∈ H form a histogram query, and they can allbe approximated while satisfying α′-differential privacy us-ing the Laplace mechanism by adding Lap( 1

α′ ) noise to eachcount. We approximate one histogram query for each hi ∈ H ,so we set α′ = 1

|H| so that the total privacy cost is α. Next,using tail bounds for Laplace random variables, we knowthat with high probability, the maximum error in any of thecounts constructed by a single entity is bounded by O( |H|α ).Therefore, the each accumulated count obtained by the co-ordinator has at most O(k|H|α ) error accumulated from thek entities. Finally, we show that when |S`| ≥ O(k|H|αζ ), thecounts are accurate enough guarantee a ζ-optimal split.

Corollary 7. Under uniform budgeting, using NoisyCountsas PrivateSplit requires |S| ≥ O(M

3k|H|ε2γ2α ) for Theorem 3.

Compared to single machine, the bound on N for Noisy-Counts increased by a factor of k|H|. This is the consequenceof having each entity publish all the counts so that J(`, hi)can be approximated for every hi ∈ H whereas we onlyneeded the maximum. Therefore, each of the k entities callsLM with |H| factor more noise. Note that unlike weightestimation and leaf labeling, which easily extended to thedistributed setting while accruing only a k factor, the naturalmethod of NoisyCounts accrues also a |H| factor since there

are |H| splitting functions to try, which is not ideal since |H|can be a lot larger than k in practice.

Distributed algorithm 2: LocalRNM. To reduce commu-nication and to increase the privacy budget for each dataquery, we propose the following heuristic improvement overNoisyCounts. With budget α

2 , each entity i uses RNM tocompute the locally-best splitting function h∗i for their owndata Si` at leaf `. Then, with budget α2 , the coordinator runsthe NoisyCounts procedure to find the best splitting functionover the set of locally-best splitting functions H = h∗i i∈[k]rather than the entire H . In LocalRNM, the noise of eachdata query scales with the number of entities k instead of thenumber of splitting functions |H|. Since the amount of noiseinjected is in general much less in LocalRNM, this heuristichave the possibility of achieving higher accuracy than Noisy-Counts. However, locally-best splitting functions, especiallyfrom entities with small datasets, can have poor performanceacross the union of data, which is why LocalRNM does notsatisfy the utility requirements of PrivateSplit. Nevertheless,LocalRNM performs noticeably better than NoisyCounts onfour out of seven of our chosen datasets. This suggests thatat least one of the locally-best splitting functions tend to begood enough in practice.

4 Experimental evaluationWe evaluated DP-TopDown on seven datasets, described be-low, that exhibit a wide spectrum of use cases where pri-vacy could be of concern in the distributed setting. We pre-processed categorical features with one-hot encoding. Thesplitting class for each dataset comprised of evenly spacedthresholds for each feature, with just one threshold at 0.5 forthe one-hot encoded features.Adult: US Census Data to predict if one’s income exceeds$50K (Dua and Graff 2017). It has 32,561 entries in total.Bank: Predict subscription behavior based on client infofrom marketing campaigns (Moro, Cortez, and Rita 2014). Ithas 45,211 entries in total.KDDCup 1999: Classify network connections as maliciousor normal (Cup 1999). It has 494,021 entries in total.MNIST: Greyscale 28 × 28 images of handwritten digits(LeCun 1998). The training set consists of 60,000 images,and the test set consists of 10,000 images.Creditcard: Client payment information to predict defaultoutcome (Yeh and Lien 2009). It has 30,000 entries in total.Avazu CTR: Based on anonymized mobile ad informa-tion, predict whether the ad is clicked (Avazu 2015). It has1,100,000 entries in total.Skin: Classify RGB pixels as skin or non-skin (Bhatt andDhall 2010). It has 245,057 entries in total.

With the exception of MNIST, which had 60,000 trainingimages and 10,000 test images, each dataset was split intofixed training and testing sets with size ratio 9 : 1. Adult,Bank, Creditcard and KDDCup all used 10 evenly spacedthresholds for each continuous feature, giving splitting classsizes of 159, 115, 210, 427 respectively. Skin used 32 evenlyspaced thresholds, giving a splitting class of size 96. MNISTand CTR had splitting class sizes of 147 and 263 respectively,and we defer their descriptions to Appendix A.7.

Page 6: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

Figure 1: Test accuracy for privacy parameters α ∈ 2i|i = −3,−2, ..., 9 on SingleMachine, NoisyCounts and LocalRNM.Baseline runs without noise are shown as solid lines in same color as corresponding dotted lines, representing noised runs.

Our experiments used the entropy splitting criterion, thedecay budgeting strategy, M = 512, and ε = 0.1. Data wasdivided amongst four entities uniformly randomly. When J ≤0.01, we did not push the leaf into Q, since this suggestedthat it was mostly homogeneous or the entropy estimate wastoo noisy. Our results are averaged over 100 independentruns. Shaded error bars denote one standard deviation in themean, which is tiny in most plots.

Figure 1 shows the trade-off between privacy and test-ing accuracy (i.e. privacy curves) for the three private DP-TopDown algorithms presented in this paper. The (dotted)accuracy of the private algorithms increases as a functionof the privacy parameter α and eventually converges to the(solid) non-private baseline with no noise. As expected, singlemachine RNM performs the best. In the distributed setting,LocalRNM, while lacking utility guarantees, performs con-sistently better than NoisyCounts on Adult, Bank, Creditcardand Skin. Both trends agree with our intuition that injectingnoise should result in worse performance.

For CTR, the test accuracy is noticeably higher for theprivate algorithms than their non-private counterparts for αvalues. This counter-intuitive result shows the benefit of in-jecting noise into models that are prone to overfitting, such asdecision trees. We found that the non-private decision treeshad higher error due to some unlucky greedy choices, whichwere avoided by adding small amounts of noise through ourprivate mechanisms. This is similar in spirit to adding noisein non-convex optimization problems to escape local optima(Zhang, Liang, and Charikar 2017). Of course, when α istoo small, the decision tree inevitably has low accuracy sincethe noise is so large that the learning algorithm cannot cor-rectly label the leaves. This observation suggests that whendesigning private greedy algorithms, there is trade-off notjust between privacy and accuracy, but also generalization.

Figure 2 shows the effects of training dataset size on the

testing accuracy privacy curves. The curve shapes are regularand larger training sizes consistently result in better test ac-curacy. We also observe that smaller dataset sizes generallylead to higher variance in the error bars, which is expectedsince the injected noise has greater relative affect to the dataqueries. For the privacy curves of small datasets, the ben-efit of injecting noise for generalization is now very clear.For example, NoisyCount’s Bank plot shows that (dotted)private curves of 0.01 and 0.05 consistently outperform the(solid) non-private baselines trained on the same subset ofdata. These results confirm the intuition we gain from Theo-rem 3 and its corollaries, that DP-TopDown enjoys the sameutility guarantees as TopDown when trained on large datasets.

We present more interesting results in Appendix A.8. Fig-ure 4 shows that training on larger datasets also leads tohigher training accuracy, which is counter-intuitive sincesmaller datasets are easier to overfit. Figure 3 plots the depthsand sizes of the learned trees of Figure 1.

5 ConclusionIn this paper, we unify the theory and practice of differentiallyprivate top-down decision tree learning in the distributedsetting. We provide the first utility guarantees for such al-gorithms as well as a comprehensive experimental analysison seven relevant datasets. Our design of DP-TopDown re-duces the problem of private top-down learning to the keychallenge of approximating optimal splits with PrivateSplit.Using Theorem 3, it is easy to derive utility guarantees forother implementations of PrivateSplit. With the code thatwill be made available, this is a great starting point for ap-plying scalable and provably accurate differentially privatedecision trees in the distributed setting.Acknowledgements: This work was supported in part byNSF grants CCF-1535967, IIS-1618714, CCF-1910321, SES-1919453, and an Amazon Research Award.

Page 7: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

LocalRNM

NoisyCounts

single machine

Figure 2: Each plot compares how test accuracy varies with the privacy parameter α for a fixed algorithm trained on differentlysized training sets. For example, a 0.01 curve was trained on 1% of the total dataset.

Page 8: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

ReferencesAvazu. 2015. Avazu click-through rate prediction dataset.Kaggle competition at https://www.kaggle.com/c/avazu-ctr-prediction/.Balcan, M. F.; Blum, A.; Fine, S.; and Mansour, Y. 2012.Distributed learning, communication complexity and privacy.In Conference on Learning Theory, 26–1.Bhatt, R., and Dhall, A. 2010. Skin segmentation dataset.UCI Machine Learning Repository.Blum, A.; Dwork, C.; McSherry, F.; and Nissim, K. 2005.Practical privacy: the sulq framework. In Proceedings of thetwenty-fourth ACM SIGMOD-SIGACT-SIGART symposiumon Principles of database systems, 128–138. ACM.Bojarski, M.; Choromanska, A.; Choromanski, K.; and Le-Cun, Y. 2014. Differentially-and non-differentially-privaterandom decision trees. arXiv preprint arXiv:1410.6973.Cormode, G.; Jha, S.; Kulkarni, T.; Li, N.; Srivastava, D.;and Wang, T. 2018. Privacy at scale: Local differentialprivacy in practice. In Proceedings of the 2018 InternationalConference on Management of Data, 1655–1658. ACM.Cup, K. 1999. Dataset. available at the following websitehttp://kdd. ics. uci. edu/databases/kddcup99/kddcup99. html72.Dua, D., and Graff, C. 2017. UCI machine learning reposi-tory.Dwork, C., and Roth, A. 2014. The algorithmic founda-tions of differential privacy. Foundations and Trends R© inTheoretical Computer Science 9(3–4):211–407.Dwork, C.; McSherry, F.; Nissim, K.; and Smith, A. 2006.Calibrating noise to sensitivity in private data analysis. In Pro-ceedings of the Theory of Cryptography Conference (TCC),265–284. Springer.Dwork, C.; Rothblum, G. N.; and Vadhan, S. 2010. Boostingand differential privacy. In 2010 IEEE 51st Annual Sympo-sium on Foundations of Computer Science, 51–60. IEEE.Evans, D.; Kolesnikov, V.; Rosulek, M.; et al. 2018. Apragmatic introduction to secure multi-party computation.Foundations and Trends R© in Privacy and Security 2(2-3):70–246.Friedman, A., and Schuster, A. 2010. Data mining withdifferential privacy. In Proceedings of the 16th ACM SIGKDDinternational conference on Knowledge discovery and datamining, 493–502. ACM.Jagannathan, G.; Pillaipakkamnatt, K.; and Wright, R. N.2009. A practical differentially private random decision treeclassifier. In 2009 IEEE International Conference on DataMining Workshops, 114–121. IEEE.Kearns, M., and Mansour, Y. 1999. On the boosting abilityof top–down decision tree learning algorithms. Journal ofComputer and System Sciences 58(1):109–128.Laurent, H., and Rivest, R. L. 1976. Constructing optimalbinary decision trees is np-complete. Information processingletters 5(1):15–17.LeCun, Y. 1998. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/.

McSherry, F. D. 2009. Privacy integrated queries: an ex-tensible platform for privacy-preserving data analysis. InProceedings of the 2009 ACM SIGMOD International Con-ference on Management of data, 19–30. ACM.Mitchell, T. M. 1997. Machine learning.Mohammed, N.; Chen, R.; Fung, B.; and Yu, P. S. 2011.Differentially private data release for data mining. In Pro-ceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining, 493–501. ACM.Moro, S.; Cortez, P.; and Rita, P. 2014. A data-driven ap-proach to predict the success of bank telemarketing. DecisionSupport Systems 62:22–31.Oneto, L.; Ridella, S.; and Anguita, D. 2017. Differentialprivacy and generalization: Sharper bounds with applications.Pattern Recognition Letters 89:31–38.Satyanarayanan, M. 2017. The emergence of edge computing.Computer 50(1):30–39.Shalev-Shwartz, S., and Ben-David, S. 2014. Understandingmachine learning: From theory to algorithms. Cambridgeuniversity press.Warner, S. L. 1965. Randomized response: A survey tech-nique for eliminating evasive answer bias. Journal of theAmerican Statistical Association 60(309):63–69.Yeh, I.-C., and Lien, C.-h. 2009. The comparisons of datamining techniques for the predictive accuracy of probabil-ity of default of credit card clients. Expert Systems withApplications 36(2):2473–2480.Zhang, Y.; Liang, P.; and Charikar, M. 2017. A hitting timeanalysis of stochastic gradient langevin dynamics. arXivpreprint arXiv:1702.05575.

Page 9: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

A AppendixA.1 Composition theorems for differential

privacyTheorem 8 (Sequential Composition). Let A1, . . . , Ak bek mechanisms such that Ai satisfies αi-differential privacy.Then the mechanism A(S) = (A1(S), . . . , Ak(S)) satisfies(∑i αi)-differential privacy. This holds even if Ai is chosen

based on the outputs of A1(S), . . . , Ai−1(S).

Theorem 9 (Parallel Composition). Let D1, . . . , Dk be apartition of the input domain and suppose A1, . . . , Ak aremechanisms such that Ai satisfies αi-differential privacy.Then the mechanismA(S) = (A1(S∩D1), . . . , Ak(S∩Dk))satisfies (maxi αi)-differential privacy.

A.2 Sensitivity of entropyIn this section, we analyze the sensitivity of entropy. We willuse the following standard result.

Lemma 10. For any x, y ∈ (0, 1), if |x − y| ≤ 12 , then

|x lg(x)− y lg(y)| ≤ −|x− y| lg(|x− y|).

We now deduce the global sensitivity of J , used in ourimplementation of PrivateSplit with RNM. In the proof below,we use the notation J(S, h) where S is a dataset, which ismore general than J(`, h) (equivalently J(S`, h)) for leaf `.

Lemma 11. Suppose G(q) = −q lg(q)− (1− q) lg(1− q),the entropy function. When applied to a dataset of size m,then for any h ∈ H , the sensitivity of J ′(S) = J(S, h) is∆J ′ ≤ 2

m (3 lg(m) + 1).

Proof. For a dataset S of size m and splitting func-tion h, denote ry=i,h=j = |x∈S|f(x)=i,h(x)=j|

m , ry=i =∑j∈0,1 ry=i,h=j and similarly for rh=j . Since the numer-

ator is a count that changes by at most 1 for neighboringdatasets S and S′, we have that |r∗ − r′∗| ≤ 1

m (here, ∗denotes wildcard). Consider an arbitrary h ∈ H and fix h.Then,

|J(S, h)− J(S′, h)|

= | − [∑

a∈0,1

ry=a lg(ry=a)− r′y=a lg(r′y=a)]

+ [rh=0

∑a∈0,1

ry=a,h=0 lg(ry=a,h=0)

− r′h=0

∑a∈0,1

r′y=a,h=0 lg(r′y=l,h=0)]

+ [rh=1

∑a∈0,1

ry=a,h=1 lg(ry=l,h=1)

− r′h=1

∑a∈0,1

r′y=a,h=1 lg(r′y=a,h=1)]|

≤ 2

mlg(m) + 2(

1

m+

2

mlg(m))

by triangle inequality, the above lemma, and −x log(x) isincreasing when x ∈ (0, e−1) (assuming 1

m ≤ e−1).

A.3 RNM implements single machinePrivateSplit

We first recall the utility guarantee for Laplace mechanism(Dwork and Roth 2014).Lemma 12. If Y ∼ Lap(b) then Pr(|Y | ≥ ln( 1

δ ) · b) = δ.In general, if Y1, Y2, ..., Yk are i.i.d. samples from Lap(b),then Pr(maxi |Yi| ≥ ln(kδ ) · b) ≤ δ.

We first restate the RNM algorithm for (h, J) =PrivateSplit(`, α, δ). For every i = 1, 2, ..., |H|, definefi(S`) = J(`, hi). By Lemma 11, ∆fi ≤ 10 lg(m)

m . Thus,RNM samplesXi ∼ Lap( 20 lg(m)

αm ), calculates fi = fi(S`)+

Xi and computes i∗ = arg maxi∈[|H|] fi, while preservingα-differential privacy. Now we show that this procedure findsnearly optimal splits w.h.p.Lemma 13. Let b > 0. If x ≥ 2b log(b), then x ≥ b log(x)(Shalev-Shwartz and Ben-David 2014).Theorem 14. The single machine RNM-based proceduredescribed above satisfies: ∀ζ > 0,∃N = O( 1

αζ ) so thatif |S`| ≥ N , then with probability at least 1 − δ, we haveJ(`, h) ≥ maxh J(`, h)− ζ and |J(`, h)− J | ≤ ζ.

Proof. We show that a stronger claim holds with high proba-bility. By Lemma 12, |Xi| = |fi − fi| ≤ ζ

2 ,∀i ∈ [|H|] holdswith probability 1−δ whenever ln( |H|δ ) 20 lg(m)

αm ≤ ζ2 for large

enough m. By Lemma 13, it suffices to have m ≥ 2b log(b)

where b = ln( |H|δ ) · 40αζ .

We now show that |fi − fi| ≤ ζ2 ,∀i is sufficient. This

is equivalent to |J(`, hi) − J(`, hi)| ≤ ζ2 ,∀i. In particular,

|J(l, h)− J | ≤ ζ2 ≤ ζ.

Let j∗ = arg maxi∈[s] J(`, hi). Then, fi∗ ≥ fj∗ − ζ,which is equivalent to J(`, h) ≥ maxh J(`, h) − ζ. This isbecause fj∗ − fi∗ ≤ (fj∗ − fj∗) + (fj∗ − fi∗) ≤ ζ

2 + (fi∗ −fi∗) ≤ ζ.

Therefore N(ζ, α, δ) = O( 1αζ ).

A.4 NoisyCounts implements distributedPrivateSplit

In NoisyCounts, each entity j estimates countscjy=a,hi=b, cy=a, chi=b for every splitting function hiusing the LM with 3α

|H| privacy budget for each type of count.Since counts have sensitivity of 1, the noise added for LM issampled from Lap( |H|3α ). By Theorem 8, this entire operationpreserves α-differential privacy.

The coordinator aggregates the counts by summing overthe entities cy=a,hi=b =

∑j∈[k] c

jy=a,hi=b

. Then, the aggre-

gated noisy counts are used to compute each J(`, hi) andlocate the maximum.Theorem 15. The NoisyCounts procedure described abovesatisfies: ∀ζ > 0, ∃N = O(k|H|αζ ) so that if |S`| ≥ N

then with probability at least 1 − δ, we have J(`, h) ≥maxh J(`, h)− ζ and |J(`, h)− J | ≤ ζ.

Page 10: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

Proof. The idea is that the estimated counts are accuratew.h.p. Then, applying analysis similar to the sensitivity ofJ , we know that inputting these accurately estimated countsproduces accurate estimates of J .

Any estimated count cy=a,hi=b is the true count cy=a,hi=bnoised by X1 +X2 + ...+Xk where each Xi ∼ Lap( 3|H|

α ).Since w.p. 1 − δ

3k|H| , we have |Xi| ≤ ln( 3k|H|δ ) 3|H|

α , ap-plying union bound yields that w.p. ≥ 1 − δ

3 , we have|cy=a,hi=b − cy=a,hi=b| ≤ ln( 3k|H|

δ ) 3k|H|α for every split-

ting function hi. This is also true for cy=a and chi=b. Inthe notation of the proof for Lemma 11, we have |r∗ −r′∗| ≤ ln( 3k|H|

δ ) 3k|H|mα ; the same analysis leads to |J(`, hi)−

J(`, hi)| ≤ 10 ln(3k|H|δ ) 3k|H|

mα ln( mα

ln(3k|H|δ )3k|H|

) ≤

30 ln(3k|H|δ ) lg(m)k|H|mα , assuming ln( 3k|H|

δ ) 3k|H|mα ∈ (0, 1e ).

Thus, with probability at least 1− δ, |J(`, hi)− J(`, hi)| ≤ζ2 ,∀i when m ≥ 2b log(b) with b = 60 ln(3k|H|

δ )k|H|αζ . Thisis sufficient, as shown in the proof of Theorem 14.

Therefore, N(ζ, α, δ) = O(k|H|αζ ).

A.5 Proof of Main TheoremTheorem 3 is our main boosting-based utility guarantee. Wefirst define some notation and recount key arguments used toprove boosting-based utility guarantees for TopDown (Kearnsand Mansour 1999).

Define εt = ε(Tt) and Gt = G(Tt) where Tt is the deci-sion tree at the tth iteration for DP-TopDown. Let t ∈ [M ] bearbitrary but fixed step in the algorithm. After t splits, thereexists leaf ` such that w(`) min(q(`), 1− q(`)) ≥ εt

t , sincethe tree Tt has t leaves. By using the weak learning assump-tion, Kearns and Mansour (Kearns and Mansour 1999) showthat splitting the leaf ` with the optimal splitting function(i.e., the split h ∈ H that maximizes J(`, h)) reduces Gtsignificantly: Gt+1 ≤ Gt − γ2Gt

4t log(2/Gt). A solution to this

recurrence is given by Gt ≤ e−γ√

log(t)/c for constant c, soit suffices to make (1/ε)O(log(1/ε)/γ2) splits.

Now we proceed with the proof of our main Theorem 3.Then, we deduce the distributed version in 16.

Theorem 3. Let G(q) be entropy. Let M,α, ε, δ be inputs toDP-TopDown, as defined in Algorithm 1. Under the WeakLearning Assumption with advantage γ, DP-TopDown suf-fices to make (1/ε)O(log(1/ε)/γ2) splits to achieve trainingerror ≤ ε with probability ≥ 1 − δ provided the dataset Shas size at least

|S| ≥ max

(O

(M

γ2εαb

),

2M

εN

(O(γ2ε

M),αb

4, O(

δ

M)

)),

where b = min1≤d≤M B(d) and N(ζ, α, δ) is the data re-quirement of PrivateSplit.

Proof. Let ζ ∈ (0, ε2M ) be a target accuracy for the guaran-

tees of PrivateSplit. We will set a value for ζ later in theproof. Recall that α` = α

2B(depth(`)) is the privacy budgetfor `. We now case on each event:

• There are at most 2M + 1 calls to PrivateSplit. Bythe properties of PrivateSplit, each call has low errorwhen |S`| ≥ N(ζ, α`2 ,

δ2(2M+1) ) with probability at least

1 − δ2(2M+1) ; that is J(`, h) ≥ maxh J(`, h) − ζ and

|J(`, h) − J | ≤ ζ. Thus, all the calls have low error w.p.at least 1− δ

2 .• There are also 2M weight estimates on line 10, and we

want all of them to have error ζ w.p. at least 1− δ4 . So we

want the weight estimate has low error |w(`)− w(`)| ≤ ζw.p. 1− δ

8M . By Lemma 12, we need |S| ≥ ln( 8Mδ ) 2

ζα`.

• There are also at most M + 1 leaves. We want to boundthe total error of leaf labeling by ε

2 w.p. at least 1− δ4 , so

bound each leaf’s error by ε4(M+1) w.p. at least 1− δ

4(M+1) .Thus, even if we mistakenly picked a label other than themost common label, the label we did pick gives error atmost ε

2(M+1) away from the most common label. RNM

with budget α2 requires |S| ≥ ln( 4(M+1)δ ) 8(M+1)

εα .By union bound, all of the above happens with probabilityat least 1− δ, so assume this high probability event for therest of the proof. Note that since ζ < ε

2M and α` < α,the sample complexity of weight estimates is asymptoticallylower bounded by the sample complexity of leaf labeling.

Now we show that our algorithm chooses leaveswith enough data to satisfy the sample complexity ofPrivateSplit. We know after t splits, there exists a leaf `ts.t. w(`t) min(q(`t), 1− q(`t)) ≥ εt

t . Since min(q(`t), 1−q(`t)) ≤ 1

2 , we havew(`t) >2εtt ≥

2εM . Restricting ζ < ε

2M ,then w(`t) ≥ w(`t)− ζ ≥ ε

M ; so `t is pushed onto Q, whichimplies that DP-TopDown will consider splitting this leaf.Furthermore, any leaf ` in Q satisfies w(`) ≥ ε

2M sinceζ < ε

2M . In other words, there is at least one leaf in thequeue Q with weight large relative to ε, and no leaves thathave very small weight. Therefore, to satisfy the sufficientconditions for PrivateSplit to achieve low error when run ona leaf `, we require that |S`| ≥ N(ζ, α`2 ,

δ2(2M+1) ), which is

guaranteed to be satisfied for any leaf in the queue Q when|S| ≥ 2M

ε N(ζ, α`2 ,δ

2(2M+1) ), since |S`| ≥ ε2M |S|.

Next, we argue that every iteration of DP-TopDown re-duces the potential G(T ) by at least half of the non-privatealgorithm. This is due to a combination of the weak learningassumption, together with the fact that for every leaf ` ∈ Qwe are guaranteed that PrivateSplit returned a ζ-optimalsplit for that leaf. Let J∗ = maxh J(`, h) and (h, J) be theoutput of PrivateSplit run on leaf `, then

|w(`)J∗ − w(`)J |≤ w(`)|J∗ − J |+ J |w(`)− w(`)|≤ |J∗ − J(`, h)|+ |J(`, h)− J |+ |w(`)− w(`)|≤ 3ζ

where we used that w(`), J are bounded by 1. Note thatthe above is true for any estimated J from PrivateSplit;in particular it’s true for the leafs in Q. Therefore, after tsplits, the leaf l∗ we pop from Q has wJ nearly as goodas the optimal split for leaf `. Hence, applying the Kearns

Page 11: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

and Mansour weak-learning analysis, we are guaranteedthat Gt+1 ≤ Gt − γ2Gt

4t log(2/Gt)+ 6ζ. Choose ζ such that

6ζ = γ2ε8M log(2/ε) . Then we have Gt+1 ≤ Gt − γ2Gt

8t log(2/Gt).

Then, the solution to the recurrence inequality differs onlyby a constant. Thus, to achieve error at most ε

2 , it suf-fices to make (2/ε)O(log(2/ε)/γ2) splits, which simplifies to(1/ε)O(log(1/ε)/γ2). As we bounded the leaf labeling error byε2 , the total error of DP-TopDown is at most ε as desired.

For the above claim to hold, the size of the dataset needsto be at least

|S| ≥ max`ln(

8M

δ)

2

ζα`,

2M

εN(ζ,

α`2,

δ

2(2M + 1))

where the maximum is taken over all leaves ` that are encoun-tered by DP-TopDown (i.e., the leaves of any tree Tt). Theterms in the above maximum only depend on the leaves `through the privacy budget α` assigned to each leaf. Since thelearned decision tree has depth at most M , we are guaranteedthat α` = α

2B(depth `) ≥ αb2 , where b = min1≤d≤M B(d).

Substituting this into the above bound on |S| completes theproof.

Corollary 16. Suppose the distributed setting. Given a dis-tributed implementation of PrivateSplit, DP-TopDown hasthe guarantees of Theorem 3 with sample complexity

|S| ≥ max

(O

(kM

γ2εαb

),

2M

εN

(O(γ2ε

M),αb

4, O(

δ

M)

)).

Proof. We are only concerned with formally quantifying theeffects of weight estimation and leaf labeling.• Weight estimation can be done distributedly by having

each entity publish an estimate of the number of data pointsat a particular leaf with LM. The budget remains the sameas single machine from parallel composition of the entities.

• Leaf labeling can be done distributedly by having eachentity publish an estimate of the number of data points foreach label, with LM. Since we are no longer using RNM,the noise parameter increases by factor of 2 (since thereare two possible labels).

In both cases, instead of bounding the magnitude of oneLaplace distributed random variable, we bound the sum of kindependent Laplace distributed random variables to be lessthan some error bound ζ with failure probability δ. We can doso by bounding each error ζk with failure probability δ

k , whichLemma 12 provides the sample complexity for. We see inboth cases, we accrue a k factor inside and outside the ln term.In particular, weight estimation requires |S| ≥ ln( 8kM

δ ) 2kζα`

and leaf labeling requires |S| ≥ ln( 4k(M+1)δ ) 8k(M+1)

εα . Thus,to account for the distributed setting in Theorem 3, the firstterm in max accrues an extra k factor, becoming O( kM

γ2εαb ).

We note that the same style of analysis also applies for thesplitting criterion G(q) = 4q(1− q) or G(q) = 2

√q(1− q)

as in (Kearns and Mansour 1999).

A.6 Future workIn our current work, we give half the budget for internalnodes, specifically weight estimation and PrivateSplit, andthe other half for labeling leaves. In general, spending halfthe budget for labeling leaves can be too excessive, especiallyif the budget is large enough that the noise is much less than0.5 w.h.p. In any case, having noise < 0.5 will guarantee se-lecting the correct label. Therefore, exploring how to balancethe budget between splits and labeling leaves is an interestingdirection for future work.

We also think it’s possible to extend our algorithm forprivate weak learners; in particular, PrivateSplit only needsto privately return a splitting function that beats randomguessing for the leaf w.h.p. We leave the details for futurework.

A.7 Splitting classes for MNIST and Avazu CTRIn this section we describe the splitting class we used forMNIST and Avazu CTR.

MNIST: The splitting class for MNIST consists of threethresholds on the average pixel intensity of the 49 4 × 4blocks of each 28 × 28 image. For example, the smallestthreshold on the “first” block for a flattened MNIST imagex would return 1 if (x[0] + x[1] + x[28] + x[29])/4 ≤ 42.5and 0 otherwise.

Avazu CTR: The splitting class for Avazu CTR werechosen without a fixed number of thresholds for each featuresince the range for each feature fluctuates more than that ofthe other datasets. The splitting class consists of 7 thresholdsonC1, 100 thresholds onC14, 4 thresholds onC15 andC16,40 thresholds on C17, 10 thresholds on C19 and C21, and15 thresholds on C20.

A.8 More experimental resultsFigure 3 shows the training accuracy privacy curves as wellas plots of the depths and sizes of the learned decision treesin Figure 1. In general, the depth and size of the tree de-creases as the privacy budget decreases. This suggests thatDP-TopDown is inclined to terminate earlier if it determinesthat it is no longer productive to make the deeper splits. Wealso observe an interesting phenomenon that NoisyCountsconsistently learns the deepest trees, while LocalRNM con-sistently learns the shallowest trees.

Figure 4 shows the effects of training dataset size on thetraining accuracy privacy curves. It is essentially the train-ing accuracy analog of Figure 2. In these plots, we see thesame trend that greater training size leads to better trainingaccuracy, which is what we observed for testing accuracy inFigure 2. This is especially interesting for training accuracysince smaller datasets are easy to overfit and thus have highertraining accuracies, which we indeed observe in the baselines.However, the training privacy curves show that the trainingaccuracy for private decision trees trained on smaller datasetsremain lower than that of private decision trees trained onlarger datasets. Therefore, the plots in Figure 4 suggest thatthe effect of dataset size on the performance of private al-gorithms is strong enough to even overcome the effect ofoverfitting on small datasets.

Page 12: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

NumNodes

MaxAchievedDepth

trainAcc

Figure 3: Training accuracy, number of nodes and maximum achieved depth of decision trees under same setup as Figure 1.

Page 13: Scalable and provably accurate algorithms for ...fferdinando3/cfp/... · In this work, we investigate learning decision trees while preserving privacy in the distributed setting (Balcan

NoisyCounts

single machine

LocalRNM

Figure 4: Training accuracy under same setup as Figure 2.