li-yang tantheory.stanford.edu/~liyang/top-down.pdftop-down algorithms for learning dts a natural...
TRANSCRIPT
Provable Guarantees for Decision Tree Induction
Guy Blanc Neha GuptaStanford Stanford MIT
Li-Yang TanBased on joint works with:
Jane Lange
Learning decision trees from labeled data
Example 𝑥 ∈ {0,1}! Label 𝑦 ∈ {0,1}
1 0 0 0 1 0 1 0 0 1 0 1 1
0 1 0 0 1 1 0 0 1 0 1 1 0
1 1 1 0 0 0 1 0 0 1 0 1 1
0 0 1 0 1 0 1 1 1 0 1 0 1
0 0 1 0 0 1 1 0 0 1 0 1 0
1 1 1 1 1 0 0 1 0 0 0 0 1
0 0 1 1 0 0 1 0 0 1 0 0 1
1 0 0 1 0 0 1 1 1 0 1 0 0
1 0 0 1 0 1 0 1 0 1 1 0 1
Fastalgorithm
𝑥)
𝑥* 𝑥++
𝑥,0 1 1
1 0
Small decision tree
Decision trees: simple and effective
§ Fast to evaluate
§ Easy to understand, Easy to explain predictions
§ Powerful “ensemble methods”: Random forests and Boosted decision trees
𝑥)
𝑥* 𝑥++
𝑥,0 1 1
1 0
Top-down algorithms for learning DTs
A natural approach: Greedy, top-down
1. Choose a “good” variable to query as the root
2. Recurse
𝑥!
§ Definition of “good” = splitting criterion of algorithm
§ Intuitively, want the most “important”/“relevant” variable
§ Captures almost all algorithms used in practice: ID3, C4.5, CART, etc.
§ Widely employed, empirically successful
§ This talk: theoretical justification for their effectiveness?
This talk: Performance guarantees for top-down algorithms
Let 𝑓: {0,1}" → {0,1} be a target function, 𝒟 a distribution over {0,1}"
Input: Labeled examples 𝑥, 𝑓 𝑥 where 𝑥 ∼ 𝒟, Parameters 𝑠 ∈ ℕ, and 𝜀 ∈ 0,1
Top-down algorithms achieve error within 𝜀 of that of the optimal size-𝑠 tree for 𝑓, with a tree “not much larger than” 𝑠
size-𝑠 DT
≤ opt" + 𝜀opt" 𝑓
Goal:
size 𝑆 = 𝑆 𝑠, 𝜀
Our results: 𝒟 = product distribution (This talk: 𝒟 = uniform)
Top-down algorithm’s tree:
Learning decision trees: practice vs. theory
Decision trees long been a workhorse of ML
Rich and vast theory literature on learning decision trees
§ Algorithms used in practice proceed in top-down fashion:
§ Empirically successful, but lack performance guarantees
§ Many algorithms do not return a decision tree hypothesis (“improper” algorithms)
§ Fastest proper algorithm [Ehrenfeucht–Haussler 89] proceeds bottom-up
Outline of this talk
§ How these top-down algorithms work: “impurity-based”
§ Bad news: target functions for which impurity algorithms fail badly
Impurity algorithms fail
Space of all target functions
Monotone functions
§ Result I: Performance guarantees for monotone functions
§ Result II: A new, non-impurity-based, algorithm which achieves guarantees for all functions
Part I:Power and limitations of existing algorithms
“Top-down induction of decision trees: rigorous guarantees and inherent limitations”, Guy Blanc, Jane Lange, Li-Yang Tan. ITCS 2020.
“Provable guarantees for decision tree induction: the agnostic setting”, Guy Blanc, Jane Lange, Li-Yang Tan. ICML 2020.
How do these top-down algorithms work? § Main algorithms used in practice – ID3, C4.5, CART – all “impurity-based”
§ Each defined by an impurity function 𝒢: 0,1 → 0,1
§ Concave, satisfies 𝒢 0 = 𝒢 1 = 0 and 𝒢 0.5 = 1
Examples: ID3 uses 𝒢 𝑝 = H#(𝑝), binary entropyCART uses 𝒢 𝑝 = 4𝑝 1 − 𝑝 , “Gini impurity”
1
𝒢 0 =0 𝒢 1 =0
𝒢 0.5 =1
0
High-level idea
𝒢(𝑝) approximation of min{𝑝, 1 − 𝑝}
∴ For 𝑓: {0,1}! → {0,1},
𝒢(𝔼 𝑓 ) ≈ min{𝔼 𝑓 , 1 − 𝔼 𝑓 },
min{𝑝, 1 − 𝑝}
How close 𝑓 is to a constant function
Impurity function defines the splitting criterion
Query 𝑥0 as the root, where 𝑖 ∈ 𝑛 maximizes:
Target function 𝑓: {0,1}" → {0,1}. Which 𝑥0 to query as the root?
𝒢 𝔼 𝑓 − 𝔼1 𝒢 𝔼 𝑓2$31
𝔼[𝑓] 𝔼[𝑓!!"#]𝔼[𝑓!!"$]
𝒢“Purity gain associated with splitting 𝑥0”
?
Order in which to split the leaves
Here’s the tree we’ve built so far:
Which leaf to split next?
?
ℓ%
ℓ#
We split the leaf ℓ with the highest “score”, where:
score ℓ ≔ ℙ2∼𝒟[𝑥 reaches ℓ] ⋅ max0∈ "
{ Purity-gain𝒢 ℓ, 𝑥0 }
ℓ
𝒢
A single representative algorithm?
Different impurity functions 𝒢 give rise to different trees. Would be nice to analyze a single algorithm …
TopDown
Lemma [Blanc-Lange-T.]:
There is an algorithm, TopDown, that is “complete” for impurity algorithms: Guarantees for TopDown yield guarantees for all impurity algorithms
ID3 C4.5
Arbitrary 𝒢Overhead in our reduction scales with concavity of 𝒢
TopDown
score ℓ ≔ ℙ2∼𝒟[𝑥 reaches ℓ] ⋅ max0∈ "
𝔼 𝑓ℓ 𝑥 𝑥0 <
Forget about impurity functions. In each iteration, we split the leaf ℓ with the highest score, where:
Split ℓ with a query to the 𝑥0 that’s best correlated with 𝑓ℓ
ℓ
“correlation btw 𝑓ℓ and 𝑥&”
𝑥&
How does the process end?
When tree reaches certain user-specified size 𝑆, label each leaf ℓ with round(𝔼 𝑓ℓ ) and output tree
Outline of this talk
§ How these top-down algorithms work: “impurity-based”
§ Bad news: target functions for which impurity algorithms fail badly
Impurity algorithms fail
Monotone functions
§ Result I: Performance guarantees for monotone functions
§ Result II: A new, non-impurity-based, algorithm which achieves guarantees for all functions
Bad news for TopDown
There is an 𝑓: {0,1}" → {0,1} such that:
𝑓size-4 DT ℙ 𝑇 𝑥 ≠ 𝑓 𝑥 = 0.5
Ω(24)
TopDown’s tree 𝑇
≡
𝑓 𝑥 = 𝑥0 ⊕𝑥> for some 𝑖, 𝑗 ∈ 𝑛
TopDown struggles because:
𝔼 𝑓 𝑥 𝑥? = 0 for all 𝑘 ∈ 𝑛
M. Kearns, “Recent developments in decision tree induction and the weak learning framework”, AAAI 1996.
Guarantees for monotone functions?
Impurity algorithms fail
Monotone functions
[Fiat–Pechyony 04] raised the possibility of performance guarantees for monotone functions
Read-once DNFs,Halfspaces [FP04]
Our work§ Reasonable assumption in practice; intensively studied in theory
§ [FP04]: Proved strong guarantees for various subclasses of monotone functions
Theorem [Blanc-Lange-T.]:
Let 𝑓: {0,1}" → {0,1} be monotone, 𝒟 = product distribution over {0,1}".For all 𝑠 ∈ ℕ and 𝜀 ∈ 0,1 , given examples 𝑥, 𝑓 𝑥 where 𝑥 ∼ 𝒟, TopDown grows a tree 𝑇 of size
𝑠 >?((@AB C)/E!)
that achieves ℙ2∼𝒟 𝑓 𝑥 ≠ 𝑇 𝑥 ≤ opt@ + 𝜀.
Monotone 𝑓
size-𝑠 DT
≤ opt" + 𝜀opt"
𝑠 AB((DEF @)/I')
TopDown’s tree:
Monotone 𝑓
size-𝑠 DT
≤ opt' + 𝜀opt'
𝑠 AB((DEF @)/I')TopDown’s tree:
𝑠PQ(DEF @)
Monotone 𝑓
size-𝑠 DT
≥ 0.49≤ 0.01
Matching lower bound: For all 𝑠 ∈ ℕ, there is a monotone 𝑓 such that:
TopDown’s tree:
Theorem: For all 𝑠 ∈ ℕ, 𝜀 ∈ 0,1 , and monotone 𝑓,
Proof sketch of the upper bound
Consider realizable setting (opt@ = 0) 𝑓 ≡
size-𝑠 DT
TopDown’s tree
𝑥& Q: What can we say about 𝔼 𝑓 𝑥 𝑥0 ?
What’s the relationship between
Let’s focus on understanding the first decision that TopDown makes:
max0∈ "
𝔼 𝑓 𝑥 𝑥0 and the complexity of 𝑓?
O’Donnell, Saks, Schramm, Servedio“Every decision tree has an influential variable”, FOCS 2005
OSSS inequality: If 𝑓: {0,1}" → {0,1} is computed by a size-𝑠 decision tree, there is an 𝑖 ∈ 𝑛 such that:
§ If 𝑓 is monotone, Inf0 𝑓 = 𝔼 𝑓 𝑥 𝑥0
§ If Var 𝑓 ≤ 𝜀 then 𝑓 is 𝑂(𝜀)-close to constant
Inf0 𝑓 ≥Var 𝑓log 𝑠
where Inf0 𝑓 ∶= ℙ 𝑓 𝑥 ≠ 𝑓(𝑥0) and 𝑥0 = 𝑥 with 𝑖-th bit flipped.
OSSS vs. Kahn-Kalai-Linial inequality
𝑓size-𝑠 DT
≡ For all 𝑓: {0,1}" → {0,1}
Inf0 𝑓 ≥ Var 𝑓 ⋅log 𝑛𝑛
Exists 𝑖 ∈ 𝑛 such that:
OSSS inequality KKL inequality
Exists 𝑖 ∈ 𝑛 such that:
Inf0 𝑓 ≥Var 𝑓log 𝑠
§ OSSS = “KKL that takes into account decision tree complexity of 𝑓”
§ Strengthens KKL whenever 𝑠 = 2V("/ DEF ")
§ Exponentially stronger when 𝑠 = poly(𝑛)
Back to analyzing TopDown
Simplifying assumption #1
Simplifying assumption #2
How TopDown works:
𝑓
size-𝑠 DT
≡Realizable setting (opt = 0)
What we’ll instead analyze:
Splits leaf with highest score Splits all leaves,maintains complete tree
A simple potential function argument
How does Var 𝑓 change when 𝑥0 is queried?
𝑓G"HI 𝑓G"H+Fact: 𝔼" Var 𝑓#($" = Var 𝑓 − 𝔼# 𝑓 𝑥 𝑥% &
TopDown’s tree so far
𝑇
Consider the quantity: 𝔼 [Var 𝑓ℓ ]
§ When 𝑇 = ∅, this is simply Var 𝑓
§ For all 𝑇, this quantity is ≥ 0
leavesℓ ∼ 𝑇
𝑥0
TopDown’s tree so far
𝑇
Q: Is it true that for ≥ 1– 𝜀 fraction of leaves ℓ, Var 𝑓ℓ ≤ 𝜀?
If so, done:
If not, split all leaves of 𝑇:
𝑇∗ =
𝔼 Var 𝑓ℓ = 𝔼 Var 𝑓ℓ − 𝔼 𝔼2 𝑓ℓ 𝑥 𝑥0(ℓ)<
𝑇
𝑇 is 𝑂(𝜀)-close to 𝑓
ℓ ∼ 𝑇∗ ℓ ∼ 𝑇 ℓ ∼ 𝑇
Variable maximizing correlation with 𝑓ℓ
≥Var 𝑓ℓlog 𝑠
#
[OSSS]≥ 𝜀 fraction s.t. Var 𝑓ℓ ≥ 𝜀
1 unit of depth buys us ≥ 𝜀'/(log 𝑠)& decrease in “potential”
∴ Done when depth ≤ (log 𝑠)&/𝜀', hence size ≤ 𝑠 ()((+,- .)/1*) .
§ For non-realizable setting, need “robust” version of OSSS inequality,
proved by [Jain-Zhang 11].
§ For realizable setting, can optimize argument to get slightly better
parameters: 𝑠B((DEF @)(/'/I) instead of 𝑠 AB((DEF @)/I*) .
Monotone 𝑓
size-𝑠 DT
≤ 𝜀≡
𝑠 AB((DEF @)/I*)
TopDown’s tree:We just sketched the proof of:
Realizable (opt" = 0)
Beyond the proof sketch
Monotone 𝑓
size-𝑠 DT
≤ opt' + 𝜀opt'
𝑠 AB((DEF @)/I')TopDown’s tree:
𝑠PQ(DEF @)
Monotone 𝑓
size-𝑠 DT
≥ 0.49≤ 0.01
Matching lower bound: For all 𝑠 ∈ ℕ, there is a monotone 𝑓 such that:
TopDown’s tree:
Theorem: For all 𝑠 ∈ ℕ, 𝜀 ∈ 0,1 , and monotone 𝑓,
Next:
A monotone function that fools TopDownOur function will be based on Tribes and Majority:
Tribesℓ 𝑥X, … , 𝑥ℓ : 𝔼 Tribesℓ 𝑥 𝑥0 = Θ DEF ℓℓ
for all 𝑖 ∈ ℓ
Majority? 𝑦X, … , 𝑦? : 𝔼 Majority? 𝑦 𝑦0 = Θ X√?
for all 𝑖 ∈ 𝑘
High-level idea: Our function 𝐹 𝑥X, … , 𝑥ℓ, 𝑦X, … , 𝑦? will be:
§ “99% Tribesℓ + 1%Majority?”, so 𝐹 ≈ Tribesℓ
§ However, ℓ and 𝑘 chosen so that Majority?’s variables better correlated with 𝐹 than Tribesℓ’s
∴ TopDown fooled into querying Majority?’s variables
𝐹 𝑥X, … , 𝑥ℓ, 𝑦X, … , 𝑦? ≔
1%
Tribesℓ 𝑥 Majority+(𝑦)
If 𝑥 ∉ output Tribesℓ 𝑥
Else output Majority?(𝑦)
Observation: 𝐹 is 0.01-close to Tribesℓ 𝑥 , decision tree size ≤ 2ℓ.
We choose ℓ and 𝑘 to satisfy X√?≳ DEF ℓ
ℓ,
ensuring that 𝔼 𝐹 𝑥, 𝑦 𝑦0 ≿ 𝔼 𝐹 𝑥, 𝑦 𝑥0 .
∴ TopDown fooled into building DT for Majority?(𝑦), size ≥ 2? .
2?𝐹
2ℓ≥ 0.49≤ 0.01
TopDown’s tree:Tribesℓ
Majority+
{0,1}ℓ {0,1}+
End of Part I
Monotone 𝑓
size-𝑠 DT
≤ opt' + 𝜀opt'
𝑠 AB((DEF @)/I')TopDown’s tree:
We provide a matching 𝑠PQ(DEF @) lower bound
Theorem: For all 𝑠 ∈ ℕ, 𝜀 ∈ 0,1 , and monotone 𝑓,
Performance guarantees for impurity algorithms when run on monotone target functions:
Outline of this talk
§ How these top-down algorithms work: “impurity-based”
§ Bad news: target functions for which impurity algorithms fail badly
Impurity algorithms fail
Monotone functions
§ Result I: Performance guarantees for monotone functions
§ Result II: A new, non-impurity-based, algorithm which achieves guarantees for all functions
Part II:A new splitting criterion
“Universal guarantees for decision tree induction via a higher-order splitting criterion”, Guy Blanc, Neha Gupta, Jane Lange, Li-Yang Tan. NeurIPS 2020.
𝑓
size-𝑠 DT
≤ 𝑂(opt") + 𝜀opt"
𝑠 AB((DEF @)'/I')Our algorithm’s tree:
Our main result
A new splitting criterion and associated learning algorithm that achieves performance guarantees for all functions:
Our algorithm can be viewed as a “noise stabilization” procedure
Theorem: For all 𝑠 ∈ ℕ, 𝜀 ∈ 0,1 , and all target functions 𝑓,
Revisiting the bad news for TopDown
𝑓 𝑥 = 𝑥0 ⊕𝑥>size-4 DT error = 0.5
Ω(24)
TopDown’s tree 𝑇
TopDown struggles because:
𝔼 𝑓 𝑥 𝑥? = 0 for all𝑘 ∈ 𝑛
Easy fix for this example:
Take pairwise correlations 𝔼 𝑓 𝑥 𝑥?𝑥ℓ into account
A generalized splitting criterion
TopDown’s splitting criterion
Split 𝑥0 where 𝑖 ∈ 𝑛 maximizes 𝔼 𝑓 𝑥 𝑥0 <
Split 𝑥0 where 𝑖 ∈ 𝑛 maximizes
Generalized splitting criterion
t[ ∋ 0
(1 − 𝛿) [ ⋅ 𝔼 𝑓 𝑥 ∏ 𝑥><
𝑆 ≤ 𝑑
Sets of size ≤ 𝑑 containing 𝑖Correlation between𝑓 and variables in 𝑆
“Attenuating” factor (1 − 𝛿) 2 : large sets contribute less(necessary!)
𝑗 ∈ 𝑆
𝑥0
Combinatorial view: “Noisy influence”
Split 𝑥0 where 𝑖 ∈ 𝑛 maximizes t[ ∋ 0
(1 − 𝛿) [ ⋅ 𝔼 𝑓 𝑥 ∏ 𝑥><
𝑆 ≤ 𝑑
Def. Noisy sensitivity of 𝑓 at rate 𝛿NS` 𝑓 ≔ ℙa∼2 𝑓 𝑥 ≠ 𝑓 𝑦
{0,1}!
Uniform 𝑥
𝑦 = 𝛿-noisy copy of 𝑥
𝛿
≍ Expected drop in noise sensitivity from querying 𝑥0
NS` 𝑓 − 𝔼1∼{b,X} NS` 𝑓2$31𝑓 𝑥 =1
𝑗 ∈ 𝑆
Our algorithm vs. TopDown
§ Recall simple key fact in our analysis of TopDown:
𝔼2 𝑓 𝑥 𝑥0 < = Var 𝑓 − 𝔼1 Var 𝑓2$31
TopDown greedily drives down the variance of 𝑓
§ Analogous identity for our new splitting criterion:
t[ ∋ 0
(1 − 𝛿) [ ⋅ 𝔼 𝑓 𝑥 ∏ 𝑥>< ≍ NS` 𝑓 − 𝔼1 NS` 𝑓2$31
𝑆 ≤ 𝑑
Our algorithm greedily drives down the noise sensitivity of 𝑓
𝑗 ∈ 𝑆
Driving down variance vs. noise sensitivityConsider two candidate splits:
𝑥X 𝑥<
§ Expected variance of subfunctions same for both splits
§ Expected noise sensitivity of subfunctions much lower for 𝑥< split
∴ Both splits equally appealing to TopDown
∴ Our algorithm identifies 𝑥< as the better split
= 1-inputs, = 0-inputs
𝑓
size-𝑠 DT
≤ 𝑂(opt") + 𝜀opt"
𝑠 AB((DEF @)'/I')Our algorithm’s tree:
End of Part II
A new splitting criterion and associated learning algorithm that achieves performance guarantees for all functions:
Our algorithm can be viewed as a “noise stabilization” procedure
Theorem: For all 𝑠 ∈ ℕ, 𝜀 ∈ 0,1 , and all target functions 𝑓,
Recap of this talk
§ How these top-down algorithms work: “impurity-based”
§ Bad news: target functions for which impurity algorithms fail badly
Impurity algorithms fail
Monotone functions
§ Result I: Performance guarantees for monotone functions
§ Result II: A new, non-impurity-based, algorithm which achieves guarantees for all functions
Impurity algorithms fail
Monotone functions
?Impurity algorithms:
§ Guarantees for other broad and natural subclasses?
§ Characterization?
Beyond impurity algorithms:
§ Understand power of new splitting criterion
§ Noise sensitivity for other domains and distributions?
𝑥
𝑦 = 𝛿-noisy copy of 𝑥
Beyond a single tree:
§ Random forests and boosted decision trees