li-yang tantheory.stanford.edu/~liyang/top-down.pdftop-down algorithms for learning dts a natural...

Provable Guarantees for Decision Tree Induction

Guy Blanc Neha GuptaStanford Stanford MIT

Li-Yang TanBased on joint works with:

Jane Lange

Learning decision trees from labeled data

Example 𝑥 ∈ {0,1}! Label 𝑦 ∈ {0,1}

1 0 0 0 1 0 1 0 0 1 0 1 1

0 1 0 0 1 1 0 0 1 0 1 1 0

1 1 1 0 0 0 1 0 0 1 0 1 1

0 0 1 0 1 0 1 1 1 0 1 0 1

0 0 1 0 0 1 1 0 0 1 0 1 0

1 1 1 1 1 0 0 1 0 0 0 0 1

0 0 1 1 0 0 1 0 0 1 0 0 1

1 0 0 1 0 0 1 1 1 0 1 0 0

1 0 0 1 0 1 0 1 0 1 1 0 1

Fastalgorithm

𝑥)

𝑥* 𝑥++

𝑥,0 1 1

1 0

Small decision tree

Decision trees: simple and effective

§ Fast to evaluate

§ Easy to understand, Easy to explain predictions

§ Powerful “ensemble methods”: Random forests and Boosted decision trees

𝑥)

𝑥* 𝑥++

𝑥,0 1 1

1 0

Top-down algorithms for learning DTs

A natural approach: Greedy, top-down

1. Choose a “good” variable to query as the root

2. Recurse

𝑥!

§ Definition of “good” = splitting criterion of algorithm

§ Intuitively, want the most “important”/“relevant” variable

§ Captures almost all algorithms used in practice: ID3, C4.5, CART, etc.

§ Widely employed, empirically successful

§ This talk: theoretical justification for their effectiveness?

This talk: Performance guarantees for top-down algorithms

Let 𝑓: {0,1}" → {0,1} be a target function, 𝒟 a distribution over {0,1}"

Input: Labeled examples 𝑥, 𝑓 𝑥 where 𝑥 ∼ 𝒟, Parameters 𝑠 ∈ ℕ, and 𝜀 ∈ 0,1

Top-down algorithms achieve error within 𝜀 of that of the optimal size-𝑠 tree for 𝑓, with a tree “not much larger than” 𝑠

size-𝑠 DT

≤ opt" + 𝜀opt" 𝑓

Goal:

size 𝑆 = 𝑆 𝑠, 𝜀

Our results: 𝒟 = product distribution (This talk: 𝒟 = uniform)

Top-down algorithm’s tree:

Learning decision trees: practice vs. theory

Decision trees long been a workhorse of ML

Rich and vast theory literature on learning decision trees

§ Algorithms used in practice proceed in top-down fashion:

§ Empirically successful, but lack performance guarantees

§ Many algorithms do not return a decision tree hypothesis (“improper” algorithms)

§ Fastest proper algorithm [Ehrenfeucht–Haussler 89] proceeds bottom-up

Outline of this talk

§ How these top-down algorithms work: “impurity-based”

§ Bad news: target functions for which impurity algorithms fail badly

Impurity algorithms fail

Space of all target functions

Monotone functions

§ Result I: Performance guarantees for monotone functions

§ Result II: A new, non-impurity-based, algorithm which achieves guarantees for all functions

Part I:Power and limitations of existing algorithms

“Top-down induction of decision trees: rigorous guarantees and inherent limitations”, Guy Blanc, Jane Lange, Li-Yang Tan. ITCS 2020.

“Provable guarantees for decision tree induction: the agnostic setting”, Guy Blanc, Jane Lange, Li-Yang Tan. ICML 2020.

How do these top-down algorithms work? § Main algorithms used in practice – ID3, C4.5, CART – all “impurity-based”

§ Each defined by an impurity function 𝒢: 0,1 → 0,1

§ Concave, satisfies 𝒢 0 = 𝒢 1 = 0 and 𝒢 0.5 = 1

Examples: ID3 uses 𝒢 𝑝 = H#(𝑝), binary entropyCART uses 𝒢 𝑝 = 4𝑝 1 − 𝑝 , “Gini impurity”

1

𝒢 0 =0 𝒢 1 =0

𝒢 0.5 =1

0

High-level idea

𝒢(𝑝) approximation of min{𝑝, 1 − 𝑝}

∴ For 𝑓: {0,1}! → {0,1},

𝒢(𝔼 𝑓 ) ≈ min{𝔼 𝑓 , 1 − 𝔼 𝑓 },

min{𝑝, 1 − 𝑝}

How close 𝑓 is to a constant function

Impurity function defines the splitting criterion

Query 𝑥0 as the root, where 𝑖 ∈ 𝑛 maximizes:

Target function 𝑓: {0,1}" → {0,1}. Which 𝑥0 to query as the root?

𝒢 𝔼 𝑓 − 𝔼1 𝒢 𝔼 𝑓2$31

𝔼[𝑓] 𝔼[𝑓!!"#]𝔼[𝑓!!"$]

𝒢“Purity gain associated with splitting 𝑥0”

?

Order in which to split the leaves

Here’s the tree we’ve built so far:

Which leaf to split next?

?

ℓ%

ℓ#

We split the leaf ℓ with the highest “score”, where:

score ℓ ≔ ℙ2∼𝒟[𝑥 reaches ℓ] ⋅ max0∈ "

{ Purity-gain𝒢 ℓ, 𝑥0 }

ℓ

𝒢

A single representative algorithm?

Different impurity functions 𝒢 give rise to different trees. Would be nice to analyze a single algorithm …

TopDown

Lemma [Blanc-Lange-T.]:

There is an algorithm, TopDown, that is “complete” for impurity algorithms: Guarantees for TopDown yield guarantees for all impurity algorithms

ID3 C4.5

Arbitrary 𝒢Overhead in our reduction scales with concavity of 𝒢

TopDown

score ℓ ≔ ℙ2∼𝒟[𝑥 reaches ℓ] ⋅ max0∈ "

𝔼 𝑓ℓ 𝑥 𝑥0 <

Forget about impurity functions. In each iteration, we split the leaf ℓ with the highest score, where:

Split ℓ with a query to the 𝑥0 that’s best correlated with 𝑓ℓ

ℓ

“correlation btw 𝑓ℓ and 𝑥&”

𝑥&

How does the process end?

When tree reaches certain user-specified size 𝑆, label each leaf ℓ with round(𝔼 𝑓ℓ ) and output tree





Monotone functions



Bad news for TopDown

There is an 𝑓: {0,1}" → {0,1} such that:

𝑓size-4 DT ℙ 𝑇 𝑥 ≠ 𝑓 𝑥 = 0.5

Ω(24)

TopDown’s tree 𝑇

≡

𝑓 𝑥 = 𝑥0 ⊕𝑥> for some 𝑖, 𝑗 ∈ 𝑛

TopDown struggles because:

𝔼 𝑓 𝑥 𝑥? = 0 for all 𝑘 ∈ 𝑛

M. Kearns, “Recent developments in decision tree induction and the weak learning framework”, AAAI 1996.

Guarantees for monotone functions?


Monotone functions

[Fiat–Pechyony 04] raised the possibility of performance guarantees for monotone functions

Read-once DNFs,Halfspaces [FP04]

Our work§ Reasonable assumption in practice; intensively studied in theory

§ [FP04]: Proved strong guarantees for various subclasses of monotone functions

Theorem [Blanc-Lange-T.]:

Let 𝑓: {0,1}" → {0,1} be monotone, 𝒟 = product distribution over {0,1}".For all 𝑠 ∈ ℕ and 𝜀 ∈ 0,1 , given examples 𝑥, 𝑓 𝑥 where 𝑥 ∼ 𝒟, TopDown grows a tree 𝑇 of size

𝑠 >?((@AB C)/E!)

that achieves ℙ2∼𝒟 𝑓 𝑥 ≠ 𝑇 𝑥 ≤ opt@ + 𝜀.

Monotone 𝑓

size-𝑠 DT

≤ opt" + 𝜀opt"

𝑠 AB((DEF @)/I')

TopDown’s tree:

Monotone 𝑓

size-𝑠 DT

≤ opt' + 𝜀opt'

𝑠 AB((DEF @)/I')TopDown’s tree:

𝑠PQ(DEF @)

Monotone 𝑓

size-𝑠 DT

≥ 0.49≤ 0.01

Matching lower bound: For all 𝑠 ∈ ℕ, there is a monotone 𝑓 such that:

TopDown’s tree:

Theorem: For all 𝑠 ∈ ℕ, 𝜀 ∈ 0,1 , and monotone 𝑓,

Proof sketch of the upper bound

Consider realizable setting (opt@ = 0) 𝑓 ≡

size-𝑠 DT

TopDown’s tree

𝑥& Q: What can we say about 𝔼 𝑓 𝑥 𝑥0 ?

What’s the relationship between

Let’s focus on understanding the first decision that TopDown makes:

max0∈ "

𝔼 𝑓 𝑥 𝑥0 and the complexity of 𝑓?

O’Donnell, Saks, Schramm, Servedio“Every decision tree has an influential variable”, FOCS 2005

OSSS inequality: If 𝑓: {0,1}" → {0,1} is computed by a size-𝑠 decision tree, there is an 𝑖 ∈ 𝑛 such that:

§ If 𝑓 is monotone, Inf0 𝑓 = 𝔼 𝑓 𝑥 𝑥0

§ If Var 𝑓 ≤ 𝜀 then 𝑓 is 𝑂(𝜀)-close to constant

Inf0 𝑓 ≥Var 𝑓log 𝑠

where Inf0 𝑓 ∶= ℙ 𝑓 𝑥 ≠ 𝑓(𝑥0) and 𝑥0 = 𝑥 with 𝑖-th bit flipped.

OSSS vs. Kahn-Kalai-Linial inequality

𝑓size-𝑠 DT

≡ For all 𝑓: {0,1}" → {0,1}

Inf0 𝑓 ≥ Var 𝑓 ⋅log 𝑛𝑛

Exists 𝑖 ∈ 𝑛 such that:

OSSS inequality KKL inequality

Exists 𝑖 ∈ 𝑛 such that:

Inf0 𝑓 ≥Var 𝑓log 𝑠

§ OSSS = “KKL that takes into account decision tree complexity of 𝑓”

§ Strengthens KKL whenever 𝑠 = 2V("/ DEF ")

§ Exponentially stronger when 𝑠 = poly(𝑛)

Back to analyzing TopDown

Simplifying assumption #1

Simplifying assumption #2

How TopDown works:

𝑓

size-𝑠 DT

≡Realizable setting (opt = 0)

What we’ll instead analyze:

Splits leaf with highest score Splits all leaves,maintains complete tree

A simple potential function argument

How does Var 𝑓 change when 𝑥0 is queried?

𝑓G"HI 𝑓G"H+Fact: 𝔼" Var 𝑓#($" = Var 𝑓 − 𝔼# 𝑓 𝑥 𝑥% &

TopDown’s tree so far

𝑇

Consider the quantity: 𝔼 [Var 𝑓ℓ ]

§ When 𝑇 = ∅, this is simply Var 𝑓

§ For all 𝑇, this quantity is ≥ 0

leavesℓ ∼ 𝑇

𝑥0

TopDown’s tree so far

𝑇

Q: Is it true that for ≥ 1– 𝜀 fraction of leaves ℓ, Var 𝑓ℓ ≤ 𝜀?

If so, done:

If not, split all leaves of 𝑇:

𝑇∗ =

𝔼 Var 𝑓ℓ = 𝔼 Var 𝑓ℓ − 𝔼 𝔼2 𝑓ℓ 𝑥 𝑥0(ℓ)<

𝑇

𝑇 is 𝑂(𝜀)-close to 𝑓

ℓ ∼ 𝑇∗ ℓ ∼ 𝑇 ℓ ∼ 𝑇

Variable maximizing correlation with 𝑓ℓ

≥Var 𝑓ℓlog 𝑠

#

[OSSS]≥ 𝜀 fraction s.t. Var 𝑓ℓ ≥ 𝜀

1 unit of depth buys us ≥ 𝜀'/(log 𝑠)& decrease in “potential”

∴ Done when depth ≤ (log 𝑠)&/𝜀', hence size ≤ 𝑠 ()((+,- .)/1*) .

§ For non-realizable setting, need “robust” version of OSSS inequality,

proved by [Jain-Zhang 11].

§ For realizable setting, can optimize argument to get slightly better

parameters: 𝑠B((DEF @)(/'/I) instead of 𝑠 AB((DEF @)/I*) .

Monotone 𝑓

size-𝑠 DT

≤ 𝜀≡

𝑠 AB((DEF @)/I*)

TopDown’s tree:We just sketched the proof of:

Realizable (opt" = 0)

Beyond the proof sketch

Monotone 𝑓

size-𝑠 DT

≤ opt' + 𝜀opt'


𝑠PQ(DEF @)

Monotone 𝑓

size-𝑠 DT

≥ 0.49≤ 0.01

Matching lower bound: For all 𝑠 ∈ ℕ, there is a monotone 𝑓 such that:

TopDown’s tree:


Next:

A monotone function that fools TopDownOur function will be based on Tribes and Majority:

Tribesℓ 𝑥X, … , 𝑥ℓ : 𝔼 Tribesℓ 𝑥 𝑥0 = Θ DEF ℓℓ

for all 𝑖 ∈ ℓ

Majority? 𝑦X, … , 𝑦? : 𝔼 Majority? 𝑦 𝑦0 = Θ X√?

for all 𝑖 ∈ 𝑘

High-level idea: Our function 𝐹 𝑥X, … , 𝑥ℓ, 𝑦X, … , 𝑦? will be:

§ “99% Tribesℓ + 1%Majority?”, so 𝐹 ≈ Tribesℓ

§ However, ℓ and 𝑘 chosen so that Majority?’s variables better correlated with 𝐹 than Tribesℓ’s

∴ TopDown fooled into querying Majority?’s variables

𝐹 𝑥X, … , 𝑥ℓ, 𝑦X, … , 𝑦? ≔

1%

Tribesℓ 𝑥 Majority+(𝑦)

If 𝑥 ∉ output Tribesℓ 𝑥

Else output Majority?(𝑦)

Observation: 𝐹 is 0.01-close to Tribesℓ 𝑥 , decision tree size ≤ 2ℓ.

We choose ℓ and 𝑘 to satisfy X√?≳ DEF ℓ

ℓ,

ensuring that 𝔼 𝐹 𝑥, 𝑦 𝑦0 ≿ 𝔼 𝐹 𝑥, 𝑦 𝑥0 .

∴ TopDown fooled into building DT for Majority?(𝑦), size ≥ 2? .

2?𝐹

2ℓ≥ 0.49≤ 0.01

TopDown’s tree:Tribesℓ

Majority+

{0,1}ℓ {0,1}+

End of Part I

Monotone 𝑓

size-𝑠 DT

≤ opt' + 𝜀opt'


We provide a matching 𝑠PQ(DEF @) lower bound


Performance guarantees for impurity algorithms when run on monotone target functions:





Monotone functions



Part II:A new splitting criterion

“Universal guarantees for decision tree induction via a higher-order splitting criterion”, Guy Blanc, Neha Gupta, Jane Lange, Li-Yang Tan. NeurIPS 2020.

𝑓

size-𝑠 DT

≤ 𝑂(opt") + 𝜀opt"

𝑠 AB((DEF @)'/I')Our algorithm’s tree:

Our main result

A new splitting criterion and associated learning algorithm that achieves performance guarantees for all functions:

Our algorithm can be viewed as a “noise stabilization” procedure

Theorem: For all 𝑠 ∈ ℕ, 𝜀 ∈ 0,1 , and all target functions 𝑓,

Revisiting the bad news for TopDown

𝑓 𝑥 = 𝑥0 ⊕𝑥>size-4 DT error = 0.5

Ω(24)

TopDown’s tree 𝑇

TopDown struggles because:

𝔼 𝑓 𝑥 𝑥? = 0 for all𝑘 ∈ 𝑛

Easy fix for this example:

Take pairwise correlations 𝔼 𝑓 𝑥 𝑥?𝑥ℓ into account

A generalized splitting criterion

TopDown’s splitting criterion

Split 𝑥0 where 𝑖 ∈ 𝑛 maximizes 𝔼 𝑓 𝑥 𝑥0 <

Split 𝑥0 where 𝑖 ∈ 𝑛 maximizes

Generalized splitting criterion

t[ ∋ 0

(1 − 𝛿) [ ⋅ 𝔼 𝑓 𝑥 ∏ 𝑥><

𝑆 ≤ 𝑑

Sets of size ≤ 𝑑 containing 𝑖Correlation between𝑓 and variables in 𝑆

“Attenuating” factor (1 − 𝛿) 2 : large sets contribute less(necessary!)

𝑗 ∈ 𝑆

𝑥0

Combinatorial view: “Noisy influence”

Split 𝑥0 where 𝑖 ∈ 𝑛 maximizes t[ ∋ 0

(1 − 𝛿) [ ⋅ 𝔼 𝑓 𝑥 ∏ 𝑥><

𝑆 ≤ 𝑑

Def. Noisy sensitivity of 𝑓 at rate 𝛿NS` 𝑓 ≔ ℙa∼2 𝑓 𝑥 ≠ 𝑓 𝑦

{0,1}!

Uniform 𝑥

𝑦 = 𝛿-noisy copy of 𝑥

𝛿

≍ Expected drop in noise sensitivity from querying 𝑥0

NS` 𝑓 − 𝔼1∼{b,X} NS` 𝑓2$31𝑓 𝑥 =1

𝑗 ∈ 𝑆

Our algorithm vs. TopDown

§ Recall simple key fact in our analysis of TopDown:

𝔼2 𝑓 𝑥 𝑥0 < = Var 𝑓 − 𝔼1 Var 𝑓2$31

TopDown greedily drives down the variance of 𝑓

§ Analogous identity for our new splitting criterion:

t[ ∋ 0

(1 − 𝛿) [ ⋅ 𝔼 𝑓 𝑥 ∏ 𝑥>< ≍ NS` 𝑓 − 𝔼1 NS` 𝑓2$31

𝑆 ≤ 𝑑

Our algorithm greedily drives down the noise sensitivity of 𝑓

𝑗 ∈ 𝑆

Driving down variance vs. noise sensitivityConsider two candidate splits:

𝑥X 𝑥<

§ Expected variance of subfunctions same for both splits

§ Expected noise sensitivity of subfunctions much lower for 𝑥< split

∴ Both splits equally appealing to TopDown

∴ Our algorithm identifies 𝑥< as the better split

= 1-inputs, = 0-inputs

𝑓

size-𝑠 DT

≤ 𝑂(opt") + 𝜀opt"

𝑠 AB((DEF @)'/I')Our algorithm’s tree:

End of Part II

A new splitting criterion and associated learning algorithm that achieves performance guarantees for all functions:

Our algorithm can be viewed as a “noise stabilization” procedure

Theorem: For all 𝑠 ∈ ℕ, 𝜀 ∈ 0,1 , and all target functions 𝑓,

Recap of this talk




Monotone functions




Monotone functions

?Impurity algorithms:

§ Guarantees for other broad and natural subclasses?

§ Characterization?

Beyond impurity algorithms:

§ Understand power of new splitting criterion

§ Noise sensitivity for other domains and distributions?

𝑥

𝑦 = 𝛿-noisy copy of 𝑥

Beyond a single tree:

§ Random forests and boosted decision trees

li-yang tantheory.stanford.edu/~liyang/top-down.pdftop-down algorithms for learning dts a natural...

Documents