harmonic analysis in learning theory

Harmonic Analysis in Learning Theory

Jeff Jackson

Duquesne University

Themes

• Harmonic analysis is central to learning theoretic results in wide variety of models– Results generally strongest known for

learning with respect to uniform distribution

• Work on learning problems has led to some new harmonic results– Spectral properties of Boolean function

classes– Algorithms for approximating Boolean

functions

Uniform Learning Model

Boolean Function Class F

(e.g., DNF)

Example OracleEX(f)

Target functionf : {0,1}n {0,1}

Learning AlgorithmA

UniformRandomExamples

< x, f(x) >

Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε

Accuracyε > 0

Circuit Classes

• Constant-depth AND/OR circuits (AC0 without the polynomial-size restriction; call this CDC)

• DNF: depth-2 circuit with OR at root

. . . . . . . . .} d levels

v1 v2 v3 vn

. . . . . .

Negations allowed

Decision Treesv3

v1v2

v40

0

01

1

Decision Treesv3

v1v2

v40

0

01

1

x3 = 0

x = 11001

Decision Treesv3

v1v2

v40

0

01

1

x1 = 1

x = 11001

Decision Treesv3

v1v2

v40

0

01

1

x = 11001f(x) = 1

Function Size

• Each function representation has a natural size measure:– CDC, DNF: # of gates– DT: # of leaves

• Size sF (f) of f with respect to class F is size of smallest representation of f within F– For all Boolean f,

sCDC(f) ≤ sDNF(f) ≤ sDT(f)

Efficient Uniform Learning Model


(e.g., DNF)

Example OracleEX(f)


Learning AlgorithmA


< x, f(x) >


Accuracyε > 0

Timepoly(n,sF ,1/ε)

Harmonic-Based Uniform Learning

• [LMN]: constant-depth circuits are quasi-efficiently (n polylog(s/ε)-time) uniform learnable

• [BT]: monotone Boolean functions are uniform learnable in time roughly 2√n logn

– Monotone: For all x, i: f(x|xi=0) ≤ f(x|xi=1)

– Also exponential in 1/ε (so assumes ε constant)– But independent of any size measure

Notation

• Assume f: {0,1}n {-1,1}

• For all a in {0,1}n, χa (x) ≡ (-1) a · x

• For all a in {0,1}n, Fourier coefficient f(a) of f at a is:

• Sometimes write, e.g., f({1}) for f(10…0)

)]()([)(ˆ E~

xxa ax

ffU

^

^

^

Fourier Properties of Classes

• [LMN]: f is a constant-depth circuit of depth d andS = { a : |a| < logd(s/ε) } ( |a| ≡ # of 1’s in a )

• [BT]:f is a monotone Boolean function andS = { a : |a| < √n / ε) }

:if)(ˆS

2 a

af

Spectral Properties

Proof Techniques

• [LMN]: Hastad’s Switching Lemma + harmonic analysis

• [BT]: Based on [KKL]– Define AS(f) ≡ n · Prx,i[f(x|xi=0) ≠ f(x|xi=1)]– If S = {a : |a| < AS(f)/ε} then ΣaS f2(a) < ε– For monotone f, harmonic analysis + Cauchy-

Schwartz shows AS(f) ≤ √n– Note: This is tight for MAJ

^

Function Approximation

• For all Boolean f,

• For S {0,1}n, define

• [LMN]:

aa

a )(ˆ}1,0{

n

ff

aa

a )(ˆS

S

ff

S

2S~ )(ˆ))](()([Pr

ax axx ffsignfU

“The” Fourier Learning Algorithm

• Given: ε (and perhaps s, d)

• Determine k such that for S = {a : |a| < k}, ΣaS f2(a) < ε

• Draw sufficiently large sample of examples <x,f(x)> to closely estimate f(a) for all aS– Chernoff bounds: ~nk/ε sample size sufficient

• Output h ≡ sign(ΣaS f(a) χa)

• Run time ~ n2k/ε

^

~

^

Halfspaces

• [KOS]: Halfspaces are efficiently uniform learnable (given ε is constant)– Halfspace: wRn+1 s.t. f(x) = sign(w · (xº1))

– If S = {a : |a| < (21/ε)2 } then aS f2(a) < ε

– Apply LMN algorithm

• Similar result applies for arbitrary function applied to constant number of halfspaces– Intersection of halfspaces key learning pblm

^

Halfspace Techniques

• [O] (cf. [BKS], [BJTa]): – Noise sensitivity of f at γ is probability that

corrupting each bit of x with probability γ changes f(x)

– NSγ (f) = ½(1-a(1-2 γ)|a| f2(a))

• [KOS]:– If S = {a : |a| < 1/ γ} then aS f2(a) < 3 NSγ (f)

– If f is halfspace then NSε < 9√ ε

^

^

Monotone DT

• [OS]: Monotone functions are efficiently learnable given:– ε is constant– sDT(f) is used as the size measure

• Techniques:– Harmonic analysis: for monotone f,

AS(f) ≤ √log sDT(f) – [BT]: If S = {a : |a| < AS(f)/ε} then ΣaS f2(a) < ε– Friedgut: |T| ≤ 2AS(f)/ε s.t. ΣAT f2(A) < ε

^

^

Weak Approximators

• KKL also show that if f is monotone,there is an i such that -f({i}) ≥ log2n/n

• Therefore Pr[f(x) = -χ{i}(x)] ≥ ½ + log2n/2n

• In general, h s.t. Pr[f = h] ≥ ½ + 1/poly(n,s) is called a weak approximator to f

• If A outputs a weak approximator for every f in F , then F is weakly learnable

^

Uniform Learning Model


(e.g., DNF)

Example OracleEX(f)


Learning AlgorithmA


< x, f(x) >


Accuracyε > 0

Weak Uniform Learning Model


(e.g., DNF)

Example OracleEX(f)


Learning AlgorithmA


< x, f(x) >

Hypothesish:{0,1}n {0,1} s.t.

Prx~U [f(x) ≠ h(x) ] < ½ - 1/p(n,s)

Efficient Weak Learning Algorithm for Monotone Boolean Functions

• Draw set of ~n2 examples <x,f(x)>

• For i = 1 to n– Estimate f({i})

• Output h ≡ argmaxf({i})(-χ{i})

^

^

Weak Approximation for MAJ of Constant-Depth Circuits

• Note that adding a single MAJ to a CDC destroys the LMN spectral property

• [JKS]: MAJ of CDC’s is quasi-efficiently quasi-weak uniform learnable– If f is a MAJ of CDC’s of depth d, and if the

number of gates in f is s, then there is a set A {0,1}n such that

• |A| < logd s ≡ k

• Pr[f(x) = χA(x)] ≥ ½ +1/4snk

Weak Learning Algorithm

• Compute k = logds

• Draw ~snk examples <x,f(x)>

• Repeat for |A| < k– Estimate f(A)

• Until find A s.t. f(A) > 1/2snk

• Output h ≡ χA

• Run time ~npolylog(s)

^

^

Weak ApproximatorProof Techniques

• “Discriminator Lemma” (HMPST)– Implies one of the CDC’s is a weak

approximator to f

• LMN spectral characterization of CDC

• Harmonic analysis

• Beigel result used to extend weak learning to CDC with polylog MAJ gates

Boosting

• In many (not all) cases, uniform weak learning algorithms can be converted to uniform (strong) learning algorithms using a boosting technique ([S], [F], …)– Need to learn weakly with respect to near-

uniform distributions• For near-uniform distribution D, find weak hj s.t.

Prx~D[hj = f] > ½ + 1/poly(n,s)

– Final h typically MAJ of weak approximators

Strong Learning for MAJ of Constant-Depth Circuits

• [JKS]: MAJ of CDC is quasi-efficiently uniform learnable– Show that for near-uniform distributions, some

parity function is a weak approximator– Beigel result again extends to CDC with poly-

log MAJ gates

• [KP] + boosting: there are distributions for which no parity is a weak approximator

Uniform Learning from a Membership Oracle


(e.g., DNF)

Membership OracleMEM(f)


Learning AlgorithmAf(x)


Accuracyε > 0

x

Uniform Membership Learning of Decision Trees

• [KM]– L1(f) ≡ a |f(a)| ≤ sDT(f)

– If S = {a : |f(a)| ≥ ε/L1(f)} then ΣaS f2(a) < ε

– [GL]: Algorithm (memberhip oracle) for finding {a : |f(a)| ≥ θ} in time ~n/θ6

– So can efficiently uniform membership learn DT

– Output h same form as LMN:h ≡ sign(ΣaS f(a) χa)

^

^ ^

^

^

~

^

Uniform Membership Learning of DNF

• [J]– (distributions D) χa s.t.

Prx~D[f(x) = χa(x)] ≥ ½ + 1/6sDNF

– Modified [GL] can efficiently locate such χa given oracle for near-uniform D

• Boosters can provide such an oracle when uniform learning

– Boosting provides strong learning

• [BJTb] (see also [KS]): – Modified Levin algo finds χa in time ~ns2

Uniform Learning from a Classification Noise Oracle


(e.g., DNF)

Classification Noise OracleEXη (f)


Learning AlgorithmAPr[<x, f(x)>]=1-η

Pr[<x, -f(x)>]=η


Accuracyε > 0

Uniform random x

Error rateη > 0

Uniform Learning from a Statistical Query Oracle


(e.g., DNF)

Statistical Query OracleSQ(f)


Learning AlgorithmAEU[q(x, f(x))] ± τ


Accuracyε > 0

( q(), τ )

SQ and Classification Noise Learning

• [K]– If F is uniform SQ learnable in time

poly(n, sF ,1/ε, 1/τ) then F is uniform CN learnable in time poly(n, sF ,1/ε, 1/τ, 1/(1-2η))

– Empirically, almost always true that if F is efficiently uniform learnable then F is efficiently uniform SQ learnable (i.e., 1/τ poly in other parameters)

• Exception: F = PARn ≡ {χa : a {0,1}n, |a| ≤ n}

Uniform SQ Hardness for PAR

• [BFJKMR]– Harmonic analysis shows that for any q, χa:

EU[q(x, χa(x))] = q(0n+1) + q(a º 1)

– Thus adversarial SQ response to (q,τ) is q(0n+1) whenever |q(a º 1)| < τ

– Parseval: |q(b º 1)| < τ for all but 1/τ2 Fourier coefficients

– So ‘bad’ query eliminates only poly coefficients

– Even PARlog n not efficiently SQ learnable

^ ^

^ ^

^

Uniform Learning from an Attribute Noise Oracle


(e.g., DNF)

Attribute Noise OracleEXDN(f)


Learning AlgorithmA<xr, f(x)>, r~DN


Accuracyε > 0

Uniform random x

Noise modelDN

Uniform Learning with Independent Attribute Noise

• [BJTa]:– LMN algorithm produces estimates of

f(a) · Er~DN[χa(r)]

• Example application– Assume noise process DN is a product distribution:

• DN(x) = ∏i (pi(xi) + (1-pi)(1-xi))

– Assume pi < 1/polylog n, 1/ε at most quasi-poly(n) (mild restrictions)

– Then modified LMN uniform learns attribute noisy AC0 in quasi-poly time

^

Agnostic Learning Model

Arbitrary Boolean Function

Example OracleEX(f)


Learning AlgorithmA


< x, f(x) >

Hypothesish:{0,1}n {0,1} s.t.

Prx~U [f(x) ≠ h(x) ]minimized

Near-Agnostic Learning via LMN

• [KKM]:– Let f be an arbitrary Boolean function– Fix any set S {1..n} and fix ε– Let g be any function s.t.

• ΣaS g2(a) < ε and

• Pr[f ≠ g] is minimized (call this η)

– Then for h learned by LMN by estimating coefficients of f over S:

• Pr[f ≠ h] < 4η + ε

^

Average Case Uniform Learning Model


(e.g., DNF)

Example OracleEX(f)

D -randomf : {0,1}n {0,1}

Learning AlgorithmA


< x, f(x) >


Accuracyε > 0

Average Case Learning of DT

• [JSa]:– D : uniform over complete, non-redundant

log-depth DT’s – DT efficiently uniform learnable on average– Output is a DT (proper learning)

Average Case Learning of DT

• Technique– [KM]: All Fourier coefficients of DT with min

depth d are rational with denominator 2d

– In average-case tree, coefficient f({i}) for at least one variable vi has odd numerator

• So log(denominator) is min depth of tree

– Try all variables at root and find depth of child trees, choosing root with shallowest children

– Recurse on child trees to choose their roots

^

Average Case Learning of DNF

• [JSb]:– D : s terms, each term uniform from terms of

length log s– Monotone DNF with <n2 terms and DNF with

<n1.5 terms properly and efficiently uniform learnable on average

• Harmonic property– In average-case DNF, sign of f({i,j}) (usually)

indicates whether vi and vj are in a common term or not

^

Summary

• Most uniform-learning results depend on harmonic analysis

• Learning theory provides motivation for new harmonic observations

• Even very “weak” harmonic results can be useful in learning-theory algorithms

Some Open Problems

• Efficient uniform learning of monotone DNF– Best to date for small sDNF is [S], time

~nslog s (based on [BT], [M], [LMN])

• Non-uniform learning– Relatively easy to extend many results to

product distributions, e.g. [FJS] extends [LMN]– Key issue in real-world applicability

Open Problems (cont’d)

• Weaker dependence on ε– Several algorithms fully exponential (or worse)

in 1/ε

• Additional proper learning results– Allows for interpretation of learned hypothesis

harmonic analysis in learning theory

Documents