harmonic analysis in learning theory
Post on 03-Jan-2016
28 Views
Preview:
DESCRIPTION
TRANSCRIPT
Harmonic Analysis in Learning Theory
Jeff Jackson
Duquesne University
Themes
• Harmonic analysis is central to learning theoretic results in wide variety of models– Results generally strongest known for
learning with respect to uniform distribution
• Work on learning problems has led to some new harmonic results– Spectral properties of Boolean function
classes– Algorithms for approximating Boolean
functions
Uniform Learning Model
Boolean Function Class F
(e.g., DNF)
Example OracleEX(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Circuit Classes
• Constant-depth AND/OR circuits (AC0 without the polynomial-size restriction; call this CDC)
• DNF: depth-2 circuit with OR at root
. . . . . . . . .} d levels
v1 v2 v3 vn
. . . . . .
Negations allowed
Decision Treesv3
v1v2
v40
0
01
1
Decision Treesv3
v1v2
v40
0
01
1
x3 = 0
x = 11001
Decision Treesv3
v1v2
v40
0
01
1
x1 = 1
x = 11001
Decision Treesv3
v1v2
v40
0
01
1
x = 11001f(x) = 1
Function Size
• Each function representation has a natural size measure:– CDC, DNF: # of gates– DT: # of leaves
• Size sF (f) of f with respect to class F is size of smallest representation of f within F– For all Boolean f,
sCDC(f) ≤ sDNF(f) ≤ sDT(f)
Efficient Uniform Learning Model
Boolean Function Class F
(e.g., DNF)
Example OracleEX(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Timepoly(n,sF ,1/ε)
Harmonic-Based Uniform Learning
• [LMN]: constant-depth circuits are quasi-efficiently (n polylog(s/ε)-time) uniform learnable
• [BT]: monotone Boolean functions are uniform learnable in time roughly 2√n logn
– Monotone: For all x, i: f(x|xi=0) ≤ f(x|xi=1)
– Also exponential in 1/ε (so assumes ε constant)– But independent of any size measure
Notation
• Assume f: {0,1}n {-1,1}
• For all a in {0,1}n, χa (x) ≡ (-1) a · x
• For all a in {0,1}n, Fourier coefficient f(a) of f at a is:
• Sometimes write, e.g., f({1}) for f(10…0)
)]()([)(ˆ E~
xxa ax
ffU
^
^
^
Fourier Properties of Classes
• [LMN]: f is a constant-depth circuit of depth d andS = { a : |a| < logd(s/ε) } ( |a| ≡ # of 1’s in a )
• [BT]:f is a monotone Boolean function andS = { a : |a| < √n / ε) }
:if)(ˆS
2 a
af
Spectral Properties
Proof Techniques
• [LMN]: Hastad’s Switching Lemma + harmonic analysis
• [BT]: Based on [KKL]– Define AS(f) ≡ n · Prx,i[f(x|xi=0) ≠ f(x|xi=1)]– If S = {a : |a| < AS(f)/ε} then ΣaS f2(a) < ε– For monotone f, harmonic analysis + Cauchy-
Schwartz shows AS(f) ≤ √n– Note: This is tight for MAJ
^
Function Approximation
• For all Boolean f,
• For S {0,1}n, define
• [LMN]:
aa
a )(ˆ}1,0{
n
ff
aa
a )(ˆS
S
ff
S
2S~ )(ˆ))](()([Pr
ax axx ffsignfU
“The” Fourier Learning Algorithm
• Given: ε (and perhaps s, d)
• Determine k such that for S = {a : |a| < k}, ΣaS f2(a) < ε
• Draw sufficiently large sample of examples <x,f(x)> to closely estimate f(a) for all aS– Chernoff bounds: ~nk/ε sample size sufficient
• Output h ≡ sign(ΣaS f(a) χa)
• Run time ~ n2k/ε
^
~
^
Halfspaces
• [KOS]: Halfspaces are efficiently uniform learnable (given ε is constant)– Halfspace: wRn+1 s.t. f(x) = sign(w · (xº1))
– If S = {a : |a| < (21/ε)2 } then aS f2(a) < ε
– Apply LMN algorithm
• Similar result applies for arbitrary function applied to constant number of halfspaces– Intersection of halfspaces key learning pblm
^
Halfspace Techniques
• [O] (cf. [BKS], [BJTa]): – Noise sensitivity of f at γ is probability that
corrupting each bit of x with probability γ changes f(x)
– NSγ (f) = ½(1-a(1-2 γ)|a| f2(a))
• [KOS]:– If S = {a : |a| < 1/ γ} then aS f2(a) < 3 NSγ (f)
– If f is halfspace then NSε < 9√ ε
^
^
Monotone DT
• [OS]: Monotone functions are efficiently learnable given:– ε is constant– sDT(f) is used as the size measure
• Techniques:– Harmonic analysis: for monotone f,
AS(f) ≤ √log sDT(f) – [BT]: If S = {a : |a| < AS(f)/ε} then ΣaS f2(a) < ε– Friedgut: |T| ≤ 2AS(f)/ε s.t. ΣAT f2(A) < ε
^
^
Weak Approximators
• KKL also show that if f is monotone,there is an i such that -f({i}) ≥ log2n/n
• Therefore Pr[f(x) = -χ{i}(x)] ≥ ½ + log2n/2n
• In general, h s.t. Pr[f = h] ≥ ½ + 1/poly(n,s) is called a weak approximator to f
• If A outputs a weak approximator for every f in F , then F is weakly learnable
^
Uniform Learning Model
Boolean Function Class F
(e.g., DNF)
Example OracleEX(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Weak Uniform Learning Model
Boolean Function Class F
(e.g., DNF)
Example OracleEX(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesish:{0,1}n {0,1} s.t.
Prx~U [f(x) ≠ h(x) ] < ½ - 1/p(n,s)
Efficient Weak Learning Algorithm for Monotone Boolean Functions
• Draw set of ~n2 examples <x,f(x)>
• For i = 1 to n– Estimate f({i})
• Output h ≡ argmaxf({i})(-χ{i})
^
^
Weak Approximation for MAJ of Constant-Depth Circuits
• Note that adding a single MAJ to a CDC destroys the LMN spectral property
• [JKS]: MAJ of CDC’s is quasi-efficiently quasi-weak uniform learnable– If f is a MAJ of CDC’s of depth d, and if the
number of gates in f is s, then there is a set A {0,1}n such that
• |A| < logd s ≡ k
• Pr[f(x) = χA(x)] ≥ ½ +1/4snk
Weak Learning Algorithm
• Compute k = logds
• Draw ~snk examples <x,f(x)>
• Repeat for |A| < k– Estimate f(A)
• Until find A s.t. f(A) > 1/2snk
• Output h ≡ χA
• Run time ~npolylog(s)
^
^
Weak ApproximatorProof Techniques
• “Discriminator Lemma” (HMPST)– Implies one of the CDC’s is a weak
approximator to f
• LMN spectral characterization of CDC
• Harmonic analysis
• Beigel result used to extend weak learning to CDC with polylog MAJ gates
Boosting
• In many (not all) cases, uniform weak learning algorithms can be converted to uniform (strong) learning algorithms using a boosting technique ([S], [F], …)– Need to learn weakly with respect to near-
uniform distributions• For near-uniform distribution D, find weak hj s.t.
Prx~D[hj = f] > ½ + 1/poly(n,s)
– Final h typically MAJ of weak approximators
Strong Learning for MAJ of Constant-Depth Circuits
• [JKS]: MAJ of CDC is quasi-efficiently uniform learnable– Show that for near-uniform distributions, some
parity function is a weak approximator– Beigel result again extends to CDC with poly-
log MAJ gates
• [KP] + boosting: there are distributions for which no parity is a weak approximator
Uniform Learning from a Membership Oracle
Boolean Function Class F
(e.g., DNF)
Membership OracleMEM(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmAf(x)
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
x
Uniform Membership Learning of Decision Trees
• [KM]– L1(f) ≡ a |f(a)| ≤ sDT(f)
– If S = {a : |f(a)| ≥ ε/L1(f)} then ΣaS f2(a) < ε
– [GL]: Algorithm (memberhip oracle) for finding {a : |f(a)| ≥ θ} in time ~n/θ6
– So can efficiently uniform membership learn DT
– Output h same form as LMN:h ≡ sign(ΣaS f(a) χa)
^
^ ^
^
^
~
^
Uniform Membership Learning of DNF
• [J]– (distributions D) χa s.t.
Prx~D[f(x) = χa(x)] ≥ ½ + 1/6sDNF
– Modified [GL] can efficiently locate such χa given oracle for near-uniform D
• Boosters can provide such an oracle when uniform learning
– Boosting provides strong learning
• [BJTb] (see also [KS]): – Modified Levin algo finds χa in time ~ns2
Uniform Learning from a Classification Noise Oracle
Boolean Function Class F
(e.g., DNF)
Classification Noise OracleEXη (f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmAPr[<x, f(x)>]=1-η
Pr[<x, -f(x)>]=η
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Uniform random x
Error rateη > 0
Uniform Learning from a Statistical Query Oracle
Boolean Function Class F
(e.g., DNF)
Statistical Query OracleSQ(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmAEU[q(x, f(x))] ± τ
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
( q(), τ )
SQ and Classification Noise Learning
• [K]– If F is uniform SQ learnable in time
poly(n, sF ,1/ε, 1/τ) then F is uniform CN learnable in time poly(n, sF ,1/ε, 1/τ, 1/(1-2η))
– Empirically, almost always true that if F is efficiently uniform learnable then F is efficiently uniform SQ learnable (i.e., 1/τ poly in other parameters)
• Exception: F = PARn ≡ {χa : a {0,1}n, |a| ≤ n}
Uniform SQ Hardness for PAR
• [BFJKMR]– Harmonic analysis shows that for any q, χa:
EU[q(x, χa(x))] = q(0n+1) + q(a º 1)
– Thus adversarial SQ response to (q,τ) is q(0n+1) whenever |q(a º 1)| < τ
– Parseval: |q(b º 1)| < τ for all but 1/τ2 Fourier coefficients
– So ‘bad’ query eliminates only poly coefficients
– Even PARlog n not efficiently SQ learnable
^ ^
^ ^
^
Uniform Learning from an Attribute Noise Oracle
Boolean Function Class F
(e.g., DNF)
Attribute Noise OracleEXDN(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA<xr, f(x)>, r~DN
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Uniform random x
Noise modelDN
Uniform Learning with Independent Attribute Noise
• [BJTa]:– LMN algorithm produces estimates of
f(a) · Er~DN[χa(r)]
• Example application– Assume noise process DN is a product distribution:
• DN(x) = ∏i (pi(xi) + (1-pi)(1-xi))
– Assume pi < 1/polylog n, 1/ε at most quasi-poly(n) (mild restrictions)
– Then modified LMN uniform learns attribute noisy AC0 in quasi-poly time
^
Agnostic Learning Model
Arbitrary Boolean Function
Example OracleEX(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesish:{0,1}n {0,1} s.t.
Prx~U [f(x) ≠ h(x) ]minimized
Near-Agnostic Learning via LMN
• [KKM]:– Let f be an arbitrary Boolean function– Fix any set S {1..n} and fix ε– Let g be any function s.t.
• ΣaS g2(a) < ε and
• Pr[f ≠ g] is minimized (call this η)
– Then for h learned by LMN by estimating coefficients of f over S:
• Pr[f ≠ h] < 4η + ε
^
Average Case Uniform Learning Model
Boolean Function Class F
(e.g., DNF)
Example OracleEX(f)
D -randomf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Average Case Learning of DT
• [JSa]:– D : uniform over complete, non-redundant
log-depth DT’s – DT efficiently uniform learnable on average– Output is a DT (proper learning)
Average Case Learning of DT
• Technique– [KM]: All Fourier coefficients of DT with min
depth d are rational with denominator 2d
– In average-case tree, coefficient f({i}) for at least one variable vi has odd numerator
• So log(denominator) is min depth of tree
– Try all variables at root and find depth of child trees, choosing root with shallowest children
– Recurse on child trees to choose their roots
^
Average Case Learning of DNF
• [JSb]:– D : s terms, each term uniform from terms of
length log s– Monotone DNF with <n2 terms and DNF with
<n1.5 terms properly and efficiently uniform learnable on average
• Harmonic property– In average-case DNF, sign of f({i,j}) (usually)
indicates whether vi and vj are in a common term or not
^
Summary
• Most uniform-learning results depend on harmonic analysis
• Learning theory provides motivation for new harmonic observations
• Even very “weak” harmonic results can be useful in learning-theory algorithms
Some Open Problems
• Efficient uniform learning of monotone DNF– Best to date for small sDNF is [S], time
~nslog s (based on [BT], [M], [LMN])
• Non-uniform learning– Relatively easy to extend many results to
product distributions, e.g. [FJS] extends [LMN]– Key issue in real-world applicability
Open Problems (cont’d)
• Weaker dependence on ε– Several algorithms fully exponential (or worse)
in 1/ε
• Additional proper learning results– Allows for interpretation of learned hypothesis
top related