shahar slides on machine learning

Upload: outlaw

Post on 15-Feb-2018

243 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/23/2019 Shahar Slides on Machine learning

    1/32

    A Geometric Approach to Statistical Learning

    Theory

    Shahar Mendelson

    Centre for Mathematics and its ApplicationsThe Australian National University

    Canberra, Australia

  • 7/23/2019 Shahar Slides on Machine learning

    2/32

    What is a learning problem

    A class of functions Fon a probability space (, ) A random variableYone wishes to estimate

    A loss functional

    The information we have: a sample(Xi, Yi)ni=1

    Our goal: with high probability, find a good approximation to Y in F with

    respect to, that is

    Findf Fsuch that E(f(X), Y)) is almost optimal.

    fis selected according to the sample (Xi, Yi)ni=1.Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    3/32

    Example

    Consider

    The random variable Y is a fixed function T : [0, 1] (that is Yi =T(Xi)).

    The loss functional is (u, v) = (u

    v)2.

    Hence, the goal is to find some f F for whichE(f(X), T(X)) = E(f(X) T(X))2 =

    (f(t) T(t))2d(t)

    is as small as possible.To selectfwe use the sampleX1,...,Xn and the valuesT(X1),...,T(Xn).

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    4/32

    A variation

    Consider the following excess loss:

    f(x) = (f(x) T(x))2 (f(x) T(x))2 =f(x) f(x),wherefminimizes Ef(x) = E(f(X) T(X))2 in the class.

    The difference between the two cases:

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    5/32

    Our Goal

    Given a fixed sample size, how close to the optimal can one get using em-pirical data?

    How does the specific choice of the loss influence the estimate?

    What parameters of the class Fare important? Although one has access to a random sample, the measure which generates

    the data is not known.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    6/32

    The algorithm

    Given a sample (X1,...,Xn), select f Fwhich satisfies

    argminfF1

    n

    ni=1

    f(Xi),

    that is, fis the best function in the class on the data.

    The hope is thatwith high probability E(f|X1,...,Xn) =f(t)d(t)is close

    to the optimal.

    In other words, hopefully, with high probability, the empirical minimizer of theloss is almost the best function in the class with respect to the loss.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    7/32

    Back to the squared loss

    In the case of the squared excess loss -

    f(x) = (f(x) T)2 (f(x) T(x))2,since the second term if the same for every f F, the empirical minimizationselects

    argminfFn

    i=1

    (f(Xi) T(Xi))2

    and the question is

    how to relate thisempirical distance to the real distance we are interested

    in.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    8/32

    Analyzing the algorithm

    For a second, lets forget the loss, and from here on, to simplify notation, denote

    byGthe loss class.

    We shall attempt to connect 1nn

    i=1 g(Xi) (i.e. therandom, empirical structure

    onG) to Eg.

    We shall examine various notions of similarity of the structures.

    Note: in the case of an excess loss, 0 Gand our aim is to be as close to 0 aspossible. Otherwise, our aim is to approachg

    = 0.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    9/32

    A road map

    Asking the correct question - beware of loose methods of attack. Properties of the loss and their significance.

    Estimating the complexity of a class.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    10/32

    A little history

    Originally, the study of

    {0, 1

    }-valued classes (e.g. Perceptrons) used the uni-

    form law of large numbers:

    Pr

    g G

    1

    n

    n

    i=1g(Xi) Eg

    ,

    which is a uniform measure of similarity.

    If the probability of this is small, then for everyg G, the empirical structureis close to the real one. In particular, this is true for the empirical minimizer,

    and thus, on the good event,

    Eg 1n

    ni=1

    g(Xi) + .

    In a minute: this approach is suboptimal!!!!Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    11/32

    Why is this bad?

    Consider the excess loss case.

    We hope that the algorithm will get us close to 0...

    So, it seems likely that we would only need to control the part ofGwhichis not too far from 0.

    No need to control functions which are far away, while in the ULLN, wecontrol everyfunction inG.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    12/32

    Why would this lead to a better bound?

    Well, first, the set is smaller...

    More important:

    functions close to 0 in expectation are likely to have a small variance (undermild assumptions)...

    On the other hand,

    Because of the CLT, for every fixed function g G and n large enough,with probability 1/2,

    1

    n

    ni=1

    g(Xi) Eg

    var(g)

    n ,

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    13/32

    So?

    Control over the entire class = control over functions with nonzero variance= rate of convergence cant be better than c/n.

    Ifg = 0, we cant hope to get a faster rate than c/nusing this method.

    This shows the statistical limitation of the loss.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    14/32

    What does this tell us about the loss?

    To get faster convergence rates one has to consider the excess loss. We also need a condition that would imply that if the expectation is small,

    the variance is small (e.g E2f BEf - A Bernstein condition).

    It turns out that this condition is connected to convexity properties ofat0.

    One has to connect the richness ofG to that ofF (which follows from aLipshitz condition on).

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    15/32

    Localization - Excess loss

    There are several ways to localize.

    It is enough to bound

    Pr

    g G 1

    n

    ni=1

    g(Xi) Eg 2

    This event upper bounds the probability that the algorithm fails. If thisprobability is small , and sincen1

    ni=1g(Xi) , then Eg 2.

    Another (similar) option: relative bounds:

    Prg G n

    1/2ni=1(g(Xi)

    E

    g)var(g)

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    16/32

    Comparing structures

    Suppose that one could find rn, for which, with high probability, for every g

    G

    with Eg rn,1

    2Eg 1

    n

    ni=1

    g(Xi) 32Eg

    (here, 1/2 and 3/2 can be replaced by 1

    and 1 + ).

    Then if g was produced by the algorithm it can either

    have a large expectation - Eg rn, =

    The structures are similar and thus Eg

    2

    nni=1g(Xi),Or

    have a small expectation = Eg rn,

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    17/32

    Comparing structures II

    Thus, with high probability

    Eg max

    rn, 2

    n

    ni=1

    g(Xi)

    .

    This result is based on a ratio limit theorem, because we would like to show

    thatsup

    gG,Egrn

    n1

    ni=1 g(Xi)

    Eg 1

    .This normalization is possible if Eg2 can be bounded using Eg (which is aproperty of the loss). Otherwise, one needs a slightly different localization.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    18/32

    Star shape

    IfG is star-shaped, its relative richness increases as r becomes smaller.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    19/32

    Why is this better?

    Thanks to a star-shape assumption, our aim is to find the smallest rnsuch that

    with high probability,

    supgG, Eg=r

    1

    n

    n

    i=1g(Xi) Eg

    r

    2.

    This would imply that the error of the algorithm is at mostr.

    For the non-localized result, to obtain the same error, one needs to show

    supgG1n

    ni=1

    g(Xi) Eg r,where the supremum is on a much larger set, and includes functions with alarge variance.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    20/32

    BTW, even this is not optimal....

    It turns out that a structural approach (uniform or localized) does not give the

    best result that one could get on the error of the EM algorithm.

    A sharp bound follows from a direct analysis of the algorithm (under mild

    assumptions) and depends on the behavior of the (random) function

    n(r) = supgG, Eg=r

    1

    n

    ni=1

    g(Xi) Eg.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    21/32

    Application of concentration

    Suppose that we can show that with high probability, for every r,

    supgG, Eg=r

    1nn

    i=1

    g(Xi) Eg E supgG, Eg=r

    1nn

    i=1

    g(Xi) Eg n(r)

    i.e. that the expectation is a good estimation of the random variable. Then,

    The uniform estimate: the error is close to n(1). The localized estimate: the error is close to the fixed point: n(r) =r/2.

    The direct analysis ......

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    22/32

    A short summary

    bounding the right quantity - one needs to understand

    Pr

    supgA

    1nn

    i=1

    g(Xi) Eg t

    .

    For an estimate on EM, A G. The smaller we can take A - the betterthe bound!

    loss class vs. excess loss: being close to 0 (hopefully) implies small variance(property of the loss).

    One has to connect the complexity ofA

    G to the complexity of the

    subset of the base classFthat generated it.

    If one considers excess loss (better statistical error), there is a question ofthe approximation error.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    23/32

    Estimating the complexity

    Concentration:supgA

    1nn

    i=1

    g(Xi) Eg E supgA

    1nn

    i=1

    g(Xi) Eg .

    symmetrization

    E supgA

    1nn

    i=1

    g(Xi) Eg EXE supgA

    1nn

    i=1

    ig(Xi)

    =Rn(A). iare independent,P r(= 1) =P r(= 1) = 1/2.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    24/32

    Estimating the Complexity II

    For= (X1,...,Xn) considerPA= {(g(X1),...,g(Xn))}. Then

    Rn(A) = EX

    E sup

    vPA

    1

    n

    n

    i=1ivi

    .

    For every (random) coordinate projection V = PA, the complexity pa-rameter E supvPA

    n1ni=1 ivi measures the correlation ofV with arandom noise.

    The noise model: a random point in {1, 1}n (then-dimensional combina-torial cube), i.e., (1,...,n). Rn(A) measures howVis correlated with this

    noise.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    25/32

    Example - A large set

    If the class of functions is bounded by 1 then for any sample (X1,...,Xn),PA [1, 1]n.

    ForV = [1, 1]n (orV = {1, 1}n),

    1nE sup

    vPA

    ni=1

    ivi = 1,

    (If there are many large coordinate projections, Rn does not converge to

    0 asn !).

    Question: what subsets of [1, 1]n are big in the context of this complexityparameter?

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    26/32

    Combinatorial dimensions

    Consider a class of

    {1, 1

    }-valued functions. Define the Vapnik-Chervonenkis

    dimension ofAby

    vc(A) = sup|| | PA= {1, 1}||

    .

    In other words, vc(A) is the largest dimension of a coordinate projection ofA

    which is the entire (combinatorial) cube.

    There is a real-valued analog of the Vapnik-Chervonenkis dimension, which is

    called thecombinatorial dimension.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    27/32

    Combinatorial dimensions II

    The combinatorial dimension:

    For every, it measures the largest dimension|| of a cube of side length that can be found in a coordinate projectionPA.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    28/32

    Some connection between the parameters

    Ifvc(A) dthenRn(A) c

    d/n.

    Ifvc(A, ) is the combinatorial dimension ofAat scale, then

    Rn(A)

    C

    n

    0vc(A, )d.

    Note: These bounds on Rn are (again) not optimal and can be improved in

    various ways. For example:

    The bounds take into account the worst case projection - not the average

    projection.

    The bound does not take into account the diameter ofA.Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    29/32

    Important

    The vc dimension (combinatorial dimension) and other related complexity pa-

    rameters (e.g covering numbers) are only ways of upper bounding Rn.

    Sometimes such a bound is good, but at times it is not.

    Although the connections between the various complexity parameters are very

    interesting and nontrivial, for SLT it is always best to try and bound Rndirectly.

    Again, the difficulty of the learning problem is captured by the richness of a

    random coordinate projection of the loss class.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    30/32

    Example: Error rates for VC classes

    Let

    Fbe a class of{0, 1}-valued functions withvc(F) dandT F. (Properlearning)

    is the squared loss andGis the loss class. Note, 0

    G!

    H= star(G, 0) = {g| g G} is the star-shaped hull ofG.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    31/32

    Example: Error rates for VC classes II

    Then:

    Sincef(x) 0 and functions in F are {0, 1}-valued, then Eh2 Eh.

    The error rate is upper bounded by the fixed point of

    E suphH, Eh=r

    n1n

    i=1ih(Xi)

    =Rn(Hr),i.e. when

    Rn(Hr) =r

    2.

    The next step is to relate the complexity ofHr to the complexity ofF.

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

  • 7/23/2019 Shahar Slides on Machine learning

    32/32

    Bounding Rn(Hr)

    Fis small in the appropriate sense. is a Lipschitz function, and thus G= (F) is not much larger than F.

    The star-shaped hull ofGis not much larger than G.In particular, for every n d, with probability larger than 1 edn cd, ifEng infgG Eng+ , then

    Eg c maxd

    nlog

    n

    ed , .

    Shahar Mendelson: A Geometric Approach to Statistical Learning Theory