learning theory2.rtf
Post on 02-Jun-2018
224 Views
Preview:
TRANSCRIPT
-
8/11/2019 Learning theory2.rtf
1/39
Introduction to Statistical Learning Theory
abor Lugosi3Olivier Bousquet1 , Stephane Boucheron2 , and
1 !a"#$lanc% Institute &or Biological 'ybernetics
Spe(annstr) 3*, +#2-. T/ ubingen, er(any
olivier)bousquet0(")org
ho(e page http44555)%yb)(pg)de46bousquet2 7niversite de $aris#Sud, Laboratoire d8In&or(atique
B9ati(ent :-, ;#:1-< Orsay 'ede", ;rance
stephane)boucheron0lri)&r
ho(e page http44555)lri)&r46bouchero3 +epart(ent o& =cono(ics, $o(peu ;abra 7niversity
>a(on Trias ;argas 2
-
8/11/2019 Learning theory2.rtf
2/39
-
8/11/2019 Learning theory2.rtf
3/39
1. Bousquet, Boucheron H Lugosi
1) Observe a pheno(enon
2) 'onstruct a (odel o& that pheno(enon
3) !a%e predictions using this (odel
O& course, this de&inition is very general and could be ta%en (ore or less as the
goal o& Fatural Sciences) The goal o& !achine Learning is to actually auto(ate
this process and the goal o& Learning Theory is to &or(aliGe it)In this tutorial 5e consider a special case o& the above process 5hich is the
supervised learning &ra(e5or% &or pattern recognition) In this &ra(e5or%, the
data consists o& instance#label pairs, 5here the label is either 1 or 1) iven a
set o& such pairs, a learning algorith( constructs a &unction (apping instances to
labels) This &unction should be such that it (a%es &e5 (ista%es 5hen predicting
the label o& unseen instances)
O& course, given so(e training data, it is al5ays possible to build a &unction
that &its e"actly the data) But, in the presence o& noise, this (ay not be the
best thing to do as it 5ould lead to a poor per&or(ance on unseen instances
@this is usually re&erred to as over&ittingA) The general idea behind the design o&
1 )n @gA, it 5ould
be unreasonable to loo% &or the &unction (ini(iGing >n @g A a(ong all possible&unctions) Indeed, 5hen the input space is in&inite, one can al5ays construct a
&unction gn 5hich per&ectly predicts the labels o& the training data @i)e) gn @i A M
Pi , and >n @gn A M -A, but behaves on the other points as the opposite o& the target
&unction t, i)e) gn @A M P so that >@gn A M 1 ) So one 5ould have (ini(u(e(pirical ris% but (a"i(u( ris%)
It is thus necessary to prevent this over&itting situation) There are essentially
t5o 5ays to do this @5hich can be co(binedA) The &irst one is to restrict the
class o& &unctions in 5hich the (ini(iGation is per&or(ed, and the second is to
(odi&y the criterion to be (ini(iGed @e)g) adding a penalty &or co(plicated8
&unctionsA)
=(pirical >is% !ini(iGation) This algorith( is one o& the (ost straight#
&or5ard, yet it is usually e cient) The idea is to choose a (odel o& possible
&unctions and to (ini(iGe the e(pirical ris% in that (odel
gn M arg (in >n @gA )g
O& course, this 5ill 5or% best 5hen the target &unction belongs to ) o5ever,
it is rare to be able to (a%e such an assu(ption, so one (ay 5ant to enlarge
the (odel as (uch as possible, 5hile preventing over&itting)
Structural >is% !ini(iGation) The idea here is to choose an in&inite se#
quence Jd d M 1, 2, ) ) )K o& (odels o& increasing siGe and to (ini(iGe thee(pirical ris% in each (o del 5ith an added penalty &or the siGe o& the (odel
gn M arg (in >n @gA pen@d, nA )g d ,d
The penalty pen@d, nA gives pre&erence to (odels 5here esti(ation error is s(all
and (easures the siGe or capacity o& the (odel)
>egulariGation) ?nother, usually easier to i(ple(ent approach consists in
cho osing a large (odel @possibly dense in the continuous &unctions &or e"a(pleA
and to de&ine on a regulariGer, typically a nor( g ) Then one has to (ini(iGe
the regulariGed e(pirical ris%
gn M arg (in >n @gA g 2 )g
Strictly spea%ing this is only possible i& the probability distribution satis&ies so(e
(ild conditions @e)g) has no ato(sA) Other5ise, it (ay not be possible to achieve>@gn A M 1 but even in this case, provided the support o& $ contains in&initely (any
points, a si(ilar pheno(enon occurs)
-
8/11/2019 Learning theory2.rtf
7/39
1*- Bousquet, Boucheron H Lugosi
'o(pared to S>!, there is here a &ree para(eter , called the regulariGation
para(eter 5hich allo5s to choose the right trade#o bet5een &it and co(ple"ity)
Tuning is usually a hard proble( and (ost o&ten, one uses e"tra validation
data &or this tas%)
!ost e"isting @and success&ulA (ethods can be thought o& as regulariGation
(ethods)
For(aliGed >egulariGation) There are other possible approaches 5hen the
regulariGer can, in so(e sense, be nor(aliGed8, i)e) 5hen it corresponds to so(e
probability distribution over )
iven a probability distribution de&ined on @usually called a priorA, one can
use as a regulariGer log @g A< ) >eciprocally, &ro( a regulariGer o& the &or( g 2 ,
i& there e"ists a (easure U on such that e g 2 dU@gA V &or so(e W -,then one can construct a prior corresponding to this regulariGer) ;or e"a(ple, i&
is the set o& hyperplanes in d going through the origin, can be identi&ied
5ith d and, ta%ing U as the Lebesgue (easure, it is possible to go &ro( the
=uclidean nor( regulariGer to a spherical aussian (easure on d as a prior. )This type o& nor(aliGed regulariGer, or prior, can be used to construct another
probability distribution on @usually called posteriorA, as
X @ A @gA ,@gA M e > n@ g A
5here - is a &ree para(eter and X @ A is a nor(aliGation &actor)
There are several 5ays in 5hich this can be used) I& 5e ta%e the &unction
(a"i(iGing it, 5e recover regulariGation as
arg (a" @gA M arg (in >n @gA log @gA ,g g
5here the regulariGer is 1 log @gA )?lso, can be used to rando(iGe the predictions) In that case, be&ore co(#
puting the predicted label &or an input ", one sa(ples a &unction g according to and outputs g@"A) This procedure is usually called ibbs classi&ication)
?nother 5ay in 5hich the distribution constructed above can be used is byta%ing the e"pected prediction o& the &unctions in
gn @"A M sgn@ @g@"AAA )
< This is &ine 5hen is countable) In the continuous case, one has to consider the
density associated to ) e o(it these details). eneraliGation to in&inite di(ensional ilbert spaces can also be done but it requires
(ore care) One can &or e"a(ple establish a correspondence bet5een the nor( o& a
reproducing %ernel ilbert space and a aussian process prior 5hose covariance
&unction is the %ernel o& this space) Fote that (ini(iGing >n @gA log @gA is equivalent to (ini(iGing >n @gA
1 log @gA)
-
8/11/2019 Learning theory2.rtf
8/39
Statistical Learning Theory 1*1
This is typically called Bayesian averaging)
?t this point 5e have to insist again on the &act that the choice o& the class
and o& the asso ciated regulariGer or prior, has to co(e &ro( a priori %no5ledge
about the tas% at hand, and there is no universally best choice)
2)2 Bounds
e have presented the &ra(e5or% o& the theory and the type o& algorith(s that
it studies, 5e no5 introduce the %ind o& results that it ai(s at) The overall goal is
to characteriGe the ris% that so(e algorith( (ay have in a given situation) !ore
precisely, a learning algorith( ta%es as input the data @1 , P1 A, ) ) ) , @n , Pn A and
produces a &unction gn 5hich depends on this data) e 5ant to esti(ate the
ris% o& gn ) o5ever, >@gn A is a rando( variable @since it depends on the dataAand it cannot be co(puted &ro( the data @since it also depends on the un%no5n
$ A) =sti(ates o& >@gn A thus usually ta%e the &or( o& probabilistic bounds)Fotice that 5hen the algorith( chooses its output &ro( a (odel , it is
possible, by intro ducing the best &unction g in , 5ith >@g A M in&g >@gA, to5rite
>@gn A > M D>@g A > E D>@gn A >@g AE )
The &irst ter( on the right hand side is usually called the appro"i(ation error,and (easures ho5 5ell can &unctions in approach the target @it 5ould be Gero
i& t A) The second ter(, called esti(ation error is a rando( quantity @it
depends on the dataA and (easures ho5 close is gn to the best possible choicein )
=sti(ating the appro"i(ation error is usually hard since it requires %no5ledge
about the target) 'lassically, in Statistical Learning Theory it is pre&erable to
avoid (a%ing speci&ic assu(ptions about the target @such as its belonging to
so(e (odelA, but the assu(ptions are rather on the value o& > , or on the noise&unction s)
It is also %no5n that &or any @consistentA algorith(, the rate o& convergence to
Gero o& the appro"i(ation error* can be arbitrarily slo5 i& one does not (a%eassu(ptions about the regularity o& the target, 5hile the rate o& convergence
o& the esti(ation error can be co(puted 5ithout any such assu(ption) e 5ill
thus &ocus on the esti(ation error)
?nother possible deco(position o& the ris% is the &ollo5ing
>@gn A M >n @gn A D>@gn A >n @gn AE )
In this case, one esti(ates the ris% by its e(pirical counterpart, and so(e quan#
tity 5hich appro"i(ates @or upper boundsA >@gn A >n @gn A)
To su((ariGe, 5e 5rite the three type o& results 5e (ay be interested in)
* ;or this converge to (ean anything, one has to c onsider algorith(s 5hich choose
&unctions &ro( a class 5hich gro5s 5ith the sa(ple siGe) This is the case &or e"a(ple
o& Structural >is% !ini(iGation or >egulariGation based algorith(s)
-
8/11/2019 Learning theory2.rtf
9/39
1*2 Bousquet, Boucheron H Lugosi
Y =rror bound >@gn A >n @gn A B@n, A) This corresponds to the esti(ationo& the ris% &ro( an e(pirical quantity)
Y =rror bound relative to the best in the class >@gn A >@g A B@n, A) Thistells ho5 Zopti(alZ is the algorith( given the (odel it uses)
Y =rror bound relative to the Bayes ris% >@gn A > B@n, A) This givestheoretical guarantees on the convergence to the Bayes ris%)
3 Basic Bounds
In this section 5e sho5 ho5 to obtain si(ple error bounds @also called general#
iGation boundsA) The ele(entary (aterial &ro( probability theory that is needed
here and in the later sections is su((ariGed in ?ppendi" ?)
3)1 >elationship to =(pirical $rocesses
gn @ A MP o& the &unction
>ecall that 5e 5ant to esti(ate the ris% >@gn A M
gn returned by the algorith( a&ter seeing the data @1 , P1 A, ) ) ) , @n , Pn A) Thisquantity cannot be observed @$ is un%no5nA and is a rando( variable @since it
depends on the dataA) ence one 5ay to (a%e a state(ent about this quantity
is to say ho5 it relates to an esti(ate such as the e(pirical ris% >n @gn A) Thisrelationship can ta%e the &or( o& upper and lo5er bounds &or
D>@gn A >n @gn A W E )
;or convenience, let Xi M @i , Pi A and X M @, P A) iven de&ine the loss class
; M J& @", yA g @"A My g K ) @1A
Fotice that contains &unctions 5ith range in J 1, 1K 5hile ; contains non#
negative &unctions 5ith range in J-, 1K) In the re(ainder o& the tutorial, 5e 5illgo bac% and &orth bet5een ; and @as there is a biection bet5een the(A, so(e#
ti(es stating the results in ter(s o& &unctions in ; and so(eti(es in ter(s o&
&unctions in ) It 5ill be clear &ro( the conte"t 5hich classes and ; 5e re&er
to, and ; 5ill al5ays be derived &ro( the last (entioned class in the 5ay o& @1A)
ne use the shorthand notation $& M D& @, P AE and $n & M 1i M1 & @i , Pi A)n
$n is usually called the e(pirical (easure associated to the training sa(ple)ith this notation, the quantity o& interest @di erence bet5een true and e(pir#
ical ris%sA can be 5ritten as
@2A$ &n $n &n )
?n e(pirical process is a collection o& rando( variables inde"ed by a class o&
&unctions, and such that each rando( variable is distributed as a su( o& i)i)d)
rando( variables @values ta%en by the &unction at the dataA
J$ & $n & K& ; )
-
8/11/2019 Learning theory2.rtf
10/39
Statistical Learning Theory 1*3
One o& the (ost studied quantity associated to e(pirical processes is their supre#
(u(
sup $& $n & )& ;
It is clear that i& 5e %no5 an upper bound on this quantity, it 5ill be an upper
bound on @2A) This sho5s that the theory o& e(pirical processes is a great source
o& tools and techniques &or Statistical Learning Theory)
3)2 oe ding8s Inequality
Let us re5rite again the quantity 5e are interested in as &ollo5s
n
>@gA >n @gA M D& @X AE 1 & @Xi A )ni M1
It is easy to recogniGe here the di erence bet5een the e"pectation and the e(#
pirical average o& the rando( variable & @X A) By the la5 o& large nu(bers, 5e
i((ediately obtain that
1 nli( & @Xi A D& @X AE M - M 1 )nn
i M1
This indicates that 5ith enough sa(ples, the e(pirical ris% o& a &unction is agood appro"i(ation to its true ris%)
It turns out that there e"ists a quantitative version o& the la5 o& large nu(bers
5hen the variables are bounded)
Theore( 1 @o e dingA) Let X1 , ) ) ) , Xn be n i)i)d) rando( variables 5ith& @XA Da, bE) Then &or all W -, 5e have
1 n& @Xi A D& @XAE W 2 e"p 2n 2
n @b aA2 )i M1
Let us re5rite the above &or(ula to better understand its consequences) +enote
the right hand side by ) Then
,R$n & $ & R W @b aA log 2
2n
or @by inversion, see ?ppendi" ?A 5ith probability at least 1 ,
R$n & $& R @b aA log 2 2n )
-
8/11/2019 Learning theory2.rtf
11/39
1* Bousquet, Boucheron H Lugosi
?pplying this to & @X A M g @ AMP 5e get that &or any g, and any W -, 5ithprobability at least 1
@3A>@g A >n @gA log 2 2n )
Fotice that one has to consider a &i"ed &unction g and the probability is 5ithrespect to the sa(pling o& the data) I& the &unction depends on the data this
do es not apply[
3)3 Li(itations
?lthough the above result see(s very nice @since it applies to any class o&
bounded &unctionsA, it is actually severely li(ited) Indeed, 5hat it essentiallysays is that &or each @&i"edA &unction & ;, there is a set S o& sa(ples &or 5hich
DSE 1 A) o5#2 n @and this set o& sa(ples has (easure$ & $n & lo g 2
ever, these sets S (ay be di erent &or di erent &unctions) In other 5ords, &or the
observed sa(ple, only so(e o& the &unctions in ; 5ill satis&y this inequality)
?nother 5ay to e"plain the li(itation o& oe ding8s inequality is the &ollo5#
ing) I& 5e ta%e &or the class o& all J 1, 1K#valued @(easurableA &unctions, then
&or any &i"ed sa(ple, there e"ists a &unction & ; such that
$ & $n & M 1 )
To see this, ta%e the &unction 5hich is & @i A M Pi on the data and & @A M Pevery5here else) This does not contradict oe ding8s inequality but sho5s that
it do es not yield 5hat 5e need)
;igure 2 illustrates the above argu(entation) The horiGontal a"is corresponds
>>is%
Rn
R (g)n
R(g)
;unction classg g g \n
;ig) 2) 'onvergence o& the e(pirical ris% to the true ris% over the class o& &unctions)
-
8/11/2019 Learning theory2.rtf
12/39
Statistical Learning Theory 1*@&n A >n @&n A sup @>@& A >n @& AA @A& ;
In other 5ords, i& 5e can upper bound the supre(u( on the right, 5e are done)
;or this, 5e need a bound 5hich holds si(ultaneously &or all &unctions in a class)
Let us e"plain ho5 one can construct such uni&or( bounds) 'onsider t5o
&unctions &1 , &2 and de&ine
'i M J@"1 , y1 A, ) ) ) , @"n , yn A $ &i $n &i W K )
This set contains all the bad8 sa(ples, i)e) those &or 5hich the bound &ails) ;ro(
oe ding8s inequality, &or each i
D'i E )
e 5ant to (easure ho5 (any sa(ples are bad8 &or i M 1 or i M 2) ;or this 5e
use @see ?ppendi" ?A
D'1 '2 E D'1 E D'2 E 2 )
!ore generally, i& 5e have F &unctions in our class, 5e can 5rite
F
D'1 ) ) ) 'F E D'i E
i M1
?s a result 5e obtain
D & J&1 , ) ) ) , &F K $& $n & W E
F
D$ &i $n &i W E
iM 1
F e"p 2n 2
-
8/11/2019 Learning theory2.rtf
13/39
1*. Bousquet, Boucheron H Lugosi
ence, &or M Jg1 , ) ) ) , gF K, &or all W - 5ith probability at least 1 ,
g , >@gA >n @g A log F log 12n
This is an error bound) Indeed, i& 5e %no5 that our algorith( pic%s &unctions
&ro( , 5e can apply this result to gn itsel&)Fotice that the (ain di erence 5ith oe ding8s inequality is the e"tra log F
ter( on the right hand side) This is the ter( 5hich accounts &or the &act that 5e
5ant F bounds to hold si(ultaneously) ?nother interpretation o& this ter( is as
the nu(ber o& bits one 5ould require to speci&y one &unction in ) It turns out
that this %ind o& coding interpretation o& generaliGation bounds is o&ten possible
and can be used to obtain error esti(ates D1.E)
3)< =sti(ation =rror
7sing the sa(e idea as be&ore, and 5ith no additional e ort, 5e can also get a
bound on the esti(ation error) e start &ro( the inequality
>@g A >n @g A sup @>@gA >n @gAA ,g
5hich 5e co(bine 5ith @A and 5ith the &act that since gn (ini(iGes the e(#pirical ris% in ,
>n @g A >n @gn A -
Thus 5e obtain
>@gn A M >@gn A >@g A >@g A
>n @g A >n @gn A >@gn A >@g A >@g A
2 sup R>@gA >n @gAR >@g Ag
e obtain that 5ith probability at least 1
>@gn A >@g A 2 log F log 2 2n )
e notice that in the right hand side, both ter(s depend on the siGe o& the
class ) I& this siGe increases, the &irst ter( 5ill decrease, 5hile the second 5ill
increase)
3). Su((ary and $erspective
?t this point, 5e can su((ariGe 5hat 5e have e"posed so &ar)
Y In&erence requires to put assu(ptions on the process generating the data
@data sa(pled i)i)d) &ro( an un%no5n $A, generaliGation requires %no5ledge
@e)g) restriction, structure, or priorA)
-
8/11/2019 Learning theory2.rtf
14/39
Statistical Learning Theory 1*
Y The error bounds are valid 5ith respect to the repeated sa(pling o& training
sets)
Y ;or a &i"ed &unction g, &or (ost o& the sa(ples
>@gA >n @gA 14 n
Y ;or (ost o& the sa(ples i& R R M F
sup >@gA >n @gA log F4ng
The e"tra variability co(es &ro( the &act that the chosen gn changes 5iththe data)
So the result 5e have obtained so &ar is that 5ith high probability, &or a &inite
class o& siGe F ,
sup @>@gA >n @gAA log F log 12n )
g
There are several things that can be i(proved
Y oe ding8s inequality only uses the boundedness o& the &unctions, not theirvariance)
Y The union bound is as bad as i& all the &unctions in the class 5ere independent
@i)e) i& &1 @XA and &2 @X A 5ere independentA)
Y The supre(u( over o& >@gA >n @gA is not necessarily 5hat the algorith(
5ould cho ose, so that upper bounding >@gn A >n @gn A by the supre(u((ight be lo ose)
In&inite 'ase Capni%#'hervonen%is Theory
In this section 5e sho5 ho5 to e"tend the previous results to the case 5here the
class is in&inite) This requires, in the non#countable case, the introduction o&
tools &ro( Capni%#'hervonen%is Theory)
)1 >e&ined 7nion Bound and 'ountable 'ase
e &irst start 5ith a si(ple re&ine(ent o& the union bound that allo5s to e"tend
the previous results to the @countablyA in&inite case)
>ecall that by oe ding8s inequality, &or each & ;, &or each W - @possibly
depending on & , 5hich 5e 5rite @& AA,
log 1 @& A )@& A$ & $n & W2n
-
8/11/2019 Learning theory2.rtf
15/39
1** Bousquet, Boucheron H Lugosi
ence, i& 5e have a countable set ;, the union bound i((ediately yields
log 1@& A @&A )& ; $ & $n & W
2n& ;
'ho osing @& A M p@& A 5ith & ; p@& A M 1, this (a%es the right#hand side
equal to and 5e get the &ollo5ing result) ith probability at least 1 ,
log 1p@& A log 1 & ;, $ & $n & 2n )
e notice that i& ; is &inite @5ith siGe F A, ta%ing a uni&or( p gives the log F as
be&ore)
7sing this approach, it is possible to put %no5ledge about the algorith(
into p@& A, but p should be chosen be&ore seeing the data, so it is not possible tocheat8 by setting all the 5eight to the &unction returned by the algorith( a&ter
seeing the data @5hich 5ould give the s(allest possible boundA) But, in general,
i& p is 5ell#chosen, the bound 5ill have a s(all value) ence, the bound can be
i(proved i& one %no5s ahead o& ti(e the &unctions that the algorith( is li%ely
to pic% @i)e) %no5ledge i(proves the boundA)
)2 eneral 'ase
hen the set is uncountable, the previous approach do es not directly 5or%)
The general idea is to loo% at the &unction class proected8 on the sa(ple) !ore
precisely, given a sa(ple G1 , ) ) ) , Gn , 5e consider
;G 1, )) ),G n M J@& @G1 A, ) ) ) , &@Gn AA & ;K
The siGe o& this set is the nu(ber o& possible 5ays in 5hich the data @G1 , ) ) ) , Gn Acan be classi&ied) Since the &unctions & can only ta%e t5o values, this set 5ill
al5ays be &inite, no (atter ho5 big ; is)
+e&inition 1 @ro5th &unctionA) The gro5th &unction is the (a"i(u( nu(#
ber o& 5ays into 5hich n points can be classi&ied by the &unction class
R;G 1, )) ),G n R )S; @nA M sup
@G 1 ,)) ),G n A
e have de&ined the gro5th &unction in ter(s o& the loss class ; but 5e can do
the sa(e 5ith the initial class and notice that S; @nA M S @nA)It turns out that this gro5th &unction can be used as a (easure o& the siGe8
o& a class o& &unction as de(onstrated by the &ollo5ing result)
Theore( 2 @Capni%#'hervonen%isA) ;or any W -, 5ith probability at least
1 ,
g , >@gA >n @gA 2 2 log S @2nA log 2 n )
-
8/11/2019 Learning theory2.rtf
16/39
Statistical Learning Theory 1*:
Fotice that, in the &inite case 5here RR M F , 5e have S @nA F so that thisbound is al5ays better than the one 5e had be&ore @e"cept &or the constantsA)
But the proble( beco(es no5 one o& co(puting S @nA)
)3 C' +i(ension
Since g J 1, 1K, it is clear that S @nA 2n ) I& S @nA M 2n , there is a set o&siGe n such that the class o& &unctions can generate any classi&ication on these
points @5e say that shatters the setA)
+e&inition 2 @C' di(ensionA) The C' di(ension o& a class is the largest
n such that
S @nA M 2n )
In other 5ords, the C' di(ension o& a class is the siGe o& the largest set that
it can shatter)
In order to illustrate this de&inition, 5e give so(e e"a(ples) The &irst one is the
set o& halplanes in d @see ;igure 3A) In this case, as depicted &or the cased M 2, one can shatter a set o& d 1 points but no set o& d 2 points, 5hich
(eans that the C' di(ension is d 1)
;ig) 3) 'o(puting the C' di(ension o& hyperplanes in di(ension 2 a set o& 3 points
can be shattered, but no set o& &our points)
It is interesting to notice that the nu(ber o& para(eters needed to de&ine
halspaces in d is d, so that a natural question is 5hether the C' di(ensionis related to the nu(ber o& para(eters o& the &unction class) The ne"t e"a(ple,
depicted in ;igure , is a &a(ily o& &unctions 5ith one para(eter only
Jsgn@sin@t"AA t K
5hich actually has in&inite C' di(ension @this is an e"ercise le&t to the readerA)
-
8/11/2019 Learning theory2.rtf
17/39
1:- Bousquet, Boucheron H Lugosi
;ig) ) C' di(ension o& sinusoids)
It re(ains to sho5 ho5 the notion o& C' di(ension can bring a solution
to the proble( o& co(puting the gro5th &unction) Indeed, at &irst glance, i& 5e
%no5 that a class has C' di(ension h, it entails that &or all n h, S @nA M 2n
and S @nA V 2n other5ise) This see(s o& little use, but actually, an intriguingpheno(enon occurs &or n h as depicted in ;igure
-
8/11/2019 Learning theory2.rtf
18/39
Statistical Learning Theory 1:1
and &or all n h,h )
S @nA en h
7sing this le((a along 5ith Theore( 2 5e i((ediately obtain that i& has
C' di(ension h, 5ith probability at least 1 ,
h log 2 g , >@gA >n @gA 2 2 h log 2 e n n )
hat is i(portant to recall &ro( this result, is that the di erence bet5een the
true and e(pirical ris% is at (ost o& order
h log n
n )
?n interpretation o& C' di(ension and gro5th &unctions is that they (easure the
e ective siGe o& the class, that is the siGe o& the pro ection o& the class onto &inite
sa(ples) In addition, this (easure does not ust count8 the nu(ber o& &unctions
in the class but depends on the geo(etry o& the class @rather its pro ectionsA);inally, the &initeness o& the C' di(ension ensures that the e(pirical ris% 5ill
converge uni&or(ly over the class to the true ris%)
) Sy((etriGation
e no5 indicate ho5 to prove Theore( 2) The %ey ingredient to the proo& is the
so#called sy((etriGation le((a) The idea is to replace the true ris% by an esti#
(ate co(puted on an independent set o& data) This is o& course a (athe(atical
technique and does not (ean one needs to have (ore data to be able to apply
the result) The e"tra data set is usually called virtual8 or ghost sa(ple8)
e 5ill denote by X1 , ) ) ) , Xn an independent @ghostA sa(ple and by $n thecorresponding e(pirical (easure)
Le((a 2 @Sy((etriGationA) ;or any t W -, such that nt2 2,
sup sup@$ $n A& t 2 @$n $n A& t42 )& ; & ;
$roo&) Let &n be the &unction achieving the supre(u( @note that it depends
on X1 , ) ) ) , Xn A) One has @5ith denoting the conunction o& t5o eventsA,
@$ $n A& n Wt @$ $ nA &n V t42 M @ $ $ n A& n Wt @$n $ A& n t42
@ $n $n A& n Wt42 )
Ta%ing e"pectations 5ith respect to the second sa(ple gives
D@$ $n A&n V t42E D@$n $n A&n W t42E )@$ $ nA &n W t
-
8/11/2019 Learning theory2.rtf
19/39
1:2 Bousquet, Boucheron H Lugosi
By 'hebyshev8s inequality @see ?ppendi" ?A,
D@$ $n A&n t42E Car&nnt2 1 nt2 )
Indeed, a rando( variable 5ith range in D-, 1E has variance less than 14) ence
@$ $n A& n Wt @1 1
D@$n $n A&n W t42E )nt2 A
Ta%ing e"pectation 5ith respect to &irst sa(ple gives the result)
This le((a allo5s to replace the e"pectation $& by an e(pirical average
over the ghost sa(ple) ?s a result, the right hand side only depends on the
proection o& the class ; on the double sa(ple
,;X 1, )) ),X n ,X1 , ))) ,X n
5hich contains &initely (any di erent vectors) One can thus use the si(ple union
bound that 5as presented be&ore in the &inite case) The other ingredient that is
needed to obtain Theore( 2 is again oe ding8s inequality in the &ollo5ing &or(
D$n & $n & W tE e n t2 42 )
e no5 ust have to put the pieces together
sup& ; @$ $n A& t
2 sup& ; @$n $n A& t42
M 2 @$n $n A& t42sup& ; X1 , ))), Xn , X1 ,) )), Xn
2S; @2nA D@$n $n A& t42E
S; @2nAe n t2 4* )
7sing inversion &inishes the proo& o& Theore( 2)
)< C' =ntropy
One i(portant aspect o& the C' di(ension is that it is distribution independent)
ence, it allo5s to get bounds that do not depend on the proble( at hand
the sa(e bound holds &or any distribution) ?lthough this (ay be seen as an
advantage, it can also be a dra5bac% since, as a result, the bound (ay be loose
&or (ost distributions)
e no5 sho5 ho5 to (odi&y the proo& above to get a distribution#dependent
result) e use the &ollo5ing notation F @;, Gn 1 A M R;G 1 ,) )), G n R)
+e&inition 3 @C' entropyA) The @annealedA C' entropy is de&ined as
; @nA M log DF @;, Xn 1 AE )
-
8/11/2019 Learning theory2.rtf
20/39
Statistical Learning Theory 1:3
Theore( 3) ;or any W -, 5ith probability at least 1 ,
g , >@gA >n @gA 2 2 @2nA log 2n )
$roo&) e again begin 5ith the sy((etriGation le((a so that 5e have to
upper bound the quantity
I M @$n $n A& t42 )sup& ; X n
1 ,X n1
Let 1 , ) ) ) , n be n independent rando( variables such that $ @ i M 1A M $ @ i M1A M 142 @they are called >ade(acher variablesA) e notice that the quanti#
nties @$n $n A& and 1i M1 i @& @Xi A & @Xi AA have the sa(e distribution sincen
changing one i corresponds to e"changing Xi and Xi ) ence 5e have
1 n
i @& @Xi A & @Xi AA t42 ,nI sup& ; X ni M11 ,X n1
and the union bound leads to
1 nI F ;, X n i @& @Xi A & @Xi AA t42 )1 , Xn1 (a" n
& i M1
Since i @& @Xi A & @Xi AA D 1, 1E, oe ding8s inequality &inally gives
I DF @;, X, X AE e n t24 * )
The rest o& the proo& is as be&ore)
< 'apacity !easures
e have seen so &ar three (easures o& capacity or siGe o& classes o& &unction theC' di(ension and gro5th &unction both distribution independent, and the C'
entropy 5hich depends on the distribution) ?part &ro( the C' di(ension, they
are usually hard or i(possible to co(pute) There are ho5ever other (easures
5hich not only (ay give sharper esti(ates, but also have properties that (a%e
their co(putation possible &ro( the data only)
-
8/11/2019 Learning theory2.rtf
21/39
1: Bousquet, Boucheron H Lugosi
This is the nor(aliGed a((ing distance o& the proections8 on the sa(ple)
iven such a (etric, 5e say that a set &1 , ) ) ) , &F covers ; at radius i&
; F i M1 B@&i , A )
e then de&ine the covering nu(bers o& ; as &ollo5s)
+e&inition @'overing nu(berA) The covering nu(ber o& ; at radius ,
5ith respect to dn , denoted by F @; , , nA is the (ini(u( siGe o& a cover o&radius )
Fotice that i t does not (atter i& 5e apply this de&inition to the original class
or the loss class ;, since F @;, , nA M F @, , nA)
The covering nu(bers characteriGe the siGe o& a &unction class as (easured
by the (etric dn ) The rate o& gro5th o& the logarith( o& F @, , nA usually calledthe (etric entropy, is related to the classical concept o& vector di(ension) Indeed,
i& is a co(pact set in a d#di(ensional =uclidean space, F @, , nA d )
hen the covering nu(bers are &inite, it is possible to appro"i(ate the class
by a &inite set o& &unctions @5hich cover A) hich again allo5s to use the
&inite union bound, provided 5e can relate the behavior o& all &unctions in to
that o& &unctions in the cover) ? typical result, 5hich 5e provide 5ithout proo&,
is the &ollo5ing)
Theore( ) ;or any t W -,
D g >@gA W >n @gA tE * DF @, t, nAE e n t2 41 2 * )
'overing nu(bers can also be de&ined &or classes o& real#valued &unctions)
e no5 relate the covering nu(bers to the C' di(ension) Fotice that, be#
cause the &unctions in can only ta%e t5o values, &or all W -, F @, , nA
R M F @, Xn1 A) ence the C' entropy corresponds to log covering nu(bersRX n
1at (ini(al scale, 5hich i(plies F @, , nA h log en h , but one can have a con#siderably better result)
Le((a 3 @ausslerA) Let be a class o& C' di(ension h) Then, &or all W -,
all n, and any sa(ple,
F @ , , nA 'h@eAh h )
The interest o& this result is that the upper bound do es not depend on the sa(ple
siGe n)
The covering nu(ber bound is a generaliGation o& the C' entropy bound
5here the scale is adapted to the error) It turns out that this result can be
i(proved by considering all scales @see Section ecall that 5e used in the proo& o& Theore( 3 >ade(acher rando( variables,
i)e) independent J 1, 1K#valued rando( variables 5ith probability 142 o& ta%ing
either value)
-
8/11/2019 Learning theory2.rtf
22/39
Statistical Learning Theory 1:n & M 1i M1 i & @Xi A) e 5ill denote by the e"pectation ta%en 5ithn
respect to the >ade(acher variables @i)e) conditionally to the dataA 5hile 5illdenote the e"pectation 5ith respect to all the rando( variables @i)e) the data,
the ghost sa(ple and the >ade(acher variablesA)
+e&inition < @>ade(acher averagesA) ;or a class ; o& &unctions, the >ade#
(acher average is de&ined as
>@;A M sup >n & ,& ;
and the conditional >ade(acher average is de&ined as
>n @;A M sup >n & )& ;
e no5 state the &unda(ental result involving >ade(acher averages)
Theore( @;A log 12n ,
and also, 5ith probability at least 1 ,
& ;, $& $n & 2>n @; A 2 log 2 n )
It is re(ar%able that one can obtain a bound @second part o& the theore(A 5hich
depends solely on the data)
The pro o& o& the above result requires a po5er&ul tool called a concentration
inequality &or e(pirical processes)
?ctually, oe ding8s inequality is a @si(pleA concentration inequality, in the
sense that 5hen n increases, the e(pirical average is concentrated around the
e"pectation) It is possible to generaliGe this result to &unctions that depend on
i)i)d) rando( variables as sho5n in the theore( belo5)
Theore( . @!c+iar(id D1EA) ?ssu(e &or all i M 1, ) ) ) , n,
sup R; @G1 , ) ) ) , Gi , ) ) ) , Gn A ; @G1 , ) ) ) , Gi , ) ) ) , Gn AR c ,
G 1 ,) )), G n ,G i
then &or all W -,
DR; D; E R W E 2 e"p 2 2nc2 )
The (eaning o& this result is thus that, as soon as one has a &unction o& n
independent rando( variables, 5hich is such that its variation is bounded 5hen
one variable is (odi&ied, the &unction 5ill satis&y a oe ding#li%e inequality)
-
8/11/2019 Learning theory2.rtf
23/39
1:. Bousquet, Boucheron H Lugosi
$roo& o& Theore( ade(acher average to the conditional
one)
e &irst sho5 that !c+iar(id8s inequality can be applied to sup& ; $ & $n & )
e denote te(porarily by $ i n the e(pirical (easure obtained by (odi&ying oneele(ent @e)g) Xi is replaced by Xi A o& the sa(ple) It is easy to chec% that the&ollo5ing holds
R sup @$ & $ i R$ i@$ & $n & A sup n & AR sup n & $n & R )& ; & ; & ;
Since & J-, 1K 5e obtain
R$ in & $n & R M 1 n ,n R& @Xi A & @Xi AR 1
and thus !c+iar(id8s inequality can be applied 5ith c M 14n) This concludesthe &irst step o& the pro o&)
e ne"t prove the @&irst part o& theA &ollo5ing sy((etriGation le((a)Le((a ) ;or any class ;,
sup$ & $n & 2 sup >n & ,
& ; & ;
and sup
R$ & $n & R 1 >n & 1 2 n )2 sup& ; & ;
$roo&) e only prove the &irst part) e introduce a ghost sa(ple and its
corresponding (easure $n ) e successively use the &act that $n & M $ & andthe supre(u( is a conve" &unction @hence 5e can apply ]ensen8s inequality, see
?ppendi" ?A
sup
$ & $n && ; D$n & E $n &
M sup& ;
sup $n & $n & & ;
1 n
M sup i @& @Xi A & @Xi AAn& ; iM1
1 n 1 n
sup i & @Xi A sup i & @Xi AAn n& ; & ;
iM1 i M1
M 2 sup >n & )& ;
-
8/11/2019 Learning theory2.rtf
24/39
Statistical Learning Theory 1:
5here the third step uses the &act that &@Xi A & @Xi A and i @& @Xi A & @Xi AA
have the sa(e distribution and the last step uses the &act that the i & @Xi A and
i &@Xi A have the sa(e distribution)
The above already establishes the &irst part o& Theore( n @;A )
It is easy to chec% that ; satis&ies !c+iar(id8s assu(ptions 5ith c M 1n ) ?s a
result, ; M >@; A can be sharply esti(ated by ; M >n @;A)
Loss 'lass and Initial 'lass) In order to (a%e use o& Theore( < 5e have to
relate the >ade(acher average o& the loss class to those o& the initial class) This
can be done 5ith the &ollo5ing derivation 5here one uses the &act that i and
i Pi have the sa(e distribution)
1 n
>@; A M sup n i g @ i AMP ig i M1
1 n 1 iM sup n 2 @1 Pi g@i AAg i M1
1 nM 1 i Pi g@i A M 1n 2 >@ A )2 sup g
i M1
Fotice that the sa(e is valid &or conditional >ade(acher averages, so that 5e
obtain that 5ith probability at least 1 ,
g , >@gA >n @gA >n @A 2 log 2n )
'o(puting the >ade(acher ?verages) e no5 assess the di culty o&
actually co(puting the >ade(acher averages) e 5rite the &ollo5ing)
1 1 n
i g@i An2 sup g i M1
1 nM 1 1 i g@i An 22 sup g
i M1
1 n 1 i g@i AM 1n 22 in& g
i M1
M 1 >n @g, A )2 in& g
-
8/11/2019 Learning theory2.rtf
25/39
1:* Bousquet, Boucheron H Lugosi
This indicates that, given a sa(ple and a choice o& the rando( variables 1 , ) ) ) , n ,
co(puting >n @A is not harder than co(puting the e(pirical ris% (ini(iGer in
) Indeed, the procedure 5ould be to generate the i rando(ly and (ini(iGe
the e(pirical error in 5ith respect to the labels i )
?n advantage o& re5riting >n @ A as above is that it gives an intuition o& 5hatit actually (easures it (easures ho5 (uch the class can &it rando( noise) I&
the class is very large, there 5ill al5ays be a &unction 5hich can per&ectly &itthe i and then >n @ A M 142, so that there is no hope o& uni&or( convergenceto Gero o& the di erence bet5een true and e(pirical ris%s)
;or a &inite set 5ith R R M F , one can sho5 that
>n @ A 2 log F 4n ,
5here 5e again see the logarith(ic &actor log F ) ? consequence o& this is that,
by considering the proection on the sa(ple o& a class 5ith C' di(ension h,
and using Le((a 1, 5e have
h>@A 2 h log en n )
This result along 5ith Theore( < allo5s to recover the Capni% 'hervonen%isbound 5ith a concentration#based proo&)
?lthough the bene&it o& using concentration (ay not be entirely clear at that
point, let us ust (ention that one can actually i(prove the dependence on n
o& the above bound) This is based on the so#called chaining technique) The idea
is to use covering nu(bers at all scales in order to capture the geo(etry o& the
class in a better 5ay than the C' entropy does)
One has the &ollo5ing result, called +udley8s entropy bound
n log F @; , t, nA dt )>n @;A '-
?s a consequence, along 5ith aussler8s upper bound, 5e can get the &ollo5ing
result
>n @;A ' h n )
e can thus, 5ith this approach, re(ove the unnecessary log n &actor o& the C'
bound)
. ?dvanced Topics
In this section, 5e point out several 5ays in 5hich the results presented so &ar
can be i(proved) The (ain source o& i(prove(ent actually co(es, as (entioned
earlier, &ro( the &act that o e ding and !c+iar(id inequalities do not (a%e
use o& the variance o& the &unctions)
-
8/11/2019 Learning theory2.rtf
26/39
Statistical Learning Theory 1::
.)1 Bino(ial Tails
e recall that the &unctions 5e consider are binary valued) So, i& 5e consider a
&i"ed &unction & , the distribution o& $n & is actually a bino(ial la5 o& para(eters
$ & and n @since 5e are su((ing n i)i)d) rando( variables & @Xi A 5hich can either
be - or 1 and are equal to 1 5ith probability &@Xi A M $ & A) +enoting p M $ & ,
5e can have an e"act e"pression &or the deviations o& $n & &ro( $&
n@ p tA nD$ & $n & tE M
% p% @1 pAn % )%M-
Since this e"pression is not easy to (anipulate, 5e have used an upper bound
provided by oe ding8s inequality) o5ever, there e"ist other @sharperA upper
bounds) The &ollo5ing quantities are an upper bound on D$ & $n & tE,
n @p tA @e"ponentialAn @1 p tA p1 p
1 p t pt
1 p @@1 t4pA lo g @1 t4pAt4pA @BennettAe np
2 p@ 1 p A2 t 43 @BernsteinAe n t 2
e 2 n t2 @o e dingA
="a(ining the above bounds @and using inversionA, 5e can say that roughly
spea%ing, the s(all deviations o& $& $n & have a aussian behavior o& the
&or( e"p@ nt2 42p@1 pAA @i)e) aussian 5ith variance p@1 pAA 5hile the largedeviations have a $oisson behavior o& the &or( e"p@ 3nt42A)
So the tails are heavier than aussian, and oe ding8s inequality consists in
upper bounding the tails 5ith a aussian 5ith (a"i(u( variance, hence the
ter( e"p@ 2nt2 A)=ach &unction & ; has a di erent variance $ & @1 $ & A $ & ) !oreover,
&or each & ;, by Bernstein8s inequality, 5ith probability at least 1 ,
$ & $n & 2$ & log 13n )n 2 log 1
The aussian part @second ter( in the right hand sideA do(inates @&or $ & not
too s(all, or n large enoughA, and it depends on $ & ) e thus 5ant to co(bine
Bernstein8s inequality 5ith the union bound and the sy((etriGation)
.)2 For(aliGation
The idea is to consider the ratio
$& $n &$ & )
ere @& J-, 1KA, Car& $ & 2 M $ &
-
8/11/2019 Learning theory2.rtf
27/39
2-- Bousquet, Boucheron H Lugosi
The reason &or considering this ration is that a&ter nor(aliGation, uctuations
are (ore uni&or(8 in the class ;) ence the supre(u( in
$ & $n &sup $ &
& ;
not necessarily attained at &unctions 5ith large variance as it 5as the case pre#
viously)!oreover, 5e %no5 that our goal is to &ind &unctions 5ith s(all error $&
@hence s(all varianceA) The nor(aliGed supre(u( ta%es this into account)
e no5 state a result si(ilar to Theore( 2 &or the nor(aliGed supre(u()
Theore( @Capni%#'hervonen%is, D1*EA) ;or W - 5ith probability at least
1 ,
$ & 2 log S; @2nA log & ;, $& $n & n ,
and also 5ith probability at least 1 ,
$n & 2 log S; @2nA log & ;, $n & $& n )
$roo&) e only give a s%etch o& the proo&) The &irst step is a variation o& the
sy((etriGation le((a
$ & $n & $n & $n &sup $ & t 2 sup@$n & $n & A42 t )& ; & ;
The second step consists in rando(iGation @5ith >ade(acher variablesA
n1i M1 i @& @Xi A & @Xi AAn sup
^ ^ ^ M 2 @$n & $n & A42 t )& ;
;inally, one uses a tail bound o& Bernstein type)
Let us e"plore the consequences o& this result)
;ro( the &act that &or non#negative nu(bers ?, B, ',
? B ' B ' ,? ? B '2
5e easily get &or e"a(ple
& ;, $ & $n & 2 $n & log S; @2nA log n
log S; @2nA log n )
-
8/11/2019 Learning theory2.rtf
28/39
Statistical Learning Theory 2-1
In the ideal situation 5here there is no noise @i)e) P M t@A al(ost surelyA, and
t , denoting by gn the e(pirical ris% (ini(iGer, 5e have > M - and also
>n @gn A M -) In particular, 5hen is a class o& C' di(ension h, 5e obtain
>@gn A M O h log n n )
So, in a 5ay, Theore( allo5s to interpolate bet5een the best case 5here
the rate o& convergence is O@h log n4nA and the 5orst case 5here the rate is
O@ h log n4nA @it does not allo5 to re(ove the log n &actor in this caseA)
It is also possible to derive &ro( Theore( relative error bounds &or the
(ini(iGer o& the e(pirical error) ith probability at least 1 ,
>@gn A >@g A 2 >@g A log S @2nA log n
log S @2nA log n )
e notice here that 5hen >@g A M - @i)e) t and > M -A, the rate is again
o& order 14n 5hile, as so on as >@g A W -, the rate is o& order 14 n) There&ore,it is not possible to obtain a rate 5ith a po5er o& n in bet5een 142 and 1)
The (ain reason is that the &actor o& the square ro ot ter( >@g A is not theright quantity to use here since it does not vary 5ith n) e 5ill see later that
one can have instead >@gn A >@g A as a &actor, 5hich is usually converging toGero 5ith n increasing) 7n&ortunately, Theore( cannot be applied to &unctions
o& the type & & @5hich 5ould be needed to have the (entioned &actorA, so 5e5ill need a re&ined approach)
.)3 Foise 'onditions
The re&ine(ent 5e see% to obtain requires certain speci&ic assu(ptions about the
noise &unction s@"A) The ideal case being 5hen s@"A M - every5here @5hich cor#
responds to > M - and P M t@AA) e no5 intro duce quantities that (easure
ho5 5ell#behaved the noise &unction is)The situation is &avorable 5hen the regression &unction @"A is not too close
to -, or at least not too o&ten close to 142) Indeed, @"A M - (eans that the noise
is (a"i(u( at " @s@"A M 142A and that the label is co(pletely undeter(ined
@any prediction 5ould yield an error 5ith probability 142A)
+e&initions) There are t5o types o& conditions)
+e&inition . @!assart8s Foise 'onditionA) ;or so(e c W -, assu(e
R @AR W 1c al(ost surely )
-
8/11/2019 Learning theory2.rtf
29/39
2-2 Bousquet, Boucheron H Lugosi
This condition i(plies that there is no region 5here the decision is co(pletely
rando(, or the noise is bounded a5ay &ro( 142)
+e&inition @Tsyba%ov8s Foise 'onditionA) Let D-, 1E, assu(e that
one the &ollo5ing equivalent conditions is satis&ied
@iA c W -, g J 1, 1K ,
Dg@A @A -E c@>@gA > A
@iiA c W -, ? , d$ @"A c@ R @"ARd$@"AA? ?
@iiiA B W -, t -, DR @AR tE Bt1
'ondition @iiiA is probably the easiest to interpret it (eans that @"A is close
to the critical value - 5ith lo5 probability)
e indicate ho5 to prove that conditions @iA, @iiA and @iiiA are indeed equiv#
alent
@iA @iiA It is easy to chec% that >@gA > M DR @AR g - E) ;or each
&unction g, there e"ists a set ? such that ? M g -@iiA @iiiA Let ? M J" R @"AR tK
DR R tE M d$ @"A c@ R @ "ARd$ @"AA? ?
d$ @"AAct @?
DR R tE c 1 1 t 1@iiiA @iA e 5rite
>@gA > M DR @AR g -Eg - R Rt
t
M t g W- R R tDR R tE t
Dg W -E M t@ 1 A )1 A tt@1 Bt Dg -E Bt
@1 A 4 &inally givesDg - ETa%ing t M @1 AB
Dg -E B1@1 A@ 1 A @>@gA > A )
e notice that the para(eter has to be in D-, 1E) Indeed, one has the opposite
inequality
Dg@A @A -E ,>@g A > M DR @AR g - E D g - E M
5hich is inco(patible 5ith condition @iA i& W 1)
e also notice that 5hen M -, Tsyba%ov8s condition is void, and 5hen
M 1, it is equivalent to !assart8s condition)
-
8/11/2019 Learning theory2.rtf
30/39
-
8/11/2019 Learning theory2.rtf
31/39
2- Bousquet, Boucheron H Lugosi
The reason &or this de&inition is that, as 5e have seen be&ore, the crucial ingredi#
ent to obtain better rates o& convergence is to use the variance o& the &unctions)
LocaliGing the >ade(acher average allo5s to &ocus on the part o& the &unction
class 5here the &ast rate pheno(enon o ccurs, that are &unctions 5ith s(all vari#
ance)
Fe"t 5e introduce the concept o& a sub#root &unction, a real#valued &unction
5ith certain (onotony properties)
+e&inition : @Sub#>oot ;unctionA) ? &unction is sub#root i&
@iA is non#decreasing,
@iiA is non negative,
@iiiA @rA4 r is non#increasing )
?n i((ediate consequence o& this de&inition is the &ollo5ing result)
Le((a
-
8/11/2019 Learning theory2.rtf
32/39
Statistical Learning Theory 2-ade(acher average behaves li%e a
sub#root &unction, and thus has a unique &i"ed point) This &i"ed point 5ill turn
out to be the %ey quantity in the relative error bounds)
Le((a .) ;or any class o& &unctions ;,
>n @ ;, rA is sub#root )
One legiti(ate question is 5hether ta%ing the star#hull does not enlarge the class
too (uch) One 5ay to see 5hat the e ect is on the siGe o& the class is to co(pare
the (etric entropy @log covering nu(bersA o& ; and o& ;) It is possible to
see that the entropy increases only by a logarith(ic &actor, 5hich is essentially
negligible)
>esult) e no5 state the (ain result involving local >ade(acher averages and
their &i"ed point)
Theore( *) Let ; be a class o& bounded &unctions @e)g) & D 1, 1EA and r
be the &i"ed point o& >@ ;, rA) There e"ists a constant ' W - such that 5ith
probability at least 1 ,
log log n& ;, $ & $n & ' r Car& log 1n )
I& in addition the &unctions in ; satis&y Car& c@$ & A , then one obtains that5ith probability at least 1 ,
log log n2 log 1& ;, $ & ' $n & @r A 1 n )
$roo&) e only give the (ain steps o& the proo&)
1) The starting point is Talagrand8s inequality &or e(pirical processes, a gen#eraliGation o& !c+iar(id8s inequality o& Bernstein type @i)e) 5hich includes
the varianceA) This inequality tells that 5ith high probability,
sup Car& 4n c 4n ,$ & $n & sup $ & $n & c sup& ; & ; & ;
&or so(e constants c, c )
2) The second step consists in peeling8 the class, that is splitting the class into
subclasses according to the variance o& the &unctions
;% M J& Car& D"% , "% 1 AK ,
-
8/11/2019 Learning theory2.rtf
33/39
2-. Bousquet, Boucheron H Lugosi
3) e can then apply Talagrand8s inequality to each o& the sub#classes sepa#
rately to get 5ith high probability
sup $ & $n & sup $ & $n & c "Car& 4n c 4n ,& ; % & ; %
) Then the sy((etriGation le((a allo5s to introduce local >ade(acher av#
erages) e get that 5ith high probability
& ; , $& $n & 2>@;, "Car&A c "Car& 4n c 4n )
behaves li%e a
square root &unction since 5e can upper bound the local >ade(acher average
by the value o& its &i"ed point) ith high probability,
$ & $n & 2 r Car& c "Car& 4n c 4n )
.) ;inally, 5e use the relationship bet5een variance and e"pectation
Car& c@$ & A ,
and solve the inequality in $ & to get the result)
e 5ill not got into the details o& ho5 to apply the above result, but 5e give
so(e re(ar%s about its use)
?n i(portant e"a(ple is the case 5here the class ; is o& &inite C' di(ension
h) In that case, one has
>@;, rA ' rh log nn ,
so that r ' h lo g nn ) ?s a consequence, 5e obtain, under Tsyba%ov condition, a
rate o& convergence o& $ &n to $ & is O@14n1 4@2 A A) It is i(portant to note that
in this case, the rate o& convergence o& $n & to $ & in O@14 nA) So 5e obtain
a &ast rate by loo%ing at the relative error) These &ast rates can be obtainedprovided t @but it is not needed that > M -A) This require(ent can bere(oved i& one uses structural ris% (ini(iGation or regulariGation)
?nother related result is that, as in the global case, one can obtain a bound5ith data#dependent @i)e) conditionalA local >ade(acher averages
>n @;, rA M sup >n & )
& ; $ & 2 r
The result is the sa(e as be&ore @5ith di erent constantsA under the sa(e con#
ditions as in Theore( *) ith probability at least 1 ,
log log n2 log 1$ & ' $n & @rn
n A 1
-
8/11/2019 Learning theory2.rtf
34/39
Statistical Learning Theory 2-
5here r n is the &i"ed point o& a sub#root upper bound o& >n @;, rA)ence, 5e can get i(proved rates 5hen the noise is 5ell#behaved and these
rates interpolate bet5een n 1 42 and n 1 ) o5ever, it is not in general possibleto esti(ate the para(eters @c and A entering in the noise conditions, but 5e 5ill
not discuss this issue &urther here) ?nother point is that although the capacity
(easure that 5e use see(s lo cal8, it does depend on all the &unctions in the
class, but each o& the( is i(plicitly appropriately rescaled) Indeed, in >@ ;, rA,each &unction & ; 5ith $ & 2 r is considered at scale r4$& 2 )
Bibliographical re(ar%s) oe ding8s inequality appears in D1:E) ;or a proo&
o& the contraction principle 5e re&er to Ledou" and Talagrand D2-E)
Capni%#'hervonen%is#Sauer#Shelah8s le((a 5as proved independently by
Sauer D21E, Shelah D22E, and Capni% and 'hervonen%is D1*E) ;or related co(#
binatorial results 5e re&er to ?les%er D23E, ?lon, Ben#+avid, 'esa#Bianchi, and
aussler D2E, 'esa#Bianchi and aussler D2
-
8/11/2019 Learning theory2.rtf
35/39
2-* Bousquet, Boucheron H Lugosi
The use o& >ade(acher averages in classi&ication 5as &irst pro(oted by
Noltchins%ii D
-
8/11/2019 Learning theory2.rtf
36/39
Statistical Learning Theory 2-:
B Fo ;ree Lunch
e can no5 give a &or(al de&inition o& consistency and state the core results
about the i(possibility o& universally goo d algorith(s)
+e&inition 11 @'onsistencyA) ?n algorith( is consistent i& &or any probability
(easure $ ,
li(n >@gn A M > al(ost surely)It is i(portant to understand the reasons that (a%e possible the e"istence o&
consistent algorith(s) In the case 5here the input space is countable, things
are so(eho5 easy since even i& there is no relationship at all bet5een inputs and
outputs, by repeatedly sa(pling data independently &ro( $ , one 5ill get to see
an increasing nu(ber o& di erent inputs 5hich 5ill eventually converge to all
the inputs) So, in the countable case, an algorith( 5hich 5ould si(ply learn by
heart8 @i)e) (a%es a (aority vote 5hen the instance has been seen be&ore, and
produces an arbitrary prediction other5iseA 5ould be consistent)
In the case 5here is not countable @e)g) M A, things are (ore subtle)Indeed, in that case, there is a see(ingly innocent assu(ption that beco(es
crucial to be able to de&ine a probability (easure $ on , one needs a #algebra
on that space, 5hich is typically the Borel #algebra) So the hidden assu(ption
is that $ is a Borel (easure) This (eans that the topology o& plays a rolehere, and thus, the target &unction t 5ill be Borel (easurable) In a sense this
guarantees that it is possible to appro"i(ate t &ro( its value @or appro"i(ate
valueA at a &inite nu(ber o& points) The algorith(s that 5ill achieve consistency
are thus those 5ho use the topology in the sense o& generaliGing8 the observed
values to neighborho ods @e)g) lo cal classi&iersA) In a 5ay, the (easurability o& t
is one o& the crudest notions o& s(oothness o& &unctions)
e no5 cite t5o i(portant results) The &irst one tells that &or a &i"ed sa(ple
siGe, one can construct arbitrarily bad proble(s &or a given algorith()
Theore( : @Fo ;ree Lunch, see e)g) DEA) ;or any algorith(, any n and
any W -, there e"ists a distribution $ such that > M - and
>@gn A 1 2 M 1 )
The second result is (ore subtle and indicates that given an algorith(, one
can construct a proble( &or 5hich this algorith( 5ill converge as slo5ly as one
5ishes)
Theore( 1- @Fo ;ree Lunch at ?ll, see e)g) DEA) ;or any algorith(, and
any sequence @an A that converges to -, there e"ists a probability distribution $
such that > M - and
>@gn A an )
In the above theore(, the bad8 probability (easure is constructed on a countable
set @5here the outputs are not related at all to the inputs so that no generaliGa#
tion is possibleA, and is such that the rate at 5hich one gets to see ne5 inputs
is as slo5 as the convergence o& an )
-
8/11/2019 Learning theory2.rtf
37/39
21- Bousquet, Boucheron H Lugosi
;inally 5e (ention other notions o& consistency)
+e&inition 12 @C' consistency o& =>!A) The =>! algorith( is consistent
i& &or any probability (easure $ ,
>@gn A >@g A in probability,
and>n @gn A >@g A in probability)
+e&inition 13 @C' non#trivial consistency o& =>!A) The =>! algorith(
is non#trivially consistent &or the set and the probability distribution $ i& &or
any c ,
in& $ @&A in probability)$n @& A in&& ; $ & W c & ; $ & W c
>e&erences
1) Capni%, C) Statistical Learning Theory) ]ohn iley, Fe5 Por% @1::*A
2) ?nthony, !), Bartlett, $)L) Feural Fet5or% Learning Theoretical ;oundations)
'a(bridge 7niversity $ress, 'a(bridge @1:::A
3) Brei(an, L), ;ried(an, ]), Olshen, >), Stone, ') 'lassi&ication and >egressionTrees) ads5orth International, Bel(ont, '? @1:*A) +evroye, L), y/ or&i, L), Lugosi, ) ? $robabilistic Theory o& $attern >ecognition)
Springer#Cerlag, Fe5 Por% @1::.A
), art, $) $attern 'lassi&ication and Scene ?nalysis) ]ohn iley, Fe5
Por% @1:3A
.) ;u%unaga, N) Introduction to Statistical $attern >ecognition) ?cade(ic $ress,
Fe5 Por% @1:2A
) Nearns, !), CaGirani, 7) ?n Introduction to 'o(putational Learning Theory)
!IT $ress, 'a(bridge, !assachusetts @1::A
*) Nul%arni, S), Lugosi, ), Cen%atesh, S) Learning pattern classi&ication`a sur#
vey) I=== Transactions on In&or(ation Theory @1::*A 21*Y22-. In&or(ation
Theory 1:*Y1::*) 'o((e(orative special issue):) Lugosi, ) $attern classi&ication and learning theory) In y/ or&i, L), ed) $rinciples
o& Fonpara(etric Learning, Springer, Ciena @2--2A ecognition) ]ohniley, Fe5 Por% @1::2A
11) !endelson, S) ? &e5 notes on statistical learning theory) In !endelson, S), S(ola,
?), eds) ?dvanced Lectures in !achine Learning) LF'S 2.--, Springer @2--3A 1Y
-
12) Fataraan, B) !achine Learning ? Theoretical ?pproach) !organ Nau&(ann,
San !ateo, '? @1::1A
13) Capni%, C) =sti(ation o& +ependencies Based on =(pirical +ata) Springer#Cerlag,Fe5 Por% @1:*2A
1) Capni%, C) The Fature o& Statistical Learning Theory) Springer#Cerlag, Fe5 Por%
@1::
-
8/11/2019 Learning theory2.rtf
38/39
Statistical Learning Theory 211
1.) von Lu"burg, 7), Bousquet, O), Sch/ ol%op&, B) ? co(pression approach to support
vector (odel selection) The ]ournal o& !achine Learning >esearch < @2--A 2:3Y
323
1) !c+iar(id, ') On the (ethod o& bounded di erences) In Surveys in 'o(bina#
torics 1:*:, 'a(bridge 7niversity $ress, 'a(bridge @1:*:A 1*Y1**
1*) Capni%, C), 'hervonen%is, ?) On the uni&or( convergence o& relative &requencies
o& events to their probabilities) Theory o& $robability and its ?pplications 1.
@1:1A 2.Y2*-
1:) oe ding, ) $robability inequalities &or su(s o& bounded rando( variables)
]ournal o& the ?(erican Statistical ?ssociation
-
8/11/2019 Learning theory2.rtf
39/39
212 Bousquet, Boucheron H Lugosi
3:) Capni%, C), 'hervonen%is, ?) Fecessary and su cient conditions &or the uni#
&or( convergence o& (eans to their e"pectations) Theory o& $robability and its
?pplications 2. @1:*1A *21Y*32-) ?ssouad, $) +ensite et di(ension) ?nnales de l8Institut ;ourier 33 @1:*3A 233Y2*2
1) 'over, T) eo(etrical and statistical properties o& syste(s o& linear inequali#
ties 5ith applications in pattern recognition) I=== Transactions on =lectronic
'o(puters 1 @1:.) Balls in >% do not cut all subsets o& % 2 points) ?dvances in
!athe(atics 31 @3A @1::A 3-.Y3-*
3) oldberg, $), ]erru(, !) Bounding the Capni%#'hervonen%is di(ension o& con#
cept classes para(etriGed by real nu(bers) !achine Learning 1* @1::), +udley, >) So(e special Capni%#'hervonen%is classes) +iscrete
!athe(atics 33 @1:*1A 313Y31*
top related