computational and statistical learning theorynati/teaching/ttic31120/2015/lecture4.pdfย ยท โ€ขe.g....

22
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 4: MDL and PAC-Bayes

Upload: others

Post on 09-Nov-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Computational and StatisticalLearning Theory

TTIC 31120

Prof. Nati Srebro

Lecture 4:MDL and PAC-Bayes

Page 2: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Uniform vs Non-Uniform Bias

โ€ข No Free Lunch: we need some โ€œinductive biasโ€

โ€ข Limiting attention to hypothesis class โ„‹: โ€œflatโ€ bias

โ€ข ๐‘ โ„Ž =1

โ„‹for โ„Ž โˆˆ โ„‹, and ๐‘ โ„Ž = 0 otherwise

โ€ข Non-uniform bias: ๐‘ โ„Ž encodes bias

โ€ข Can use any ๐‘ โ„Ž โ‰ฅ 0, s.t. โ„Ž ๐‘(โ„Ž) โ‰ค 1

โ€ข E.g. choose prefix-disambiguous encoding ๐‘‘(โ„Ž) and use ๐‘ โ„Ž = 2โˆ’ ๐‘‘ โ„Ž

โ€ข Or, choose ๐‘:๐’ฐ โ†’ ๐’ด๐’ณ over prefix-disambiguous programs ๐’ฐ โŠ‚ 0,1 โˆ—

and use ๐‘ โ„Ž = 2โˆ’ min

๐‘ ๐œŽ =โ„Ž๐œŽ

โ€ข Choice of ๐‘ โ‹… , ๐‘‘ โ‹… or ๐‘ โ‹… encodes are expert knowledge/inductive bias

Page 3: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Minimum Description Length Learning

โ€ข Choose โ€œpriorโ€ ๐‘(โ„Ž) s.t. โ„Ž ๐‘ โ„Ž โ‰ค 1 (or description language ๐‘‘ โ‹… or ๐‘ โ‹… )

โ€ข Minimum Description Length learning rule(based on above prior/description language):

๐‘€๐ท๐ฟ๐‘ ๐‘† = arg max๐ฟ๐‘† โ„Ž =0

๐‘(โ„Ž) = arg min๐ฟ๐‘† โ„Ž =0

|๐‘‘ โ„Ž |

โ€ข For any ๐’Ÿ, w.p. โ‰ฅ 1 โˆ’ ๐›ฟ,

๐ฟ ๐‘€๐ท๐ฟ๐‘ ๐‘† โ‰ค infโ„Ž ๐‘ .๐‘ก.๐ฟ๐’Ÿ โ„Ž =0

โˆ’ log ๐‘ โ„Ž + log 2/๐›ฟ

2๐‘š

Sample complexity: ๐‘š = ๐‘‚โˆ’ log ๐‘(โ„Ž)

๐œ–2= ๐‘‚

๐‘‘ โ„Ž

๐œ–2

(more careful analysis: ๐‘‚๐‘‘ โ„Žโˆ—

๐œ–)

Page 4: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

MDL and Universal Learningโ€ข Theorem: For any โ„‹ and ๐‘:โ„‹ โ†’ [0,1], s.t. โ„Ž ๐‘(โ„Ž) โ‰ค 1, and any

source distribution ๐’Ÿ, if there exists โ„Ž with ๐ฟ โ„Ž = 0 and ๐‘ โ„Ž > 0, then w.p. โ‰ฅ 1 โˆ’ ๐›ฟ over ๐‘† โˆผ ๐’Ÿ๐‘š:

๐ฟ ๐‘€๐ท๐ฟ๐‘ ๐‘† โ‰คโˆ’ log ๐‘ โ„Ž + log 2/๐›ฟ

2๐‘š

โ€ข Can learn any countable class!

โ€ข Class of all computable functions, with ๐‘ โ„Ž = 2โˆ’ min

๐‘ ๐œŽ =โ„Ž๐œŽ

.

โ€ข Class enumerable with ๐‘›:โ„‹ โ†’ โ„• with ๐‘ โ„Ž = 2โˆ’๐‘› โ„Ž

โ€ข But VCdim(all computable functions)=โˆž !

โ€ข Why no contradiction to Fundamental Theorem?

โ€ข PAC Learning: Sample complexity ๐‘š ๐œ–, ๐›ฟ is uniform for all โ„Ž โˆˆ โ„‹.Depends only on class โ„‹, not on specific โ„Žโˆ—

โ€ข MDL: Sample complexity ๐‘š(๐œ–, ๐›ฟ, ๐’‰) depends on โ„Ž.

Page 5: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Uniform and Non-Uniform Learnability

โ€ข Definition: A hypothesis class โ„‹ is agnostically PAC-Learnable if there

exists a learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆƒ๐‘š ๐œ–, ๐›ฟ , โˆ€๐’Ÿ, โˆ€๐’‰, โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

โ€ข Definition: A hypothesis class โ„‹ is non-uniformly learnable if there exists

a learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆ€โ„Ž, โˆƒ๐‘š ๐œ–, ๐›ฟ, ๐’‰ , โˆ€๐’Ÿ, โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ,๐’‰๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

โ€ข Theorem: โˆ€๐’Ÿ, if there exists โ„Ž with ๐ฟ โ„Ž = 0, then โˆ€๐‘†โˆผ๐’Ÿ๐‘š๐›ฟ

๐ฟ ๐‘€๐ท๐ฟ๐‘ ๐‘† โ‰คโˆ’ log ๐‘ โ„Ž + log 2/๐›ฟ

2๐‘š

Compete also with โ„Ž s.t. ๐ฟ โ„Ž > 0?

Page 6: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Allowing Errors: From MDL to SRM

๐ฟ โ„Ž โ‰ค ๐ฟ๐‘† โ„Ž +โˆ’ log ๐‘ โ„Ž + log 2/๐›ฟ

2๐‘š

โ€ข Structural Risk Minimization:

๐‘†๐‘…๐‘€๐‘ ๐‘† = argminโ„Ž

๐ฟ๐‘† โ„Ž +โˆ’ log ๐‘ โ„Ž

2๐‘š

โ€ข Theorem: For any prior ๐‘(โ„Ž), โ„Ž ๐‘(โ„Ž) โ‰ค 1, and any source distribution ๐’Ÿ, w.p. โ‰ฅ 1 โˆ’ ๐›ฟ over ๐‘† โˆผ ๐’Ÿ๐‘š:

๐ฟ ๐‘†๐‘…๐‘€๐‘ ๐‘† โ‰ค infโ„Ž

๐ฟ โ„Ž + 2โˆ’ log ๐‘ โ„Ž + log 2/๐›ฟ

๐‘š

Minimized by MDLMinimized

by ERM

fit thedata match the prior /

simple / short description

Page 7: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Non-Uniform Learning: Beyond Cardinalityโ€ข MDL still essentially based on cardinality (โ€œhow many hypothesis are simpler

then meโ€) and ignores relationship between predictors.

โ€ข Generalizes the cardinality bound: Using ๐‘ โ„Ž =1

โ„‹we get

๐‘š ๐œ–, ๐›ฟ, โ„Ž = ๐‘š ๐œ–, ๐›ฟ =log โ„‹ + log 2/๐›ฟ

๐œ–2

โ€ข Can we treat continuous classes (e.g. linear predictors)?Move from cardinality to โ€œgrowth functionโ€?

โ€ข E.g.:

โ€ข โ„‹ = ๐‘ ๐‘–๐‘”๐‘› ๐‘“ ๐œ™ ๐‘ฅ | ๐‘“:โ„๐‘‘ โ†’ โ„ is a polynomial , ๐œ™:๐’ณ โ†’ โ„๐‘‘

โ€ข VCdim(โ„‹)=โˆžโ€ข โ„‹ is uncountable, and there is no distribution with โˆ€โ„Žโˆˆโ„‹๐‘ โ„Ž > 0โ€ข But what if we bias toward lower order polynomials?

โ€ข Answer 1: prior over hypothesis classesโ€ข Write โ„‹ =โˆชโ„‹๐‘Ÿ (e.g. โ„‹๐‘Ÿ =degree-๐‘Ÿ polynomials)โ€ข Use prior ๐‘(๐ป๐‘Ÿ) over hypothesis classes

Page 8: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Prior Over Hypothesis Classes

โ€ข VC bound: โˆ€๐‘Ÿโ„™ โˆ€โ„Žโˆˆโ„‹๐‘Ÿ๐ฟ โ„Ž โ‰ค ๐ฟ๐‘† โ„Ž + ๐‘‚

VCdim โ„‹๐‘Ÿ +log 1 ๐›ฟ๐‘Ÿ

๐‘šโ‰ฅ 1 โˆ’ ๐›ฟ๐‘Ÿ

โ€ข Setting ๐›ฟ๐‘Ÿ = ๐‘ โ„‹๐‘Ÿ โ‹… ๐›ฟ and taking a union bound,

โˆ€๐‘†โˆผ๐’Ÿ๐‘š๐›ฟ โˆ€โ„‹๐‘Ÿ

โˆ€โ„Žโˆˆโ„‹๐‘Ÿ๐ฟ โ„Ž โ‰ค ๐ฟ๐‘† โ„Ž + ๐‘‚

VCdim โ„‹๐‘Ÿ โˆ’ log ๐‘ โ„‹๐‘Ÿ + log 1 ๐›ฟ๐‘š

โ€ข Structural Risk Minimization over hypothesis classes:

๐‘†๐‘…๐‘€๐‘ ๐‘† = arg minโ„Žโˆˆโ„‹๐‘Ÿ

๐ฟ๐‘† โ„Ž + ๐ถโˆ’ log ๐‘ โ„‹๐‘Ÿ + VCdim โ„‹๐‘Ÿ

๐‘š

โ€ข Theorem: w.p. โ‰ฅ 1 โˆ’ ๐›ฟ,

๐ฟ๐’Ÿ ๐‘†๐‘…๐‘€๐‘ ๐‘† โ‰ค minโ„‹๐‘Ÿ,โ„Žโˆˆโ„‹๐‘Ÿ

๐ฟ๐’Ÿ โ„Ž + ๐‘‚โˆ’ log ๐‘ โ„‹๐‘Ÿ + VCdim โ„‹๐‘Ÿ + log

1๐›ฟ

๐‘š

Page 9: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Structural Risk Minimizationโ€ข Theorem: For a prior ๐‘ โ„‹๐‘Ÿ with โ„‹๐‘Ÿ

๐‘ โ„‹๐‘Ÿ โ‰ค 1 and any ๐’Ÿ, โˆ€๐‘†โˆผ๐’Ÿ๐‘š๐›ฟ ,

๐ฟ๐’Ÿ ๐‘†๐‘…๐‘€๐‘ ๐‘† โ‰ค minโ„‹๐‘Ÿ,โ„Žโˆˆโ„‹๐‘Ÿ

๐ฟ๐’Ÿ โ„Ž + ๐‘‚โˆ’ log ๐‘ โ„‹๐‘Ÿ + VCdim โ„‹๐‘Ÿ + log

1๐›ฟ

๐‘š

โ€ข For โ„‹๐‘– = โ„Ž๐‘– :

โ€ข ๐‘‰๐ถ๐‘‘๐‘–๐‘š โ„‹๐‘Ÿ = 0

โ€ข Reduces to โ€œstandardโ€ SRM with a prior over hypothesis

โ€ข For ๐‘ โ„‹๐‘Ÿ = 1

โ€ข Reduces to ERM over a finite-VC class

โ€ข More general. Eg for polynomials over ๐œ™ ๐‘ฅ โˆˆ โ„๐‘‘ with ๐‘ degree ๐‘Ÿ = 2โˆ’๐‘Ÿ,

๐‘š ๐œ–, ๐›ฟ, โ„Ž = ๐‘‚degree(โ„Ž) + ๐‘‘ + 1 degree โ„Ž + log

1๐›ฟ

๐œ–2

โ€ข Allows non-uniform learning of a countable union of finite-VC classes

Page 10: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Uniform and Non-Uniform Learnability

โ€ข Definition: A hypothesis class โ„‹ is agnostically PAC-Learnable if there exists a

learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆƒ๐‘š ๐œ–, ๐›ฟ , โˆ€๐’Ÿ, โˆ€๐’‰, โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

โ€ข Definition: A hypothesis class โ„‹ is non-uniformly learnable if there exists a

learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆ€โ„Ž, โˆƒ๐‘š ๐œ–, ๐›ฟ, ๐’‰ , โˆ€๐’Ÿ, โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ,๐’‰๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

โ€ข Theorem: A hypothesis class โ„‹ is non-uniformly learnable if and only ifit is a countable union of finite VC class (โ„‹ =โˆช๐‘–โˆˆโ„• โ„‹๐‘–, VCdim โ„‹๐‘– < โˆž)

โ€ข Definition: A hypothesis class โ„‹ is โ€œconsistently learnableโ€ if there exists a

learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆ€โ„Ž โˆ€๐““, โˆƒ๐‘š ๐œ–, ๐›ฟ, โ„Ž, ๐““ , โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ,โ„Ž,๐““๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

Page 11: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Consistency

โ€ข ๐’ณ countable (e.g. ๐’ณ = 0,1 โˆ—), โ„‹ = ยฑ1 ๐’ณ (all possible functions)

โ€ข โ„‹ is uncountable, it is not a countable union of finite VC classes, and is thus not non-uniformly learnable

โ€ข Claim: โ„‹ is โ€œconsistently learnableโ€ using๐ธ๐‘…๐‘€โ„‹ ๐‘† ๐‘ฅ = ๐‘€๐ด๐ฝ๐‘‚๐‘…๐ผ๐‘‡๐‘Œ ๐‘ฆ๐‘– ๐‘ . ๐‘ก. ๐‘ฅ๐‘– , ๐‘ฆ๐‘– โˆˆ ๐‘†

โ€ข Proof sketch: for any ๐’Ÿ,

โ€ข Sort ๐’ณ by decreasing probability. The tail has diminishing probability and thus for any ๐œ–, there exists some prefix ๐’ณโ€ฒ of the sort s.t. the tail ๐’ณ โˆ–๐’ณโ€ฒ has probability mass โ‰ค ๐œ–/2.

โ€ข Weโ€™ll give up on the tail. ๐’ณโ€ฒ is finite, and so ยฑ1 โ„‹ is also finite.

โ€ข Why only โ€œconsistently learnableโ€?

โ€ข Size of ๐’ณโ€˜ required to capture 1 โˆ’ ๐œ–/2 of mass depends on ๐’Ÿ.

Page 12: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Uniform and Non-Uniform Learnability

โ€ข Definition: A hypothesis class โ„‹ is agnostically PAC-Learnable if there exists a

learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆƒ๐‘š ๐œ–, ๐›ฟ , โˆ€๐’Ÿ, โˆ€๐’‰, โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

โ€ข (Agnostically) PAC-Learnable iff VCdim โ„‹ < โˆž

โ€ข Definition: A hypothesis class โ„‹ is non-uniformly learnable if there exists a

learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆ€โ„Ž, โˆƒ๐‘š ๐œ–, ๐›ฟ, ๐’‰ , โˆ€๐’Ÿ, โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ,๐’‰๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

โ€ข Non-uniformly learnable iff โ„‹ is a countable union of finite VC classes

โ€ข Definition: A hypothesis class โ„‹ is โ€œconsistently learnableโ€ if there exists a

learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆ€โ„Ž โˆ€๐““, โˆƒ๐‘š ๐œ–, ๐›ฟ, โ„Ž, ๐““ , โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ,โ„Ž,๐““๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

Page 13: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

SRM In Practice

๐‘†๐‘…๐‘€๐‘ ๐‘† = arg minโ„Žโˆˆโ„‹๐‘Ÿ

๐ฟ๐‘† โ„Ž + ๐ถโˆ’ log๐‘ โ„‹๐‘Ÿ + VCdim โ„‹๐‘Ÿ

๐‘š

โ€ข Bound is loose anyway. Better to view as bi-criteria optimization:argmin ๐ฟ๐‘  โ„Ž and โˆ’ log ๐‘ โ„‹๐‘Ÿ + VCdim โ„‹๐‘Ÿ

E.g. serialize asargmin ๐ฟ๐‘  โ„Ž + ๐œ† โˆ’ log ๐‘ โ„‹๐‘Ÿ + VCdim โ„‹๐‘Ÿ

โ€ข Typically โˆ’log ๐‘ โ„‹๐‘Ÿ , VCdim โ„‹๐‘Ÿ monotone in โ€œcomplexityโ€ ๐‘Ÿargmin ๐ฟ๐‘  โ„Ž ๐š๐ง๐ ๐‘Ÿ(โ„Ž)

where๐‘Ÿ โ„Ž = min ๐‘Ÿ ๐‘ . ๐‘ก. โ„Ž โˆˆ โ„‹๐‘Ÿ

Page 14: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

SRM as a Bi-Criteria Problemargmin ๐ฟ๐‘  โ„Ž and ๐‘Ÿ(โ„Ž)

Regularization Path = argminโ„Ž ๐ฟ๐‘  โ„Ž + ๐œ† โ‹… ๐‘Ÿ โ„Ž | 0 โ‰ค ๐œ† โ‰ค โˆž

Select ๐œ† using a validation setโ€”exact bound not needed

๐‘Ÿ(โ„Ž)

๐ฟ๐‘†(โ„Ž)

โ„Ž

Regularization Path(Pareto Frontier)

๐œ† = โˆž

๐œ† = 0

Page 15: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

Non-Uniform Learning: Beyond Cardinalityโ€ข MDL still essentially based on cardinality (โ€œhow many hypothesis are

simpler then meโ€) and ignores relationship between predictors.

โ€ข Can we treat continuous classes (e.g. linear predictors)?Move from cardinality? Take into account that many predictors are similar?

โ€ข Answer 1: prior ๐’‘(๐“—) over hypothesis class

โ€ข Answer 2: PAC-Bayes Theory

โ€ข Prior distribution ๐‘ƒ (not necessarily discrete) over โ„‹

David McAllester

Page 16: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

PAC-Bayes

โ€ข Until now (MDL, SRM) we used a discrete โ€œpriorโ€ (discrete โ€œdistributionโ€ ๐‘ โ„Žover hypothesis, or discrete โ€œdistributionโ€ ๐‘(โ„‹๐‘Ÿ) over hypothesis classes)

โ€ข Instead: encode inductive bias as distribution ๐‘ƒ over hypothesis

โ€ข Use randomized (averaged) predictor โ„Ž๐‘„, where for each prediction choosesโ„Ž โˆผ ๐‘„ and predicts โ„Ž(๐‘ฅ)

โ€ข โ„Ž๐‘„ ๐‘ฅ = ๐‘ฆ w. p. โ„™โ„Žโˆผ๐‘„ โ„Ž ๐‘ฅ = ๐‘ฆ

โ€ข ๐ฟ๐’Ÿ โ„Ž๐‘„ = ๐”ผโ„Žโˆผ๐‘„ ๐ฟ๐’Ÿ โ„Ž

โ€ข Theorem: for any distribution ๐‘ƒ over hypothesis and any ๐’Ÿ, โˆ€๐‘†โˆผ๐’Ÿ๐‘š๐›ฟ

๐ฟ๐’Ÿ โ„Ž๐‘„ โˆ’ ๐ฟ๐‘† โ„Ž๐‘„ โ‰ค๐พ๐ฟ ๐‘„||๐‘ƒ + log 2๐‘š

๐›ฟ2 ๐‘š โˆ’ 1

Page 17: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

KL-Divergence

๐พ๐ฟ(๐‘„| ๐‘ƒ = ๐”ผโ„Žโˆผ๐‘„ log๐‘‘๐‘„

๐‘‘๐‘ƒ

= โ„Ž ๐‘ž โ„Ž log๐‘ž โ„Ž

๐‘ โ„Žfor discrete dist with pmf ๐‘, ๐‘ž

= โˆซ ๐‘“๐‘„ โ„Ž log๐‘“๐‘„ โ„Ž

๐‘“๐‘ƒ โ„Ž๐‘‘โ„Ž for continuous distributions

โ€ข Measures how much ๐‘„ deviates from ๐‘„

โ€ข ๐พ๐ฟ(๐‘„| ๐‘ƒ โ‰ฅ 0, and๐พ๐ฟ(๐‘„| ๐‘ƒ = 0 if and only if ๐‘„ = ๐‘ƒ

โ€ข If ๐‘„ ๐ด > 0 while ๐‘ƒ ๐ด = 0, ๐พ๐ฟ(๐‘„| ๐‘ƒ = โˆž (other direction is allowed)

โ€ข ๐พ๐ฟ(๐ป1| ๐ป0 =information per sample for rejecting ๐ป0 when ๐ป1 is true

โ€ข ๐พ๐ฟ ๐’ฌ||Unif ๐‘› = log๐‘› โˆ’ ๐ป(๐’ฌ)

โ€ข ๐ผ ๐‘‹, ๐‘Œ = ๐พ๐ฟ ๐‘ƒ ๐‘‹, ๐‘Œ ||๐‘ƒ ๐‘‹ ๐‘ƒ ๐‘Œ

Solomon

Kullback

Richard

Leibler

Page 18: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

PAC-Bayesโ€ข For any distribution ๐’ซ over hypothesis and any ๐’Ÿ, โˆ€๐‘†โˆผ๐’Ÿ๐‘š

๐›ฟ

๐ฟ๐’Ÿ โ„Ž๐’ฌ โˆ’ ๐ฟ๐‘† โ„Ž๐’ฌ โ‰ค๐พ๐ฟ ๐‘„||๐‘ƒ + log 2๐‘š

๐›ฟ2 ๐‘š โˆ’ 1

โ€ข Can only use hypothesis in the support of ๐‘ƒ (otherwise ๐พ๐ฟ(๐‘„| ๐‘ƒ = โˆž)

โ€ข For a finite โ„‹ with ๐‘ƒ = Unif(โ„‹)

โ€ข Consider ๐‘„ = point mass on h

โ€ข ๐พ๐ฟ(๐‘„| ๐‘ƒ = log โ„‹

โ€ข Generalizes cardinality bound (up to log๐‘š)

โ€ข More generally, for a discrete ๐‘ƒ and ๐‘„ = point mass on h

โ€ข ๐พ๐ฟ(๐‘„| ๐‘ƒ = ๐‘ž โ„Ž log๐‘ž โ„Ž

๐‘ โ„Ž=

1

๐‘ โ„Ž

โ€ข Generalizes MDL/SRM (up to log๐‘š)

โ€ข For continuous ๐‘ƒ (eg over linear predictors or polynomials)

โ€ข For ๐‘„=point-mass (or any discrete), ๐พ๐ฟ(๐‘„| ๐‘ƒ = โˆž

โ€ข Take โ„Ž๐‘„ as average over similar hypothesis (eg with same behavior on ๐‘†)

Page 19: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

PAC-Bayes

๐ฟ๐’Ÿ โ„Ž๐‘„ โ‰ค ๐ฟ๐‘† โ„Ž๐‘„ +๐พ๐ฟ ๐‘„||๐‘ƒ + log 2๐‘š

๐›ฟ2 ๐‘š โˆ’ 1

โ€ข What learning rule does the PAC-Bayes bound suggest?

๐‘„๐œ† = argmin๐‘„

๐ฟ๐‘† โ„Ž๐‘„ + ๐œ† โ‹… ๐พ๐ฟ ๐‘„||๐‘ƒ

โ€ข Theorem:๐‘ž๐œ† โ„Ž โˆ ๐‘ โ„Ž ๐‘’โˆ’๐›ฝ๐ฟ๐‘† โ„Ž

for some โ€œinverse temperatureโ€ ๐›ฝ

โ€ข As ๐œ† โ†’ โˆž we ignore the data, corresponding to infinite temperature, ๐›ฝ โ†’ 0

โ€ข As ๐œ† โ†’ 0 we insist on minimizing ๐ฟ๐‘† โ„Ž๐‘„ , corresponding to zero temperature, ๐›ฝ โ†’ โˆž, and the prediction becomes ERM (or rather, a distribution over the ERM hypothesis in the support of ๐‘ƒ)

Page 20: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

PAC-Bayes vs Bayes

Bayesian approach:โ€ข Assume โ„Ž โˆผ ๐’ซ,

โ€ข ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘š iid conditioned on โ„Ž, with ๐‘ฆ๐‘–|๐‘ฅ๐‘– , โ„Ž = โ„Ž(๐‘ฅ๐‘–), ๐‘ค. ๐‘. 1 โˆ’ ๐œˆ

โˆ’โ„Ž(๐‘ฅ๐‘–), ๐‘ค. ๐‘. ๐œˆ

Use posterior:

๐‘ โ„Ž ๐‘† โˆ ๐‘ โ„Ž ๐‘ ๐‘† โ„Ž

= ๐‘ โ„Ž ๐‘– ๐‘ ๐‘ฅ๐‘– ๐‘(๐‘ฆ๐‘–|๐‘ฅ๐‘–)

โˆ ๐‘ โ„Ž ๐‘–๐œˆ

1โˆ’๐œˆ

โ„Ž ๐‘ฅ๐‘– โ‰ ๐‘ฆ๐‘–= ๐‘ โ„Ž

๐œˆ

1โˆ’๐œˆ

๐‘– โ„Ž ๐‘ฅ๐‘– โ‰ ๐‘ฆ๐‘–

= ๐‘ โ„Ž ๐‘’โˆ’๐›ฝ๐ฟ๐‘  โ„Ž

where ๐›ฝ = ๐‘š log1โˆ’๐œˆ

๐œˆ

Page 21: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

PAC-Bayes vs BayesPAC-Bayes

โ€ข ๐‘ƒ encodes inductive bias, not assumption about reality

โ€ข SRM-type bound minimized by Gibbs distribution

๐‘ž๐œ† โ„Ž โˆ ๐‘ โ„Ž ๐‘’โˆ’๐›ฝ๐ฟ๐‘† โ„Ž

โ€ข Post-hoc guarantee always valid (โˆ€๐‘†๐›ฟ),

with no assumption about reality

๐ฟ๐’Ÿ โ„Ž๐‘„ โ‰ค ๐ฟ๐‘† โ„Ž๐‘„ +๐พ๐ฟ ๐‘„||๐‘ƒ + log 2๐‘š

๐›ฟ2 ๐‘š โˆ’ 1

โ€ข Bound valid for any ๐‘„

โ€ข If inductive bias very different from reality, bound will be high

Bayesian Approachโ€ข ๐’ซ is prior over reality

โ€ข Posterior given by Gibbs distribution

๐‘ž๐œ† โ„Ž โˆ ๐‘ โ„Ž ๐‘’โˆ’๐›ฝ๐ฟ๐‘† โ„Ž

โ€ข Risk analysis assuming prior

Page 22: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfย ยท โ€ขE.g. choose prefix-disambiguous encoding (โ„Ž) ... ๐‘๐œŽ=โ„Ž ๐œŽ โ€ขChoice of Lโ‹…,

PAC-Bayes: Tighter Version

โ€ข For any distribution ๐‘ƒ over hypothesis and any source distribution ๐’Ÿ, โˆ€๐‘†โˆผ๐’Ÿ๐‘š๐›ฟ

๐พ๐ฟ ๐ฟ๐‘† โ„Ž๐‘„ ||๐ฟ๐’Ÿ โ„Ž๐‘„ โ‰ค๐พ๐ฟ ๐‘„||๐‘ƒ + log 2๐‘š

๐›ฟ๐‘š โˆ’ 1

where ๐พ๐ฟ(๐›ผ| ๐›ฝ = ๐›ผ log๐›ผ

๐›ฝ+ 1 โˆ’ ๐›ผ log

1โˆ’๐›ผ

1โˆ’๐›ฝfor ๐›ผ, ๐›ฝ โˆˆ [0,1]

๐ฟ๐’Ÿ โ„Ž๐‘„ โ‰ค ๐ฟ๐‘† โ„Ž๐‘„ +2๐ฟ๐‘  โ„Ž๐‘„ ๐พ๐ฟ ๐‘„๐‘ƒ + log

2๐‘š๐›ฟ

๐‘š โˆ’ 1+2 ๐พ๐ฟ ๐‘„๐‘ƒ + log

2๐‘š๐›ฟ

๐‘šโˆ’ 1

โ€ข This generalizes the realizable case (๐ฟ๐‘† โ„Ž๐‘„ = 0, and so only the 1

๐‘šterm

appears) and the agnostic case (where the 1 ๐‘š term is dominant)

โ€ข Numerically much tighter

โ€ข Can also be used as a tail bound instead of Hoeffding or Bernstein also with cardinality or VC-based guarantees. Arises naturally in PAC-Bayes.