three concepts: information

80
Outline Occam’s Razor MDL Principle Three Concepts: Information Lecture 5: MDL Principle Teemu Roos Complex Systems Computation Group Department of Computer Science, University of Helsinki Fall 2007 Teemu Roos Three Concepts: Information

Upload: others

Post on 19-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

Three Concepts: InformationLecture 5: MDL Principle

Teemu Roos

Complex Systems Computation GroupDepartment of Computer Science, University of Helsinki

Fall 2007

Teemu Roos Three Concepts: Information

Page 2: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

Lecture 5: MDL Principle

Jorma Rissanen (left) receiving the IEEE Information TheorySociety Best Paper Award from Claude Shannon in 1986.

IEEE Golden Jubilee Award forTechnological Innovation (for theinvention of arithmetic coding)1998; IEEE Richard W. HammingMedal (for fundamentalcontribution to information theory,statistical inference, control theory,and the theory of complexity)1993; Kolmogorov Medal 2006;IBM Corporate Award (for theMDL/PMDL principles andstochastic complexity) 1991; IBMOutstanding Innovation Award(for work in statistical inference,information theory, and the theoryof complexity) 1988; ...

Teemu Roos Three Concepts: Information

Page 3: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

1 Occam’s RazorHouseVisual RecognitionAstronomyRazor

2 MDL PrincipleIdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Teemu Roos Three Concepts: Information

Page 4: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

1 Occam’s RazorHouseVisual RecognitionAstronomyRazor

2 MDL PrincipleIdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Teemu Roos Three Concepts: Information

Page 5: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Teemu Roos Three Concepts: Information

Page 6: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 7: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 8: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 9: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 10: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 11: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 12: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 13: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 14: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 15: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 16: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 common cold,

2 gout medicine,

3 gout medicine,

4 gout medicine,

5 common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Three Concepts: Information

Page 17: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 18: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 19: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 20: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 21: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 22: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 23: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 24: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 25: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Astronomy

Teemu Roos Three Concepts: Information

Page 26: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Astronomy

Teemu Roos Three Concepts: Information

Page 27: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Astronomy

Teemu Roos Three Concepts: Information

Page 28: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

William of Ockham (c. 1288–1348)

Teemu Roos Three Concepts: Information

Page 29: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural thingsthan such as are both true and sufficient to explain theirappearances.”

Diagnostic parsimony: Find the fewest possible causes thatexplain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damnwell please.”)

Teemu Roos Three Concepts: Information

Page 30: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural thingsthan such as are both true and sufficient to explain theirappearances.”

Diagnostic parsimony: Find the fewest possible causes thatexplain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damnwell please.”)

Teemu Roos Three Concepts: Information

Page 31: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural thingsthan such as are both true and sufficient to explain theirappearances.”

Diagnostic parsimony: Find the fewest possible causes thatexplain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damnwell please.”)

Teemu Roos Three Concepts: Information

Page 32: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural thingsthan such as are both true and sufficient to explain theirappearances.”

Diagnostic parsimony: Find the fewest possible causes thatexplain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damnwell please.”)

Teemu Roos Three Concepts: Information

Page 33: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 34: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 35: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 36: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

HouseVisual RecognitionAstronomyRazor

Visual Recognition

Teemu Roos Three Concepts: Information

Page 37: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

1 Occam’s RazorHouseVisual RecognitionAstronomyRazor

2 MDL PrincipleIdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Teemu Roos Three Concepts: Information

Page 38: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

MDL Principle

Minimum Description Length (MDL) Principle (2-part)

Choose the hypothesis which minimizes the sum of

1 the codelength of the hypothesis, and

2 the codelength of the data with the help of the hypothesis.

How to encode data with the help of a hypothesis?

Teemu Roos Three Concepts: Information

Page 39: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

MDL Principle

Minimum Description Length (MDL) Principle (2-part)

Choose the hypothesis which minimizes the sum of

1 the codelength of the hypothesis, and

2 the codelength of the data with the help of the hypothesis.

How to encode data with the help of a hypothesis?

Teemu Roos Three Concepts: Information

Page 40: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

Teemu Roos Three Concepts: Information

Page 41: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

Teemu Roos Three Concepts: Information

Page 42: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

Teemu Roos Three Concepts: Information

Page 43: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 1 :

(n

1

)= 625� 2625.

Codelength log2(n + 1) + log2

(n

k

)≈ 19 vs. 625

Teemu Roos Three Concepts: Information

Page 44: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 2 :

(n

2

)= 195 000� 2625.

Codelength log2(n + 1) + log2

(n

k

)≈ 27 vs. 625

Teemu Roos Three Concepts: Information

Page 45: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 3 :

(n

3

)= 40 495 000� 2625.

Codelength log2(n + 1) + log2

(n

k

)≈ 35 vs. 625

Teemu Roos Three Concepts: Information

Page 46: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 10 :

(n

10

)= 2331 354 000 000 000 000 000� 2625.

Codelength log2(n + 1) + log2

(n

k

)≈ 80 vs. 625

Teemu Roos Three Concepts: Information

Page 47: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 100 :

(n

100

)≈ 9.5× 10117 � 2625.

Codelength log2(n + 1) + log2

(n

k

)≈ 401 vs. 625

Teemu Roos Three Concepts: Information

Page 48: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 300 :

(n

300

)≈ 2.7× 10186 < 2625.

Codelength log2(n + 1) + log2

(n

k

)≈ 629 vs. 625

Teemu Roos Three Concepts: Information

Page 49: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Probabilistic Models

Idea 2: Hypothesis = probability distribution.

Noisy PSNR=19.8 MDL (A-B) PSNR=32.9

How to encode a distribution?

Teemu Roos Three Concepts: Information

Page 50: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Encoding Data: Probabilistic Models

Idea 2: Hypothesis = probability distribution.

Noisy PSNR=19.8 MDL (A-B) PSNR=32.9

How to encode a distribution?

Teemu Roos Three Concepts: Information

Page 51: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Two-Part Codes

Let M = {pθ : θ ∈ Θ} be a parametric probabilistic model class,i.e., a set of distributions pθ indexed by parameter θ.

If the parameter space Θ is discrete, we can construct a (prefix)code C1 : Θ→ {0, 1}∗ which maps each parameter value to acodeword.

For each distribution pθ there is a prefix code Cθ : D → {0, 1}∗where D ∈ D is a data-set to be encoded, such that the codewordlengths satisty

`θ(D) ≈ log21

pθ(D).

Using parameter value θ, the total codelength becomes (≈)

`1(θ) + log21

pθ(D).

Teemu Roos Three Concepts: Information

Page 52: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Two-Part Codes

Let M = {pθ : θ ∈ Θ} be a parametric probabilistic model class,i.e., a set of distributions pθ indexed by parameter θ.

If the parameter space Θ is discrete, we can construct a (prefix)code C1 : Θ→ {0, 1}∗ which maps each parameter value to acodeword.

For each distribution pθ there is a prefix code Cθ : D → {0, 1}∗where D ∈ D is a data-set to be encoded, such that the codewordlengths satisty

`θ(D) ≈ log21

pθ(D).

Using parameter value θ, the total codelength becomes (≈)

`1(θ) + log21

pθ(D).

Teemu Roos Three Concepts: Information

Page 53: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Two-Part Codes

Let M = {pθ : θ ∈ Θ} be a parametric probabilistic model class,i.e., a set of distributions pθ indexed by parameter θ.

If the parameter space Θ is discrete, we can construct a (prefix)code C1 : Θ→ {0, 1}∗ which maps each parameter value to acodeword.

For each distribution pθ there is a prefix code Cθ : D → {0, 1}∗where D ∈ D is a data-set to be encoded, such that the codewordlengths satisty

`θ(D) ≈ log21

pθ(D).

Using parameter value θ, the total codelength becomes (≈)

`1(θ) + log21

pθ(D).

Teemu Roos Three Concepts: Information

Page 54: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Two-Part Codes

Let M = {pθ : θ ∈ Θ} be a parametric probabilistic model class,i.e., a set of distributions pθ indexed by parameter θ.

If the parameter space Θ is discrete, we can construct a (prefix)code C1 : Θ→ {0, 1}∗ which maps each parameter value to acodeword.

For each distribution pθ there is a prefix code Cθ : D → {0, 1}∗where D ∈ D is a data-set to be encoded, such that the codewordlengths satisty

`θ(D) ≈ log21

pθ(D).

Using parameter value θ, the total codelength becomes (≈)

`1(θ) + log21

pθ(D).

Teemu Roos Three Concepts: Information

Page 55: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Two-Part Codes

The parameter value minimizing the codelength is given by themaximum likelihood parameter θ̂:

minθ∈Θ

log21

pθ(D)= log2

1

maxθ∈Θ

pθ(D)= log2

1

pθ̂(D).

It could of course be that `1(θ̂) is so large that some otherparameter value gives a shorter total codelength.

Teemu Roos Three Concepts: Information

Page 56: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Two-Part Codes

The parameter value minimizing the codelength is given by themaximum likelihood parameter θ̂:

minθ∈Θ

log21

pθ(D)= log2

1

maxθ∈Θ

pθ(D)= log2

1

pθ̂(D).

It could of course be that `1(θ̂) is so large that some otherparameter value gives a shorter total codelength.

Teemu Roos Three Concepts: Information

Page 57: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Multi-Part Codes

If there are more than one model classes, M1,M2, . . . it ispossible to construct multi-part codes where the parts are

1 Encoding of the model class index: C0(i), i ∈ N.

2 Encoding of the parameter (vector): Ci (θ), θ ∈ Θi .

3 Encoding of the data: Cθ(D), D ∈ D.

For instance, the models could be polynomials with differentdegrees, the parameters are the coefficients

θ0 + θ1x + θ2x2 + θ3x

3 + . . . + θkxk .

The more complex the model class (the higher the degree), thebetter it fits the data but the longer the second part Ci (θ)becomes.

Teemu Roos Three Concepts: Information

Page 58: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Multi-Part Codes

If there are more than one model classes, M1,M2, . . . it ispossible to construct multi-part codes where the parts are

1 Encoding of the model class index: C0(i), i ∈ N.

2 Encoding of the parameter (vector): Ci (θ), θ ∈ Θi .

3 Encoding of the data: Cθ(D), D ∈ D.

For instance, the models could be polynomials with differentdegrees, the parameters are the coefficients

θ0 + θ1x + θ2x2 + θ3x

3 + . . . + θkxk .

The more complex the model class (the higher the degree), thebetter it fits the data but the longer the second part Ci (θ)becomes.

Teemu Roos Three Concepts: Information

Page 59: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Multi-Part Codes

If there are more than one model classes, M1,M2, . . . it ispossible to construct multi-part codes where the parts are

1 Encoding of the model class index: C0(i), i ∈ N.

2 Encoding of the parameter (vector): Ci (θ), θ ∈ Θi .

3 Encoding of the data: Cθ(D), D ∈ D.

For instance, the models could be polynomials with differentdegrees, the parameters are the coefficients

θ0 + θ1x + θ2x2 + θ3x

3 + . . . + θkxk .

The more complex the model class (the higher the degree), thebetter it fits the data but the longer the second part Ci (θ)becomes.

Teemu Roos Three Concepts: Information

Page 60: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Multi-Part Codes

If there are more than one model classes, M1,M2, . . . it ispossible to construct multi-part codes where the parts are

1 Encoding of the model class index: C0(i), i ∈ N.

2 Encoding of the parameter (vector): Ci (θ), θ ∈ Θi .

3 Encoding of the data: Cθ(D), D ∈ D.

For instance, the models could be polynomials with differentdegrees, the parameters are the coefficients

θ0 + θ1x + θ2x2 + θ3x

3 + . . . + θkxk .

The more complex the model class (the higher the degree), thebetter it fits the data but the longer the second part Ci (θ)becomes.

Teemu Roos Three Concepts: Information

Page 61: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Multi-Part Codes

If there are more than one model classes, M1,M2, . . . it ispossible to construct multi-part codes where the parts are

1 Encoding of the model class index: C0(i), i ∈ N.

2 Encoding of the parameter (vector): Ci (θ), θ ∈ Θi .

3 Encoding of the data: Cθ(D), D ∈ D.

For instance, the models could be polynomials with differentdegrees, the parameters are the coefficients

θ0 + θ1x + θ2x2 + θ3x

3 + . . . + θkxk .

The more complex the model class (the higher the degree), thebetter it fits the data but the longer the second part Ci (θ)becomes.

Teemu Roos Three Concepts: Information

Page 62: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Polynomials

Teemu Roos Three Concepts: Information

Page 63: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)? How to encode continuous values?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Information Geometry!

If the points are sufficiently dense (in a codelength sense) then thecodelength for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Three Concepts: Information

Page 64: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)? How to encode continuous values?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Information Geometry!

If the points are sufficiently dense (in a codelength sense) then thecodelength for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Three Concepts: Information

Page 65: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)? How to encode continuous values?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a codelength sense) then thecodelength for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Three Concepts: Information

Page 66: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)? How to encode continuous values?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a codelength sense) then thecodelength for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Three Concepts: Information

Page 67: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)? How to encode continuous values?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a codelength sense) then thecodelength for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Three Concepts: Information

Page 68: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)? How to encode continuous values?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a codelength sense) then thecodelength for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Three Concepts: Information

Page 69: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)? How to encode continuous values?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a codelength sense) then thecodelength for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Three Concepts: Information

Page 70: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)? How to encode continuous values?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a codelength sense) then thecodelength for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Three Concepts: Information

Page 71: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)? How to encode continuous values?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a codelength sense) then thecodelength for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Three Concepts: Information

Page 72: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

About Quantization

How many points should there be in the subset θ(1), θ(2), . . .?

Intuition: Estimation accuracy of order1√n.

Theorem

Optimal quantization accuracy is of order1√n.

⇒ number of points ≈√

nk

= nk/2, where k = dim(Θ).

Teemu Roos Three Concepts: Information

Page 73: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

About Quantization

How many points should there be in the subset θ(1), θ(2), . . .?

Intuition: Estimation accuracy of order1√n.

Theorem

Optimal quantization accuracy is of order1√n.

⇒ number of points ≈√

nk

= nk/2, where k = dim(Θ).

Teemu Roos Three Concepts: Information

Page 74: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

About Quantization

How many points should there be in the subset θ(1), θ(2), . . .?

Intuition: Estimation accuracy of order1√n.

Theorem

Optimal quantization accuracy is of order1√n.

⇒ number of points ≈√

nk

= nk/2, where k = dim(Θ).

Teemu Roos Three Concepts: Information

Page 75: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

About Quantization

How many points should there be in the subset θ(1), θ(2), . . .?

Intuition: Estimation accuracy of order1√n.

Theorem

Optimal quantization accuracy is of order1√n.

⇒ number of points ≈√

nk

= nk/2, where k = dim(Θ).

Teemu Roos Three Concepts: Information

Page 76: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

About Quantization

How many points should there be in the subset θ(1), θ(2), . . .?

Intuition: Estimation accuracy of order1√n.

Theorem

Optimal quantization accuracy is of order1√n.

⇒ number of points ≈√

nk

= nk/2, where k = dim(Θ).

The codelength for the quantized parameters becomes

`(θq) ≈ log2 nk/2 =k

2log2 n .

Teemu Roos Three Concepts: Information

Page 77: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Old-Style MDL

With the precision 1√n

the codelength for data is almost optimal:

minθq∈{θ(1),θ(2),...}

`θq(D) ≈ minθ∈Θ

`θ(D) = log21

pθ̂(D).

This gives the total codelength formula:

“Steam MDL”

`θq(D) + `(θq) ≈ log21

pθ̂(D)+

k

2log2 n .

Teemu Roos Three Concepts: Information

Page 78: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Old-Style MDL

With the precision 1√n

the codelength for data is almost optimal:

minθq∈{θ(1),θ(2),...}

`θq(D) ≈ minθ∈Θ

`θ(D) = log21

pθ̂(D).

This gives the total codelength formula:

“Steam MDL”

`θq(D) + `(θq) ≈ log21

pθ̂(D)+

k

2log2 n .

Teemu Roos Three Concepts: Information

Page 79: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Old-Style MDL

The k2 log2 n formula is only a rough approximation, and works well

only for very large samples.

Next week:

More advanced codes: mixtures, normalized maximumlikelihood, etc.

Foundations of MDL.

Teemu Roos Three Concepts: Information

Page 80: Three Concepts: Information

OutlineOccam’s RazorMDL Principle

IdeaRules & ExceptionsProbabilistic ModelsOld-Style MDL

Old-Style MDL

The k2 log2 n formula is only a rough approximation, and works well

only for very large samples.

Next week:

More advanced codes: mixtures, normalized maximumlikelihood, etc.

Foundations of MDL.Teemu Roos Three Concepts: Information