april 3 2014 slides mayo
Post on 05-Dec-2014
5.296 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
April 3, 2014 Phil 6334
The Howson (1997) paper was invited as a discussion on the general
topic of a paper of mine (and one other paper): “Duhem's Problem, the
Bayesian Way, and Error Statistics or “What’s Belief Got to Do With It?”
The physicist can never submit an isolated hypothesis to the control
of experiment, but only a whole group of hypotheses. When
experiment is in disagreement with his predictions, it teaches him
that one at least of the hypotheses that constitute this group is wrong
and must be modified. But experiment does not show him the one
that must be changed (Duhem 1954, p.185).
2
“...is to point out that the Bayesian personalist approach to
scientific inference provides a ...solution to this [Duhem] puzzle by
telling us exactly when [disregarding unsuccessful predictions] can
be reconstructed as rational and when it has to be deemed
irrational. Rationality here, for the Bayesian, simply means
conformity with [Bayes'] theorem” (Dorling 1979, p. 177).
Bayes' Theorem (one form):
P(H | e) = P(e | H) P(H)
P(e | H) P(H) + P(e | not-H) P(not-H)
The Bayesian “catchall factor”: P(e | not-H).
3
1. Dorling’s Homework Problem:
(I). The components:
Hypothesis H: Newton’s theory of motion and gravitation
e: the predicted secular acceleration of the moon
e’: the observed acceleration of the moon--the anomalous result
Auxiliary hypothesis A: the effects of tidal friction are not of a
sufficient order of magnitude to affect the acceleration of the
moon.
H and A entail e, but e’ is observed.
4
(II) An informal (and neutral) description of a situation where
anomaly e’ indicates (or is best explained by) auxiliary A being in
error:
there is a great deal of evidence in favor of a theory or
hypothesis H, whereas
there is little evidence for the truth of auxiliary A, say hardly
more evidence for its truth than for its falsity, and
unless A is false, there is no other plausible way to explain e’.
A Bayesian rendering may be obtained by inserting
"agent X believes that"
prior to assertions (1), (2), and (3).
5
(III) The Numerical Solution to the Homework Problem:
ASSUME:
(i) P(H) = 0.9 and P(A) = 0.6
(ii) “The agent contemplates auxiliary A being true”.
The probability of e’, given A and not-H is very small. Let this
very small value be .
(Dorling takes to be 0.001.)
(iii)“The agent contemplates auxiliary A being false”
(a) The probability of e’, given H and not-A, is 50.
(b) The probability of e’, given not-H and not-A, is 50.
(iv) H and A are probabilistically independent:
P(H and A) = P(H)P(A)
6
1.1. THE RESULTS
The Bayesian Catchall factor: P(e’|not-H) = 20.6
(.0206)
The posterior probabilities: P(H|e’) = 0.897
P(A|e’) = 0.003
H hasn’t gone down much, the blame for the anomaly is placed on
auxiliary A.
Of course, the opposite assignment could have been given thus
putting the blame on the theory.
7
Two key features of the error statistical approach (of central relevance
to Duhem's problem):
1. A Piecemeal Approach.
Two contrasts with the Bayesian Way may be noted:
I. Gets beyond a single probability pie
II. Gets beyond a white-glove analysis
8
2. The Fundamental Use of Error Probabilities of Tests
▪ The question of whether data provides good evidence for a
hypothesis is regarded as an objective (though empirical)
one, not a subjective, one.
▪ Data count as good evidence for H just to the extent that H
passes a severe test.
9
Statistical Significance Tests
The null (or “test”) hypothesis: H0, there is no increased risk of
R (in a given population)
[H0 says it is an error to suppose a genuine increased risk is
responsible for any observed difference in risk rates.]
The alternative hypothesis: J, there is an increased risk of R (in a
given population)
Here e is anomalous for J.
10
The significance question: how often would a (positive)
difference in R rates as high as (or even higher than) the one
observed (e) occur, if in fact H0 were true? The answer is called
the statistical significance level of the data.
11
Error Probabilities Register Illegitimate Ways to “Save”
Hypotheses From Anomalies
H0: there is no increased risk of R (in a given population)
J: there is an increased risk of R (in a given population)
e: a 0 (or a statistically insignificant) difference
Way #1: J + “compensation hypothesis” (there really is a risk
but something compensated for it in this data)
Way #2: J’: there is an increased risk of R’ (in a given
population)
(There’s some other risk, found through searching the data)
12
The hypotheses erected to accord with the evidence fail to
pass severe tests.
The probability of erroneously finding some alleged
compensating factor or other, Way #1, and the probability of
erroneously finding one or another excess in risk, Way #2, are
no longer the low .01 level as at the start, but are instead higher,
in extreme cases, maximal (i.e., 1)
13
skip
Two Types of Strategies in the Error Statistical Approach to
Duhem's Problem:
(1) “blocker” strategies:
We criticize attempts to explain away anomalies (e.g., as due to
H-saving factors) on the grounds that
they fail to pass severe tests
their denials pass severe tests.
(2) Show anomaly e’ may be blamed on an auxiliary hypothesis A
by showing A’ (the denial of A), passes a severe test.
14
In Dorling's illustration, a result e’ that is anomalous for H is
taken to provide positive grounds for discrediting A and
confirming its denial A'. (The degree of belief in A' went from
0.4 to 0.99—by dint of anomaly e’.)
But the error statistician wants to know if the test is severe!
this requires positive evidence that the alleged extraneous factor is
responsible for the anomaly.
strong belief in H together with low enough degree of belief in the
Bayesian catchall factor do not suffice for showing A' has passed a
severe test.
going from satisfying the Bayesian conditions to declaring strong
evidence for A'--is a very unreliable one (makes it too easy to blame
auxiliary hypothesis A even if A is true).
such an appeal to A’ would thereby be blocked.
15
Mini overview:
Severity Requirement: Data x provides good evidence for
inferring H only if it results from a procedure which, taken as a
whole, constitutes H having passed a severe test — that is, a
procedure which would have (at least with very high probably)
uncovered the falsity of, or errors in H, and yet H emerged
unscathed.
Inductive learning, in this view, proceeds by testing hypotheses and
inferring those which pass probative or severe tests — tests which
very probably would have unearthed some error in the hypothesis H,
were such an error present.
16
A methodology for induction, accordingly, is a methodology for
arriving at severe tests, and for scrutinizing inferences by considering
the severity with which they have passed tests.
Methodological rules and strategies are claims about how to avoid
mistakes and learning from different types of errors; their appraisal
turns on understanding how methods enable avoidance of specific
errors.
Hence an inductive methodology of severe testing will focus on
understanding the properties of tools for generating, modeling and
analyzing data so as to learn about some aspect of the data-
generating mechanism.
These properties, while empirical, are objective.
17
Highly Probable vs. Highly Probed Hypotheses
The Criticism: H may pass with high severity (with data x) even
though the (Bayesian) posterior probability for H (given x) is low.
All such “funny Bayesian examples” need to assume prior
probability assignments to an exhaustive set of hypotheses, while for
a frequentist error statistician, a hypothesis could only be given a
probability assignment if its truth is the outcome of a random trial
(but “events” to not also serve as statistical hypotheses).
Subjective degree of belief assignments will not ensure the error
probability, and thus the severity, assessments we need.
Examples with frequentist priors, however, commit the fallacy of
probabilistic instantiation.
18
The Fallacy of Probabilistic Instantiation
Hypothesis H is true of p% of the populations (bags) in this urn of
populations U,
1. P(H is true of a randomly selected bag from an urn of bags U) = p
2. The randomly selected bag that was drawn is the bag used in test
T1 is b1,
Therefore:
(*) P(H is true of b1) = p.
For the frequentist: either H is true of b1 or not — the probability in
(*) is fallacious and results from an unsound instantiation.
19
Students from the Wrong Side of Town
Isaac, has passed comprehensive tests of mastery of high school
subjects regarded as indicating college readiness…
Since such high scores s could rarely result among high school
students who are not sufficiently prepared to be deemed ‘college
ready” we regarded s as good evidence for
H(I): Isaac is college ready.
And let the denial be H’:
H’(I): Isaac is not college ready (i.e., he is deficient).
The probability for such good results, given a student is college ready,
is extremely high:
P(s | H(I)) is practically 1,
20
while very low assuming he is not college ready.
P(s | H’(I)) =.05.
But imagine Isaac was randomly selected from the population of
students in, let us say, Fewready Town—where college readiness is
extremely rare, say one out of one thousand. The critic infers that the
prior probability of Isaac’s college-readiness is therefore .001:
(*) P(H(I)) = .001.
If so, then the posterior probability that Isaac is college ready, given
his high test results, would be very low:
p(H(I)|s) is very low,
even though the posterior probability has increased from the prior in
(*).
21
This is supposedly problematic for testers because we’d say this was
evidence for H(I) (readiness).
Actually I would want degrees of readiness to make my inference, but
these are artificially excluded here.
But, even granting his numbers, the main fallacy here is fallacious
probabilistic instantiation.
Although the probability of a randomly selected student taken from
high schoolers in Fewready Town is .001, it does not follow that
Isaac, the one we happened to select, has a probability of .001 of
being college ready
22
Achinstein says he will grant the fallacy…but only for frequentists:
“My response to the probabilistic fallacy charge is to say that
it would be true if the probabilities in question were construed
as relative frequencies. However, …I am concerned with
epistemic probability.”
He is prepared to grant the following instantiations:
▪ P% of the hypotheses in a given pool of hypotheses are true (or a
character holds for p%).
▪ The particular hypothesis Hi was randomly selected from this pool.
▪ Therefore, the objective epistemic probability P(Hi is true) = p.
23
Of course, epistemic probabilists are free to endorse this road to
posteriors—this just being a matter of analytic definition.
But the consequences speak loudly against the desirability of doing
so.
No Severity. The example considers only two outcomes: reaching the
high scores s, or reaching lower scores, ~s.
Clearly a lower grade gives even less evidence of readiness;
that is, P(H’(I)| ~s) > P(H’(I)|s). Therefore, whether Isaac scored as
high as s or lower, ~s, the epistemic probabilist is justified in having
high belief that Isaac is not ready.
The probability of finding evidence of Isaac’s readiness even if in fact
he is ready (H(I) is true) is low if not zero.
24
Bayesian B-boosters might interpret things differently, noting that
since the posterior for readiness has increased, the test scores provide
at least some evidence for H(I)—but then the invocation of the
example to demonstrate a conflict between a frequentist and Bayesian
assessment would seem to diminish or evaporate.
Reverse Discrimination? To push the problem further, suppose that
the epistemic probabilist receives a report that Isaac was in fact
selected randomly, not from Fewready Town, but from a population
where college readiness is common, Fewdeficient Town.
The same scores s now warrant the assignment of a strong objective
epistemic belief in Isaac’s readiness (i.e., H(I)).
A high-school student from Fewready Town would need to have
scored quite a bit higher on these same tests than a student selected
from Fewdeficient Town for his scores to be considered evidence of
his readiness.
25
When we move from hypotheses like “Isaac is college ready” to
scientific generalizations, the difficulties become even more serious.
We need not preclude that H(I) has a legitimate frequentist prior; the
frequentist probability that Isaac is college ready might refer to
generic and environmental factors that determine the chance of his
deficiency—although I do not have a clue how one might compute it.
The main thing is that this probability is not given by the probabilistic
instantiation above.
These examples, repeatedly used in criticisms, invariably shift the
meaning from one kind of experimental outcome—a randomly
selected student has the property “college ready”—to another—a
genetic and environmental “experiment” concerning Isaac in which
the outcomes are ready or not ready.
This also points out the flaw in trying to glean reasons for epistemic
belief with just any conception of “low frequency of error.”
26
If we declared each student from Fewready to be “unready,” we
would rarely be wrong, but in each case the “test” has failed to
discriminate the particular student’s readiness from his unreadiness.
Moreover, were we really interested in the probability of the event
that a student randomly selected from a town is college ready, and had
the requisite probability model (e.g., Bernouilli), then there would be
nothing to stop the frequentist error statistician from inferring the
conditional probability.
However, there seems to be nothing “Bayesian” in this relative
frequency calculation.
Bayesians scarcely have a monopoly on the use of conditional
probability! But even here it strikes me as a very odd way to talk
about evidence.
(Howson says it shows unsoundness because he identifies a p-value
with a posterior probability in a hypothesis)
27
A Common Variant on the Criticisms: p-values vs. posterior
probabilities):
Certain choices for prior probabilities in the null and
alternative hypothesis shows that a small p-value is consistent with a
much higher posterior probability in null hypothesis,
The alternative hypothesis would, in such cases, pass severely,
even though the null hypothesis has a high posterior (Bayesian)
probability.
28
A statistically significant difference from H0 can correspond to large
posteriors in H0
From the Bayesian perspective, it follows that p-values come up short
as a measure of inductive evidence,
the significance testers balk at the fact that the recommended priors
result in highly significant results being construed as no evidence
against the null — or even evidence for it!
29
The conflict often considers the two sided T(2 test
H0: = versus H1: ≠ .
(The difference between p-values and posteriors are far less marked
with one-sided tests).
“Assuming a prior of .5 to H0, with n = 50 one can classically ‘reject H0
at significance level p = .05,’ although P(H0|x) = .52 (which would
actually indicate that the evidence favors H0).”
This is taken as a criticism of p-values, only because, it is assumed the
.51 posterior is the appropriate measure of the belief worthiness.
As the sample size increases, the conflict becomes more
noteworthy.
30
If n = 1000, a result statistically significant at the .05 level leads
to a posterior to the null of .82!
SEV (H1) = .95 while the corresponding posterior has gone
from .5 to .82. What warrants such a prior?
n (sample size)
______________________________________________________
p t n=10 n=20 n=50 n=100 n=1000
.10 1.645 .47 .56 .65 .72 .89
.05 1.960 .37 .42 .52 .60 .82
.01 2.576 .14 .16 .22 .27 .53
.001 3.291 .024 .026 .034 .045 .124
31
(1) Some claim the prior of .5 is a warranted frequentist assignment:
H0 was randomly selected from an urn in which 50% are true
(*) Therefore P(H0) = p
H0 may be 0 change in extinction rates, 0 lead concentration, etc.
What should go in the urn of hypotheses?
For the frequentist: either H0 is true or false the probability in (*) is
fallacious and results from an unsound instantiation.
We are very interested in how false it might be, which is what we
can do by means of a severity assessment of.
32
(2) Subjective degree of belief assignments will not ensure the error
probability, and thus the severity, assessments we need.
(3) Some suggest an “impartial” or “uninformative” Bayesian prior gives
.5 to H0, the remaining .5 probability being spread out over the alternative
parameter space, Jeffreys.
This “spiked concentration of belief in the null” is at odds with the
prevailing view “we know all nulls are false”.
33
Upshot: However severely I might wish to say that a hypothesis H
has passed a test, the Bayesian critic assigns a sufficiently low prior
probability to H so as to yield a low posterior probability in H.
But this is no argument about why this counts in favor of, rather than
against, their Bayesian computation as an appropriate assessment of
the warrant to be accorded to hypothesis H.
To begin with, in order to use techniques for assigning frequentist
probabilities to events, their examples invariably involve
“hypotheses” that consist of asserting that a sample possesses a
characteristic, such as “having a disease” or “being college ready” or,
for that matter, “being true.”
This would not necessarily be problematic if it were not for the fact
that their criticism requires shifting the probability to the particular
sample selected
34
Bayesians sometimes tell us they will cure the significance tester’s
tendency to exaggerate the evidence against the null (in two-sided
testing) by using some variant on a spiked prior.
But the result of their “cure” is that outcomes may too readily be
taken as no evidence against, or even evidence for, the null
hypothesis, even if it is false.
We actually don’t think we need a cure.
Faced with conflicts between error probabilities and Bayesian
posterior probabilities, the error statistician may well conclude that
the flaw lies with the latter measure.
top related