generalized inferential models: basics and beyond

Post on 11-May-2022

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Generalized inferential models: basics and beyond

Ryan Martin1

North Carolina State UniversityResearchers.One

Statistics SeminarNorthwestern Polytechnical University, China

November 19, 2021

1www4.stat.ncsu.edu/~rmartin

1 / 40

Introduction

Statistics aims to give reliable/valid uncertainty quantificationabout unknowns based on data, models, etc.

Two dominant schools of thought:

frequentistBayesian

Both have familiar pros and cons...

Do we have to accept the cons? Can’t we just have all pros?

e.g., Efron (2013):

Perhaps the most important unresolved problem in

statistical inference is the use of Bayes theorem in the

absence of prior information.

2 / 40

Introduction, cont.

Chuanhai Liu2 and I developed a pros-focused approach.

Objectives:

data-dependent “probabilities,” without priorscalibration properties to make inference reliable

Our framework: inferential models (IMs).

Some similarities to what Fisher & others did.

Key difference:

reliability requires “probabilities” to be imprecise

2https://www.stat.purdue.edu/people/faculty/chuanhai.html

3 / 40

This talk

Background / inferential models (IMs)

Generalized IMs

easier constructionstill valid

Applications:

meta-analysissurvival analysis

Next-generation generalized IMs...

4 / 40

Inferential models

Observable data: Y in sample space YStatistical model: Y ∼ PY |θ, θ in parameter space Θ

Goal: learn about unknown θ from observed data, y

That is, quantify uncertainty about θ based on y

Bayes, Fisher, and others use probability distributionsDempster & Shafer use “belief functions”

These are special cases of IMs...

5 / 40

IMs, cont.

Mathematically, an IM is a mapping that takes data y to apair of lower and upper probabilities34

Πy (A) = degree of belief in “θ ∈ A”

Πy (A) = degree of plausibility in “θ ∈ A”

→ Probabilities are additive, Πy = Πy .

→ Belief functions, etc. are non-additive, Πy ≤ Πy .

Clearly lots of options, how to choose?

Recommend an IM that’s “statistically reliable”

3Technically, these are super/sub-additive and monotone capacities4Linked via the duality Πy (A) = 1 − Πy (Ac)

6 / 40

IMs, cont.

“Reliable” in what sense?

Basic principle: if ΠY (A) is large, infer A.

Reid & Cox:it is unacceptable if a procedure. . . of representing

uncertain knowledge would, if used repeatedly, give

systematically misleading conclusions.

We don’t want, e.g., ΠY (A) to be large if A is false.

Idea: require that y 7→ Πy (·) satisfy

{θ 6∈ A and Y ∼ PY |θ} =⇒ ΠY (A) tends to be small.

7 / 40

IMs, cont.

Definition.

An IM y 7→ (Πy ,Πy ) is valid if

supθ 6∈A

PY |θ{ΠY (A) > 1− α} ≤ α, ∀ A ⊆ Θ, α ∈ [0, 1]

Validity controls the frequency at which the IM assignsrelatively high beliefs to false assertions.

There’s an equivalent statement in terms of Πy :

supθ∈A

PY |θ{ΠY (A) ≤ α} ≤ α, ∀ A ⊆ Θ, α ∈ [0, 1].

False confidence theorem:5 additive IMs can’t be valid.

5Balch, M., and Ferson, arXiv:1706.085658 / 40

IMs, cont.

Theorem.

If (ΠY ,ΠY ) are valid, then derived procedures control error rates:

“reject H0 : θ ∈ A if Πy (A) ≤ α” is a size α test,

the 100(1− α)% plausibility region {ϑ : Πy ({ϑ}) > α} hascoverage probability ≥ 1− α.

IM validity =⇒ usual frequentist validity

Connection is mutually beneficial:

IMs help with interpretation of frequentist outputcalibration makes IM’s (Πy ,Πy ) real-world relevant

9 / 40

IMs, cont.

How to construct a valid IM?

A. Y = a(θ,U), U ∼ PU .P. Use a random set U to “guess” the

unobserved U.C. Data-dependent random set

Θy (U) =⋃u∈U{ϑ : y = a(ϑ, u)}.

leads to lower and upper probs

Πy (A) = PU{Θy (U) ⊆ A}Πy (A) = PU{Θy (U) ∩ A 6= ∅}.

This is what the book is about!

10 / 40

IMs, cont.

Problems first considered by Fisher:

Scalar Y ∼ PY |θ and scalar θcontinuous distribution function Fθrange of Fθ(y) unconstrained by fixed y or fixed θ.

IM construction:6

A. Y = F−1θ (U), U ∼ PU = Unif(0, 1)P. U = {u ∈ [0, 1] : |u − 0.5| ≤ |Unif(0, 1)− 0.5|}C. Θy (U) = {ϑ : Fϑ(y) ∈ U}, with

πy (ϑ) = PU{Θy (U) 3 ϑ}︸ ︷︷ ︸plausibility contour

= 1− |2Fϑ(y)− 1|.

Lots of examples can be covered by this analysis.

6For original details, see M. and Liu, arXiv:1206.409111 / 40

IMs, cont.

Two examples: Cauchy(θ, 1) and Gamma(θ, 1)

Plots below show the plausibility contour, πy (ϑ).

How this is used?confidence interval {ϑ : πy (ϑ) > α}upper probabilities: Πy (A) = supϑ∈A πy (ϑ)

−30 −20 −10 0 10 20 30

0.0

0.2

0.4

0.6

0.8

1.0

ϑ

φy(ϑ

)

(a) y = 0, Cauchy

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

ϑ

πy(ϑ

)

(b) y = 1, gamma

12 / 40

IMs, cont.

Despite IM’s nice features, practical challenges can arise

Basic issue: A-step must determine PY |θChallenges:

efficiency-motivated auxiliary variable dimension reduction7

eliminating nuisance parameters8

(big) requires a fully-specified statistical model...

Formal remedies are difficult to carry out.

Idea: do these dimension-reduction-related tasks less formallybefore starting the IM construction.

Leads to a generalized IM...

7M. and Liu, arXiv:1211.15308M. and Liu, arXiv:1306.3092

13 / 40

Generalized IMs

The idea9 is to connect some function of (Y , θ) to anauxiliary variable with known distribution.

Let Ty ,ϑ be a real-valued function of (y , ϑ).

Good example to keep in mind

Ty ,ϑ =Ly (ϑ)

Ly (θ), relative likelihood.

Note: the value of TY ,θ does not determine (Y , θ).

So, an association in terms of TY ,θ amounts to a “loss ofinformation” in a sense that turns out to be irrelevant.

9M., arXiv:1203.6665 and arXiv:1511.06733

14 / 40

Generalized IMs, cont.

Generalized association:

TY ,θ = F−1θ (U), U ∼ Unif(0, 1)

where Fθ(t) = PY |θ(TY ,θ ≤ t), t ∈ R.

Unlike before, the generalized association doesn’t determinethe distribution of Y — but that’s not important.

Key benefits:

U is a scalar, no dimension reduction needed!ordering in ϑ 7→ Ty ,ϑ suggests a particular random set U .

15 / 40

Generalized IMs, cont.

Generalized IM construction.

A. TY ,θ = F−1θ (U) for U ∼ Unif(0, 1).

P. Introduce a suitable random set U on [0, 1]

C. Combine to get a new random set on Θ:

Θy (U) = {ϑ : Fϑ(Ty ,ϑ) ∈ U}.

For the special case U = [U ′, 1] for U ′ ∼ Unif(0, 1), somesimplification is possible:

πy (ϑ) = PU{Θy (U) 3 ϑ} = Fϑ(Ty ,ϑ), ϑ ∈ Θ.

Immediately gives valid, prior-free probabilistic inferenceacross a wide range of problems!

16 / 40

Generalized IMs, cont.

Simple binomial example.

Left: plot of πy (ϑ) based on (n, y) = (25, 15)

Right: GIM’s and Clopper–Pearson’s coverage probability.

17 / 40

Generalized IMs, cont.

Too good to be true?

Computation of Fθ can be challenging.

Lots of sampling-based methods available for this.

To evaluate πy (ϑ) on a grid:

do separate Monte Carlo on each grid pointMonte Carlo + importance sampling adjustmentsother things...10

Better/general strategies for GIM computation would be aninteresting and welcomed contribution!

10Syring and M., arXiv:2103.0265918 / 40

Generalized IMs, cont.

Often only interested in some feature of θ.

Split θ = (φ, λ), interest in φ.

Now the idea is to connect a function of (Y , φ) to an auxiliaryvariable with known distribution.

Let Ty ,ϕ be a real-valued function of (y , ϕ).

Good example to keep in mind

Ty ,ϕ =Ly (ϕ, λϕ)

Ly (φ, λ), relative profile likelihood.

19 / 40

Generalized IMs, cont.

Generalized association:

TY ,φ = F−1φ,λ(U), U ∼ Unif(0, 1)

where Fφ,λ(t) = PY |φ,λ(TY ,φ ≤ t).

TY ,φ doesn’t directly depend on λ, but its distribution does.

If λ were known, or if the dependence on λ dropped out,11

then this would be exactly like before.

That is, we end up with

“πy (ϕ)” = Fϕ,λ(Ty ,ϕ), ϕ ∈ φ(Θ).

11e.g., bivariate normal with φ the correlation and λ the means and variances20 / 40

Generalized IMs, cont.

Natural idea is to use a plug-in estimate

Define λϕ = arg maxλ Ly (ϕ, λ).

Generalized IM has (plug-in) plausibility contour

πy (ϕ) = Fϕ,λϕ(Ty ,ϕ).

Plug-in means it can’t be exactly valid; but one can usuallyprove asymptotic validity, i.e.,

limn→∞

PY n|φ,λ{πY n(φ) ≤ α} = α.

Open question: Empirically, this convergence is very fast, butis there a built-in “higher-order accuracy”?

21 / 40

Applications

Two recent applications of generalized IMs:

1 Meta-analysis with few studies.12

2 Survival analysis.13

Both involve nuisance parameters and non-trivial computation.

Generalized IM methods outperform existing methods.

Below are some details for each in turn.

12Cahoon and M., arXiv:1910.0053313Cahoon and M., arXiv:1912.00037

22 / 40

Meta-analysis

It’s natural in science for multiple researchers to carry outtheir own analysis related to the same question.

Pool these separate analyses into a “meta-analysis”?

K independent studies produce data (Yk , σ2k)

Yk is estimate of µ from study-k dataσk is the study-k standard error, treated as fixed

Basic model: Yk ∼ N(µ, ν + σ2k), k = 1, . . . ,K .

ν > 0 is the across-study variance, unknown.

Goal is inference on µ.

23 / 40

Meta-analysis, cont.

Estimating µ is easy, valid inference on µ is difficult.

Challenge comes from the nuisance parameter ν.

Asymptotic confidence intervals for µ are available,14 theserequire large K .

I was unsuccessful trying to work out basic IM details for this,but aforementioned generalized IM works.

Take-away messages:

Probabilistic inference on µAsymptotically valid, empirically accurate for small KOutperforms other methods we tried

14DerSimonian & Laird is classic24 / 40

Meta-analysis, cont.

Left: 3 individual plausibility contours & combined

Right: empirical CDF of πY (µ) for K = 5 studies.

25 / 40

Meta-analysis, cont.

Simulation comparison of GIM against competitors.

As functions of K , compare 95% CIs for µ

coverage probability (left)average length (right)

e.g., GIM (black), oracle (green), DL (purple)

26 / 40

Survival analysis

Data may be incomplete in some applications.

e.g., in time-to-event studies, event times may be censored.

Survival analysis deals with such things.

Basic right-censoring model:

Xi ∼ Hφ and Ci ∼ G , with θ = (φ,G )Ti = Xi ∧ Ci , Di = 1(Xi ≤ Ci )

Data Yi = (Ti ,Di ) for i = 1, . . . , n.

Goal is inference on φ

G is an (infinite-dim) nuisance parameter.

27 / 40

Survival analysis, cont.

There’s a likelihood function, hence MLEs

Asymptotic normality of φ can be used for inference.

Bayesian methods are also available.

I was unsuccessful trying to work out basic IM details for this,but aforementioned generalized IM works.

Take-away messages:

Probabilistic inference on φAsymptotically valid, empirically accurate for small nOutperforms other methods we tried

28 / 40

Survival analysis, cont.

Some computational details:

Monte Carlo to evaluate πy (ϕ)To simulate censoring times for Monte Carlo, use plug-inKaplan–Meier G , the nonparametric MLE

Some theoretical details:

challenging to handle the infinite-dim plug-in Gworks because G is root-n consistentempirical results suggest higher-order accuracy...

29 / 40

Survival analysis, cont.

Hφ is Weibull with φ = (shape, α; scale, β)

Compare coverage probability with existing methods

GIM (black), MLE (red), Bayes (green)

Not an extensive simulation...

30 / 40

Survival analysis, cont.

Real-data example:

Chemical concentrations in soilLeft-censored — measuring instrument has limited precisionHφ is log-normal, φ = (µ, σ2)

Plots:

Left: joint plausibility contour for (µ, σ2)Right: derived marginal for ψ = exp(µ+ σ2/2)

31 / 40

Next-gen GIMs

Generalized IMs relaxed the basic IM’s construction by notrequiring the association to determine the model for Y .

But still requires a statistical model.

In machine learning applications, it’s common to work withouta statistical model.

This has been a barrier for IMs and other probabilisticinference frameworks.

Turns out the generalized IM can be generalized even furtherto cover certain no-model cases.

I’ll talk briefly about two such extensions.

32 / 40

No-model inference

Often the quantity of interest isn’t a model parameter.

Common situation: θ = arg minϑ E{`ϑ(Y )}Loss could be squared error, classification error, etc.

Analogue to the relative likelihood

Ty ,ϑ = e−{Ry (ϑ)−Ry (θy )} Rn(ϑ) = 1n

∑ni=1 `ϑ(yi ).

Similar in principle to generalized IM before, but there areseveral new challenges.15

e.g., bootstrap needed to compute distribution of TY ,θ

Asymptotically valid under regularity conditions.

15Cella and M., almost done...33 / 40

No-model inference, cont.

Quantile regression is an important example.

Conditional τ th quantile: Qτ (Y | x) = x>βτ

“Check loss” defines the risk minimization problem.

34 / 40

No-model prediction

Prediction of future observations is important.

Can do model-based IM prediction.16

Model assumptions are restrictive in applications, so whatabout a no-model version?

Recent development using a new type of generalized IM.17

Only assumes exchangeability, provably valid.

Close connections to conformal prediction.18

16M. and Lingham, arXiv:1403.758917Cella and M. , https://researchers.one/articles/20.01.0001018Vovk, Shafer, etc.

35 / 40

No-model prediction, cont.

Latest developments for supervised learning, e.g., regression.19

Very general relationships between response and predictors.

Leads to valid probabilistic inference!

19Cella and M., almost done...36 / 40

Conclusion

IM framework is a promising potential solution to Efron’s“most important unresolved problem”

Necessary breakthrough to get past old B vs. F debates is theimportance of non-additivity/imprecision

I didn’t fully appreciate how fundamental imprecision wasuntil after the book was finished.

So I’ve written quite a bit about this recently...

37 / 40

Conclusion, cont.

Two recent papers about validity and the importance of

imprecision/non-additivity

38 / 40

Conclusion, cont.

Generalized IM framework is also powerful, in some sensemight be the “right” way to do IMs.

I’m excited about the latest developments.

Tons of potential applications!

Open questions/problems:

General computational strategies?Possible higher-order accuracy?Anything lost in the move IM → GIM?Likelihood-based inference is “optimal,” so can GIMs shedlight on “optimal IM constructions”?

39 / 40

The end

Thanks for your attention!

Questions? rgmarti3@ncsu.edu

Links to papers: www4.stat.ncsu.edu/~rmartin/

100% open peer-review & publicationhttps://researchers.one

www.twitter.com/ResearchersOne

40 / 40

top related