Vtyjkyuk

Question

In the following paper found here and reference below, the author suggests that "if the model is true or close to true, the posterior predictive p-value will almost certainly be very close to 0.5" . This is found in the very beginning of Section 6 in the paper.

I am trying to interpret what is meant when he says 'model is true'.
My questions are:

i) Statistically what is a "true model" as said in the quote above?

ii) What does a value of 0.5 mean in simple words?

Gelman, A. (2013). Two simple examples for understanding posterior p-values whose distributions are far from uniform. Electronic Journal of Statistics, 7, 2595-2602.

Could you give a full reference to the paper just in case the link goes dead in the future (so as to make the thread comprehensible for future readers)? — Aug 17 at 7:23

score 4 · Accepted Answer · 2018-08-17 04:36:55Z

The model is true if you're assuming that's how the data are generated. This is not the setup where you consider multiple models $M_1, M_2, ldots$ and have discrete probability distributions describing the model uncertainty.

A p-value of .5 means your test statistic $T(y^textrep)$ (that will be computed on simulated data) will be right around the median of the posterior predictive distribution. Roughly speaking, a test statistic $T(y^textrep)$ calculated with simulated data will be pretty close to the old training data test statistic $T(y)$ you have computed from "real" data.

The posterior predictive distribution is
beginalign*
p(y^textrep mid y) &= int p(y^textrep,theta mid y) dtheta\
&= int p(y^textrepmid theta, y)p(theta mid y) dtheta\
&= int underbracep(y^textrep mid theta)_textmodelunderbracep(theta mid y)_textposterior dtheta \
&= E_theta mid yleft[p(y^textrep mid theta) right].
endalign*

Then you take this distribution and integrate over the region where
$T(y^textrep)$ is greater than some calculated nonrandom statistic of the dataset $T(y)$.

$$
P(T(y^textrep) > T(y) mid y) = int_T(y^textrep) : T(y^textrep) > T(y) p(y^textrep mid y) dy^textrep.
$$

In practice, if computing the above integral is too difficult, this means simulating many $y^textrep$s, calculating $T(y^textrep)$ for each of them, and then seeing what percent are above your single calculated $T(y)$. For more information, see this thread: What are posterior predictive checks and what makes them useful?

Because you are assuming there is no model uncertainty, $p(y^textrep mid y)$ is an integral over the parameter space; not the parameter space AND the model space.

if I do a ppd check Pr(T(y_rep)> T(y)), and get 0.5,I know this means the simulated and observed data are similar and the model is a good fit. But my question is why? what does the 0.5 tell us as say opposed to 0.8? — Aug 17 at 4:08
@user121 it might to help to think about the situation when it comes out to be .0001. This is bad because your model is predicting $T$s that are nowhere near the $T$ you calculated from the data. — Aug 17 at 4:10
ItÃ¢Â€Â™s far away (as measured by your posterior) in the other direction. It means your model is predicting something not like the data you have, and itÃ¢Â€Â™s impoetant to nderstand why itÃ¢Â€Â™s overriding common sense/intuition. .5 is good because half of the simulations are above your one real T and half are below. — Aug 17 at 4:16
appreciate the feedback. I dont think I am asking clearly what I'm looking for. If the p-val= 0.9999, what statement can we make versus a p-val =0.5214. I do not understand why 0.5 is the 'ideal' number for model fit. It is something fundamental I'm missing. Is there some physical meaning behind it? — Aug 17 at 4:23

Dave Harris 3,204315 · Answer 2 · 2018-08-17 04:36:48Z

I would recommend reading the underlying papers that this paper is derived from as the terminology doesn't appear to have become standard in the field. The original paper is by Rubin, but Gelman is writing from Meng.

Meng, X. (1994). Posterior Predictive p-Values. The Annals of Statistics, 22(3), 1142-1160.

As to your questions:

I am trying to interpret what is meant is meant when he says 'model is true'. My questions are:

i) Statistically, what is a "true model" as said in the quote above?

ii) What does a value of 0.5 mean in simple words?

So there is some unfortunate language usage as p-values are a Frequentist idea and Bayesian methods do not have p-values. Nonetheless, within the context of the articles beginning with Rubin, we can discuss the idea of a Bayesian p-value in a narrow sense.

As to question one, Bayesian models do not falsify a null hypothesis. In fact, except where some method is intending to mimic Frequentist methods, as in this paper, the phrasing "null hypothesis" is rarely used. Instead, Bayesian methods are generative methods and are usually constructed from a different set of axioms.

The easiest way to approach your question is from Cox's axioms.

Cox, R. T. (1961). The Algebra of Probable Inference. Baltimore, MD: Johns Hopkins University Press.

Cox's first axiom is that plausibilities of a proposition are a real number that varies with the information related to the proposition. Notice the word probability wasn't used as this also allows us to think in terms of odds or other mechanisms. This varies very strongly from null hypothesis methods. To see an example, consider binary hypotheses $H_1,H_2$, which in Frequentist methods will be denoted $H_0,H_A$. What is the difference?

$H_0$ is conditioned to be true and the p-value tests the probability of observing the sample, given the null is true. It does not test if the null is actually true or false and $H_A$ has no form of probability statement attached to it. So, if $p<.05$, this does not imply that $Pr(H_A)>.95$.

In the Bayesian framework, each proposition has a probability assigned to it so that if $H_1:mulek$ and $H_2:mu>k$, then it follows that if $Pr(H_1)=.7327$ then $Pr(H_2)=.2673$.

The true model is the model that generated the data in nature. This varies from the Frequentist method which depends only on the sampling distribution of the statistic, generally.

As to question two, Gelman is responding to Meng. He was pointing out that in a broad variety of circumstances if the null hypothesis is true, then $Pr(y^rep|y)$ will cluster around .5 if you average over the sample space. He provides a case where this is useful and one where it is a problem. However, the hint as to why comes from the examples, particularly the first.

The first has been constructed so that there are great prior uncertainty and this use of a nearly uninformative prior propagates through to the predictive distribution in such a way that, almost regardless of your sample, Rubin and Ming's posterior predictive p-values will be near 50%. In this case, it would mean that it would tell you there is a 50% chance the null is true, which is highly undesirable since you would rather be either near 100% or in the case of falsehood, 0%.

The idea of Bayesian posterior p-values is the observation that since you are now in the sample space as random, rather than the parameter space as random, the rough interpretation of a Bayesian posterior probability is remarkably close to the Frequentist p-value. It is problematic because the model is not considered a parameter, in itself, and has no prior probability as would be the case in a test of many different models. The model, $H_A$ versus $H_B$ is implicit.

This article is a warning of something that should, in a sense, be obvious. Imagine you had fifty million data points and there was no ambiguity as to the location of the parameter, then you would be stunned if the resulting predictive distribution was a bad estimator over the sample space. Now consider a model where the results are ambiguous and the prior was at best weakly informative, then even if the model is true, it would be surprising to get a clear result from the posterior predictive distribution.

He provides an example where data is drawn from a population that has a standard normal distribution. The required sample would have to be 28,000 to get a rejection of the model. In a standard normal population, that will never happen.

The model is about the propagation of uncertainty and whether or not Rubin/Meng's idea generates a useful construct when it is needed most when the data is poor, small, weak or ambiguous as opposed to samples that are stunningly clear. As an out-of-sample test tool, its sampling properties are undesirable in some circumstances, but desirable in others.

In this case, what Gelman is saying is that regardless of the true probability of the model, the out-of-sample validation score provided by the Bayesian posterior predictive p-value will be near 50% when the null is true when the data doesn't clearly point to a solution.

This has lead to the criticism of the idea as uncalibrated with the true probabilities. See

Bayarri, M. J. and Berger, J. (2000). P-values for composite null models. Journal of the American Statistical Association 95, 1127Ã¢Â€Â“1142.

score 4 · Accepted Answer · 2018-08-17 04:36:55Z