Description
This issue concerns something that was raised on Discourse. See that page for more details.
Suppose x
and y
are the predictor and response for a simple linear regression. I fit the model and draw samples using sample_ppc
as follows:
with pm.Model() as model1:
x = pm.Uniform('x',lower = 0.0, upper = 10.0, observed = observed_x,shape = n)
sigma = pm.HalfCauchy('sigma',beta = 1.0)
beta = pm.Normal('beta',sd = 5.0)
alpha = pm.Normal('alpha',sd = 5.0)
y_hat = beta * x + alpha
y = pm.Normal('y',mu = y_hat,sd = sigma, shape = n,observed = observed_y)
trace = pm.sample()
ppc = pm.sample_ppc(trace, samples=1)
Then, I plot the samples of Y versus the samples of X
plt.scatter(ppc['x'],ppc['y'])
and I find that they are uncorrelated:
Now, I was expecting it to instead look like this:
plt.scatter(observed_x,ppc['y'])
@junpenglao pointed out that sample_ppc
is not drawing samples of y
conditional on the new values x
but rather from the distribution conditional on the observed data observed_x
which is not what we might expect. As a consequence, we get samples of x
and y
from their joint distribution which do not have the right correlation.