Skip to content

update GLM out of sample prediction notebook to v4 #370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Jun 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
a35fddb
create truncated regression example
drbenvincent Jan 24, 2021
bc3d659
delete truncated regression example from main branch
drbenvincent Jan 25, 2021
3aba79b
Merge branch 'pymc-devs:main' into main
drbenvincent Jun 30, 2021
d84d852
create truncated regression example
drbenvincent Jan 24, 2021
d3dabca
delete truncated regression example from main branch
drbenvincent Jan 25, 2021
664ab97
create truncated regression example
drbenvincent Jan 24, 2021
cc6693f
delete truncated regression example from main branch
drbenvincent Jan 25, 2021
612abc4
Merge branch 'main' of https://github.com/drbenvincent/pymc-examples
Feb 1, 2022
f0812aa
Merge branch 'main' of https://github.com/drbenvincent/pymc-examples
Feb 8, 2022
7372a46
Merge remote-tracking branch 'upstream/main' into main
Feb 8, 2022
22c9935
Merge remote-tracking branch 'upstream/main'
Feb 18, 2022
89c5af5
Merge remote-tracking branch 'upstream/main'
Mar 20, 2022
11347c9
Merge remote-tracking branch 'upstream/main'
Apr 2, 2022
4e368ec
Merge remote-tracking branch 'upstream/main'
Apr 9, 2022
d8b04b8
Merge remote-tracking branch 'upstream/main'
Apr 10, 2022
f62b1ef
Merge remote-tracking branch 'upstream/main'
Apr 17, 2022
0e67ecb
Merge remote-tracking branch 'upstream/main'
May 21, 2022
52687a0
fix incorrect statement about pm.NormalMixture
May 21, 2022
f07611c
Merge remote-tracking branch 'upstream/main'
May 30, 2022
7085eeb
Merge remote-tracking branch 'upstream/main'
May 31, 2022
ce771fe
Merge remote-tracking branch 'upstream/main'
Jun 1, 2022
2866e2d
Merge remote-tracking branch 'upstream/main'
Jun 2, 2022
e2ddbaa
Merge remote-tracking branch 'upstream/main'
Jun 4, 2022
d8cb2e1
Merge remote-tracking branch 'upstream/main'
Jun 5, 2022
109466d
Merge remote-tracking branch 'upstream/main'
Jun 5, 2022
4bce72d
update to v4
Jun 5, 2022
92ab809
remove a commented out line of code + add myst file
Jun 5, 2022
f0a959e
spelling
Jun 5, 2022
9ddb9f1
revisions based on feedback
Jun 5, 2022
619b6e3
fix: outputs disappeared for some reason
Jun 5, 2022
7d8916d
updates based on feedback
Jun 5, 2022
9c665fb
revert to using labels for coords
Jun 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116,038 changes: 480 additions & 115,558 deletions examples/generalized_linear_models/GLM-out-of-sample-predictions.ipynb

Large diffs are not rendered by default.

261 changes: 133 additions & 128 deletions myst_nbs/generalized_linear_models/GLM-out-of-sample-predictions.myst.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,87 +6,70 @@ jupytext:
format_version: 0.13
jupytext_version: 1.13.7
kernelspec:
display_name: 'Python 3.7.6 64-bit (''website_projects'': conda)'
metadata:
interpreter:
hash: fbddea5140024843998ae64bf59a7579a9660d103062604797ea5984366c686c
display_name: Python 3.9.12 ('pymc-dev-py39')
language: python
name: python3
---

# GLM in PyMC3: Out-Of-Sample Predictions
(GLM-out-of-sample-predictions)=
# Out-Of-Sample Predictions

In this notebook I explore the [glm](https://docs.pymc.io/api/glm.html) module of [PyMC3](https://docs.pymc.io/). I am particularly interested in the model definition using [patsy](https://patsy.readthedocs.io/en/latest/) formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread ["Out of sample" predictions with the GLM sub-module](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) on the (great!) forum [discourse.pymc.io/](https://discourse.pymc.io/), thank you all for your input!

**Resources**


- [PyMC3 Docs: Example Notebooks](https://docs.pymc.io/nb_examples/index.html)

- In particular check [GLM: Logistic Regression](https://docs.pymc.io/notebooks/GLM-logistic.html)

- [Bambi](https://bambinos.github.io/bambi/), a more complete implementation of the GLM submodule which also allows for mixed-effects models.

- [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
- [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/)
:::{post} June, 2022
:tags: generalized linear model, logistic regression, out of sample predictions, patsy
:category: beginner
:::

+++

## Prepare Notebook
For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread ["Out of sample" predictions with the GLM sub-module](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) on the (great!) forum [discourse.pymc.io/](https://discourse.pymc.io/), thank you all for your input! But note that this GLM sub-module was deprecated in favour of [`bambi`](https://github.com/bambinos/bambi). But this notebook implements a 'raw' PyMC model.

```{code-cell} ipython3
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sns.set_style(style="darkgrid", rc={"axes.facecolor": ".9", "grid.color": ".8"})
sns.set_palette(palette="deep")
sns_c = sns.color_palette(palette="deep")

import arviz as az
import patsy
import pymc3 as pm
import pymc as pm
import seaborn as sns

from pymc3 import glm
from scipy.special import expit as inverse_logit
from sklearn.metrics import RocCurveDisplay, accuracy_score, auc, roc_curve
from sklearn.model_selection import train_test_split
```

plt.rcParams["figure.figsize"] = [7, 6]
plt.rcParams["figure.dpi"] = 100
```{code-cell} ipython3
RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
az.style.use("arviz-darkgrid")
```

## Generate Sample Data

We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.

```{code-cell} ipython3
SEED = 42
np.random.seed(SEED)

# Number of data points.
# Number of data points
n = 250
# Create features.
x1 = np.random.normal(loc=0.0, scale=2.0, size=n)
x2 = np.random.normal(loc=0.0, scale=2.0, size=n)
epsilon = np.random.normal(loc=0.0, scale=0.5, size=n)
# Define target variable.
# Create features
x1 = rng.normal(loc=0.0, scale=2.0, size=n)
x2 = rng.normal(loc=0.0, scale=2.0, size=n)
# Define target variable
intercept = -0.5
beta_x1 = 1
beta_x2 = -1
beta_interaction = 2
z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2
p = 1 / (1 + np.exp(-z))
y = np.random.binomial(n=1, p=p, size=n)

p = inverse_logit(z)
# note binimial with n=1 is equal to a bernoulli
y = rng.binomial(n=1, p=p, size=n)
df = pd.DataFrame(dict(x1=x1, x2=x2, y=y))

df.head()
```

Let us do some exploration of the data:

```{code-cell} ipython3
sns.pairplot(
data=df, kind="scatter", height=2, plot_kws={"color": sns_c[1]}, diag_kws={"color": sns_c[2]}
);
sns.pairplot(data=df, kind="scatter");
```

- $x_1$ and $x_2$ are not correlated.
Expand All @@ -95,91 +78,85 @@ sns.pairplot(

```{code-cell} ipython3
fig, ax = plt.subplots()
sns_c_div = sns.diverging_palette(240, 10, n=2)
sns.scatterplot(x="x1", y="x2", data=df, hue="y", palette=[sns_c_div[0], sns_c_div[-1]])
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
sns.scatterplot(x="x1", y="x2", data=df, hue="y")
ax.legend(title="y")
ax.set(title="Sample Data", xlim=(-9, 9), ylim=(-9, 9));
```

## Prepare Data for Modeling

I wanted to use the *`classmethod`* `from_formula` (see [documentation](https://docs.pymc.io/api/glm.html)), but I was not able to generate out-of-sample predictions with this approach (if you find a way please let me know!). As a workaround, I created the features from a formula using [patsy](https://patsy.readthedocs.io/en/latest/) directly and then use *`class`* `pymc3.glm.linear.GLM` (this was motivated by going into the [source code](https://github.com/pymc-devs/pymc3/blob/master/pymc3/glm/linear.py)).

```{code-cell} ipython3
# Define model formula.
formula = "y ~ x1 * x2"
# Create features.
y, x = patsy.dmatrices(formula_like=formula, data=df)
y, x = patsy.dmatrices("y ~ x1 * x2", data=df)
y = np.asarray(y).flatten()
labels = x.design_info.column_names
x = np.asarray(x)
```

As pointed out on the [thread](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) (thank you @Nicky!), we need to keep the labels of the features in the design matrix.

```{code-cell} ipython3
print(f"labels = {labels}")
```

Now we do a train-test split.

```{code-cell} ipython3
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=SEED)
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
```

## Define and Fit the Model

We now specify the model in PyMC3.
We now specify the model in PyMC.

```{code-cell} ipython3
with pm.Model() as model:
# Set data container.
data = pm.Data("data", x_train)
# Define GLM family.
family = pm.glm.families.Binomial()
# Set priors.
priors = {
"Intercept": pm.Normal.dist(mu=0, sd=10),
"x1": pm.Normal.dist(mu=0, sd=10),
"x2": pm.Normal.dist(mu=0, sd=10),
"x1:x2": pm.Normal.dist(mu=0, sd=10),
}
# Specify model.
glm.GLM(y=y_train, x=data, family=family, intercept=False, labels=labels, priors=priors)
# Configure sampler.
trace = pm.sample(5000, chains=5, tune=1000, target_accept=0.87, random_seed=SEED)
COORDS = {"coeffs": labels}

with pm.Model(coords=COORDS) as model:
# data containers
X = pm.MutableData("X", x_train)
y = pm.MutableData("y", y_train)
# priors
b = pm.Normal("b", mu=0, sigma=1, dims="coeffs")
# linear model
mu = pm.math.dot(X, b)
# link function
p = pm.Deterministic("p", pm.math.invlogit(mu))
# likelihood
pm.Bernoulli("obs", p=p, observed=y)

pm.model_to_graphviz(model)
```

```{code-cell} ipython3
# Plot chains.
az.plot_trace(data=trace);
with model:
idata = pm.sample()
```

```{code-cell} ipython3
az.summary(trace)
az.plot_trace(idata, var_names="b", compact=False);
```

The chains look good.

+++
```{code-cell} ipython3
az.summary(idata, var_names="b")
```

And we do a good job of recovering the true parameters for this simulated dataset.

```{code-cell} ipython3
az.plot_posterior(
idata, var_names=["b"], ref_val=[intercept, beta_x1, beta_x2, beta_interaction], figsize=(15, 4)
);
```

## Generate Out-Of-Sample Predictions

Now we generate predictions on the test set.

```{code-cell} ipython3
# Update data reference.
pm.set_data({"data": x_test}, model=model)
# Generate posterior samples.
ppc_test = pm.sample_posterior_predictive(trace, model=model, samples=1000)
with model:
pm.set_data({"X": x_test, "y": y_test})
idata.extend(pm.sample_posterior_predictive(idata))
```

```{code-cell} ipython3
# Compute the point prediction by taking the mean
# and defining the category via a threshold.
p_test_pred = ppc_test["y"].mean(axis=0)
# Compute the point prediction by taking the mean and defining the category via a threshold.
p_test_pred = idata.posterior_predictive["obs"].mean(dim=["chain", "draw"])
y_test_pred = (p_test_pred >= 0.5).astype("int")
```

Expand All @@ -188,24 +165,20 @@ y_test_pred = (p_test_pred >= 0.5).astype("int")
First let us compute the accuracy on the test set.

```{code-cell} ipython3
from sklearn.metrics import accuracy_score

print(f"accuracy = {accuracy_score(y_true=y_test, y_pred=y_test_pred): 0.3f}")
```

Next, we plot the [roc curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and compute the [auc](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).

```{code-cell} ipython3
from sklearn.metrics import RocCurveDisplay, auc, roc_curve

fpr, tpr, thresholds = roc_curve(
y_true=y_test, y_score=p_test_pred, pos_label=1, drop_intermediate=False
)
roc_auc = auc(fpr, tpr)

fig, ax = plt.subplots()
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc_display = roc_display.plot(ax=ax, marker="o", color=sns_c[4], markersize=4)
roc_display = roc_display.plot(ax=ax, marker="o", markersize=4)
ax.set(title="ROC");
```

Expand Down Expand Up @@ -238,60 +211,92 @@ Observe that this curve is a hyperbola centered at the singularity point $x_1 =
Let us now plot the model decision boundary using a grid:

```{code-cell} ipython3
# Construct grid.
x1_grid = np.linspace(start=-9, stop=9, num=300)
x2_grid = x1_grid

x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)

x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)

# Create features on the grid.
x_grid_ext = patsy.dmatrix(formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1]))

x_grid_ext = np.asarray(x_grid_ext)
def make_grid():
x1_grid = np.linspace(start=-9, stop=9, num=300)
x2_grid = x1_grid
x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)
x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)
return x1_grid, x2_grid, x_grid


x1_grid, x2_grid, x_grid = make_grid()

with model:
# Create features on the grid.
x_grid_ext = patsy.dmatrix(formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1]))
x_grid_ext = np.asarray(x_grid_ext)
# set the observed variables
pm.set_data({"X": x_grid_ext})
# calculate pushforward values of `p`
ppc_grid = pm.sample_posterior_predictive(idata, var_names=["p"])
```

# Generate model predictions on the grid.
pm.set_data({"data": x_grid_ext}, model=model)
ppc_grid = pm.sample_posterior_predictive(trace, model=model, samples=1000)
```{code-cell} ipython3
# grid of predictions
grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
grid_df["p"] = ppc_grid.posterior_predictive.p.mean(dim=["chain", "draw"])
p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()
```

Now we compute the model decision boundary on the grid for visualization purposes.

```{code-cell} ipython3
numerator = -(trace["Intercept"].mean(axis=0) + trace["x1"].mean(axis=0) * x1_grid)
denominator = trace["x2"].mean(axis=0) + trace["x1:x2"].mean(axis=0) * x1_grid
bd_grid = numerator / denominator

grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
grid_df["p"] = ppc_grid["y"].mean(axis=0)
grid_df.sort_values("p", inplace=True)

p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()
def calc_decision_boundary(idata, x1_grid):
# posterior mean of coefficients
intercept = idata.posterior["b"].sel(coeffs="Intercept").mean().data
b1 = idata.posterior["b"].sel(coeffs="x1").mean().data
b2 = idata.posterior["b"].sel(coeffs="x2").mean().data
b1b2 = idata.posterior["b"].sel(coeffs="x1:x2").mean().data
# decision boundary equation
return -(intercept + b1 * x1_grid) / (b2 + b1b2 * x1_grid)
```

We finally get the plot and the predictions on the test set:

```{code-cell} ipython3
fig, ax = plt.subplots()
cmap = sns.diverging_palette(240, 10, n=50, as_cmap=True)

# data
sns.scatterplot(
x=x_test[:, 1].flatten(),
y=x_test[:, 2].flatten(),
hue=y_test,
palette=[sns_c_div[0], sns_c_div[-1]],
ax=ax,
)
sns.lineplot(x=x1_grid, y=bd_grid, color="black", ax=ax)
ax.contourf(x1_grid, x2_grid, p_grid, cmap=cmap, alpha=0.3)

# decision boundary
ax.plot(x1_grid, calc_decision_boundary(idata, x1_grid), color="black", linestyle=":")

# grid of predictions
ax.contourf(x1_grid, x2_grid, p_grid, alpha=0.3)

ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
ax.lines[0].set_linestyle("dotted")
ax.set(title="Model Decision Boundary", xlim=(-9, 9), ylim=(-9, 9), xlabel="x1", ylabel="x2");
```

**Remark:** Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and auc). One way of doing this is by storing and computing it inside the model definition as a `Deterministic` variable as in [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb).
Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and AUC).

+++

## References

- [Bambi](https://bambinos.github.io/bambi/), a more complete implementation of the GLM submodule which also allows for mixed-effects models.
- [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
- [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/)

+++

## Authors
- Created by [Juan Orduz](https://github.com/juanitorduz).
- Updated by [Benjamin T. Vincent](https://github.com/drbenvincent) to PyMC v4 in June 2022

+++

## Watermark

```{code-cell} ipython3
%load_ext watermark
%watermark -n -u -v -iv -w
%watermark -n -u -v -iv -w -p aesara,aeppl
```

:::{include} ../page_footer.md :::