Skip to content

Out-of-sample predictions for BART intro notebook #507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
796 changes: 762 additions & 34 deletions examples/case_studies/BART_introduction.ipynb

Large diffs are not rendered by default.

194 changes: 190 additions & 4 deletions examples/case_studies/BART_introduction.myst.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ jupytext:
format_name: myst
format_version: 0.13
kernelspec:
display_name: Python 3 (ipykernel)
display_name: pymc-examples-env
language: python
name: python3
---
Expand All @@ -27,6 +27,11 @@ import numpy as np
import pandas as pd
import pymc as pm
import pymc_bart as pmb
import seaborn as sns

from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = "retina"

print(f"Running on PyMC v{pm.__version__}")
```
Expand Down Expand Up @@ -86,7 +91,7 @@ In PyMC a BART variable can be defined very similar to other random variables. O
```{code-cell} ipython3
with pm.Model() as model_coal:
μ_ = pmb.BART("μ_", X=x_data, Y=y_data, m=20)
μ = pm.Deterministic("μ", np.abs(μ_))
μ = pm.Deterministic("μ", pm.math.abs(μ_))
y_pred = pm.Poisson("y_pred", mu=μ, observed=y_data)
idata_coal = pm.sample(random_seed=RANDOM_SEED)
```
Expand Down Expand Up @@ -137,15 +142,17 @@ try:
except FileNotFoundError:
bikes = pd.read_csv(pm.get_data("bikes.csv"))

X = bikes[["hour", "temperature", "humidity", "workingday"]]
features = ["hour", "temperature", "humidity", "workingday"]

X = bikes[features]
Y = bikes["count"]
```

```{code-cell} ipython3
with pm.Model() as model_bikes:
α = pm.Exponential("α", 1 / 10)
μ = pmb.BART("μ", X, Y)
y = pm.NegativeBinomial("y", mu=np.abs(μ), alpha=α, observed=Y)
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y)
idata_bikes = pm.sample(random_seed=RANDOM_SEED)
```

Expand Down Expand Up @@ -179,11 +186,190 @@ Additionally, PyMC-BART provides a novel method to assess the variable importanc
pmb.plot_variable_importance(idata_bikes, μ, X, samples=100);
```

### Out-of-Sample Predictions

In this section we want to show how to do out-of-sample predictions with BART. We are going to use the same dataset as before, but this time we are going to split the data into a training and a test set. We are going to use the training set to fit the model and the test set to evaluate the model.

+++

#### Regression

Let's start by modelling this data as a regression problem. In this context we randomly split the data into a training and a test set.

```{code-cell} ipython3
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=RANDOM_SEED)
```

Now, we fit the same model as above but this time using a *shared variable* for the covariatates so that we can then replace them to generate the out of sample predictions.

```{code-cell} ipython3
with pm.Model() as model_oos_regression:
X = pm.MutableData("X", X_train)
Y = Y_train
α = pm.Exponential("α", 1 / 10)
μ = pmb.BART("μ", X, Y)
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
idata_oos_regression = pm.sample(random_seed=RANDOM_SEED)
posterior_predictive_oos_regression_train = pm.sample_posterior_predictive(
trace=idata_oos_regression, random_seed=RANDOM_SEED
)
```

Next, we replace the data in the model and sample from the posterior predictive distribution.

```{code-cell} ipython3
with model_oos_regression:
X.set_value(X_test)
posterior_predictive_oos_regression_test = pm.sample_posterior_predictive(
trace=idata_oos_regression, random_seed=RANDOM_SEED
)
```

Finally, we can compare the posterior predictive distribution with the observed data.

```{code-cell} ipython3
:tags: [hide-input]

fig, ax = plt.subplots(
nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
)

az.plot_ppc(
data=posterior_predictive_oos_regression_train, kind="cumulative", observed_rug=True, ax=ax[0]
)
ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))

az.plot_ppc(
data=posterior_predictive_oos_regression_test, kind="cumulative", observed_rug=True, ax=ax[1]
)
ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
```

Yay! The results look quite reasonable 🙂!

+++

#### Time Series

We can view the same data from a *time series* perspective using the `hour` feature. From this point of view, we need to make sure we do not shuffle the data so that we do not leak information. Thus, we define th train-test split using the `hour` feature.

```{code-cell} ipython3
train_test_hour_split = 19

train_bikes = bikes.query("hour <= @train_test_hour_split")
test_bikes = bikes.query("hour > @train_test_hour_split")

X_train = train_bikes[features]
Y_train = train_bikes["count"]

X_test = test_bikes[features]
Y_test = test_bikes["count"]
```

We can then run the same model (but with different input data!) and generate out-of-sample predictions as above.

```{code-cell} ipython3
with pm.Model() as model_oos_ts:
X = pm.MutableData("X", X_train)
Y = Y_train
α = pm.Exponential("α", 1 / 10)
μ = pmb.BART("μ", X, Y)
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
idata_oos_ts = pm.sample(random_seed=RANDOM_SEED)
posterior_predictive_oos_ts_train = pm.sample_posterior_predictive(
trace=idata_oos_ts, random_seed=RANDOM_SEED
)
```

We generate out-of-sample predictions.

```{code-cell} ipython3
with model_oos_ts:
X.set_value(X_test)
posterior_predictive_oos_ts_test = pm.sample_posterior_predictive(
trace=idata_oos_ts, random_seed=RANDOM_SEED
)
```

Similarly as above, we can compare the posterior predictive distribution with the observed data.

```{code-cell} ipython3
:tags: [hide-input]

fig, ax = plt.subplots(
nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
)

az.plot_ppc(data=posterior_predictive_oos_ts_train, kind="cumulative", observed_rug=True, ax=ax[0])
ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))

az.plot_ppc(data=posterior_predictive_oos_ts_test, kind="cumulative", observed_rug=True, ax=ax[1])
ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
```

Wow! This does not look right! The predictions on the test set look very odd 🤔. To better understand what is going on we can plot the predictions as time series:

```{code-cell} ipython3
:tags: [hide-input]

fig, ax = plt.subplots(figsize=(12, 6))
az.plot_hdi(
x=X_train.index,
y=posterior_predictive_oos_ts_train.posterior_predictive["y"],
hdi_prob=0.94,
color="C0",
fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (train)"},
smooth=False,
ax=ax,
)
az.plot_hdi(
x=X_train.index,
y=posterior_predictive_oos_ts_train.posterior_predictive["y"],
hdi_prob=0.5,
color="C0",
fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (train)"},
smooth=False,
ax=ax,
)
ax.plot(X_train.index, Y_train, label="train (observed)")
az.plot_hdi(
x=X_test.index,
y=posterior_predictive_oos_ts_test.posterior_predictive["y"],
hdi_prob=0.94,
color="C1",
fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (test)"},
smooth=False,
ax=ax,
)
az.plot_hdi(
x=X_test.index,
y=posterior_predictive_oos_ts_test.posterior_predictive["y"],
hdi_prob=0.5,
color="C1",
fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (test)"},
smooth=False,
ax=ax,
)
ax.plot(X_test.index, Y_test, label="test (observed)")
ax.axvline(X_train.shape[0], color="k", linestyle="--", label="train/test split")
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
ax.set(
title="BART model predictions for bike rentals",
xlabel="observation index",
ylabel="number of rentals",
);
```

This plot helps us understand the season behind the bad performance on the test set: Recall that in the variable importance ranking from the initial model we saw that `hour` was the most important predictor. On the other hand, our training data just sees `hour` values until $19$ (since is our train-test threshold). As BART learns how to partition the (training) data, it can not differentiate between `hour` values between $20$ and $22$ for example. It just cares that both values are greater that $19$. This is very important to understand when using BART! This explains why one should not use BART for time series forecasting if there is a trend component. In this case it is better to detrend the data first, model the remainder with BART and model the trend with a different model.

+++

## Authors
* Authored by Osvaldo Martin in Dec, 2021 ([pymc-examples#259](https://github.com/pymc-devs/pymc-examples/pull/259))
* Updated by Osvaldo Martin in May, 2022 ([pymc-examples#323](https://github.com/pymc-devs/pymc-examples/pull/323))
* Updated by Osvaldo Martin in Sep, 2022
* Updated by Osvaldo Martin in Nov, 2022
* Juan Orduz added out-of-sample section in Jan, 2023

+++

Expand Down