Skip to content

Commit dbcce72

Browse files
author
Juan Orduz
authored
Out-of-sample predictions for BART intro notebook (#507)
* add out-os-sample section * add regression example
1 parent 951c0ef commit dbcce72

File tree

2 files changed

+952
-38
lines changed

2 files changed

+952
-38
lines changed

examples/case_studies/BART_introduction.ipynb

Lines changed: 762 additions & 34 deletions
Large diffs are not rendered by default.

examples/case_studies/BART_introduction.myst.md

Lines changed: 190 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ jupytext:
55
format_name: myst
66
format_version: 0.13
77
kernelspec:
8-
display_name: Python 3 (ipykernel)
8+
display_name: pymc-examples-env
99
language: python
1010
name: python3
1111
---
@@ -27,6 +27,11 @@ import numpy as np
2727
import pandas as pd
2828
import pymc as pm
2929
import pymc_bart as pmb
30+
import seaborn as sns
31+
32+
from sklearn.model_selection import train_test_split
33+
34+
%config InlineBackend.figure_format = "retina"
3035
3136
print(f"Running on PyMC v{pm.__version__}")
3237
```
@@ -86,7 +91,7 @@ In PyMC a BART variable can be defined very similar to other random variables. O
8691
```{code-cell} ipython3
8792
with pm.Model() as model_coal:
8893
μ_ = pmb.BART("μ_", X=x_data, Y=y_data, m=20)
89-
μ = pm.Deterministic("μ", np.abs(μ_))
94+
μ = pm.Deterministic("μ", pm.math.abs(μ_))
9095
y_pred = pm.Poisson("y_pred", mu=μ, observed=y_data)
9196
idata_coal = pm.sample(random_seed=RANDOM_SEED)
9297
```
@@ -137,15 +142,17 @@ try:
137142
except FileNotFoundError:
138143
bikes = pd.read_csv(pm.get_data("bikes.csv"))
139144
140-
X = bikes[["hour", "temperature", "humidity", "workingday"]]
145+
features = ["hour", "temperature", "humidity", "workingday"]
146+
147+
X = bikes[features]
141148
Y = bikes["count"]
142149
```
143150

144151
```{code-cell} ipython3
145152
with pm.Model() as model_bikes:
146153
α = pm.Exponential("α", 1 / 10)
147154
μ = pmb.BART("μ", X, Y)
148-
y = pm.NegativeBinomial("y", mu=np.abs(μ), alpha=α, observed=Y)
155+
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y)
149156
idata_bikes = pm.sample(random_seed=RANDOM_SEED)
150157
```
151158

@@ -179,11 +186,190 @@ Additionally, PyMC-BART provides a novel method to assess the variable importanc
179186
pmb.plot_variable_importance(idata_bikes, μ, X, samples=100);
180187
```
181188

189+
### Out-of-Sample Predictions
190+
191+
In this section we want to show how to do out-of-sample predictions with BART. We are going to use the same dataset as before, but this time we are going to split the data into a training and a test set. We are going to use the training set to fit the model and the test set to evaluate the model.
192+
193+
+++
194+
195+
#### Regression
196+
197+
Let's start by modelling this data as a regression problem. In this context we randomly split the data into a training and a test set.
198+
199+
```{code-cell} ipython3
200+
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=RANDOM_SEED)
201+
```
202+
203+
Now, we fit the same model as above but this time using a *shared variable* for the covariatates so that we can then replace them to generate the out of sample predictions.
204+
205+
```{code-cell} ipython3
206+
with pm.Model() as model_oos_regression:
207+
X = pm.MutableData("X", X_train)
208+
Y = Y_train
209+
α = pm.Exponential("α", 1 / 10)
210+
μ = pmb.BART("μ", X, Y)
211+
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
212+
idata_oos_regression = pm.sample(random_seed=RANDOM_SEED)
213+
posterior_predictive_oos_regression_train = pm.sample_posterior_predictive(
214+
trace=idata_oos_regression, random_seed=RANDOM_SEED
215+
)
216+
```
217+
218+
Next, we replace the data in the model and sample from the posterior predictive distribution.
219+
220+
```{code-cell} ipython3
221+
with model_oos_regression:
222+
X.set_value(X_test)
223+
posterior_predictive_oos_regression_test = pm.sample_posterior_predictive(
224+
trace=idata_oos_regression, random_seed=RANDOM_SEED
225+
)
226+
```
227+
228+
Finally, we can compare the posterior predictive distribution with the observed data.
229+
230+
```{code-cell} ipython3
231+
:tags: [hide-input]
232+
233+
fig, ax = plt.subplots(
234+
nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
235+
)
236+
237+
az.plot_ppc(
238+
data=posterior_predictive_oos_regression_train, kind="cumulative", observed_rug=True, ax=ax[0]
239+
)
240+
ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))
241+
242+
az.plot_ppc(
243+
data=posterior_predictive_oos_regression_test, kind="cumulative", observed_rug=True, ax=ax[1]
244+
)
245+
ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
246+
```
247+
248+
Yay! The results look quite reasonable 🙂!
249+
250+
+++
251+
252+
#### Time Series
253+
254+
We can view the same data from a *time series* perspective using the `hour` feature. From this point of view, we need to make sure we do not shuffle the data so that we do not leak information. Thus, we define th train-test split using the `hour` feature.
255+
256+
```{code-cell} ipython3
257+
train_test_hour_split = 19
258+
259+
train_bikes = bikes.query("hour <= @train_test_hour_split")
260+
test_bikes = bikes.query("hour > @train_test_hour_split")
261+
262+
X_train = train_bikes[features]
263+
Y_train = train_bikes["count"]
264+
265+
X_test = test_bikes[features]
266+
Y_test = test_bikes["count"]
267+
```
268+
269+
We can then run the same model (but with different input data!) and generate out-of-sample predictions as above.
270+
271+
```{code-cell} ipython3
272+
with pm.Model() as model_oos_ts:
273+
X = pm.MutableData("X", X_train)
274+
Y = Y_train
275+
α = pm.Exponential("α", 1 / 10)
276+
μ = pmb.BART("μ", X, Y)
277+
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
278+
idata_oos_ts = pm.sample(random_seed=RANDOM_SEED)
279+
posterior_predictive_oos_ts_train = pm.sample_posterior_predictive(
280+
trace=idata_oos_ts, random_seed=RANDOM_SEED
281+
)
282+
```
283+
284+
We generate out-of-sample predictions.
285+
286+
```{code-cell} ipython3
287+
with model_oos_ts:
288+
X.set_value(X_test)
289+
posterior_predictive_oos_ts_test = pm.sample_posterior_predictive(
290+
trace=idata_oos_ts, random_seed=RANDOM_SEED
291+
)
292+
```
293+
294+
Similarly as above, we can compare the posterior predictive distribution with the observed data.
295+
296+
```{code-cell} ipython3
297+
:tags: [hide-input]
298+
299+
fig, ax = plt.subplots(
300+
nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
301+
)
302+
303+
az.plot_ppc(data=posterior_predictive_oos_ts_train, kind="cumulative", observed_rug=True, ax=ax[0])
304+
ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))
305+
306+
az.plot_ppc(data=posterior_predictive_oos_ts_test, kind="cumulative", observed_rug=True, ax=ax[1])
307+
ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
308+
```
309+
310+
Wow! This does not look right! The predictions on the test set look very odd 🤔. To better understand what is going on we can plot the predictions as time series:
311+
312+
```{code-cell} ipython3
313+
:tags: [hide-input]
314+
315+
fig, ax = plt.subplots(figsize=(12, 6))
316+
az.plot_hdi(
317+
x=X_train.index,
318+
y=posterior_predictive_oos_ts_train.posterior_predictive["y"],
319+
hdi_prob=0.94,
320+
color="C0",
321+
fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (train)"},
322+
smooth=False,
323+
ax=ax,
324+
)
325+
az.plot_hdi(
326+
x=X_train.index,
327+
y=posterior_predictive_oos_ts_train.posterior_predictive["y"],
328+
hdi_prob=0.5,
329+
color="C0",
330+
fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (train)"},
331+
smooth=False,
332+
ax=ax,
333+
)
334+
ax.plot(X_train.index, Y_train, label="train (observed)")
335+
az.plot_hdi(
336+
x=X_test.index,
337+
y=posterior_predictive_oos_ts_test.posterior_predictive["y"],
338+
hdi_prob=0.94,
339+
color="C1",
340+
fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (test)"},
341+
smooth=False,
342+
ax=ax,
343+
)
344+
az.plot_hdi(
345+
x=X_test.index,
346+
y=posterior_predictive_oos_ts_test.posterior_predictive["y"],
347+
hdi_prob=0.5,
348+
color="C1",
349+
fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (test)"},
350+
smooth=False,
351+
ax=ax,
352+
)
353+
ax.plot(X_test.index, Y_test, label="test (observed)")
354+
ax.axvline(X_train.shape[0], color="k", linestyle="--", label="train/test split")
355+
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
356+
ax.set(
357+
title="BART model predictions for bike rentals",
358+
xlabel="observation index",
359+
ylabel="number of rentals",
360+
);
361+
```
362+
363+
This plot helps us understand the season behind the bad performance on the test set: Recall that in the variable importance ranking from the initial model we saw that `hour` was the most important predictor. On the other hand, our training data just sees `hour` values until $19$ (since is our train-test threshold). As BART learns how to partition the (training) data, it can not differentiate between `hour` values between $20$ and $22$ for example. It just cares that both values are greater that $19$. This is very important to understand when using BART! This explains why one should not use BART for time series forecasting if there is a trend component. In this case it is better to detrend the data first, model the remainder with BART and model the trend with a different model.
364+
365+
+++
366+
182367
## Authors
183368
* Authored by Osvaldo Martin in Dec, 2021 ([pymc-examples#259](https://github.com/pymc-devs/pymc-examples/pull/259))
184369
* Updated by Osvaldo Martin in May, 2022 ([pymc-examples#323](https://github.com/pymc-devs/pymc-examples/pull/323))
185370
* Updated by Osvaldo Martin in Sep, 2022
186371
* Updated by Osvaldo Martin in Nov, 2022
372+
* Juan Orduz added out-of-sample section in Jan, 2023
187373

188374
+++
189375

0 commit comments

Comments
 (0)