Skip to content

Commit 1f0d793

Browse files
author
juanitorduz
committed
add out-os-sample section
1 parent 5288e05 commit 1f0d793

File tree

2 files changed

+519
-38
lines changed

2 files changed

+519
-38
lines changed

examples/case_studies/BART_introduction.ipynb

Lines changed: 410 additions & 34 deletions
Large diffs are not rendered by default.

examples/case_studies/BART_introduction.myst.md

Lines changed: 109 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ jupytext:
55
format_name: myst
66
format_version: 0.13
77
kernelspec:
8-
display_name: Python 3 (ipykernel)
8+
display_name: pymc-examples-env
99
language: python
1010
name: python3
1111
---
@@ -28,6 +28,8 @@ import pandas as pd
2828
import pymc as pm
2929
import pymc_bart as pmb
3030
31+
%config InlineBackend.figure_format = "retina"
32+
3133
print(f"Running on PyMC v{pm.__version__}")
3234
```
3335

@@ -86,7 +88,7 @@ In PyMC a BART variable can be defined very similar to other random variables. O
8688
```{code-cell} ipython3
8789
with pm.Model() as model_coal:
8890
μ_ = pmb.BART("μ_", X=x_data, Y=y_data, m=20)
89-
μ = pm.Deterministic("μ", np.abs(μ_))
91+
μ = pm.Deterministic("μ", pm.math.abs(μ_))
9092
y_pred = pm.Poisson("y_pred", mu=μ, observed=y_data)
9193
idata_coal = pm.sample(random_seed=RANDOM_SEED)
9294
```
@@ -137,15 +139,17 @@ try:
137139
except FileNotFoundError:
138140
bikes = pd.read_csv(pm.get_data("bikes.csv"))
139141
140-
X = bikes[["hour", "temperature", "humidity", "workingday"]]
142+
features = ["hour", "temperature", "humidity", "workingday"]
143+
144+
X = bikes[features]
141145
Y = bikes["count"]
142146
```
143147

144148
```{code-cell} ipython3
145149
with pm.Model() as model_bikes:
146150
α = pm.Exponential("α", 1 / 10)
147151
μ = pmb.BART("μ", X, Y)
148-
y = pm.NegativeBinomial("y", mu=np.abs(μ), alpha=α, observed=Y)
152+
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y)
149153
idata_bikes = pm.sample(random_seed=RANDOM_SEED)
150154
```
151155

@@ -179,11 +183,112 @@ Additionally, PyMC-BART provides a novel method to assess the variable importanc
179183
pmb.plot_variable_importance(idata_bikes, μ, X, samples=100);
180184
```
181185

186+
### Out-of-Sample Predictions
187+
188+
In this section we want to show how to do out-of-sample predictions with BART. We are going to use the same dataset as before, but this time we are going to split the data into a training and a test set. We are going to use the training set to fit the model and the test set to evaluate the model.
189+
As this is a time series problem we need to make sure we do not shuffle the data. Hence we do the split using the `hour` feature.
190+
191+
```{code-cell} ipython3
192+
train_test_hour_split = 19
193+
194+
train_bikes = bikes.query("hour <= @train_test_hour_split")
195+
test_bikes = bikes.query("hour > @train_test_hour_split")
196+
197+
X_train = train_bikes[features]
198+
Y_train = train_bikes["count"]
199+
200+
X_test = test_bikes[features]
201+
Y_test = test_bikes["count"]
202+
```
203+
204+
Now, we fit the same model as above but this time using a *shared variable* for the covariatates so that we can then replace them to generate the out of sample predictions.
205+
206+
```{code-cell} ipython3
207+
with pm.Model() as model_bikes_train_test:
208+
X = pm.MutableData("X", X_train)
209+
Y = Y_train
210+
α = pm.Exponential("α", 1 / 10)
211+
μ = pmb.BART("μ", X, Y)
212+
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
213+
idata_bikes_train = pm.sample(random_seed=RANDOM_SEED)
214+
posterior_predictive_train = pm.sample_posterior_predictive(
215+
trace=idata_bikes_train, random_seed=RANDOM_SEED
216+
)
217+
```
218+
219+
Next, we replace the data in the model and sample from the posterior predictive distribution.
220+
221+
```{code-cell} ipython3
222+
with model_bikes_train_test:
223+
X.set_value(X_test)
224+
posterior_predictive_test = pm.sample_posterior_predictive(
225+
trace=idata_bikes_train, random_seed=RANDOM_SEED
226+
)
227+
```
228+
229+
Finally, let's plot the results:
230+
231+
```{code-cell} ipython3
232+
:tags: [hide-input]
233+
234+
fig, ax = plt.subplots(figsize=(12, 6))
235+
az.plot_hdi(
236+
x=X_train.index,
237+
y=posterior_predictive_train.posterior_predictive["y"],
238+
hdi_prob=0.94,
239+
color="C0",
240+
fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (train)"},
241+
smooth=False,
242+
ax=ax,
243+
)
244+
az.plot_hdi(
245+
x=X_train.index,
246+
y=posterior_predictive_train.posterior_predictive["y"],
247+
hdi_prob=0.5,
248+
color="C0",
249+
fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (train)"},
250+
smooth=False,
251+
ax=ax,
252+
)
253+
ax.plot(X_train.index, Y_train, label="train (observed)")
254+
az.plot_hdi(
255+
x=X_test.index,
256+
y=posterior_predictive_test.posterior_predictive["y"],
257+
hdi_prob=0.94,
258+
color="C1",
259+
fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (test)"},
260+
smooth=False,
261+
ax=ax,
262+
)
263+
az.plot_hdi(
264+
x=X_test.index,
265+
y=posterior_predictive_test.posterior_predictive["y"],
266+
hdi_prob=0.5,
267+
color="C1",
268+
fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (test)"},
269+
smooth=False,
270+
ax=ax,
271+
)
272+
ax.plot(X_test.index, Y_test, label="test (observed)")
273+
ax.axvline(X_train.shape[0], color="k", linestyle="--", label="train/test split")
274+
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
275+
ax.set(
276+
title="BART model predictions for bike rentals",
277+
xlabel="observation index",
278+
ylabel="number of rentals",
279+
);
280+
```
281+
282+
The out-of-sample predictions look a bit odd. Why? Well, note that in the variable importance ranking from the initial model we saw that `hour` was the most important predictor. On the other hand, our training data just sees `hour` values until $19$. As BART learns how to partition the (training) data, it can not differentiate between `hour` values between $20$ and $22$ for example. It just cares that both values are greater that $19$. This is very important to understand when using BART! This explains why one should not use BART for time series forecasting if there is a trend component. In this case it is better to detrend the data first, model the remainder with BART and model the trend with a different model.
283+
284+
+++
285+
182286
## Authors
183287
* Authored by Osvaldo Martin in Dec, 2021 ([pymc-examples#259](https://github.com/pymc-devs/pymc-examples/pull/259))
184288
* Updated by Osvaldo Martin in May, 2022 ([pymc-examples#323](https://github.com/pymc-devs/pymc-examples/pull/323))
185289
* Updated by Osvaldo Martin in Sep, 2022
186290
* Updated by Osvaldo Martin in Nov, 2022
291+
* Juan Orduz added out-of-sample section in Jan, 2023
187292

188293
+++
189294

0 commit comments

Comments
 (0)