Skip to content

Commit ae24deb

Browse files
author
juanitorduz
committed
add regression example
1 parent 1f0d793 commit ae24deb

File tree

2 files changed

+487
-54
lines changed

2 files changed

+487
-54
lines changed

examples/case_studies/BART_introduction.ipynb

Lines changed: 390 additions & 38 deletions
Large diffs are not rendered by default.

examples/case_studies/BART_introduction.myst.md

Lines changed: 97 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,9 @@ import numpy as np
2727
import pandas as pd
2828
import pymc as pm
2929
import pymc_bart as pmb
30+
import seaborn as sns
31+
32+
from sklearn.model_selection import train_test_split
3033
3134
%config InlineBackend.figure_format = "retina"
3235
@@ -186,7 +189,69 @@ pmb.plot_variable_importance(idata_bikes, μ, X, samples=100);
186189
### Out-of-Sample Predictions
187190

188191
In this section we want to show how to do out-of-sample predictions with BART. We are going to use the same dataset as before, but this time we are going to split the data into a training and a test set. We are going to use the training set to fit the model and the test set to evaluate the model.
189-
As this is a time series problem we need to make sure we do not shuffle the data. Hence we do the split using the `hour` feature.
192+
193+
+++
194+
195+
#### Regression
196+
197+
Let's start by modelling this data as a regression problem. In this context we randomly split the data into a training and a test set.
198+
199+
```{code-cell} ipython3
200+
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=RANDOM_SEED)
201+
```
202+
203+
Now, we fit the same model as above but this time using a *shared variable* for the covariatates so that we can then replace them to generate the out of sample predictions.
204+
205+
```{code-cell} ipython3
206+
with pm.Model() as model_oos_regression:
207+
X = pm.MutableData("X", X_train)
208+
Y = Y_train
209+
α = pm.Exponential("α", 1 / 10)
210+
μ = pmb.BART("μ", X, Y)
211+
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
212+
idata_oos_regression = pm.sample(random_seed=RANDOM_SEED)
213+
posterior_predictive_oos_regression_train = pm.sample_posterior_predictive(
214+
trace=idata_oos_regression, random_seed=RANDOM_SEED
215+
)
216+
```
217+
218+
Next, we replace the data in the model and sample from the posterior predictive distribution.
219+
220+
```{code-cell} ipython3
221+
with model_oos_regression:
222+
X.set_value(X_test)
223+
posterior_predictive_oos_regression_test = pm.sample_posterior_predictive(
224+
trace=idata_oos_regression, random_seed=RANDOM_SEED
225+
)
226+
```
227+
228+
Finally, we can compare the posterior predictive distribution with the observed data.
229+
230+
```{code-cell} ipython3
231+
:tags: [hide-input]
232+
233+
fig, ax = plt.subplots(
234+
nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
235+
)
236+
237+
az.plot_ppc(
238+
data=posterior_predictive_oos_regression_train, kind="cumulative", observed_rug=True, ax=ax[0]
239+
)
240+
ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))
241+
242+
az.plot_ppc(
243+
data=posterior_predictive_oos_regression_test, kind="cumulative", observed_rug=True, ax=ax[1]
244+
)
245+
ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
246+
```
247+
248+
Yay! The results look quite reasonable 🙂!
249+
250+
+++
251+
252+
#### Time Series
253+
254+
We can view the same data from a *time series* perspective using the `hour` feature. From this point of view, we need to make sure we do not shuffle the data so that we do not leak information. Thus, we define th train-test split using the `hour` feature.
190255

191256
```{code-cell} ipython3
192257
train_test_hour_split = 19
@@ -201,40 +266,56 @@ X_test = test_bikes[features]
201266
Y_test = test_bikes["count"]
202267
```
203268

204-
Now, we fit the same model as above but this time using a *shared variable* for the covariatates so that we can then replace them to generate the out of sample predictions.
269+
We can then run the same model (but with different input data!) and generate out-of-sample predictions as above.
205270

206271
```{code-cell} ipython3
207-
with pm.Model() as model_bikes_train_test:
272+
with pm.Model() as model_oos_ts:
208273
X = pm.MutableData("X", X_train)
209274
Y = Y_train
210275
α = pm.Exponential("α", 1 / 10)
211276
μ = pmb.BART("μ", X, Y)
212277
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
213-
idata_bikes_train = pm.sample(random_seed=RANDOM_SEED)
214-
posterior_predictive_train = pm.sample_posterior_predictive(
215-
trace=idata_bikes_train, random_seed=RANDOM_SEED
278+
idata_oos_ts = pm.sample(random_seed=RANDOM_SEED)
279+
posterior_predictive_oos_ts_train = pm.sample_posterior_predictive(
280+
trace=idata_oos_ts, random_seed=RANDOM_SEED
216281
)
217282
```
218283

219-
Next, we replace the data in the model and sample from the posterior predictive distribution.
284+
We generate out-of-sample predictions.
220285

221286
```{code-cell} ipython3
222-
with model_bikes_train_test:
287+
with model_oos_ts:
223288
X.set_value(X_test)
224-
posterior_predictive_test = pm.sample_posterior_predictive(
225-
trace=idata_bikes_train, random_seed=RANDOM_SEED
289+
posterior_predictive_oos_ts_test = pm.sample_posterior_predictive(
290+
trace=idata_oos_ts, random_seed=RANDOM_SEED
226291
)
227292
```
228293

229-
Finally, let's plot the results:
294+
Similarly as above, we can compare the posterior predictive distribution with the observed data.
295+
296+
```{code-cell} ipython3
297+
:tags: [hide-input]
298+
299+
fig, ax = plt.subplots(
300+
nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
301+
)
302+
303+
az.plot_ppc(data=posterior_predictive_oos_ts_train, kind="cumulative", observed_rug=True, ax=ax[0])
304+
ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))
305+
306+
az.plot_ppc(data=posterior_predictive_oos_ts_test, kind="cumulative", observed_rug=True, ax=ax[1])
307+
ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
308+
```
309+
310+
Wow! This does not look right! The predictions on the test set look very odd 🤔. To better understand what is going on we can plot the predictions as time series:
230311

231312
```{code-cell} ipython3
232313
:tags: [hide-input]
233314
234315
fig, ax = plt.subplots(figsize=(12, 6))
235316
az.plot_hdi(
236317
x=X_train.index,
237-
y=posterior_predictive_train.posterior_predictive["y"],
318+
y=posterior_predictive_oos_ts_train.posterior_predictive["y"],
238319
hdi_prob=0.94,
239320
color="C0",
240321
fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (train)"},
@@ -243,7 +324,7 @@ az.plot_hdi(
243324
)
244325
az.plot_hdi(
245326
x=X_train.index,
246-
y=posterior_predictive_train.posterior_predictive["y"],
327+
y=posterior_predictive_oos_ts_train.posterior_predictive["y"],
247328
hdi_prob=0.5,
248329
color="C0",
249330
fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (train)"},
@@ -253,7 +334,7 @@ az.plot_hdi(
253334
ax.plot(X_train.index, Y_train, label="train (observed)")
254335
az.plot_hdi(
255336
x=X_test.index,
256-
y=posterior_predictive_test.posterior_predictive["y"],
337+
y=posterior_predictive_oos_ts_test.posterior_predictive["y"],
257338
hdi_prob=0.94,
258339
color="C1",
259340
fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (test)"},
@@ -262,7 +343,7 @@ az.plot_hdi(
262343
)
263344
az.plot_hdi(
264345
x=X_test.index,
265-
y=posterior_predictive_test.posterior_predictive["y"],
346+
y=posterior_predictive_oos_ts_test.posterior_predictive["y"],
266347
hdi_prob=0.5,
267348
color="C1",
268349
fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (test)"},
@@ -279,7 +360,7 @@ ax.set(
279360
);
280361
```
281362

282-
The out-of-sample predictions look a bit odd. Why? Well, note that in the variable importance ranking from the initial model we saw that `hour` was the most important predictor. On the other hand, our training data just sees `hour` values until $19$. As BART learns how to partition the (training) data, it can not differentiate between `hour` values between $20$ and $22$ for example. It just cares that both values are greater that $19$. This is very important to understand when using BART! This explains why one should not use BART for time series forecasting if there is a trend component. In this case it is better to detrend the data first, model the remainder with BART and model the trend with a different model.
363+
This plot helps us understand the season behind the bad performance on the test set: Recall that in the variable importance ranking from the initial model we saw that `hour` was the most important predictor. On the other hand, our training data just sees `hour` values until $19$ (since is our train-test threshold). As BART learns how to partition the (training) data, it can not differentiate between `hour` values between $20$ and $22$ for example. It just cares that both values are greater that $19$. This is very important to understand when using BART! This explains why one should not use BART for time series forecasting if there is a trend component. In this case it is better to detrend the data first, model the remainder with BART and model the trend with a different model.
283364

284365
+++
285366

0 commit comments

Comments
 (0)