You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this section we want to show how to do out-of-sample predictions with BART. We are going to use the same dataset as before, but this time we are going to split the data into a training and a test set. We are going to use the training set to fit the model and the test set to evaluate the model.
189
-
As this is a time series problem we need to make sure we do not shuffle the data. Hence we do the split using the `hour` feature.
192
+
193
+
+++
194
+
195
+
#### Regression
196
+
197
+
Let's start by modelling this data as a regression problem. In this context we randomly split the data into a training and a test set.
198
+
199
+
```{code-cell} ipython3
200
+
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=RANDOM_SEED)
201
+
```
202
+
203
+
Now, we fit the same model as above but this time using a *shared variable* for the covariatates so that we can then replace them to generate the out of sample predictions.
204
+
205
+
```{code-cell} ipython3
206
+
with pm.Model() as model_oos_regression:
207
+
X = pm.MutableData("X", X_train)
208
+
Y = Y_train
209
+
α = pm.Exponential("α", 1 / 10)
210
+
μ = pmb.BART("μ", X, Y)
211
+
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
We can view the same data from a *time series* perspective using the `hour` feature. From this point of view, we need to make sure we do not shuffle the data so that we do not leak information. Thus, we define th train-test split using the `hour` feature.
Now, we fit the same model as above but this time using a *shared variable* for the covariatates so that we can then replace them to generate the out of sample predictions.
269
+
We can then run the same model (but with different input data!) and generate out-of-sample predictions as above.
205
270
206
271
```{code-cell} ipython3
207
-
with pm.Model() as model_bikes_train_test:
272
+
with pm.Model() as model_oos_ts:
208
273
X = pm.MutableData("X", X_train)
209
274
Y = Y_train
210
275
α = pm.Exponential("α", 1 / 10)
211
276
μ = pmb.BART("μ", X, Y)
212
277
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
Wow! This does not look right! The predictions on the test set look very odd 🤔. To better understand what is going on we can plot the predictions as time series:
The out-of-sample predictions look a bit odd. Why? Well, note that in the variable importance ranking from the initial model we saw that `hour` was the most important predictor. On the other hand, our training data just sees `hour` values until $19$. As BART learns how to partition the (training) data, it can not differentiate between `hour` values between $20$ and $22$ for example. It just cares that both values are greater that $19$. This is very important to understand when using BART! This explains why one should not use BART for time series forecasting if there is a trend component. In this case it is better to detrend the data first, model the remainder with BART and model the trend with a different model.
363
+
This plot helps us understand the season behind the bad performance on the test set: Recall that in the variable importance ranking from the initial model we saw that `hour` was the most important predictor. On the other hand, our training data just sees `hour` values until $19$ (since is our train-test threshold). As BART learns how to partition the (training) data, it can not differentiate between `hour` values between $20$ and $22$ for example. It just cares that both values are greater that $19$. This is very important to understand when using BART! This explains why one should not use BART for time series forecasting if there is a trend component. In this case it is better to detrend the data first, model the remainder with BART and model the trend with a different model.
0 commit comments