pymc-devs
diff --git a/‎examples/case_studies/BART_introduction.ipynb
Lines changed: 390 additions & 38 deletions b/‎examples/case_studies/BART_introduction.ipynb
Lines changed: 390 additions & 38 deletions
diff --git a/‎examples/case_studies/BART_introduction.myst.md
Lines changed: 97 additions & 16 deletions b/‎examples/case_studies/BART_introduction.myst.md
Lines changed: 97 additions & 16 deletions
@@ -27,6 +27,9 @@ import numpy as np
 import pandas as pd
 import pymc as pm
 import pymc_bart as pmb
+import seaborn as sns
+
+from sklearn.model_selection import train_test_split
 
 %config InlineBackend.figure_format = "retina"
 
@@ -186,7 +189,69 @@ pmb.plot_variable_importance(idata_bikes, μ, X, samples=100);
 ### Out-of-Sample Predictions
 
 In this section we want to show how to do out-of-sample predictions with BART. We are going to use the same dataset as before, but this time we are going to split the data into a training and a test set. We are going to use the training set to fit the model and the test set to evaluate the model.
-As this is a time series problem we need to make sure we do not shuffle the data. Hence we do the split using the `hour` feature.
+
++++
+
+#### Regression
+
+Let's start by modelling this data as a regression problem. In this context we randomly split the data into a training and a test set.
+
+```{code-cell} ipython3
+X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=RANDOM_SEED)
+```
+
+Now, we fit the same model as above but this time using a *shared variable* for the covariatates so that we can then replace them to generate the out of sample predictions.
+
+```{code-cell} ipython3
+with pm.Model() as model_oos_regression:
+    X = pm.MutableData("X", X_train)
+    Y = Y_train
+    α = pm.Exponential("α", 1 / 10)
+    μ = pmb.BART("μ", X, Y)
+    y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
+    idata_oos_regression = pm.sample(random_seed=RANDOM_SEED)
+    posterior_predictive_oos_regression_train = pm.sample_posterior_predictive(
+        trace=idata_oos_regression, random_seed=RANDOM_SEED
+    )
+```
+
+Next, we replace the data in the model and sample from the posterior predictive distribution.
+
+```{code-cell} ipython3
+with model_oos_regression:
+    X.set_value(X_test)
+    posterior_predictive_oos_regression_test = pm.sample_posterior_predictive(
+        trace=idata_oos_regression, random_seed=RANDOM_SEED
+    )
+```
+
+Finally, we can compare the posterior predictive distribution with the observed data.
+
+```{code-cell} ipython3
+:tags: [hide-input]
+
+fig, ax = plt.subplots(
+    nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
+)
+
+az.plot_ppc(
+    data=posterior_predictive_oos_regression_train, kind="cumulative", observed_rug=True, ax=ax[0]
+)
+ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))
+
+az.plot_ppc(
+    data=posterior_predictive_oos_regression_test, kind="cumulative", observed_rug=True, ax=ax[1]
+)
+ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
+```
+
+Yay! The results look quite reasonable 🙂!
+
++++
+
+#### Time Series
+
+We can view the same data from a *time series* perspective using the `hour` feature. From this point of view, we need to make sure we do not shuffle the data so that we do not leak information. Thus, we define th train-test split using the `hour` feature.
 
 ```{code-cell} ipython3
 train_test_hour_split = 19
@@ -201,40 +266,56 @@ X_test = test_bikes[features]
 Y_test = test_bikes["count"]
 ```
 
-Now, we fit the same model as above but this time using a *shared variable* for the covariatates so that we can then replace them to generate the out of sample predictions.
+We can then run the same model (but with different input data!) and generate out-of-sample predictions as above.
 
 ```{code-cell} ipython3
-with pm.Model() as model_bikes_train_test:
+with pm.Model() as model_oos_ts:
     X = pm.MutableData("X", X_train)
     Y = Y_train
     α = pm.Exponential("α", 1 / 10)
     μ = pmb.BART("μ", X, Y)
     y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
-    idata_bikes_train = pm.sample(random_seed=RANDOM_SEED)
-    posterior_predictive_train = pm.sample_posterior_predictive(
-        trace=idata_bikes_train, random_seed=RANDOM_SEED
+    idata_oos_ts = pm.sample(random_seed=RANDOM_SEED)
+    posterior_predictive_oos_ts_train = pm.sample_posterior_predictive(
+        trace=idata_oos_ts, random_seed=RANDOM_SEED
     )
 ```
 
-Next, we replace the data in the model and sample from the posterior predictive distribution.
+We generate out-of-sample predictions.
 
 ```{code-cell} ipython3
-with model_bikes_train_test:
+with model_oos_ts:
     X.set_value(X_test)
-    posterior_predictive_test = pm.sample_posterior_predictive(
-        trace=idata_bikes_train, random_seed=RANDOM_SEED
+    posterior_predictive_oos_ts_test = pm.sample_posterior_predictive(
+        trace=idata_oos_ts, random_seed=RANDOM_SEED
     )
 ```
 
-Finally, let's plot the results:
+Similarly as above, we can compare the posterior predictive distribution with the observed data.
+
+```{code-cell} ipython3
+:tags: [hide-input]
+
+fig, ax = plt.subplots(
+    nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
+)
+
+az.plot_ppc(data=posterior_predictive_oos_ts_train, kind="cumulative", observed_rug=True, ax=ax[0])
+ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))
+
+az.plot_ppc(data=posterior_predictive_oos_ts_test, kind="cumulative", observed_rug=True, ax=ax[1])
+ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
+```
+
+Wow! This does not look right! The predictions on the test set look very odd 🤔. To better understand what is going on we can plot the predictions as  time series:
 
 ```{code-cell} ipython3
 :tags: [hide-input]
 
 fig, ax = plt.subplots(figsize=(12, 6))
 az.plot_hdi(
     x=X_train.index,
-    y=posterior_predictive_train.posterior_predictive["y"],
+    y=posterior_predictive_oos_ts_train.posterior_predictive["y"],
     hdi_prob=0.94,
     color="C0",
     fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (train)"},
@@ -243,7 +324,7 @@ az.plot_hdi(
 )
 az.plot_hdi(
     x=X_train.index,
-    y=posterior_predictive_train.posterior_predictive["y"],
+    y=posterior_predictive_oos_ts_train.posterior_predictive["y"],
     hdi_prob=0.5,
     color="C0",
     fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (train)"},
@@ -253,7 +334,7 @@ az.plot_hdi(
 ax.plot(X_train.index, Y_train, label="train (observed)")
 az.plot_hdi(
     x=X_test.index,
-    y=posterior_predictive_test.posterior_predictive["y"],
+    y=posterior_predictive_oos_ts_test.posterior_predictive["y"],
     hdi_prob=0.94,
     color="C1",
     fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (test)"},
@@ -262,7 +343,7 @@ az.plot_hdi(
 )
 az.plot_hdi(
     x=X_test.index,
-    y=posterior_predictive_test.posterior_predictive["y"],
+    y=posterior_predictive_oos_ts_test.posterior_predictive["y"],
     hdi_prob=0.5,
     color="C1",
     fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (test)"},
@@ -279,7 +360,7 @@ ax.set(
 );
 ```
 
-The out-of-sample predictions look a bit odd. Why? Well, note that in the variable importance ranking from the initial model we saw that `hour` was the most important predictor. On the other hand, our training data just sees `hour` values until $19$. As BART learns how to partition the (training) data, it can not differentiate between `hour` values between $20$ and $22$ for example. It just cares that both values are greater that $19$. This is very important to understand when using BART! This explains why one should not use BART for time series forecasting if there is a trend component. In this case it is better to detrend the data first, model the remainder with BART and model the trend with a different model.
+This plot helps us understand the season behind the bad performance on the test set: Recall that in the variable importance ranking from the initial model we saw that `hour` was the most important predictor. On the other hand, our training data just sees `hour` values until $19$ (since is our train-test threshold). As BART learns how to partition the (training) data, it can not differentiate between `hour` values between $20$ and $22$ for example. It just cares that both values are greater that $19$. This is very important to understand when using BART! This explains why one should not use BART for time series forecasting if there is a trend component. In this case it is better to detrend the data first, model the remainder with BART and model the trend with a different model.
 
 +++