Skip to content

Update GLM-robust.ipynb #356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
421 changes: 298 additions & 123 deletions examples/generalized_linear_models/GLM-robust.ipynb

Large diffs are not rendered by default.

111 changes: 75 additions & 36 deletions myst_nbs/generalized_linear_models/GLM-robust.myst.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,26 +6,25 @@ jupytext:
format_version: 0.13
jupytext_version: 1.13.7
kernelspec:
display_name: Python 3
display_name: Python 3.8.13 ('pymc-dev-py38')
language: python
name: python3
---

# GLM: Robust Linear Regression

This tutorial first appeard as a post in small series on Bayesian GLMs on:
The tutorial is the second of a three-part series on Bayesian *generalized linear models (GLMs)*, that first appeared on [Thomas Wiecki's blog](https://twiecki.io/):

1. [The Inference Button: Bayesian GLMs made easy with PyMC3](http://twiecki.github.com/blog/2013/08/12/bayesian-glms-1/)
2. [This world is far from Normal(ly distributed): Robust Regression in PyMC3](http://twiecki.github.io/blog/2013/08/27/bayesian-glms-2/)
3. [The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3](http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/)
1. {ref}`Linear Regression <pymc:GLM_linear>`
2. {ref}`Robust Linear Regression <GLM-robust>`
3. {ref}`Hierarchical Linear Regression <GLM-hierarchical>`

In this blog post I will write about:

- How a few outliers can largely affect the fit of linear regression models.
- How replacing the normal likelihood with Student T distribution produces robust regression.
- How this can easily be done with the `Bambi` library by passing a `family` object or passing the family name as an argument.

This is the second part of a series on Bayesian GLMs (click [here for part I about linear regression](http://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/)). In this prior post I described how minimizing the squared distance of the regression line is the same as maximizing the likelihood of a Normal distribution with the mean coming from the regression line. This latter probabilistic expression allows us to easily formulate a Bayesian linear regression model.

In the [linear regression tutorial](https://docs.pymc.io/en/v3/pymc-examples/examples/generalized_linear_models/GLM-linear.html), I described how minimizing the squared distance of the regression line is the same as maximizing the likelihood of a Normal distribution with the mean coming from the regression line. This latter probabilistic expression allows us to easily formulate a Bayesian linear regression model.

This worked splendidly on simulated data. The problem with simulated data though is that it's, well, simulated. In the real world things tend to get more messy and assumptions like normality are easily violated by a few outliers.

Expand All @@ -38,15 +37,14 @@ Again, import our modules.
```{code-cell} ipython3
%matplotlib inline

import aesara
import aesara.tensor as at
import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano

print(f"Running on pymc3 v{pm.__version__}")
import pymc as pm
import xarray as xr
```

```{code-cell} ipython3
Expand Down Expand Up @@ -88,23 +86,43 @@ plt.legend(loc=0);

## Robust Regression

### Normal Likelihood

Lets see what happens if we estimate our Bayesian linear regression model using the `bambi`. This function takes a [`formulae`](https://bambinos.github.io/formulae/api_reference.html) string to describe the linear model and adds a Normal likelihood by default.
Lets see what happens if we estimate our Bayesian linear regression model by fitting a regression model with a normal likelihood. Note that the [`bambi`](https://bambinos.github.io/bambi/main/index.html) library provides an easy to use such that an equivalent model can be built using one line of code.

```{code-cell} ipython3
model = bmb.Model("y ~ x", data)
trace = model.fit(draws=2000, cores=2)
with pm.Model() as model:
# define priors
intercept = pm.Normal("intercept", mu=0, sigma=1)
slope = pm.Normal("slope", mu=0, sigma=1)
sigma = pm.HalfCauchy("sigma", beta=10)

mu = pm.Deterministic("mu", intercept + slope * x_out)

# define likelihood
likelihood = pm.Normal("y", mu=mu, sigma=sigma, observed=y_out)

# inference
trace = pm.sample(tune=2000)
```

To evaluate the fit, I am plotting the posterior predictive regression lines by taking regression parameters from the posterior distribution and plotting a regression line for each (this is all done inside of `plot_posterior_predictive()`).
We now want to generate from the [*posterior predictive*](https://en.wikipedia.org/wiki/Posterior_predictive_distribution) of the mean of the regression line `mu` at each of the observations `x`.

```{code-cell} ipython3
with model:
pm.sample_posterior_predictive(
trace, var_names=["mu"], random_seed=RANDOM_SEED, extend_inferencedata=True
)
```

```{code-cell} ipython3
plt.figure(figsize=(7, 5))
plt.plot(x_out, y_out, "x", label="data")
pm.plot_posterior_predictive_glm(trace, samples=100, label="posterior predictive regression lines")
plt.plot(x, true_regression_line, label="true regression line", lw=3.0, c="y")
pp = az.extract_dataset(trace, group="posterior_predictive", num_samples=500)["mu"]

plt.legend(loc=0);
plt.scatter(x_out, y_out, label="data")
plt.plot(x_out, pp, alpha=0.02)
plt.plot(x, true_regression_line, label="true regression line", lw=3.0, c="r")
plt.legend(loc=0)
plt.title("Posterior predictive for normal likelihood")
```

As you can see, the fit is quite skewed and we have a fair amount of uncertainty in our estimate as indicated by the wide range of different posterior predictive regression lines. Why is this? The reason is that the normal distribution does not have a lot of mass in the tails and consequently, an outlier will affect the fit strongly.
Expand All @@ -119,29 +137,49 @@ Lets look at those two distributions to get a feel for them.
normal_dist = pm.Normal.dist(mu=0, sigma=1)
t_dist = pm.StudentT.dist(mu=0, lam=1, nu=1)
x_eval = np.linspace(-8, 8, 300)
plt.plot(x_eval, theano.tensor.exp(normal_dist.logp(x_eval)).eval(), label="Normal", lw=2.0)
plt.plot(x_eval, theano.tensor.exp(t_dist.logp(x_eval)).eval(), label="Student T", lw=2.0)
plt.plot(x_eval, at.exp(pm.logp(normal_dist, x_eval)).eval(), label="Normal", lw=2.0)
plt.plot(x_eval, at.exp(pm.logp(t_dist, x_eval)).eval(), label="Student T", lw=2.0)
plt.xlabel("x")
plt.ylabel("Probability density")
plt.legend();
```

As you can see, the probability of values far away from the mean (0 in this case) are much more likely under the `T` distribution than under the Normal distribution.

To define the usage of a T distribution in `Bambi` we can pass the distribution name to the `family` argument -- `t` -- that specifies that our data is Student T-distributed. Note that this is the same syntax as `R` and `statsmodels` use.
Below is a PyMC model, with the `likelihood` term following a `StudentT` distribution with $\nu=3$ degrees of freedom, opposed to the `Normal` distribution.

```{code-cell} ipython3
model_robust = bmb.Model("y ~ x", data, family="t")
model_robust.set_priors({"nu": bmb.Prior("Gamma", alpha=3, beta=1)})
trace_robust = model_robust.fit(draws=2000, cores=2)
with pm.Model() as robust_model:

# define priors
intercept = pm.Normal("intercept", mu=0, sigma=1)
slope = pm.Normal("slope", mu=0, sigma=1)
sigma = pm.HalfCauchy("sigma", beta=10)

mu = pm.Deterministic("mu", intercept + slope * x_out)

# define likelihood
likelihood = pm.StudentT("y", mu=mu, sigma=sigma, nu=3, observed=y_out)

# inference
robust_trace = pm.sample(tune=4000)
```

```{code-cell} ipython3
plt.figure(figsize=(7, 5))
plt.plot(x_out, y_out, "x")
pm.plot_posterior_predictive_glm(trace_robust, label="posterior predictive regression lines")
plt.plot(x, true_regression_line, label="true regression line", lw=3.0, c="y")
plt.legend();
with robust_model:
pm.sample_posterior_predictive(
robust_trace, var_names=["mu"], random_seed=RANDOM_SEED, extend_inferencedata=True
)
```

```{code-cell} ipython3
robust_pp = az.extract_dataset(robust_trace, group="posterior_predictive", num_samples=500)["mu"]

plt.scatter(x_out, y_out, label="data")
plt.plot(x_out, robust_pp, alpha=0.02)
plt.plot(x, true_regression_line, label="true regression line", lw=3.0, c="r")
plt.legend(loc=0)
plt.title("Posterior predictive for Student-T likelihood")
```

There, much better! The outliers are barely influencing our estimation at all because our likelihood function assumes that outliers are much more probable than under the Normal distribution.
Expand All @@ -150,11 +188,8 @@ There, much better! The outliers are barely influencing our estimation at all be

## Summary

- `Bambi` allows you to pass in a `family` argument that contains information about the likelihood.
- By changing the likelihood from a Normal distribution to a Student T distribution -- which has more mass in the tails -- we can perform *Robust Regression*.

The next post will be about logistic regression in PyMC3 and what the posterior and oatmeal have in common.

*Extensions*:

- The Student-T distribution has, besides the mean and variance, a third parameter called *degrees of freedom* that describes how much mass should be put into the tails. Here it is set to 1 which gives maximum mass to the tails (setting this to infinity results in a Normal distribution!). One could easily place a prior on this rather than fixing it which I leave as an exercise for the reader ;).
Expand All @@ -167,3 +202,7 @@ Author: [Thomas Wiecki](https://twitter.com/twiecki)
%load_ext watermark
%watermark -n -u -v -iv -w -p xarray
```

```{code-cell} ipython3

```