Skip to content

GLM-robust update #451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
754 changes: 636 additions & 118 deletions examples/generalized_linear_models/GLM-robust.ipynb

Large diffs are not rendered by default.

122 changes: 96 additions & 26 deletions myst_nbs/generalized_linear_models/GLM-robust.myst.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,29 @@ jupytext:
format_version: 0.13
jupytext_version: 1.13.7
kernelspec:
display_name: Python 3
display_name: Python 3 (ipykernel)
language: python
name: python3
---

(GLM-robust)=
# GLM: Robust Linear Regression

:::{post} August, 2013
:tags: regression, linear model, robust
:category: beginner
:author: Thomas Wiecki
:::

+++

# GLM: Robust Linear Regression

This tutorial first appeard as a post in small series on Bayesian GLMs on:

1. [The Inference Button: Bayesian GLMs made easy with PyMC3](http://twiecki.github.com/blog/2013/08/12/bayesian-glms-1/)
2. [This world is far from Normal(ly distributed): Robust Regression in PyMC3](http://twiecki.github.io/blog/2013/08/27/bayesian-glms-2/)
3. [The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3](http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/)
1. [The Inference Button: Bayesian GLMs made easy with PyMC3](https://twiecki.io/blog/2013/08/12/bayesian-glms-1/)
2. [This world is far from Normal(ly distributed): Robust Regression in PyMC3](https://twiecki.io/blog/2013/08/27/bayesian-glms-2/)
3. [The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3](https://twiecki.io/blog/2014/03/17/bayesian-glms-3/)

In this blog post I will write about:

Expand All @@ -33,18 +44,18 @@ Lets see what happens if we add some outliers to our simulated data from the las

+++

Again, import our modules.
First, let's import our modules.

```{code-cell} ipython3
%matplotlib inline

import aesara
import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano
import pymc as pm

print(f"Running on pymc3 v{pm.__version__}")
```
Expand Down Expand Up @@ -76,10 +87,14 @@ y_out = np.append(y, [8, 6, 9])
data = pd.DataFrame(dict(x=x_out, y=y_out))
```

```{code-cell} ipython3
data.head()
```

Plot the data together with the true regression line (the three points in the upper left corner are the outliers we added).

```{code-cell} ipython3
fig = plt.figure(figsize=(7, 7))
fig = plt.figure(figsize=(7, 5))
ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model")
ax.plot(x_out, y_out, "x", label="sampled data")
ax.plot(x, true_regression_line, label="true regression line", lw=2.0)
Expand All @@ -89,38 +104,62 @@ plt.legend(loc=0);
## Robust Regression


Lets see what happens if we estimate our Bayesian linear regression model using the `bambi`. This function takes a [`formulae`](https://bambinos.github.io/formulae/api_reference.html) string to describe the linear model and adds a Normal likelihood by default.
Lets see what happens if we estimate our Bayesian linear regression model using the `bambi`. This function takes a [`formulae`](https://bambinos.github.io/formulae/api_reference.html) string to describe the linear model and adds a Normal likelihood for Intercept and Slope by default.

```{code-cell} ipython3
model = bmb.Model("y ~ x", data)
trace = model.fit(draws=2000, cores=2)
```

To evaluate the fit, I am plotting the posterior predictive regression lines by taking regression parameters from the posterior distribution and plotting a regression line for each (this is all done inside of `plot_posterior_predictive()`).
```{code-cell} ipython3
model.graph()
```

```{code-cell} ipython3
az.summary(trace)
```

To evaluate the fit, the code below calculates the posterior predictive regression lines by taking regression parameters from the posterior distribution and plots a regression line for every 10th of them.

```{code-cell} ipython3
plt.figure(figsize=(7, 5))
plt.plot(x_out, y_out, "x", label="data")
pm.plot_posterior_predictive_glm(trace, samples=100, label="posterior predictive regression lines")
plt.plot(x, true_regression_line, label="true regression line", lw=3.0, c="y")
fig = plt.figure(figsize=(7, 5))
ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model")
ax.plot(x_out, y_out, "x", label="sampled data")

plt.legend(loc=0);
# calculate posterior regression lines (for every 10th point from posterior)
for chain in range(2):
for i in range(0, 2000, 10):
regression_line = (
trace.posterior.Intercept[chain, i].values + trace.posterior.x[chain, i].values * x
)
ax.plot(x, regression_line, lw=0.5, alpha=0.25, color="black", label="posterior regression")

ax.plot(x, true_regression_line, label="true regression line", lw=2.0)

# remove duplicate legend labels for posterior regression lines
handles, labels = plt.gca().get_legend_handles_labels()
newLabels, newHandles = [], []
for handle, label in zip(handles, labels):
if label not in newLabels:
newLabels.append(label)
newHandles.append(handle)
_ = ax.legend(newHandles, newLabels, loc=0)
```

As you can see, the fit is quite skewed and we have a fair amount of uncertainty in our estimate as indicated by the wide range of different posterior predictive regression lines. Why is this? The reason is that the normal distribution does not have a lot of mass in the tails and consequently, an outlier will affect the fit strongly.

A Frequentist would estimate a [Robust Regression](http://en.wikipedia.org/wiki/Robust_regression) and use a non-quadratic distance measure to evaluate the fit.

But what's a Bayesian to do? Since the problem is the light tails of the Normal distribution we can instead assume that our data is not normally distributed but instead distributed according to the [Student T distribution](http://en.wikipedia.org/wiki/Student%27s_t-distribution) which has heavier tails as shown next (I read about this trick in ["The Kruschke"](http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/), aka the puppy-book; but I think [Gelman](http://www.stat.columbia.edu/~gelman/book/) was the first to formulate this).
But what's a Bayesian to do? Since the problem is the light tails of the Normal distribution we can instead assume that our data is not normally distributed but instead distributed according to the [Student T distribution](http://en.wikipedia.org/wiki/Student%27s_t-distribution) which has heavier tails as shown next (I read about this trick in ["The Kruschke"](https://www.elsevier.com/books/doing-bayesian-data-analysis/kruschke/978-0-12-405888-0), aka the puppy-book; but I think [Gelman](http://www.stat.columbia.edu/~gelman/book/) was the first to formulate this).

Lets look at those two distributions to get a feel for them.

```{code-cell} ipython3
normal_dist = pm.Normal.dist(mu=0, sigma=1)
t_dist = pm.StudentT.dist(mu=0, lam=1, nu=1)
x_eval = np.linspace(-8, 8, 300)
plt.plot(x_eval, theano.tensor.exp(normal_dist.logp(x_eval)).eval(), label="Normal", lw=2.0)
plt.plot(x_eval, theano.tensor.exp(t_dist.logp(x_eval)).eval(), label="Student T", lw=2.0)
plt.plot(x_eval, aesara.tensor.exp(pm.logp(normal_dist, x_eval)).eval(), label="Normal", lw=2.0)
plt.plot(x_eval, aesara.tensor.exp(pm.logp(t_dist, x_eval)).eval(), label="Student T", lw=2.0)
plt.xlabel("x")
plt.ylabel("Probability density")
plt.legend();
Expand All @@ -137,11 +176,37 @@ trace_robust = model_robust.fit(draws=2000, cores=2)
```

```{code-cell} ipython3
plt.figure(figsize=(7, 5))
plt.plot(x_out, y_out, "x")
pm.plot_posterior_predictive_glm(trace_robust, label="posterior predictive regression lines")
plt.plot(x, true_regression_line, label="true regression line", lw=3.0, c="y")
plt.legend();
model_robust.graph()
```

```{code-cell} ipython3
az.summary(trace_robust)
```

```{code-cell} ipython3
fig = plt.figure(figsize=(7, 5))
ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model")
ax.plot(x_out, y_out, "x", label="sampled data")

# calculate posterior regression lines (for every 10th point from posterior)
for chain in range(2):
for i in range(0, 2000, 10):
regression_line = (
trace_robust.posterior.Intercept[chain, i].values
+ trace_robust.posterior.x[chain, i].values * x
)
ax.plot(x, regression_line, lw=0.5, alpha=0.25, color="black", label="posterior regression")

ax.plot(x, true_regression_line, label="true regression line", lw=2.0)

# remove duplicate legend labels for posterior regression lines
handles, labels = plt.gca().get_legend_handles_labels()
newLabels, newHandles = [], []
for handle, label in zip(handles, labels):
if label not in newLabels:
newLabels.append(label)
newHandles.append(handle)
_ = ax.legend(newHandles, newLabels, loc=0)
```

There, much better! The outliers are barely influencing our estimation at all because our likelihood function assumes that outliers are much more probable than under the Normal distribution.
Expand All @@ -151,17 +216,22 @@ There, much better! The outliers are barely influencing our estimation at all be
## Summary

- `Bambi` allows you to pass in a `family` argument that contains information about the likelihood.
- By changing the likelihood from a Normal distribution to a Student T distribution -- which has more mass in the tails -- we can perform *Robust Regression*.
- By changing the likelihood from a Normal distribution to a Student-T distribution -- which has more mass in the tails -- we can perform *Robust Regression*.

The next post will be about logistic regression in PyMC3 and what the posterior and oatmeal have in common.

*Extensions*:

- The Student-T distribution has, besides the mean and variance, a third parameter called *degrees of freedom* that describes how much mass should be put into the tails. Here it is set to 1 which gives maximum mass to the tails (setting this to infinity results in a Normal distribution!). One could easily place a prior on this rather than fixing it which I leave as an exercise for the reader ;).
- T distributions can be used as priors as well. I will show this in a future post on hierarchical GLMs.
- How do we test if our data is normal or violates that assumption in an important way? Check out this [great blog post](http://allendowney.blogspot.com/2013/08/are-my-data-normal.html) by Allen Downey.
- How do we test if our data is normal or violates that assumption in an important way? Check out this [great blog post](http://allendowney.blogspot.com/2013/08/are-my-data-normal.html) by Allen Downey.

+++

## Authors

Author: [Thomas Wiecki](https://twitter.com/twiecki)
* Authored by Thomas Wiecki in August, 2013
* Updated by Igor Kuvychko in October, 2022

```{code-cell} ipython3
%load_ext watermark
Expand Down