Skip to content

Commit 989172b

Browse files
committed
Added tutorial notebook on updating priors
1 parent 5a76df5 commit 989172b

File tree

1 file changed

+285
-0
lines changed

1 file changed

+285
-0
lines changed
Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Updating priors"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"In this notebook, I will show how it is possible to update the priors as new data becomes available. The example is a slightly modified version of the linear regression in the [Getting started with PyMC3](https://github.com/pymc-devs/pymc3/blob/master/docs/source/notebooks/getting_started.ipynb) notebook."
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": null,
20+
"metadata": {
21+
"collapsed": true,
22+
"deletable": true,
23+
"editable": true
24+
},
25+
"outputs": [],
26+
"source": [
27+
"import matplotlib.pyplot as plt\n",
28+
"import matplotlib as mpl\n",
29+
"from pymc3 import Model, Normal, Slice\n",
30+
"from pymc3 import sample\n",
31+
"from pymc3 import traceplot\n",
32+
"from pymc3.distributions import Continuous\n",
33+
"from theano import as_op\n",
34+
"import theano.tensor as tt\n",
35+
"import numpy as np\n",
36+
"from scipy import stats\n",
37+
"\n",
38+
"%matplotlib inline"
39+
]
40+
},
41+
{
42+
"cell_type": "markdown",
43+
"metadata": {},
44+
"source": [
45+
"## Generating data"
46+
]
47+
},
48+
{
49+
"cell_type": "code",
50+
"execution_count": null,
51+
"metadata": {
52+
"collapsed": false
53+
},
54+
"outputs": [],
55+
"source": [
56+
"# Initialize random number generator\n",
57+
"np.random.seed(123)\n",
58+
"\n",
59+
"# True parameter values\n",
60+
"alpha_true, beta0_true, beta1_true = 5, 7, 13\n",
61+
"\n",
62+
"# Size of dataset\n",
63+
"size = 100\n",
64+
"\n",
65+
"# Predictor variable\n",
66+
"X1 = np.random.randn(size)\n",
67+
"X2 = np.random.randn(size) * 0.2\n",
68+
"\n",
69+
"# Simulate outcome variable\n",
70+
"Y = alpha_true + beta0_true * X1 + beta1_true * X2 + np.random.randn(size)"
71+
]
72+
},
73+
{
74+
"cell_type": "markdown",
75+
"metadata": {},
76+
"source": [
77+
"## Model specification"
78+
]
79+
},
80+
{
81+
"cell_type": "markdown",
82+
"metadata": {},
83+
"source": [
84+
"Our initial beliefs about the parameters are quite informative (sd=1) and a bit off the true values."
85+
]
86+
},
87+
{
88+
"cell_type": "code",
89+
"execution_count": null,
90+
"metadata": {
91+
"collapsed": false,
92+
"deletable": true,
93+
"editable": true
94+
},
95+
"outputs": [],
96+
"source": [
97+
"basic_model = Model()\n",
98+
"\n",
99+
"with basic_model:\n",
100+
" \n",
101+
" # Priors for unknown model parameters\n",
102+
" alpha = Normal('alpha', mu=0, sd=1)\n",
103+
" beta0 = Normal('beta0', mu=12, sd=1)\n",
104+
" beta1 = Normal('beta1', mu=18, sd=1)\n",
105+
" \n",
106+
" # Expected value of outcome\n",
107+
" mu = alpha + beta0 * X1 + beta1 * X2\n",
108+
" \n",
109+
" # Likelihood (sampling distribution) of observations\n",
110+
" Y_obs = Normal('Y_obs', mu=mu, sd=1, observed=Y)\n",
111+
" \n",
112+
" # draw 1000 posterior samples\n",
113+
" trace = sample(1000)"
114+
]
115+
},
116+
{
117+
"cell_type": "code",
118+
"execution_count": null,
119+
"metadata": {
120+
"collapsed": false
121+
},
122+
"outputs": [],
123+
"source": [
124+
"traceplot(trace)"
125+
]
126+
},
127+
{
128+
"cell_type": "markdown",
129+
"metadata": {},
130+
"source": [
131+
"In order to update our beliefs about the parameters, we use the posterior distributions, which will be used as the prior distributions for the next inference. The data used for each inference iteration has to be independent from the previous iterations, otherwise the same (possibly wrong) belief is injected over and over in the system, misleading the inference. By ensuring the data is independent, the system should converge to the true parameter values.\n",
132+
"\n",
133+
"Because we draw samples from the posterior distribution (shown on the right in the figure above), we need to estimate their probability density (shown on the left in the figure above). Kernel density estimation (KDE) is a way to achieve this, and we will use this technique here. In any case, it is an empirical distribution that cannot be expressed analytically. Fortunately PyMC3 provides a way to built custom distributions. We just need to inherit the *Continuous* class and provide our own *logp* method. The code below does just that."
134+
]
135+
},
136+
{
137+
"cell_type": "code",
138+
"execution_count": null,
139+
"metadata": {
140+
"collapsed": true
141+
},
142+
"outputs": [],
143+
"source": [
144+
"def from_posterior(param, samples):\n",
145+
"\n",
146+
" class From_posterior(Continuous):\n",
147+
" def __init__(self, *args, **kwargs):\n",
148+
" self.from_posterior_logp = _from_posterior_logp()\n",
149+
" super(From_posterior, self).__init__(*args, **kwargs)\n",
150+
" def logp(self, value):\n",
151+
" return self.from_posterior_logp(value)\n",
152+
"\n",
153+
" class From_posterior_logp:\n",
154+
" def __init__(self, samples):\n",
155+
" smin, smax = np.min(samples), np.max(samples)\n",
156+
" self.x = np.linspace(smin, smax, 100)\n",
157+
" self.y = stats.gaussian_kde(samples)(self.x)\n",
158+
" #self.y /= np.sum(self.y)\n",
159+
" def from_posterior_logp(self, value):\n",
160+
" return np.array(np.log(np.interp(value, self.x, self.y, left=0, right=0)))\n",
161+
" \n",
162+
" from_posterior_logp = From_posterior_logp(samples)\n",
163+
"\n",
164+
" def _from_posterior_logp():\n",
165+
" @as_op(itypes=[tt.dscalar], otypes=[tt.dscalar])\n",
166+
" def logp(value):\n",
167+
" return from_posterior_logp.from_posterior_logp(value)\n",
168+
" return logp\n",
169+
"\n",
170+
" return From_posterior(param, testval=np.median(samples))"
171+
]
172+
},
173+
{
174+
"cell_type": "markdown",
175+
"metadata": {},
176+
"source": [
177+
"Now we just need to generate more data and build our Bayesian model so that the prior distributions for the current iteration are the posterior distributions from the previous iteration. We save the posterior samples for each iteration so that we can plot their distribution and see it changing from one iteration to the next (first iterations are plotted in yellow, last iterations are plotted in red)."
178+
]
179+
},
180+
{
181+
"cell_type": "code",
182+
"execution_count": null,
183+
"metadata": {
184+
"collapsed": true
185+
},
186+
"outputs": [],
187+
"source": [
188+
"update_i = 0\n",
189+
"traces = [trace]"
190+
]
191+
},
192+
{
193+
"cell_type": "code",
194+
"execution_count": null,
195+
"metadata": {
196+
"collapsed": false
197+
},
198+
"outputs": [],
199+
"source": [
200+
"for _ in range(10):\n",
201+
"\n",
202+
" # generate more data\n",
203+
" X1 = np.random.randn(size)\n",
204+
" X2 = np.random.randn(size) * 0.2\n",
205+
" Y = alpha_true + beta0_true * X1 + beta1_true * X2 + np.random.randn(size)\n",
206+
"\n",
207+
" model = Model()\n",
208+
" with model:\n",
209+
" burnin = int(len(trace) / 5)\n",
210+
" # Priors for unknown model parameters\n",
211+
" alpha = from_posterior('alpha', trace['alpha'][burnin:])\n",
212+
" beta0 = from_posterior('beta0', trace['beta0'][burnin:])\n",
213+
" beta1 = from_posterior('beta1', trace['beta1'][burnin:])\n",
214+
"\n",
215+
" # Expected value of outcome\n",
216+
" mu = alpha + beta0 * X1 + beta1 * X2\n",
217+
"\n",
218+
" # Likelihood (sampling distribution) of observations\n",
219+
" Y_obs = Normal('Y_obs', mu=mu, sd=1, observed=Y)\n",
220+
" \n",
221+
" step = Slice([alpha, beta0, beta1])\n",
222+
" \n",
223+
" # draw 1000 posterior samples\n",
224+
" trace = sample(1000, step=step)\n",
225+
" traces.append(trace)\n",
226+
" \n",
227+
" update_i += 1"
228+
]
229+
},
230+
{
231+
"cell_type": "code",
232+
"execution_count": null,
233+
"metadata": {
234+
"collapsed": false
235+
},
236+
"outputs": [],
237+
"source": [
238+
"cmap = mpl.cm.autumn\n",
239+
"for param in ['alpha', 'beta0', 'beta1']:\n",
240+
" plt.figure(figsize=(8, 2))\n",
241+
" for update_i, trace in enumerate(traces):\n",
242+
" samples = trace[param][burnin:]\n",
243+
" smin, smax = np.min(samples), np.max(samples)\n",
244+
" x = np.linspace(smin, smax, 100)\n",
245+
" y = stats.gaussian_kde(samples)(x)\n",
246+
" plt.plot(x, y, color=cmap(1 - update_i / len(traces)))\n",
247+
" plt.axvline({'alpha': alpha_true, 'beta0': beta0_true, 'beta1': beta1_true}[param], c='k')\n",
248+
" plt.ylabel('Frequency')\n",
249+
" plt.title(param)\n",
250+
" plt.show()"
251+
]
252+
},
253+
{
254+
"cell_type": "markdown",
255+
"metadata": {},
256+
"source": [
257+
"You can re-execute the last two cells to generate more updates.\n",
258+
"\n",
259+
"What is interesting to note is that the posterior distributions for our parameters tend to get centered on their true value (vertical lines), and the distribution gets thiner and thiner. This means that we get more confident each time, and the (false) belief we had at the beginning gets flushed away by the new data we incorporate."
260+
]
261+
}
262+
],
263+
"metadata": {
264+
"anaconda-cloud": {},
265+
"kernelspec": {
266+
"display_name": "Python [default]",
267+
"language": "python",
268+
"name": "python3"
269+
},
270+
"language_info": {
271+
"codemirror_mode": {
272+
"name": "ipython",
273+
"version": 3
274+
},
275+
"file_extension": ".py",
276+
"mimetype": "text/x-python",
277+
"name": "python",
278+
"nbconvert_exporter": "python",
279+
"pygments_lexer": "ipython3",
280+
"version": "3.5.2"
281+
}
282+
},
283+
"nbformat": 4,
284+
"nbformat_minor": 2
285+
}

0 commit comments

Comments
 (0)