You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
:author: Thomas Wiecki, updated by Chris Fonnesbeck
22
22
:::
@@ -28,13 +28,13 @@ kernelspec:
28
28
**Probabilistic Programming**, **Deep Learning** and "**Big Data**" are among the biggest topics in machine learning. Inside of PP, a lot of innovation is focused on making things scale using **Variational Inference**. In this example, I will show how to use **Variational Inference** in PyMC to fit a simple Bayesian Neural Network. I will also discuss how bridging Probabilistic Programming and Deep Learning can open up very interesting avenues to explore in future research.
29
29
30
30
### Probabilistic Programming at scale
31
-
**Probabilistic Programming** allows very flexible creation of custom probabilistic models and is mainly concerned with **inference** and learning from your data. The approach is inherently **Bayesian** so we can specify **priors** to inform and constrain our models and get uncertainty estimation in form of a **posterior** distribution. Using {ref}`MCMC sampling algorithms <multilevel_modeling>` we can draw samples from this posterior to very flexibly estimate these models. PyMC, [NumPyro](https://github.com/pyro-ppl/numpyro), and [Stan](http://mc-stan.org/) are the current state-of-the-art tools for consructing and estimating these models. One major drawback of sampling, however, is that it's often slow, especially for high-dimensional models and large datasets. That's why more recently, **variational inference** algorithms have been developed that are almost as flexible as MCMC but much faster. Instead of drawing samples from the posterior, these algorithms instead fit a distribution (*e.g.* normal) to the posterior turning a sampling problem into and optimization problem. Automatic Differentation Variational Inference {cite:p}`kucukelbir2015automatic` is implemented in several probabilistic programming packages including PyMC, NumPyro and Stan.
31
+
**Probabilistic Programming** allows very flexible creation of custom probabilistic models and is mainly concerned with **inference** and learning from your data. The approach is inherently **Bayesian** so we can specify **priors** to inform and constrain our models and get uncertainty estimation in form of a **posterior** distribution. Using [MCMC sampling algorithms](http://twiecki.github.io/blog/2015/11/10/mcmc-sampling/) we can draw samples from this posterior to very flexibly estimate these models. PyMC, [NumPyro](https://github.com/pyro-ppl/numpyro), and [Stan](http://mc-stan.org/) are the current state-of-the-art tools for consructing and estimating these models. One major drawback of sampling, however, is that it's often slow, especially for high-dimensional models and large datasets. That's why more recently, **variational inference** algorithms have been developed that are almost as flexible as MCMC but much faster. Instead of drawing samples from the posterior, these algorithms instead fit a distribution (*e.g.* normal) to the posterior turning a sampling problem into and optimization problem. Automatic Differentation Variational Inference {cite:p}`kucukelbir2015automatic` is implemented in PyMC, NumPyro and Stan.
32
32
33
33
Unfortunately, when it comes to traditional ML problems like classification or (non-linear) regression, Probabilistic Programming often plays second fiddle (in terms of accuracy and scalability) to more algorithmic approaches like [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning) (e.g. [random forests](https://en.wikipedia.org/wiki/Random_forest) or [gradient boosted regression trees](https://en.wikipedia.org/wiki/Boosting_(machine_learning)).
34
34
35
35
### Deep Learning
36
36
37
-
Now in its third renaissance, neural networks have been making headlines repeatedly by dominating almost any object recognition benchmark, kicking ass at Atari games {cite:p}`mnih2013playing`, and beating the world-champion Lee Sedol at Go {cite:p}`silver2016masteringgo`. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders {cite:p}`kingma2014autoencoding` and in all sorts of other interesting ways (e.g. [Recurrent Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network), or [MDNs](http://cbonnett.github.io/MDN_EDWARD_KERAS_TF.html) to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.
37
+
Now in its third renaissance, neural networks have been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games {cite:p}`mnih2013playing`, and beating the world-champion Lee Sedol at Go {cite:p}`silver2016masteringgo`. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders {cite:p}`kingma2014autoencoding` and in all sorts of other interesting ways (e.g. [Recurrent Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network), or [MDNs](http://cbonnett.github.io/MDN_EDWARD_KERAS_TF.html) to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.
38
38
39
39
A large part of the innoviation in deep learning is the ability to train these extremely complex models. This rests on several pillars:
40
40
* Speed: facilitating the GPU allowed for much faster processing.
@@ -64,11 +64,12 @@ While this would allow Probabilistic Programming to be applied to a much wider s
64
64
First, lets generate some toy data -- a simple binary classification problem that's not linearly separable.
65
65
66
66
```{code-cell} ipython3
67
+
import aesara
68
+
import aesara.tensor as at
67
69
import arviz as az
68
70
import matplotlib.pyplot as plt
69
71
import numpy as np
70
72
import pymc as pm
71
-
import pytensor
72
73
import seaborn as sns
73
74
74
75
from sklearn.datasets import make_moons
@@ -78,7 +79,7 @@ from sklearn.preprocessing import scale
total_size=X_train.shape[0], # IMPORTANT for minibatches
164
-
dims="obs_id",
160
+
total_size=Y_train.shape[0], # IMPORTANT for minibatches
165
161
)
166
162
return neural_network
167
163
@@ -176,9 +172,9 @@ That's not so bad. The `Normal` priors help regularize the weights. Usually we w
176
172
177
173
### Variational Inference: Scaling model complexity
178
174
179
-
We could now just run a MCMC sampler like {class}`pymc.NUTS` which works pretty well in this case, but was already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers.
175
+
We could now just run a MCMC sampler like {class}`~pymc.step_methods.hmc.nuts.NUTS` which works pretty well in this case, but was already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers.
180
176
181
-
Instead, we will use the {class}`pymc.ADVI` variational inference algorithm. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior.
177
+
Instead, we will use the {class}`~pymc.variational.inference.ADVI` variational inference algorithm. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior.
182
178
183
179
```{code-cell} ipython3
184
180
%%time
@@ -199,7 +195,7 @@ plt.xlabel("iteration");
199
195
trace = approx.sample(draws=5000)
200
196
```
201
197
202
-
Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC). We can use {func}`~pymc.sample_posterior_predictive` to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).
198
+
Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC). We can use {func}`~pymc.sampling.sample_posterior_predictive` to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).
203
199
204
200
```{code-cell} ipython3
205
201
---
@@ -215,7 +211,7 @@ with neural_network:
215
211
We can average the predictions for each observation to estimate the underlying probability of class 1.
216
212
217
213
```{code-cell} ipython3
218
-
pred = ppc.posterior_predictive["out"].mean(("chain", "draw")) > 0.5
214
+
pred = ppc.posterior_predictive["out"].squeeze().mean(axis=0) > 0.5
0 commit comments