Skip to content

Commit 463fdd4

Browse files
author
Severin Hatt
committed
Update according to Jupyter Style Guide, but missing authors
1 parent d8baf4a commit 463fdd4

File tree

2 files changed

+141
-556
lines changed

2 files changed

+141
-556
lines changed

examples/case_studies/probabilistic_matrix_factorization.ipynb

Lines changed: 122 additions & 539 deletions
Large diffs are not rendered by default.

myst_nbs/case_studies/probabilistic_matrix_factorization.myst.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,16 @@ jupytext:
66
format_version: 0.13
77
jupytext_version: 1.13.7
88
kernelspec:
9-
display_name: Python 3
9+
display_name: Python 3 (ipykernel)
1010
language: python
1111
name: python3
1212
---
1313

14+
(probabilistic_matrix_factorization)=
1415
# Probabilistic Matrix Factorization for Making Personalized Recommendations
1516

16-
:::{post} Sept 20, 2021
17-
:tags: case study, pymc3.Model, pymc3.MvNormal, pymc3.Normal
17+
:::{post} June 3, 2022
18+
:tags: case study
1819
:category: intermediate
1920
:::
2021

@@ -25,10 +26,8 @@ import arviz as az
2526
import matplotlib.pyplot as plt
2627
import numpy as np
2728
import pandas as pd
28-
import pymc3 as pm
29+
import pymc as pm
2930
import xarray as xr
30-
31-
print(f"Running on PyMC3 v{pm.__version__}")
3231
```
3332

3433
```{code-cell} ipython3
@@ -42,7 +41,7 @@ az.style.use("arviz-darkgrid")
4241

4342
So you are browsing for something to watch on Netflix and just not liking the suggestions. You just know you can do better. All you need to do is collect some ratings data from yourself and friends and build a recommendation algorithm. This notebook will guide you in doing just that!
4443

45-
We'll start out by getting some intuition for how our model will work. Then we'll formalize our intuition. Afterwards, we'll examine the dataset we are going to use. Once we have some notion of what our data looks like, we'll define some baseline methods for predicting preferences for movies. Following that, we'll look at Probabilistic Matrix Factorization (PMF), which is a more sophisticated Bayesian method for predicting preferences. Having detailed the PMF model, we'll use PyMC3 for MAP estimation and MCMC inference. Finally, we'll compare the results obtained with PMF to those obtained from our baseline methods and discuss the outcome.
44+
We'll start out by getting some intuition for how our model will work. Then we'll formalize our intuition. Afterwards, we'll examine the dataset we are going to use. Once we have some notion of what our data looks like, we'll define some baseline methods for predicting preferences for movies. Following that, we'll look at Probabilistic Matrix Factorization (PMF), which is a more sophisticated Bayesian method for predicting preferences. Having detailed the PMF model, we'll use PyMC for MAP estimation and MCMC inference. Finally, we'll compare the results obtained with PMF to those obtained from our baseline methods and discuss the outcome.
4645

4746
## Intuition
4847

@@ -305,23 +304,23 @@ Given small precision parameters, the priors on $U$ and $V$ ensure our latent va
305304
import logging
306305
import time
307306
307+
import aesara
308308
import scipy as sp
309-
import theano
310309
311310
# Enable on-the-fly graph computations, but ignore
312311
# absence of intermediate test values.
313-
theano.config.compute_test_value = "ignore"
312+
aesara.config.compute_test_value = "ignore"
314313
315314
# Set up logging.
316315
logger = logging.getLogger()
317316
logger.setLevel(logging.INFO)
318317
319318
320319
class PMF:
321-
"""Probabilistic Matrix Factorization model using pymc3."""
320+
"""Probabilistic Matrix Factorization model using pymc."""
322321
323322
def __init__(self, train, dim, alpha=2, std=0.01, bounds=(1, 5)):
324-
"""Build the Probabilistic Matrix Factorization model using pymc3.
323+
"""Build the Probabilistic Matrix Factorization model using pymc.
325324
326325
:param np.ndarray train: The training data to use for learning the model.
327326
:param int dim: Dimensionality of the model; number of latent factors.
@@ -390,7 +389,7 @@ We'll also need functions for calculating the MAP and performing sampling on our
390389

391390
$$ E = \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^M I_{ij} (R_{ij} - U_i V_j^T)^2 + \frac{\lambda_U}{2} \sum_{i=1}^N \|U\|_{Fro}^2 + \frac{\lambda_V}{2} \sum_{j=1}^M \|V\|_{Fro}^2, $$
392391

393-
where $\lambda_U = \alpha_U / \alpha$, $\lambda_V = \alpha_V / \alpha$, and $\|\cdot\|_{Fro}^2$ denotes the Frobenius norm {cite:p}`mnih2008advances`. Minimizing this objective function gives a local minimum, which is essentially a maximum a posteriori (MAP) estimate. While it is possible to use a fast Stochastic Gradient Descent procedure to find this MAP, we'll be finding it using the utilities built into `pymc3`. In particular, we'll use `find_MAP` with Powell optimization (`scipy.optimize.fmin_powell`). Having found this MAP estimate, we can use it as our starting point for MCMC sampling.
392+
where $\lambda_U = \alpha_U / \alpha$, $\lambda_V = \alpha_V / \alpha$, and $\|\cdot\|_{Fro}^2$ denotes the Frobenius norm {cite:p}`mnih2008advances`. Minimizing this objective function gives a local minimum, which is essentially a maximum a posteriori (MAP) estimate. While it is possible to use a fast Stochastic Gradient Descent procedure to find this MAP, we'll be finding it using the utilities built into `pymc`. In particular, we'll use `find_MAP` with Powell optimization (`scipy.optimize.fmin_powell`). Having found this MAP estimate, we can use it as our starting point for MCMC sampling.
394393

395394
Since it is a reasonably complex model, we expect the MAP estimation to take some time. So let's save it after we've found it. Note that we define a function for finding the MAP below, assuming it will receive a namespace with some variables in it. Then we attach that function to the PMF class, where it will have such a namespace after initialization. The PMF class is defined in pieces this way so I can say a few things between each piece to make it clearer.
396395

@@ -424,9 +423,9 @@ So now our PMF class has a `map` `property` which will either be found using Pow
424423
```{code-cell} ipython3
425424
# Draw MCMC samples.
426425
def _draw_samples(self, **kwargs):
427-
kwargs.setdefault("chains", 1)
426+
# kwargs.setdefault("chains", 1)
428427
with self.model:
429-
self.trace = pm.sample(**kwargs, return_inferencedata=True)
428+
self.trace = pm.sample(**kwargs)
430429
431430
432431
# Update our class with the sampling infrastructure.
@@ -721,7 +720,7 @@ print("Improvement from MAP: %.5f" % (pmf_map_rmse - final_test_rmse)
721720
print("Improvement from Mean of Means: %.5f" % (baselines["mom"] - final_test_rmse))
722721
```
723722

724-
We have some interesting results here. As expected, our MCMC sampler provides lower error on the training set. However, it seems it does so at the cost of overfitting the data. This results in a decrease in test RMSE as compared to the MAP, even though it is still much better than our best baseline. So why might this be the case? Recall that we used point estimates for our precision parameters $\alpha_U$ and $\alpha_V$ and we chose a fixed precision $\alpha$. It is quite likely that by doing this, we constrained our posterior in a way that biased it towards the training data. In reality, the variance in the user ratings and the movie ratings is unlikely to be equal to the means of sample variances we used. Also, the most reasonable observation precision $\alpha$ is likely different as well.
723+
We have some interesting results here. As expected, our MCMC sampler provides lower error on the training set. However, it seems it does so at the cost of overfitting the data. This results in a decrease in test RMSE as compared to the MAP, even though it is still much better than our best baseline. So why might this be the case? Recall that we used point estimates for our precision parameters $\alpha_U$ and $\alpha_V$ and we chose a fixed precision $\alpha$. It is quite likely that by doing this, we constrained our posterior in a way that biased it towards the training data. In reality, the variance in the user ratings and the movie ratings is unlikely to be equal to the means of sample variances we used. Also, the most reasonable observation precision $\alpha$ is likely different as well.
725724

726725
+++
727726

@@ -748,11 +747,11 @@ ax.set_ylabel("RMSE");
748747

749748
## Summary
750749

751-
We set out to predict user preferences for unseen movies. First we discussed the intuitive notion behind the user-user and item-item neighborhood approaches to collaborative filtering. Then we formalized our intuitions. With a firm understanding of our problem context, we moved on to exploring our subset of the Movielens data. After discovering some general patterns, we defined three baseline methods: uniform random, global mean, and mean of means. With the goal of besting our baseline methods, we implemented the basic version of Probabilistic Matrix Factorization (PMF) using `pymc3`.
750+
We set out to predict user preferences for unseen movies. First we discussed the intuitive notion behind the user-user and item-item neighborhood approaches to collaborative filtering. Then we formalized our intuitions. With a firm understanding of our problem context, we moved on to exploring our subset of the Movielens data. After discovering some general patterns, we defined three baseline methods: uniform random, global mean, and mean of means. With the goal of besting our baseline methods, we implemented the basic version of Probabilistic Matrix Factorization (PMF) using `pymc`.
752751

753752
Our results demonstrate that the mean of means method is our best baseline on our prediction task. As expected, we are able to obtain a significant decrease in RMSE using the PMF MAP estimate obtained via Powell optimization. We illustrated one way to monitor convergence of an MCMC sampler with a high-dimensionality sampling space using the Frobenius norms of the sampled variables. The traceplots using this method seem to indicate that our sampler converged to the posterior. Results using this posterior showed that attempting to improve the MAP estimation using MCMC sampling actually overfit the training data and increased test RMSE. This was likely caused by the constraining of the posterior via fixed precision parameters $\alpha$, $\alpha_U$, and $\alpha_V$.
754753

755-
As a followup to this analysis, it would be interesting to also implement the logistic and constrained versions of PMF. We expect both models to outperform the basic PMF model. We could also implement the fully Bayesian version of PMF (BPMF) {cite:p}`salakhutdinov2008bayesian`, which places hyperpriors on the model parameters to automatically learn ideal mean and precision parameters for $U$ and $V$. This would likely resolve the issue we faced in this analysis. We would expect BPMF to improve upon the MAP estimation produced here by learning more suitable hyperparameters and parameters. For a basic (but working!) implementation of BPMF in `pymc3`, see [this gist](https://gist.github.com/macks22/00a17b1d374dfc267a9a).
754+
As a followup to this analysis, it would be interesting to also implement the logistic and constrained versions of PMF. We expect both models to outperform the basic PMF model. We could also implement the fully Bayesian version of PMF (BPMF) {cite:p}`salakhutdinov2008bayesian`, which places hyperpriors on the model parameters to automatically learn ideal mean and precision parameters for $U$ and $V$. This would likely resolve the issue we faced in this analysis. We would expect BPMF to improve upon the MAP estimation produced here by learning more suitable hyperparameters and parameters. For a basic (but working!) implementation of BPMF in `pymc`, see [this gist](https://gist.github.com/macks22/00a17b1d374dfc267a9a).
756755

757756
If you made it this far, then congratulations! You now have some idea of how to build a basic recommender system. These same ideas and methods can be used on many different recommendation tasks. Items can be movies, products, advertisements, courses, or even other people. Any time you can build yourself a user-item matrix with user preferences in the cells, you can use these types of collaborative filtering algorithms to predict the missing values. If you want to learn more about recommender systems, the first reference is a good place to start.
758757

@@ -778,3 +777,6 @@ goldberg2001eigentaste
778777
%load_ext watermark
779778
%watermark -n -u -v -iv -w
780779
```
780+
781+
:::{include} ../page_footer.md
782+
:::

0 commit comments

Comments
 (0)