-
-
Notifications
You must be signed in to change notification settings - Fork 269
added draft of the prevalence of malaria example #691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added draft of the prevalence of malaria example #691
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Hi @Dekermanjian sorry for the delay. I think this looks great, short and to the point. I think it could benefit from a little more narrative, but I wouldn't object as it is. You could also provide a little guidance on how you chose the Matern32 or m=[40, 40], c=2.5. Or link to the other HSGP example nb that covers this.
Shouldn't be an issue, 82kB and 25kB isn't bad.
I think it's great as is, and it'd be helpful to include for people building a spatial model.
This I don't know... You should be able to build the documentation locally, and check if it works that way. I think if it works locally you're alright. @OriolAbril would know the answer to this |
:::{post} Aug 04, 2024 | ||
:tags: spatial, autoregressive, count data | ||
:category: beginner, tutorial | ||
:author: Jonathan Dekermanjian, bwengals (please add your name here) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:author: Jonathan Dekermanjian, bwengals (please add your name here) | |
:author: Jonathan Dekermanjian, Bill Engels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no need to add me as an author. It's great work and I wouldn't mind being associated with it, but I didn't write a single word!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are collaborating on this example. Even though you did not write a single word your feedback and ideas are being incorporated into the work. I won't force your name to be added but I do think credit should be given to all contributors of a piece of work.
# The prevalence of malaria in the Gambia | ||
|
||
:::{post} Aug 04, 2024 | ||
:tags: spatial, autoregressive, count data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are too detailed, please take a look at existing tags and re-use those, should add a GP one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @twiecki , I had copied those tags from the notebook "Conditional Autoregressive (CAR) Models for Spatial Data". Would it make more sense if the tags were just "spatial" and "gp"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GP should be added. I have no strong opinion on the others.
Yes, that is a great idea. In fact I used that example to choose my m and c values. As for the Matern32, the Matern family of covariance functions are popular covariance functions for spatially related data and selecting the v=3/2 as opposed to 5/2 is because I thought the 5/2 would overly smooth the transitions (I may be wrong since I have not tested that out). I can add a section about why the Matern family of covariance functions are typically used for spatially related data and talk about the smoothing as we change v. Do you think that would be helpful? Another thing I would like to talk about is the length scale. I always struggle with this concept and would really like to offer the reader a simple explanation of how to think about it. Is it correct to say this: In the notebook you'll see that the posterior mean of "ls" is 0.187. Since we are passing in degrees of longitude and latitude to the Matern, is it fair to interpret the length scale this way: We can expect the gaussian mean to decay towards 0 (since we set a 0 mean function) as we move 0.187 degrees away from any sampled point on the map? |
Sure, as much as you'd like! Why Matern covariances are used for spatial data is super interesting, and might steer people away from ExpQuad. Or even include the reasoning you give here (thought Matern52 would be to smooth, etc).
Yes exactly this. But it's not a hard cut off, and usually the lengthscale posterior is poorly constrained by the data so it still has a large standard deviation. Like if you plot: x = np.linspace(0, 10, 100)
K = pm.gp.cov.Matern52(input_dim=1, ls=1)(x[:, None]).eval()
plt.plot(x, K[:, 0]); Then you'll see the covariance between x and x' it gives with a specific lengthscale ls=1. |
@bwengals Thank you for that explanation it was very helpful. Okay, so what I will do is:
I will do my best to have these done in the next couple of days. |
Sure no rush, thanks! I saw this recently, might be helpful for he lengthscale interpretation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plots seem to render correctly: https://pymcio--691.org.readthedocs.build/projects/examples/en/691/spatial/malaria_prevalence_PyMCv5.html#data-processing, so kudos to pydata-sphinx-theme team
Couple extra general comments. The filename makes it into the url for the page (as seen above) I would remove pymcv5 from the name.
You can use the authors section for more detailed authorship description and acknowledgements but I agree if you wrote everything you should be the only author (for now). At the bottom you can link to the original source/inspiration, say you were the one to adapt it but with extensive feedback by Bill or something like that
…xplanation to why we use matern vs expquad, added acknowledgements section under authors
@OriolAbril Thank you for checking the interactive plots! I removed pymcv5 from the filename and I added a sub-section for @bwengals I added the items that I listed earlier:
If you don't mind, whenever you have the chance, could you review the new additions? Thank you both for your help with this example! |
Thank you for sharing this with me! I will give it a read and see if I can provide an enhanced interpretation on the length scale. |
Hey @bwengals I read through the article you shared with me. Here are my high level take-aways:
The conditions of their testing were fixing the signal to noise ratio across varying length scales, simulating data using a uniform distribution, and attempting to recover both the true length scale and true function using an ExpQuad 1-d GP. |
View / edit / reply to this conversation on ReviewNB aloctavodia commented on 2024-08-23T17:11:42Z Too add a little bit more narrative and make the exampler friendlier to non-experts, you could turn the comments to text in markdown cells.
|
View / edit / reply to this conversation on ReviewNB aloctavodia commented on 2024-08-23T17:11:43Z Again for non-experts could you explain what EPSG is and why we care in particular of 4326
|
View / edit / reply to this conversation on ReviewNB aloctavodia commented on 2024-08-23T17:11:43Z maybe pass |
View / edit / reply to this conversation on ReviewNB aloctavodia commented on 2024-08-23T17:11:44Z not entirelly clear to me. Can we overlap the map or the observations? |
View / edit / reply to this conversation on ReviewNB bwengals commented on 2024-08-23T17:21:46Z Just a couple small spelling things. Otherwise, thanks for adding this explanation!
Quadtratic -> Quadratic paramter -> parameter smooting -> smoothing
|
I'm good with the nb as is, pending @aloctavodia suggestions |
@aloctavodia thank you very much for your review and suggestions. I will make the requested additions. @bwengals Thank you again for reviewing this example. I will clean up those spelling typos. I appreciate all the feedback, thank you! |
…rative, made the graph after inference clearer, fixed typos
@aloctavodia I made the changes that you suggested. I added a paragraph to describe coordinate reference systems, put more of a narrative in addition to the comments in the data processing section, and I plotted the actual prevalences side by side next to the inferred prevalences. When you have a chance please let me know if that looks good to you. |
View / edit / reply to this conversation on ReviewNB bwengals commented on 2024-08-26T06:12:59Z I wonder if the divergence are caused by the really large m and c values. Since the data size is small, what if you use Dekermanjian commented on 2024-08-27T01:28:10Z Hey @bwengals, I swapped out the HSGP for an actual GP but I am still getting divergences. I tried increasing the sample size, swapping out the priors on the length scale and on eta, but nothing seems to get rid of the divergences. |
thanks @Dekermanjian! It reads much clearer now. |
@aloctavodia Thank you for your feedback. I really appreciate it. |
Hey @bwengals, I swapped out the HSGP for an actual GP but I am still getting divergences. I tried increasing the sample size, swapping out the priors on the length scale and on eta, but nothing seems to get rid of the divergences. View entire conversation on ReviewNB |
Try this code instead: # Fit a GP to model the simulated data
with pm.Model() as matern_model:
eta = pm.Exponential("eta", scale=10.0)
ls = pm.Lognormal("ls", mu=0.5, sigma=0.75)
cov_func = eta**2 * pm.gp.cov.Matern32(input_dim=1, ls=ls)
gp = pm.gp.Latent(cov_func=cov_func)
s = gp.prior("s", X=x[:, None])
measurement_error = pm.Exponential("measurement_error", scale=5.0)
pm.Normal("likelihood", mu=s, sigma=measurement_error, observed=y)
matern_idata = pm.sample(tune=2000, nuts_sampler="numpyro", target_accept=0.98) The divergences aren't completely gone, but the different priors on the hyperparameters are helping to tamp them down. GP models are really sensitive to how those are set. To get rid of divergences completely, you can switch to # Fit a GP to model the simulated data
with pm.Model() as matern_model:
eta = pm.Exponential("eta", scale=10.0)
ls = pm.Lognormal("ls", mu=0.5, sigma=0.75)
cov_func = eta**2 * pm.gp.cov.Matern32(input_dim=1, ls=ls)
gp = pm.gp.Marginal(cov_func=cov_func)
measurement_error = pm.Exponential("measurement_error", scale=5.0)
gp.marginal_likelihood("likelihood", X=x[:, None], y=y, sigma=measurement_error)
matern_idata = pm.sample(tune=2000, nuts_sampler="numpyro") |
@bwengals thank you! That certainly reduced the divergences quite a bit. I added in your changes and pushed up a new version. |
I'd approve, as long as @aloctavodia is good with things? |
Hey @aloctavodia, is there anything in the example you'd like me to update or anything you think I should add to make it a better example? Thank you for taking the time to review the example. I really appreciate it! |
Thanks, @Dekermanjian, this is a great addition to the example gallery. |
Thank you very much, @aloctavodia!! |
Relates to proposal #684
I have put together the first draft. Tagging @bwengals if you can review the example that would be wonderful.
This is my first time adding an example to pymc-examples and I have a few questions:
📚 Documentation preview 📚: https://pymc-examples--691.org.readthedocs.build/en/691/