Skip to content

added draft of the prevalence of malaria example #691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Dekermanjian
Copy link
Contributor

@Dekermanjian Dekermanjian commented Aug 3, 2024

Relates to proposal #684

I have put together the first draft. Tagging @bwengals if you can review the example that would be wonderful.

This is my first time adding an example to pymc-examples and I have a few questions:

  • I added two data files (1 csv (tabular data) format, 1 tif (raster) format) to the data folder. I am not sure if that was the correct thing to do?
  • The data processing section is a little bit long. I can save the processed data as a csv file and skip a lot of the processing. What are your thoughts on this?
  • There are 2 interactive plots that use folium. I don't know if that will be supported on the website?

📚 Documentation preview 📚: https://pymc-examples--691.org.readthedocs.build/en/691/

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@bwengals
Copy link
Collaborator

bwengals commented Aug 6, 2024

Hi @Dekermanjian sorry for the delay. I think this looks great, short and to the point. I think it could benefit from a little more narrative, but I wouldn't object as it is. You could also provide a little guidance on how you chose the Matern32 or m=[40, 40], c=2.5. Or link to the other HSGP example nb that covers this.

I added two data files (1 csv (tabular data) format, 1 tif (raster) format) to the data folder. I am not sure if that was the correct thing to do?

Shouldn't be an issue, 82kB and 25kB isn't bad.

The data processing section is a little bit long. I can save the processed data as a csv file and skip a lot of the processing. What are your thoughts on this?

I think it's great as is, and it'd be helpful to include for people building a spatial model.

There are 2 interactive plots that use folium. I don't know if that will be supported on the website?

This I don't know... You should be able to build the documentation locally, and check if it works that way. I think if it works locally you're alright. @OriolAbril would know the answer to this

:::{post} Aug 04, 2024
:tags: spatial, autoregressive, count data
:category: beginner, tutorial
:author: Jonathan Dekermanjian, bwengals (please add your name here)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:author: Jonathan Dekermanjian, bwengals (please add your name here)
:author: Jonathan Dekermanjian, Bill Engels

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no need to add me as an author. It's great work and I wouldn't mind being associated with it, but I didn't write a single word!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are collaborating on this example. Even though you did not write a single word your feedback and ideas are being incorporated into the work. I won't force your name to be added but I do think credit should be given to all contributors of a piece of work.

# The prevalence of malaria in the Gambia

:::{post} Aug 04, 2024
:tags: spatial, autoregressive, count data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are too detailed, please take a look at existing tags and re-use those, should add a GP one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @twiecki , I had copied those tags from the notebook "Conditional Autoregressive (CAR) Models for Spatial Data". Would it make more sense if the tags were just "spatial" and "gp"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GP should be added. I have no strong opinion on the others.

@Dekermanjian
Copy link
Contributor Author

Dekermanjian commented Aug 7, 2024

@bwengals

You could also provide a little guidance on how you chose the Matern32 or m=[40, 40], c=2.5. Or link to the other HSGP example nb that covers this.

Yes, that is a great idea. In fact I used that example to choose my m and c values. As for the Matern32, the Matern family of covariance functions are popular covariance functions for spatially related data and selecting the v=3/2 as opposed to 5/2 is because I thought the 5/2 would overly smooth the transitions (I may be wrong since I have not tested that out). I can add a section about why the Matern family of covariance functions are typically used for spatially related data and talk about the smoothing as we change v. Do you think that would be helpful?

Another thing I would like to talk about is the length scale. I always struggle with this concept and would really like to offer the reader a simple explanation of how to think about it. Is it correct to say this: In the notebook you'll see that the posterior mean of "ls" is 0.187. Since we are passing in degrees of longitude and latitude to the Matern, is it fair to interpret the length scale this way: We can expect the gaussian mean to decay towards 0 (since we set a 0 mean function) as we move 0.187 degrees away from any sampled point on the map?

@bwengals
Copy link
Collaborator

bwengals commented Aug 8, 2024

Yes, that is a great idea. In fact I used that example to choose my m and c values. As for the Matern32, the Matern family of covariance functions are popular covariance functions for spatially related data and selecting the v=3/2 as opposed to 5/2 is because I thought the 5/2 would overly smooth the transitions (I may be wrong since I have not tested that out). I can add a section about why the Matern family of covariance functions are typically used for spatially related data and talk about the smoothing as we change v. Do you think that would be helpful

Sure, as much as you'd like! Why Matern covariances are used for spatial data is super interesting, and might steer people away from ExpQuad. Or even include the reasoning you give here (thought Matern52 would be to smooth, etc).

is it fair to interpret the length scale this way: We can expect the gaussian mean to decay towards 0 (since we set a 0 mean function) as we move 0.187 degrees away from any sampled point on the map?

Yes exactly this. But it's not a hard cut off, and usually the lengthscale posterior is poorly constrained by the data so it still has a large standard deviation. Like if you plot:

x = np.linspace(0, 10, 100)
K = pm.gp.cov.Matern52(input_dim=1, ls=1)(x[:, None]).eval()
plt.plot(x, K[:, 0]);

Then you'll see the covariance between x and x' it gives with a specific lengthscale ls=1.

@Dekermanjian
Copy link
Contributor Author

@bwengals Thank you for that explanation it was very helpful. Okay, so what I will do is:

  • Link the examples on HSGP first steps
  • Add a section talking about why we use the Matern and why we wouldn't want to use the ExpQuad
  • Talk about how to interpret the posterior mean of the length scale parameter

I will do my best to have these done in the next couple of days.

@bwengals
Copy link
Collaborator

Sure no rush, thanks! I saw this recently, might be helpful for he lengthscale interpretation

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plots seem to render correctly: https://pymcio--691.org.readthedocs.build/projects/examples/en/691/spatial/malaria_prevalence_PyMCv5.html#data-processing, so kudos to pydata-sphinx-theme team

Couple extra general comments. The filename makes it into the url for the page (as seen above) I would remove pymcv5 from the name.

You can use the authors section for more detailed authorship description and acknowledgements but I agree if you wrote everything you should be the only author (for now). At the bottom you can link to the original source/inspiration, say you were the one to adapt it but with extensive feedback by Bill or something like that

…xplanation to why we use matern vs expquad, added acknowledgements section under authors
@Dekermanjian
Copy link
Contributor Author

@OriolAbril Thank you for checking the interactive plots! I removed pymcv5 from the filename and I added a sub-section for Acknowledgements within the Authors section.

@bwengals I added the items that I listed earlier:

  • mentioned and linked the HSGP first steps example to provide detailed examples on how to choose the m and c parameters of the HSGP
  • Put in a little blurb about how to think about the estimated length scale.
  • Added a section right before the conclusions highlighting the advantages of the Matern covariance function compared to the expquad covariance function and why we would want that in spatial analyses.

If you don't mind, whenever you have the chance, could you review the new additions?

Thank you both for your help with this example!

@Dekermanjian
Copy link
Contributor Author

Sure no rush, thanks! I saw this recently, might be helpful for he lengthscale interpretation

Thank you for sharing this with me! I will give it a read and see if I can provide an enhanced interpretation on the length scale.

@Dekermanjian
Copy link
Contributor Author

Hey @bwengals I read through the article you shared with me. Here are my high level take-aways:

  • For shorter length scales the effect on posterior mean and HDI coverage are negligible to the choice of prior placed on the length scale (they specifically tested Cauchy, half normal, gamma, and Weibull. The rational was varying the density mass near 0 and the tails).
  • For longer length scales the Cauchy and Weibull in their simulations resulted in large posterior mean bias compared to the true function and poorer HDI coverage. However, results between the half normal and gamma priors were negligible.

The conditions of their testing were fixing the signal to noise ratio across varying length scales, simulating data using a uniform distribution, and attempting to recover both the true length scale and true function using an ExpQuad 1-d GP.

Copy link

review-notebook-app bot commented Aug 23, 2024

View / edit / reply to this conversation on ReviewNB

aloctavodia commented on 2024-08-23T17:11:42Z
----------------------------------------------------------------

Too add a little bit more narrative and make the exampler friendlier to non-experts, you could turn the comments to text in markdown cells.


Copy link

review-notebook-app bot commented Aug 23, 2024

View / edit / reply to this conversation on ReviewNB

aloctavodia commented on 2024-08-23T17:11:43Z
----------------------------------------------------------------

Again for non-experts could you explain what EPSG is and why we care in particular of 4326


Copy link

review-notebook-app bot commented Aug 23, 2024

View / edit / reply to this conversation on ReviewNB

aloctavodia commented on 2024-08-23T17:11:43Z
----------------------------------------------------------------

maybe pass kind="stats" argument


Copy link

review-notebook-app bot commented Aug 23, 2024

View / edit / reply to this conversation on ReviewNB

aloctavodia commented on 2024-08-23T17:11:44Z
----------------------------------------------------------------

not entirelly clear to me. Can we overlap the map or the observations?


Copy link

review-notebook-app bot commented Aug 23, 2024

View / edit / reply to this conversation on ReviewNB

bwengals commented on 2024-08-23T17:21:46Z
----------------------------------------------------------------

Just a couple small spelling things. Otherwise, thanks for adding this explanation!

Quadtratic -> Quadratic

paramter -> parameter

smooting -> smoothing


@bwengals
Copy link
Collaborator

I'm good with the nb as is, pending @aloctavodia suggestions

@Dekermanjian
Copy link
Contributor Author

@aloctavodia thank you very much for your review and suggestions. I will make the requested additions.

@bwengals Thank you again for reviewing this example. I will clean up those spelling typos.

I appreciate all the feedback, thank you!

…rative, made the graph after inference clearer, fixed typos
@Dekermanjian
Copy link
Contributor Author

@aloctavodia I made the changes that you suggested. I added a paragraph to describe coordinate reference systems, put more of a narrative in addition to the comments in the data processing section, and I plotted the actual prevalences side by side next to the inferred prevalences. When you have a chance please let me know if that looks good to you.

Copy link

review-notebook-app bot commented Aug 26, 2024

View / edit / reply to this conversation on ReviewNB

bwengals commented on 2024-08-26T06:12:59Z
----------------------------------------------------------------

I wonder if the divergence are caused by the really large m and c values. Since the data size is small, what if you use pm.gp.Latent here instead, so that the HSGP approximation doesn't come into play and it's only about the difference in the Matern kernels?


Dekermanjian commented on 2024-08-27T01:28:10Z
----------------------------------------------------------------

Hey @bwengals, I swapped out the HSGP for an actual GP but I am still getting divergences. I tried increasing the sample size, swapping out the priors on the length scale and on eta, but nothing seems to get rid of the divergences.

@aloctavodia
Copy link
Member

thanks @Dekermanjian! It reads much clearer now.

@Dekermanjian
Copy link
Contributor Author

thanks @Dekermanjian! It reads much clearer now.

@aloctavodia Thank you for your feedback. I really appreciate it.

Copy link
Contributor Author

Hey @bwengals, I swapped out the HSGP for an actual GP but I am still getting divergences. I tried increasing the sample size, swapping out the priors on the length scale and on eta, but nothing seems to get rid of the divergences.


View entire conversation on ReviewNB

@bwengals
Copy link
Collaborator

bwengals commented Aug 28, 2024

Try this code instead:

# Fit a GP to model the simulated data
with pm.Model() as matern_model:

    eta = pm.Exponential("eta", scale=10.0)
    ls = pm.Lognormal("ls", mu=0.5, sigma=0.75)
    cov_func = eta**2 * pm.gp.cov.Matern32(input_dim=1, ls=ls)
    gp = pm.gp.Latent(cov_func=cov_func)
    s = gp.prior("s", X=x[:, None])

    measurement_error = pm.Exponential("measurement_error", scale=5.0)
    pm.Normal("likelihood", mu=s, sigma=measurement_error, observed=y)
    matern_idata = pm.sample(tune=2000, nuts_sampler="numpyro", target_accept=0.98)

The divergences aren't completely gone, but the different priors on the hyperparameters are helping to tamp them down. GP models are really sensitive to how those are set. To get rid of divergences completely, you can switch to pm.gp.Marginal instead of pm.gp.Latent, which is only possible because the likelihood is Gaussian. Notice you don't need to turn target_accept up here. That code is:

# Fit a GP to model the simulated data
with pm.Model() as matern_model:

    eta = pm.Exponential("eta", scale=10.0)
    ls = pm.Lognormal("ls", mu=0.5, sigma=0.75)
    cov_func = eta**2 * pm.gp.cov.Matern32(input_dim=1, ls=ls)
    
    gp = pm.gp.Marginal(cov_func=cov_func)
    
    measurement_error = pm.Exponential("measurement_error", scale=5.0)
    gp.marginal_likelihood("likelihood", X=x[:, None], y=y, sigma=measurement_error)
    
    matern_idata = pm.sample(tune=2000, nuts_sampler="numpyro")

Copy link
Contributor Author

@bwengals thank you! That certainly reduced the divergences quite a bit. I added in your changes and pushed up a new version.

@bwengals bwengals self-requested a review August 29, 2024 00:36
@bwengals
Copy link
Collaborator

I'd approve, as long as @aloctavodia is good with things?

@Dekermanjian
Copy link
Contributor Author

Hey @aloctavodia, is there anything in the example you'd like me to update or anything you think I should add to make it a better example? Thank you for taking the time to review the example. I really appreciate it!

@aloctavodia aloctavodia merged commit 2d9fec2 into pymc-devs:main Sep 7, 2024
2 checks passed
@aloctavodia
Copy link
Member

Thanks, @Dekermanjian, this is a great addition to the example gallery.

@Dekermanjian
Copy link
Contributor Author

Thank you very much, @aloctavodia!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants