Skip to content

Update dirichlet mixture model notebooks #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 9, 2022

Conversation

chiral-carbon
Copy link
Collaborator

Addresses issues #100 and #101 and aims to advance the notebooks to best practices.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Aug 18, 2021

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-08-18T19:25:39Z
----------------------------------------------------------------

ost_pdf_contribs = sp.stats.norm.pdf(
    np.atleast_3d(x_plot),
    posterior["mu"].T.values[:, np.newaxis, :],
    1.0 / np.sqrt(posterior["lambda_"] * posterior["tau"]).T.values[:, np.newaxis, :],
)

Is there a better way to code for the plots than using posterior["mu"].T.values[:, np.newaxis, :]?

I tried passing the xarray dataset of all the data variables from trace.posterior , like posterior["mu"] with np.atleast_3d(xr.DataArray(x_plot)), but it raised an error.

I thought that might be because the scipy.stats.norm functions might not take xarray inputs, but I am not sure. If there is a better way of utilizing xarray here then do point that out, I'll try changing it.


OriolAbril commented on 2021-08-24T18:11:42Z
----------------------------------------------------------------

There are some ways to make scipy.stats.norm compatible with xarray inputs. If you have the inferencedata saved we can discuss that tomorrow.

chiral-carbon commented on 2021-08-26T06:22:32Z
----------------------------------------------------------------

didnt get a chance to ask about using xarray with scipy.stats.norm functions, do you have any quick links I could refer to?

OriolAbril commented on 2021-08-26T14:04:04Z
----------------------------------------------------------------

https://arviz-devs.github.io/arviz/user_guide/pymc3_refitting_xr_lik.html is the example closest to this situation, then also check the apply_ufunc docs and this answer: https://stackoverflow.com/questions/58719696/how-to-apply-a-xarray-u-function-over-netcdf-and-return-a-2d-array-multiple-new/62012973#62012973

@@ -65,8 +65,9 @@
},
Copy link
Collaborator Author

@chiral-carbon chiral-carbon Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried running the notebook first without making any major changes (except using numpy random Generator for setting the seed), but I get an error here. The summary_multinomial dataframe that I get is not the same as before, because,

creation of the column ess_mean_per_sec depends on a column called ess_mean. However, here we don't get the columns ess_mean and ess_sd , is there something obvious I missed here?


Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arviz summary now uses ess_bulk as default. They are not exactly the same but nearly. For the purposes of these notebook we can just change ess_mean for ess_bulk

@review-notebook-app
Copy link

review-notebook-app bot commented Aug 19, 2021

View / edit / reply to this conversation on ReviewNB

Sayam753 commented on 2021-08-19T13:37:35Z
----------------------------------------------------------------

I think the equation $w_i = \beta_i \prod_{j=1}^{n-1}(1-\beta_j)$ should be $w_i = \beta_i \prod_{j=1}^{i-1}(1-\beta_j)$

Reference - https://en.wikipedia.org/wiki/Dirichlet_process#The_stick-breaking_process


chiral-carbon commented on 2021-08-19T16:54:07Z
----------------------------------------------------------------

ah thanks!

OriolAbril commented on 2021-08-24T18:03:25Z
----------------------------------------------------------------

I don't know if this comes from wikipedia or if it's maybe a typo, but I find using both lowercase omega and w in the same set of equations to be extremely confusing and a recipe for disaster.

chiral-carbon commented on 2021-08-25T07:15:33Z
----------------------------------------------------------------

will change the variable for weights to \mu

OriolAbril commented on 2021-08-25T08:10:00Z
----------------------------------------------------------------

how about weights if that is what the variable represents?

chiral-carbon commented on 2021-08-25T21:23:36Z
----------------------------------------------------------------

wouldn't 'weights' be a little too big for a variable name? will change to that though if you think that would make things clearer

@review-notebook-app
Copy link

review-notebook-app bot commented Aug 19, 2021

View / edit / reply to this conversation on ReviewNB

Sayam753 commented on 2021-08-19T13:37:36Z
----------------------------------------------------------------

Line #3.    ax.plot(x_plot, sample_cdfs[0], c="gray", alpha=0.75, label="DP sample CDFs")
Line #4.    ax.plot(x_plot, sample_cdfs[1:].T, c="gray", alpha=0.75)

This can be combined in one ax.plot(x_plot, sample_cdfs.T, c="gray", alpha=0.75, label="DP sample CDFs")


chiral-carbon commented on 2021-08-19T16:54:54Z
----------------------------------------------------------------

thanks, will try running this.

OriolAbril commented on 2021-08-24T18:04:48Z
----------------------------------------------------------------

arviz dev version will have an ecdf plot very soon which could be used in this notebook https://github.com/arviz-devs/arviz/pull/1753

chiral-carbon commented on 2021-08-25T07:23:29Z
----------------------------------------------------------------

then it would be better to wait for that I think because on contracting

Line #3.    ax.plot(x_plot, sample_cdfs[0], c="gray", alpha=0.75, label="DP sample CDFs")
Line #4.    ax.plot(x_plot, sample_cdfs[1:].T, c="gray", alpha=0.75)

to

ax.plot(x_plot, sample_cdfs.T, c="gray", alpha=0.75, label="DP sample CDFs")  

I end up getting multiple labels for DP sample CDFs lines in the legend and we would have to write a code snippet to handle the legend correction.

Copy link
Collaborator Author

ah thanks!


View entire conversation on ReviewNB

Copy link
Collaborator Author

thanks, will try running this.


View entire conversation on ReviewNB

Copy link
Member

I don't know if this comes from wikipedia or if it's maybe a type, but I find using both lowercase omega and w in the same set of equations to be extremely confusing and a recipe for disaster.


View entire conversation on ReviewNB

Copy link
Member

arviz dev version will have an ecdf plot very soon which could be used in this notebook https://github.com/arviz-devs/arviz/pull/1753


View entire conversation on ReviewNB

Copy link
Member

There are some ways to make scipy.stats.norm compatible with xarray inputs. If you have the inferencedata saved we can discuss that tomorrow.


View entire conversation on ReviewNB

@@ -65,8 +65,9 @@
},
Copy link
Member

@OriolAbril OriolAbril Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use coords. This example tries to be more down-to-earth even if using simulated data so the number of clusters (k again) corresponds to the tree species, so it would be great to use as coord values real tree names :). For the number of observations, here they now correspond to the number of forests, so again, as there are only 10 "forests" we could also use real forest names.


Reply via ReviewNB

Copy link
Collaborator Author

@chiral-carbon chiral-carbon Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have used coords/dims, what do you mean by using real tree names? 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the dimension should be tree, not num_trees because the number of trees is an integer, k in this case. tree_num or tree_id could also work, but I'd much rather have non-integer coordinate values now that they make sense. Here we count trees on different forests, so we can use the names of the trees we count and the names of the forests where we have counted these trees. The tree dimension for example could have ["pine", "oak", "cedar", "cottonwook"... and the coordinate values for the forest dimension can also be real forest names. For example cool catalan forests: https://www.descobrir.cat/ca/notices/2021/03/7-boscos-espectaculars-per-descobrir-a-catalunya-4742.php

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it, that's a cool idea. will add that!

@@ -65,8 +65,9 @@
},
Copy link
Member

@OriolAbril OriolAbril Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #3.            draws=int(5e3), chains=4, step=pm.Metropolis(), return_inferencedata=True

5000 is shorter than int(5e3) and I find it more clear


Reply via ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Aug 24, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-24T18:15:48Z
----------------------------------------------------------------

k represents the number of components, n the number of observations, so obs_id should be an arange of length N and be used on the NormalMixture


@review-notebook-app
Copy link

review-notebook-app bot commented Aug 24, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-24T18:15:49Z
----------------------------------------------------------------

truncating (...) to thirty components


@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-24T18:15:49Z
----------------------------------------------------------------

Line #5.    ax.bar(plot_w - 0.5, posterior["w"].mean(axis=1), width=1.0, lw=0)

Here we want the mean over the chain and draw dimensions, so we don't really need to use the stacked posterior, and as a general rule, we should not use axis=1 because it's not clear what are we reducing. It should be mean(("chain", "draw")) 


Copy link
Collaborator Author

will change the variable for weights to mu


View entire conversation on ReviewNB

Copy link
Collaborator Author

then it would be better to wait for that I think because on contracting

Line #3.    ax.plot(x_plot, sample_cdfs[0], c="gray", alpha=0.75, label="DP sample CDFs")
Line #4.    ax.plot(x_plot, sample_cdfs[1:].T, c="gray", alpha=0.75)

to

ax.plot(x_plot, sample_cdfs.T, c="gray", alpha=0.75, label="DP sample CDFs")  

I end up getting multiple labels for DP sample CDFs lines in the legend and we would have to write a code snippet to handle the legend correction.


View entire conversation on ReviewNB

Copy link
Member

how about weights if that is what the variable represents?


View entire conversation on ReviewNB

Copy link
Collaborator Author

wouldn't 'weights' be a little too big for a variable name? will change to that though if you think that would make things clearer


View entire conversation on ReviewNB

Copy link
Collaborator Author

didnt get a chance to ask about using xarray with scipy.stats.norm functions, do you have any quick links I could refer to?


View entire conversation on ReviewNB

Copy link
Member

https://arviz-devs.github.io/arviz/user_guide/pymc3_refitting_xr_lik.html is the example closest to this situation, then also check the apply_ufunc docs and this answer: https://stackoverflow.com/questions/58719696/how-to-apply-a-xarray-u-function-over-netcdf-and-return-a-2d-array-multiple-new/62012973#62012973


View entire conversation on ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 1, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-09-01T18:52:29Z
----------------------------------------------------------------

Line #2.        frac = pm.Dirichlet("frac", a=np.ones(k))

frac has dim tree and should be annotated, this will also make the right coordinate value appear in the plot_forest. I would also use singular names for the dimension names for coherence with the arviz built-in chain and draw


@review-notebook-app
Copy link

review-notebook-app bot commented Sep 1, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-09-01T18:52:30Z
----------------------------------------------------------------

This is looping over trees so it should use sel(tree="abc") instead of the [:, :, :, j] as well as using the tree names as labels for the plots.


chiral-carbon commented on 2021-09-03T13:37:47Z
----------------------------------------------------------------

even though I initialize counts with dims=("forest", "tree") it ends up having dimensions counts_dim_0 and counts_dim_1 in the posterior predictive group, will rename the dims

Copy link
Collaborator Author

even though I initialize counts with dims=("forest", "tree") it ends up having dimensions counts_dim_0 and counts_dim_1 in the posterior predictive group, will rename the dims


View entire conversation on ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 3, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-09-03T14:08:50Z
----------------------------------------------------------------

Here you are using from_dict instead of from_pymc3 which is why it doesn't know the dims and coords automatically


chiral-carbon commented on 2021-09-03T15:02:39Z
----------------------------------------------------------------

using from_pymc3 also doesnt make it realize the coords/dims, and also for some reason ends up confusing the chains and draws to form 5 dims in total like:

  • chain: 1
  • counts_dim_0: 5000
  • counts_dim_1: 10
  • counts_dim_2: 5
  • draw: 4

I'm not sure why this is happening

OriolAbril commented on 2021-09-03T15:53:03Z
----------------------------------------------------------------

It's probably using keep_size=True in sample_posterior_predictive. And make sure to make the conversion inside a model context, the coords and dims are stored in the model object so they won't be available otherwise.

Copy link
Collaborator Author

using from_pymc3 also doesnt make it realize the coords/dims, and also for some reason ends up confusing the chains and draws to form 5 dims in total like:

  • chain: 1
  • counts_dim_0: 5000
  • counts_dim_1: 10
  • counts_dim_2: 5
  • draw: 4

I'm not sure why this is happening


View entire conversation on ReviewNB

Copy link
Member

It's probably using keep_size=True in sample_posterior_predictive. And make sure to make the conversion inside a model context, the coords and dims are stored in the model object so they won't be available otherwise.


View entire conversation on ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 22, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-09-22T00:42:06Z
----------------------------------------------------------------

the references here should be cited with the cite role so they are then also incorporated into the bibliography automatically.


@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-09-22T00:42:07Z
----------------------------------------------------------------

the theme already adds a table of contents automatically, there is no longer need for a manual agenda


@review-notebook-app
Copy link

review-notebook-app bot commented Sep 22, 2021

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-09-22T13:11:07Z
----------------------------------------------------------------

Ohlssen, et al. provide justification for truncation, showing that 

the link here points to http://fisher.osu.edu/~schroeder.9/AMIS900/Ohlssen2006.pdf which gives an error. I'm also not sure which paper they are referring to from searching the keywords in the URL itself. it most likely is by this author but I'm not sure which paper this is supposed to cite.


chiral-carbon commented on 2021-09-23T11:56:37Z
----------------------------------------------------------------

hey @AustinRochford , could you maybe let me know which paper you meant to reference with http://fisher.osu.edu/~schroeder.9/AMIS900/Ohlssen2006.pdf as the link now seems to be broken?

Copy link
Collaborator Author

hey @AustinRochford , could you maybe let me know which paper you meant to reference with http://fisher.osu.edu/~schroeder.9/AMIS900/Ohlssen2006.pdf as the link now seems to be broken?


View entire conversation on ReviewNB

@OriolAbril OriolAbril force-pushed the dirichlet-mix-models branch from 0ea547f to 78ce655 Compare January 9, 2022 05:13
@OriolAbril OriolAbril merged commit 7af5789 into pymc-devs:main Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants