Skip to content

Update GLM predictions #204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

chiral-carbon
Copy link
Collaborator

Addresses issue #85 and aims to advance it to best practices, use arviz-darkgrid and use bambi.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Aug 6, 2021

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-08-06T17:11:58Z
----------------------------------------------------------------

@tomicapretto had a doubt here as to how to convert this to use bambi instead. we have to pass a dataframe rather than x and y in bambi.Model() so should I concat x with y and pass that? would that be correct?

tomicapretto commented on 2021-08-06T17:26:26Z
----------------------------------------------------------------

Yep! Bambi only works with data frames so far. So you need to put all the data in a pandas data frame and then pass the data as the data argument.

OriolAbril commented on 2021-08-06T17:39:27Z
----------------------------------------------------------------

does it support out of sample posterior predictive sampling though?

tomicapretto commented on 2021-08-06T19:58:18Z
----------------------------------------------------------------

Yes, it does (in the dev version)

model.predict(idata, kind="pps", data=new_data)

Where

model is the Bambi model.

idata is the inference data returned by the sampling process.

new_data is a new pandas data frame that has the same structure than the data frame used to fit the model, except from the variable that indicates the number of successes.

If you have Model("p(y, n) ~ x1*x2", data, family="binomial")

Then the new data frame must have columns for x1, x2 and n

Copy link

Yep! Bambi only works with data frames so far. So you need to put all the data in a pandas data frame and then pass the data as the data argument.


View entire conversation on ReviewNB

Copy link
Member

does it support out of sample posterior predictive sampling though?


View entire conversation on ReviewNB

Copy link

Yes, it does (in the dev version)

model.predict(idata, kind="pps", data=new_data)

Where

model is the Bambi model.

idata is the inference data returned by the sampling process.

new_data is a new pandas data frame that has the same structure than the data frame used to fit the model, except from the variable that indicates the number of successes.

If you have Model("p(y, n) ~ x1*x2", data, family="binomial")

Then the new data frame must have columns for x1, x2 and n


View entire conversation on ReviewNB

@chiral-carbon chiral-carbon mentioned this pull request Aug 13, 2021
@MarcoGorelli MarcoGorelli marked this pull request as draft August 13, 2021 13:25
@review-notebook-app
Copy link

review-notebook-app bot commented Aug 24, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-24T16:56:14Z
----------------------------------------------------------------

I would not compute the decision boundary with the mean of the posterior but instead use xarray, again, everything should broadcast automatically. If doing this we can delete this remark

Also, the Deterministic is only valid in pure pymc3 whereas postprocessing with xarray only needs to have the inferencedata


@chiral-carbon chiral-carbon marked this pull request as ready for review September 1, 2021 12:42
@review-notebook-app
Copy link

review-notebook-app bot commented Sep 1, 2021

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-09-01T12:44:44Z
----------------------------------------------------------------

I get a memory error here, but the shape of the dataset being passed is 90000x4, not 90000x25000 as printed.

so what is the mistake here?


tomicapretto commented on 2021-09-01T13:48:28Z
----------------------------------------------------------------

This is definetly a problem with Bambi trying to create a very large object within the predict method. I will try to replicate it on my side to try to understand what is going on.

What I'm not sure about is why you are using design_matrices and then you do the as_dataframe(). I think this can be simplified to:

grid = np.linspace(start=-9, stop=9, num=300)
x1, x2 = np.meshgrid(grid, grid)
x_grid = np.stack(arrays=[x1.flatten(), x2.flatten()], axis=1)
new_data = pd.DataFrame(x_grid, columns=["x1", "x2"])

model.predict(trace, kind="pps", data=new_data, draws=1000)

chiral-carbon commented on 2021-09-01T14:28:22Z
----------------------------------------------------------------

I thinkI simply replaced the previous patsy usage with formulae.design_matrices(), but thanks this looks much simpler. I do get the same error though on running with the code you provided.

Sayam753 commented on 2021-09-19T08:46:21Z
----------------------------------------------------------------

so what is the mistake here?

I don't think there is any mistake. The shape of dataset is (90k, 4).

The shape (90k, 25k) can be interpreted as -

  • There are 90k rows in the dataset (generated by mesh)
  • We also have 25k samples (from 5 chains and 5000 samples each chain) for each random variable in PyMC3 model.
  • For each row (tuple of (Intercept, x1, x2, x1*x2)), we have 25k ways of generating y .
  • So, for all the rows in the dataset, the resulting shape of y will be (90k, 25k). And this seems a pretty big number.

To solve this, two obvious ways would be to

  • either reduce number of draws/chains in pm.sample and make sure sampler converged
  • or reduce the number of data points to generate smaller artificial dataset

chiral-carbon commented on 2021-09-20T17:24:38Z
----------------------------------------------------------------

I'll try this and see if it works

@chiral-carbon chiral-carbon changed the title [WIP] Update GLM predictions Update GLM predictions Sep 1, 2021
@review-notebook-app
Copy link

review-notebook-app bot commented Sep 1, 2021

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-09-01T12:49:50Z
----------------------------------------------------------------

I think the model overfitting? have I made any mistakes in the model definition step?

Copy link

This is definetly a problem with Bambi trying to create a very large object within the predict method. I will try to replicate it on my side to try to understand what is going on.

What I'm not sure about is why you are using design_matrices and then you do the as_dataframe(). I think this can be simplified to:

grid = np.linspace(start=-9, stop=9, num=300)
x1, x2 = np.meshgrid(grid, grid)
x_grid = np.stack(arrays=[x1.flatten(), x2.flatten()], axis=1)
new_data = pd.DataFrame(x_grid, columns=["x1", "x2"])

model.predict(trace, kind="pps", data=new_data, draws=1000)


View entire conversation on ReviewNB

Copy link
Collaborator Author

I thinkI simply replaced the previous patsy usage with formulae.design_matrices(), but thanks this looks much simpler. I do get the same error though on running with the code you provided.


View entire conversation on ReviewNB

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-09-08T15:46:55Z
----------------------------------------------------------------

I don't understand this sentence, and there is still a link to the old glm module code in pymc3 repo


@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

Sayam753 commented on 2021-09-19T03:59:53Z
----------------------------------------------------------------

The first sentence should use bambi.


Copy link
Member

so what is the mistake here?

I don't think there is any mistake. The shape of dataset is (90k, 4).

The shape (90k, 25k) can be interpreted as -

  • There are 90k rows in the dataset (generated by mesh)
  • We also have 25k samples (from 5 chains and 5000 samples each chain) for each random variable in PyMC3 model.
  • For each row (tuple of (Intercept, x1, x2, x1*x2)), we have 25k ways of generating y .
  • So, for all the rows in the dataset, the resulting shape of y will be (90k, 25k). And this seems a pretty big number.

To solve this, two obvious ways would be to

  • either reduce number of draws/chains in pm.sample and make sure sampler converged
  • or reduce the number of data points to generate smaller artificial dataset


View entire conversation on ReviewNB

Copy link
Collaborator Author

I'll try this and see if it works


View entire conversation on ReviewNB

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-10-10T07:47:06Z
----------------------------------------------------------------

@OriolAbril the decision boundary does not look accurate to me and I think it could be because of the reduced sampling size and the reduced dataset size generated by np.mesh() ? how should I correct it?


@fonnesbeck
Copy link
Member

Is this close to being ready? At this point we should go ahead and convert to v4.

@chiral-carbon
Copy link
Collaborator Author

@fonnesbeck Hi, sorry to leave it here without updating. Will work on it this week and finish it up.

@fonnesbeck
Copy link
Member

@chiral-carbon any chance of pushing this out, or do you want to hand it over to someone?

@chiral-carbon
Copy link
Collaborator Author

chiral-carbon commented May 11, 2022 via email

@chiral-carbon
Copy link
Collaborator Author

chiral-carbon commented May 15, 2022

@fonnesbeck should I try to update this to v4 now in this PR or update to v3 only and leave v4 update for a new PR? this same question applies to the other open PRs I have pending.

@ghost
Copy link

ghost commented May 17, 2022

Definitely v4, as we are trying to get as many examples ported before the v4 release as possible. Let me know if you need a hand!

@chiral-carbon
Copy link
Collaborator Author

@cfonnesbeck okay! will need some help in that case. you could just point out what I should refer to/start with, or anything else that should be done first.

@fonnesbeck
Copy link
Member

It looks like this one is in a bit of a holding pattern until the update to Bambi is released, but if you want to get a head start you'd have to build Bambi from its v4 branch, switch from pymc3 to pymc and see what breaks.

@tomicapretto
Copy link

It looks like this one is in a bit of a holding pattern until the update to Bambi is released, but if you want to get a head start you'd have to build Bambi from its v4 branch, switch from pymc3 to pymc and see what breaks.

@chiral-carbon @fonnesbeck most things should work fine in the new v4 branch. The problems are related to some specific cases. I can assist if you need help!

@chiral-carbon
Copy link
Collaborator Author

@tomicapretto @fonnesbeck thanks! I will update here when as and when I need help

@drbenvincent
Copy link
Contributor

This notebook was updated to v4 by #370, so closing this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants