Skip to content

Missing data and Bayesian Imputation #500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Feb 3, 2023

Conversation

NathanielF
Copy link
Contributor

@NathanielF NathanielF commented Jan 16, 2023

A notebook on Missing Data methods and Bayesian imputation

Related to #461

This notebook aims to showcase methods for imputation of missing data using primarily bayesian methods. We will focus on a dataset which records employee satisfaction metrics drawn from the book Applied Missing Data Analysis. We will demonstrate how FIML and Bayesian imputation methods work using the Multivariate normal distribution differ and we also want to show how approximate the multivariate distribution using the sequential chained equation methods.

…t FIML

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
@NathanielF NathanielF marked this pull request as ready for review January 23, 2023 22:32
@NathanielF
Copy link
Contributor Author

I think this is ready for review now. It's quite long and covers a number of approaches to imputation.

(I) We discuss the taxonomies of missing-ness (MCAR), (MAR) and (MNAR). I try to set it up as a prelude to considerations about causal inference.

(ii) FIML and MLE approaches to estimating a multivariate model given missing data
(iii) Bayesian imputation of missing values using the multivariate gaussian and the posterior predictive distribution
(iv) Two examples of imputation using sequential regression equations

Each of the approaches so far is presented in the Enders book and our estimates match those presented there.

(v) I apply the missing data imputation to hierarchical model and estimate the values of the missing data informed by the structure of "team" clusters in our employee data set. The model is estimated using the blackjax sampler and shows divergences, but converges nicely with good Rhat numbers...,. I use the differences in imputation patterns between the hierarchical model and the simpler regression models to argue for why we need to be aware of heterogenous patterns of imputation and how this is analogous to concerns in causal inference of heterogenous treatment effects.

We finish on a wrap up and celebration of the flexibility of bayesian modelling in an enterprise that has work with confounding and complexity.

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 24, 2023

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2023-01-24T02:41:15Z
----------------------------------------------------------------

The table looks janky. Does it need to be placed in a code block to enforce monospace?


NathanielF commented on 2023-01-24T10:55:40Z
----------------------------------------------------------------

Fair. It was a bit needless. I've taken another approach just adding the patterns of missing-ness as a pandas dataframe:

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 24, 2023

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2023-01-24T02:41:16Z
----------------------------------------------------------------

Should add a legend if possible.


NathanielF commented on 2023-01-24T11:02:17Z
----------------------------------------------------------------

Done.

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 24, 2023

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2023-01-24T02:41:17Z
----------------------------------------------------------------

Perhaps add a sentence or two interpreting these plots?


NathanielF commented on 2023-01-24T10:56:09Z
----------------------------------------------------------------

Updated and added some more explanatory text

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 24, 2023

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2023-01-24T02:41:18Z
----------------------------------------------------------------

Line #15.        pm.Potential("x_logp", pm.logp(rv=pm.MvNormal.dist(mus, chol=cov_flat_prior), value=x))

Why are potentials being constructed here rather than just imputing with the MvNormal likelihood? Does that not work anymore? (perhaps I'm missing something obvious)


NathanielF commented on 2023-01-24T10:57:31Z
----------------------------------------------------------------

Yes, i think it's broken or not implemented in the latest version. I was getting the same error discussed here: https://discourse.pymc.io/t/automatic-imputation-of-multivariate-models/11029

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 24, 2023

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2023-01-24T02:41:19Z
----------------------------------------------------------------

Lower case y in "PyMC"


NathanielF commented on 2023-01-24T10:57:43Z
----------------------------------------------------------------

Adjusted!

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 24, 2023

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2023-01-24T02:41:19Z
----------------------------------------------------------------

I'm not sure printing out the entire idata object is helpful, given how large and verbose it is. Maybe pull a few elements that are interesting?


NathanielF commented on 2023-01-24T10:59:00Z
----------------------------------------------------------------

Removed the idata_uniform entirely as it was a bit overkill. I left the idata_normal. I like having the ability to inspect the model output. Makes reproductions easier to check for consistency.

@fonnesbeck
Copy link
Member

Great tutorial!

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Copy link
Contributor Author

Fair. It was a bit needless. I've taken another approach just adding the patterns of missing-ness as a pandas dataframe:


View entire conversation on ReviewNB

Copy link
Contributor Author

Updated and added some more explanatory text


View entire conversation on ReviewNB

Copy link
Contributor Author

Yes, i think it's broken or not implemented in the latest version. I was getting the same error discussed here: https://discourse.pymc.io/t/automatic-imputation-of-multivariate-models/11029


View entire conversation on ReviewNB

Copy link
Contributor Author

Adjusted!


View entire conversation on ReviewNB

Copy link
Contributor Author

Removed the idata_uniform entirely as it was a bit overkill. I left the idata_normal. I like having the ability to inspect the model output. Makes reproductions easier to check for consistency.


View entire conversation on ReviewNB

Copy link
Contributor Author

Thank you for taking the time to review!! Glad you liked it.

Copy link
Contributor Author

Done.


View entire conversation on ReviewNB

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
@NathanielF NathanielF requested a review from fonnesbeck January 25, 2023 19:54
Copy link
Contributor

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments via nbreview.

Overall really cool. Bit of a vague comment, but I'd be tempted to add in a little more explanation. But that's just my own style, so feel free to ignore. For this more advanced level, it's quite possibly the case that people don't need more hand holding. Nevertheless, if you wanted to add some, it could make it more accessible to a broader range of readers.

House move is ongoing... the actual move won't happen for another month or so :)

@NathanielF
Copy link
Contributor Author

Perfect, thanks @drbenvincent. Will adjust this evening.

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Copy link
Contributor Author

Done


View entire conversation on ReviewNB

…d regression notebook

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Copy link
Contributor Author

Yes


View entire conversation on ReviewNB

Copy link
Contributor Author

I thought he just meant the legend in the picture i.e. the color labels for Empowerment etc... which were missing at the time he commented but are there now for me.


View entire conversation on ReviewNB

Copy link
Contributor Author

Changed this


View entire conversation on ReviewNB

Copy link
Contributor Author

Linked to that notebook too.


View entire conversation on ReviewNB

Copy link
Contributor Author

Done


View entire conversation on ReviewNB

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
…ot by team

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
…text

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
@NathanielF
Copy link
Contributor Author

That should be good to go now @drbenvincent. I've tidied a few things and added some more explanatory text to sign post what i'm doing a bit more. I think i've also addressed all comments above.

Copy link
Contributor

Sorry if I'm missing it, but can't see a reference to the example


View entire conversation on ReviewNB

Copy link
Contributor

Thanks! A quick find shows up some remaining examples which are not actual L2 markdown headings. ## Percentage Missing in this cell. A bunch in cell 31, one in cell 11.


View entire conversation on ReviewNB

Copy link
Contributor

Ah yes, Chris meant legend, but I meant figure caption :)


View entire conversation on ReviewNB

Copy link
Contributor

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I added in a few minor replies to comments, but happy to approve after that.

Copy link
Contributor Author

Just above introducing the employee data set below the MNAR definition


View entire conversation on ReviewNB

Copy link
Contributor

👍🏻


View entire conversation on ReviewNB

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Copy link
Contributor Author

Agh... sorry. I think i've got them all now.


View entire conversation on ReviewNB

Copy link
Contributor

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great

@drbenvincent drbenvincent self-requested a review February 3, 2023 09:34
Copy link
Contributor

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff 👍🏻

@drbenvincent drbenvincent merged commit 9028ba3 into pymc-devs:main Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants