Skip to content

modified rugby analytics case study with increased arviz usage #62

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Apr 3, 2021
Merged

modified rugby analytics case study with increased arviz usage #62

merged 18 commits into from
Apr 3, 2021

Conversation

mjhajharia
Copy link
Member

@mjhajharia mjhajharia commented Mar 30, 2021

I tried to convert some of the plots to arviz wherever possible, removed a few warnings because of deprecated plot names, and added inference data as arviz one
still working on: adding priors etc

Addresses issue #49 and aims to advance it to "Best Practices" column

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@mjhajharia
Copy link
Member Author

@OriolAbril I had a few doubts

  1. in the second last cell i tried:
    az.plot_pair(data,
    var_names=['atts'],
    coords = {"team": teams.team.values},
    kind='scatter',
    divergences=True,
    textsize=25),
    figsize=(10, 10)
    instead of sns.pairplot
    does that make sense?

  2. In the plot " HDI of Attack Strength, by Team", can we again use arviz forestplot?

  3. Now are there any other changes I should look into other than adding prioirs?

@mjhajharia mjhajharia marked this pull request as draft March 30, 2021 11:24
@mjhajharia mjhajharia mentioned this pull request Mar 30, 2021
@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:11Z
----------------------------------------------------------------

The try except looks like something from the times of python2 compatibility. It shold be replaced by plain import: from io import StringIO


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:12Z
----------------------------------------------------------------

not sure why or how it has appeared, but this unnamed column that is a duplicate of the previous one should not be present


almostmeenal commented on 2021-03-30T15:50:08Z
----------------------------------------------------------------

fixed that by changing

try:
    df_all = pd.read_csv("../data/rugby.csv")
except:
    df_all = pd.read_csv(pm.get_data("rugby.csv"), index_col=0)

to

df_all = pd.read_csv(pm.get_data("../data/rugby.csv"), index_col=0)

in the earlier case, the try case did not have index_col=0, so that was causing the trouble

@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:13Z
----------------------------------------------------------------

unlike in the arviz example, here we should use dims instead of shape which will add dims and coords automatically to the inferencedata object.

We should also use pm.Data for home_team and away_team so they are stored in the constant_data group. I wrote this blogpost which should be helpful in understanding how to make the most of the pymc3-arviz relation: https://oriolabril.github.io/oriol_unraveled/python/arviz/pymc3/xarray/2020/09/22/pymc3-arviz.html. Look at the explanations, not at the priors used which are copied from the bad versions of the notebooks.


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:13Z
----------------------------------------------------------------

here we only need the trace, the posterior predictive is computed later (we'll get to that too) and the prior is not used. We should therefore use return_inferencedata=True


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:14Z
----------------------------------------------------------------

this goes for all the plots, and in general. you'll see warnings if you try to use arviz plots directly on pymc3 multitrace objects (the return type of pm.sample(..., return_inferencedata=False) that will go away if you run those from within a model context, but this is not a good fix. The warning is shown because when passing multitraces to arviz plots, arviz converts them to inferencedata under the hood, but it does so automatically and no arguments can be used to customize the conversion.

For both better functionality and performance, one should convert to inferencedata (or use return_inferencedata=True) and then call arviz functions. When doing this, no model context is needed, all the necessary info in contained in the inferencedata object.


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:15Z
----------------------------------------------------------------

This cell should be deleted, and the one below should be simplified to az.plot_energy(data);, we can keep the figsize, but that's all.


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:15Z
----------------------------------------------------------------

Gelman-Rubin is not used anymore, it should be $\hat{R}$ , but I don't think it's shown in plot_energy with default arviz labels and legend. I would suggest adding a cell with az.summary(data, kind="diagnostics") and leave the cell that checks for rhat.


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:16Z
----------------------------------------------------------------

you can get this with idata.posterior["atts"].median(("chain", "draw")) . ArviZ build on top of xarray which already implements most of the functions we'll want to use. Updating notebooks to get to the ArviZ column will require both ArviZ and xarray knowledge


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:16Z
----------------------------------------------------------------

We do not want to convert the xarray dataset we get from az.hdi to pandas, we should use xarray all the way, which in my opinion will make the code much more clear. I also would recommend not using errorbar and going with vlines instead, the code should be very similar to cells 21, 27 or 46 in https://docs.pymc.io/notebooks/multilevel_modeling.html minus the loops, sorting or other preprocessing, 46 is the one that will look more similar to this case I think.

Also try using ax = idata.posterior["atts"].median(("chain", "draw")).plot() (there are also examples similars to this in the multilevel modeling notebook) to create and label the axes, and then use ax.vlines

Note: this applies to both this and the 2 cells below.


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:17Z
----------------------------------------------------------------

We don't need this anymore, now all the plots have the actual team in their title, there is no need to have the number encoding at hand.


almostmeenal commented on 2021-03-30T16:16:57Z
----------------------------------------------------------------

i was thinking i'll replace teams as an array instead of a df in that case

@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:17Z
----------------------------------------------------------------

Here we should use

with model:
    data.extend(az.from_pymc3(
        posterior_predictive=pm.sample_posterior_predictive(data)
    ), inplace=True)

which will add the posterior predictive samples to the inferencedata object as posterior_predictive group. We will therefore use arviz and xarray for postprocessing wherever possible. I am convinced all the looping, dicts and so will be easier to handle. I have some examples of groupbys and other useful xarray computations [here This example is about handling log likelihoods but I think it will be illustrative enough, and as it uses the same model I think it will be easier to follow than xarray docs, but I can also add links to relevant sections of the xarray docs if you prefer.


almostmeenal commented on 2021-03-30T18:32:28Z
----------------------------------------------------------------

i did that, although i checked https://arviz-devs.github.io/arviz/api/generated/arviz.InferenceData.extend.html?highlight=data%20extend#arviz.InferenceData.extend and inplace is not a valid argument, i got that error as well

OriolAbril commented on 2021-03-30T18:38:27Z
----------------------------------------------------------------

True, I missremembered. extend can only be inplace so it has no implace argument, removing it should fix that

almostmeenal commented on 2021-03-30T18:43:21Z
----------------------------------------------------------------

hi, so i made the changes except these next two cells for the "sim_table" , I'll read that notebook and see how to work with xarrays im still new to that. I'll make a commit for the rest of the changes and you can have a look, meanwhile I'll fix these as well. Thanks for the patient and detailed responses, i understand I don't know a lot of things and made many naive errors.

OriolAbril commented on 2021-03-30T20:09:30Z
----------------------------------------------------------------

This sounds great! Thanks!

i understand I don't know a lot of things and made many naive errors.

We have little to no documentation on that, which is one of the reasons we need to update everything urgently so we don't expect anyone to know all this, even between maintainers, we are constantly asking about the parts of the library we don't know too much about. There are no stupid questions, so don't hesitate to ask anything.

@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:18Z
----------------------------------------------------------------

Looks great, in this particular case, I would add marginals=True to keep the look more similar to the original plot.


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 30, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-03-30T14:16:19Z
----------------------------------------------------------------

Given the extent of changes in data processing and visualization, I would update this to something along the lines of author Peadar Coyle, updated by Meenal Jhajharia to use ArviZ and xarray. No need to leave your mail.


Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added several comments with review nb, that are also added to github, but don't read them on github as there its not possible to see to which cells and changes they are referring to.

Copy link
Member Author

fixed that by changing

try:
    df_all = pd.read_csv("../data/rugby.csv")
except:
    df_all = pd.read_csv(pm.get_data("rugby.csv"), index_col=0)

to

df_all = pd.read_csv("../data/rugby.csv", index_col=0)

in the earlier case, the try case did not have index_col=0, so that was causing the trouble


View entire conversation on ReviewNB

Copy link
Member Author

i was thinking i'll replace teams as an array instead of a df in that case


View entire conversation on ReviewNB

Copy link
Member Author

i did that, although i checked https://arviz-devs.github.io/arviz/api/generated/arviz.InferenceData.extend.html?highlight=data%20extend#arviz.InferenceData.extend and inplace is not a valid argument, i got that error as well


View entire conversation on ReviewNB

Copy link
Member

True, I missremembered. extend can only be inplace so it has no implace argument, removing it should fix that


View entire conversation on ReviewNB

Copy link
Member Author

hi, so i made the changes except these next two cells for the "sim_table" , I'll read that notebook and see how to work with xarrays im still new to that. I'll make a commit for the rest of the changes and you can have a look, meanwhile I'll fix these as well. Thanks for the patient and detailed responses, i understand I don't know a lot of things and made many naive errors.


View entire conversation on ReviewNB

Copy link
Member

This sounds great! Thanks!

i understand I don't know a lot of things and made many naive errors.

We have little to no documentation on that, which is one of the reasons we need to update everything urgently so we don't expect anyone to know all this, even between maintainers, we are constantly asking about the parts of the library we don't know too much about. There are no stupid questions, so don't hesitate to ask anything.


View entire conversation on ReviewNB

@mjhajharia
Copy link
Member Author

@OriolAbril i couldnt make a commit because of black failing as a precommit hook, i'll look into it and resolve it, then you can check the cells, just give me some time

@OriolAbril
Copy link
Member

You should try recomitting again right after the first time when it fails. pre-commit should edit the files in addition to making the commit error out, so the 2nd time you try requirements should be met, at least for black which is completely automated.

You can also run pre-commit manually as explained in the jupyter style guide

@mjhajharia
Copy link
Member Author

mjhajharia commented Mar 31, 2021 via email

@mjhajharia
Copy link
Member Author

mjhajharia commented Mar 31, 2021

@OriolAbril okay so now its done

Copy link
Member Author

I've used arviz inference data to extend the posterior predictive part as you mentioned, but I've stuck to dataframes for generating the sim table, I tried using xarray operations to calculate 3*home_won and do the grouping, it was getting complicated(maybe its just me). But i think its fairly reasonable to use pandas for generating that table, as all the data stays in arviz.inferencedata but its manipulation for viewing can be done using pandas.

Also, i kept constant data and calculated it with pp, because I feel like it's a good practice in general, for simple manipulation of arviz data, having the constant part contains which is home_team etc which should be stored too, makes it all happen in arviz and reduces the need to get data from initial csv/df.

Let me know if i should change anything else

@OriolAbril
Copy link
Member

yeah i thought that was supposed to happen, so when i tried the next time it showed the same error again, when i pushed it just showed everything up to date

Oh, I see, yes, I said commit above but you first have to make sure to add the modified files before committing again, or use git commit -a (a for all) which is what I tend to do and forget to be explicit on the add part. In any case, take a look at git status right after running pre-commit to see if you have changes not staged to commit.

I've used arviz inference data to extend the posterior predictive part as you mentioned, but I've stuck to dataframes for generating the sim table, I tried using xarray operations to calculate 3*home_won and do the grouping, it was getting complicated(maybe its just me). But i think its fairly reasonable to use pandas for generating that table, as all the data stays in arviz.inferencedata but its manipulation for viewing can be done using pandas.

Also, i kept constant data and calculated it with pp, because I feel like it's a good practice in general, for simple manipulation of arviz data, having the constant part contains which is home_team etc which should be stored too, makes it all happen in arviz and reduces the need to get data from initial csv/df.

Let me know if i should change anything else

I have skimmed through the changes and they look amazing, thanks! There are some nits to polish, but nothing to big. I'll review later on reviewnb.

Regarding the sim table, if you don't mind I can give it a go, now that everything else uses xarray and ArviZ it will be much easier for me to jump in and focus only on this problematic computation. Using pandas is fine, but I think that here everything will be much simpler once converted to xarray because the data in inherently multidimensional and trying to get pandas to play nicely with this data takes more time than the job itself. If so, I would do this after giving you time to work on the minor changes still needed so there are no git conflicts and push directly to this PR.

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 2, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-02T15:48:39Z
----------------------------------------------------------------

Add a semicolon at the end (like in plot_trace call) to remove undesired text output.


@review-notebook-app
Copy link

review-notebook-app bot commented Apr 2, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-02T15:48:40Z
----------------------------------------------------------------

Title should be "HDI" or "Posterior HDI" instead of HPD which is an acronym we have not introduced nor is part of the library "corpus". Also minor comment, we don't need plt.show() in notebooks


@review-notebook-app
Copy link

review-notebook-app bot commented Apr 2, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-02T15:48:41Z
----------------------------------------------------------------

Do you want to add the code that uses the extra simulation dim?

If so we should update this to

[...] over a total of 120k simulations, 30 per sample in the posterior.

We can do that with the same ease we would generate a single simulation per draw in the posterior thanks to PyMC3+ArviZ+xarray integration.

If not we should update to "... over a total of 4000 simulations, one per sample in the posterior"



@review-notebook-app
Copy link

review-notebook-app bot commented Apr 2, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-02T15:48:42Z
----------------------------------------------------------------

I don't like what black is doing here. Can you try reformatting to

pp["home_win"] = (
    (pp["home_points"] > pp["away_points"]) * 3     # home team wins and gets 3 points
    + (pp["home_points"] == pp["away_points"]) * 2  # tie -> home team gets 2 points
)
pp["away_win"] = (
    (pp["home_points"] < pp["away_points"]) * 3 
    + (pp["home_points"] == pp["away_points"]) * 2
)

and see if it likes it better?


@review-notebook-app
Copy link

review-notebook-app bot commented Apr 2, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-02T15:48:43Z
----------------------------------------------------------------

The xlabel should be updated, we are plotting all ranks now. Maybe "Probability of finishing in each rank (including ties)"? The ties note we need to keep


Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notebook looks amazing now!

All remaining comments are mostly styling. I'll merge after this last review iteration. Thanks for all the patience and for addressing all the comments.

@mjhajharia
Copy link
Member Author

mjhajharia commented Apr 2, 2021

@OriolAbril

g = sns.FacetGrid(df_all, col="away_team", col_wrap=2, height=5)
g = g.map(plt.scatter, "year", "difference_non_abs").set_axis_labels("Year", "Score Difference")

gives an error: /Users/almostmeenal/anaconda3/envs/p/lib/python3.9/site-packages/seaborn/axisgrid.py:64: UserWarning: This figure was using constrained_layout==True, but that is incompatible with subplots_adjust and or tight_layout: setting constrained_layout==False. self.fig.tight_layout(*args, **kwargs)

i think this is happening after the arviz style, also i need to change the font size, is there something you suggest doing about this error?

update: fixed it with

plt.rcParams['figure.constrained_layout.use'] = False
sns.set(font_scale=1)

@mjhajharia
Copy link
Member Author

@OriolAbril took care of all the comments, except the one with nqba-black, tried twice, apparently its adamant

@OriolAbril
Copy link
Member

except the one with nqba-black, tried twice, apparently its adamant

wow, the reformat with the comments is terrible. Turn formatting of for this cell then, the style guide shows how: https://github.com/pymc-devs/pymc3/wiki/PyMC3-Jupyter-Notebook-Style-Guide

@mjhajharia
Copy link
Member Author

done

@OriolAbril
Copy link
Member

It looks like the notebook is not all executed now, maybe you saved and pushed before it finished?

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-02T18:34:35Z
----------------------------------------------------------------

Sorry I didn't notice before, sns.set(font_scale=1) is somehow messing the theme, the color palette used now is different and no longer colorblind friendly, I guess it gets switched up to the default seaborn palette. It should only be necessary to set plt.rcParams["figure... once, I'd do so in the same cell where we set arviz-darkgrid as style


@mjhajharia
Copy link
Member Author

is somehow messing the theme,

the only problem is, without changing the font, the x labels are collapsing into each other and not at all legible

@OriolAbril
Copy link
Member

Try g.fig.autofmt_xdate() after g.map...

@mjhajharia
Copy link
Member Author

mjhajharia commented Apr 2, 2021

Try g.fig.autofmt_xdate() after g.map...

thanks this worked! this PR took insane patience, thankyou for that. finally the notebook looks good I think.

also in case you missed it in the git notifs, I dropped an email too about random gsoc stuff

@OriolAbril
Copy link
Member

thanks this worked!

Great! I see the change in the first cell with the seaborn facetgrid, but not in the second, you should edit the 2nd cell (currently labeled In [13]) and then do a "Restart kernel & run all", all cells should be run sequentially to make sure everything works as expected.

@mjhajharia
Copy link
Member Author

thanks this worked!

Great! I see the change in the first cell with the seaborn facetgrid, but not in the second, you should edit the 2nd cell (currently labeled In [13]) and then do a "Restart kernel & run all", all cells should be run sequentially to make sure everything works as expected.

did

@mjhajharia
Copy link
Member Author

if this is done and ready to be merged then i can(?) finally remove my local branch, right now i'm constantly shifting between two branches

@OriolAbril
Copy link
Member

then do a "Restart kernel & run all", all cells should be run sequentially to make sure everything works as expected.

did

I still don't see all cells executed sequentially, there are some >30 in the middle of the notebook. I can rerun it myself too if you prefer

if this is done and ready to be merged then i can(?) finally remove my local branch, right now i'm constantly shifting between two branches

Once the PR is merged github itself will show you a button to delete the branch, but switching between branches is good, you should get familiar with that and work on a single feature/change per PR and have each PR submitted from it's own branch.

@mjhajharia
Copy link
Member Author

Once the PR is merged github itself will show you a button to delete the branch, but switching between branches is good, you should get familiar with that and work on a single feature/change per PR and have each PR submitted from it's own branch.

makes sense yeah

I still don't see all cells executed sequentially, there are some >30 in the middle of the notebook.

ah yeah I didn't save the notebook after re run again my bad. it should be fixed in the last commit i made

@OriolAbril OriolAbril merged commit 779c1a1 into pymc-devs:main Apr 3, 2021
@OriolAbril
Copy link
Member

Thanks for the PR!

@mjhajharia mjhajharia deleted the rugby branch April 3, 2021 17:01
twiecki pushed a commit that referenced this pull request Jan 17, 2023
* Initial commit of the linear conjugate gradients

* Updated __init__ so that we can import linear_cg

* Since we are invoking the function as a class attribute, removing the self

* Fixed the import

* Cleaned up some unused code

* Fixed the name of the function

* If vector is not equal to itself, it might have NaN

* removed temporary variable

* Moved configuration from settings class to function arguments

* Ensure output has same datatype regardless of value of n_tridiag

* if n_tridiag = 0 then tridiagonal matrices are not computed instead an identity matrix is returned which doesn't affect downstream computation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants