Skip to content

ENH: propagating attrs always uses deepcopy #55314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 4, 2023

Conversation

timhoffm
Copy link
Contributor

@timhoffm timhoffm commented Sep 28, 2023

Always using a deepcopy prevents shared state and thus unintentional modification of the attrs of other objects. IMHO this safety has a higher priority than the slight performance cost of the deepcopy. The implementation now skips the copying if attrs are not used (i.e. an empty dict). This check takes only ~20ns. Thus, the attrs copy mechanism has no performance impact if attrs are not used.

@timhoffm
Copy link
Contributor Author

timhoffm commented Sep 28, 2023

Importing stdlib copy at the top of generic.py issues the a number of pylint warnings of the type:

pandas/core/generic.py:289:8: W0621: Redefining name 'copy' from outer scope (line 5) (redefined-outer-name)

What is the preferred way in the project to handle this?

@mroeschke
Copy link
Member

What is the preferred way in the project to handle this?

Probably using from copy import deepcopy should be fine

@mroeschke mroeschke added the metadata _metadata, .attrs label Sep 28, 2023
@mroeschke
Copy link
Member

Just a pre-commit check failing otherwise looks good

Always using a deepcopy prevents shared state and thus unintentional
modification of the attrs of other objects. IMHO this safety has a
higher priority than the slight performance cost of the deepcopy.
The implementation now skips the copying if *attrs* are not used (i.e.
an empty dict). This check takes only ~20ns. Thus, the attrs propagation
mechanism has no performance impact if attrs are not used.

Closes pandas-dev#54134.
@mroeschke mroeschke added this to the 2.2 milestone Oct 4, 2023
@mroeschke mroeschke merged commit f3f0795 into pandas-dev:main Oct 4, 2023
@mroeschke
Copy link
Member

Thanks @timhoffm

@timhoffm timhoffm deleted the attrs-deepcopy branch October 4, 2023 17:55
@rhshadrach
Copy link
Member

This patch may have induced a performance regression. If it was a necessary behavior change, this may have been expected and everything is okay.

Please check the links below. If any ASVs are parameterized, the combinations of parameters that a regression has been detected for appear as subbullets.

Subsequent benchmarks may have skipped some commits. The link below lists the commits that are between the two benchmark runs where the regression was identified.

6a83910...364c9cb

@timhoffm
Copy link
Contributor Author

timhoffm commented Oct 20, 2023

I believe the detected changes come from this PR.

The test uses obj.attrs, which this PR changes to deepcopy in finalize. finalize performance should not decrease if attrs are not used (in fact, it might be marginally faster), but I don't see whether you have an ASV test for that as well.

Note also that the tests use attrs with 1000 entries, which makes the deepcopy cost quite prominent. I expect that typical attrs is a more like a bit of metadata with << 1000 entries.

The use of deepcopy is expected to be a bit slower. I'd argue that it's a necessary safety measure to prevent multiple objects to share the same attrs value objects. It might otherwise be quite surprising that a user changes attrs down the road of operations, and by that change the attrs of an input object. Such a coupling is even more surprising nowadays that dataframe values themselves can be considered independent because of copy-on-write.
IMHO, if the user does not want to afford the cost of the safety net, they should not use attrs but mange that state themselves outside of pandas. - Or if there is really interest make shallow copy available as an opt-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata _metadata, .attrs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: copy.deepcopy() doesn't deepcopy the metadata in .attrs
4 participants