-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
differences in Series.map with defaultdict with different dtypes #49011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this fix the defaultdict mutation issue? #47527
Just tested this ( & followed the older threads on this issue ). Eg -
After changes -
|
@@ -613,7 +630,7 @@ def test_map_defaultdict_ignore_na(): | |||
mapping = defaultdict(int, {1: 10, np.nan: 42}) | |||
ser = Series([1, np.nan, 2]) | |||
result = ser.map(mapping) | |||
expected = Series([10, 0, 0]) | |||
expected = Series([10, 42, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this older testcase seemed incorrect & needed to be corrected with this change.
Am however unsure if the older behavior is as per intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the discussion we had in #47585 the old behavior is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh I see.
Do we keep it as it is now ? or is this enhancement fine - since there are some discrepancies observed (based on dtype) in map behavior as highlighted in bug description #48813
TIA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rhshadrach can you weigh in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on changing the behavior in this test. The 0 in question here is because np.nan
does not equal itself, and NumPy often returns views so that ids are not equal either; e.g.
mapping = defaultdict(int, {1: 10, np.nan: 42})
arr = np.array([1, np.nan, 2])
print(mapping[arr[1]])
# 0
I think introducing a better lookup for NaN values makes sense, and brings this in line with the Series case:
mapping = pd.Series({1: 10, np.nan: 42})
ser = Series([1, np.nan, 2])
print(ser.map(mapping))
# 0 10.0
# 1 42.0
# 2 NaN
# dtype: float64
Could you add a test to check that the default dict isn't mutated? |
cc @phofl since you made a fix here recently |
Updated the PR with a testcase to check defaultdict mutation. |
e8f11ee
to
349d54f
Compare
490dac3
to
ee788c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @phofl or @rhshadrach merge when ready
Thx @dannyi96 |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.