Skip to content

differences in Series.map with defaultdict with different dtypes #49011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 14, 2022

Conversation

dannyi96
Copy link
Contributor

@dannyi96 dannyi96 commented Oct 8, 2022

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fix the defaultdict mutation issue? #47527

@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Series Series data structure labels Oct 10, 2022
@dannyi96
Copy link
Contributor Author

dannyi96 commented Oct 10, 2022

Does this fix the defaultdict mutation issue? #47527

Just tested this ( & followed the older threads on this issue ).
Yes, these changes does fix the defaultdict mutation issue.

Eg -
Before changes -

s = pd.Series([1, 2, np.nan])
default_map = defaultdict(lambda: "missing", {1: "a", 2: "b", np.nan: "c"})

s.map(default_map) 
# 0              a
# 1              b
# 2    missing
# dtype: object

# defaultdict is mutated
default_map 
# defaultdict(<function <lambda> at 0x7f3e792833a0>, {1: 'a', 2: 'b', nan: 'c', nan: 'missing'})

After changes -

s = pd.Series([1, 2, np.nan])
default_map = defaultdict(lambda: "missing", {1: "a", 2: "b", np.nan: "c"})

s.map(default_map) 
# 0    a
# 1    b
# 2    c
# dtype: object

# defaultdict is not mutated
default_map 
# defaultdict(<function <lambda> at 0x7fe002b773a0>, {1: 'a', 2: 'b', nan: 'c'})

@@ -613,7 +630,7 @@ def test_map_defaultdict_ignore_na():
mapping = defaultdict(int, {1: 10, np.nan: 42})
ser = Series([1, np.nan, 2])
result = ser.map(mapping)
expected = Series([10, 0, 0])
expected = Series([10, 42, 0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this older testcase seemed incorrect & needed to be corrected with this change.
Am however unsure if the older behavior is as per intention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the discussion we had in #47585 the old behavior is correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I see.
Do we keep it as it is now ? or is this enhancement fine - since there are some discrepancies observed (based on dtype) in map behavior as highlighted in bug description #48813
TIA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rhshadrach can you weigh in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on changing the behavior in this test. The 0 in question here is because np.nan does not equal itself, and NumPy often returns views so that ids are not equal either; e.g.

mapping = defaultdict(int, {1: 10, np.nan: 42})
arr = np.array([1, np.nan, 2])
print(mapping[arr[1]])

# 0

I think introducing a better lookup for NaN values makes sense, and brings this in line with the Series case:

mapping = pd.Series({1: 10, np.nan: 42})
ser = Series([1, np.nan, 2])
print(ser.map(mapping))

# 0    10.0
# 1    42.0
# 2     NaN
# dtype: float64

@mroeschke
Copy link
Member

Yes, these changes does fix the defaultdict mutation issue.

Could you add a test to check that the default dict isn't mutated?

@mroeschke
Copy link
Member

cc @phofl since you made a fix here recently

@dannyi96
Copy link
Contributor Author

Yes, these changes does fix the defaultdict mutation issue.

Could you add a test to check that the default dict isn't mutated?

Updated the PR with a testcase to check defaultdict mutation.

@mroeschke mroeschke added this to the 2.0 milestone Oct 14, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @phofl or @rhshadrach merge when ready

@phofl phofl merged commit 42a43e3 into pandas-dev:main Oct 14, 2022
@phofl
Copy link
Member

phofl commented Oct 14, 2022

Thx @dannyi96

@dannyi96 dannyi96 deleted the daniel_bug_fix branch October 14, 2022 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Series Series data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: differences in Series.map with defaultdict with different dtypes
4 participants