Skip to content

BUG: Pandas map function does not work properly with StringDtype #40823

Closed
@jaspersival

Description

@jaspersival
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd


df1 = pd.DataFrame(
    {
        "col1": [pd.NA, "foo", "bar"]
    }, index=["id1", "id2", "id3"], dtype=pd.StringDtype()
)

df2 = pd.DataFrame(
    {
        "id": ["id4", "id2", "id1"]
    }, dtype=pd.StringDtype()
)

df2["col1"] = df2["id"].map(df1["col1"])

print(df2.head())

Problem description

Traceback:
Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files\JetBrains\PyCharm 2020.3.2\plugins\python\helpers\pydev_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2020.3.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/XXX/AppData/Roaming/JetBrains/PyCharm2020.3/scratches/scratch_22.py", line 16, in
df2["col1"] = df2["id"].map(df1["col1"])
File "C:\UsersXXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\series.py", line 3909, in map
new_values = super()._map_values(arg, na_action=na_action)
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\base.py", line 908, in _map_values
new_values = algorithms.take_1d(mapper._values, indexer)
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\algorithms.py", line 1699, in take_nd
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\arrays_mixins.py", line 78, in take
return self.from_backing_data(new_data)
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\arrays\numpy
.py", line 190, in from_backing_data
return type(self)(arr)
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\arrays\string
.py", line 195, in init
self.validate()
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\arrays\string
.py", line 200, in _validate
raise ValueError("StringArray requires a sequence of strings or pandas.NA")
ValueError: StringArray requires a sequence of strings or pandas.NA

This issue can be circumvented if I do:

df2["col1"] = df2["id"].map(df1["col1"].astype("object"))
print(df2.head())

Gives:
id col1
0 id4 NaN
1 id2 foo
2 id1 <NA>

Which is a combination of np.nan and pd.NA

If I do then a conversion back to pd.StringDtype:

df2["col1"] = df2["id"].map(df1["col1"].astype("object")).astype(pd.StringDtype())
print(df2.head())

I get the expected output:
id col1
0 id4 <NA>
1 id2 foo
2 id1 <NA>

But this is ugly as hell.

Expected Output

id  col1

0 id4 <NA>
1 id2 foo
2 id1 <NA>

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2c8480
python : 3.9.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.2.3
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.1
sqlalchemy : 1.4.3
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions