Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pandas as pd
df = pd.DataFrame(
{
"A": ["foo", "bar", "baz", pd.NA]
}, index=["id1", "id2", "id3", "id4"], dtype=pd.StringDtype()
)
filtered_df = pd.DataFrame(
{
"A": ["this", "that"]
}, index=["id2", "id3"], dtype=pd.StringDtype()
)
filter = pd.Series([False, True, True, False])
new_df = df.mask(filter, filtered_df)
Problem description
Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files\JetBrains\PyCharm 2020.3.2\plugins\python\helpers\pydev_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2020.3.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/XXX/AppData/Roaming/JetBrains/PyCharm2020.3/scratches/scratch_24.py", line 19, in
new_df = df.mask(filter, filtered_df)
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\generic.py", line 9317, in mask
return self.where(
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\generic.py", line 9286, in where
return self._where(
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\generic.py", line 9056, in _where
_, other = self.align(
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\frame.py", line 4102, in align
return super().align(
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\generic.py", line 8813, in align
return self._align_frame(
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\generic.py", line 8880, in _align_frame
right = other._reindex_with_indexers(
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\generic.py", line 4877, in _reindex_with_indexers
new_data = new_data.reindex_indexer(
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\internals\managers.py", line 1311, in reindex_indexer
new_blocks = [
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\internals\managers.py", line 1312, in
blk.take_nd(
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\internals\blocks.py", line 1863, in take_nd
new_values = self.values.take(indexer, fill_value=fill_value, allow_fill=True)
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\arrays_mixins.py", line 78, in take
return self.from_backing_data(new_data)
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\arrays\numpy.py", line 190, in from_backing_data
return type(self)(arr)
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\arrays\string.py", line 195, in init
self.validate()
File "C:\Users\XXX\AppData\Local\pypoetry\Cache\virtualenvs\basel-w19Rd76b-py3.9\lib\site-packages\pandas\core\arrays\string.py", line 200, in _validate
raise ValueError("StringArray requires a sequence of strings or pandas.NA")
ValueError: StringArray requires a sequence of strings or pandas.NA
In generic.py starting on line 9053, there is this code block:
# align with me
if other.ndim <= self.ndim:
_, other = self.align(
other, join="left", axis=axis, level=level, fill_value=np.nan
)
In case that the filtered_df has fewer rows than the original df, this code block is executed and it uses np.nan as a fill_value for the align method. This fill_value will propagate until the error is raised as shown above in the stack trace. I think that this issue can be easily solved if the fill_value is a mutable parameter for the .mask or .where method
Expected Output
"""
----| A
id1 | <NA>
id2 | this
id3 | that
id4 | <NA>
"""
or
"""
----| A
id1 | foo
id2 | this
id3 | that
id4 | <NA>
"""
Output of pd.show_versions()
INSTALLED VERSIONS
commit : f2c8480
python : 3.9.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.2.3
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.1
sqlalchemy : 1.4.3
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None