Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": list("caaab")})
df["y"] = df["y"].astype(pd.StringDtype("pyarrow"))
df["y"] = df["y"].astype("category")
df.to_parquet("test-arrow.parquet")
Issue Description
I have a dataframe with string[pyarrow]
. I can convert it to category
, but the resulting dataframe can't be written with to_parquet
:
Traceback (most recent call last):
File "/Users/jbennet/Library/Application Support/JetBrains/PyCharm2022.3/scratches/writeparquet_pyarrow.py", line 6, in <module>
df.to_parquet("test-arrow.parquet")
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/frame.py", line 2976, in to_parquet
return to_parquet(
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/io/parquet.py", line 430, in to_parquet
impl.write(
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/io/parquet.py", line 174, in write
table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
File "pyarrow/table.pxi", line 3557, in pyarrow.lib.Table.from_pandas
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 611, in dataframe_to_arrays
arrays = [convert_column(c, f)
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 611, in <listcomp>
arrays = [convert_column(c, f)
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column
raise e
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 592, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 310, in pyarrow.lib.array
File "pyarrow/array.pxi", line 2608, in pyarrow.lib.DictionaryArray.from_arrays
File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("Could not convert <pyarrow.StringScalar: 'a'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column y with type category')
Expected Behavior
Expected parquet file to be created.
Installed Versions
INSTALLED VERSIONS
commit : d9ba285
python : 3.8.16.final.0
python-bits : 64
OS : Darwin
OS-release : 22.3.0
Version : Darwin Kernel Version 22.3.0: Thu Jan 5 20:50:36 PST 2023; root:xnu-8792.81.2~2/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.0.dev0+112.gd9ba285ac1
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 67.4.0
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.1
hypothesis : 6.68.2
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.8
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : 1.3.6
brotli :
fastparquet : 2023.2.0
fsspec : 2023.1.0
gcsfs : 2023.1.0
matplotlib : 3.6.3
numba : 0.56.4
numexpr : 2.8.3
odfpy : None
openpyxl : 3.1.0
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : 1.2.1
pyxlsb : 1.0.10
s3fs : 2023.1.0
scipy : 1.10.1
snappy :
sqlalchemy : 2.0.4
tables : 3.7.0
tabulate : 0.9.0
xarray : 2023.1.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : None
pyqt5 : None