Skip to content

BUG: Unable to write an empty dataframe to parquet #52034

Closed
@galipremsagar

Description

@galipremsagar

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [1]: import pandas as pd

In [5]: df = pd.DataFrame(index=pd.Index(["a", "b", "c"], name="custom name"))

In [6]: df.to_parquet("abc")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 df.to_parquet("abc")

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/frame.py:2873, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2786 """
   2787 Write a DataFrame to the binary parquet format.
   2788 
   (...)
   2869 >>> content = f.read()
   2870 """
   2871 from pandas.io.parquet import to_parquet
-> 2873 return to_parquet(
   2874     self,
   2875     path,
   2876     engine,
   2877     compression=compression,
   2878     index=index,
   2879     partition_cols=partition_cols,
   2880     storage_options=storage_options,
   2881     **kwargs,
   2882 )

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/io/parquet.py:434, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    430 impl = get_engine(engine)
    432 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 434 impl.write(
    435     df,
    436     path_or_buf,
    437     compression=compression,
    438     index=index,
    439     partition_cols=partition_cols,
    440     storage_options=storage_options,
    441     **kwargs,
    442 )
    444 if path is None:
    445     assert isinstance(path_or_buf, io.BytesIO)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/io/parquet.py:176, in PyArrowImpl.write(self, df, path, compression, index, storage_options, partition_cols, **kwargs)
    166 def write(
    167     self,
    168     df: DataFrame,
   (...)
    174     **kwargs,
    175 ) -> None:
--> 176     self.validate_dataframe(df)
    178     from_pandas_kwargs: dict[str, Any] = {"schema": kwargs.pop("schema", None)}
    179     if index is not None:

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/io/parquet.py:138, in BaseImpl.validate_dataframe(df)
    136 else:
    137     if df.columns.inferred_type not in {"string", "empty"}:
--> 138         raise ValueError("parquet must have string column names")
    140 # index level names must be strings
    141 valid_names = all(
    142     isinstance(name, str) for name in df.index.names if name is not None
    143 )

ValueError: parquet must have string column names

Issue Description

There seems to be an error when we try to write an empty dataframe with no columns to a parquet file. However passing columns=[] to DataFrame constructor works. Is this an intended behavior?

Expected Behavior

Idealy pd.DataFrame(index=pd.Index(["a", "b", "c"], name="custom name")) has to work. But if pd.DataFrame(index=pd.Index(["a", "b", "c"], name="custom name", columns=[])) is the right way to do it starting pandas 2.0, please let me know.

Installed Versions

In [2]: pd.show_versions()
/nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit : c2a7f1a
python : 3.10.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.0rc1
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 67.6.0
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.2
hypothesis : 6.70.0
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : None
brotli :
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : None
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : 1.10.1
snappy :
sqlalchemy : 1.4.46
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions