Description
Code Sample, a copy-pastable example if possible
import pandas as pd # v0.22.0
import scipy.sparse # v1.0.1
rpd = pd.SparseDataFrame(scipy.sparse.random(1000, 1000),
columns=list(map(str, range(1000))),
default_fill_value=0.0)
rpd.to_parquet('rpd.pq')
---------------------------------------------------------------------------
ArrowIOError Traceback (most recent call last)
<ipython-input-65-1aeaae9e36a0> in <module>()
4 columns=list(map(str, range(1000))),
5 default_fill_value=0.0)
----> 6 rpd.to_parquet('rpd.pq')
...
ArrowIOError: Column 8 had 4 while previous column had 8
Problem description
SparseDataFrames and parquet should be a match made in data science heaven, because parquet should be able to compress the sparse columns and get big space and IO savings. But the to_parquet
method seems to be very unhappy when it gets a sparse dataframe.
Output of pd.show_versions()
pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.7.1
xarray: None
IPython: 6.3.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None