Skip to content

BUG: DataFrame().to_parquet() does not write Parquet compliant data for nested arrays #43689

Open
@judahrand

Description

@judahrand

Reproducible Example

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


df = pd.DataFrame({'int_array_col': [[[1,2,3]], [[4,5,6]]]})
df.to_parquet('/tmp/test', engine='pyarrow')

pandas_parquet_table = pq.read_table('/tmp/test')

pyarrow_table = pa.Table.from_pandas(df)
writer = pa.BufferOutputStream()
pq.write_table(
    pyarrow_table,
    writer,
    use_compliant_nested_type=True
)
reader = pa.BufferReader(writer.getvalue())
parquet_table = pq.read_table(reader)

print("Pandas:", pandas_parquet_table.schema.types)
print("Non-compliant Parquet:", pyarrow_table.schema.types)
print("Compliant Parquet:", parquet_table.schema.types)
assert pandas_parquet_table.schema.types == pyarrow_table.types
assert pandas_parquet_table.schema.types == parquet_table.schema.types


```python-traceback
Pandas: [ListType(list<item: list<item: int64>>)]
Non-compliant Parquet: [ListType(list<item: list<item: int64>>)]
Compliant Parquet: [ListType(list<element: list<element: int64>>)]
Traceback (most recent call last):
  File "/Users/judahrand/test_dir/pandas_parquet.py", line 25, in <module>
    assert pandas_parquet_table.schema.types == parquet_table.schema.types
AssertionError

Issue Description

This method currently does not write adherent Parquet Logical Types for nested arrays as defined here. This can cause problems when trying to Parquet as in intermediate format, for example loading data into BigQuery which expects adherent data.

This was an issue in PyArrow itself, however, it was fixed in ARROW-11497. I believe that this flag should be set in Pandas if we are to claim that Pandas .to_parquet() method actually outputs Parquet.

Expected Behavior

Output complaint Parquet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions