Open
Description
Reproducible Example
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame({'int_array_col': [[[1,2,3]], [[4,5,6]]]})
df.to_parquet('/tmp/test', engine='pyarrow')
pandas_parquet_table = pq.read_table('/tmp/test')
pyarrow_table = pa.Table.from_pandas(df)
writer = pa.BufferOutputStream()
pq.write_table(
pyarrow_table,
writer,
use_compliant_nested_type=True
)
reader = pa.BufferReader(writer.getvalue())
parquet_table = pq.read_table(reader)
print("Pandas:", pandas_parquet_table.schema.types)
print("Non-compliant Parquet:", pyarrow_table.schema.types)
print("Compliant Parquet:", parquet_table.schema.types)
assert pandas_parquet_table.schema.types == pyarrow_table.types
assert pandas_parquet_table.schema.types == parquet_table.schema.types
```python-traceback
Pandas: [ListType(list<item: list<item: int64>>)]
Non-compliant Parquet: [ListType(list<item: list<item: int64>>)]
Compliant Parquet: [ListType(list<element: list<element: int64>>)]
Traceback (most recent call last):
File "/Users/judahrand/test_dir/pandas_parquet.py", line 25, in <module>
assert pandas_parquet_table.schema.types == parquet_table.schema.types
AssertionError
Issue Description
This method currently does not write adherent Parquet Logical Types for nested arrays as defined here. This can cause problems when trying to Parquet as in intermediate format, for example loading data into BigQuery which expects adherent data.
This was an issue in PyArrow itself, however, it was fixed in ARROW-11497. I believe that this flag should be set in Pandas if we are to claim that Pandas .to_parquet()
method actually outputs Parquet.
Expected Behavior
Output complaint Parquet.