Skip to content

fixed size list type is not retained when writing to parquet #957

Open
@matko

Description

@matko

When I create a parquet file from an arrow table with a fixed size array as one of the columns, then read back the resulting parquet, the column is no longer a fixed size array, but instead a dynamically sized array.

Example:

import datafusion as df
import pyarrow as pa

FILENAME = "/tmp/fixed_array_example.parquet"
ctx = df.SessionContext()

array = pa.array([[1.0, 2.0], [3.0, 4.0]], type=pa.list_(pa.float32(), 2))
table = pa.Table.from_pydict({"array": array})
df_table = ctx.from_arrow(table)
print("original schema:")
print(df_table.schema())

df_table.write_parquet(FILENAME)
print("roundtrip schema:")
print(ctx.read_parquet(FILENAME).schema())

Output:

original schema:
array: fixed_size_list<item: float>[2]
  child 0, item: float
roundtrip schema:
array: list<item: float>
  child 0, item: float

As the output demonstrates, the datafusion dataframe that is written out has the proper schema. Nevertheless, the file that is read back does not.

If instead of datafusion, I use pyarrow to write the parquet file, I do get the expected schema when I read it back using datafusion.

import datafusion as df
import pyarrow as pa
import pyarrow.parquet as pq

FILENAME = "/tmp/fixed_array_example_pyarrow.parquet"
ctx = df.SessionContext()

array = pa.array([[1.0, 2.0], [3.0, 4.0]], type=pa.list_(pa.float32(), 2))
table = pa.Table.from_pydict({"array": array})

print("original schema:")
print(table.schema)

pq.write_table(table, FILENAME)
print("roundtrip schema:")
print(ctx.read_parquet(FILENAME).schema())

output:

original schema:
array: fixed_size_list<item: float>[2]
  child 0, item: float
roundtrip schema:
array: fixed_size_list<element: float>[2]
  child 0, element: float

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions