Open
Description
When I create a parquet file from an arrow table with a fixed size array as one of the columns, then read back the resulting parquet, the column is no longer a fixed size array, but instead a dynamically sized array.
Example:
import datafusion as df
import pyarrow as pa
FILENAME = "/tmp/fixed_array_example.parquet"
ctx = df.SessionContext()
array = pa.array([[1.0, 2.0], [3.0, 4.0]], type=pa.list_(pa.float32(), 2))
table = pa.Table.from_pydict({"array": array})
df_table = ctx.from_arrow(table)
print("original schema:")
print(df_table.schema())
df_table.write_parquet(FILENAME)
print("roundtrip schema:")
print(ctx.read_parquet(FILENAME).schema())
Output:
original schema:
array: fixed_size_list<item: float>[2]
child 0, item: float
roundtrip schema:
array: list<item: float>
child 0, item: float
As the output demonstrates, the datafusion dataframe that is written out has the proper schema. Nevertheless, the file that is read back does not.
If instead of datafusion, I use pyarrow to write the parquet file, I do get the expected schema when I read it back using datafusion.
import datafusion as df
import pyarrow as pa
import pyarrow.parquet as pq
FILENAME = "/tmp/fixed_array_example_pyarrow.parquet"
ctx = df.SessionContext()
array = pa.array([[1.0, 2.0], [3.0, 4.0]], type=pa.list_(pa.float32(), 2))
table = pa.Table.from_pydict({"array": array})
print("original schema:")
print(table.schema)
pq.write_table(table, FILENAME)
print("roundtrip schema:")
print(ctx.read_parquet(FILENAME).schema())
output:
original schema:
array: fixed_size_list<item: float>[2]
child 0, item: float
roundtrip schema:
array: fixed_size_list<element: float>[2]
child 0, element: float