Skip to content

Expand Parquet pandas schema metadata to store RangeIndex without serialization  #25672

Closed
@wesm

Description

@wesm

In https://github.com/pandas-dev/pandas/blob/master/doc/source/development/developer.rst, there is no affordance for storing RangeIndex without serializing it to a column of integers. This wastes both memory and time

I'll propose an evolution of the metadata that permits "non-serialized" indexes like RangeIndex to be stored without a conversion step of some kind

This will have to mind forward compatibility (so we can read old files, but not backward compatibility -- i.e. allowing new files to be read by old readers -- see below). I would suggest changing the index_columns to include dictionaries like

{
    'kind': 'range',
    'start': 0,
    'stop': 10,
    'step': 1
}

versus

{
    'kind': 'serialized',
    'field_name': '__index_level_0__'
}

So if a string is encountered in this field (instead of a dict), we know it is "old" metadata. This will break old readers but I think that is OK

Cross ref with https://issues.apache.org/jira/browse/ARROW-1639

cc @cpcloud @martindurant @kszucs @xhochy

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO DataIO issues that don't fit into a more specific labelIO Parquetparquet, feather

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions