Skip to content

ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010

Closed
@wesm

Description

@wesm

From the discussion in dask/dask#2222 and elsewhere, there has been a need to standardize storing custom metadata related to the serializing pandas DataFrame objects in binary file formats like Apache Parquet, Arrow, and Feather.

Because there are multiple implementations of various file formats, where it is possible to specify the "official" pandas metadata representation (e.g. if one or more columns represent the index, then these are clearly marked), this will help avoid implementation divergence and incompatibilities.

Things that we should specify:

  • How to determine the Index when reading the object
  • The original data types, where there is possible ambiguity (e.g. multiple types stored in the same physical storage type)

I don't think this needs to be more complicated than a Markdown document in the pandas repo, so we can maintain a single point of truth and reach consensus as a community.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions