ENH: Specification document for embedding pandas-specific metadata in binary file formats

From the discussion in https://github.com/dask/dask/issues/2222 and elsewhere, there has been a need to standardize storing custom metadata related to the serializing pandas DataFrame objects in binary file formats like Apache Parquet, Arrow, and Feather. 

Because there are multiple implementations of various file formats, where it is possible to specify the "official" pandas metadata representation (e.g. if one or more columns represent the index, then these are clearly marked), this will help avoid implementation divergence and incompatibilities. 

Things that we should specify:

* How to determine the Index when reading the object
* The original data types, where there is possible ambiguity (e.g. multiple types stored in the same physical storage type)

I don't think this needs to be more complicated than a Markdown document in the pandas repo, so we can maintain a single point of truth and reach consensus as a community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions