Closed
Description
From the discussion in dask/dask#2222 and elsewhere, there has been a need to standardize storing custom metadata related to the serializing pandas DataFrame objects in binary file formats like Apache Parquet, Arrow, and Feather.
Because there are multiple implementations of various file formats, where it is possible to specify the "official" pandas metadata representation (e.g. if one or more columns represent the index, then these are clearly marked), this will help avoid implementation divergence and incompatibilities.
Things that we should specify:
- How to determine the Index when reading the object
- The original data types, where there is possible ambiguity (e.g. multiple types stored in the same physical storage type)
I don't think this needs to be more complicated than a Markdown document in the pandas repo, so we can maintain a single point of truth and reach consensus as a community.