Meta data for DataFrame and Column

In order to not lose information that is encoded in DataFrames and Columns that is not covered by our API, we may want to provide extra metadata slots for these.

One may argue that this should be covered in the API, and this defeats the purpose of a standard, but I think it's a very pragmatic approach to guarantee lossless roundtripping for information outside of this standard and help adoption (because there is an escape hatch).

Example metadata for a dataframe
 * path: for when it's backed by a file or remote
 * description: metadata describing the dataframe
 * license: CC0, MIT
 * history: log of how the data was produced

Example metadata for a column:
 * unit: string that describes the unit (`'km/s'`, '`parsec'`, `'furlong'`)
 * description: metadata describing the column
 * expression: in vaex, this is the expression in string form
 * is_index: an indicator that this column is the `index` in Pandas.
 
This could also help to round trip Arrow extension types: https://arrow.apache.org/docs/python/extending_types.html and I guess the same holds for Pandas.

An implementation could be a `def get_metadata(self) -> dict[str, Any]` where we recommend prefixing keys with implementation specific names, like `'arrow.extention_type'`, `'vaex.unit'`, `'pandas.extension_type_name'` etc.

Commonly used keys could be upgraded to be part of the API in the future (non-prefixed keys) that we formalize and document.

FYI: metadata is a first-class citizen in the Clojure language https://clojure.org/reference/metadata


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Meta data for DataFrame and Column #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Meta data for DataFrame and Column #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions