Skip to content

Tracking issue: dataframe protocol implementation #46

Open
@rgommers

Description

@rgommers

The bulk of the dataframe interchange protocol was done in gh-38. There were still a number of TODOs however, and more will likely pop up once we have multiple implementations so we can actually turn one type of dataframe into another type. This is the tracking issue for those TODOs and issues:

  • Categorical dtypes: we should allow having null as a category; it should not have a specified meaning, it's just another category that should (e.g.) roundtrip correctly. See conversation in 8 Apr meeting.
  • Categorical dtypes: should they be a dtype in themselves, or should they be a part of the dtype tuple? Currently dtype is (kind, bitwidth, format_str, endianness), with categorical being a value of the kind enum. Is making a 5th element in the dtype, with that element being another dtype 4-tuple, thereby allowing for nesting, sensible?
  • Add a metadata attribute that can be used to store library-specific things. For example, Vaex should be able to store expressions for its virtual columns there. See PR PR: Add metadata attribute to DataFrame and Column #43
  • Add a flag to throw an exception if the export cannot be zero-copy. (e.g. for pandas, possible due to block manager where rows are contiguous and columns are not - add a test for that). See PR PR: Add allow_copy flag to interchange protocol #44
  • Add a string dtype, with variable-length strings implemented with the same scheme as Arrow uses (an offsets and a data buffer, see Add a prototype of the dataframe interchange protocol #38 (comment)). _See PR Add variable-length string support #45
  • Signature of the from_dataframe protocol? See Signature for a standard from_dataframe constructor function #42 and meeting of 20 May.
  • What can be reused between implementations in different libraries, and can/should we have a reference implementation? --> question needs answering somewhere.
  • What is the ownership for buffers, who owns the memory? This should be clearly spelled out in the docs. An owner attribute is perhaps needed. See meeting minutes 4 March, How to consume a single buffer & connection to array interchange #39, and comments on this PR.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions