Skip to content

API: Add Dictionary-encoded Extension Type #20899

Open
@TomAugspurger

Description

@TomAugspurger

Currently, Categorical serves two main purposes

  1. A type for expressing data from a fixed set of categories
  2. A memory efficient storage format for low-cardinality objects

This proposal is to add a new extension type (let's call it DictEncodedArray
for now) for the second use case. The storage format would be the same as
Categorical: an Index of the unique "keys" (categories) and an array of codes.
Much of the implementation would be shared. But they would have different
semantics on operations

  • concat (union by default)
  • groupby (unobserved categories would be dropped by default)
  • value_counts (unobserved categories would be dropped by default)

This is most useful for strings, but could even be useful for storing a large
array of 64-bit precision items (store the 64-bit items once, then use an int16
or int32 array for the codes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions