Open
Description
Currently, Categorical serves two main purposes
- A type for expressing data from a fixed set of categories
- A memory efficient storage format for low-cardinality objects
This proposal is to add a new extension type (let's call it DictEncodedArray
for now) for the second use case. The storage format would be the same as
Categorical: an Index of the unique "keys" (categories) and an array of codes.
Much of the implementation would be shared. But they would have different
semantics on operations
- concat (union by default)
- groupby (unobserved categories would be dropped by default)
- value_counts (unobserved categories would be dropped by default)
This is most useful for strings, but could even be useful for storing a large
array of 64-bit precision items (store the 64-bit items once, then use an int16
or int32 array for the codes).