Skip to content

ENH: One-hot decoding #34260

Closed
Closed
@clbarnes

Description

@clbarnes

Is your feature request related to a problem?

pd.get_dummies provides a way to turn a sequence of category-like data into a one-hot encoded data frame. However, there is no easy way (to my knowledge) of going in the other direction: given a boolean dataframe where the row sums are all 1, produce a categorical series. This task is particularly valuable for serialisation.

Describe the solution you'd like

Some way of constructing a Categorical array from a one-hot encoded dataframe (view). To avoid piling extra functionality into the existing constructor, a class method could be used.

Scratch implementation:

import numpy as np 
import pandas as pd

class Categorical:
    ...
    
    @classmethod
    def from_dummies(cls, df: pd.DataFrame, **kwargs):
        onehot = df.astype(bool)

        if (onehot.sum(axis=1) > 1).any():
            raise ValueError("Some rows belong to >1 category")

        index_into = pd.Series([np.nan] + list(onehot.columns))
        mult_by = np.arange(1, len(index_into))

        indexes = (onehot.astype(int) * mult_by).sum(axis=1)
        values = index_into[indexes]

        return cls(values, df.columns, **kwargs)

Describe alternatives you've considered

  • A free function (less discoverable, less self-documenting)
  • Importing scikit-learn

Additional context

sklearn.preprocessing.OneHotEncoder.inverse_transform

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions