Skip to content

ENH: DataFrame.struct.explode(column, *, separator=".") method to pull struct subfields into the parent DataFrame #59585

Open
@tswast

Description

@tswast

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, I can use Series.struct.explode() to create a DataFrame out of the subfields of a ArrowDtype(pa.struct(...)) column. Joining these back to the original DataFrame is a little awkward. It'd be nice to have a top-level explode() similar to how DataFrame.explode() works on lists.

Feature Description

Add a new StructFrameAccessor to pandas/core/arrays/arrow/accessors.py. I think implementation could be almost identical to what I did here: googleapis/python-bigquery-dataframes#916

class StructFrameAccessor:
    """
    Accessor object for structured data properties of the DataFrame values.
    """

    def __init__(self, data: DataFrame) -> None:
        self._parent = data


    def explode(self, column, *, separator: str = "."):
        """
        Extract all child fields of struct column(s) and add to the DataFrame.

        **Examples:**

            >>> countries = pd.Series(["cn", "es", "us"])
            >>> files = pd.Series(
            ...     [
            ...         {"version": 1, "project": "pandas"},
            ...         {"version": 2, "project": "pandas"},
            ...         {"version": 1, "project": "numpy"},
            ...     ],
            ...     dtype=pd.ArrowDtype(pa.struct(
            ...         [("version", pa.int64()), ("project", pa.string())]
            ...     ))
            ... )
            >>> downloads = pd.Series([100, 200, 300])
            >>> df = pd.DataFrame({"country": countries, "file": files, "download_count": downloads})
            >>> df.struct.explode("file")
              country  file.version file.project  download_count
            0      cn             1       pandas             100
            1      es             2       pandas             200
            2      us             1        numpy             300
            [3 rows x 4 columns]

        Args:
            column:
                Column(s) to explode. For multiple columns, specify a non-empty
                list with each element be str or tuple, and all specified
                columns their list-like data on same row of the frame must
                have matching length.
            separator:
                Separator/delimiter to use to separate the original column name
                from the sub-field column name.


        Returns:
            DataFrame:
                Original DataFrame with exploded struct column(s).
        """
        df = self._parent
        column_labels = check_column(column)

        for label in column_labels:
            position = df.columns.to_list().index(label)
            df = df.drop(columns=label)
            subfields = self._parent[label].struct.explode()
            for subfield in reversed(subfields.columns):
                df.insert(
                    position, f"{label}{separator}{subfield}", subfields[subfield]
                )

        return df


def check_column(
    column: Union[blocks.Label, Sequence[blocks.Label]],
) -> Sequence[blocks.Label]:
    if not is_list_like(column):
        column_labels = cast(Sequence[blocks.Label], (column,))
    else:
        column_labels = cast(Sequence[blocks.Label], tuple(column))

    if not column_labels:
        raise ValueError("column must be nonempty")
    if len(column_labels) > len(set(column_labels)):
        raise ValueError("column must be unique")

    return column_labels

Alternative Solutions

An alternative could be to modify DataFrame.explode to support exploding a struct into columns. Potentially with an axis parameter to explode into columns instead of rows.

Additional Context

See also, the Series.struct accessor added last year: #54977

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityEnhancementNeeds DiscussionRequires discussion from core team before further actionNeeds InfoClarification about behavior needed to assess issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions