Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Currently, I can use Series.struct.explode()
to create a DataFrame
out of the subfields of a ArrowDtype(pa.struct(...))
column. Joining these back to the original DataFrame
is a little awkward. It'd be nice to have a top-level explode()
similar to how DataFrame.explode()
works on lists.
Feature Description
Add a new StructFrameAccessor to pandas/core/arrays/arrow/accessors.py
. I think implementation could be almost identical to what I did here: googleapis/python-bigquery-dataframes#916
class StructFrameAccessor:
"""
Accessor object for structured data properties of the DataFrame values.
"""
def __init__(self, data: DataFrame) -> None:
self._parent = data
def explode(self, column, *, separator: str = "."):
"""
Extract all child fields of struct column(s) and add to the DataFrame.
**Examples:**
>>> countries = pd.Series(["cn", "es", "us"])
>>> files = pd.Series(
... [
... {"version": 1, "project": "pandas"},
... {"version": 2, "project": "pandas"},
... {"version": 1, "project": "numpy"},
... ],
... dtype=pd.ArrowDtype(pa.struct(
... [("version", pa.int64()), ("project", pa.string())]
... ))
... )
>>> downloads = pd.Series([100, 200, 300])
>>> df = pd.DataFrame({"country": countries, "file": files, "download_count": downloads})
>>> df.struct.explode("file")
country file.version file.project download_count
0 cn 1 pandas 100
1 es 2 pandas 200
2 us 1 numpy 300
[3 rows x 4 columns]
Args:
column:
Column(s) to explode. For multiple columns, specify a non-empty
list with each element be str or tuple, and all specified
columns their list-like data on same row of the frame must
have matching length.
separator:
Separator/delimiter to use to separate the original column name
from the sub-field column name.
Returns:
DataFrame:
Original DataFrame with exploded struct column(s).
"""
df = self._parent
column_labels = check_column(column)
for label in column_labels:
position = df.columns.to_list().index(label)
df = df.drop(columns=label)
subfields = self._parent[label].struct.explode()
for subfield in reversed(subfields.columns):
df.insert(
position, f"{label}{separator}{subfield}", subfields[subfield]
)
return df
def check_column(
column: Union[blocks.Label, Sequence[blocks.Label]],
) -> Sequence[blocks.Label]:
if not is_list_like(column):
column_labels = cast(Sequence[blocks.Label], (column,))
else:
column_labels = cast(Sequence[blocks.Label], tuple(column))
if not column_labels:
raise ValueError("column must be nonempty")
if len(column_labels) > len(set(column_labels)):
raise ValueError("column must be unique")
return column_labels
Alternative Solutions
An alternative could be to modify DataFrame.explode
to support exploding a struct into columns. Potentially with an axis
parameter to explode into columns instead of rows.
Additional Context
See also, the Series.struct
accessor added last year: #54977