Skip to content

Implement DataFrame.__array_ufunc__ #23743

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Applying a ufunc on a DataFrame with sparse columns does not retain its sparse dtype:

In [105]: df = pd.SparseDataFrame(np.array([[0, 1, 0], [1, 0, 1]]),
                                  columns=['a', 'b', 'c'], default_fill_value=0)
In [106]: df2 = pd.DataFrame(df)

In [107]: np.exp(df)['a']
Out[107]: 
0    1.000000
1    2.718282
Name: a, dtype: Sparse[float64, 0]
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([2], dtype=int32)

In [108]: np.exp(df2)['a']
Out[108]: 
0    1.000000
1    2.718282
Name: a, dtype: float64

Although SparseDataFrame returns the correct thing here, I am not sure it actually works as desired, as I am not sure it prevents materializing the full data (which in principle should be possible to not do)


edit from Tom

Implementing DataFrame.__array_ufunc__ is probably the best way to do this.

The semantics will be similar to
Series.array_ufunc,
but applied blockwise.

  1. Series and DataFrame objs in inputs will first be aligned.
  2. All arrays will be unboxed from blocks
  3. The ufunc will be applied to each array. If the array defines array_ufunc, it'll be called.
  4. The results will be re-boxed in a DataFrame with the original labels.

There are some additional complicates with dimensionality, shapes, broadcasting... But the basic idea of using __array_ufunc__ blockwise so that the underlying array's __array_ufunc__ is called makes sense.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions