Closed
Description
Applying a ufunc on a DataFrame
with sparse columns does not retain its sparse dtype:
In [105]: df = pd.SparseDataFrame(np.array([[0, 1, 0], [1, 0, 1]]),
columns=['a', 'b', 'c'], default_fill_value=0)
In [106]: df2 = pd.DataFrame(df)
In [107]: np.exp(df)['a']
Out[107]:
0 1.000000
1 2.718282
Name: a, dtype: Sparse[float64, 0]
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([2], dtype=int32)
In [108]: np.exp(df2)['a']
Out[108]:
0 1.000000
1 2.718282
Name: a, dtype: float64
Although SparseDataFrame
returns the correct thing here, I am not sure it actually works as desired, as I am not sure it prevents materializing the full data (which in principle should be possible to not do)
edit from Tom
Implementing DataFrame.__array_ufunc__
is probably the best way to do this.
The semantics will be similar to
Series.array_ufunc,
but applied blockwise.
- Series and DataFrame objs in
inputs
will first be aligned. - All arrays will be unboxed from blocks
- The ufunc will be applied to each array. If the array defines array_ufunc, it'll be called.
- The results will be re-boxed in a DataFrame with the original labels.
There are some additional complicates with dimensionality, shapes, broadcasting... But the basic idea of using __array_ufunc__
blockwise so that the underlying array's __array_ufunc__
is called makes sense.