Skip to content

stack() method of SparseDataFrame should return a SparseSeries and optimize memory usage #15045

Closed
@datapythonista

Description

@datapythonista

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': [np.nan, np.nan, np.nan, np.nan, np.nan, 1., np.nan],
                   'b': [1., np.nan, np.nan, 1., np.nan, np.nan, np.nan]}).to_sparse()
print(type(df))
print(type(df.stack()))

<class 'pandas.sparse.frame.SparseDataFrame'>
<class 'pandas.core.series.Series'>

Problem description

I'm trying to convert a SparseDataFrame (obtained it from pd.get_dummies()) into a scipy sparse matrix, by using the experimental .to_coo(). As this method accepts a MultiIndex Series, instead of a DataFrame, i call the .stack() method of this SparseDataFrame.

The problem is that it looks like the .stack() method doesn't process the SparseDataFrame as sparse, and instead stacks it as dense, consuming too much memory, and returning a (dense) Series.

Returning a dense Series could be all right, as np.nan values are drop by default with the dropna parameters, but the memory consumption is a problem.

I'm aware the whole sparse functionality is not yet mature. And I saw the function pd.sparse.frame.stack_sparse_frame which I guess it's a step to fix this problem (which doesn't work for me). But as I couldn't find a specific issue for this problem, I thought it was worth opening it.

Expected Output

<class 'pandas.sparse.frame.SparseDataFrame'>
<class 'pandas.sparse.series.SparseSeries'>

Output of pd.show_versions()

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.7.5-100.fc23.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.19.2+0.g825876c.dirty
nose: 1.3.7
pip: 9.0.1
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions