Skip to content

BUG: multiple application of .loc on sparse DataFrame results in NaN and filling the DataFrame  #34687

Closed
@deusebio

Description

@deusebio
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

# Your code here

from scipy.sparse import coo_matrix, eye  
import pandas as pd

df = pd.DataFrame.sparse.from_spmatrix(eye(10))

df.sparse.density 
# Should output 0.1

df.loc[range(5)]                                                        
#     0    1    2    3    4    5    6    7    8    9
# 0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
# 1  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
# 2  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
# 3  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
# 4  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0

df.loc[range(5)].sparse.density
# outputs 0.1

df.loc[range(5)].loc[range(3)]                                          
#     0    1    2    3    4   5   6   7   8   9
# 0  1.0  0.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN
# 1  0.0  1.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN
# 2  0.0  0.0  1.0  0.0  0.0 NaN NaN NaN NaN NaN

df.loc[range(5)].loc[range(3)].sparse.density
# outputs 0.6

Problem description

It seems that sparse DataFrame extracted using loc does not behave as the original ones. This creates inconsistencies in our processing pipelines, depending on filtering and selection that has been applied, sometimes producing "Nans", severely impacting memory consumption and computational time.

Expected Output

The output of loc should not depend on multiple slicing, i.e.

df.loc[range(5)].loc[range(3)] = df.loc[range(3)]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 9.0.3
setuptools : 39.0.1
Cython : 0.29.19
pytest : 5.4.3
hypothesis : 5.16.1
sphinx : 3.1.0
blosc : 1.9.1
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : 0.4.0
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.9
numba : 0.49.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesNeeds TestsUnit test(s) needed to prevent regressionsSparseSparse Data Typegood first issue

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions