Skip to content

Build empty SparseDataFrame by columns very loog compared to by index. #16197

Closed
@cfrancois7

Description

@cfrancois7

Code Sample, a copy-pastable example if possible

I want to create a sparse matrix with a 4 level multiindex and about 340 000 x 340 000 cells.
I is not possible to build it in dense and to sparse it.
So I tried to build it directly in SparseDataFrame.

n = len(index)
print(n)
>>> 338275
%timeit df0 = pd.SparseDataFrame(index=index)
1000 loops, best of 3: 419 µs per loop

But if I tried to construct:

df1 = pd.SparseDataFrame(columns=index)

or

df2 = pd.SparseDataFrame(index=index, columns=index)

An all night wasn't enough to build the empty SparseDataFrame.
I don't understand how to build this empty SparseDataFrame in a quite reseanoble time (less than <20 minutes with 8 GoRam).

Output of pd.show_versions()

``` commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.19.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0

</details>

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceSparseSparse Data Type

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions