Description
Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
n = 2000000
data = pd.DataFrame({'a' : range(n)})
for i in range(1, 100):
data['col_' + str(i)] = np.random.choice(['a', 'b'], n)
for i in range(1, 600):
data['test_{}'.format(i)] = i
print(str(i))
Problem description
Following this StackOverflow question.
I run this code sample on Ubuntu 18.04 LTS machine with 16 GB of RAM and 2 GB Swap. Execution produces following stacktrace:
294
295
296
297
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2657, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1053, in set
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "py-memory-test.py", line 12, in <module>
data['test_{}'.format(i)] = i
File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3370, in __setitem__
self._set_item(key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3446, in _set_item
NDFrame._set_item(self, key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3172, in _set_item
self._data.set(key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1056, in set
self.insert(len(self.items), item, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1184, in insert
self._consolidate_inplace()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 929, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1899, in _consolidate
_can_consolidate=_can_consolidate)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 3149, in _merge_blocks
new_values = new_values[argsort]
MemoryError
I have found following code inside pandas
core:
def insert(self, loc, item, value, allow_duplicates=False):
...
self._known_consolidated = False
if len(self.blocks) > 100:
self._consolidate_inplace()
It seems that this consolidation process takes place every ~100th iteration and substantially affects performance and memory usage. In order to proof this hypothesis I have tried to modify 100
to 1000000
and it worked just fine, no performance gaps, no MemoryError
.
It looks quite weird from my perspective, since 'consolidation' sounds like it should reduce memory usage. Probably pandas
should allocate some private Swap files (e.g. via mmap
) if it is running out RAM+SystemSwap in order to be able to successfully complete consolidation process.
Expected Output
1
2
3
4
5
...
599
Without substantial freezes every ~100th iteration and MemoryError
.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-50-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None