Description
Code Sample
import pandas as pd
NUM_ROWS = 1000
CHUNKSIZE = 20
with open('test.csv', 'w') as f:
for i in range(NUM_ROWS):
f.write('{}\n'.format(i))
for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c')):
print(chunk_index)
Problem description
In v0.24.0rc1, using chunksize
in pandas.read_csv
with the C engine causes exponential memory growth (engine='python'
works fine).
The code sample I listed uses a very small chunksize to better illustrate the issue but the issue happens with more realistic values like NUM_ROWS = 1000000
and CHUNKSIZE = 1024
. The low_memory
parameter in pd.read_csv()
doesn't affect the behavior.
On Windows, the process becomes very slow as memory usage grows.
On Linux, an out-of-memory exception is raised after some chunks are processed and the buffer length grows too much. For example:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Traceback (most recent call last):
File "test_csv.py", line 10, in <module>
for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c')):
File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1110, in __next__
return self.get_chunk()
File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1168, in get_chunk
return self.read(nrows=size)
File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1134, in read
ret = self._engine.read(nrows)
File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1977, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 920, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 962, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 949, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2166, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
I tried to debug the C code from the tokenizer as of 0bd454c. The unexpected behavior seems present since 011b79f which introduces these lines (and other changes) to fix #23509:
pandas/pandas/_libs/src/parser/tokenizer.c
Lines 294 to 306 in 0bd454c
I'm not familiar with the code, so I could be misinterpreting it, but I believe that code block, coupled with how self->words_cap
and self->max_words_cap
are handled could be the source of the issue. There are some potentially misleading variables names like nbytes
that seem to refer to the number of bytes that are later interpreted as nbytes tokens
-- I couldn't follow what's happening but hopefully this report helps.
It seems the issue could also be related to #16537 and #21516 but the specific changes that cause it are newer, not present in previous releases.
Expected Output
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.25-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.0rc1
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None