Skip to content

read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

Closed
@PMeira

Description

@PMeira

Code Sample

import pandas as pd

NUM_ROWS = 1000
CHUNKSIZE = 20

with open('test.csv', 'w') as f:
    for i in range(NUM_ROWS):
        f.write('{}\n'.format(i))
        
for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c')):
    print(chunk_index)

Problem description

In v0.24.0rc1, using chunksize in pandas.read_csv with the C engine causes exponential memory growth (engine='python' works fine).

The code sample I listed uses a very small chunksize to better illustrate the issue but the issue happens with more realistic values like NUM_ROWS = 1000000 and CHUNKSIZE = 1024. The low_memory parameter in pd.read_csv() doesn't affect the behavior.

On Windows, the process becomes very slow as memory usage grows.

On Linux, an out-of-memory exception is raised after some chunks are processed and the buffer length grows too much. For example:

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Traceback (most recent call last):
  File "test_csv.py", line 10, in <module>
    for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c')):
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1110, in __next__
    return self.get_chunk()
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1168, in get_chunk
    return self.read(nrows=size)
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1134, in read
    ret = self._engine.read(nrows)
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1977, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 920, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 962, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 949, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2166, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: out of memory

I tried to debug the C code from the tokenizer as of 0bd454c. The unexpected behavior seems present since 011b79f which introduces these lines (and other changes) to fix #23509:

/**
* If we are reading in chunks, we need to be aware of the maximum number
* of words we have seen in previous chunks (self->max_words_cap), so
* that way, we can properly allocate when reading subsequent ones.
*
* Otherwise, we risk a buffer overflow if we mistakenly under-allocate
* just because a recent chunk did not have as many words.
*/
if (self->words_len + nbytes < self->max_words_cap) {
length = self->max_words_cap - nbytes;
} else {
length = self->words_len;
}

I'm not familiar with the code, so I could be misinterpreting it, but I believe that code block, coupled with how self->words_cap and self->max_words_cap are handled could be the source of the issue. There are some potentially misleading variables names like nbytes that seem to refer to the number of bytes that are later interpreted as nbytes tokens -- I couldn't follow what's happening but hopefully this report helps.

It seems the issue could also be related to #16537 and #21516 but the specific changes that cause it are newer, not present in previous releases.

Expected Output

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.25-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0rc1
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csvRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions