read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1

#### Code Sample

```python
import pandas as pd

NUM_ROWS = 1000
CHUNKSIZE = 20

with open('test.csv', 'w') as f:
    for i in range(NUM_ROWS):
        f.write('{}\n'.format(i))
        
for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c')):
    print(chunk_index)
```
#### Problem description

In v0.24.0rc1, using `chunksize` in `pandas.read_csv` with the C engine causes exponential memory growth (`engine='python'` works fine). 

The code sample I listed uses a very small chunksize to better illustrate the issue but the issue happens with more realistic values like `NUM_ROWS = 1000000` and `CHUNKSIZE = 1024`. The `low_memory` parameter in `pd.read_csv()` doesn't affect the behavior.

On Windows, the process becomes very slow as memory usage grows. 

On Linux, an out-of-memory exception is raised after some chunks are processed and the buffer length grows too much. For example:

<details>

```
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
```
```pytb
Traceback (most recent call last):
  File "test_csv.py", line 10, in <module>
    for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c')):
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1110, in __next__
    return self.get_chunk()
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1168, in get_chunk
    return self.read(nrows=size)
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1134, in read
    ret = self._engine.read(nrows)
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1977, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 920, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 962, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 949, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2166, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
```

</details>

I tried to debug the C code from the tokenizer as of 0bd454cdc9307d3a7e73403c49cc8350965628ce. The unexpected behavior seems present since 011b79fbf73b45313b47c08b4be1fc07dcb99365 which introduces these lines (and other changes) to fix #23509:

https://github.com/pandas-dev/pandas/blob/0bd454cdc9307d3a7e73403c49cc8350965628ce/pandas/_libs/src/parser/tokenizer.c#L294-L306

I'm not familiar with the code, so I could be misinterpreting it, but I believe that code block, coupled with how `self->words_cap` and `self->max_words_cap` are handled could be the source of the issue. There are some potentially misleading variables names like `nbytes` that seem to refer to the number of bytes that are later interpreted as `nbytes tokens` -- I couldn't follow what's happening but hopefully this report helps.

It seems the issue could also be related to https://github.com/pandas-dev/pandas/issues/16537 and https://github.com/pandas-dev/pandas/issues/21516 but the specific changes that cause it are newer, not present in previous releases.

#### Expected Output

```
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
```

#### Output of ``pd.show_versions()``

<details>

```
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.25-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0rc1
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
```

</details>


	/**
	* If we are reading in chunks, we need to be aware of the maximum number
	* of words we have seen in previous chunks (self->max_words_cap), so
	* that way, we can properly allocate when reading subsequent ones.
	*
	* Otherwise, we risk a buffer overflow if we mistakenly under-allocate
	* just because a recent chunk did not have as many words.
	*/
	if (self->words_len + nbytes < self->max_words_cap) {
	length = self->max_words_cap - nbytes;
	} else {
	length = self->words_len;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

Code Sample

Problem description

Expected Output

Output of `pd.show_versions()`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

Description

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Output of `pd.show_versions()`