Skip to content

pandas.read_csv leaks memory while opening massive files with chunksize & iterator=True #21516

Closed
@Shirui816

Description

@Shirui816

I am using anaconda and my pandas version is 0.23.1. When dealing with single large file, setting chunksize or iterator=True works fine and memory usage is low. The problem raises when I am trying to dealing with 5000+ files (file names are in filelist):

trajectory = [pd.read_csv(f, delim_whitespace=True, header=None, chunksize=10000) for f in filelist]

The memory usage raises very soon and exceeds 20GB+ quickly. However, trajectory = [open(f, 'r')....] and reading 10000 lines from each file works fine.

I also tried low_memory=True option but it's not working. Both engine='python' and memory_map=<some file> options solve the memory problem but when I use the datas with

X = np.asarray([f.get_chunk().values for f in trajectory])
FX = np.fft.fft(X, axis=0)

The multi-threading of MKL-FFT does not work anymore.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csv

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions