Skip to content

BUG: to_json memory leak (introduced in 1.1.0) #43877

Closed
@vernetya

Description

@vernetya

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# any loop
for _ in range(1000)
    df = pd.DataFrame({str(c): np.random.random_sample(size=100_000) for c in range(10)})
    df.to_json() # same regardless orient or using file

Issue Description

It looks like a memory leak when calling to_json introduced in version 1.1.0. It seems it prevents the dataframe to be correctly garbage collected. Here's a memory profile Pandas 1.1.0 compared to the previous version 1.0.5:

mem_prof

I have the same trends either on Windows 10, Linux Ubuntu, python 3.7, 3.8 & 3.9.
This leak is still there on latest Pandas version 1.3.3 and is proportional to the size of the dataframe. I've tried direct calls to del and gc.collect() but it doesn't change anything.

It's specific to to_json method. I haven't observed leak with other formats such as CSV.

I don't know if it makes sense or help, here's an output using tracemalloc from this code:

def foo():
    df = pd.DataFrame({str(c): np.random.random_sample(size=100_000) for c in range(5)})
    df.to_json()


if __name__ == "__main__":
    tracemalloc.start(50)

    foo()

    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('traceback')

    # pick the biggest memory block
    stat = top_stats[0]
    print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
    for line in stat.traceback.format():
        print(line)

with Pandas 1.1.0 or 1.3.3:

5 memory blocks: 782.5 KiB
  File "main.py"
    foo()
  File "main.py"
    df.to_json()
  File ".\ven37\lib\site-packages\pandas\core\generic.py", line 2571
    storage_options=storage_options,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 122
    indent=indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 183
    indent=self.indent,
  File ".\ven37\lib\site-packages\pandas\core\indexes\base.py", line 4367
    return self._data
  File ".\ven37\lib\site-packages\pandas\core\indexes\range.py", line 186
    return np.arange(self.start, self.stop, self.step, dtype=np.int64)

whereas 1.0.5 produces this:

9 memory blocks: 1.6 KiB
  File "main.py"
    foo()
  File "main.py"
    df.to_json()
  File ".\ven37\lib\site-packages\pandas\core\generic.py", line 2364
    indent=indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 85
    indent=indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 145
    self.indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 245
    indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 167
    indent=indent,

Expected Behavior

No leak expected, similar to version 1.0.5

Installed Versions

Versions with leak:
master -----------------
commit : 6599834
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
pandas : 1.4.0.dev0+833.g6599834103
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
...

1.3.3 ------------------
commit : 73c6825
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
pandas : 1.3.3
numpy : 1.21.2
...

1.1.0 ------------------
commit : d9fff27
pandas : 1.1.0
numpy : 1.21.2
...

Versions without leak:
commit : None
pandas : 1.0.5
numpy : 1.21.2
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO JSONread_json, to_json, json_normalizePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions