Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
# any loop
for _ in range(1000)
df = pd.DataFrame({str(c): np.random.random_sample(size=100_000) for c in range(10)})
df.to_json() # same regardless orient or using file
Issue Description
It looks like a memory leak when calling to_json
introduced in version 1.1.0. It seems it prevents the dataframe to be correctly garbage collected. Here's a memory profile Pandas 1.1.0 compared to the previous version 1.0.5:
I have the same trends either on Windows 10, Linux Ubuntu, python 3.7, 3.8 & 3.9.
This leak is still there on latest Pandas version 1.3.3 and is proportional to the size of the dataframe. I've tried direct calls to del
and gc.collect()
but it doesn't change anything.
It's specific to to_json
method. I haven't observed leak with other formats such as CSV.
I don't know if it makes sense or help, here's an output using tracemalloc from this code:
def foo():
df = pd.DataFrame({str(c): np.random.random_sample(size=100_000) for c in range(5)})
df.to_json()
if __name__ == "__main__":
tracemalloc.start(50)
foo()
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('traceback')
# pick the biggest memory block
stat = top_stats[0]
print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
for line in stat.traceback.format():
print(line)
with Pandas 1.1.0 or 1.3.3:
5 memory blocks: 782.5 KiB
File "main.py"
foo()
File "main.py"
df.to_json()
File ".\ven37\lib\site-packages\pandas\core\generic.py", line 2571
storage_options=storage_options,
File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 122
indent=indent,
File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 183
indent=self.indent,
File ".\ven37\lib\site-packages\pandas\core\indexes\base.py", line 4367
return self._data
File ".\ven37\lib\site-packages\pandas\core\indexes\range.py", line 186
return np.arange(self.start, self.stop, self.step, dtype=np.int64)
whereas 1.0.5 produces this:
9 memory blocks: 1.6 KiB
File "main.py"
foo()
File "main.py"
df.to_json()
File ".\ven37\lib\site-packages\pandas\core\generic.py", line 2364
indent=indent,
File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 85
indent=indent,
File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 145
self.indent,
File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 245
indent,
File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 167
indent=indent,
Expected Behavior
No leak expected, similar to version 1.0.5
Installed Versions
Versions with leak:
master -----------------
commit : 6599834
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
pandas : 1.4.0.dev0+833.g6599834103
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
...
1.3.3 ------------------
commit : 73c6825
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
pandas : 1.3.3
numpy : 1.21.2
...
1.1.0 ------------------
commit : d9fff27
pandas : 1.1.0
numpy : 1.21.2
...
Versions without leak:
commit : None
pandas : 1.0.5
numpy : 1.21.2
...