Description
Is your feature request related to a problem?
Currently to_pickle
has the following code to workaround issue ( #39002 ). The result is an extra in-memory copy with pickle protocol 5 for some formats.
Lines 104 to 109 in c90294d
Describe the solution you'd like
The reason the file compressors fail is they often assume something that is bytes
-like (though not as general as a memoryview
). In particular they assume the data is 1-D contiguous and of uint8
type. With PickleBuffer
's raw
method, it is pretty straightforward to construct a memoryview
with this format. If the buffer is non-contiguous (not sure how often this would come up here), raw
will raise a BufferError
. Though this can be mitigated by falling back to an in-memory copy if that exception occurs.
Given this, one option would be to wrap the write
method of the compressor file objects that need this handling. For example the following would work on Python 3.8+.
from bz2 import BZ2File as _BZ2File
from pickle import PickleBuffer
class BZ2File(_BZ2File):
def write(self, b):
if isinstance(b, PickleBuffer):
try:
b = b.raw() # coerce to 1-D `uint8` C-contiguous `memoryview` zero-copy
except BufferError:
b = bytes(b) # perform in-memory copy if buffer is not contiguous
return super(BZ2File, self).write(b)
Potentially this could live alongside other custom file objects like this one in pandas.io.common
(albeit as private objects). Though maybe there are other places that could make sense.
API breaking implications
NA
Already when protocol 5 is set data with that protocol is written out in memory before writing to the file. This would only save the memcpy
before writing to the file. Data written before and after this change should still be readable just the same.
Describe alternatives you've considered
Ultimately it would be preferable to have this fixed upstream. In fact to some extent that has already happened ( python/cpython#88605 ). However Python 3.8 is not covered, which is still supported by Pandas. Also if users are stuck on an earlier patch version of 3.9 (only 3.9.6+ has the fix), they may not have the fix. In these cases, it may make sense to have this workaround to provide the improved efficiency while protecting against this issue.
Additional context
NA (included above)