Skip to content

ENH: Always write directly to output in to_pickle #46747

Closed
@jakirkham

Description

@jakirkham

Is your feature request related to a problem?

Currently to_pickle has the following code to workaround issue ( #39002 ). The result is an extra in-memory copy with pickle protocol 5 for some formats.

pandas/pandas/io/pickle.py

Lines 104 to 109 in c90294d

if handles.compression["method"] in ("bz2", "xz") and protocol >= 5:
# some weird TypeError GH#39002 with pickle 5: fallback to letting
# pickle create the entire object and then write it to the buffer.
# "zip" would also be here if pandas.io.common._BytesZipFile
# wouldn't buffer write calls
handles.handle.write(pickle.dumps(obj, protocol=protocol))

Describe the solution you'd like

The reason the file compressors fail is they often assume something that is bytes-like (though not as general as a memoryview). In particular they assume the data is 1-D contiguous and of uint8 type. With PickleBuffer's raw method, it is pretty straightforward to construct a memoryview with this format. If the buffer is non-contiguous (not sure how often this would come up here), raw will raise a BufferError. Though this can be mitigated by falling back to an in-memory copy if that exception occurs.

Given this, one option would be to wrap the write method of the compressor file objects that need this handling. For example the following would work on Python 3.8+.

from bz2 import BZ2File as _BZ2File
from pickle import PickleBuffer


class BZ2File(_BZ2File):
    def write(self, b):
        if isinstance(b, PickleBuffer):
            try:
                b = b.raw()  # coerce to 1-D `uint8` C-contiguous `memoryview` zero-copy
            except BufferError:
                b = bytes(b)  # perform in-memory copy if buffer is not contiguous
        return super(BZ2File, self).write(b)

Potentially this could live alongside other custom file objects like this one in pandas.io.common (albeit as private objects). Though maybe there are other places that could make sense.

API breaking implications

NA

Already when protocol 5 is set data with that protocol is written out in memory before writing to the file. This would only save the memcpy before writing to the file. Data written before and after this change should still be readable just the same.

Describe alternatives you've considered

Ultimately it would be preferable to have this fixed upstream. In fact to some extent that has already happened ( python/cpython#88605 ). However Python 3.8 is not covered, which is still supported by Pandas. Also if users are stuck on an earlier patch version of 3.9 (only 3.9.6+ has the fix), they may not have the fix. In these cases, it may make sense to have this workaround to provide the improved efficiency while protecting against this issue.

Additional context

NA (included above)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions