ENH: Always write directly to output in `to_pickle`

#### Is your feature request related to a problem?

Currently `to_pickle` has the following code to workaround issue ( https://github.com/pandas-dev/pandas/issues/39002 ). The result is an extra in-memory copy with pickle protocol 5 for some formats.

https://github.com/pandas-dev/pandas/blob/c90294db40cd48fca7fbec5fa419b46a4b4768a1/pandas/io/pickle.py#L104-L109

#### Describe the solution you'd like

The reason the file compressors fail is they often assume something that is `bytes`-like (though not as general as a `memoryview`). In particular they assume the data is 1-D contiguous and of `uint8` type. With `PickleBuffer`'s [`raw` method]( https://docs.python.org/3/library/pickle.html#pickle.PickleBuffer.raw ), it is pretty straightforward to construct a `memoryview` with this format. If the buffer is non-contiguous (not sure how often this would come up here), [`raw` will raise a `BufferError`]( https://github.com/python/cpython/blob/f4c03484da59049eb62a9bf7777b963e2267d187/Objects/picklebufobject.c#L156-L161 ). Though this can be mitigated by falling back to an in-memory copy if that exception occurs.

Given this, one option would be to wrap the `write` method of the compressor file objects that need this handling. For example the following would work on Python 3.8+.

```python
from bz2 import BZ2File as _BZ2File
from pickle import PickleBuffer


class BZ2File(_BZ2File):
    def write(self, b):
        if isinstance(b, PickleBuffer):
            try:
                b = b.raw()  # coerce to 1-D `uint8` C-contiguous `memoryview` zero-copy
            except BufferError:
                b = bytes(b)  # perform in-memory copy if buffer is not contiguous
        return super(BZ2File, self).write(b)
```

Potentially this could live alongside other custom file objects like [this one]( https://github.com/pandas-dev/pandas/blob/32999a1a45f2c4171b8ac5e97f5aa34b1956686c/pandas/io/common.py#L859 ) in `pandas.io.common` (albeit as private objects). Though maybe there are other places that could make sense.

#### API breaking implications

NA

Already when protocol 5 is set data with that protocol is written out in memory before writing to the file. This would only save the `memcpy` before writing to the file. Data written before and after this change should still be readable just the same.

#### Describe alternatives you've considered

Ultimately it would be preferable to have this fixed upstream. In fact to some extent that has already happened ( https://github.com/python/cpython/issues/88605 ). However Python 3.8 is not covered, which is still supported by Pandas. Also if users are stuck on an earlier patch version of 3.9 (only 3.9.6+ [has the fix]( https://docs.python.org/3.9/whatsnew/changelog.html?highlight=bpo-44439#python-3-9-6-final )), they may not have the fix. In these cases, it may make sense to have this workaround to provide the improved efficiency while protecting against this issue.

#### Additional context

NA (included above)

	if handles.compression["method"] in ("bz2", "xz") and protocol >= 5:
	# some weird TypeError GH#39002 with pickle 5: fallback to letting
	# pickle create the entire object and then write it to the buffer.
	# "zip" would also be here if pandas.io.common._BytesZipFile
	# wouldn't buffer write calls
	handles.handle.write(pickle.dumps(obj, protocol=protocol))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Always write directly to output in `to_pickle` #46747

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: Always write directly to output in to_pickle #46747

Description

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ENH: Always write directly to output in `to_pickle` #46747