Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
see descriptions (two programs required)
Issue Description
Pandas 2.0 cannot read Pandas 1.5 pickle files when compression is used.
According to the documentation of read_pickle:
read_pickle is only guaranteed to be backwards compatible to pandas 0.20.3 provided the object was serialized with to_pickle.
From that, I understand it is supposed to be compatible from 1.5.3 -> 2.0.3 but the compatibility breaks when using a zstandard compressed pickle file.
To reproduce, you need two programs in different venvs.
- venv with
pip install pandas==1.5.3 zstandard
- use this to write the pickle files
import pandas as pd
df = pd.DataFrame(
data={"i": 1, "d": pd.Timestamp("2000-01-01")}, index=pd.Index([10, 12])
)
df.to_pickle("bug.pkl.zst")
df.to_pickle("ok.pkl")
- venv with
pip install pandas==2.0.3 zstandard
import pandas as pd
df1 = pd.read_pickle("ok.pkl")
df2 = pd.read_pickle("bug.pkl.zst") # <-- exception here
pd.testing.assert_frame_equal(df1, df2)
The marked line causes an exception: OSError: cannot seek zstd decompression stream backwards
(happens when handling ModuleNotFoundError: No module named 'pandas.core.indexes.numeric'
)
The problem is that pandas goes into compatibility mode when encountering the "old" type from pandas 1.5.3 and tries to seek
the stream - but the stream isn't seekable (zstd streams can only be seeked forward, not backwards). I have been using zstandard since it was supported with 1.4.0 because it is such a good and fast compression method. Now my problem is that I can no longer read the files... The offending line is here: https://github.com/pandas-dev/pandas/blob/v2.0.3/pandas/compat/pickle_compat.py#L210
Also (probably a separate issue), I have noticed that loading an 1.5.3 pickle file is very very slow. I have a DF here (300000 rows × 30 columns but only 5 MB as .zst file and 70 MB as uncompressed pkl, so I consider it small) which loads in under 300ms when written&read with the identical pandas version, but when writing in 1.5.3, reading in 2.0.3 takes more than 4 seconds, more than 10x slower (too slow for me). That came as a very big surprise. I am not sure how to best transition from 1.5 to 2.0.
Expected Behavior
There should not be an exception.