Skip to content

BUG: reading zst compressed pickle from 1.5.3 fails #54753

Open
@behrenhoff

Description

@behrenhoff

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

see descriptions (two programs required)

Issue Description

Pandas 2.0 cannot read Pandas 1.5 pickle files when compression is used.

According to the documentation of read_pickle:

read_pickle is only guaranteed to be backwards compatible to pandas 0.20.3 provided the object was serialized with to_pickle.

From that, I understand it is supposed to be compatible from 1.5.3 -> 2.0.3 but the compatibility breaks when using a zstandard compressed pickle file.

To reproduce, you need two programs in different venvs.

  1. venv with pip install pandas==1.5.3 zstandard - use this to write the pickle files
import pandas as pd
df = pd.DataFrame(
    data={"i": 1, "d": pd.Timestamp("2000-01-01")}, index=pd.Index([10, 12])
)
df.to_pickle("bug.pkl.zst")
df.to_pickle("ok.pkl")
  1. venv with pip install pandas==2.0.3 zstandard
import pandas as pd
df1 = pd.read_pickle("ok.pkl")
df2 = pd.read_pickle("bug.pkl.zst") # <-- exception here

pd.testing.assert_frame_equal(df1, df2)

The marked line causes an exception: OSError: cannot seek zstd decompression stream backwards (happens when handling ModuleNotFoundError: No module named 'pandas.core.indexes.numeric')

The problem is that pandas goes into compatibility mode when encountering the "old" type from pandas 1.5.3 and tries to seek the stream - but the stream isn't seekable (zstd streams can only be seeked forward, not backwards). I have been using zstandard since it was supported with 1.4.0 because it is such a good and fast compression method. Now my problem is that I can no longer read the files... The offending line is here: https://github.com/pandas-dev/pandas/blob/v2.0.3/pandas/compat/pickle_compat.py#L210

Also (probably a separate issue), I have noticed that loading an 1.5.3 pickle file is very very slow. I have a DF here (300000 rows × 30 columns but only 5 MB as .zst file and 70 MB as uncompressed pkl, so I consider it small) which loads in under 300ms when written&read with the identical pandas version, but when writing in 1.5.3, reading in 2.0.3 takes more than 4 seconds, more than 10x slower (too slow for me). That came as a very big surprise. I am not sure how to best transition from 1.5 to 2.0.

Expected Behavior

There should not be an exception.

Installed Versions

pandas : 2.0.3 numpy : 1.25.2 zstandard : 0.21.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions