Skip to content

BUG: read_feather doesn't work when columns are shuffle #33878

Closed
@Benjamin15

Description

@Benjamin15
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

# Your code here
import pandas as pd

df = pd.DataFrame({
    "A": [1, 2],
    "B": ["x", "y"],
    "C": [True, False]
})
df.to_feather("./test_data.feather")

df2 = pd.read_feather("./test_data.feather", columns=['B', 'A'])

Error message

ArrowInvalid                              Traceback (most recent call last)
<ipython-input-4-1e23cf201732> in <module>
     15 
     16 
---> 17 df2 = pd.read_feather("/misc/labshare/datasets3/rating/data/preprocessing/tests/test_data.feather", columns=['B', 'A'])

~/.conda/envs/venv/lib/python3.6/site-packages/pandas/io/feather_format.py in read_feather(path, columns, use_threads)
    101     path = stringify_path(path)
    102 
--> 103     return feather.read_feather(path, columns=columns, use_threads=bool(use_threads))

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
    206     """
    207     _check_pandas_version()
--> 208     return (read_table(source, columns=columns, memory_map=memory_map)
    209             .to_pandas(use_threads=use_threads))
    210 

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
    237         return reader.read_indices(columns)
    238     elif all(map(lambda t: t == str, column_types)):
--> 239         return reader.read_names(columns)
    240 
    241     column_type_names = [t.__name__ for t in column_types]

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.read_names()

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Schema at index 0 was different: 
B: string
A: int64
vs
A: int64
B: string

Problem description

We don't always know the order in which our columns are.
The issue is when we update pyarrow to 0.17.0

This line work fine:

df2 = pd.read_feather("./test_data.feather", columns=['B', 'A'])

Should we apply a fix here or in the pyarrow repository ?

Expected Output

df2 = pd.DataFrame({
"A": [1, 2],
"B": ["x", "y"],
})

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-91-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3
Cython : 0.29.15
pytest : 5.3.2
hypothesis : 5.5.4
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.3
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pytest : 5.3.2
pyxlsb : None
s3fs : None
scipy : 1.2.3
sqlalchemy : 1.3.13
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.48.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions