Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
# Your code here
import pandas as pd
df = pd.DataFrame({
"A": [1, 2],
"B": ["x", "y"],
"C": [True, False]
})
df.to_feather("./test_data.feather")
df2 = pd.read_feather("./test_data.feather", columns=['B', 'A'])
Error message
ArrowInvalid Traceback (most recent call last)
<ipython-input-4-1e23cf201732> in <module>
15
16
---> 17 df2 = pd.read_feather("/misc/labshare/datasets3/rating/data/preprocessing/tests/test_data.feather", columns=['B', 'A'])
~/.conda/envs/venv/lib/python3.6/site-packages/pandas/io/feather_format.py in read_feather(path, columns, use_threads)
101 path = stringify_path(path)
102
--> 103 return feather.read_feather(path, columns=columns, use_threads=bool(use_threads))
~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
206 """
207 _check_pandas_version()
--> 208 return (read_table(source, columns=columns, memory_map=memory_map)
209 .to_pandas(use_threads=use_threads))
210
~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
237 return reader.read_indices(columns)
238 elif all(map(lambda t: t == str, column_types)):
--> 239 return reader.read_names(columns)
240
241 column_type_names = [t.__name__ for t in column_types]
~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.read_names()
~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Schema at index 0 was different:
B: string
A: int64
vs
A: int64
B: string
Problem description
We don't always know the order in which our columns are.
The issue is when we update pyarrow to 0.17.0
This line work fine:
df2 = pd.read_feather("./test_data.feather", columns=['B', 'A'])
Should we apply a fix here or in the pyarrow repository ?
Expected Output
df2 = pd.DataFrame({
"A": [1, 2],
"B": ["x", "y"],
})
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-91-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3
Cython : 0.29.15
pytest : 5.3.2
hypothesis : 5.5.4
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.3
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pytest : 5.3.2
pyxlsb : None
s3fs : None
scipy : 1.2.3
sqlalchemy : 1.3.13
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.48.0