Skip to content

BUG: lib.is_all_arraylike should not assume a _data attribute means "array-like". #46030

Open
@zpincus

Description

@zpincus

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

class A:
    _dat = 5

# succeeds
pd.DataFrame(data={1:range(10), 2:range(10)}, index = [A() for a in range(10)])

class B:
    _data = 5

# fails, as B instances are assumed array-like, so Pandas tries and fails to make a MultiIndex
pd.DataFrame(data={1:range(10), 2:range(10)}, index = [B() for a in range(10)])

Issue Description

In pandas.core.indexes.base, ensure_index() calls pandas.lib.is_all_arraylike to determine whether a list of entries should be treated as a regular index or a multi-index.

However, this function will incorrectly assume that any instance with a _data attribute is "array-like". This is highly nonstandard. It would seem better to look for something canonical such as __array_interface__ or similar. Just looking for a _data attribute is both nonspecific (lots of classes have _data attributes) and misses any array-like instances that don't happen to have a _data. (Numpy arrays don't have _data, even: is_all_arraylike has to special-case for actual arrays...)

The result of this is that instances of non-array-like classes that happen to have _data attributes cannot be pandas indices.

Here's the source in master:
https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/lib.pyx#L762

Expected Behavior

pandas.lib.is_all_arraylike should correctly report array-like things as array-like, and things like class B above correctly as not array-like.

Installed Versions

My versions don't matter (they're below anyway), as the offending code is in master:
https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/lib.pyx#L762

pd.show_versions()

INSTALLED VERSIONS

commit : 945c9ed
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.0-18-cloud-amd64
Version : #1 SMP Debian 4.19.208-1 (2021-09-29)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.27.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.9.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : 0.15.0
pyarrow : 4.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : 1.4.28
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : 0.54.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugInternalsRelated to non-user accessible pandas implementationNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions