Skip to content

BUG: pandas is using absolute from pandas._libs cimport and is shipping broken private .pxd files #51875

Open
@rgommers

Description

@rgommers

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Put this in `tmp.pyx`:

# In Cython code - any use of `_libs.khash` will trigger this
from pandas._libs.khash cimport kh_int64_t

Then run cython tmp.pyx. That will result in:

Error compiling Cython file:
------------------------------------------------------------
...
    bint kh_exist_strbox(kh_strbox_t*, khiter_t) nogil

    khuint_t kh_needed_n_buckets(khuint_t element_n) nogil


include "khash_for_primitive_helper.pxi"
^
------------------------------------------------------------

/home/rgommers/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/_libs/khash.pxd:129:0: 'khash_for_primitive_helper.pxi' not found

Issue Description

I found this when following up on #49115 (comment):

Cython.Compiler.Errors.InternalError: Internal compiler error: 'khash_for_primitive_helper.pxi' not found

There are a couple of related issues that interact here:

  1. pandas is shipping lots of files in wheels that should not be there. In particular, .pxd and .pyx files in pandas/_libs.
  2. Use of absolute cimport's which should probably be relative
  3. Use of include <name>.pxi" in .pxd files. This should be replaced by shared declarations in a common .pxd file (see the warning in http://docs.cython.org/en/latest/src/userguide/language_basics.html#the-include-statement-and-include-files)

For (1), if you download any pandas 1.5.3 wheel, you'll see in pandas/_libs:

khash.pxd
khash_for_primitive_helper.pxi.in

And, notably, khash.pxd contains include "khash_for_primitive_helper.pxi" - and that file is not present (only the pxi.in template is). So basically a broken private .pxd here. Which is then picked up during the build in gh-49115 because of absolute from pandas._libs.khash cimport ... statements inside pandas itself.

That particular issue probably shows up in the Meson build but not during the setup.py-based build because in the latter the .pxi file is generated in-place rather than in the build dir. However, as my reproducer above shows, this is a bit of a house of cards, because the absolute from pandas._libs imports are actually broken.

Expected Behavior

Expected is that the .pxds aren't shipped, so anyone trying to access private .pxd files will get a clear exception. This will be automatically fixed when the Meson build is merged. However, that still leaves potential issues in any environments that already have pandas installed.

My suggestion is to:

  • Use relative cimports for accessing anything within pandas (needs testing, because Cython's cimport mechanism is very fragile all around).
  • Get rid of the .pxi.in and replace it with the recommended .pxd method.

Installed Versions

INSTALLED VERSIONS

commit : 2e218d1
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.1-arch1-1
Version : #1 SMP PREEMPT_DYNAMIC Sun, 26 Feb 2023 03:39:23 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.3
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 67.4.0
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.1
hypothesis : 6.68.2
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.8
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : 1.3.6
brotli :
fastparquet : 2023.2.0
fsspec : 2023.1.0
gcsfs : 2023.1.0
matplotlib : 3.6.3
numba : 0.56.4
numexpr : 2.8.3
odfpy : None
openpyxl : 3.1.0
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : 1.2.1
pyxlsb : 1.0.10
s3fs : 2023.1.0
scipy : 1.10.1
snappy :
sqlalchemy : 2.0.4
tables : 3.7.0
tabulate : 0.9.0
xarray : 2023.1.0
xlrd : 2.0.1
xlwt : None
zstandard : 0.19.0
tzdata : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugInternalsRelated to non-user accessible pandas implementation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions