Skip to content

BUG: pd.cut(df, bins=N) current behavior is incorrect, returns float where int is the limit #47996

Open
@FlorinAndrei

Description

@FlorinAndrei

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

https://github.com/FlorinAndrei/misc/blob/master/HeartDisease.csv

import pandas as pd
hd = pd.read_csv("HeartDisease.csv")
pd.cut(hd["Age"], bins=3, include_lowest=True)

Issue Description

The lowest of the three bins created is: (28.951, 45.0]. This is incorrect in several ways.

First off, I expect a left-inclusive bin there. That bin is not left-inclusive.

Secondly, the minimum value in that column is 29. It is not 28.951 - that float is an artifact of the library and does not exist in the data.

One workaround you can find online is this:

_, edges = pd.cut(hd["Age"], bins=3, include_lowest=True, retbins=True)
edges_r = [round(x) for x in edges]
pd.cut(hd["Age"], bins=edges_r)

But this is pointless and annoying. The library should simply return the true minimum value.

Expected Behavior

The bin I expect is [29.0, 45.0]. I would also settle for [29, 45].

Installed Versions

pd.show_versions()
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\util_print_versions.py", line 109, in show_versions
deps = _get_dependency_info()
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\util_print_versions.py", line 88, in get_dependency_info
mod = import_optional_dependency(modname, errors="ignore")
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\compat_optional.py", line 138, in import_optional_dependency
module = importlib.import_module(name)
File "C:\Program Files\Python310\lib\importlib_init
.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in gcd_import
File "", line 1027, in find_and_load
File "", line 1002, in find_and_load_unlocked
File "", line 945, in find_spec
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init
.py", line 79, in find_spec
return method()
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init
.py", line 100, in spec_for_pip
if self.pip_imported_during_build():
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 111, in pip_imported_during_build
return any(
File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 112, in
frame.f_globals['file'].endswith('setup.py')
KeyError: 'file'

I do have the latest Pandas installed (1.4.3) but there's another bug now with show_versions() that prevents me from printing that info.

Python 3.10.6

Numpy 1.22.4

Windows 10

Jupyter Notebook

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions