Skip to content

ENH: numpy histogram bin edges in cut (GH 14627) #23567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,7 @@ Other Enhancements
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)
- :func: `~cut` `bins` kwarg now accepts a string, which is dispatched to `numpy.histogram_bin_edges`. (:issue:`14627`)

.. _whatsnew_0240.api_breaking:

Expand Down
42 changes: 36 additions & 6 deletions pandas/core/reshape/tile.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
is_scalar, is_timedelta64_dtype)
from pandas.core.dtypes.missing import isna

from pandas.compat import string_types

from pandas import (
Categorical, Index, Interval, IntervalIndex, Series, Timedelta, Timestamp,
to_datetime, to_timedelta)
Expand All @@ -35,12 +37,14 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
----------
x : array-like
The input array to be binned. Must be 1-dimensional.
bins : int, sequence of scalars, or pandas.IntervalIndex
bins : int, str, sequence of scalars, or pandas.IntervalIndex
The criteria to bin by.

* int : Defines the number of equal-width bins in the range of `x`. The
range of `x` is extended by .1% on each side to include the minimum
and maximum values of `x`.
* str : Bin calculaton dispatched to `np.histogram_bin_edges`. See that
documentation for details. (versionadded:: 0.24.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sphinx won't detect the version added, should be in its own line starting with ..

* sequence of scalars : Defines the bin edges allowing for non-uniform
width. No extension of the range of `x` is done.
* IntervalIndex : Defines the exact bins to be used.
Expand Down Expand Up @@ -83,11 +87,11 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,

* False : returns an ndarray of integers.

bins : numpy.ndarray or IntervalIndex.
bins : numpy.ndarray or IntervalIndex
The computed or specified bins. Only returned when `retbins=True`.
For scalar or sequence `bins`, this is an ndarray with the computed
bins. If set `duplicates=drop`, `bins` will drop non-unique bin. For
an IntervalIndex `bins`, this is equal to `bins`.
For scalar, str, or sequence `bins`, this is an ndarray with the
computed bins. If set `duplicates=drop`, `bins` will drop non-unique
bin. For an IntervalIndex `bins`, this is equal to `bins`.

See Also
--------
Expand All @@ -98,6 +102,8 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
Series : One-dimensional array with axis labels (including time series).
pandas.IntervalIndex : Immutable Index implementing an ordered,
sliceable set.
numpy.histogram_bin_edges : Bin calculation dispatched to this method when
`bins` is a string.

Notes
-----
Expand Down Expand Up @@ -181,14 +187,38 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
>>> pd.cut([0, 0.5, 1.5, 2.5, 4.5], bins)
[NaN, (0, 1], NaN, (2, 3], (4, 5]]
Categories (3, interval[int64]): [(0, 1] < (2, 3] < (4, 5]]

Passng a string for `bins` dispatches the bin calculation to numpy's
`histogram_bin_edges`. (Starting in version 0.24.)
>>> pd.cut(array([0.1, 0.1, 0.2, 0.5, 0.5, 0.9, 1.0]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave a blank line before this line

... bins="auto")
... # doctest: +ELLIPSIS`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should go in the previous line, after the code

[(0.0991, 0.325], (0.0991, 0.325], (0.0991, 0.325], (0.325, 0.55],
(0.325, 0.55], (0.775, 1.0], (0.775, 1.0]]
Categories (4, interval[float64]): [(0.0991, 0.325] < (0.325, 0.55] <
(0.55, 0.775] < (0.775, 1.0]]
"""
# NOTE: this binning code is changed a bit from histogram for var(x) == 0

# for handling the cut for datetime and timedelta objects
x_is_series, series_index, name, x = _preprocess_for_cut(x)
x, dtype = _coerce_to_type(x)

if not np.iterable(bins):
if isinstance(bins, string_types):
# GH 14627
bins = np.histogram_bin_edges(x, bins)
mn, mx = bins[0], bins[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equivalent to doing pd.cut(np.histogram_bin_edges(array, bins))? Why do we do the additional processing / adjustment starting here?

adj = (mx - mn)
if adj:
adj *= 0.001 # 0.1% of the range
else:
adj = 0.001
if right:
bins[0] -= adj
else:
bins[-1] += adj

elif not np.iterable(bins):
if is_scalar(bins) and bins < 1:
raise ValueError("`bins` should be a positive integer.")

Expand Down
27 changes: 26 additions & 1 deletion pandas/tests/reshape/test_tile.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@
from pandas.core.algorithms import quantile
import pandas.core.reshape.tile as tmod


class TestCut(object):

def test_simple(self):
Expand All @@ -37,6 +36,32 @@ def test_bins(self):
tm.assert_almost_equal(bins, np.array([0.1905, 3.36666667,
6.53333333, 9.7]))

def test_str_bins(self):
# GH 14627
data = np.array([0.1, 0.1, 0.2, 0.5, 0.5, 0.9, 1.0])
result, bins_cut = cut(data, bins="auto",
retbins=True)

bins_np = np.histogram_bin_edges(data, "auto")
adj = (bins_np[-1] - bins_np[0]) * 0.001
bins_np[0] -= adj
tm.assert_almost_equal(bins_cut, bins_np)
tm.assert_almost_equal(np.round(bins_cut, 4),
np.array([0.0991, 0.325, 0.55, 0.775, 1.0]))

intervals = IntervalIndex.from_breaks(np.round(bins_np, 4),
closed="right")
expected = Categorical(intervals, ordered=True)
tm.assert_index_equal(result.categories,
expected.categories)


# Test that a `bin` string not present in `np.histogram_bin_edges`
# throws a ValueError.
tm.assert_raises_regex(ValueError,
"'*' is not a valid estimator for `bins`",
cut, data, "bad bins")

def test_right(self):
data = np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1, 2.575])
result, bins = cut(data, 4, right=True, retbins=True)
Expand Down