-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: numpy histogram bin edges in cut (GH 14627) #23567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: numpy histogram bin edges in cut (GH 14627) #23567
Conversation
Passig a string to `pd.cut` bins kwarg dispatches bin calculation to `np.histogram_bin_edges`.
Hello @bla1089! Thanks for updating the PR.
Comment last updated on January 05, 2019 at 17:33 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a whatsnew entry in other enhancements
Is there anything left to resolve? |
@bla1089 : We recently had a huge doc-formatting change. If you could resolve the merge conflict, that would be fantastic. |
Not even sure how to do that. The site says they're too complex to solve
online.
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Thu, Nov 15, 2018 at 12:06 AM gfyoung ***@***.***> wrote:
Is there anything left to resolve?
@bla1089 <https://github.com/bla1089> : We recently had a huge
doc-formatting change. If you could resolve the merge conflict, that would
be fantastic.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVSgPqqKI7fqCnivPEHOYiR43WLCtks5uvPZagaJpZM4YUzrl>
.
|
Passing a string to the `pd.cut` bins kwarg dispatches bin calculation to `np.histogram_bin_edges`. Closes pandas-devgh-14627.
d7b2e3e
to
79b7145
Compare
That's because it's attempting to do a merge, which is much more difficult when the file that you modified actually got deleted on I just rebased for you right now. In essence, what we needed to do is undo your changes to |
Thank you. I won't pretend to know how to do all that.
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Thu, Nov 15, 2018 at 12:27 AM gfyoung ***@***.***> wrote:
Not even sure how to do that. The site says they're too complex to solve
online.
That's because it's attempting to do a merge, which is much more difficult
when the file that you modified actually got deleted on master. You need
to do something a little more than that.
I just rebased for you right now. In essence, what we needed to do is undo
your changes to v0.24.0.txt and re-apply them to v0.24.0.rst (the new
file), so I squashed your commits into one to facilitate the "undo",
rebased on master, and then added your whatsnew entry to v0.24.0.rst.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVRKwglBPityX1ZejPiJxeibwtplCks5uvPtPgaJpZM4YUzrl>
.
|
Can someone help me interpret the Travis CI and Azure failures? |
@bla1089 : When was |
I believe "auto" was added as an option to `np.histogram` in 1.11
and `histogram_bin_edges` was exposed independent of `histogram` in 1.15.0
or 1.15.1.
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Sun, Nov 18, 2018 at 11:59 PM gfyoung ***@***.***> wrote:
Can someone help me interpret the Travis CI and Azure failures?
@bla1089 <https://github.com/bla1089> : When was histogram_bin_edges
introduced to numpy ? It seems that it's breaking on some of the numpy
versions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVR5xUWdhBkdFge8bamSu32aQOsTAks5uwjqRgaJpZM4YUzrl>
.
|
@bla1089 : Hmm...if you could double check when this was introduced, that would be great. It looks like this feature only applies to certain |
The documentation is unclear. `np.histogram` docstring indicates
`bins="string"` option was added in 1.11, as do the release notes. (
https://docs.scipy.org/doc/numpy-1.11.0/release.html) The 1.15.0 release
notes indicate that `histogram_bin_edges` was refactors into an independent
function, but there is no similar notice in the `histogram_bin_edges`
docstring.
For numpy version compatibility, I can add a try-except statement that will
call `np.histogram` when `np.histogram_bin_edges` raises an
`AttributeError`. Would that be sufficient?
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Mon, Nov 19, 2018 at 9:03 AM gfyoung ***@***.***> wrote:
@bla1089 <https://github.com/bla1089> : Hmm...if you could double check
when this was introduced, that would be great. It looks like this feature
only applies to certain numpy versions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVW5wyg3FG28bqMVAHGA1H6uTL5Gtks5uwroTgaJpZM4YUzrl>
.
|
@bla1089 : That seems reasonable for now. Give that a shot! |
1) Should bring compatibility down from 1.15 to 1.11.
…hub.com/bla1089/pandas into cut-bins-from-numpy-histogram-bin-edges
@gfyoung can you help me fugue out the status here?
…On Mon, Nov 19, 2018, 11:11 gfyoung ***@***.*** wrote:
@bla1089 <https://github.com/bla1089> : That seems reasonable for now.
Give that a shot!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVf3Y2XHbXuEpNvSP4OYbPFLLXCXuks5uwtgQgaJpZM4YUzrl>
.
|
@bla1089 : Sorry, I'm not sure I fully understand you here. Are you referring to debugging the tests? |
yes
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Tue, Nov 27, 2018 at 12:33 AM gfyoung ***@***.***> wrote:
can you help me fugue out the status here?
@bla1089 <https://github.com/bla1089> : Sorry, I'm not sure I fully
understand you here. Are you referring to debugging the tests?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVR9Lf9cB-FRb_8Aco5W8Srd8hDSsks5uzM6ugaJpZM4YUzrl>
.
|
Based on the Travis failures, I see a couple of things: It looks like your version checking of Not 100% sure off the top of my head how to fix the doc issues though... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of comments about the docstring, it'd be good to render it to see how it's rendered
pandas/core/reshape/tile.py
Outdated
The criteria to bin by. | ||
|
||
* int : Defines the number of equal-width bins in the range of `x`. The | ||
range of `x` is extended by .1% on each side to include the minimum | ||
and maximum values of `x`. | ||
* str : Bin calculaton dispatched to `np.histogram_bin_edges`. See that | ||
documentation for details. (versionadded:: 0.24.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sphinx won't detect the version added, should be in its own line starting with ..
|
||
Passng a string for `bins` dispatches the bin calculation to numpy's | ||
`histogram_bin_edges`. (Starting in version 0.24.) | ||
>>> pd.cut(array([0.1, 0.1, 0.2, 0.5, 0.5, 0.9, 1.0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave a blank line before this line
pandas/core/reshape/tile.py
Outdated
`histogram_bin_edges`. (Starting in version 0.24.) | ||
>>> pd.cut(array([0.1, 0.1, 0.2, 0.5, 0.5, 0.9, 1.0]), | ||
... bins="auto") | ||
... # doctest: +ELLIPSIS` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go in the previous line, after the code
What's needed to complete the merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some merge conflicts that need to be addressed.
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -233,6 +233,7 @@ Other Enhancements | |||
- :class:`DatetimeIndex` gained :attr:`DatetimeIndex.timetz` attribute. Returns local time with timezone information. (:issue:`21358`) | |||
- :meth:`round`, :meth:`ceil`, and meth:`floor` for :class:`DatetimeIndex` and :class:`Timestamp` now support an ``ambiguous`` argument for handling datetimes that are rounded to ambiguous times (:issue:`18946`) | |||
- :meth:`round`, :meth:`ceil`, and meth:`floor` for :class:`DatetimeIndex` and :class:`Timestamp` now support a ``nonexistent`` argument for handling datetimes that are rounded to nonexistent times. See :ref:`timeseries.timezone_nonexistent` (:issue:`22647`) | |||
- :func: `~cut` `bins` kwarg now accepts a string, which is dispatched to `numpy.histogram_bin_edges`. (:issue:`14627`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No space after :func:.
You can link to the numpy method with :func:`numpy.historgram_bin_edges`.
pandas/core/reshape/tile.py
Outdated
@@ -44,6 +46,10 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3, | |||
* sequence of scalars : Defines the bin edges allowing for non-uniform | |||
width. No extension of the range of `x` is done. | |||
* IntervalIndex : Defines the exact bins to be used. | |||
* str : Bin calculaton dispatched to `np.histogram_bin_edges`. See that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, link to the numpy function here with :func:numpy.histogram_bin_edges
.
except AttributeError: | ||
cnt, bins = np.histogram(x, bins) | ||
|
||
mn, mx = bins[0], bins[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this equivalent to doing pd.cut(np.histogram_bin_edges(array, bins))
? Why do we do the additional processing / adjustment starting here?
pandas/tests/reshape/test_tile.py
Outdated
# Test that a `bin` string not present in `np.histogram_bin_edges` | ||
# throws a ValueError. | ||
with pytest.raises(ValueError, | ||
match="'*' is not a valid estimator for `bins`", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will potentially fail if NumPy adjusts their error message. Given that it's not really testing anything in pandas, I think we're OK removing the match
. It's probably fine to ensure that a ValueError is raised.
I'm not sure what you're referring to regarding the processing. Are you
asking why the try-except statement? If so, it's because numpy didn't
refactor `histogram_bin_edges` from `histogram` until until v1.15.
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Sun, Dec 16, 2018 at 2:58 PM Tom Augspurger ***@***.***> wrote:
***@***.**** commented on this pull request.
I think there are some merge conflicts that need to be addressed.
------------------------------
In doc/source/whatsnew/v0.24.0.rst
<#23567 (comment)>:
> @@ -233,6 +233,7 @@ Other Enhancements
- :class:`DatetimeIndex` gained :attr:`DatetimeIndex.timetz` attribute. Returns local time with timezone information. (:issue:`21358`)
- :meth:`round`, :meth:`ceil`, and meth:`floor` for :class:`DatetimeIndex` and :class:`Timestamp` now support an ``ambiguous`` argument for handling datetimes that are rounded to ambiguous times (:issue:`18946`)
- :meth:`round`, :meth:`ceil`, and meth:`floor` for :class:`DatetimeIndex` and :class:`Timestamp` now support a ``nonexistent`` argument for handling datetimes that are rounded to nonexistent times. See :ref:`timeseries.timezone_nonexistent` (:issue:`22647`)
+- :func: `~cut` `bins` kwarg now accepts a string, which is dispatched to `numpy.histogram_bin_edges`. (:issue:`14627`)
No space after :func:.
You can link to the numpy method with :func:`numpy.historgram_bin_edges`.
------------------------------
In pandas/core/reshape/tile.py
<#23567 (comment)>:
> @@ -44,6 +46,10 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
* sequence of scalars : Defines the bin edges allowing for non-uniform
width. No extension of the range of `x` is done.
* IntervalIndex : Defines the exact bins to be used.
+ * str : Bin calculaton dispatched to `np.histogram_bin_edges`. See that
Again, link to the numpy function here with :func:
numpy.histogram_bin_edges.
------------------------------
In pandas/core/reshape/tile.py
<#23567 (comment)>:
> """
# NOTE: this binning code is changed a bit from histogram for var(x) == 0
# for handling the cut for datetime and timedelta objects
x_is_series, series_index, name, x = _preprocess_for_cut(x)
x, dtype = _coerce_to_type(x)
- if not np.iterable(bins):
+ if isinstance(bins, string_types):
+ # GH 14627
+ # NOTE: when the minimum numpy requirement is
+ # increased to 1.15, the try-except statement
+ # can be removed.
+ try:
+ bins = np.histogram_bin_edges(x, bins)
+ except AttributeError:
+ cnt, bins = np.histogram(x, bins)
+
+ mn, mx = bins[0], bins[-1]
Is this equivalent to doing pd.cut(np.histogram_bin_edges(array, bins))?
Why do we do the additional processing / adjustment starting here?
------------------------------
In pandas/tests/reshape/test_tile.py
<#23567 (comment)>:
> + bins_np[0] -= adj
+ tm.assert_almost_equal(bins_cut, bins_np)
+ tm.assert_almost_equal(np.round(bins_cut, 4),
+ np.array([0.0991, 0.325, 0.55, 0.775, 1.0]))
+
+ intervals = IntervalIndex.from_breaks(np.round(bins_np, 4),
+ closed="right")
+ expected = Categorical(intervals, ordered=True)
+ tm.assert_index_equal(result.categories,
+ expected.categories)
+
+
+ # Test that a `bin` string not present in `np.histogram_bin_edges`
+ # throws a ValueError.
+ with pytest.raises(ValueError,
+ match="'*' is not a valid estimator for `bins`",
This test will potentially fail if NumPy adjusts their error message.
Given that it's not really testing anything in pandas, I think we're OK
removing the match. It's probably fine to ensure that a ValueError is
raised.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVdZtzKEx0cWddEGSwJfiD834ZG0Nks5u5qXngaJpZM4YUzrl>
.
|
@bla1089 this needs to merge master then can have a look |
Pushing now. I hadn't b/c I was waiting for @TomAugspurger to reply*.*
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Sat, Jan 5, 2019 at 12:28 PM Jeff Reback ***@***.***> wrote:
@bla1089 <https://github.com/bla1089> this needs to merge master then can
have a look
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVREkuqOaCCJjLSu6SjCKjjk8uP6Wks5vAODDgaJpZM4YUzrl>
.
|
@bla1089 you need to merge master |
lets try that.
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Sat, Jan 5, 2019 at 1:34 PM Jeff Reback ***@***.***> wrote:
@bla1089 <https://github.com/bla1089> you need to merge master
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVZ4FBx5tJSaBzwhaHEa8wKDEtJ2Eks5vAPA9gaJpZM4YUzrl>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you merge master and address conflicts?
@@ -233,6 +233,7 @@ Other Enhancements | |||
- :class:`DatetimeIndex` gained :attr:`DatetimeIndex.timetz` attribute. Returns local time with timezone information. (:issue:`21358`) | |||
- :meth:`round`, :meth:`ceil`, and meth:`floor` for :class:`DatetimeIndex` and :class:`Timestamp` now support an ``ambiguous`` argument for handling datetimes that are rounded to ambiguous times (:issue:`18946`) | |||
- :meth:`round`, :meth:`ceil`, and meth:`floor` for :class:`DatetimeIndex` and :class:`Timestamp` now support a ``nonexistent`` argument for handling datetimes that are rounded to nonexistent times. See :ref:`timeseries.timezone_nonexistent` (:issue:`22647`) | |||
- :func:`~cut` `bins` kwarg now accepts a string, which is dispatched to `numpy.histogram_bin_edges`. (:issue:`14627`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to 0.25 at this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure how to get my fork to update here. I recently changed my GH id and I'm unsure if that's getting in the way. Any ideas?
merge master |
That's what I did, which is why I'm confused.
pandas-bla1089 balterma$ git merge master
Already up to date.
pandas-bla1089 balterma$ git push
Everything up-to-date
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Sat, Mar 2, 2019 at 9:19 PM Jeff Reback ***@***.***> wrote:
merge master
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVbduv54oi0Frr92t12TM9VZpjnN8ks5vSzEQgaJpZM4YUzrl>
.
|
you need to merge upstream |
pandas-bla1089 balterma$ git merge upstream/master
Already up to date.
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Sat, Mar 2, 2019 at 9:38 PM Jeff Reback ***@***.***> wrote:
you need to merge upstream
git merge upstream/master
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVT_IuYx3aXSg7DXOAKrJYMpVja-fks5vSzW3gaJpZM4YUzrl>
.
|
@blalterman : Two questions:
|
Here's the former.
pandas-bla1089$ git remote -v
origin https://github.com/blalterman/pandas.git (fetch)
origin https://github.com/blalterman/pandas.git (push)
upstream https://github.com/pandas-dev/pandas.git (fetch)
upstream https://github.com/pandas-dev/pandas.git (push)
*fetch upstream* pulled down a bunch of changes. Now what?
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Sat, Mar 2, 2019 at 11:48 PM gfyoung ***@***.***> wrote:
@blalterman <https://github.com/blalterman> : Two questions:
- When you run git remote -v, what does upstream point to?
- Did you fetch from upstream first?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVTiO_GiYYYXY89UD27QW7WAtVhsqks5vS1QsgaJpZM4YUzrl>
.
|
Now do a |
ok...
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Sun, Mar 3, 2019 at 2:02 PM Tom Augspurger ***@***.***> wrote:
Now do a git merge upstream/master (and you'll need to fixup the
conflicts.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVR5ie-Psa4D3YWsUHUg4-YBPVy2Jks5vTCpQgaJpZM4YUzrl>
.
|
@blalterman are you able to update here? If you changed your GH credentials might be easiest to push a new PR |
Closing as stale |
Uh oh!
There was an error while loading. Please reload this page.