Description
Binning continuous data into discrete bins is a non-trivial problem, especially when there are features of various scales that one is trying to capture. This astropy example is much more clear than any I could hope to write. The current implementation of pandas.cut
does not allow for easy user access to these tools.
numpy
provides a number of simple binning algorithms to capture these features in computationally efficient ways. As done in astropy
, perhaps we can pass the bin edge calculations directly to numpy
. To do so, we would need to call histogram_bin_edges
(available in numpy
1.15.1, if not earler). This function isolates the calculation of bin edges.
To utilize np.histogram_bin_edges
, a user would call
data = np.random.normal(100000)
edges = np.histogram_bin_edges(data, bins="auto") # Other strings are acceptable.
cut = pd.cut(data, bins=edges)
We can simplify this by replacing one or two lines in pandas.cut source
bins = np.linspace(mn, mx, bins + 1, endpoint=True)
becomes
bins = np.histogram_bin_edges(x, bins=bins)
so that the user can simply call
data = np.random.normal(100000)
cut = pd.cut(data, bins="auto") # Or any of the other strings accepted by numpy
and have access to all the techniques available there.
The histogram_bin_edges
docs are availble here. The source is here.