Reimagining histogram autobin

Histogram autobin works relatively well for a single trace, but could be better. But for multiple traces, despite several attempts to clean it up over time (#2028, #1944, #1901, ...), it has a bunch of problems:
- If you leave out all binning information, we'll push `autobinx: true` back to the first trace, but `autobin: false` to all subsequent traces. This doesn't have any immediate impact, but if you then try to alter bins later, the results depend on which trace you edit. In particular if you edit the first one, (which seems logical, right?) the later ones will keep their original bin sizes https://codepen.io/alexcjohnson/pen/pOVrGj?editors=0010, but if instead you edit the other two (so change `[0]` to `[1,2]` in the `restyle` call at the end) the first one will get this new size as well.
- Stacked/grouped histograms can have different bin sizes or incompatible start positions (this is the resulting situation in the codepen above). This results in a misleading plot (you can make peaks shift, so two matching peaks look separated, for example, and make a flat distribution look like it has gaps), and I would argue even if explicitly supplied this way we should not allow this situation, and take the size only from the first one that explicitly specifies it, similar to how we handle stacked area options. It's fine though to have independent sizes & starts if `barmode='overlay'`. I initially thought we might need a "`bingroup`" attribute similar to scatter's `stackgroup` but now I think it's better to just enforce a match across the already known group.
- #1944, while matching up autobin sizes across histograms, made what I now think is the wrong decision: for multiple autobinned histograms the bin size "is the minimum any of them were auto-assigned". I think a much better solution would be to concatenate all the data together and autobin it as a single unit. That will generally result in roughly the largest bin size of any of the constituent traces, but sometimes it will be bigger than any of the individual traces (if they have a bigger range together than separately), sometimes smaller (if they have similar ranges but the total sample size is now big enough that we choose a smaller bin), and it becomes very clear how to shift the start to reduce ambiguity (to minimize data exactly at bin edges). Initially I was thinking we should make this optional, but I've come to think if we clean up the first two points above, the autobin size should just be changed to be the composite size.
- We're mutating `gd.data` (`autobinx`, `xbins.(start|end|size)`) - and doing so in buggy ways at that.

Proposal:
- Drop `autobin(x|y)` entirely, and just use the (improved) autobin routine to fill in whatever gaps there are in the explicitly specified attributes. So if you have explicitly specified an attribute and would like it to revert to auto, instead of turning on `autobin` you would delete the value.
- For backward compatibility with `restyle` we can convert `autobinx: true` into `xbins: null`. I'll have to investigate the existing behavior when `autobinx: true` and `xbins` are both specified upfront to figure out what we would want to do in `cleanData`.
- Coerce `nbinsx` iff an explicit `xbins.size` is not found (in any trace in the group)
- Determine any needed auto values for `xbins`, and stash these *and only these* in `fullTrace._xbins`, but have both `supplyDefaults` and `calc` ensure the two are merged in `fullTrace.xbins`. So for example, if you specify `trace.xbins: {end: 30}` we'll set `fullTrace._xbins: {start: 0, size: 5}` and `fullTrace.xbins: {start: 0, end: 30, size: 5}`. The reason for this is `Plotly.react`: we want the full set in `fullTrace`, and `supplyDefaults` needs access to the auto-generated values, but in case you delete an explicit value from `trace` we don't want `supplyDefaults` to be able to fill it back in, so we will see the change and trigger a `calc`.

@etpinard I know this is a fairly big change, potentially breaking for some users, but the existing behavior is sufficiently broken and problematic already that I think we should consider it. This would also allow us to take a broader view of what "autobin" means, realizing it's not just a boolean but `size`, `start`, and `end` can all be independently auto or explicit.

Note that `start` does not need to match exactly from one trace to the next but does need to be compatible (all of them must be the same modulo `size`). `end` can be completely independent from trace to trace. So this logic will still be a little bit intricate...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reimagining histogram autobin #3001

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Reimagining histogram autobin #3001

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions