Skip to content

Transform input data: groupby, filter #917

Closed
@monfera

Description

@monfera

Previously discussed (some lists are from @chriddyp) :

A groupbytransform should split apart traces as per unique values or bins of the groupby dimension. Example:

groupby: ['a', 'b', 'a', 'b']
x: [1, 2, 1, 2]
y: [10, 20, 30, 40]

should generate two traces:

trace 1:
x: [1, 2]
y: [10, 20]

trace 2:
x: [1, 2]
y: [30, 40]

Static groupby as a means of splitting spatially and/or aesthetically

  • distinct categorical values: numbers, strings or datetime strings
  • evenly spaced bins based on numerical data or time (datetime strings) in the groupby attribute, reusing logic of the preexisting plotly algorithm for histograms

image

Functional aspects:

  1. groupby needs to work across numbers, dates, and categories (@chriddyp in the JS context, meaning strings, correct?)

  2. groupby needs to split across all of the arrays or array-like specifications in a trace, not just x and y. For example, marker.color or marker.line.color. Not all array-like specifications in a trace are actual arrays (consider colorscale)

  3. There must be a way of specifying distinct styles for the split apart traces so that they're discernible - example:

    transform:
        groupby: ['a', 'b', 'a', 'b']
        marker:
            color:
                a: 'blue'
                b: 'red'
    
  4. @etpinard found some issues with legend items as he wrote an initial version of transforms: Introducing transform plugins #499 (comment). We'll probably need to modify some of the transforms and API. That's OK - transforms was made for groupby

  5. All relevant denotations for groupby, and the related animation split use (see below) need to be in the JSON format for serializability, fitting in the current declarative structure

  6. The transforms such as groupby must work in the restyle and relayout steps, not just the initial plot step

  7. gd.data is expected to preserve the single trace and the groupby spec as the user supplied, and _fullData on the other hand has the individual (spllt) traces and no longer has the groupby attribute

  8. We must ID traces in _fullData back to groups or styles in data. Styling controls will be populated with the defaults from _fullData (e.g. _fullData[4].marker.color) but they’ll need to update the attributes in the data object (e.g. data[0].transform.marker.color.d). That’s because we serialize and save data, not _fullData.

Preliminary work

Related PR, containing the initial, analogous filter work by @timelyportfolio : #859
groupby: https://github.com/plotly/plotly.js/blob/master/test/jasmine/assets/transforms/groupby.js

Planned groupby coverage of the initial sprint

  1. It would cover a positive list of attributes for groupby such as x and y but not all at once - HOWEVER the preferred solution aims for generality because other transforms will need to use a similar approach e.g. filter, and future arraylike attributes should be covered without code coupling to transformations (consequence: we'll have to check if there's enough attribute metadata that allows us to tell if it's arraylike, or we need further metadata; also, whether there's a programmatic way of separating arraylike data e.g. colorscale that's not represented as an array at input, otherwise we need to handle them attribute by attribute (we'll have to come back to this topic after a first round of work).
    Initial attributes at least: x, y, marker.color, marker.size (scatter, bar, histogram, box)
    Then lat, lon (maps), a, b, c (ternary), ‘z’ (scatter3d), error_y.array
  2. It would cover a set of (initially, non-WebGL) traces
  3. First goalpost is separation by category (JS number or string)

It is expected that the trace separation (and transformations in general) is being performed in the supply defaults step.

Subsequent goal: splitting data for animations

Instead of generating n different paths as described above, plotly would arrive at a temporal sequence of n frames

Possible future items:

  1. Incremental recalculation (e.g. of bins, upon newly arriving data points)
  2. Combine this with a subplots transform for rendering the traces into separate subplots (as small multiples plots)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions