Skip to content

Aggregate transforms #1924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 3, 2017
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions lib/aggregate.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
/**
* Copyright 2012-2017, Plotly, Inc.
* All rights reserved.
*
* This source code is licensed under the MIT license found in the
* LICENSE file in the root directory of this source tree.
*/

'use strict';

module.exports = require('../src/transforms/aggregate');
1 change: 1 addition & 0 deletions lib/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Plotly.register([
// https://github.com/plotly/plotly.js/pull/978#pullrequestreview-2403353
//
Plotly.register([
require('./aggregate'),
require('./filter'),
require('./groupby'),
require('./sort')
Expand Down
16 changes: 12 additions & 4 deletions src/plots/cartesian/axes.js
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ axes.cleanPosition = function(pos, gd, axRef) {
return cleanPos(pos);
};

axes.getDataToCoordFunc = function(gd, trace, target, targetArray) {
var getDataConversions = axes.getDataConversions = function(gd, trace, target, targetArray) {
var ax;

// If target points to an axis, use the type we already have for that
Expand Down Expand Up @@ -155,15 +155,23 @@ axes.getDataToCoordFunc = function(gd, trace, target, targetArray) {

// if 'target' has corresponding axis
// -> use setConvert method
if(ax) return ax.d2c;
if(ax) return {d2c: ax.d2c, c2d: ax.c2d};

// special case for 'ids'
// -> cast to String
if(d2cTarget === 'ids') return function(v) { return String(v); };
if(d2cTarget === 'ids') return {d2c: toString, c2d: toString};

// otherwise (e.g. numeric-array of 'marker.color' or 'marker.size')
// -> cast to Number
return function(v) { return +v; };

return {d2c: toNum, c2d: toNum};
};

function toNum(v) { return +v; }
function toString(v) { return String(v); }

axes.getDataToCoordFunc = function(gd, trace, target, targetArray) {
return getDataConversions(gd, trace, target, targetArray).d2c;
};

// empty out types for all axes containing these traces
Expand Down
283 changes: 283 additions & 0 deletions src/transforms/aggregate.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
/**
* Copyright 2012-2017, Plotly, Inc.
* All rights reserved.
*
* This source code is licensed under the MIT license found in the
* LICENSE file in the root directory of this source tree.
*/

'use strict';

var Axes = require('../plots/cartesian/axes');
var Lib = require('../lib');
var PlotSchema = require('../plot_api/plot_schema');
var BADNUM = require('../constants/numerical').BADNUM;

exports.moduleType = 'transform';

exports.name = 'aggregate';

var attrs = exports.attributes = {
enabled: {
valType: 'boolean',
dflt: true,
description: [
'Determines whether this aggregate transform is enabled or disabled.'
].join(' ')
},
groups: {
// TODO: groupby should support string or array grouping this way too
// currently groupby only allows a grouping array
valType: 'string',
strict: true,
noBlank: true,
arrayOk: true,
dflt: 'x',
description: [
'Sets the grouping target to which the aggregation is applied.',
'Data points with matching group values will be coalesced into',
'one point, using the supplied aggregation functions to reduce data',
'in other data arrays.',
'If a string, *groups* is assumed to be a reference to a data array',
'in the parent trace object.',
'To aggregate by nested variables, use *.* to access them.',
'For example, set `groups` to *marker.color* to aggregate',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between the way backticks and asterisks render in the docs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch - not quite sure how it renders, but we've tried to use backticks for attribute names and asterisks for string attribute values, looks like I mixed them up a bit here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good. I didn't notice which was which. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, this special attribute / value formatting isn't currently used anywhere. I originally though it could be on https://plot.ly/javascript/reference/, but no-one ever got to implementing it.

'about the marker color array.',
'If an array, *groups* is itself the data array by which we aggregate.'
].join(' ')
},
aggregations: {
_isLinkedToArray: 'style',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, the _isLinkedToArray values are used in the python api to build the graph objects e.g. go.Annotation. So, we should change this line to _isLinkedToArray: 'aggregation'.

This won't do anything at the moment as transforms and anything inside them don't have corresponding python graph objects (I think) at the moment, but might as well stay consistent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh haha copy/paste error - thanks.

array: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not target as in groupby and filter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array -> target in ffc4ee2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To recap:

In filter, target = data by which filter is applied
In sort, target = data by which data is sorted
In groupby, groups = data by which trace is grouped (should this be target?)
In aggregate, target = data by which trace is aggregated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In groupby, groups = data by which trace is grouped (should this be target?)

aggregate has both groups (in the main container) and target (in each aggregation) - I suppose we could make all of these be target but that's starting to sound a bit confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought maybe I was missing one. That works for me.

valType: 'string',
role: 'info',
description: [
'A reference to the data array in the parent trace to aggregate.',
'To aggregate by nested variables, use *.* to access them.',
'For example, set `groups` to *marker.color* to aggregate',
'about the marker color array.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtle, but maybe "to aggregate over the marker color array?"

'The referenced array must already exist, unless `func` is *count*,',
'and each array may only be referenced once.'
].join(' ')
},
func: {
valType: 'enumerated',
values: ['count', 'sum', 'avg', 'min', 'max', 'first', 'last'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't imagine median would be very popular… mode…? variance…?

dflt: 'first',
Copy link
Contributor

@rreusser rreusser Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, presumably any data array just returns the first entry as a way of dealing with the issue of needing to specify operations for all data arrays?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, presumably any data array just returns the first entry as a way of dealing with the issue of needing to specify operations for all data arrays?

exactly

role: 'info',
description: [
'Sets the aggregation function.',
'All values from the linked `array`, corresponding to the same value',
'in the `groups` array, are collected and reduced by this function.',
'*count* is simply the number of values in the `groups` array, so does',
'not even require the linked array to exist. *first* (*last*) is just',
'the first (last) linked value.'
].join(' ')
},
}
};

/**
* Supply transform attributes defaults
*
* @param {object} transformIn
* object linked to trace.transforms[i] with 'func' set to exports.name
* @param {object} traceOut
* the _fullData trace this transform applies to
* @param {object} layout
* the plot's (not-so-full) layout
* @param {object} traceIn
* the input data trace this transform applies to
*
* @return {object} transformOut
* copy of transformIn that contains attribute defaults
*/
exports.supplyDefaults = function(transformIn, traceOut) {
var transformOut = {};
var i;

function coerce(attr, dflt) {
return Lib.coerce(transformIn, transformOut, attrs, attr, dflt);
}

var enabled = coerce('enabled');

if(!enabled) return transformOut;

/*
* Normally _arrayAttrs is calculated during doCalc, but that comes later.
* Anyway this can change due to *count* aggregations (see below) so it's not
* necessarily the same set.
*
* For performance we turn it into an object of truthy values
* we'll use 1 for arrays we haven't aggregated yet, 0 for finished arrays,
* as distinct from undefined which means this array isn't present in the input
* missing arrays can still be aggregate outputs for *count* aggregations.
*/
var arrayAttrArray = PlotSchema.findArrayAttributes(traceOut);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of subtle interest is the fact that arrayAttrs does include transform data arrays themselves. I glossed over this groupby since it wasn't critically important for performance (iterated groupbys are definitely a corner case) and since the groups are already decided by the time this transform modifies its own arrayAttrs so that the result is correct. Perhaps it should filter transform[i] for i > transformIn._index

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting... I will have to look for cases where this matters, I can't quite tell offhand whether transforming the earlier transforms is merely unnecessary or can actually lead to bugs. Combine that with groupby happening first and the condition is a bit more complicated than just i > transformIn._index anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just a tiny bit of unnecessary work we can 🔪 at some point.

var arrayAttrs = {};
for(i = 0; i < arrayAttrArray.length; i++) arrayAttrs[arrayAttrArray[i]] = 1;

var groups = coerce('groups');

if(!Array.isArray(groups)) {
if(!arrayAttrs[groups]) {
transformOut.enabled = false;
return;
}
arrayAttrs[groups] = 0;
}

var aggregationsIn = transformIn.aggregations;
var aggregationsOut = transformOut.aggregations = [];

if(aggregationsIn) {
for(i = 0; i < aggregationsIn.length; i++) {
var aggregationOut = {};
var array = Lib.coerce(aggregationsIn[i], aggregationOut, attrs.aggregations, 'array');
var func = Lib.coerce(aggregationsIn[i], aggregationOut, attrs.aggregations, 'func');

// add this aggregation to the output only if it's the first instance
// of a valid array attribute - or an unused array attribute with "count"
if(array && (arrayAttrs[array] || (func === 'count' && arrayAttrs[array] === undefined))) {
arrayAttrs[array] = 0;
aggregationsOut.push(aggregationOut);
Copy link
Contributor

@etpinard etpinard Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in general, aggregationsIn.length !== aggregationsOut.length. Hmm, that might cause issues in the workspace (but I'm sure you thought about that 😏 ). Maybe we could instead add an aggragations[i].enabled attribute to make more similar to other isLinkedToArray items?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. We're never going to be able to ensure equal in/out lengths, as we have to add entries for missing arrays - but we can at least ensure that all inputs contribute and the extras go at the end of the array so entries that do appear in the input have the same index in the output. added enabled -> 6d2d32c

}
}
}

// any array attributes we haven't yet covered, fill them with the default aggregation
for(i = 0; i < arrayAttrArray.length; i++) {
if(arrayAttrs[arrayAttrArray[i]]) {
aggregationsOut.push({
array: arrayAttrArray[i],
func: attrs.aggregations.func.dflt
});
}
}

return transformOut;
};


exports.calcTransform = function(gd, trace, opts) {
if(!opts.enabled) return;

var groups = opts.groups;

var groupArray = Lib.getTargetArray(trace, {target: groups});
if(!groupArray) return;

var i, vi, groupIndex;

var groupIndices = {};
var groupings = [];
for(i = 0; i < groupArray.length; i++) {
vi = groupArray[i];
groupIndex = groupIndices[vi];
if(groupIndex === undefined) {
groupIndices[vi] = groupings.length;
groupings.push([i]);
}
else groupings[groupIndex].push(i);
}

var aggregations = opts.aggregations;

for(i = 0; i < aggregations.length; i++) {
aggregateOneArray(gd, trace, groupings, aggregations[i]);
}

if(typeof groups === 'string') {
aggregateOneArray(gd, trace, groupings, {array: groups, func: 'first'});
}
};

function aggregateOneArray(gd, trace, groupings, aggregation) {
var attr = aggregation.array;
var targetNP = Lib.nestedProperty(trace, attr);
var arrayIn = targetNP.get();
var conversions = Axes.getDataConversions(gd, trace, attr, arrayIn);
var func = getAggregateFunction(aggregation.func, conversions);

var arrayOut = new Array(groupings.length);
for(var i = 0; i < groupings.length; i++) {
arrayOut[i] = func(arrayIn, groupings[i]);
Copy link
Contributor

@rreusser rreusser Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment: is it necessary to split the arrays or would it be possible to use online algorithms to just make one pass without splitting into arrays? Like mean and variance, for example. Going through the list with online algorithms in mind:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I think for the moment I'll leave it as is, all the existing aggregations would be fairly easy to replicate online (and yes, 'last' is easy - perhaps even easier than 'first', you just always replace the output with the new value) but some of the others we've listed in the comments might be trickier (median and mode, in particular). I'll keep this in mind though as a potential performance gain for later.

To note from a private convo with @rreusser - the case where this is most important is when you have a large number of small groups - thousands of groups aggregating just a couple of items each - in which case creating all the little groupings arrays entails significant overhead. If we see unreasonable drag in this case, switching to online algorithms would be the fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Yeah. Easy. Obv.

}
targetNP.set(arrayOut);
}

function getAggregateFunction(func, conversions) {
var d2c = conversions.d2c;
var c2d = conversions.c2d;

switch(func) {
// count, first, and last don't depend on anything about the data
// point back to pure functions for performance
case 'count':
return count;
case 'first':
return first;
case 'last':
return last;

case 'sum':
// This will produce output in all cases even though it's nonsensical
// for date or category data.
return function(array, indices) {
var total = 0;
for(var i = 0; i < indices.length; i++) {
var vi = d2c(array[indices[i]]);
if(vi !== BADNUM) total += +vi;
}
return c2d(total);
};

case 'avg':
// Generally meaningless for category data but it still does something.
return function(array, indices) {
var total = 0;
var cnt = 0;
for(var i = 0; i < indices.length; i++) {
var vi = d2c(array[indices[i]]);
if(vi !== BADNUM) {
total += +vi;
cnt++;
}
}
return cnt ? c2d(total / cnt) : BADNUM;
};

case 'min':
return function(array, indices) {
var out = Infinity;
for(var i = 0; i < indices.length; i++) {
var vi = d2c(array[indices[i]]);
if(vi !== BADNUM) out = Math.min(out, +vi);
}
return (out === Infinity) ? BADNUM : c2d(out);
};

case 'max':
return function(array, indices) {
var out = -Infinity;
for(var i = 0; i < indices.length; i++) {
var vi = d2c(array[indices[i]]);
if(vi !== BADNUM) out = Math.max(out, +vi);
}
return (out === -Infinity) ? BADNUM : c2d(out);
};
}
}

function count(array, indices) {
return indices.length;
}

function first(array, indices) {
return array[indices[0]];
}

function last(array, indices) {
return array[indices[indices.length - 1]];
}
6 changes: 4 additions & 2 deletions src/transforms/groupby.js
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,12 @@ exports.attributes = {
*
* @param {object} transformIn
* object linked to trace.transforms[i] with 'type' set to exports.name
* @param {object} fullData
* the plot's full data
* @param {object} traceOut
* the _fullData trace this transform applies to
* @param {object} layout
* the plot's (not-so-full) layout
* @param {object} traceIn
* the input data trace this transform applies to
*
* @return {object} transformOut
* copy of transformIn that contains attribute defaults
Expand Down
Loading