Description
From the last call, the cuDF folks stated that a goal of theirs is to let users prototype with pandas on CPU, and then switch to GPU with cuDF and have their code "just work".
One issue they currently face is groupby
: pandas sorts by default, so that's what their users expect
Even if pandas were to adopt the standard in its main namespace, this would not accomplish their goal. This is because whilst the standard doesn't specify whether groupby
should sort, it doesn't forbid it either - thus, pandas could continue sorting in groupby
by default whilst respecting the standard, but then cuDF
would be no better off.
cuDF also said that if pandas were to implement the standard in a separate namespace, then users wouldn't necessarily want to look up the standard and would expect things to just work as they're used to (i.e. pandas main namespace).
The only way I can think of of accomplishing cuDF's goal, without changes in cuDF, would be:
- the Standard forbids sorting in
groupby
- pandas adopts the standard in its main namespace
I don't mean to be a "Pessimistic Pete", but the second one seems unlikely to land. Some deviations from pandas in cuDF may be warranted.
My suggestion is:
- for cuDF's end-user groupby issue, they could remove the default from
groupby
, thus forcing users to specify a value forsort
. Users would be required to type an extra 2 words, but they'd probably be better off because of it - the Standard stays developer-oriented, rather end-user-oriented
- to minimise surprise to users of the Standard, that the standard explicitly forbid sorting in
groupby
. It's been mentioned that developers might only test their code using pandas and then expect it to work with other DataFrame libraries, so this would reduce the chances of surprises. And so long as it's in a separate namespace, there's no risk of breaking millions of users' code
EDIT: on the last point - maybe the standard doesn't need to forbid sorting, but the pandas implementation of it shouldn't sort by default