Skip to content

groupby sorting - don't specify, or forbid? #102

Closed
@MarcoGorelli

Description

@MarcoGorelli

From the last call, the cuDF folks stated that a goal of theirs is to let users prototype with pandas on CPU, and then switch to GPU with cuDF and have their code "just work".

One issue they currently face is groupby: pandas sorts by default, so that's what their users expect

Even if pandas were to adopt the standard in its main namespace, this would not accomplish their goal. This is because whilst the standard doesn't specify whether groupby should sort, it doesn't forbid it either - thus, pandas could continue sorting in groupby by default whilst respecting the standard, but then cuDF would be no better off.

cuDF also said that if pandas were to implement the standard in a separate namespace, then users wouldn't necessarily want to look up the standard and would expect things to just work as they're used to (i.e. pandas main namespace).

The only way I can think of of accomplishing cuDF's goal, without changes in cuDF, would be:

  1. the Standard forbids sorting in groupby
  2. pandas adopts the standard in its main namespace

I don't mean to be a "Pessimistic Pete", but the second one seems unlikely to land. Some deviations from pandas in cuDF may be warranted.

My suggestion is:

  • for cuDF's end-user groupby issue, they could remove the default from groupby, thus forcing users to specify a value for sort. Users would be required to type an extra 2 words, but they'd probably be better off because of it
  • the Standard stays developer-oriented, rather end-user-oriented
  • to minimise surprise to users of the Standard, that the standard explicitly forbid sorting in groupby. It's been mentioned that developers might only test their code using pandas and then expect it to work with other DataFrame libraries, so this would reduce the chances of surprises. And so long as it's in a separate namespace, there's no risk of breaking millions of users' code

EDIT: on the last point - maybe the standard doesn't need to forbid sorting, but the pandas implementation of it shouldn't sort by default

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions