-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
DOC: Clarify that DataFrame.sort_values is stable for sorting by multiple columns or labels #38426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Clarify that DataFrame.sort_values is stable for sorting by multiple columns or labels #38426
Conversation
6f89c68
to
e42a008
Compare
e42a008
to
f1171a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any tests to check that the sorting algos are stable?
If no could be something to add (not necessarily here)
Good point! There is a test that verifies that the counting sort is stable by comparing it to numpy's pandas/pandas/tests/test_algos.py Lines 2118 to 2138 in e2dec8d
The mergesort in There is also a test case that tests the pandas/pandas/tests/test_sorting.py Lines 104 to 124 in e2dec8d
If you think it's worthwhile, I can also add a test case verifying that the multicolumn sort is stable directly. (I at least have not found any such test.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for looking into this! I think that's worth doing though not in this PR (it's always good to keep orthogonal tasks separate) |
@jotasi can you add test for the multi-column sorting as well. |
@jreback Sure I can do that. There is already a test that according to its name should test this but doesn't really as it doesn't contain duplicated pairs: pandas/pandas/tests/frame/methods/test_sort_values.py Lines 220 to 241 in 4b7c4bb
I'll just adapt that one. Should I open a new PR as @arw2019 suggested or should I do it here? |
@jotasi happy in this PR |
@pytest.mark.parametrize("na_position", ["first", "last"]) | ||
def test_sort_values_stable_multicolumn_sort( | ||
self, expected_idx_non_na, ascending, na_position | ||
): | ||
df = DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment with the GH id of this pr or the issue here and i'd say this is done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I've added a reference to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ping on greenish
@jreback Thanks. The checks are done now. The Travis build is failing. But it looks like this is unrelated to the changes here. |
thanks @jotasi very nice, keep em coming! |
…iple columns or labels (pandas-dev#38426)
tests added / passedblack pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entryFrom what I understand from the code, for
by
containing more than one entry,pandas.DataFrame.sort_values
usespandas.core.sorting.lexsort_indexer
, which converts the columns to orderedCategorical
s, gets grouped lexicographic group indices viaget_group_index
, compressed these viacompress_group_index
and then sorts using counting sort or mergesort viaget_group_index_sorter
. Both of these are stable, as explained inget_group_index_sorter
's docstring. Therefore, the sorting should be stable ifsort_values
is called withlen(by)>1
.