Skip to content

DOC: Clarify that DataFrame.sort_values is stable for sorting by multiple columns or labels #38426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 22, 2020

Conversation

jotasi
Copy link
Contributor

@jotasi jotasi commented Dec 12, 2020

From what I understand from the code, for by containing more than one entry, pandas.DataFrame.sort_values uses pandas.core.sorting.lexsort_indexer, which converts the columns to ordered Categoricals, gets grouped lexicographic group indices via get_group_index, compressed these via compress_group_index and then sorts using counting sort or mergesort via get_group_index_sorter. Both of these are stable, as explained in get_group_index_sorter's docstring. Therefore, the sorting should be stable if sort_values is called with len(by)>1.

@pep8speaks
Copy link

pep8speaks commented Dec 12, 2020

Hello @jotasi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-22 18:13:05 UTC

@jotasi jotasi force-pushed the clarify_doc_sort_values_kind branch from 6f89c68 to e42a008 Compare December 12, 2020 17:36
@jotasi jotasi force-pushed the clarify_doc_sort_values_kind branch from e42a008 to f1171a8 Compare December 12, 2020 17:38
Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests to check that the sorting algos are stable?

If no could be something to add (not necessarily here)

@jotasi
Copy link
Contributor Author

jotasi commented Dec 13, 2020

Good point!

There is a test that verifies that the counting sort is stable by comparing it to numpy's argsort with kind="mergesort", which is stable:

def test_groupsort_indexer():
a = np.random.randint(0, 1000, 100).astype(np.int64)
b = np.random.randint(0, 1000, 100).astype(np.int64)
result = libalgos.groupsort_indexer(a, 1000)[0]
# need to use a stable sort
# np.argsort returns int, groupsort_indexer
# always returns int64
expected = np.argsort(a, kind="mergesort")
expected = expected.astype(np.int64)
tm.assert_numpy_array_equal(result, expected)
# compare with lexsort
# np.lexsort returns int, groupsort_indexer
# always returns int64
key = a * 1000 + b
result = libalgos.groupsort_indexer(key, 1000000)[0]
expected = np.lexsort((b, a))
expected = expected.astype(np.int64)

The mergesort in get_group_index_sorter is done by argsort with kind="mergesort".

There is also a test case that tests the lexsort_indexer, which tests that the None entries are sorted stably (no other duplicates are in the data though):

def test_lexsort_indexer(self):
keys = [[np.nan] * 5 + list(range(100)) + [np.nan] * 5]
# orders=True, na_position='last'
result = lexsort_indexer(keys, orders=True, na_position="last")
exp = list(range(5, 105)) + list(range(5)) + list(range(105, 110))
tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))
# orders=True, na_position='first'
result = lexsort_indexer(keys, orders=True, na_position="first")
exp = list(range(5)) + list(range(105, 110)) + list(range(5, 105))
tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))
# orders=False, na_position='last'
result = lexsort_indexer(keys, orders=False, na_position="last")
exp = list(range(104, 4, -1)) + list(range(5)) + list(range(105, 110))
tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))
# orders=False, na_position='first'
result = lexsort_indexer(keys, orders=False, na_position="first")
exp = list(range(5)) + list(range(105, 110)) + list(range(104, 4, -1))
tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))

If you think it's worthwhile, I can also add a test case verifying that the multicolumn sort is stable directly. (I at least have not found any such test.)

@jotasi jotasi requested a review from arw2019 December 21, 2020 16:19
Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arw2019
Copy link
Member

arw2019 commented Dec 21, 2020

Good point!

There is a test that verifies that the counting sort is stable by comparing it to numpy's argsort with kind="mergesort", which is stable:

def test_groupsort_indexer():
a = np.random.randint(0, 1000, 100).astype(np.int64)
b = np.random.randint(0, 1000, 100).astype(np.int64)
result = libalgos.groupsort_indexer(a, 1000)[0]
# need to use a stable sort
# np.argsort returns int, groupsort_indexer
# always returns int64
expected = np.argsort(a, kind="mergesort")
expected = expected.astype(np.int64)
tm.assert_numpy_array_equal(result, expected)
# compare with lexsort
# np.lexsort returns int, groupsort_indexer
# always returns int64
key = a * 1000 + b
result = libalgos.groupsort_indexer(key, 1000000)[0]
expected = np.lexsort((b, a))
expected = expected.astype(np.int64)

The mergesort in get_group_index_sorter is done by argsort with kind="mergesort".

There is also a test case that tests the lexsort_indexer, which tests that the None entries are sorted stably (no other duplicates are in the data though):

def test_lexsort_indexer(self):
keys = [[np.nan] * 5 + list(range(100)) + [np.nan] * 5]
# orders=True, na_position='last'
result = lexsort_indexer(keys, orders=True, na_position="last")
exp = list(range(5, 105)) + list(range(5)) + list(range(105, 110))
tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))
# orders=True, na_position='first'
result = lexsort_indexer(keys, orders=True, na_position="first")
exp = list(range(5)) + list(range(105, 110)) + list(range(5, 105))
tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))
# orders=False, na_position='last'
result = lexsort_indexer(keys, orders=False, na_position="last")
exp = list(range(104, 4, -1)) + list(range(5)) + list(range(105, 110))
tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))
# orders=False, na_position='first'
result = lexsort_indexer(keys, orders=False, na_position="first")
exp = list(range(5)) + list(range(105, 110)) + list(range(104, 4, -1))
tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))

If you think it's worthwhile, I can also add a test case verifying that the multicolumn sort is stable directly. (I at least have not found any such test.)

Thanks for looking into this! I think that's worth doing though not in this PR (it's always good to keep orthogonal tasks separate)

@jreback jreback added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 21, 2020
@jreback jreback added this to the 1.3 milestone Dec 21, 2020
@jreback
Copy link
Contributor

jreback commented Dec 21, 2020

@jotasi can you add test for the multi-column sorting as well.

@jotasi
Copy link
Contributor Author

jotasi commented Dec 22, 2020

@jreback Sure I can do that. There is already a test that according to its name should test this but doesn't really as it doesn't contain duplicated pairs:

def test_sort_values_stable_descending_multicolumn_sort(self):
df = DataFrame(
{"A": [1, 2, np.nan, 1, 6, 8, 4], "B": [9, np.nan, 5, 2, 5, 4, 5]}
)
# test stable mergesort
expected = DataFrame(
{"A": [np.nan, 8, 6, 4, 2, 1, 1], "B": [5, 4, 5, 5, np.nan, 2, 9]},
index=[2, 5, 4, 6, 1, 3, 0],
)
sorted_df = df.sort_values(
["A", "B"], ascending=[0, 1], na_position="first", kind="mergesort"
)
tm.assert_frame_equal(sorted_df, expected)
expected = DataFrame(
{"A": [np.nan, 8, 6, 4, 2, 1, 1], "B": [5, 4, 5, 5, np.nan, 9, 2]},
index=[2, 5, 4, 6, 1, 0, 3],
)
sorted_df = df.sort_values(
["A", "B"], ascending=[0, 0], na_position="first", kind="mergesort"
)
tm.assert_frame_equal(sorted_df, expected)

I'll just adapt that one.

Should I open a new PR as @arw2019 suggested or should I do it here?

@jreback
Copy link
Contributor

jreback commented Dec 22, 2020

@jotasi happy in this PR

@pytest.mark.parametrize("na_position", ["first", "last"])
def test_sort_values_stable_multicolumn_sort(
self, expected_idx_non_na, ascending, na_position
):
df = DataFrame(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment with the GH id of this pr or the issue here and i'd say this is done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I've added a reference to this PR.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ping on greenish

@jotasi
Copy link
Contributor Author

jotasi commented Dec 22, 2020

@jreback Thanks. The checks are done now. The Travis build is failing. But it looks like this is unrelated to the changes here.

@jreback jreback merged commit 254e260 into pandas-dev:master Dec 22, 2020
@jreback
Copy link
Contributor

jreback commented Dec 22, 2020

thanks @jotasi very nice, keep em coming!

@jotasi jotasi deleted the clarify_doc_sort_values_kind branch December 22, 2020 20:34
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Clarify stability of sorting in documentation of DataFrame.sort_values for multiple columns / labels
4 participants