DOC: Clarify that DataFrame.sort_values is stable for sorting by multiple columns or labels #38426

jotasi · 2020-12-12T17:31:36Z

closes DOC: Clarify stability of sorting in documentation of DataFrame.sort_values for multiple columns / labels #38357
~~tests added / passed~~
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
~~whatsnew entry~~

From what I understand from the code, for by containing more than one entry, pandas.DataFrame.sort_values uses pandas.core.sorting.lexsort_indexer, which converts the columns to ordered Categoricals, gets grouped lexicographic group indices via get_group_index, compressed these via compress_group_index and then sorts using counting sort or mergesort via get_group_index_sorter. Both of these are stable, as explained in get_group_index_sorter's docstring. Therefore, the sorting should be stable if sort_values is called with len(by)>1.

pep8speaks · 2020-12-12T17:31:40Z

Hello @jotasi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-22 18:13:05 UTC

arw2019

Do we have any tests to check that the sorting algos are stable?

If no could be something to add (not necessarily here)

jotasi · 2020-12-13T18:55:36Z

Good point!

There is a test that verifies that the counting sort is stable by comparing it to numpy's argsort with kind="mergesort", which is stable:

pandas/pandas/tests/test_algos.py

Lines 2118 to 2138 in e2dec8d

    
           def test_groupsort_indexer(): 
        
               a = np.random.randint(0, 1000, 100).astype(np.int64) 
        
               b = np.random.randint(0, 1000, 100).astype(np.int64) 
        
               result = libalgos.groupsort_indexer(a, 1000)[0] 
        
               # need to use a stable sort 
        
               # np.argsort returns int, groupsort_indexer 
        
               # always returns int64 
        
               expected = np.argsort(a, kind="mergesort") 
        
               expected = expected.astype(np.int64) 
        
               tm.assert_numpy_array_equal(result, expected) 
        
               # compare with lexsort 
        
               # np.lexsort returns int, groupsort_indexer 
        
               # always returns int64 
        
               key = a * 1000 + b 
        
               result = libalgos.groupsort_indexer(key, 1000000)[0] 
        
               expected = np.lexsort((b, a)) 
        
               expected = expected.astype(np.int64)

The mergesort in get_group_index_sorter is done by argsort with kind="mergesort".

There is also a test case that tests the lexsort_indexer, which tests that the None entries are sorted stably (no other duplicates are in the data though):

pandas/pandas/tests/test_sorting.py

Lines 104 to 124 in e2dec8d

    
           def test_lexsort_indexer(self): 
        
               keys = [[np.nan] * 5 + list(range(100)) + [np.nan] * 5] 
        
               # orders=True, na_position='last' 
        
               result = lexsort_indexer(keys, orders=True, na_position="last") 
        
               exp = list(range(5, 105)) + list(range(5)) + list(range(105, 110)) 
        
               tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) 
        
               # orders=True, na_position='first' 
        
               result = lexsort_indexer(keys, orders=True, na_position="first") 
        
               exp = list(range(5)) + list(range(105, 110)) + list(range(5, 105)) 
        
               tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) 
        
               # orders=False, na_position='last' 
        
               result = lexsort_indexer(keys, orders=False, na_position="last") 
        
               exp = list(range(104, 4, -1)) + list(range(5)) + list(range(105, 110)) 
        
               tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp)) 
        
               # orders=False, na_position='first' 
        
               result = lexsort_indexer(keys, orders=False, na_position="first") 
        
               exp = list(range(5)) + list(range(105, 110)) + list(range(104, 4, -1)) 
        
               tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))

If you think it's worthwhile, I can also add a test case verifying that the multicolumn sort is stable directly. (I at least have not found any such test.)

arw2019

LGTM

arw2019 · 2020-12-21T16:37:33Z

Good point!

There is a test that verifies that the counting sort is stable by comparing it to numpy's argsort with kind="mergesort", which is stable:

pandas/pandas/tests/test_algos.py

Lines 2118 to 2138 in e2dec8d

def test_groupsort_indexer():

a = np.random.randint(0, 1000, 100).astype(np.int64)

b = np.random.randint(0, 1000, 100).astype(np.int64)

result = libalgos.groupsort_indexer(a, 1000)[0]

# need to use a stable sort

# np.argsort returns int, groupsort_indexer

# always returns int64

expected = np.argsort(a, kind="mergesort")

expected = expected.astype(np.int64)

tm.assert_numpy_array_equal(result, expected)

# compare with lexsort

# np.lexsort returns int, groupsort_indexer

# always returns int64

key = a * 1000 + b

result = libalgos.groupsort_indexer(key, 1000000)[0]

expected = np.lexsort((b, a))

expected = expected.astype(np.int64)

The mergesort in get_group_index_sorter is done by argsort with kind="mergesort".

There is also a test case that tests the lexsort_indexer, which tests that the None entries are sorted stably (no other duplicates are in the data though):

pandas/pandas/tests/test_sorting.py

Lines 104 to 124 in e2dec8d

def test_lexsort_indexer(self):

keys = [[np.nan] * 5 + list(range(100)) + [np.nan] * 5]

# orders=True, na_position='last'

result = lexsort_indexer(keys, orders=True, na_position="last")

exp = list(range(5, 105)) + list(range(5)) + list(range(105, 110))

tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))

# orders=True, na_position='first'

result = lexsort_indexer(keys, orders=True, na_position="first")

exp = list(range(5)) + list(range(105, 110)) + list(range(5, 105))

tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))

# orders=False, na_position='last'

result = lexsort_indexer(keys, orders=False, na_position="last")

exp = list(range(104, 4, -1)) + list(range(5)) + list(range(105, 110))

tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))

# orders=False, na_position='first'

result = lexsort_indexer(keys, orders=False, na_position="first")

exp = list(range(5)) + list(range(105, 110)) + list(range(104, 4, -1))

tm.assert_numpy_array_equal(result, np.array(exp, dtype=np.intp))

If you think it's worthwhile, I can also add a test case verifying that the multicolumn sort is stable directly. (I at least have not found any such test.)

Thanks for looking into this! I think that's worth doing though not in this PR (it's always good to keep orthogonal tasks separate)

jreback · 2020-12-21T20:07:43Z

@jotasi can you add test for the multi-column sorting as well.

jotasi · 2020-12-22T14:07:08Z

@jreback Sure I can do that. There is already a test that according to its name should test this but doesn't really as it doesn't contain duplicated pairs:

pandas/pandas/tests/frame/methods/test_sort_values.py

Lines 220 to 241 in 4b7c4bb

    
           def test_sort_values_stable_descending_multicolumn_sort(self): 
        
               df = DataFrame( 
        
                   {"A": [1, 2, np.nan, 1, 6, 8, 4], "B": [9, np.nan, 5, 2, 5, 4, 5]} 
        
               ) 
        
               # test stable mergesort 
        
               expected = DataFrame( 
        
                   {"A": [np.nan, 8, 6, 4, 2, 1, 1], "B": [5, 4, 5, 5, np.nan, 2, 9]}, 
        
                   index=[2, 5, 4, 6, 1, 3, 0], 
        
               ) 
        
               sorted_df = df.sort_values( 
        
                   ["A", "B"], ascending=[0, 1], na_position="first", kind="mergesort" 
        
               ) 
        
               tm.assert_frame_equal(sorted_df, expected) 
        
               expected = DataFrame( 
        
                   {"A": [np.nan, 8, 6, 4, 2, 1, 1], "B": [5, 4, 5, 5, np.nan, 9, 2]}, 
        
                   index=[2, 5, 4, 6, 1, 0, 3], 
        
               ) 
        
               sorted_df = df.sort_values( 
        
                   ["A", "B"], ascending=[0, 0], na_position="first", kind="mergesort" 
        
               ) 
        
               tm.assert_frame_equal(sorted_df, expected)

I'll just adapt that one.

Should I open a new PR as @arw2019 suggested or should I do it here?

jreback · 2020-12-22T14:20:48Z

@jotasi happy in this PR

pandas/tests/frame/methods/test_sort_values.py

arw2019 · 2020-12-22T18:07:58Z

pandas/tests/frame/methods/test_sort_values.py

+    @pytest.mark.parametrize("na_position", ["first", "last"])
+    def test_sort_values_stable_multicolumn_sort(
+        self, expected_idx_non_na, ascending, na_position
+    ):
        df = DataFrame(


add a comment with the GH id of this pr or the issue here and i'd say this is done

Ok. I've added a reference to this PR.

jreback

lgtm ping on greenish

jotasi · 2020-12-22T20:03:03Z

@jreback Thanks. The checks are done now. The Travis build is failing. But it looks like this is unrelated to the changes here.

jreback · 2020-12-22T20:33:06Z

thanks @jotasi very nice, keep em coming!

…iple columns or labels (pandas-dev#38426)

jotasi force-pushed the clarify_doc_sort_values_kind branch from 6f89c68 to e42a008 Compare December 12, 2020 17:36

DOC: Clarify that DataFrame.sort_values is stable for len(by)>1

f1171a8

jotasi force-pushed the clarify_doc_sort_values_kind branch from e42a008 to f1171a8 Compare December 12, 2020 17:38

arw2019 reviewed Dec 12, 2020

View reviewed changes

jotasi requested a review from arw2019 December 21, 2020 16:19

arw2019 approved these changes Dec 21, 2020

View reviewed changes

jreback added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 21, 2020

jreback added this to the 1.3 milestone Dec 21, 2020

TST: Update test case to show multilabel sort is stable

2e24f69

arw2019 reviewed Dec 22, 2020

View reviewed changes

pandas/tests/frame/methods/test_sort_values.py Outdated Show resolved Hide resolved

TST: Parametrize sort options in multicol sort_values test

2addf1d

arw2019 reviewed Dec 22, 2020

View reviewed changes

pandas/tests/frame/methods/test_sort_values.py Show resolved Hide resolved

Jonathan Tammo Siebert added 2 commits December 22, 2020 18:31

TST: Simplified parametrization

da6b97a

TST: Removed parametrization IDs

33c1a93

arw2019 reviewed Dec 22, 2020

View reviewed changes

TST: Added reference to github PR

911745a

jreback approved these changes Dec 22, 2020

View reviewed changes

jreback merged commit 254e260 into pandas-dev:master Dec 22, 2020

jotasi deleted the clarify_doc_sort_values_kind branch December 22, 2020 20:34

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

DOC: Clarify that DataFrame.sort_values is stable for sorting by mult…

1aa9181

…iple columns or labels (pandas-dev#38426)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DOC: Clarify that DataFrame.sort_values is stable for sorting by multiple columns or labels #38426

DOC: Clarify that DataFrame.sort_values is stable for sorting by multiple columns or labels #38426

Uh oh!

jotasi commented Dec 12, 2020 •

edited

Loading

Uh oh!

pep8speaks commented Dec 12, 2020 •

edited

Loading

Uh oh!

arw2019 left a comment

Uh oh!

jotasi commented Dec 13, 2020 •

edited

Loading

Uh oh!

arw2019 left a comment

Uh oh!

arw2019 commented Dec 21, 2020

Uh oh!

jreback commented Dec 21, 2020

Uh oh!

jotasi commented Dec 22, 2020

Uh oh!

jreback commented Dec 22, 2020

Uh oh!

Uh oh!

Uh oh!

arw2019 Dec 22, 2020

Uh oh!

jotasi Dec 22, 2020

Uh oh!

jreback left a comment

Uh oh!

jotasi commented Dec 22, 2020

Uh oh!

jreback commented Dec 22, 2020

Uh oh!

Uh oh!

Uh oh!

DOC: Clarify that DataFrame.sort_values is stable for sorting by multiple columns or labels #38426

DOC: Clarify that DataFrame.sort_values is stable for sorting by multiple columns or labels #38426

Uh oh!

Conversation

jotasi commented Dec 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Dec 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-12-22 18:13:05 UTC

Uh oh!

arw2019 left a comment

Choose a reason for hiding this comment

Uh oh!

jotasi commented Dec 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arw2019 left a comment

Choose a reason for hiding this comment

Uh oh!

arw2019 commented Dec 21, 2020

Uh oh!

jreback commented Dec 21, 2020

Uh oh!

jotasi commented Dec 22, 2020

Uh oh!

jreback commented Dec 22, 2020

Uh oh!

Uh oh!

Uh oh!

arw2019 Dec 22, 2020

Choose a reason for hiding this comment

Uh oh!

jotasi Dec 22, 2020

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jotasi commented Dec 22, 2020

Uh oh!

jreback commented Dec 22, 2020

Uh oh!

Uh oh!

jotasi commented Dec 12, 2020 •

edited

Loading

pep8speaks commented Dec 12, 2020 •

edited

Loading

jotasi commented Dec 13, 2020 •

edited

Loading