Skip to content

Add string comparison support so that x.sort() works #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 23, 2023

Conversation

peytondmurray
Copy link
Contributor

This PR builds on numpy/numpy#23173 to provide a PyArray_CompareFunc so that x.sort() works.

I also updated the README with a bit of information about needing the nightly numpy wheel for support, added a note about --no-build-isolation, and removed a bit about meson-python not supporting editable builds (it does now!).

This PR remains in draft until numpy/numpy#23173 is merged.

@ngoldbaum
Copy link
Member

Can we get a test too?

@ngoldbaum
Copy link
Member

It occurs to me that it may not be correct to rely on strcmp to reliably sort utf-8 encoded data in a sensible way. It may be a good idea to look up edge cases related to sorting unicode strings to see how this approach handles them.

@mattip
Copy link
Member

mattip commented Feb 9, 2023

Unicode string comparison is tricky. There are lots of edge cases.

@peytondmurray
Copy link
Contributor Author

Yep, you're both right - strcmp will work for the ASCII dataset but a more robust approach seems important here. I'll spend some time looking into this.

@peytondmurray peytondmurray force-pushed the string-comparison branch 3 times, most recently from 681a2a1 to bce48f6 Compare February 22, 2023 04:50
@peytondmurray
Copy link
Contributor Author

After some discussion I've made the choice to sort strings by their UCS4 code points. This is the approach that python takes by default, and if you want the more advanced sorting methods required for unicode comparison, you can use the unicodedata package.

I still think it would be cool to have vectorized functions for normalization and sorting, but we can introduce those later, since you can't provide the options you would need to do that for arr.sort() here anyway.

@peytondmurray peytondmurray changed the title [WIP] Add string comparison support so that x.sort() works Add string comparison support so that x.sort() works Feb 22, 2023
@peytondmurray peytondmurray marked this pull request as ready for review February 22, 2023 05:05
@ngoldbaum
Copy link
Member

Besides the inline comment, can you also increment the API version in the other dtypes? You'll need to do that anyway before this is merged to get the tests to pass.

@peytondmurray
Copy link
Contributor Author

Thanks for the feedback - I've updated the API versions everywhere.

@ngoldbaum ngoldbaum merged commit 988da4b into numpy:main Feb 23, 2023
@peytondmurray peytondmurray deleted the string-comparison branch February 23, 2023 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants