-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Support for NFD and NFKD #8445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for NFD and NFKD #8445
Conversation
Wonderful! Any chance of throwing in NFKC? |
Pardon my ignorance, but are the added combining tables enough information for splitting a string into grapheme clusters? |
@graydon Yes and No. I definitely want to add support for NFC and NFKC in the not-so-far future. But Unicode Composition is a bit more involved than Decomposition and my time is a bit limited right now. So that would be in a later pull request, if that's okay. @Kimundi From a quick glance at http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries it seems to me that this also requires at least the Grapheme_Cluster_Break property. |
Certainly. Just curious. |
Don't we need access to non-allocating normalizations when they are used for equality test? |
Right, if you could write those normalizations to work as a |
@blake2-ppc @Kimundi Would you elaborate a bit what you mean exactly by "without allocating". |
@Florob Hm, if normalization involves reordering an unbounded number of characters, then you might indeed need an allocation in there. Still, would be nice if that is all that's necessary: One dynamic vector as part of the state. And if we ever get not-terrible smallvectors, we could use those, so that for normal lengths of combining characters, it doesn't need an allocation at all. |
fn test_nfkd() { | ||
assert_eq!("abc".nfd(), ~"abc"); | ||
assert_eq!("\u2026".nfd(), ~"..."); | ||
assert_eq!("\u2126".nfd(), ~"\u03a9"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like you are calling nfd
in the nfkd
tests.
Updated to provide an iterator instead. The code is much more complex than I'd like it to be, so I'm happy to see any constructive criticism and suggestions. |
I have to say it's entirely unclear to me why this failed the tests (twice). The output looks like failures unrelated to this PR to me. Am I missing something? |
@Florob as far as I can tell, it looks like the failure is "just" a transient LLVM issue, since there've been successful builds on that slave ( |
This adds support for performing Unicode Normalization Forms D and KD on strings. To enable this the decomposition and canonical combining class properties are added to std::unicode. On my system this increases libstd's size by ~250KiB.
Llint for casting between raw slice pointers with different element sizes This lint disallows using `as` to convert from a raw pointer to a slice (e.g. `*const [i32]`, `*mut [Foo]`) to any other raw pointer to a slice if the element types have different sizes. When a raw slice pointer is cast, the data pointer and count metadata are preserved. This means that when the size of the inner slice's element type changes, the total number of bytes pointed to by the count changes. For example a `*const [i32]` with length 4 (four `i32` elements) is cast `as *const [u8]` the resulting pointer points to four `u8` elements at the same address, losing most of the data. When the size *increases* the resulting pointer will point to *more* data, and accessing that data will be UB. On its own, *producing* the pointer isn't actually a problem, but because any use of the pointer as a slice will either produce surprising behavior or cause UB I believe this is a correctness lint. If the pointer is not intended to be used as a slice, the user should instead use any of a number of methods to produce just a data pointer including an `as` cast to a thin pointer (e.g. `p as *const i32`) or if the pointer is being created from a slice, the `as_ptr` method on slices. Detecting the intended use of the pointer is outside the scope of this lint, but I believe this lint will also lead users to realize that a slice pointer is only for slices. There is an exception to this lint when either of the slice element types are zero sized (e.g `*mut [()]`). The total number of bytes pointed to by the slice with a zero sized element is zero. In that case preserving the length metadata is likely intended as a workaround to get the length metadata of a slice pointer though a zero sized slice. The lint does not forbid casting pointers to slices with the *same* element size as the cast was likely intended to reinterpret the data in the slice as some equivalently sized data and the resulting pointer will behave as intended. --- changelog: Added ``[`cast_slice_different_sizes`]``, a lint that disallows using `as`-casts to convert between raw pointers to slices when the elements have different sizes.
…endoo,xFrednet fix ICE in `cast_slice_different_sizes` fixes rust-lang#8708 changelog: fixes an ICE introduced in rust-lang#8445
This adds support for performing Unicode Normalization Forms D and KD on strings.
To enable this the decomposition and canonical combining class properties are added to std::unicode.
On my system this increases libstd's size by ~250KiB.