Skip to content

Support for NFD and NFKD #8445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Support for NFD and NFKD #8445

wants to merge 3 commits into from

Conversation

Florob
Copy link
Contributor

@Florob Florob commented Aug 11, 2013

This adds support for performing Unicode Normalization Forms D and KD on strings.
To enable this the decomposition and canonical combining class properties are added to std::unicode.
On my system this increases libstd's size by ~250KiB.

@graydon
Copy link
Contributor

graydon commented Aug 11, 2013

Wonderful! Any chance of throwing in NFKC?

@Kimundi
Copy link
Member

Kimundi commented Aug 11, 2013

Pardon my ignorance, but are the added combining tables enough information for splitting a string into grapheme clusters?

@Florob
Copy link
Contributor Author

Florob commented Aug 11, 2013

@graydon Yes and No. I definitely want to add support for NFC and NFKC in the not-so-far future. But Unicode Composition is a bit more involved than Decomposition and my time is a bit limited right now. So that would be in a later pull request, if that's okay.

@Kimundi From a quick glance at http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries it seems to me that this also requires at least the Grapheme_Cluster_Break property.

@graydon
Copy link
Contributor

graydon commented Aug 11, 2013

Certainly. Just curious.

@bluss
Copy link
Member

bluss commented Aug 11, 2013

Don't we need access to non-allocating normalizations when they are used for equality test?

@Kimundi
Copy link
Member

Kimundi commented Aug 11, 2013

Right, if you could write those normalizations to work as a char iterator, you could then use them for comparing without allocating, and for lazy calculation of a strings normalized form.

@Florob
Copy link
Contributor Author

Florob commented Aug 12, 2013

@blake2-ppc @Kimundi Would you elaborate a bit what you mean exactly by "without allocating".
Rewriting this as an iterator would be possible, but I think I would still need to allocate some memory in the iterator to be able to sort the combining characters (i.e. the iterator might have quite a bit of state, including a dynamically sized vector).

@Kimundi
Copy link
Member

Kimundi commented Aug 12, 2013

@Florob Hm, if normalization involves reordering an unbounded number of characters, then you might indeed need an allocation in there. Still, would be nice if that is all that's necessary: One dynamic vector as part of the state.

And if we ever get not-terrible smallvectors, we could use those, so that for normal lengths of combining characters, it doesn't need an allocation at all.

fn test_nfkd() {
assert_eq!("abc".nfd(), ~"abc");
assert_eq!("\u2026".nfd(), ~"...");
assert_eq!("\u2126".nfd(), ~"\u03a9");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like you are calling nfd in the nfkd tests.

@Florob
Copy link
Contributor Author

Florob commented Aug 15, 2013

Updated to provide an iterator instead. The code is much more complex than I'd like it to be, so I'm happy to see any constructive criticism and suggestions.

@Florob
Copy link
Contributor Author

Florob commented Aug 21, 2013

I have to say it's entirely unclear to me why this failed the tests (twice). The output looks like failures unrelated to this PR to me. Am I missing something?
The "upside" to this is that I had a shame-filled epiphany yesterday that I had forgotten to implement decomposition of Hangul Syllables, which is a special case not handled via tables. I have now added support for that.
r? @graydon

@huonw
Copy link
Member

huonw commented Aug 21, 2013

@Florob as far as I can tell, it looks like the failure is "just" a transient LLVM issue, since there've been successful builds on that slave (win2) since this PR.

bors added a commit that referenced this pull request Aug 21, 2013
This adds support for performing Unicode Normalization Forms D and KD on strings.
To enable this the decomposition and canonical combining class properties are added to std::unicode.
On my system this increases libstd's size by ~250KiB.
@bors bors closed this Aug 21, 2013
flip1995 pushed a commit to flip1995/rust that referenced this pull request Mar 14, 2022
Llint for casting between raw slice pointers with different element sizes

This lint disallows using `as` to convert from a raw pointer to a slice (e.g. `*const [i32]`, `*mut [Foo]`) to any other raw pointer to a slice if the element types have different sizes.  When a raw slice pointer is cast, the data pointer and count metadata are preserved.  This means that when the size of the inner slice's element type changes, the total number of bytes pointed to by the count changes.  For example a `*const [i32]` with length 4 (four `i32` elements) is cast `as *const [u8]` the resulting pointer points to four `u8` elements at the same address, losing most of the data.  When the size *increases* the resulting pointer will point to *more* data, and accessing that data will be UB.

On its own, *producing* the pointer isn't actually a problem, but because any use of the pointer as a slice will either produce surprising behavior or cause UB I believe this is a correctness lint.  If the pointer is not intended to be used as a slice, the user should instead use any of a number of methods to produce just a data pointer including an `as` cast to a thin pointer (e.g. `p as *const i32`) or if the pointer is being created from a slice, the `as_ptr` method on slices.  Detecting the intended use of the pointer is outside the scope of this lint, but I believe this lint will also lead users to realize that a slice pointer is only for slices.

There is an exception to this lint when either of the slice element types are zero sized (e.g `*mut [()]`).  The total number of bytes pointed to by the slice with a zero sized element is zero.  In that case preserving the length metadata is likely intended as a workaround to get the length metadata of a slice pointer though a zero sized slice.

The lint does not forbid casting pointers to slices with the *same* element size as the cast was likely intended to reinterpret the data in the slice as some equivalently sized data and the resulting pointer will behave as intended.

---

changelog: Added ``[`cast_slice_different_sizes`]``, a lint that disallows using `as`-casts to convert between raw pointers to slices when the elements have different sizes.
flip1995 pushed a commit to flip1995/rust that referenced this pull request May 5, 2022
…endoo,xFrednet

fix ICE in `cast_slice_different_sizes`

fixes rust-lang#8708

changelog: fixes an ICE introduced in rust-lang#8445
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants