Support for NFD and NFKD #8445

Florob · 2013-08-11T01:55:59Z

This adds support for performing Unicode Normalization Forms D and KD on strings.
To enable this the decomposition and canonical combining class properties are added to std::unicode.
On my system this increases libstd's size by ~250KiB.

graydon · 2013-08-11T02:48:45Z

Wonderful! Any chance of throwing in NFKC?

Kimundi · 2013-08-11T03:12:11Z

Pardon my ignorance, but are the added combining tables enough information for splitting a string into grapheme clusters?

Florob · 2013-08-11T12:55:21Z

@graydon Yes and No. I definitely want to add support for NFC and NFKC in the not-so-far future. But Unicode Composition is a bit more involved than Decomposition and my time is a bit limited right now. So that would be in a later pull request, if that's okay.

@Kimundi From a quick glance at http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries it seems to me that this also requires at least the Grapheme_Cluster_Break property.

graydon · 2013-08-11T17:43:48Z

Certainly. Just curious.

bluss · 2013-08-11T18:09:10Z

Don't we need access to non-allocating normalizations when they are used for equality test?

Kimundi · 2013-08-11T22:07:10Z

Right, if you could write those normalizations to work as a char iterator, you could then use them for comparing without allocating, and for lazy calculation of a strings normalized form.

Florob · 2013-08-12T09:52:38Z

@blake2-ppc @Kimundi Would you elaborate a bit what you mean exactly by "without allocating".
Rewriting this as an iterator would be possible, but I think I would still need to allocate some memory in the iterator to be able to sort the combining characters (i.e. the iterator might have quite a bit of state, including a dynamically sized vector).

Kimundi · 2013-08-12T12:02:47Z

@Florob Hm, if normalization involves reordering an unbounded number of characters, then you might indeed need an allocation in there. Still, would be nice if that is all that's necessary: One dynamic vector as part of the state.

And if we ever get not-terrible smallvectors, we could use those, so that for normal lengths of combining characters, it doesn't need an allocation at all.

bluss · 2013-08-12T13:32:59Z

src/libstd/str.rs

+    fn test_nfkd() {
+        assert_eq!("abc".nfd(), ~"abc");
+        assert_eq!("\u2026".nfd(), ~"...");
+        assert_eq!("\u2126".nfd(), ~"\u03a9");


it looks like you are calling nfd in the nfkd tests.

Florob · 2013-08-15T18:22:39Z

Updated to provide an iterator instead. The code is much more complex than I'd like it to be, so I'm happy to see any constructive criticism and suggestions.

Florob · 2013-08-21T09:53:53Z

I have to say it's entirely unclear to me why this failed the tests (twice). The output looks like failures unrelated to this PR to me. Am I missing something?
The "upside" to this is that I had a shame-filled epiphany yesterday that I had forgotten to implement decomposition of Hangul Syllables, which is a special case not handled via tables. I have now added support for that.
r? @graydon

huonw · 2013-08-21T10:05:45Z

@Florob as far as I can tell, it looks like the failure is "just" a transient LLVM issue, since there've been successful builds on that slave (win2) since this PR.

This adds support for performing Unicode Normalization Forms D and KD on strings. To enable this the decomposition and canonical combining class properties are added to std::unicode. On my system this increases libstd's size by ~250KiB.

Llint for casting between raw slice pointers with different element sizes This lint disallows using `as` to convert from a raw pointer to a slice (e.g. `*const [i32]`, `*mut [Foo]`) to any other raw pointer to a slice if the element types have different sizes. When a raw slice pointer is cast, the data pointer and count metadata are preserved. This means that when the size of the inner slice's element type changes, the total number of bytes pointed to by the count changes. For example a `*const [i32]` with length 4 (four `i32` elements) is cast `as *const [u8]` the resulting pointer points to four `u8` elements at the same address, losing most of the data. When the size *increases* the resulting pointer will point to *more* data, and accessing that data will be UB. On its own, *producing* the pointer isn't actually a problem, but because any use of the pointer as a slice will either produce surprising behavior or cause UB I believe this is a correctness lint. If the pointer is not intended to be used as a slice, the user should instead use any of a number of methods to produce just a data pointer including an `as` cast to a thin pointer (e.g. `p as *const i32`) or if the pointer is being created from a slice, the `as_ptr` method on slices. Detecting the intended use of the pointer is outside the scope of this lint, but I believe this lint will also lead users to realize that a slice pointer is only for slices. There is an exception to this lint when either of the slice element types are zero sized (e.g `*mut [()]`). The total number of bytes pointed to by the slice with a zero sized element is zero. In that case preserving the length metadata is likely intended as a workaround to get the length metadata of a slice pointer though a zero sized slice. The lint does not forbid casting pointers to slices with the *same* element size as the cast was likely intended to reinterpret the data in the slice as some equivalently sized data and the resulting pointer will behave as intended. --- changelog: Added ``[`cast_slice_different_sizes`]``, a lint that disallows using `as`-casts to convert between raw pointers to slices when the elements have different sizes.

…endoo,xFrednet fix ICE in `cast_slice_different_sizes` fixes rust-lang#8708 changelog: fixes an ICE introduced in rust-lang#8445

bluss reviewed Aug 12, 2013
View reviewed changes

Florob added 3 commits August 21, 2013 11:50

Add Unicode decomposition mappings to std::unicode

83f4bee

Add canonical combining class to std::unicode

2675f3e

Add support for performing NFD and NFKD on strings

3d720c6

bors closed this Aug 21, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for NFD and NFKD #8445

Support for NFD and NFKD #8445

Uh oh!

Florob commented Aug 11, 2013

Uh oh!

graydon commented Aug 11, 2013

Uh oh!

Kimundi commented Aug 11, 2013

Uh oh!

Florob commented Aug 11, 2013

Uh oh!

graydon commented Aug 11, 2013

Uh oh!

bluss commented Aug 11, 2013

Uh oh!

Kimundi commented Aug 11, 2013

Uh oh!

Florob commented Aug 12, 2013

Uh oh!

Kimundi commented Aug 12, 2013

Uh oh!

bluss Aug 12, 2013

Uh oh!

Florob commented Aug 15, 2013

Uh oh!

Florob commented Aug 21, 2013

Uh oh!

huonw commented Aug 21, 2013

Uh oh!

Uh oh!

Support for NFD and NFKD #8445

Support for NFD and NFKD #8445

Uh oh!

Conversation

Florob commented Aug 11, 2013

Uh oh!

graydon commented Aug 11, 2013

Uh oh!

Kimundi commented Aug 11, 2013

Uh oh!

Florob commented Aug 11, 2013

Uh oh!

graydon commented Aug 11, 2013

Uh oh!

bluss commented Aug 11, 2013

Uh oh!

Kimundi commented Aug 11, 2013

Uh oh!

Florob commented Aug 12, 2013

Uh oh!

Kimundi commented Aug 12, 2013

Uh oh!

bluss Aug 12, 2013

Choose a reason for hiding this comment

Uh oh!

Florob commented Aug 15, 2013

Uh oh!

Florob commented Aug 21, 2013

Uh oh!

huonw commented Aug 21, 2013

Uh oh!

Uh oh!