Skip to content

Unicode case folding, caseless matching, and iterator methods #19277

Closed
@SimonSapin

Description

@SimonSapin

I made https://github.com/SimonSapin/rust-casefold for Servo, the HTML spec requires “compatibility caseless matching”. Some of it might be interesting to have in libunicode/libcollections. @aturon, @alexcrichton, how much do you think is appropriate to include? I’d like your input before a prepare a PR (and have to deal with Rust bootstrapping).

zip_all and iter_eq are two generic function (independent of Unicode) that could be default methods of Iterator. The former is like i.zip(j).all(f), but also return false if the two iterators have a different length. The latter (which uses the former) check that the iterators have the same content. That is, it is equivalent to i.collect::Vec<_>() == j.collect::Vec<_>(), but compares elements one by one and does not allocate. (It also stops at the first difference rather than consume both iterators until the end.)

Case folding is fairly straightforward. The data could be generated with src/etc/unicode.py and kept in src/libunicode/tables.rs, like existing Unicode data.

Caseless matching however is more complex: there are different variants of it. Other than the “default” variant, they require NFD and NFKD normalization. libunicode already has nfd_chars and nfkd_chars methods on &str, but here that would require allocating an intermediate String. So, in the same spirit as #19042, it might be useful to expose another API for Unicode normalization (all four variants of it, while we’re at it) from a generic Iterator<char> rather than just &str / Chars.

Thoughts?

Nothing urgent here, but consider this when stabilizing libunicode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions