Add titlecase APIs to `char`

# Proposal

## Problem statement

The `char` module provides `char::is_uppercase()` and `char::is_lowercase()` for determining whether a character is uppercase or lowercase, and `char::to_uppercase()` and `char::to_lowercase()` for converting to uppercase or lowercase. However, [a small number](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%253ALt%253A%5D) of characters are [titlecase](https://www.unicode.org/faq/casemap_charprop.html#4), which is in between the two; the standard library provides no APIs for handling titlecase.

## Motivating examples or use cases

Many software systems place restrictions on the allowed case of characters, or use case for various semantic distinctions. For example, a programming language might require local variable names to be lowercase, or constants to be uppercase. Because most characters and languages are caseless, such rules are usually best implemented by *excluding* particular cases rather than *requiring* a particular case. Titlecase characters are conceptually both partly lowercase and partly uppercase, so an API that excludes either lowercase or uppercase characters will want to exclude titlecase as well, and an API that assigns special meaning to a particular case will generally want to assign the meaning to titlecase also.

In addition, it's common to want to convert a string to titlecase, which means capitalizing the first letter of all or most words. Defining what a word is, and deciding which words should be capitalized, is complex and context-dependent, and thus unsuited for the standard library. (Notably, UAX 29 and the [`unicode-segmentation`](https://github.com/unicode-rs/unicode-segmentation) crate are *not* the end-all-be-all of determining word boundaries. For example, software identifiers, like those dealt with by [`heck`](https://github.com/withoutboats/heck), have a very different concept of what a word is compared to normal running text). However, once individual words have been isolated for capitalization, the capitalization process and result are the same across all domains (disregarding locale-specific special casings that the standard library does not handle.) The exact rule is [defined by the Unicode Standard](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G34078):

> For a string X: [...]
>
> R3 toTitlecase(X): Find the word boundaries in X [...] For each word boundary, find the first cased character F following the word boundary. If F exists, map F to Titlecase_Mapping(F); then map all characters C between F and the following word boundary to Lowercase_Mapping(C).

This algorithm is not complicated to implement, as long as the titlecase mappings are available. However, if the titlecase mappings are not available, users are far more likely to resort to an erroneous implementation using `to_uppercase`, rather that to add an additional crates.io dependency or compile the data themselves.

## Solution sketch

Add the following to `core::char`:

```rust
/// Analogous to [`ToUppercase`](https://doc.rust-lang.org/core/char/struct.ToUppercase.html)
/// and [`ToLowercase`](https://doc.rust-lang.org/core/char/struct.ToLowercase.html).
#[derive(Clone, Debug)]
pub struct ToTitlecase(/*...*/);

impl Iterator for ToTitlecase {
    type Item = char;
    /* ... */ 
}

impl fmt::Display for ToTitlecase { /* ... */ }

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub enum CharCase {
    Lower = 0b00,
    Title = 0b10,
    Upper = 0b11,
}
```

Add the following implementations to `char`:

```rust
/// Whether the character is uppercase, lowercase, or titlecase.
/// `core` already includes a data table for this property internally
/// (used to implement the final-sigma casing rules),
/// so implementation is trivial.
#[must_use]
#[inline]
pub fn is_cased(self) -> bool {
    match self {
        'A'..='Z' | 'a'..='z' => true,
        '\0'..='\u{A9}' => false,
        _ => unicode::Cased(self),
    }
}

///  Whether the character is in Unicode general category Titlecase_Letter.
#[must_use]
#[inline]
pub fn is_titlecase(self) -> bool {
    match self {
        '\0'..='\u{01C4}' => false,
        _ => self.is_cased() && !self.is_lowercase() && !self.is_uppercase()
   }
}

use core::char::CharCase;

/// The case of this character, or `None` if it is uncased.
#[must_use]
pub fn case(self) -> Option<CharCase> {
    match self {
        'A'..='Z' => Some(CharCase::Upper),
        'a'..='z' => Some(CharCase::Lower),
        '\0'..='\u{A9}' => None,
        _ if !self.is_cased() => None,
        _ if self.is_lowercase() => Some(CharCase::Lower),
        _ if self.is_uppercase() => Some(CharCase::Upper),
        _ => Some(CharCase::Title),
    }
}

use core::char::ToTitlecase;

/// The only proposed API
/// that requires adding new static data to `core::unicode`.
/// Most characters map to the same uppercase and titlecase,
/// so we would only need to store the mappings that differ.
#[must_use]
#[inline]
pub fn to_titlecase(self) -> ToTitlecase {
    ToTitlecase(CaseMappingIter::new(conversions::to_title(self)))
}
```

## Alternatives

These APIs could be implemented by a crate on crates.io (and in fact, several options exist already). However, doing so in `core` is more efficient for binary sizes, as `core` already contains an internal data table for the Cased property (while third-party implementations must include their own duplicate copy). Also, developers are far more likely to simply not handle titlecase correctly, than they are to add a dependency just to deal with it.

## Links and related work

`char::to_titlecase` was added before 1.0 (https://github.com/rust-lang/rust/pull/26039) but later removed (https://github.com/rust-lang/rust/issues/26555, https://github.com/rust-lang/rust/pull/26561), with the justification that converting a string to titlecase requires a word breaking algorithm from outside `std`. However, as I have argued above, providing titlecase APIs within `core` would be beneficial even if word breaking must still be implemented outside of it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add titlecase APIs to `char` #354

Proposal

Problem statement

Motivating examples or use cases

Solution sketch

Alternatives

Links and related work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add titlecase APIs to char #354

Description

Proposal

Problem statement

Motivating examples or use cases

Solution sketch

Alternatives

Links and related work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add titlecase APIs to `char` #354