Skip to content

Add titlecase APIs to char #354

Open
@Jules-Bertholet

Description

@Jules-Bertholet

Proposal

Problem statement

The char module provides char::is_uppercase() and char::is_lowercase() for determining whether a character is uppercase or lowercase, and char::to_uppercase() and char::to_lowercase() for converting to uppercase or lowercase. However, a small number of characters are titlecase, which is in between the two; the standard library provides no APIs for handling titlecase.

Motivating examples or use cases

Many software systems place restrictions on the allowed case of characters, or use case for various semantic distinctions. For example, a programming language might require local variable names to be lowercase, or constants to be uppercase. Because most characters and languages are caseless, such rules are usually best implemented by excluding particular cases rather than requiring a particular case. Titlecase characters are conceptually both partly lowercase and partly uppercase, so an API that excludes either lowercase or uppercase characters will want to exclude titlecase as well, and an API that assigns special meaning to a particular case will generally want to assign the meaning to titlecase also.

In addition, it's common to want to convert a string to titlecase, which means capitalizing the first letter of all or most words. Defining what a word is, and deciding which words should be capitalized, is complex and context-dependent, and thus unsuited for the standard library. (Notably, UAX 29 and the unicode-segmentation crate are not the end-all-be-all of determining word boundaries. For example, software identifiers, like those dealt with by heck, have a very different concept of what a word is compared to normal running text). However, once individual words have been isolated for capitalization, the capitalization process and result are the same across all domains (disregarding locale-specific special casings that the standard library does not handle.) The exact rule is defined by the Unicode Standard:

For a string X: [...]

R3 toTitlecase(X): Find the word boundaries in X [...] For each word boundary, find the first cased character F following the word boundary. If F exists, map F to Titlecase_Mapping(F); then map all characters C between F and the following word boundary to Lowercase_Mapping(C).

This algorithm is not complicated to implement, as long as the titlecase mappings are available. However, if the titlecase mappings are not available, users are far more likely to resort to an erroneous implementation using to_uppercase, rather that to add an additional crates.io dependency or compile the data themselves.

Solution sketch

Add the following to core::char:

/// Analogous to [`ToUppercase`](https://doc.rust-lang.org/core/char/struct.ToUppercase.html)
/// and [`ToLowercase`](https://doc.rust-lang.org/core/char/struct.ToLowercase.html).
#[derive(Clone, Debug)]
pub struct ToTitlecase(/*...*/);

impl Iterator for ToTitlecase {
    type Item = char;
    /* ... */ 
}

impl fmt::Display for ToTitlecase { /* ... */ }

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub enum CharCase {
    Lower = 0b00,
    Title = 0b10,
    Upper = 0b11,
}

Add the following implementations to char:

/// Whether the character is uppercase, lowercase, or titlecase.
/// `core` already includes a data table for this property internally
/// (used to implement the final-sigma casing rules),
/// so implementation is trivial.
#[must_use]
#[inline]
pub fn is_cased(self) -> bool {
    match self {
        'A'..='Z' | 'a'..='z' => true,
        '\0'..='\u{A9}' => false,
        _ => unicode::Cased(self),
    }
}

///  Whether the character is in Unicode general category Titlecase_Letter.
#[must_use]
#[inline]
pub fn is_titlecase(self) -> bool {
    match self {
        '\0'..='\u{01C4}' => false,
        _ => self.is_cased() && !self.is_lowercase() && !self.is_uppercase()
   }
}

use core::char::CharCase;

/// The case of this character, or `None` if it is uncased.
#[must_use]
pub fn case(self) -> Option<CharCase> {
    match self {
        'A'..='Z' => Some(CharCase::Upper),
        'a'..='z' => Some(CharCase::Lower),
        '\0'..='\u{A9}' => None,
        _ if !self.is_cased() => None,
        _ if self.is_lowercase() => Some(CharCase::Lower),
        _ if self.is_uppercase() => Some(CharCase::Upper),
        _ => Some(CharCase::Title),
    }
}

use core::char::ToTitlecase;

/// The only proposed API
/// that requires adding new static data to `core::unicode`.
/// Most characters map to the same uppercase and titlecase,
/// so we would only need to store the mappings that differ.
#[must_use]
#[inline]
pub fn to_titlecase(self) -> ToTitlecase {
    ToTitlecase(CaseMappingIter::new(conversions::to_title(self)))
}

Alternatives

These APIs could be implemented by a crate on crates.io (and in fact, several options exist already). However, doing so in core is more efficient for binary sizes, as core already contains an internal data table for the Cased property (while third-party implementations must include their own duplicate copy). Also, developers are far more likely to simply not handle titlecase correctly, than they are to add a dependency just to deal with it.

Links and related work

char::to_titlecase was added before 1.0 (rust-lang/rust#26039) but later removed (rust-lang/rust#26555, rust-lang/rust#26561), with the justification that converting a string to titlecase requires a word breaking algorithm from outside std. However, as I have argued above, providing titlecase APIs within core would be beneficial even if word breaking must still be implemented outside of it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    T-libs-apiapi-change-proposalA proposal to add or alter unstable APIs in the standard libraries

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions