Skip to content

overlong utf-8 sequences should be treated as invalid utf-8 #3787

Closed
@Blub

Description

@Blub

The current utf-8 implementation contains some assertions for the validity of utf-8 bytes. Specifically, passing a sequence such as "\x80\xae" to a string function will throw an Assertion is_utf8(v) failed.
However, overlong encodings are accepted without any such error. So a sequence as "\xC0\xAE" (an overlong encoding for \x2E, a dot) will be accepted, and appear in the final rust-string.
This raises some security concerns as described in RFC3629 Section 10:
https://tools.ietf.org/html/rfc3629#section-10

Short example: when a program allows a user to access files, but wants to restrict access to "../", it must not be possible to circumvent this check by using an overlong encoding of a dot, and the author of the program shouldn't have to rely on the OS to perform any such check either.

fn main() {
    // overlong dot, should be invalid but is accepted
    let s1 = str::from_bytes([0xc0 as u8, 0xae as u8]);
    io::println(fmt!("len: %u, chars: %u, value: %s", s1.len(), str::char_len(s1), s1));
    // regular invalid utf, triggering an assertion fail
    let s2 = str::from_bytes([0x80 as u8, 0xae as u8]);
    io::println(fmt!("len: %u, chars: %u, value: %s", s2.len(), str::char_len(s2), s2));
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions