overlong utf-8 sequences should be treated as invalid utf-8

The current utf-8 implementation contains some assertions for the validity of utf-8 bytes. Specifically, passing a sequence such as "\x80\xae" to a string function will throw an `Assertion is_utf8(v) failed`.
However, overlong encodings are accepted without any such error. So a sequence as "\xC0\xAE" (an overlong encoding for \x2E, a dot) will be accepted, and appear in the final rust-string.
This raises some security concerns as described in RFC3629 Section 10:
https://tools.ietf.org/html/rfc3629#section-10

Short example: when a program allows a user to access files, but wants to restrict access to "../", it must not be possible to circumvent this check by using an overlong encoding of a dot, and the author of the program shouldn't have to rely on the OS to perform any such check either.

```
fn main() {
    // overlong dot, should be invalid but is accepted
    let s1 = str::from_bytes([0xc0 as u8, 0xae as u8]);
    io::println(fmt!("len: %u, chars: %u, value: %s", s1.len(), str::char_len(s1), s1));
    // regular invalid utf, triggering an assertion fail
    let s2 = str::from_bytes([0x80 as u8, 0xae as u8]);
    io::println(fmt!("len: %u, chars: %u, value: %s", s2.len(), str::char_len(s2), s2));
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

overlong utf-8 sequences should be treated as invalid utf-8 #3787

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

overlong utf-8 sequences should be treated as invalid utf-8 #3787

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions