Skip to content

Inconsistent whitespace definitions in string literals and language itself #60209

Open
@matklad

Description

@matklad

Lexer uses Pattern_White_Space unicode property when skipping over trivia. However, when we process string literals with escaped newlines, we only skip ASCII whitespace:

Some(' ') | Some('\n') | Some('\r') | Some('\t') => {

Here's an example program that shows that U+200F is ignored in program text, but not in the string literal

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=ec59778d31dde69f29f1095aff2c9b66

Here's the text of the program in Debug format, to make whitespace slightly more visible

"fn main() {\n\u{200f}\u{200f}\u{200f}\n    let s = \"\\\n\u{200f}\u{200f}\u{200f}hello\n\";\n    println!(\"{:?}\", s);\n}    \n"

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-UnicodeArea: UnicodeA-frontendArea: Compiler frontend (errors, parsing and HIR)A-parserArea: The lexing & parsing of Rust source code to an ASTC-bugCategory: This is a bug.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions