Skip to content

Forbidden numeric character references appear in sanitized HTML #223

Open
@simon-greatrix

Description

@simon-greatrix

The HTML living standard ( https://html.spec.whatwg.org/multipage/syntax.html#character-references ) states:

The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.

However, non-characters from the supplemental planes are encoded numerically.

The code:

    StringBuilder builder = new StringBuilder();
    Encoding.encodeRcdataOnto(Character.toString(0x5fffe), builder);
    System.out.println(builder.toString());

Produces: ""

I see two possible simple possible solutions, but I am loathe to recommend either one:

First the characters could be elided in line with the elision of U+FFFE and U+FFFF. This produces a strange botch that is not required by the rules of HTML nor XML, and I don't like it.

Alternatively, the character is allowed if it is not numerically escaped and the noncharacters U+FDD0 to U+FDEF are presented unescaped - so consistency with other non-characters would produce legal HTML. However, all other supplemental code points are represented by numeric escapes to avoid corruption when converting between unicode encodings. I am not happy with introducing special cases for these supplemental code points.

More complex solutions would be to introduce a policy for handling of the "discouraged" characters defined in https://www.w3.org/TR/2008/REC-xml-20081126/#charsets.

I am happy to put the time into creating a fix and test cases, but I need guidance as to what is the "correct" solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions