Forbidden numeric character references appear in sanitized HTML

The HTML living standard ( https://html.spec.whatwg.org/multipage/syntax.html#character-references ) states:

> The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.

However, non-characters from the supplemental planes are encoded numerically.

The code:

```
    StringBuilder builder = new StringBuilder();
    Encoding.encodeRcdataOnto(Character.toString(0x5fffe), builder);
    System.out.println(builder.toString());
```

Produces: "&amp;#x5fffe;"

I see two possible simple possible solutions, but I am loathe to recommend either one:

First the characters could be elided in line with the elision of U+FFFE and U+FFFF. This produces a strange botch that is not required by the rules of HTML nor XML, and I don't like it.

Alternatively, the character is allowed if it is not numerically escaped and the noncharacters U+FDD0 to U+FDEF are presented unescaped - so consistency with other non-characters would produce legal HTML. However, all other supplemental code points are represented by numeric escapes to avoid corruption when converting between unicode encodings. I am not happy with introducing special cases for these supplemental code points.

More complex solutions would be to introduce a policy for handling of the "discouraged" characters defined in https://www.w3.org/TR/2008/REC-xml-20081126/#charsets. 

I am happy to put the time into creating a fix and test cases, but I need guidance as to what is the "correct" solution.



 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Forbidden numeric character references appear in sanitized HTML #223

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Forbidden numeric character references appear in sanitized HTML #223

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions