Description
The HTML living standard ( https://html.spec.whatwg.org/multipage/syntax.html#character-references ) states:
The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.
However, non-characters from the supplemental planes are encoded numerically.
The code:
StringBuilder builder = new StringBuilder();
Encoding.encodeRcdataOnto(Character.toString(0x5fffe), builder);
System.out.println(builder.toString());
Produces: ""
I see two possible simple possible solutions, but I am loathe to recommend either one:
First the characters could be elided in line with the elision of U+FFFE and U+FFFF. This produces a strange botch that is not required by the rules of HTML nor XML, and I don't like it.
Alternatively, the character is allowed if it is not numerically escaped and the noncharacters U+FDD0 to U+FDEF are presented unescaped - so consistency with other non-characters would produce legal HTML. However, all other supplemental code points are represented by numeric escapes to avoid corruption when converting between unicode encodings. I am not happy with introducing special cases for these supplemental code points.
More complex solutions would be to introduce a policy for handling of the "discouraged" characters defined in https://www.w3.org/TR/2008/REC-xml-20081126/#charsets.
I am happy to put the time into creating a fix and test cases, but I need guidance as to what is the "correct" solution.