Incorrect utf8toUnicode transformation for 00xx

In utf8_unicode_inplace_ex(), we have the following code:
```
c = *utf;
/* If first byte begins with binary 0 it is single byte encoding */
if ((c & 0x80) == 0) {
    /* single byte unicode (7 bit ASCII equivilent) has no validation */
    count++;
    if (count <= len) {
        if (c == 0) *data = x2c(&c);
        else *data++ = c;
    }
}
```
The code `if (c == 0) *data = x2c(&c);` is wrong.
utf the input string.
If input is "x", we copy the character => correct.
If input is "\x00\xA0", we call the x2c function which expects an hexadecimal string as input (like "20" for a space).
Furthermore, the output buffer (data) is not increased, so the character is overwritten on the next loop.
The correct code is
```
if (c == 0) {
    sprintf(data, "%%u%04x", utf[1]);
    count += 4;
    data += 6;
    i++;
    *changed = 1;
}
```

Remarks:

1. Using "%04x" replaces a lot of useless code used in other parts of the code
2. I didn't add a check `if (count <= len)` after `count += 4;` because it cannot overflow if we fix the code: `len = input_len * 6 + 1;` instead of `len = input_len * 4 + 1;` (4 bytes -> %u1234). The check (and the variable count) could thus be removed everywhere
3. After filling data, we have several checks (/* invalid UTF-8 character number range (RFC 3629) */, /* check for overlong */, ...) with this code `count++; *data++ = c;` What's the point? This copies the character after the ones we already copied. This looks wrong to me. Any reason to keep this code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect utf8toUnicode transformation for 00xx #3129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect utf8toUnicode transformation for 00xx #3129

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions