Skip to content

Incorrect utf8toUnicode transformation for 00xx #3129

Open
@marcstern

Description

@marcstern

In utf8_unicode_inplace_ex(), we have the following code:

c = *utf;
/* If first byte begins with binary 0 it is single byte encoding */
if ((c & 0x80) == 0) {
    /* single byte unicode (7 bit ASCII equivilent) has no validation */
    count++;
    if (count <= len) {
        if (c == 0) *data = x2c(&c);
        else *data++ = c;
    }
}

The code if (c == 0) *data = x2c(&c); is wrong.
utf the input string.
If input is "x", we copy the character => correct.
If input is "\x00\xA0", we call the x2c function which expects an hexadecimal string as input (like "20" for a space).
Furthermore, the output buffer (data) is not increased, so the character is overwritten on the next loop.
The correct code is

if (c == 0) {
    sprintf(data, "%%u%04x", utf[1]);
    count += 4;
    data += 6;
    i++;
    *changed = 1;
}

Remarks:

  1. Using "%04x" replaces a lot of useless code used in other parts of the code
  2. I didn't add a check if (count <= len) after count += 4; because it cannot overflow if we fix the code: len = input_len * 6 + 1; instead of len = input_len * 4 + 1; (4 bytes -> %u1234). The check (and the variable count) could thus be removed everywhere
  3. After filling data, we have several checks (/* invalid UTF-8 character number range (RFC 3629) /, / check for overlong */, ...) with this code count++; *data++ = c; What's the point? This copies the character after the ones we already copied. This looks wrong to me. Any reason to keep this code?

Metadata

Metadata

Assignees

No one assigned

    Labels

    2.xRelated to ModSecurity version 2.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions