Skip to content

Indicate discrepencies with Unicode specifications for UTF-16/32 schemes #128571

Open
@youkidearitai

Description

@youkidearitai

Bug report

Bug description:

b"ab".decode("UTF-16")

On https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G28070, UTF-16 is not pointing at endian (there is no BOM and in the absence of higher-level protocol), UTF-16 is big-endian.

The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.

However, CPython actual behavior is maybe depends on CPU architecture.

I tested x86_64(WSL Ubuntu), and aarch64(Raspberry Pi(Raspbian) and macOS).

x86_64 result is (U+6162), aarch64 result is (U+6261).
I think endian is big-endian in UTF-16.

CPython versions tested on:

3.10, 3.12

Operating systems tested on:

Linux, macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions