Skip to content

Unexpected UnicodeError instead of UnicodeDecodeError within codec.readline() only for 'utf-16' encoding #112812

Open
@agowa

Description

@agowa

Bug report

Bug description:

import chardet
import codecs

def detect_encoding(file_path):
  with open(file_path, 'rb') as f:
    result = chardet.detect(f.read())
  return result['encoding']

def check_for_invalid_characters(file_path, encoding):
  try:
    with codecs.open(file_path, 'r', encoding=encoding, errors='strict') as f:
      f.readline()
      f.readline()
      third_line = f.readline()
      print(f"The 3rd line of the file: {third_line.strip()}")
      f.seek(0)  # Reset the file pointer to the beginning
      f.read()   # Check for invalid characters
    print(f"The file {file_path} is encoded with {encoding} and does not contain invalid characters.")
  except UnicodeDecodeError as e:
    print(f"The file {file_path} has invalid characters when decoded with {encoding}.")


def main(file_path):
  detected_encoding = detect_encoding(file_path)
  print(f"Detected encoding: {detected_encoding}")
  check_for_invalid_characters(file_path, detected_encoding)
  encodings_to_try = ['utf-8', 'latin-1', 'utf-16', 'iso-8859-1', 'iso-8859-15', 'iso-8859-7']
  for encoding in encodings_to_try:
    check_for_invalid_characters(file_path, encoding)

main(file_path)

Within the above code, when the file cannot be read using utf-16 and contains an invalid character it generates different exceptions compared to the other encodings.

For some reason when 'utf-16' is specified for the encoding and it fails codecs.py raises a UnicodeError and not just a UnicodeDecodeError as expected. This 2nd UnicodeError only appears for 'utf-16', but neither for 'utf-16-le' nor 'utf-16-be'.

The console output for the above code is:

Detected encoding: ISO-8859-7
The 3rd line of the file: DESCRIPTION;LANGUAGE=de-DE:φίδ
The file /tmp/booking.ics is encoded with ISO-8859-7 and does not contain invalid characters.
The file /tmp/booking.ics has invalid characters when decoded with utf-8.
The 3rd line of the file: DESCRIPTION;LANGUAGE=de-DE:ößä
The file /tmp/booking.ics is encoded with latin-1 and does not contain invalid characters.
Traceback (most recent call last):
  File "<frozen codecs>", line 507, in read
  File "/usr/lib/python3.11/encodings/utf_16.py", line 135, in decode
    codecs.utf_16_ex_decode(input, errors, 0, False)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 58-59: illegal encoding

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in main
  File "<stdin>", line 4, in check_for_invalid_characters
  File "<frozen codecs>", line 711, in readline
  File "<frozen codecs>", line 561, in readline
  File "<frozen codecs>", line 511, in read
  File "/usr/lib/python3.11/encodings/utf_16.py", line 141, in decode
    raise UnicodeError("UTF-16 stream does not start with BOM")
UnicodeError: UTF-16 stream does not start with BOM

My test file:
booking.zip

CPython versions tested on:

3.11

Operating systems tested on:

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions