Open
Description
Bug report
Bug description:
import chardet
import codecs
def detect_encoding(file_path):
with open(file_path, 'rb') as f:
result = chardet.detect(f.read())
return result['encoding']
def check_for_invalid_characters(file_path, encoding):
try:
with codecs.open(file_path, 'r', encoding=encoding, errors='strict') as f:
f.readline()
f.readline()
third_line = f.readline()
print(f"The 3rd line of the file: {third_line.strip()}")
f.seek(0) # Reset the file pointer to the beginning
f.read() # Check for invalid characters
print(f"The file {file_path} is encoded with {encoding} and does not contain invalid characters.")
except UnicodeDecodeError as e:
print(f"The file {file_path} has invalid characters when decoded with {encoding}.")
def main(file_path):
detected_encoding = detect_encoding(file_path)
print(f"Detected encoding: {detected_encoding}")
check_for_invalid_characters(file_path, detected_encoding)
encodings_to_try = ['utf-8', 'latin-1', 'utf-16', 'iso-8859-1', 'iso-8859-15', 'iso-8859-7']
for encoding in encodings_to_try:
check_for_invalid_characters(file_path, encoding)
main(file_path)
Within the above code, when the file cannot be read using utf-16 and contains an invalid character it generates different exceptions compared to the other encodings.
For some reason when 'utf-16' is specified for the encoding and it fails codecs.py raises a UnicodeError and not just a UnicodeDecodeError as expected. This 2nd UnicodeError only appears for 'utf-16', but neither for 'utf-16-le' nor 'utf-16-be'.
The console output for the above code is:
Detected encoding: ISO-8859-7
The 3rd line of the file: DESCRIPTION;LANGUAGE=de-DE:φίδ
The file /tmp/booking.ics is encoded with ISO-8859-7 and does not contain invalid characters.
The file /tmp/booking.ics has invalid characters when decoded with utf-8.
The 3rd line of the file: DESCRIPTION;LANGUAGE=de-DE:ößä
The file /tmp/booking.ics is encoded with latin-1 and does not contain invalid characters.
Traceback (most recent call last):
File "<frozen codecs>", line 507, in read
File "/usr/lib/python3.11/encodings/utf_16.py", line 135, in decode
codecs.utf_16_ex_decode(input, errors, 0, False)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 58-59: illegal encoding
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 7, in main
File "<stdin>", line 4, in check_for_invalid_characters
File "<frozen codecs>", line 711, in readline
File "<frozen codecs>", line 561, in readline
File "<frozen codecs>", line 511, in read
File "/usr/lib/python3.11/encodings/utf_16.py", line 141, in decode
raise UnicodeError("UTF-16 stream does not start with BOM")
UnicodeError: UTF-16 stream does not start with BOM
My test file:
booking.zip
CPython versions tested on:
3.11
Operating systems tested on:
Linux
Metadata
Metadata
Assignees
Projects
Status
No status