Description
Describe the bug
The underlying issue is described in MySQL 8.0 Reference Manual / Character Sets, Collations, Unicode / Connection Character Sets and Collations: Connect-Time Error Handling. When you use the MySQL 8.0 client library to connect to a MySQL 5.7 server with the utf8mb4 encoding, then the server falls back to latin1 instead, and the server will send result sets that are invalid UTF-8, resulting in UnicodeDecodeError when the client tries to decode them. The symptoms are similar to a server with --skip-character-set-client-handshake
.
As the Oracle document describes, when MySQLdb is linked with the mysql-client 8.0 library and attempts to connect to a mysql 5.7 server with "charset": "utf8mb4", then the mysql client library sends a HandshakeResponse41 packet requesting charset+collation number 255: utf8mb4_0900_ai_ci, which is the new default collation for the utf8mb4 charset in MySQL 8.0. However, this new collation does not exist on a MySQL 5.7 server, so the server silently falls back to the character_set_server (latin1) and collation_server (latin1_swedish_ci). In MySQL 5.7, the default collation for utf8mb4 was charset+collation number 45: utf8mb4_general_ci.
We currently call the C function mysql_character_set_name
, but that function is a client-side lookup and does not verify the character set in case the server silently ignored the charset+collation from the handshake.
The connection’s collation is apparently only used for comparing literal strings, not for comparing columns (which have their own collation), so all this trouble is for a pretty uncommon use case (character_set_collation).
This bug occurs if you use the OSX Homebrew mysql@8.0.26 package, but surprisingly it does not occur when you use mysql-client@8.0.26 (brew install mysql-client@8.0; PATH=/usr/local/opt/mysql-client@8.0/bin:$PATH pip install -e /path/to/mysqlclient
). This is because mysql-client was compiled with -DDEFAULT_COLLATION=utf8mb4_general_ci
, whereas the mysql formula does not change the DEFAULT_COLLATION. With the default collation altered, mysql-client sends 45: utf8mb4_general_ci to the server in the HandshakeResponse41 packet, which mysql 5.7 recognizes.
Googling, I saw that this seems to have occurred to other people too:
- 2021-05-25 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 11: invalid start byte
- 2021-08-29 stackoverflow: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 180: invalid start byte
To Reproduce
Server
docker run --name=mysql \
-e MYSQL_ALLOW_EMPTY_PASSWORD=y \
-e MYSQL_ROOT_PASSWORD= \
-e MYSQL_INITDB_SKIP_TZINFO=y \
-e MYSQL_DATABASE=db \
-p 127.0.0.1:3306:3306 mysql:5.7.33
Code
On Mac OSX, you can use the mysql package:
brew install mysql@8.0
pip install mysqlclient
con = MySQLdb.connect(host='127.0.0.1', user='root', passwd='', port=3306, client_flag=2, charset='utf8mb4', use_unicode=True)
with con.cursor() as c:
# use charset literals https://dev.mysql.com/doc/refman/8.0/en/charset-literal.html
# to explicitly create a latin1 string
c._query(b"select _latin1'\x92'")
c._executed = b"select _latin1'\x92'" # needed 1.3.7, not 1.4.6
c.fetchall()
Error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/yonran/third-party/mysqlclient/MySQLdb/cursors.py", line 321, in _query
self._post_get_result()
File "/Users/yonran/third-party/mysqlclient/MySQLdb/cursors.py", line 355, in _post_get_result
self._rows = self._fetch_row(0)
File "/Users/yonran/third-party/mysqlclient/MySQLdb/cursors.py", line 328, in _fetch_row
return self._result.fetch_row(size, self._fetch_type)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 0: invalid start byte
You also query the server to see that character_set_results is set to latin1 rather than utf8mb4:
with con.cursor() as c:
c.execute("show variables like 'character_set_%'")
for row in c:
print(row)
(b'character_set_client', b'latin1')
(b'character_set_connection', b'latin1')
(b'character_set_database', b'latin1')
(b'character_set_filesystem', b'binary')
(b'character_set_results', b'latin1')
(b'character_set_server', b'latin1')
(b'character_set_system', b'utf8')
(b'character_sets_dir', b'/usr/share/mysql/charsets/')
Other implementations
How do other implementations handle the handshake?
- node mysql sends utf8mb4_general_ci in the handshake response, never utf8mb4_0900_ai_ci.
- rust mysql_common (used by the mysql crate) sends utf8mb4_general_ci in the handshake response packet if the server is ≥5.5.3, but never utf8mb4_0900_ai_ci.
- MySQL Connector/J has a complicated algorithm in 8.0.26 that they describe somewhat in 6.7 Using Character Sets and Unicode. I think what it does is:
- if the server version ≥8.0.1, then handshake with utf8mb4_0900_ai_ci; otherwise handshake with utf8mb4_general_ci (configurePreHandshake) to encode the username/password
- After the handshake, the client queries the server for the connection settings such as
@@character_set_results
(loadServerVariables) - Then the client executes an additional
SET NAMES
statement if the charset/collation do not match what is required (configurePostHandshake)
Workaround
If we are using the MySQL 8.0 client library to connect to a MySQL 5.7 server, we need to perform an additional SET NAMES
to set the charset to utf8mb4.
Currently, Connection.set_character_set
(which is called during Connection.__init__
) only executes SET NAMES
if it thinks that the parameters changed after connect. But because we can’t trust the client-side mysql_character_set_name
function to return the server’s value of @@character_set_results
, we should just SET NAMES
unconditionally.
con.query("SET NAMES utf8mb4")
con.store_result()
Environment
MySQL Server
- Server OS: Linux RDS or Docker
- Server Version: MySQL 5.7.33
MySQL Client
-
OS (e.g. Windows 10, Ubuntu 20.04): OS X
-
Python: Homebrew Python 3.9.7 (but it also occurs in 2.7.10 with a compatible mysqlclient 1.4.6)
-
Connector/C: Homebrew mysql-client 8.0.26