Description
Describe the bug
Use case: the database is configured with a character set, and a non-default collation for that character set.
[mysqld]
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_ci
For reference, note that on this particular installation (MariaDB 10.3), the default collation for utf8mb4
is utf8mb4_general_ci
, NOT utf8mb4_unicode_ci
.
The bug in this case is that when using mysqlclient
, it is IMPOSSIBLE to set the correct collation. There is an option charset
, which whether it is provided, blank, or empty, issues a command as such:
SET NAME <charset>
SET NAME
will ALWAYS use the default collation for the charset if the collation is not specified, which is documented here: https://mariadb.com/kb/en/set-names/, and much more deeply documented here: https://dev.mysql.com/doc/refman/8.0/en/charset-connection.html.
In this case we are specifying charset="utf8mb4"
. The command is being generated is:
SET NAMES utf8mb4
When COLLATION is omitted, this uses the default collation for utf8mb4
which utf8mb4_general_ci
. This is hard-coded into MariaDB and cannot be changed: https://mariadb.com/kb/en/change-is_default-value-for-information_schemacollations/
The problem is that certain SQL queries, such as a CAST as generated by the Django Cast function, will infer the connection's collation, and can fail if the connection's collation does not match the table:
(1267, "Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_general_ci,IMPLICIT) for operation '='")
The simple documented solution is to also specify the collation:
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci
I have tried several workarounds to set the collation, which currently fail, because set_character_set
is always run AFTER the workaround, therefore negating it.
Workaround 1
Use the init_command
option to send SET NAME utf8mb4 COLLATE utf8mb4_unicode_ci
after connecting. Unfortunately, from examining the query log, charset
is set AFTER init_command
, therefore negating it. The query log looks as such:
31 Query SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci -- run by init_command
31 Query SET NAMES utf8mb4 -- from mysqlclient
Workaround 2
Use the init_connect
option on the MySQL server. Unfortunately, the story is the same, as the charset
is issued after the init_connect
script is run, therefore negating it.
32 Query SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci -- run by the server
32 Query SET NAMES utf8mb4 -- from mysqlclient
Possible fixes
Fix 1: Do not set charset if empty.
If charset is empty, or possibly None
, do not set it. This way, it does not override the server's default behavior or the init_command
mysqlclient/MySQLdb/connections.py
Lines 195 to 197 in 5c04abf
It seems this behavior (which is the source of my bug) was introduce in 2.1: #509
Fix 2: Add a collation
option.
This deviates from the MySQL C connector behavior, but it would be really nice to be able to specify both "charset" and "collation". Special implementation in Python might be required for this, e.g. to issue a SQL command after connecting.
Environment
MySQL Server
- MariaDB (Debian 11, Windows 10)
- MariaDB 10.3.35
MySQL Client
-
Debian, Windows
-
Python tested on multiple versions: 3.7+