Skip to content

Commit 7ed7aea

Browse files
authored
bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)
Modify locale.localeconv(), time.tzname, os.strerror() and other functions to ignore the UTF-8 Mode: always use the current locale encoding. Changes: * Add _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx(). On decoding or encoding error, they return the position of the error and an error message which are used to raise Unicode errors in PyUnicode_DecodeLocale() and PyUnicode_EncodeLocale(). * Replace _Py_DecodeCurrentLocale() with _Py_DecodeLocaleEx(). * PyUnicode_DecodeLocale() now uses _Py_DecodeLocaleEx() for all cases, especially for the strict error handler. * Add _Py_DecodeUTF8Ex(): return more information on decoding error and supports the strict error handler. * Rename _Py_EncodeUTF8_surrogateescape() to _Py_EncodeUTF8Ex(). * Replace _Py_EncodeCurrentLocale() with _Py_EncodeLocaleEx(). * Ignore the UTF-8 mode to encode/decode localeconv(), strerror() and time zone name. * Remove PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize() and PyUnicode_EncodeLocale() now ignore the UTF-8 mode: always use the "current" locale. * Remove _PyUnicode_DecodeCurrentLocale(), _PyUnicode_DecodeCurrentLocaleAndSize() and _PyUnicode_EncodeCurrentLocale().
1 parent ee3b835 commit 7ed7aea

File tree

12 files changed

+472
-505
lines changed

12 files changed

+472
-505
lines changed

Doc/c-api/sys.rst

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,16 @@ Operating System Utilities
106106
surrogate character, escape the bytes using the surrogateescape error
107107
handler instead of decoding them.
108108
109+
Encoding, highest priority to lowest priority:
110+
111+
* ``UTF-8`` on macOS and Android;
112+
* ``UTF-8`` if the Python UTF-8 mode is enabled;
113+
* ``ASCII`` if the ``LC_CTYPE`` locale is ``"C"``,
114+
``nl_langinfo(CODESET)`` returns the ``ASCII`` encoding (or an alias),
115+
and :c:func:`mbstowcs` and :c:func:`wcstombs` functions uses the
116+
``ISO-8859-1`` encoding.
117+
* the current locale encoding.
118+
109119
Return a pointer to a newly allocated wide character string, use
110120
:c:func:`PyMem_RawFree` to free the memory. If size is not ``NULL``, write
111121
the number of wide characters excluding the null character into ``*size``
@@ -137,6 +147,18 @@ Operating System Utilities
137147
:ref:`surrogateescape error handler <surrogateescape>`: surrogate characters
138148
in the range U+DC80..U+DCFF are converted to bytes 0x80..0xFF.
139149
150+
Encoding, highest priority to lowest priority:
151+
152+
* ``UTF-8`` on macOS and Android;
153+
* ``UTF-8`` if the Python UTF-8 mode is enabled;
154+
* ``ASCII`` if the ``LC_CTYPE`` locale is ``"C"``,
155+
``nl_langinfo(CODESET)`` returns the ``ASCII`` encoding (or an alias),
156+
and :c:func:`mbstowcs` and :c:func:`wcstombs` functions uses the
157+
``ISO-8859-1`` encoding.
158+
* the current locale encoding.
159+
160+
The function uses the UTF-8 encoding in the Python UTF-8 mode.
161+
140162
Return a pointer to a newly allocated byte string, use :c:func:`PyMem_Free`
141163
to free the memory. Return ``NULL`` on encoding error or memory allocation
142164
error

Doc/c-api/unicode.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -770,12 +770,20 @@ system.
770770
:c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
771771
Python startup).
772772
773+
This function ignores the Python UTF-8 mode.
774+
773775
.. seealso::
774776
775777
The :c:func:`Py_DecodeLocale` function.
776778
777779
.. versionadded:: 3.3
778780
781+
.. versionchanged:: 3.7
782+
The function now also uses the current locale encoding for the
783+
``surrogateescape`` error handler. Previously, :c:func:`Py_DecodeLocale`
784+
was used for the ``surrogateescape``, and the current locale encoding was
785+
used for ``strict``.
786+
779787
780788
.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors)
781789
@@ -797,12 +805,20 @@ system.
797805
:c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
798806
Python startup).
799807
808+
This function ignores the Python UTF-8 mode.
809+
800810
.. seealso::
801811
802812
The :c:func:`Py_EncodeLocale` function.
803813
804814
.. versionadded:: 3.3
805815
816+
.. versionchanged:: 3.7
817+
The function now also uses the current locale encoding for the
818+
``surrogateescape`` error handler. Previously, :c:func:`Py_EncodeLocale`
819+
was used for the ``surrogateescape``, and the current locale encoding was
820+
used for ``strict``.
821+
806822
807823
File System Encoding
808824
""""""""""""""""""""

Include/fileutils.h

Lines changed: 30 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,18 +20,41 @@ PyAPI_FUNC(char*) _Py_EncodeLocaleRaw(
2020
#endif
2121

2222
#ifdef Py_BUILD_CORE
23+
PyAPI_FUNC(int) _Py_DecodeUTF8Ex(
24+
const char *arg,
25+
Py_ssize_t arglen,
26+
wchar_t **wstr,
27+
size_t *wlen,
28+
const char **reason,
29+
int surrogateescape);
30+
31+
PyAPI_FUNC(int) _Py_EncodeUTF8Ex(
32+
const wchar_t *text,
33+
char **str,
34+
size_t *error_pos,
35+
const char **reason,
36+
int raw_malloc,
37+
int surrogateescape);
38+
2339
PyAPI_FUNC(wchar_t*) _Py_DecodeUTF8_surrogateescape(
24-
const char *s,
25-
Py_ssize_t size,
26-
size_t *p_wlen);
40+
const char *arg,
41+
Py_ssize_t arglen);
2742

28-
PyAPI_FUNC(wchar_t *) _Py_DecodeCurrentLocale(
43+
PyAPI_FUNC(int) _Py_DecodeLocaleEx(
2944
const char *arg,
30-
size_t *size);
45+
wchar_t **wstr,
46+
size_t *wlen,
47+
const char **reason,
48+
int current_locale,
49+
int surrogateescape);
3150

32-
PyAPI_FUNC(char*) _Py_EncodeCurrentLocale(
51+
PyAPI_FUNC(int) _Py_EncodeLocaleEx(
3352
const wchar_t *text,
34-
size_t *error_pos);
53+
char **str,
54+
size_t *error_pos,
55+
const char **reason,
56+
int current_locale,
57+
int surrogateescape);
3558
#endif
3659

3760
#ifndef Py_LIMITED_API

Include/unicodeobject.h

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1810,20 +1810,6 @@ PyAPI_FUNC(PyObject*) PyUnicode_EncodeLocale(
18101810
PyObject *unicode,
18111811
const char *errors
18121812
);
1813-
1814-
PyAPI_FUNC(PyObject*) _PyUnicode_DecodeCurrentLocale(
1815-
const char *str,
1816-
const char *errors);
1817-
1818-
PyAPI_FUNC(PyObject*) _PyUnicode_DecodeCurrentLocaleAndSize(
1819-
const char *str,
1820-
Py_ssize_t len,
1821-
const char *errors);
1822-
1823-
PyAPI_FUNC(PyObject*) _PyUnicode_EncodeCurrentLocale(
1824-
PyObject *unicode,
1825-
const char *errors
1826-
);
18271813
#endif
18281814

18291815
/* --- File system encoding ---------------------------------------------- */

Modules/_datetimemodule.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -696,7 +696,7 @@ static int parse_isoformat_date(const char *dtstr,
696696
if (NULL == p) {
697697
return -1;
698698
}
699-
699+
700700
if (*(p++) != '-') {
701701
return -2;
702702
}

Modules/_localemodule.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -572,8 +572,9 @@ PyIntl_bind_textdomain_codeset(PyObject* self,PyObject*args)
572572
if (!PyArg_ParseTuple(args, "sz", &domain, &codeset))
573573
return NULL;
574574
codeset = bind_textdomain_codeset(domain, codeset);
575-
if (codeset)
575+
if (codeset) {
576576
return PyUnicode_DecodeLocale(codeset, NULL);
577+
}
577578
Py_RETURN_NONE;
578579
}
579580
#endif

Modules/getpath.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -449,8 +449,8 @@ search_for_exec_prefix(const _PyCoreConfig *core_config,
449449
n = fread(buf, 1, MAXPATHLEN, f);
450450
buf[n] = '\0';
451451
fclose(f);
452-
rel_builddir_path = _Py_DecodeUTF8_surrogateescape(buf, n, NULL);
453-
if (rel_builddir_path != NULL) {
452+
rel_builddir_path = _Py_DecodeUTF8_surrogateescape(buf, n);
453+
if (rel_builddir_path) {
454454
wcsncpy(exec_prefix, calculate->argv0_path, MAXPATHLEN);
455455
exec_prefix[MAXPATHLEN] = L'\0';
456456
joinpath(exec_prefix, rel_builddir_path);

Modules/readline.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,13 +132,13 @@ static PyModuleDef readlinemodule;
132132
static PyObject *
133133
encode(PyObject *b)
134134
{
135-
return _PyUnicode_EncodeCurrentLocale(b, "surrogateescape");
135+
return PyUnicode_EncodeLocale(b, "surrogateescape");
136136
}
137137

138138
static PyObject *
139139
decode(const char *s)
140140
{
141-
return _PyUnicode_DecodeCurrentLocale(s, "surrogateescape");
141+
return PyUnicode_DecodeLocale(s, "surrogateescape");
142142
}
143143

144144

Modules/timemodule.c

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -418,11 +418,11 @@ tmtotuple(struct tm *p
418418
SET(8, p->tm_isdst);
419419
#ifdef HAVE_STRUCT_TM_TM_ZONE
420420
PyStructSequence_SET_ITEM(v, 9,
421-
_PyUnicode_DecodeCurrentLocale(p->tm_zone, "surrogateescape"));
421+
PyUnicode_DecodeLocale(p->tm_zone, "surrogateescape"));
422422
SET(10, p->tm_gmtoff);
423423
#else
424424
PyStructSequence_SET_ITEM(v, 9,
425-
_PyUnicode_DecodeCurrentLocale(zone, "surrogateescape"));
425+
PyUnicode_DecodeLocale(zone, "surrogateescape"));
426426
PyStructSequence_SET_ITEM(v, 10, _PyLong_FromTime_t(gmtoff));
427427
#endif /* HAVE_STRUCT_TM_TM_ZONE */
428428
#undef SET
@@ -809,8 +809,7 @@ time_strftime(PyObject *self, PyObject *args)
809809
#ifdef HAVE_WCSFTIME
810810
ret = PyUnicode_FromWideChar(outbuf, buflen);
811811
#else
812-
ret = _PyUnicode_DecodeCurrentLocaleAndSize(outbuf, buflen,
813-
"surrogateescape");
812+
ret = PyUnicode_DecodeLocaleAndSize(outbuf, buflen, "surrogateescape");
814813
#endif
815814
PyMem_Free(outbuf);
816815
break;
@@ -1541,8 +1540,8 @@ PyInit_timezone(PyObject *m) {
15411540
PyModule_AddIntConstant(m, "altzone", timezone-3600);
15421541
#endif
15431542
PyModule_AddIntConstant(m, "daylight", daylight);
1544-
otz0 = _PyUnicode_DecodeCurrentLocale(tzname[0], "surrogateescape");
1545-
otz1 = _PyUnicode_DecodeCurrentLocale(tzname[1], "surrogateescape");
1543+
otz0 = PyUnicode_DecodeLocale(tzname[0], "surrogateescape");
1544+
otz1 = PyUnicode_DecodeLocale(tzname[1], "surrogateescape");
15461545
PyModule_AddObject(m, "tzname", Py_BuildValue("(NN)", otz0, otz1));
15471546
#else /* !HAVE_TZNAME || __GLIBC__ || __CYGWIN__*/
15481547
{

0 commit comments

Comments
 (0)