Skip to content

Commit 371367c

Browse files
committed
Reintroduce legacy 'SJIS-win' text encoding in mbstring
In e245985, I combined mbstring's "SJIS-win" text encoding into CP932. This was done after doing some testing which appeared to show that the mappings for "SJIS-win" were the same as those for "CP932". Later, it was found that there was actually a small difference prior to e245985 when converting Unicode to CP932. The mappings for the following two codepoints were different: CP932 SJIS-win U+203E 0x7E 0x81 0x50 U+00A5 0x5C 0x81 0x8F As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and 'YEN SIGN' to the ASCII bytes which have conflicting uses in most legacy Japanese text encodings. "SJIS-win" mapped these to equivalent JIS X 0208 fullwidth characters. Since e2459867af was not intended to cause any user-visible change in behavior, I am rolling back the merge of "CP932" and "SJIS-win". It seems doubtful whether these two text encodings should be kept separate or merged in a future release. An extensive discussion of the related historical background and compatibility issues involved can be found in this GitHub thread: #8308
1 parent 7f26661 commit 371367c

File tree

6 files changed

+130
-22
lines changed

6 files changed

+130
-22
lines changed

ext/mbstring/libmbfl/filters/mbfilter_cp932.c

Lines changed: 82 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,36 @@
2727
*
2828
*/
2929

30+
/* CP932 is Microsoft's version of Shift-JIS.
31+
*
32+
* What we call "SJIS-win" is a variant of CP932 which maps U+00A5
33+
* and U+203E the same way as eucJP-win; namely, instead of mapping
34+
* U+00A5 (YEN SIGN) to 0x5C and U+203E (OVERLINE) to 0x7E,
35+
* these codepoints are mapped to appropriate JIS X 0208 characters.
36+
*
37+
* When converting from Shift-JIS to Unicode, there is no difference
38+
* between CP932 and "SJIS-win".
39+
*
40+
* Additional facts:
41+
*
42+
* • In the libmbfl library which formed the base for mbstring, "CP932" and
43+
* "SJIS-win" were originally aliases. The differing mappings were added in
44+
* December 2002. The libmbfl author later stated that this was done so that
45+
* "CP932" would comply with a certain specification, while "SJIS-win" would
46+
* maintain the existing mappings. He does not remember which specification
47+
* it was.
48+
* • The WHATWG specification for "Shift_JIS" (followed by web browsers)
49+
* agrees with our mappings for "CP932".
50+
* • Microsoft Windows' "best-fit" mappings for CP932 (via the
51+
* WideCharToMultiByte API) convert U+00A5 to 0x5C, which also agrees with
52+
* our mappings for "CP932".
53+
* • glibc's iconv converts U+203E to CP932 0x7E, which again agrees with
54+
* our mappings for "CP932".
55+
* • When converting Shift-JIS to CP932, the conversion goes through Unicode.
56+
* Shift-JIS 0x7E converts to U+203E, so mapping U+203E to 0x7E means that
57+
* 0x7E will go to 0x7E when converting Shift-JIS to CP932.
58+
*/
59+
3060
#include "mbfilter.h"
3161
#include "mbfilter_cp932.h"
3262

@@ -54,7 +84,8 @@ static const unsigned char mblen_table_sjis[] = { /* 0x80-0x9f,0xE0-0xFF */
5484
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
5585
};
5686

57-
static const char *mbfl_encoding_cp932_aliases[] = {"MS932", "Windows-31J", "MS_Kanji", "SJIS-win", "SJIS-ms", "SJIS-open", NULL};
87+
static const char *mbfl_encoding_cp932_aliases[] = {"MS932", "Windows-31J", "MS_Kanji", NULL};
88+
static const char *mbfl_encoding_sjiswin_aliases[] = {"SJIS-ms", "SJIS-open", NULL};
5889

5990
const mbfl_encoding mbfl_encoding_cp932 = {
6091
mbfl_no_encoding_cp932,
@@ -87,6 +118,37 @@ const struct mbfl_convert_vtbl vtbl_wchar_cp932 = {
87118
NULL,
88119
};
89120

121+
const mbfl_encoding mbfl_encoding_sjiswin = {
122+
mbfl_no_encoding_sjiswin,
123+
"SJIS-win",
124+
"Shift_JIS",
125+
mbfl_encoding_sjiswin_aliases,
126+
mblen_table_sjis,
127+
MBFL_ENCTYPE_GL_UNSAFE,
128+
&vtbl_sjiswin_wchar,
129+
&vtbl_wchar_sjiswin
130+
};
131+
132+
const struct mbfl_convert_vtbl vtbl_sjiswin_wchar = {
133+
mbfl_no_encoding_sjiswin,
134+
mbfl_no_encoding_wchar,
135+
mbfl_filt_conv_common_ctor,
136+
NULL,
137+
mbfl_filt_conv_cp932_wchar,
138+
mbfl_filt_conv_cp932_wchar_flush,
139+
NULL,
140+
};
141+
142+
const struct mbfl_convert_vtbl vtbl_wchar_sjiswin = {
143+
mbfl_no_encoding_wchar,
144+
mbfl_no_encoding_sjiswin,
145+
mbfl_filt_conv_common_ctor,
146+
NULL,
147+
mbfl_filt_conv_wchar_sjiswin,
148+
mbfl_filt_conv_common_flush,
149+
NULL,
150+
};
151+
90152
#define CK(statement) do { if ((statement) < 0) return (-1); } while (0)
91153

92154
#define SJIS_ENCODE(c1,c2,s1,s2) \
@@ -132,12 +194,7 @@ const struct mbfl_convert_vtbl vtbl_wchar_cp932 = {
132194
} \
133195
} while (0)
134196

135-
136-
/*
137-
* SJIS-win => wchar
138-
*/
139-
int
140-
mbfl_filt_conv_cp932_wchar(int c, mbfl_convert_filter *filter)
197+
int mbfl_filt_conv_cp932_wchar(int c, mbfl_convert_filter *filter)
141198
{
142199
int c1, s, s1, s2, w;
143200

@@ -224,18 +281,16 @@ static int mbfl_filt_conv_cp932_wchar_flush(mbfl_convert_filter *filter)
224281
return 0;
225282
}
226283

227-
/*
228-
* wchar => SJIS-win
229-
*/
230-
int
231-
mbfl_filt_conv_wchar_cp932(int c, mbfl_convert_filter *filter)
284+
int mbfl_filt_conv_wchar_cp932(int c, mbfl_convert_filter *filter)
232285
{
233286
int c1, c2, s1, s2;
234287

235288
s1 = 0;
236289
s2 = 0;
237290
if (c >= ucs_a1_jis_table_min && c < ucs_a1_jis_table_max) {
238291
s1 = ucs_a1_jis_table[c - ucs_a1_jis_table_min];
292+
} else if (c == 0x203E) {
293+
s1 = 0x7E;
239294
} else if (c >= ucs_a2_jis_table_min && c < ucs_a2_jis_table_max) {
240295
s1 = ucs_a2_jis_table[c - ucs_a2_jis_table_min];
241296
} else if (c >= ucs_i_jis_table_min && c < ucs_i_jis_table_max) {
@@ -251,7 +306,7 @@ mbfl_filt_conv_wchar_cp932(int c, mbfl_convert_filter *filter)
251306
}
252307
if (s1 <= 0) {
253308
if (c == 0xa5) { /* YEN SIGN */
254-
s1 = 0x216F; /* FULLWIDTH YEN SIGN */
309+
s1 = 0x5C;
255310
} else if (c == 0xff3c) { /* FULLWIDTH REVERSE SOLIDUS */
256311
s1 = 0x2140;
257312
} else if (c == 0x2225) { /* PARALLEL TO */
@@ -310,3 +365,17 @@ mbfl_filt_conv_wchar_cp932(int c, mbfl_convert_filter *filter)
310365

311366
return 0;
312367
}
368+
369+
int mbfl_filt_conv_wchar_sjiswin(int c, mbfl_convert_filter *filter)
370+
{
371+
if (c == 0xA5) {
372+
CK((*filter->output_function)(0x81, filter->data));
373+
CK((*filter->output_function)(0x8F, filter->data));
374+
} else if (c == 0x203E) {
375+
CK((*filter->output_function)(0x81, filter->data));
376+
CK((*filter->output_function)(0x50, filter->data));
377+
} else {
378+
return mbfl_filt_conv_wchar_cp932(c, filter);
379+
}
380+
return 0;
381+
}

ext/mbstring/libmbfl/filters/mbfilter_cp932.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,12 @@ extern const mbfl_encoding mbfl_encoding_cp932;
3636
extern const struct mbfl_convert_vtbl vtbl_cp932_wchar;
3737
extern const struct mbfl_convert_vtbl vtbl_wchar_cp932;
3838

39+
extern const mbfl_encoding mbfl_encoding_sjiswin;
40+
extern const struct mbfl_convert_vtbl vtbl_sjiswin_wchar;
41+
extern const struct mbfl_convert_vtbl vtbl_wchar_sjiswin;
42+
3943
int mbfl_filt_conv_cp932_wchar(int c, mbfl_convert_filter *filter);
4044
int mbfl_filt_conv_wchar_cp932(int c, mbfl_convert_filter *filter);
45+
int mbfl_filt_conv_wchar_sjiswin(int c, mbfl_convert_filter *filter);
4146

4247
#endif /* MBFL_MBFILTER_CP932_H */

ext/mbstring/libmbfl/mbfl/mbfl_encoding.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,7 @@ static const mbfl_encoding *mbfl_encoding_ptr_list[] = {
121121
&mbfl_encoding_utf8_kddi_b,
122122
&mbfl_encoding_utf8_sb,
123123
&mbfl_encoding_cp932,
124+
&mbfl_encoding_sjiswin,
124125
&mbfl_encoding_cp51932,
125126
&mbfl_encoding_jis,
126127
&mbfl_encoding_2022jp,

ext/mbstring/libmbfl/mbfl/mbfl_encoding.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ enum mbfl_no_encoding {
7474
mbfl_no_encoding_sjis_mac,
7575
mbfl_no_encoding_sjis2004,
7676
mbfl_no_encoding_cp932,
77+
mbfl_no_encoding_sjiswin,
7778
mbfl_no_encoding_cp51932,
7879
mbfl_no_encoding_jis,
7980
mbfl_no_encoding_2022jp,

ext/mbstring/tests/cp932_encoding.phpt

Lines changed: 39 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
--TEST--
2-
Exhaustive test of CP932 encoding verification and conversion
2+
Exhaustive test of CP932 encoding verification and conversion (including 'SJIS-win' variant)
33
--EXTENSIONS--
44
mbstring
55
--SKIPIF--
@@ -34,8 +34,10 @@ for ($i = 0xF0; $i <= 0xF9; $i++) {
3434
$fromUnicode["\x00\xA2"] = "\x81\x91";
3535
/* U+00A3 is POUND SIGN; convert to FULLWIDTH POUND SIGN */
3636
$fromUnicode["\x00\xA3"] = "\x81\x92";
37-
/* U+00A5 is YEN SIGN; convert to FULLWIDTH YEN SIGN */
38-
$fromUnicode["\x00\xA5"] = "\x81\x8F";
37+
/* U+00A5 is YEN SIGN; convert to 0x5C, which has conflicting uses
38+
* (either as backslash or as Yen sign) */
39+
$fromUnicode["\x00\xA5"] = "\x5C";
40+
3941

4042
/* We map the JIS X 0208 FULLWIDTH TILDE to U+FF5E (FULLWIDTH TILDE)
4143
* But when converting Unicode to CP932, we also accept U+301C (WAVE DASH) */
@@ -51,12 +53,13 @@ $fromUnicode["\x20\x16"] = "\x81\x61";
5153
* but when converting Unicode to CP932, we also accept U+00AC (NOT SIGN) */
5254
$fromUnicode["\x00\xAC"] = "\x81\xCA";
5355

54-
/* U+203E is OVERLINE; convert to JIS X 0208 FULLWIDTH MACRON */
55-
$fromUnicode["\x20\x3E"] = "\x81\x50";
56-
57-
/* U+00AF is MACRON; it can also go to FULLWIDTH MACRON */
56+
/* U+00AF is MACRON; convert to FULLWIDTH MACRON */
5857
$fromUnicode["\x00\xAF"] = "\x81\x50";
5958

59+
/* U+203E is OVERLINE; convert to 0x7E, which has conflicting uses
60+
* (either as tilde or as overline) */
61+
$fromUnicode["\x20\x3E"] = "\x7E";
62+
6063
findInvalidChars($validChars, $invalidChars, $truncated, array_fill_keys(range(0x81, 0x9F), 2) + array_fill_keys(range(0xE0, 0xFC), 2));
6164

6265
findInvalidChars($fromUnicode, $invalidCodepoints, $unused, array_fill_keys(range(0, 0xFF), 2));
@@ -106,17 +109,46 @@ echo "CP932 verification and conversion works on all invalid characters\n";
106109
convertAllInvalidChars($invalidCodepoints, $fromUnicode, 'UTF-16BE', 'CP932', '%');
107110
echo "Unicode -> CP932 conversion works on all invalid codepoints\n";
108111

112+
/* Now test 'SJIS-win' variant of CP932, which is really CP932 but with
113+
* two different mappings
114+
* Instead of mapping U+00A5 and U+203E to the single bytes 0x5C and 07E
115+
* (which have conflicting uses), 'SJIS-win' maps them to appropriate
116+
* JIS X 0208 characters */
117+
118+
/* U+00A5 is YEN SIGN; convert to FULLWIDTH YEN SIGN */
119+
$fromUnicode["\x00\xA5"] = "\x81\x8F";
120+
/* U+203E is OVERLINE; convert to JIS X 0208 FULLWIDTH MACRON */
121+
$fromUnicode["\x20\x3E"] = "\x81\x50";
122+
123+
testAllValidChars($validChars, 'SJIS-win', 'UTF-16BE');
124+
foreach ($nonInvertible as $cp932 => $unicode)
125+
testValidString($cp932, $unicode, 'SJIS-win', 'UTF-16BE', false);
126+
echo "SJIS-win verification and conversion works on all valid characters\n";
127+
128+
testAllInvalidChars($invalidChars, $validChars, 'SJIS-win', 'UTF-16BE', "\x00%");
129+
echo "SJIS-win verification and conversion works on all invalid characters\n";
130+
131+
convertAllInvalidChars($invalidCodepoints, $fromUnicode, 'UTF-16BE', 'SJIS-win', '%');
132+
echo "Unicode -> SJIS-win conversion works on all invalid codepoints\n";
133+
109134
// Test "long" illegal character markers
110135
mb_substitute_character("long");
111136
convertInvalidString("\x80", "%", "CP932", "UTF-8");
112137
convertInvalidString("\xEA", "%", "CP932", "UTF-8");
113138
convertInvalidString("\x81\x20", "%", "CP932", "UTF-8");
114139
convertInvalidString("\xEA\xA9", "%", "CP932", "UTF-8");
140+
convertInvalidString("\x80", "%", "SJIS-win", "UTF-8");
141+
convertInvalidString("\xEA", "%", "SJIS-win", "UTF-8");
142+
convertInvalidString("\x81\x20", "%", "SJIS-win", "UTF-8");
143+
convertInvalidString("\xEA\xA9", "%", "SJIS-win", "UTF-8");
115144

116145
echo "Done!\n";
117146
?>
118147
--EXPECT--
119148
CP932 verification and conversion works on all valid characters
120149
CP932 verification and conversion works on all invalid characters
121150
Unicode -> CP932 conversion works on all invalid codepoints
151+
SJIS-win verification and conversion works on all valid characters
152+
SJIS-win verification and conversion works on all invalid characters
153+
Unicode -> SJIS-win conversion works on all invalid codepoints
122154
Done!

ext/mbstring/tests/mb_internal_encoding_variation2.phpt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -176,10 +176,10 @@ string(9) "eucJP-win"
176176
-- Iteration 20 --
177177
string(9) "eucJP-win"
178178
bool(true)
179-
string(5) "CP932"
179+
string(8) "SJIS-win"
180180

181181
-- Iteration 21 --
182-
string(5) "CP932"
182+
string(8) "SJIS-win"
183183
bool(true)
184184
string(11) "ISO-2022-JP"
185185

0 commit comments

Comments
 (0)