UCS-4 conversion does not pass BOM through to output

alexdowad · alexdowad · commit 97f8495e0f9e · 2021-08-30T16:29:58.000+02:00
This is to match the way that we handle UCS-2. When a BOM is found at
the beginning of a 'UCS-2' string (NOT 'UCS-2BE' or 'UCS-2LE'), we take
note of the intended byte order and handle the string accordingly, but
do NOT emit a BOM to the output. Rather, we just use the default byte
order for the requested output encoding.

Some might argue that if the input string used a BOM, and we are
emitting output in a text encoding where both big-endian and
little-endian byte orders are possible, we should include a BOM in the
output string. To such hypothetical debaters of minutiae, I can only
offer you a shoulder shrug. No reasonable program which handles UCS-2
and UCS-4 text should require a BOM.

Really, the concept of the BOM is a poor idea and should not have been
included in Unicode. Standardizing on a single byte order would have
been much better, similar to 'network byte order' for the Internet
Protocol. But this is not the place to speak at length of such things.
diff --git a/ext/mbstring/libmbfl/filters/mbfilter_ucs4.c b/ext/mbstring/libmbfl/filters/mbfilter_ucs4.c
@@ -185,11 +185,10 @@ int mbfl_filt_conv_ucs4_wchar(int c, mbfl_convert_filter *filter)
 			} else {
 				filter->status = 0x100;		/* little-endian */
 			}
-			CK((*filter->output_function)(0xfeff, filter->data));
-		} else {
-			filter->status &= ~0xff;
+		} else if (n != 0xfeff) {
 			CK((*filter->output_function)(n, filter->data));
 		}
+		filter->status &= ~0xff;
 		break;
 	}
 

Original file line number	Diff line number	Diff line change
`@@ -185,11 +185,10 @@ int mbfl_filt_conv_ucs4_wchar(int c, mbfl_convert_filter *filter)`
`185`	`185`	`} else {`
`186`	`186`	`filter->status = 0x100; /* little-endian */`
`187`	`187`	`}`
`188`		`- CK((*filter->output_function)(0xfeff, filter->data));`
`189`		`- } else {`
`190`		`- filter->status &= ~0xff;`
	`188`	`+ } else if (n != 0xfeff) {`
`191`	`189`	`CK((*filter->output_function)(n, filter->data));`
`192`	`190`	`}`
	`191`	`+ filter->status &= ~0xff;`
`193`	`192`	`break;`
`194`	`193`	`}`
`195`	`194`