Open
Description
What steps will reproduce the problem?
Parsing a string containing certain unicode characters, such as [ (U+FF3B
FULLWIDTH LEFT
SQUARE BRACKET, not to be confused with [). For example, run this program:
require 'html5'
include HTML5
t="test\357\274\273\343\201\202\357\274\275\n"
$KCODE="UTF8"
print HTMLParser.parse_fragment(t,{:encoding => 'utf-8'})
$KCODE="NONE"
print HTMLParser.parse_fragment(t,{:encoding => 'utf-8'})
What is the expected output? What do you see instead?
Expected output:
test[あ]
test[あ]
Actual output:
test���あ���
test[あ]
Please provide any additional information below.
Some Ruby applications run with $KCODE set to UTF8; notably, this is the
default for Ruby on
Rails applications. An effect of this setting is that regular expressions
support Unicode
characters by default (ie, /a/ acts like /a/u). inputstream.rb uses a regular
expression to check
for valid utf-8:
when 0xC0..0xFF
if instance_variables.include?("@win1252") && @win1252
"\xC3" + (c - 64).chr # convert to utf-8
# from http://www.w3.org/International/questions/qa-forms-utf-8.en.php
elsif @buffer[@tell - 1..@tell + 3] =~ /^
( [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)/x
@tell += $1.length - 1
$1
else
[0xFFFD].pack('U') # invalid utf-8
end
When $KCODE is set to UTF8, the expression fails to recognize the utf-8
representation of [ as
valid. The problem can be solved by adding the "n" option at the end of the
expression. For
example:
irb(main):004:0> $KCODE='UTF8'
=> "UTF8"
irb(main):005:0> "\357\274\273" =~ /^( [\xC2-\xDF][\x80-\xBF] |
\xE0[\xA0-\xBF][\x80-
\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] |
\xF0[\x90-
\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2}
)/x
=> nil
irb(main):006:0> "\357\274\273" =~ /^( [\xC2-\xDF][\x80-\xBF] |
\xE0[\xA0-\xBF][\x80-
\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] |
\xF0[\x90-
\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2}
)/xn
=> 0
(I blame the lack of a preview button for any errors in this submission ;-) )
Original issue reported on code.google.com by camillo....@gmail.com
on 27 Apr 2008 at 1:22