Skip to content

Check for valid utf-8 in inputstream.rb gives false negatives when $KCODE is set to "UTF8" [w/fix] #66

Open
@GoogleCodeExporter

Description

@GoogleCodeExporter
What steps will reproduce the problem?

Parsing a string containing certain unicode characters, such as [ (U+FF3B 
FULLWIDTH LEFT 
SQUARE BRACKET, not to be confused with [). For example, run this program:

require 'html5'
include HTML5
t="test\357\274\273\343\201\202\357\274\275\n"
$KCODE="UTF8"
print HTMLParser.parse_fragment(t,{:encoding => 'utf-8'})
$KCODE="NONE"
print HTMLParser.parse_fragment(t,{:encoding => 'utf-8'})


What is the expected output? What do you see instead?

Expected output:
test[あ]
test[あ]

Actual output:
test���あ���
test[あ]


Please provide any additional information below.

Some Ruby applications run with $KCODE set to UTF8; notably, this is the 
default for Ruby on 
Rails applications. An effect of this setting is that regular expressions 
support Unicode 
characters by default (ie, /a/ acts like /a/u). inputstream.rb uses a regular 
expression to check 
for valid utf-8:

        when 0xC0..0xFF
          if instance_variables.include?("@win1252") && @win1252
            "\xC3" + (c - 64).chr # convert to utf-8
          # from http://www.w3.org/International/questions/qa-forms-utf-8.en.php
          elsif @buffer[@tell - 1..@tell + 3] =~ /^
                ( [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
                |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
                | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
                |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
                |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
                | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
                |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
                )/x
            @tell += $1.length - 1
            $1
          else
            [0xFFFD].pack('U') # invalid utf-8
          end

When $KCODE is set to UTF8, the expression fails to recognize the utf-8 
representation of [ as 
valid. The problem can be solved by adding the "n" option at the end of the 
expression. For 
example:

irb(main):004:0> $KCODE='UTF8'
=> "UTF8"
irb(main):005:0> "\357\274\273" =~ /^( [\xC2-\xDF][\x80-\xBF] | 
\xE0[\xA0-\xBF][\x80-
\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | 
\xF0[\x90-
\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} 
)/x
=> nil
irb(main):006:0> "\357\274\273" =~ /^( [\xC2-\xDF][\x80-\xBF] | 
\xE0[\xA0-\xBF][\x80-
\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | 
\xF0[\x90-
\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} 
)/xn
=> 0


(I blame the lack of a preview button for any errors in this submission ;-) )

Original issue reported on code.google.com by camillo....@gmail.com on 27 Apr 2008 at 1:22

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions