Skip to content

Problems with UTF8 #63

Open
Open
@GoogleCodeExporter

Description

@GoogleCodeExporter
I'm can't believe this hasn't been covered before, but I've as yet been
unable to find a solution to the following:

 puts HTML5::HTMLParser.new.parse('Test dátá')

provides:

 <html><head/><body>Test dรกtรก</body></html>

As can be seen, the text in the body has the wrong characters where á
should be, so I suspected a normal UTF8 conversion bug.
However, just to really mess with my mind, I thought the following would be
a more complete test to post here:

 puts HTML5::HTMLParser.new.parse('Sámple Téxt Wíth Acceñts')

produces:

 <html><head/><body>Sámple Téxt Wíth Acceñts</body></html>

which is correct!! My next step was to try removing each accent one by one,
until only the first á is present. Each attempt worked except the last,
which produced:

 <html><head/><body>Sรกmple Text With Accents</body></html>

Clearly, there is something very strange here, and its causing major pain.
Does anyone have any suggests as to what's going on, and more importantly,
how to fix it?

Versions:

gem -v 1.0.1
html5 (0.10.0)
ruby 1.8.6 
Ubuntu 7.10 systems

Many thanks, Sam

Original issue reported on code.google.com by sam.l...@gmail.com on 15 Feb 2008 at 11:44

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions