Encoding - Characters converted to UTF8-hex

Version 1.4.0
Original text in .msg body:

> Char-å-CharChar-Å-CharChar-ø-CharChar-Ø-CharChar-æ-CharChar-Æ-Char
> 

After calling parseMsg on the file and looking into BodyHTML & ConvertedBodyHTML of OutlookMessage, both values are null. The BodyRtf has now these values, but the characters are changed to UTF-8hex and the body in the return string contains the following:


> Char-\'c3\'a5-CharChar-\'c3\'85-CharChar-\'c3\'b8-CharChar-\'c3\'98-CharChar-\'c3\'a6-CharChar-\'c3\'86-Char
> 
What is not displayed above is that before ' there is also a backslash \


If I try to convert this extracted rtf from .msg using the recently forked library "rtf-to-html" with (_RTF2HTMLConverterRFCCompliant_ or _RTF2HTMLConverterClassic_)then I get the following exception:

<code>Exception in thread "main" java.nio.charset.UnsupportedCharsetException: 65001
	at org.bbottema.rtftohtml.impl.util.CharsetHelper.findCharset(CharsetHelper.java:19)
	at org.bbottema.rtftohtml.impl.RTF2HTMLConverterRFCCompliant.rtf2html(RTF2HTMLConverterRFCCompliant.java:112)</code>

If I use RTF2HTMLConverterJEditorPane, I am able to convert the rtf to html, but the result contains some encoding issues, so to partially solve them, I first convert the string of "Cp1252" to byte array and then the byte array to "UTF-8" String. After this I get almost all the results I wanted to achieve:
>Char-å-CharChar-Å-CharChar-ø-CharChar-#-CharChar-æ-CharChar-Æ-Char
> 

As you can see I am able to convert most of the characters to correct encoding except the Ø character.

My current solution is to go back to version 1.1.16, retrieving the ConvertedBodyHTML as in this version it is not null and converting this string of html from "Cp1252" to byte array and then the byte array to "UTF-8" string. This way I don't use the newly forked"rtf-to-html" and am able to get html from OutlookMessageParser itself.

Is there some other workaround to make the newest version of Outlook-Message-Parser work?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding - Characters converted to UTF8-hex #17

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Encoding - Characters converted to UTF8-hex #17

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions